the future is now: an update on the csu data lake

Post on 22-Apr-2022

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Future is Now:An Update on the CSU Data Lake

CalStateTech – August 2019

1

R. Brendan AldrichChief Data Officer

California State University, Chancellor’s Office

• 18+ years focused on the fields of data warehousing, business intelligence, analytics & research

• Have led data transformation and modernization initiatives for private and public organizations:

• Leading the BI/DW Team within the CSU Chancellor’s Office

• Work areas: Data Lake Project, CFS Data Warehouse, CHRS Data Warehouse

Rajveer SinghTransformation Architect

Unisys

• 14+ years experience in leading Strategy, Innovation, Digital Transformation and IT Leadership

• Hybrid Cloud, Security, Data Management, Workload Migrations, Automation, Cost Management

Gartner Predicts 2019: Analytics & BI Strategy

• Key Findings• Organizations use only a fraction of the analytic potential they already possess.

• Modern analytic technologies accelerate and diversify the creation of analytic insight, but do little to ensure deployment and use.

• Although the visibility and interest in analytics have been transformed by artificial intelligence (AI) over the last few years, quantum computing (QC) has been under the radar for most organizations. However, its eventual impact may be equally significant

4

EDUCAUSE: Digital Transformation (Dx)

• What is Digital Transformation?• Digital transformation (Dx) is a series of

deep and coordinated culture, workforce, and technology shifts that enable new educational and operating models and transform an institution’s operations, strategic directions, and value proposition.

5

EDUCAUSE: Data and Digital Transformation

6

• Higher Education Chief Data Officers Working Group

• Digital Transformation Task Force

• While the tools and technologies have changed, the vast majority of organizations are still managing their data using the same techniques they have for the last 30 years.

Solving “Traditional” Data Issues• Create a Stable Data History

from source systems

• We can answer questions that haven’t yet been asked

• ALL our in-use data (and a year’s worth of everything else)

• It’s easy and FAST to add new data

• Focus on cleaning data in sources

• Do interesting easy and curate the useful

• Every team at every campus can iterate independently while maintaining order

Conceptual Architecture

9

PS

FIN

HR

Data Sources Delphix(Rolling 365 Days)

Data Lake(No historical limit)

Cloud-based Copy

ADV

LMS

Swipe

CO

Campus

Campus DW

Transformations

CO DW

S

S x T

S x C

S x D

… BI App

Athena

S_cmp

S x T_cmp

S x C_cmp

S x D_cmp

…S x ?_cmp BI App

CSU Data Lake: A Retrospective

• June 2017: Data Lake Prototype• Data that was being provided to CO collected and housed in SQL Server tables

• Jan. 2018: CSU Is First California Higher Ed To Appoint A CDO• The work thus far:

10

Activity 02/18 03/18 04/18 05/18 06/18 07/18 08/18 09/18 10/18 11/18 12/18 01/19 02/19 03/19 04/19 05/19 06/19 07/19 08/19 09/19 10/19 11/19 12/19 …

Ideation and Planning

Pre-Development

Development: Data Lake

Development: Curated Student Data

Development: Validation BI Platform

Validation: Curated Student Data →

Data Governance Orchestration →

Campus Datal Lake Access →

RFP: Production BI Platform →

New BI/DW Sub-Teams• Discovery Team

• Data Lake Architecture & Functionality

• Tomorrow Team• ETL & Modeling

• FED Team• Front End Design

11

• Information Security• Data privacy,

protection and security

Data and Analytics Strategies Driving the Future

12

CSU Challenge: Data is highly distributed across the system and not easily accessible/ useable

Architectural Deep Dive

• Shifting Data from On Premise to Cloud• Delphix

• Populating the Data Lake• DMS

• Curated Data Collections• Multiple Technologies: AWS / Airflow + Python / Alteryx

13

14© 2019 Unisys Corporation. All rights reserved. |

Data Virtualization = Secure, Lightweight & Portable Data

Dev QA UAT Integration

1 TB

App Tier – Prod

1 TB 1 TB 1 TB 1 TB

Unique Block Mapping Block Aware Filtering Efficient Compression

.3 TB (3:1)

App Data Files

20 MB 20 MB 20 MB 20 MB

Secure Data Masking

15© 2019 Unisys Corporation. All rights reserved. |

Hybrid Cloud without Delphix

1

2

3

4

8

7

6

5

Submit Request

Approve Request

Ready Target

Ready Storage Restore

Version

Configure Database

Mask Database

Backup masked

database

ON PREMISE

12

11

10

9Ready Cloud

Target

Configure cloud storage

Restore Database

Validate Environment

CLOUD

MigrateMigrateMigrateMigrate

Days or weeks to prep data on premise Days or weeks to

provision cloud DBs from migrated data

Slow lift and shift process

!!

!

Static data load does not capture updates

Data refresh requires repeat of full process

!

! IT tickets create bottleneck

! Multiple handoffs cause delay

! Multiple handoffs cause delay

!

!Multiple tools required for data security and data movement

Complex, Manual Processes Slow Data Movement & Refresh for Hybrid Cloud

16© 2019 Unisys Corporation. All rights reserved. |

How Delphix Accelerates Cloud Migrations

Clie

nt N

etw

ork

Clo

ud P

rovi

der N

etw

ork

PRODUCTION

APP

RDBMS

STORAGE

Continuously Replicate

4

Mask Data On Premises, Ensuring No Sensitive Data Leaves Production Network

TEST DEV STAGE

NON- PRODUCTIONTEST DEV

Data Lake STAGING

CLOUDON PREMISE

1

2

3

Provision Masked/Unmasked Test & Trial Cutover Environments in Either Location

Deploy First Delphix Instance On Premises & Synchronize with Prod SystemsDeploy Second Delphix Instance in Cloud, Continuously Replicate Between Instances

17© 2019 Unisys Corporation. All rights reserved. |

Deliver Data with Delphix

CLOUDON-PREMISE

CO Source (CMS/CHRS/CFS)

Campus Source

CLOUDON-PREMISE

Campus Data in Cloud

Efficiently Deliver & Refresh Data

18© 2019 Unisys Corporation. All rights reserved. |

AWS - DMS

To Migrate Databases to AWS Quickly & Securely

Homogeneous & heterogeneous DB migrations

Continuously Replication with high availability

Streaming data to Amazon Redshift & S3

AWS Schema Conversion Tool

Fast and easy to set-up

Supports widely used databases

v

Discovery Team: Architectural Issue• Oracle vs. Amazon Data Definition Language

19

Oracle Redshift

Oracle Redshift.CSV

Discovery Team: The Teleporter• The teleporter is a tool that will “beam” Oracle tables to AWS Redshift

• The teleporter is a tool that will “beam” Oracle tables to AWS Redshift

Discovery Team: The Teleporter

Oracle RedshiftS3

CSV

Something to Know…

• Issue: [CR], [LF], and Delimiter Values in Varchar fields

• Impact: When bulk copying to RedShift, data after these values in a field are dropped

• Resolution:• Post-Processing Procedure: Replace issue values with space

• Adds time to daily process• Other Options…

22

23© 2019 Unisys Corporation. All rights reserved. |

Cost Optimization

AWS DMS Acceleration Results

24

Application of private patches

Curated Student Collections

25

Prototyping Technologies

• AWS• Crawlers, Data Catalogs, Glue

• Airflow + Python• Hand-crated ETL platform

• Alteryx, Matillion, Others• Visual ETL (new prototypes)

26

27

Curated Data Sets

• In the next 30 Days• Work with CIO’s and Heads of Institutional Research to identify participants

• Data Validation• No Statewide Normalization: Does this data look like what’s in your SIS

• The Goal• Access to a set of curated data sets refreshed on a daily basis• Once validated, we will give you the ETL code• We will assist and advise in implementing a campus environment, if desired

28

Data Validation: Pentaho

Direct Data Lake Access

• In the next 60 days• Work with CIO’s to identify initial participants

• Looking for pilot campuses• 3 - 5 pilot campuses with rollout to all other campuses to follow

• The Goal• Direct access to stored copies of all source tables via data lake• Campus Teleporter: To help campuses spin up RedShift tables from files• We will assist and advise in connection and best practices

30

Conceptual Architecture

31

PS

FIN

HR

Data Sources Delphix(Rolling 365 Days)

Data Lake(No historical limit)

Cloud-based Copy

ADV

LMS

Swipe

CO

Campus

Campus DW

Transformations

CO DW

S

S x T

S x C

S x D

… BI App

Athena

S_cmp

S x T_cmp

S x C_cmp

S x D_cmp

…S x ?_cmp BI App

Data Governance Orchestration

• Cross Functional Data Governance Teams• 17 of our 23 Campuses

• Over the Next Six Months• We will start coordinating with those teams to actively help to share data

governance practices and data dictionary definitions across campuses

• Introducing our new Student Analytics Project Manager• Angela Williams

32

Let’s Connect!

33

LinkedIn:www.linkedin.com/in/brendanaldrich/

Twitter: @CalStateCDO

R. Brendan AldrichChief Data Officer

California State University, Chancellor’s Office

Appendix

34

35© 2019 Unisys Corporation. All rights reserved. |

On PremDelphix

DMS Tasks

Source Endpoint Destination Endpoint

EC2 Instance

S3 Buckets & Folders Redshift Cluster

AWS Delphix

Table List For Sync

Data masking profile

DMS Replication Instances T2.micro

Automated Data As a Service

36© 2019 Unisys Corporation. All rights reserved. |

On PremDelphix

DMS Tasks

Source Endpoint Destination Endpoint

EC2 Instance

S3 Buckets & Folders Redshift Cluster

AWS Delphix

Table List For Sync

Data masking profile

DMS Replication Instances C4.8X Large

Automated Data As a Service

37© 2019 Unisys Corporation. All rights reserved. |

On PremDelphix

DMS Tasks

Source Endpoint Destination Endpoint

EC2 Instance

S3 Buckets & Folders Redshift Cluster

AWS Delphix

Table List For Sync

Cloud Watch

Data masking profile

performance monitoring

and reporting

DMS Replication Instances T2.micro

Automated Data As a Service

top related