the future is now: an update on the csu data lake

37
The Future is Now: An Update on the CSU Data Lake CalStateTech – August 2019 1

Upload: others

Post on 22-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Future is Now: An Update on the CSU Data Lake

The Future is Now:An Update on the CSU Data Lake

CalStateTech – August 2019

1

Page 2: The Future is Now: An Update on the CSU Data Lake

R. Brendan AldrichChief Data Officer

California State University, Chancellor’s Office

• 18+ years focused on the fields of data warehousing, business intelligence, analytics & research

• Have led data transformation and modernization initiatives for private and public organizations:

• Leading the BI/DW Team within the CSU Chancellor’s Office

• Work areas: Data Lake Project, CFS Data Warehouse, CHRS Data Warehouse

Page 3: The Future is Now: An Update on the CSU Data Lake

Rajveer SinghTransformation Architect

Unisys

• 14+ years experience in leading Strategy, Innovation, Digital Transformation and IT Leadership

• Hybrid Cloud, Security, Data Management, Workload Migrations, Automation, Cost Management

Page 4: The Future is Now: An Update on the CSU Data Lake

Gartner Predicts 2019: Analytics & BI Strategy

• Key Findings• Organizations use only a fraction of the analytic potential they already possess.

• Modern analytic technologies accelerate and diversify the creation of analytic insight, but do little to ensure deployment and use.

• Although the visibility and interest in analytics have been transformed by artificial intelligence (AI) over the last few years, quantum computing (QC) has been under the radar for most organizations. However, its eventual impact may be equally significant

4

Page 5: The Future is Now: An Update on the CSU Data Lake

EDUCAUSE: Digital Transformation (Dx)

• What is Digital Transformation?• Digital transformation (Dx) is a series of

deep and coordinated culture, workforce, and technology shifts that enable new educational and operating models and transform an institution’s operations, strategic directions, and value proposition.

5

Page 6: The Future is Now: An Update on the CSU Data Lake

EDUCAUSE: Data and Digital Transformation

6

• Higher Education Chief Data Officers Working Group

• Digital Transformation Task Force

Page 7: The Future is Now: An Update on the CSU Data Lake

• While the tools and technologies have changed, the vast majority of organizations are still managing their data using the same techniques they have for the last 30 years.

Page 8: The Future is Now: An Update on the CSU Data Lake

Solving “Traditional” Data Issues• Create a Stable Data History

from source systems

• We can answer questions that haven’t yet been asked

• ALL our in-use data (and a year’s worth of everything else)

• It’s easy and FAST to add new data

• Focus on cleaning data in sources

• Do interesting easy and curate the useful

• Every team at every campus can iterate independently while maintaining order

Page 9: The Future is Now: An Update on the CSU Data Lake

Conceptual Architecture

9

PS

FIN

HR

Data Sources Delphix(Rolling 365 Days)

Data Lake(No historical limit)

Cloud-based Copy

ADV

LMS

Swipe

CO

Campus

Campus DW

Transformations

CO DW

S

S x T

S x C

S x D

… BI App

Athena

S_cmp

S x T_cmp

S x C_cmp

S x D_cmp

…S x ?_cmp BI App

Page 10: The Future is Now: An Update on the CSU Data Lake

CSU Data Lake: A Retrospective

• June 2017: Data Lake Prototype• Data that was being provided to CO collected and housed in SQL Server tables

• Jan. 2018: CSU Is First California Higher Ed To Appoint A CDO• The work thus far:

10

Activity 02/18 03/18 04/18 05/18 06/18 07/18 08/18 09/18 10/18 11/18 12/18 01/19 02/19 03/19 04/19 05/19 06/19 07/19 08/19 09/19 10/19 11/19 12/19 …

Ideation and Planning

Pre-Development

Development: Data Lake

Development: Curated Student Data

Development: Validation BI Platform

Validation: Curated Student Data →

Data Governance Orchestration →

Campus Datal Lake Access →

RFP: Production BI Platform →

Page 11: The Future is Now: An Update on the CSU Data Lake

New BI/DW Sub-Teams• Discovery Team

• Data Lake Architecture & Functionality

• Tomorrow Team• ETL & Modeling

• FED Team• Front End Design

11

• Information Security• Data privacy,

protection and security

Page 12: The Future is Now: An Update on the CSU Data Lake

Data and Analytics Strategies Driving the Future

12

CSU Challenge: Data is highly distributed across the system and not easily accessible/ useable

Page 13: The Future is Now: An Update on the CSU Data Lake

Architectural Deep Dive

• Shifting Data from On Premise to Cloud• Delphix

• Populating the Data Lake• DMS

• Curated Data Collections• Multiple Technologies: AWS / Airflow + Python / Alteryx

13

Page 14: The Future is Now: An Update on the CSU Data Lake

14© 2019 Unisys Corporation. All rights reserved. |

Data Virtualization = Secure, Lightweight & Portable Data

Dev QA UAT Integration

1 TB

App Tier – Prod

1 TB 1 TB 1 TB 1 TB

Unique Block Mapping Block Aware Filtering Efficient Compression

.3 TB (3:1)

App Data Files

20 MB 20 MB 20 MB 20 MB

Secure Data Masking

Page 15: The Future is Now: An Update on the CSU Data Lake

15© 2019 Unisys Corporation. All rights reserved. |

Hybrid Cloud without Delphix

1

2

3

4

8

7

6

5

Submit Request

Approve Request

Ready Target

Ready Storage Restore

Version

Configure Database

Mask Database

Backup masked

database

ON PREMISE

12

11

10

9Ready Cloud

Target

Configure cloud storage

Restore Database

Validate Environment

CLOUD

MigrateMigrateMigrateMigrate

Days or weeks to prep data on premise Days or weeks to

provision cloud DBs from migrated data

Slow lift and shift process

!!

!

Static data load does not capture updates

Data refresh requires repeat of full process

!

! IT tickets create bottleneck

! Multiple handoffs cause delay

! Multiple handoffs cause delay

!

!Multiple tools required for data security and data movement

Complex, Manual Processes Slow Data Movement & Refresh for Hybrid Cloud

Page 16: The Future is Now: An Update on the CSU Data Lake

16© 2019 Unisys Corporation. All rights reserved. |

How Delphix Accelerates Cloud Migrations

Clie

nt N

etw

ork

Clo

ud P

rovi

der N

etw

ork

PRODUCTION

APP

RDBMS

STORAGE

Continuously Replicate

4

Mask Data On Premises, Ensuring No Sensitive Data Leaves Production Network

TEST DEV STAGE

NON- PRODUCTIONTEST DEV

Data Lake STAGING

CLOUDON PREMISE

1

2

3

Provision Masked/Unmasked Test & Trial Cutover Environments in Either Location

Deploy First Delphix Instance On Premises & Synchronize with Prod SystemsDeploy Second Delphix Instance in Cloud, Continuously Replicate Between Instances

Page 17: The Future is Now: An Update on the CSU Data Lake

17© 2019 Unisys Corporation. All rights reserved. |

Deliver Data with Delphix

CLOUDON-PREMISE

CO Source (CMS/CHRS/CFS)

Campus Source

CLOUDON-PREMISE

Campus Data in Cloud

Efficiently Deliver & Refresh Data

Page 18: The Future is Now: An Update on the CSU Data Lake

18© 2019 Unisys Corporation. All rights reserved. |

AWS - DMS

To Migrate Databases to AWS Quickly & Securely

Homogeneous & heterogeneous DB migrations

Continuously Replication with high availability

Streaming data to Amazon Redshift & S3

AWS Schema Conversion Tool

Fast and easy to set-up

Supports widely used databases

v

Page 19: The Future is Now: An Update on the CSU Data Lake

Discovery Team: Architectural Issue• Oracle vs. Amazon Data Definition Language

19

Oracle Redshift

Oracle Redshift.CSV

Page 20: The Future is Now: An Update on the CSU Data Lake

Discovery Team: The Teleporter• The teleporter is a tool that will “beam” Oracle tables to AWS Redshift

Page 21: The Future is Now: An Update on the CSU Data Lake

• The teleporter is a tool that will “beam” Oracle tables to AWS Redshift

Discovery Team: The Teleporter

Oracle RedshiftS3

CSV

Page 22: The Future is Now: An Update on the CSU Data Lake

Something to Know…

• Issue: [CR], [LF], and Delimiter Values in Varchar fields

• Impact: When bulk copying to RedShift, data after these values in a field are dropped

• Resolution:• Post-Processing Procedure: Replace issue values with space

• Adds time to daily process• Other Options…

22

Page 23: The Future is Now: An Update on the CSU Data Lake

23© 2019 Unisys Corporation. All rights reserved. |

Cost Optimization

Page 24: The Future is Now: An Update on the CSU Data Lake

AWS DMS Acceleration Results

24

Application of private patches

Page 25: The Future is Now: An Update on the CSU Data Lake

Curated Student Collections

25

Page 26: The Future is Now: An Update on the CSU Data Lake

Prototyping Technologies

• AWS• Crawlers, Data Catalogs, Glue

• Airflow + Python• Hand-crated ETL platform

• Alteryx, Matillion, Others• Visual ETL (new prototypes)

26

Page 27: The Future is Now: An Update on the CSU Data Lake

27

Page 28: The Future is Now: An Update on the CSU Data Lake

Curated Data Sets

• In the next 30 Days• Work with CIO’s and Heads of Institutional Research to identify participants

• Data Validation• No Statewide Normalization: Does this data look like what’s in your SIS

• The Goal• Access to a set of curated data sets refreshed on a daily basis• Once validated, we will give you the ETL code• We will assist and advise in implementing a campus environment, if desired

28

Page 29: The Future is Now: An Update on the CSU Data Lake

Data Validation: Pentaho

Page 30: The Future is Now: An Update on the CSU Data Lake

Direct Data Lake Access

• In the next 60 days• Work with CIO’s to identify initial participants

• Looking for pilot campuses• 3 - 5 pilot campuses with rollout to all other campuses to follow

• The Goal• Direct access to stored copies of all source tables via data lake• Campus Teleporter: To help campuses spin up RedShift tables from files• We will assist and advise in connection and best practices

30

Page 31: The Future is Now: An Update on the CSU Data Lake

Conceptual Architecture

31

PS

FIN

HR

Data Sources Delphix(Rolling 365 Days)

Data Lake(No historical limit)

Cloud-based Copy

ADV

LMS

Swipe

CO

Campus

Campus DW

Transformations

CO DW

S

S x T

S x C

S x D

… BI App

Athena

S_cmp

S x T_cmp

S x C_cmp

S x D_cmp

…S x ?_cmp BI App

Page 32: The Future is Now: An Update on the CSU Data Lake

Data Governance Orchestration

• Cross Functional Data Governance Teams• 17 of our 23 Campuses

• Over the Next Six Months• We will start coordinating with those teams to actively help to share data

governance practices and data dictionary definitions across campuses

• Introducing our new Student Analytics Project Manager• Angela Williams

32

Page 33: The Future is Now: An Update on the CSU Data Lake

Let’s Connect!

33

LinkedIn:www.linkedin.com/in/brendanaldrich/

Twitter: @CalStateCDO

R. Brendan AldrichChief Data Officer

California State University, Chancellor’s Office

Page 34: The Future is Now: An Update on the CSU Data Lake

Appendix

34

Page 35: The Future is Now: An Update on the CSU Data Lake

35© 2019 Unisys Corporation. All rights reserved. |

On PremDelphix

DMS Tasks

Source Endpoint Destination Endpoint

EC2 Instance

S3 Buckets & Folders Redshift Cluster

AWS Delphix

Table List For Sync

Data masking profile

DMS Replication Instances T2.micro

Automated Data As a Service

Page 36: The Future is Now: An Update on the CSU Data Lake

36© 2019 Unisys Corporation. All rights reserved. |

On PremDelphix

DMS Tasks

Source Endpoint Destination Endpoint

EC2 Instance

S3 Buckets & Folders Redshift Cluster

AWS Delphix

Table List For Sync

Data masking profile

DMS Replication Instances C4.8X Large

Automated Data As a Service

Page 37: The Future is Now: An Update on the CSU Data Lake

37© 2019 Unisys Corporation. All rights reserved. |

On PremDelphix

DMS Tasks

Source Endpoint Destination Endpoint

EC2 Instance

S3 Buckets & Folders Redshift Cluster

AWS Delphix

Table List For Sync

Cloud Watch

Data masking profile

performance monitoring

and reporting

DMS Replication Instances T2.micro

Automated Data As a Service