the future is now: an update on the csu data lake
TRANSCRIPT
The Future is Now:An Update on the CSU Data Lake
CalStateTech – August 2019
1
R. Brendan AldrichChief Data Officer
California State University, Chancellor’s Office
• 18+ years focused on the fields of data warehousing, business intelligence, analytics & research
• Have led data transformation and modernization initiatives for private and public organizations:
• Leading the BI/DW Team within the CSU Chancellor’s Office
• Work areas: Data Lake Project, CFS Data Warehouse, CHRS Data Warehouse
Rajveer SinghTransformation Architect
Unisys
• 14+ years experience in leading Strategy, Innovation, Digital Transformation and IT Leadership
• Hybrid Cloud, Security, Data Management, Workload Migrations, Automation, Cost Management
Gartner Predicts 2019: Analytics & BI Strategy
• Key Findings• Organizations use only a fraction of the analytic potential they already possess.
• Modern analytic technologies accelerate and diversify the creation of analytic insight, but do little to ensure deployment and use.
• Although the visibility and interest in analytics have been transformed by artificial intelligence (AI) over the last few years, quantum computing (QC) has been under the radar for most organizations. However, its eventual impact may be equally significant
4
EDUCAUSE: Digital Transformation (Dx)
• What is Digital Transformation?• Digital transformation (Dx) is a series of
deep and coordinated culture, workforce, and technology shifts that enable new educational and operating models and transform an institution’s operations, strategic directions, and value proposition.
5
EDUCAUSE: Data and Digital Transformation
6
• Higher Education Chief Data Officers Working Group
• Digital Transformation Task Force
• While the tools and technologies have changed, the vast majority of organizations are still managing their data using the same techniques they have for the last 30 years.
Solving “Traditional” Data Issues• Create a Stable Data History
from source systems
• We can answer questions that haven’t yet been asked
• ALL our in-use data (and a year’s worth of everything else)
• It’s easy and FAST to add new data
• Focus on cleaning data in sources
• Do interesting easy and curate the useful
• Every team at every campus can iterate independently while maintaining order
Conceptual Architecture
9
PS
FIN
HR
Data Sources Delphix(Rolling 365 Days)
Data Lake(No historical limit)
Cloud-based Copy
ADV
LMS
Swipe
CO
Campus
Campus DW
Transformations
CO DW
S
S x T
S x C
S x D
… BI App
Athena
S_cmp
S x T_cmp
S x C_cmp
S x D_cmp
…S x ?_cmp BI App
CSU Data Lake: A Retrospective
• June 2017: Data Lake Prototype• Data that was being provided to CO collected and housed in SQL Server tables
• Jan. 2018: CSU Is First California Higher Ed To Appoint A CDO• The work thus far:
10
Activity 02/18 03/18 04/18 05/18 06/18 07/18 08/18 09/18 10/18 11/18 12/18 01/19 02/19 03/19 04/19 05/19 06/19 07/19 08/19 09/19 10/19 11/19 12/19 …
Ideation and Planning
Pre-Development
Development: Data Lake
Development: Curated Student Data
Development: Validation BI Platform
Validation: Curated Student Data →
Data Governance Orchestration →
Campus Datal Lake Access →
RFP: Production BI Platform →
New BI/DW Sub-Teams• Discovery Team
• Data Lake Architecture & Functionality
• Tomorrow Team• ETL & Modeling
• FED Team• Front End Design
11
• Information Security• Data privacy,
protection and security
Data and Analytics Strategies Driving the Future
12
CSU Challenge: Data is highly distributed across the system and not easily accessible/ useable
Architectural Deep Dive
• Shifting Data from On Premise to Cloud• Delphix
• Populating the Data Lake• DMS
• Curated Data Collections• Multiple Technologies: AWS / Airflow + Python / Alteryx
13
14© 2019 Unisys Corporation. All rights reserved. |
Data Virtualization = Secure, Lightweight & Portable Data
Dev QA UAT Integration
1 TB
App Tier – Prod
1 TB 1 TB 1 TB 1 TB
Unique Block Mapping Block Aware Filtering Efficient Compression
.3 TB (3:1)
App Data Files
20 MB 20 MB 20 MB 20 MB
Secure Data Masking
15© 2019 Unisys Corporation. All rights reserved. |
Hybrid Cloud without Delphix
1
2
3
4
8
7
6
5
Submit Request
Approve Request
Ready Target
Ready Storage Restore
Version
Configure Database
Mask Database
Backup masked
database
ON PREMISE
12
11
10
9Ready Cloud
Target
Configure cloud storage
Restore Database
Validate Environment
CLOUD
MigrateMigrateMigrateMigrate
Days or weeks to prep data on premise Days or weeks to
provision cloud DBs from migrated data
Slow lift and shift process
!!
!
Static data load does not capture updates
Data refresh requires repeat of full process
!
! IT tickets create bottleneck
! Multiple handoffs cause delay
! Multiple handoffs cause delay
!
!Multiple tools required for data security and data movement
Complex, Manual Processes Slow Data Movement & Refresh for Hybrid Cloud
16© 2019 Unisys Corporation. All rights reserved. |
How Delphix Accelerates Cloud Migrations
Clie
nt N
etw
ork
Clo
ud P
rovi
der N
etw
ork
PRODUCTION
APP
RDBMS
STORAGE
Continuously Replicate
4
Mask Data On Premises, Ensuring No Sensitive Data Leaves Production Network
TEST DEV STAGE
NON- PRODUCTIONTEST DEV
Data Lake STAGING
CLOUDON PREMISE
1
2
3
Provision Masked/Unmasked Test & Trial Cutover Environments in Either Location
Deploy First Delphix Instance On Premises & Synchronize with Prod SystemsDeploy Second Delphix Instance in Cloud, Continuously Replicate Between Instances
17© 2019 Unisys Corporation. All rights reserved. |
Deliver Data with Delphix
CLOUDON-PREMISE
CO Source (CMS/CHRS/CFS)
Campus Source
CLOUDON-PREMISE
Campus Data in Cloud
Efficiently Deliver & Refresh Data
18© 2019 Unisys Corporation. All rights reserved. |
AWS - DMS
To Migrate Databases to AWS Quickly & Securely
Homogeneous & heterogeneous DB migrations
Continuously Replication with high availability
Streaming data to Amazon Redshift & S3
AWS Schema Conversion Tool
Fast and easy to set-up
Supports widely used databases
v
Discovery Team: Architectural Issue• Oracle vs. Amazon Data Definition Language
19
Oracle Redshift
Oracle Redshift.CSV
Discovery Team: The Teleporter• The teleporter is a tool that will “beam” Oracle tables to AWS Redshift
• The teleporter is a tool that will “beam” Oracle tables to AWS Redshift
Discovery Team: The Teleporter
Oracle RedshiftS3
CSV
Something to Know…
• Issue: [CR], [LF], and Delimiter Values in Varchar fields
• Impact: When bulk copying to RedShift, data after these values in a field are dropped
• Resolution:• Post-Processing Procedure: Replace issue values with space
• Adds time to daily process• Other Options…
22
23© 2019 Unisys Corporation. All rights reserved. |
Cost Optimization
AWS DMS Acceleration Results
24
Application of private patches
Curated Student Collections
25
Prototyping Technologies
• AWS• Crawlers, Data Catalogs, Glue
• Airflow + Python• Hand-crated ETL platform
• Alteryx, Matillion, Others• Visual ETL (new prototypes)
26
27
Curated Data Sets
• In the next 30 Days• Work with CIO’s and Heads of Institutional Research to identify participants
• Data Validation• No Statewide Normalization: Does this data look like what’s in your SIS
• The Goal• Access to a set of curated data sets refreshed on a daily basis• Once validated, we will give you the ETL code• We will assist and advise in implementing a campus environment, if desired
28
Data Validation: Pentaho
Direct Data Lake Access
• In the next 60 days• Work with CIO’s to identify initial participants
• Looking for pilot campuses• 3 - 5 pilot campuses with rollout to all other campuses to follow
• The Goal• Direct access to stored copies of all source tables via data lake• Campus Teleporter: To help campuses spin up RedShift tables from files• We will assist and advise in connection and best practices
30
Conceptual Architecture
31
PS
FIN
HR
Data Sources Delphix(Rolling 365 Days)
Data Lake(No historical limit)
Cloud-based Copy
ADV
LMS
Swipe
CO
Campus
Campus DW
Transformations
CO DW
S
S x T
S x C
S x D
… BI App
Athena
S_cmp
S x T_cmp
S x C_cmp
S x D_cmp
…S x ?_cmp BI App
Data Governance Orchestration
• Cross Functional Data Governance Teams• 17 of our 23 Campuses
• Over the Next Six Months• We will start coordinating with those teams to actively help to share data
governance practices and data dictionary definitions across campuses
• Introducing our new Student Analytics Project Manager• Angela Williams
32
Let’s Connect!
33
LinkedIn:www.linkedin.com/in/brendanaldrich/
Twitter: @CalStateCDO
R. Brendan AldrichChief Data Officer
California State University, Chancellor’s Office
Appendix
34
35© 2019 Unisys Corporation. All rights reserved. |
On PremDelphix
DMS Tasks
Source Endpoint Destination Endpoint
EC2 Instance
S3 Buckets & Folders Redshift Cluster
AWS Delphix
Table List For Sync
Data masking profile
DMS Replication Instances T2.micro
Automated Data As a Service
36© 2019 Unisys Corporation. All rights reserved. |
On PremDelphix
DMS Tasks
Source Endpoint Destination Endpoint
EC2 Instance
S3 Buckets & Folders Redshift Cluster
AWS Delphix
Table List For Sync
Data masking profile
DMS Replication Instances C4.8X Large
Automated Data As a Service
37© 2019 Unisys Corporation. All rights reserved. |
On PremDelphix
DMS Tasks
Source Endpoint Destination Endpoint
EC2 Instance
S3 Buckets & Folders Redshift Cluster
AWS Delphix
Table List For Sync
Cloud Watch
Data masking profile
performance monitoring
and reporting
DMS Replication Instances T2.micro
Automated Data As a Service