"who moved my data? - why tracking changes and sources of data is critical to your data lake...
TRANSCRIPT
cask.co2
Agenda
● The problem● The facets of data governance● Why it’s hard in Hadoop● How Cask is tackling this
cask.co
But you’re not the only one
4
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
cask.co
Oh and it only gets worse
5
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
Golden_Data
Golden_Data_v1Golden_Data_final
Golden_Data_v3_final
Golden_Data_v3_final_v1
Golden_Data_chris
Yellowish_data_v4
14k_Golden_dataGolden_Data_Is_Forever
Golden_Data_Russ_v1
Golden_Data_v2Golden_Data_final_really
Golden_Data_v3_final_final
Golden_Data_v3_final_v2
Golden_Data_chris_dan
Yellowish_data_v5
14k_Golden_data_rosieGolden_Data_Is_Forever_v1
cask.co
And everyone has a favorite tool
6
Excel is theonly thing I use!
Python is theonly thing I use!
Java is theonly thing I use!
R is theonly thing I use!
cask.co
“Just throw all the data into cluster now, and worry about cleansing, reconciliation and enrichment later.”
7
cask.co
Core Elements of Data Governance
● Auditing
● Lineage
● Data lifecycle management and policy enforcement
● Data stewardship and curation
● Metadata management
10
cask.co
Challenges of Data Governance in Hadoop
● Hadoop stores extremely diverse data and lots of it
● Users access that data with an increasing number of
tools
11
cask.co
Remember this?
12
Excel is theonly thing I use!
Python is theonly thing I use!
Java is theonly thing I use!
R is theonly thing I use!
cask.co
Metadata enables Data Governance
• Audit data stores who is accessing your data and when• Lineage data shows where the data came from• Lifecycle data tells you if this data is on it’s way in or out• Catalogue data ensures people can find everything
14
cask.co
The fewer humans that are involved in metadata collection,
the easier data governance becomes.
17
cask.co19
Cask Data Application Platform (CDAP)
Unified Integration Framework for Building and Running Data Applicationson Hadoop and Spark
• 100% open source and highly extensible
• Supports all major Hadoop Distributions
• Integrates the latest big data technologies, including Kafka, YARN, Spark, Impala, HIVE, HIVE on Spark, HIVE on Tez, etc.
cask.co
Coming Soon
• Security based on metadata tags• Enhanced Audit Log display• Auto tagging datasets• Data dictionary support (more metadata)
27
cask.co
In Closing
• Data Governance is critical to the success of your cluster
• Multiple systems and tools complicate things in a Hadoop
• Metadata is key to solving this in your cluster
• Tracker is working to solve governance in CDAP
28