"who moved my data? - why tracking changes and sources of data is critical to your data lake...

29
Who Moved My Data? September 14th, 2016 cask.co Russ Savage Application Engineer @ Cask 1

Upload: cask-data-inc

Post on 16-Apr-2017

111 views

Category:

Technology


0 download

TRANSCRIPT

Who Moved My Data?

September 14th, 2016

cask.co

Russ SavageApplication Engineer @ Cask

1

cask.co2

Agenda

● The problem● The facets of data governance● Why it’s hard in Hadoop● How Cask is tackling this

cask.co

The Elusive Golden Dataset

3

cask.co

But you’re not the only one

4

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

cask.co

Oh and it only gets worse

5

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

Golden_Data

Golden_Data_v1Golden_Data_final

Golden_Data_v3_final

Golden_Data_v3_final_v1

Golden_Data_chris

Yellowish_data_v4

14k_Golden_dataGolden_Data_Is_Forever

Golden_Data_Russ_v1

Golden_Data_v2Golden_Data_final_really

Golden_Data_v3_final_final

Golden_Data_v3_final_v2

Golden_Data_chris_dan

Yellowish_data_v5

14k_Golden_data_rosieGolden_Data_Is_Forever_v1

cask.co

And everyone has a favorite tool

6

Excel is theonly thing I use!

Python is theonly thing I use!

Java is theonly thing I use!

R is theonly thing I use!

cask.co

“Just throw all the data into cluster now, and worry about cleansing, reconciliation and enrichment later.”

7

cask.co

Welcome to your new Data Lake…8

cask.co

Welcome to your new Data SWAMP9

cask.co

Core Elements of Data Governance

● Auditing

● Lineage

● Data lifecycle management and policy enforcement

● Data stewardship and curation

● Metadata management

10

cask.co

Challenges of Data Governance in Hadoop

● Hadoop stores extremely diverse data and lots of it

● Users access that data with an increasing number of

tools

11

cask.co

Remember this?

12

Excel is theonly thing I use!

Python is theonly thing I use!

Java is theonly thing I use!

R is theonly thing I use!

cask.co

So of course, we need more data

13

cask.co

Metadata enables Data Governance

• Audit data stores who is accessing your data and when• Lineage data shows where the data came from• Lifecycle data tells you if this data is on it’s way in or out• Catalogue data ensures people can find everything

14

cask.co

The richer your metadata collection is, the easier data

governance becomes.

15

cask.co

The more automated your metadata collection is, the easier

data governance becomes.

16

cask.co

The fewer humans that are involved in metadata collection,

the easier data governance becomes.

17

cask.co

Collect Metadata at a Single Layer Across All Hadoop Tools

18

cask.co19

Cask Data Application Platform (CDAP)

Unified Integration Framework for Building and Running Data Applicationson Hadoop and Spark

• 100% open source and highly extensible

• Supports all major Hadoop Distributions

• Integrates the latest big data technologies, including Kafka, YARN, Spark, Impala, HIVE, HIVE on Spark, HIVE on Tez, etc.

cask.co

A self-service data discovery tool toexplore metadata, audits and lineage

20

cask.co

Audit Logs

21

cask.co

Lineage

22

cask.co

Data Lifecycle

23

cask.co

Data Lifecycle

24

cask.co

Metadata Management

25

cask.co

Metadata Management

26

cask.co

Coming Soon

• Security based on metadata tags• Enhanced Audit Log display• Auto tagging datasets• Data dictionary support (more metadata)

27

cask.co

In Closing

• Data Governance is critical to the success of your cluster

• Multiple systems and tools complicate things in a Hadoop

• Metadata is key to solving this in your cluster

• Tracker is working to solve governance in CDAP

28

cask.co29

Thanks!

Russ Savage @russellsavage

email : [email protected]