are data lakes the new core dwhs? andreas … 201…daimler tss gmbh 4 locations are data lakes the...

59
A company of Daimler AG ARE DATA LAKES THE NEW CORE DWHS? ANDREAS BUCKENHOFER, DAIMLER TSS DOAG BIG DATA, REPORTING, GEODATA DAYS - KASSEL 2017

Upload: others

Post on 28-Oct-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

A company of Daimler AG

ARE DATA LAKES THE NEW CORE DWHS?ANDREAS BUCKENHOFER, DAIMLER TSSDOAG BIG DATA, REPORTING, GEODATA DAYS - KASSEL 2017

Page 2: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

ABOUT MEhttps://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas BuckenhoferSenior DB [email protected]

Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics

Page 3: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

DAIMLER TSS. IT EXCELLENCE: COMPREHENSIVE, INNOVATIVE, CLOSE.

We're a specialist and strategic business partner for innovative IT Solutions within Daimler –not just another supplier! As a 100% subsidiary of Daimler, we live the culture of excellence and aspire to take an innovative and technological lead. With our outstanding technological and methodical know-how we are a competent provider of services that help those who benefit from them to stand out from the competition. When it comes to demanding IT questions we create impetus, especially in the core fields car IT and mobility, information security, analytics, shared services and Digital Customer Experience.

Are Data Lakes the new Core DWHs?Daimler TSS GmbH 3

TSS 2 0 2 0 ALWAYS ON THE MOVE.

Page 4: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Daimler TSS GmbH 4

LOCATIONS

Are Data Lakes the new Core DWHs?

Daimler TSS ChinaHub Beijing6 Employees

Daimler TSS MalaysiaHub Kuala Lumpur38 Employees

Daimler TSS IndiaHub Bangalore16 Employees

Daimler TSS GermanyMore than 1000 Employees

Ulm (Headquarters)

Stuttgart AreaBöblingen, Echterdingen,Leinfelden, Möhringen

Berlin

Karlsruhe

Page 5: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

AGENDA

1. Introduction/Motivation2. From the classic DWH architecture to the Data Lake3. Data Lake usage scenarios4. Summary

Page 6: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Software is becoming more and more important• 100Mio lines of code

• Physical products • are significantly enhanced with

digital service capabilities, e.g. the value of the car comes increasingly from digital assets

• become digital services, e.g. car2go

• IOT, Robotics, etc.

DIGITIZATION – DATA AS AN ASSET FOR ANALYTICAL DECISIONS

Are Data Lakes the new Core DWHs?Daimler TSS 6

Source image: https://www.linkedin.com/pulse/20140626152045-3625632-car-software-100m-lines-of-code-and-counting

Page 7: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Agility• Is the Organization ready? IT (Dev + Ops) and Business

Flexibility• Data Modeling under pressure, model as you go• New data formats coming from logs, sensors, etc.

Performance• Right Time• Scale to high volumes• Integrate data arriving at high speed

DWH AS INTEGRATION SYSTEM FOR DIGITAL ASSETS SOME OF TODAY’S MAIN CHALLENGES

Are Data Lakes the new Core DWHs?Daimler TSS 7

Page 8: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

IS THE DATA WAREHOUSE DEAD? AND ETL, TOO?

Are Data Lakes the new Core DWHs?Daimler TSS 8

Sources: https://www.linkedin.com/groups/45685/45685-6224210695295168512?trk=hp-feed-group-discussion&_mSplash=1https://speakerdeck.com/nehanarkhede/etl-is-dead-long-live-streamshttps://gcn.com/blogs/reality-check/2014/01/hadoop-vs-data-warehousing.aspx

Page 9: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

AGENDA

1. Introduction/Motivation2. From the classic DWH architecture to the Data Lake3. Data Lake usage scenarios4. Summary

Page 10: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

REFERENCE DATA WAREHOUSE ARCHITECTURE

Are Data Lakes the new Core DWHs?Daimler TSS 10

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager

subject-oriented,

integrated, time-

variant,non-

volatile

Page 11: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

REFERENCE DATA WAREHOUSE ARCHITECTURE

Are Data Lakes the new Core DWHs?Daimler TSS 11

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager

subject-oriented,

integrated, time-

variant,non-

volatile

Page 12: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Are Data Lakes the new Core DWHs?Daimler TSS 12

Data Lake on Hadoop

Data Swamp

Data Reservoir

Landing Zone

Data Library

Data RepositoryData Archive

Data Lake on Spark

Data Lake 3.0

Page 13: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

DATA LAKE REFERENCE ARCHITECTUREDATA LAKE OVERALL ARCHITECTURE VS DATA LAKE LAYER

Are Data Lakes the new Core DWHs?Daimler TSS 13

Landing Zone

Data

Gov

erna

nce

Data Reservoir / Presentation

Data Lake

Met

adat

a M

anag

emen

tData Archival

Data Security

Page 14: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

DATA LAKE REFERENCE ARCHITECTURE

Are Data Lakes the new Core DWHs?Daimler TSS 14

Landing ZoneData

Gov

erna

nce Data Reservoir /Presentation

Data Lake

Met

adat

a M

anag

emen

t Data Archival

Data Security

Firewall

Firewall

Sqoop Kafka

Knox

Rest API

ODBC/JDBC Restful Client

Sources

Page 15: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Architecture, conceptData Lake

• Tools (that can be used to implement a Lake)

Hadoop, Spark, Elastic Stack

DATA LAKE VS HADOOP

Are Data Lakes the new Core DWHs?Daimler TSS 15

Page 16: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Data has a structure: schema-less does not exist• You apply

• schema-on-reade.g. copy files (csv, json, html, …) into HDFS

• schema-on-writee.g. create table on data files in HDFS

HOW TO STRUCTURE THE DATA LAKE?SCHEMA-LESS REVOLUTION?

Are Data Lakes the new Core DWHs?Daimler TSS 16

Page 17: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Flexibility• For whom? Writing the data vs reading the data

Simplicity• For whom? Writing the data vs reading the data• Human mistakes while trying to reading the data

Agility / Model as you go• Just copy files into the directory

SCHEMA-ON-READ

Are Data Lakes the new Core DWHs?Daimler TSS 17

Page 18: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

LAMBDA ARCHITECTUREAN EARLY COMPREHENSIVE BIG DATA ARCHITECTURE

Are Data Lakes the new Core DWHs?Daimler TSS 18

Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015

• It can be argued about the complexity of the Lambda architecture

• More interesting is the author’s view on data• Rawness

Store the data as it is. No transformations.• Immutability

Don’t update or delete data, just add more.

• Graph-like schema recommended

Page 19: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

LAMBDA ARCHITECTURE

Are Data Lakes the new Core DWHs?Daimler TSS 19

Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015

• It can be argued about the complexity of the Lambda architecture

• More interesting is the author’s view on data• Rawness

Store the data as it is. No transformations.• Immutability

Don’t update or delete data, just add more.

• Graph-like schema recommended

„Many developers go down the path of writing their raw data in a schemaless

format like JSON. This is appealing because of how easy it is to get started, but this

approach quickly leads to problems. Whether due to bugs or misunderstandings

between different developers, data corruption inevitably occurs“

(see page 103, Nathan Marz, „Big Data: Principles and best practices of scalable

realtime data systems", Manning Publications)

Page 20: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Just dumping data into the Lake?

• General Data Protection Regulation, e.g. Privacy by Design• Vehicle identifier VIN is already sensitive data that needs to be protected

(anonymized) depending from usage• Earmarked use of data

Schema-on-read: How do you protect data assets if you are not aware that the data exists or where it exists?

STRUCTURING THE DATA LAKEDATA SECURITY

Are Data Lakes the new Core DWHs?Daimler TSS 20

Page 21: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

DATA LAKE REFERENCE ARCHITECTURE

Are Data Lakes the new Core DWHs?Daimler TSS 21

Landing Zone

Data

Gov

erna

nce

Data Presentation

Data Lake

Met

adat

a M

anag

emen

tData Archival

Data Security

load

structure

transform

archive

archive

archive

access

Temporary storage

Immutable, modeled dataTool neutral

Structured data for fast access

Raw data

Page 22: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Distinguish Data Lake as overall concept vs Data Lake as a layer• Landing Zone

• Source data programmatically loaded• Data is partitioned for processing• Governance includes catalog and ILM (Security, Retention)

• Data Lake• Lightly integrated by Keys• Data accessible via SQL-on-Hadoop or using SerDes on raw data• Data is partitioned for access• Governance includes catalog, ILM, lightweight model

DATA LAKE HAS LAYERS (1)DATA LAKE AS CONCEPT VS DATA LAKE AS LAYER

Are Data Lakes the new Core DWHs?Daimler TSS 22

Page 23: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Presentation Zone• Data is structured and partitioned/tuned for data access• Full Governance including e.g. catalog, ILM, model

• Known schema including metadata about tables and columns• Lineage• Documented quality

DATA LAKE HAS LAYERS (2)

Are Data Lakes the new Core DWHs?Daimler TSS 23

Page 24: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

GOVERNANCE BY DAIMLER AG / COEE.G. SAMPLE HDFS LAYOUT

Are Data Lakes the new Core DWHs?Daimler TSS 24

/

scripts

data

Source_system

Landing_zone

scripts

data

Source_system

Data_archive

scripts

data

Source_system_object

Data_lake

model

data

Data_science_results

scripts

data

Use_case

Data_reservoir

scripts

data

Data_science_sandbox

Page 25: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

AGENDA

1. Introduction/Motivation2. From the classic DWH architecture to the Data Lake3. Data Lake usage scenarios4. Summary

Page 26: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

USE CASESWHAT IS THE BUSINESS PROBLEM TO SOLVE?

Are Data Lakes the new Core DWHs?Daimler TSS 26

Source: http://ww

w.azquotes.com

/

Page 27: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

USE CASE: ANALYSIS BATTERY AGING

Are Data Lakes the new Core DWHs?Daimler TSS 27

Max capacityCurrent capacity

• CSV data ingested into HDFS, Hive tables on files

• Identify breaks (“> 8h”) and compute current drain

Page 28: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Sensor data format change without notice• Sensors get regularly updated with new versions• Names of metrics may change• Sensors with various versions in the field• Sensors from different suppliers

• Often many fields >>100 and increasing with new sensor versions• Easy storing of data in HDFS and applying schema later• Data from Robots, vehicles, …

STRUCTURING THE DATA LAKENEW DATA SOURCES – SENSOR DATA

Are Data Lakes the new Core DWHs?Daimler TSS 28

Page 29: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Sensor data format change without notice• Time consuming and error-prone

data integration into the Data Lake• Therefore preparation of data for

usage in the Data Reservoir required: “Data Engineer”

STRUCTURING THE DATA LAKE“SCHEMA-ON-READ”

Are Data Lakes the new Core DWHs?Daimler TSS 29

Landing ZoneDa

ta G

over

nanc

e

Data Reservoir

Data Lake

Met

adat

a M

anag

emen

t

Data Archival

Data Security

csv

Samp-ling / filter

Hive tables

Hive tables

Struc-ture

R Python

Page 30: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

USE CASE: OPTIMIZE CYCLE TIME FOR LIGHTWEIGHT ROBOTS

Are Data Lakes the new Core DWHs?Daimler TSS 30

• JSON data from Orient NoSQL-DB ingested into HDFS, Hive tables on files• Partly automatize the diagnosis of anomalies (e.g. the identification of

reasons for idle times)

Page 31: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

USE CASE: BOM EXPLOSIONHADOOP COMPUTING POWER

Are Data Lakes the new Core DWHs?Daimler TSS 31

Page 32: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• PLMXML files supplied by source systems• Compute changes by comparing last BOM with current BOM• Data Lake contains data across all tiers• Data Reservoir contains “dedicated, secured” views for tiers• Transfer changes to local relational DBs

USE CASE: BOM EXPLOSIONHADOOP COMPUTING POWER

Are Data Lakes the new Core DWHs?Daimler TSS 32

Page 33: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Several stakeholders, e.g. different (independent) truck units• Dumping existing systems (or new data sources like logs) into the Data

Lake• Data is available fast, but• Different data models• No integration: IF ETL is reduced to EL, then T is performed by Data Scientists

many times• Some lightweight data integration required

Data Vault

STRUCTURING THE DATA LAKE LAYEREXISTING INTERNAL DATA FOR ANALYTICS

Are Data Lakes the new Core DWHs?Daimler TSS 33

Page 34: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Hub and Link tables: how to ensure uniqueness?• No unique constraints or indexes like RDBMS

• Use View with distinct or group by on Hub or Link table• Don’t create Hub or Link table. Create view with distinct or group by on original

persisted incoming files• Use HBase NoSQL wide-column store for Hub, Link (+ Sat) and Phoenix for SQL

access via Hive• Hub and Link in RDBMS only

• Data Reservoir needs different structure or export data into Data Mart in RDBMS for faster access

STRUCTURING THE DATA LAKE LAYERDATA VAULT CHALLENGES WITH HADOOP

Are Data Lakes the new Core DWHs?Daimler TSS 34

Page 35: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Vision: One central Enterprise DWH• Reality for many organizations: Many DWHs

• more flexible• acquisition of companies. Merge of systems?• units with different (innovation) speeds and different interests, e.g. trucks

(Mercedes Benz LKW, Freightliner, Fuso, BharatBenz, Western Star, Fleetboard)• legal requirements (e.g. data export)

• Vision: One central Data Lake• Reality: ?

DATA LAKE IN ANALOGY TO AN ENTERPRISE DWH?

Are Data Lakes the new Core DWHs?Daimler TSS 35

Page 36: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

“The long-term vision was clear –the data warehouse should not be confined physically to a single database or machine” (09-MAR-2017)

BARRY DEVLIN – LOGICAL DATA WAREHOUSE

Are Data Lakes the new Core DWHs?Daimler TSS 36

Source: https://upside.tdwi.org/articles/2017/03/09/making-the-most-of-a-logical-data-warehouse.aspx

Barry Devlin wrote the first published article describing a data warehouse architecture in 1988 ( http://www.9sight.com/1988/02/art-ibmsj-ebis/ )

Page 37: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

AGENDA

1. Introduction/Motivation2. From the classic DWH architecture to the Data Lake3. Data Lake usage scenarios4. Summary

Page 38: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

“Data modeling is the process of learning about the data, and regardless of technology,this process must be performed for a successful application.”

• Learn about the data and promote collective data understanding

• Derive security classification and measures

• Design for performance

• Accelerate development

• Improve Software quality

• Reduce maintenance costs

• Generate code

• NoSQL Schema-on-read: understand model versions after years

WHY DATA MODELING?

Are Data Lakes the new Core DWHs?Daimler TSS 38

Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014

Page 39: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

DWH AND DATA LAKE

Are Data Lakes the new Core DWHs?Daimler TSS 39

DWH on RDBMS

Slowly Changing DimensionELT vs ETL3-Layer vs 2-LayerKimball ApproachInmon DefinitionStar SchemaData VaultAnchor Modelingetc

Data Lake on Hadoop

Schema-on-ReadAgilityParquetHiveHbaseSQL-on-HadoopImpalaOozieZoekeeper

Methods, Concepts,

Techniques

Tools,Tools,Tools

Page 40: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Many ETL problems are home-made, e.g.• Inefficient: ETL vs ETL / row-based vs set-based• Expensive: repetitive tasks should be accomplished with generators

NO DATA INTEGRATION - IS ETL DEAD?DATA SCIENCE REQUIRES PROPER DATA ENGINEERING

Are Data Lakes the new Core DWHs?Daimler TSS 40

Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling. Source: https://medium.com/startup-grind/fueling-the-ai-gold-rush-7ae438505bc2#.ywjvuca6z (Luke de Oliveira)

Page 41: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Data Lakes currently focus too much on tools instead on concepts and methods•Tools come and go•Flexibility / Schema-on read: Integration just postponed to Data Reservoir or in the worst case even

later to end user

PoCs vs production-ready implementation•Many tools, but still low-productivity tools (Oozie, etc)•Error handling coding nightmare across tools

Data Lakes and Core DWHs will coexist•Another choice that makes sense for many use cases•DWH: e.g. Data Vault 2.0 architecture with storing raw data and postponing data cleansing /

harmonization for lightweight data integration has similar ideas

IS THE CLASSICAL DWH DEAD?ARE DATA LAKES THE NEW CORE DWHS?

Are Data Lakes the new Core DWHs?Daimler TSS 41

Page 42: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle

Are Data Lakes the new Core DWHs?Daimler TSS 42

THANK YOU

Page 43: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

GARTNER DATA LAKE ARCHITECTURE STYLES

Are Data Lakes the new Core DWHs?Daimler TSS 43

Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/

Page 44: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Inflow Lake: accommodates a collection of data ingested from many different sources that are disconnected outside the lake but can be used together by being colocated within a single place

• Outflow Lake: a landing area for freshly arrived data available for immediate access or via streaming. It employs schema-on-read for the downstream data interpretation and refinement.

• Data Science Lab: most suitable for data discovery and for developing new advanced analytics models

GARTNER DATA LAKE ARCHITECTURE STYLES

Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/ and https://www.asug.com/news/gartner-separate-data-lakes-myths-from-facts-before-you-dive-in

Page 45: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Slide 12: Creative Commons Licence, Hernán Piñerahttps://www.flickr.com/photos/hernanpc/7175577368/in/photolist-bW5Hab-JF9HNW-a2LHAF-pwWNjx-oC1Jq8-noeV4d-oLsHUa-gUjhFx-qNB2Sw-jKLDCR-DB3B8-pRUpx2-crB6A7-nTUuNp-cXdPgN-bX7mA4-7oHeKJ-arQCtK-njdhWh-nSadX3-dykooG-sjSZHV-eq69Ux-oW44NF-i2eUbE-5AyaGL-QkmoFh-nU7KcU-QEG6Nf-oziZ4t-oUbQi4-e2NWAT-i3Yna1-eJchKZ-pGC8eC-GDux8r-5FQt95-cWdzfh-ciwtqL-jQg8BL-4X83Uc-nBZXBA-nogVER-oekb6A-9F7w4M-jKPnYQ-bAGrjd-qNB4Hq-8gJRqp-ahC2fg

Slide 47: Creative Commons Licence, James Loeschhttps://www.flickr.com/photos/jal33/5182574275/in/photolist-8TY3LT-7M8Fb9-4jWYv1-hrdbHV-4jSWSn-6cHmvc-m4NnDV-s9Efoy-ccFCcW-5t3Csw-8R87fq-mT6WNq-89mMuL-pzzDjq-2iq7ti-bBA7PT-rjPdnX-buU2V9-aottwt-4zHTZv-mT6gA6-5hLzzx-9aWGiZ-s9DJRY-jwfgr3-7WZA75-bVmho1-bXkF7U-9aWGba-3mJSwv-sa4Esa-4jWZaA-aottqr-8bj7rS-5NiZbm-oowJXV-3vp25c-5t3EkQ-NnLMaJ-naLPJm-m78nWk-nqnUYk-mT7Wso-o54T1J-bVmgA9-emeyU1-5hQFV5-akhQQL-naLDim-pPeh93

IMAGE ATTRIBUTION

Are Data Lakes the new Core DWHs?Daimler TSS 45

Page 46: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Are Data Lakes the new Core DWHs?Daimler TSS 46

DWH = inflexible development, bad performance,

complex architecture with 3 layers

Page 47: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Failure to talk to business to obtain proper requirements

Ingestion of wrong data

Storage of data with errors

Business Keys (independent object) nested into document

Read performance

SCHEMA-ON-READ OR WHY MODELING CAN STILL BE USEFUL

Are Data Lakes the new Core DWHs?Daimler TSS 47

Page 48: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

SCHEMA-ON-READ OR WHICH BUSINESS PROBLEMS ARE SOLVED

Are Data Lakes the new Core DWHs?Daimler TSS 48

Schema-on-read Remark

Data storage Yes, flexible Store data from various systems

Data integration no Integrate data from various systems

Has to be done during each access by each user

Data historization Yes, auditable Stamp data with timestamp

Information delivery no Turn data into valuable information.

Has to be done during each access by each user

Page 49: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

DATA MODELS IN THE DWH

Are Data Lakes the new Core DWHs?Daimler TSS 49

Layer Characteristics Data Model

Staging Layer Temporary storage

Ingest of source data

Normally 1:1 copy of source table structure –usually without constraints and indexes

Core Warehouse Layer

Historization / bitemporal data

Integration

Tool-independent

Non-redundant data storage

Historization

3NF with historization

Head and Version modelling

Data Vault

Anchor modeling

Dimensional model with historization (possible)

Data Mart Layer Performance for end user queries required, Tool-dependent

Lots of joins necessary to answer complex questions

Flat structures, esp. Dimensional model(ROLAP / MOLAP / HOLAP)

Page 50: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

Understand business requirements

Understand problem space

Design solution space

Think ideas (incl. alternatives) through

WHY MODEL?

Are Data Lakes the new Core DWHs?Daimler TSS 50

Page 51: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

SQL is universal language to access and manipulate data in a RDBMS

SQL is a language not only for DBAs or developers

SQL is standard for OLTP and OLAP, especially for BI tools

MAKE SQL GREAT AGAIN OR WHY SQL ON BIG DATA?

Are Data Lakes the new Core DWHs?Daimler TSS 51

Page 52: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

STRATA 2012 VS 2016

Are Data Lakes the new Core DWHs?Daimler TSS 52

Source: http://www.cazena.com/blog/strata-word-cloud-2012-vs-2016-data-lakes-spark-real-time-and-other-trends

Page 53: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Architecture with Atlas• Supports the classical tools:

• Hive• Sqoop

• HDFS?• Schema-on-read?

ATLAS FOR METADATA MANAGEMENT

Are Data Lakes the new Core DWHs?Daimler TSS 53

Page 54: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

NO DATA INTEGRATION NECESSARY ORWHO REALLY DOES UNDERSTANDS DATA MODELS?

Are Data Lakes the new Core DWHs?Daimler TSS 54

Source: Corr / Stagnitto: Agile Data Warehouse Design, DecisionOne Press, 2011, page 5

• 3NF is inefficient for query processing• 3NF models are difficult to

understand• 3NF gets even more complicated with

history added

• Many ways from person to order

Page 55: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

“Data modeling is the process of learning about the data, and regardless of technology,this process must be performed for a successful application.”

• Learn about the data and promote collective data understanding

• Derive security classification and measures

• Design for performance

• Accelerate development

• Improve Software quality

• Reduce maintenance costs

• Generate code

• NoSQL Schema-on-read: understand model versions after years

WHY DATA MODELING?

Are Data Lakes the new Core DWHs?Daimler TSS 55

Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014

„Expanding yourmodeling skillsenables you to

reduce documentation.“

Scott Ambler

Page 56: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

• Standard approach in Data Marts in DWH• Not just for performance reasons

• Performance is also an issue on Hadoop-based systems, e.g. Hive, Spark• Joins!

• But also due to understandability for end users• Understandability is also an issue on Hadoop-based systems

DIMENSIONAL MODELING

Are Data Lakes the new Core DWHs?Daimler TSS 56

Page 57: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

A prime motivation for this evolution towards a more “database-like”system was driven by the experiences of Google developers trying to buildon previous “key-value” storage systems. The prototypical example of sucha key-value system is Bigtable, which continues to see massive usage atGoogle for a variety of applications. However, developers of many OLTPapplications found it difficult to build these applications without astrong schema system, cross-row transactions, consistent replication anda powerful query language.Source: https://research.google.com/pubs/pub46103.html

IMPORTANCE OF STRONG SCHEMA @GOOGLE

Are Data Lakes the new Core DWHs?Daimler TSS 57

Page 58: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

HADOOP VS CLASSIC DWHSQL APPROACH

Are Data Lakes the new Core DWHs?Daimler TSS 58

Classic DWH Hadoop

Tables Yes Yes

SQL language Yes Yes, SQL-on-Hadoop

Query Optimizer Yes Yes

Indexes, Pks Yes No

Data “Owner” Proprietary RDBMS Open data formatAccess by many engines like Spark, HiveMany open formats like Parquet, Avro

Metadata dictionary User data + dictionary in RDBMS

User data and dictionary (“Hive Metastore”) separate

Page 59: ARE DATA LAKES THE NEW CORE DWHS? ANDREAS … 201…Daimler TSS GmbH 4 LOCATIONS Are Data Lakes the new Core DWHs? Daimler TSS China Hub Beijing 6 Employees Daimler TSS Malaysia Hub

New data sources• Sensors, Logs, NoSQL, etc. as data source• Schema-on-read useful as sensor data format change frequent

Existing internal data• Dump RDBMS exports into Data Lake for data analytics• Schema-on-read does not make any sense as data is already in a

documented data model

STRUCTURING THE DATA LAKE

Are Data Lakes the new Core DWHs?Daimler TSS 59