delta lake: making cloud data lakes transactional and …dump complex etl datalake2 > 100tb new...
TRANSCRIPT
![Page 1: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/1.jpg)
Delta Lake:Making Cloud Data Lakes Transactional and Scalable
Stanford University, 2019-05-15
Reynold Xin@rxin
![Page 2: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/2.jpg)
About Me
Databricks co-founder & Chief Architect- Designed most major things in “modern day” Apache Spark- #1 contributor to Spark by commits and net lines deleted
PhD in databases from Berkeley
![Page 3: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/3.jpg)
Building data analytics platform is hard
Data streams Insights
????
![Page 4: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/4.jpg)
Traditional Data Warehouses
OLTP databases
InsightsData Warehouse
ETL SQL
![Page 5: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/5.jpg)
Challenges with Data Warehouses
ETL pipelines are often complex and slowAd-hoc pipelines to process data and ingest into warehouseNo insights until daily data dumps have been processed
Performance is expensiveScaling up/out usually comes at a high cost
Workloads often limited to SQL and BI toolsData in proprietary formatsHard to do integrate streaming, ML, and AI workloads
Data Warehouse
![Page 6: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/6.jpg)
Dream of Data Lakes
Data Lake
scalable ETL
SQL
ML, AI
streaming
Data streams Insights
![Page 7: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/7.jpg)
Data Lakes + Spark = Awesome!
Data LakeData streams Insights
STRUCTURED STREAMING
SQL, ML, STREAMING
The 1st Unified Analytics Engine
![Page 8: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/8.jpg)
Advantages of Data Lakes
ETL pipelines are complex and slow simpler and fastUnified Spark API between batch and streaming simplifies ETLRaw unstructured data available as structured data in minutes
Performance is expensive cheaperEasy and cost-effective to scale out compute and storage
Workloads limited not limited anything!Data in files with open formatsIntegrate with data processing and BI toolsIntegrate with ML and AI workloads and tools
Data Lake
![Page 9: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/9.jpg)
Challenges of Data Lakes in practice
![Page 10: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/10.jpg)
ETL @
Challenges of Data Lakes in practice
![Page 11: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/11.jpg)
Evolution of a Cutting-Edge Data Pipeline
Events
?Reporting
StreamingAnalytics
Data Lake
![Page 12: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/12.jpg)
Evolution of a Cutting-Edge Data Pipeline
Events
Reporting
StreamingAnalytics
Data Lake
![Page 13: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/13.jpg)
Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
Reporting
Eventsλ-arch1
1
1
![Page 14: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/14.jpg)
Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
Reporting
Events
Validation
λ-archValidation
1
21
1
2
![Page 15: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/15.jpg)
Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
Reporting
Events
Validation
λ-archValidation
Reprocessing
Partitioned
1
2
3
1
1
3
2
![Page 16: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/16.jpg)
Reprocessing
Challenge #4: Query Performance?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
Reporting
Events
Validation
λ-archValidation
Reprocessing
Compaction
Partitioned
CompactSmall Files
Scheduled to Avoid Compaction
1
2
3
1
1
2
4
4
4
2
![Page 17: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/17.jpg)
Data Lake Reliability Challenges
Failed production jobs leave data in corrupt state requiring tedious recovery
Lack of schema enforcement creates inconsistent and low quality data
Lack of consistency makes it almost impossible to mix appends, deletes, upserts and get consistent reads
![Page 18: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/18.jpg)
Data Lake Performance ChallengesToo many small or very big files - more time opening & closing files rather than reading content (worse with streaming)
Partitioning aka “poor man’s indexing”- breaks down when data has many dimensions and/or high cardinality columns
Neither storage systems, nor processing engines are great at handling very large number of subdir/files
![Page 19: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/19.jpg)
Figuring out what to read is too slow
![Page 20: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/20.jpg)
Data integrity is hard
![Page 21: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/21.jpg)
Band-aid solutions made it worse!
![Page 22: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/22.jpg)
Everyone has the same problems
![Page 23: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/23.jpg)
THE GOOD OF DATA LAKES• Massive scale out• Open Formats• Mixed workloads
THE GOOD OF DATA WAREHOUSES• Pristine Data• Transactional Reliability• Fast SQL Queries
![Page 24: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/24.jpg)
DELTA
TheLOW-LATENCY
of streaming
TheRELIABILITY &
PERFORMANCEof data warehouse
TheSCALE
of data lake
![Page 25: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/25.jpg)
DELTA
Scalable storage
Transactional log+=
![Page 26: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/26.jpg)
DELTA
Scalable storage
Transactional log
pathToTable/+---- 000.parquet+---- 001.parquet+---- 002.parquet+ ...
table data stored as Parquet files on HDFS, AWS S3, Azure Blob Stores
sequence of metadata files to track operations made on the table
stored in scalable storage along with table
|+---- _delta_log/
+---- 000.json+---- 001.json...
![Page 27: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/27.jpg)
|
+---- _delta_log/
+---- 000.json
+---- 001.json
...
Log Structured Storage
Changes to the table are stored as ordered, atomic commits
Each commit is a set of actions file in directory _delta_log
Add 001.parquet
Add 002.parquet
Remove 001.parquet
Remove 002.parquet
Add 003.parquet
UPDATE actions
INSERT actions
![Page 28: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/28.jpg)
|
+---- _delta_log/
+---- 000.json
+---- 001.json
...
Log Structured Storage
Readers read the log in atomic units thus reading consistent snapshots
Add 001.parquet
Add 002.parquet
Remove 001.parquet
Remove 002.parquet
Add 003.parquet
readers will read either [001+002].parquet or 003.parquetand nothing in-between
UPDATE actions
INSERT actions
![Page 29: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/29.jpg)
Mutual Exclusion
Concurrent writers need to agree on the order of changes
New commit files must be created mutually exclusively
000.json
001.json
002.json
Writer 1 Writer 2
only one of the writers trying to concurrently write 002.json
must succeed
![Page 30: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/30.jpg)
Challenges with cloud storage
Different cloud storage systems have different semantics to provide atomic guarantees
Cloud Storage Atomic Files Visibility
Atomic Put if absent
Solution
Azure Blob Store, Azure Data Lake
✘ ✔ Write to temp file, rename to final file if not present
AWS S3 ✔ ✘ Separate service to perform all writes directly (single writer)
![Page 31: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/31.jpg)
Concurrency Control
Pessimistic Concurrency Block others from writing anythingHold lock, write data files, commit to log
Optimistic ConcurrencyAssume it’ll be okay and write data filesTry to commit to the log, fail on conflict Enough as write concurrency is usually low
✔Avoid wasted work
✘Distributed locks
✔Mutual exclusion is enough!
✘ Breaks down if there a lot of conflicts
![Page 32: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/32.jpg)
Solving Conflicts Optimistically
1. Record start version2. Record reads/writes3. If someone else wins,
check if anything you read has changed.
4. Try again.
000000.json
000001.json
000002.json
User 1R: A
W: B
User 2R: A
W: C
new file C does not conflict with new file B, so retry and commit successfully as 2.json
![Page 33: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/33.jpg)
Solving Conflicts Optimistically
1. Record start version2. Record reads/writes3. If someone else wins,
check if anything you read has changed.
4. Try again.
000000.json
000001.json
User 1R: A
W: A,B
User 2R: A
W: A,C
Deletions of file A by user 1 conflicts with deletion by user 2, user 2 operation fails
![Page 34: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/34.jpg)
Metadata/Checkpoints as Data
Large tables can have millions of files in them! Even pulling them out of Hive [MySQL] would be a bottleneck.
Add 1.parquet
Add 2.parquetRemove 1.parquet
Remove 2.parquet
Add 3.parquet
Checkpoint
![Page 35: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/35.jpg)
Challenges solved: Reliability
Problem:Failed production jobs leave data in corrupt state requiring tedious recovery
Solution:Failed write jobs do not update the commit log, hence partial / corrupt files not visible to readers DELTA
![Page 36: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/36.jpg)
Challenges solved: Reliability
Challenge :Lack of consistency makes it almost impossible to mix appends, deletes, upserts and get consistent reads
Solution:All reads have full snapshot consistencyAll successful writes are consistentIn practice, most writes don't conflict Tunable isolation levels (serializability by default)
DELTA
![Page 37: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/37.jpg)
Challenges solved: ReliabilityChallenge :Lack of schema enforcement creates inconsistent and low quality data
Solution:Schema recorded in the logFails attempts to commit data with incorrect schemaAllows explicit schema evolutionAllows invariant and constraint checks (high data quality)
DELTA
![Page 38: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/38.jpg)
Challenges solved: PerformanceChallenge:Too many small files increase resource usage significantly
Solution:Transactionally performed compaction using OPTIMIZE
OPTIMIZE table WHERE date = '2019-04-04'DELTA
![Page 39: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/39.jpg)
Challenges solved: PerformanceChallenge:Partitioning breaks down with many dimensions and/or high cardinality columns
Solution:Optimize using multi-dimensional clustering on multiple columns
OPTIMIZE conns WHERE date = '2019-04-04'ZORDER BY (srcIP, destIP)
DELTA
![Page 40: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/40.jpg)
Querying connection data at Apple
Ad-hoc query of connection data based on different columns
SELECT count(*) FROM conns WHERE date = '2019-04-04'AND srcIp = '1.1.1.1'
Connections- date- srcIp- dstIp
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND dstIp = '1.1.1.1'
partitioning is bad as cardinality is high
> PBs> trillions of rows
![Page 41: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/41.jpg)
Multidimensional Sorting
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND srcIp = '1.1.1.1'
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8sr
cIp
dstIp
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND dstIp = '1.1.1.1'
![Page 42: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/42.jpg)
Multidimensional Sorting1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND srcIp = '1.1.1.1'
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND dstIp = '1.1.1.1' sr
cIp
dstIp
ideal file size = 4 rows
![Page 43: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/43.jpg)
Multidimensional Sorting1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND srcIp = '1.1.1.1'
srcI
p
dstIp
2 files
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND dstIp = '1.1.1.1'
![Page 44: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/44.jpg)
Multidimensional Sorting1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND srcIp = '1.1.1.1'
srcI
p
dstIp
2 files
8 filesSELECT count(*) FROM conns WHERE date = '2019-04-04' AND dstIp = '1.1.1.1'
great for major sorting dimension, not for others
![Page 45: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/45.jpg)
Multidimensional Clustering1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8sr
cIp
dstIp
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND srcIp = '1.1.1.1'
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND dstIp = '1.1.1.1'
zorder space filling curve
![Page 46: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/46.jpg)
Multidimensional Clustering1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8sr
cIp
dstIp
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND srcIp = '1.1.1.1'
SELECT count(*) FROM conns WHERE date = '2019-04-04' AND dstIp = '1.1.1.1'
reasonably good for all dimensions
4 files
4 files
![Page 47: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/47.jpg)
Data Pipeline @ Apple
Security InfraIDS/IPS, DLP, antivirus, load balancers, proxy servers
Cloud Infra & AppsAWS, Azure, Google Cloud
Servers InfraLinux, Unix, Windows
Network InfraRouters, switches, WAPs, databases, LDAP
Detect signal across user, application and network logs
Quickly analyze the blast radius with ad hoc queries
Respond quickly in an automated fashion
Scaling across petabytes of data and 100’s of security analysts
> 100TB new data/day> 300B events/day
![Page 48: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/48.jpg)
Messy data not ready for analytics
DATALAKE1
DW3
DW2
DW1Incidence Response
Alerting
Reports
Data Pipeline @ Apple
Security InfraIDS/IPS, DLP, antivirus, load balancers, proxy servers
Cloud Infra & AppsAWS, Azure, Google Cloud
Servers InfraLinux, Unix, Windows
Network InfraRouters, switches, WAPs, databases, LDAP
Separate warehouses for each type of analytics
Dump Complex ETLDATALAKE2
> 100TB new data/day> 300B events/day
![Page 49: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/49.jpg)
Messy data not ready for analytics
DATALAKE1
DW3
DW2
DW1Incidence Response
Alerting
Reports
Data Pipeline @ Apple
Security InfraIDS/IPS, DLP, antivirus, load balancers, proxy servers
Cloud Infra & AppsAWS, Azure, Google Cloud
Servers InfraLinux, Unix, Windows
Network InfraRouters, switches, WAPs, databases, LDAP
Separate warehouses for each type of analytics
Dump Complex ETL
Took 20 engineers + 24 weeksHours of delay in accessing data
Very expensive to scaleOnly 2 weeks of data in proprietary formats
No advanced analytics (ML)
DATALAKE2
![Page 50: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/50.jpg)
Incidence Response
Alerting
ReportsSTRUCTURED STREAMING
Dump Complex ETL
DELTA SQL, ML, STREAMING
Took 2 engineers + 2 weeksData usable in minutes/seconds
Easy and cheaper to scaleStore 2 years of data in open formats
Enables advanced analyticsKEYNOTE TALK
Data Pipeline @ Apple
![Page 51: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/51.jpg)
Current ETL pipeline at Databricks
λ-arch
Validation
Reprocessing
Compaction
1
2
3
4
DELTA
DELTA
DELTA
DELTA
StreamingAnalytics
Reporting
Easy as data in short term and long term data in one location
Easy and seamless with Delta's transactional guarantees
Not needed, Delta handles both short and long term data
![Page 52: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/52.jpg)
CREATE TABLE ...
USING delta
…
dataframe.write.format("delta").save("/data")
Easy to use Delta with Spark APIs
CREATE TABLE ...
USING parquet
...
dataframe.write.format("parquet").save("/data")
Instead of parquet... … simply say delta
![Page 53: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/53.jpg)
Scalable Compute & Storage
ACID Transactions & Data Validation
Data Indexing & Caching (10-100x)
Open source & data stored as Parquet
Integrated with Structured Streaming
MASSIVE SCALE
RELIABILITY
PERFORMANCE
LOW-LATENCY
OPEN
DELTA
![Page 54: Delta Lake: Making Cloud Data Lakes Transactional and …Dump Complex ETL DATALAKE2 > 100TB new data/day > 300B events/day. Messy data not ready for analytics DATALAKE1 DW3 DW2 DW1](https://reader034.vdocuments.mx/reader034/viewer/2022042103/5e805363fb662866292351a4/html5/thumbnails/54.jpg)
Questions?
54