from dataframes to tungsten: a peek into spark's future @ spark summit san francisco 2015
TRANSCRIPT
![Page 1: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/1.jpg)
From DataFrames to Tungsten: A Peek into Spark’s Future Reynold Xin @rxin Spark Summit, San Francisco June 16th, 2015
![Page 2: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/2.jpg)
DataFrame noun Making Spark accessible to everyone (data scientists, engineers, statisticians, …)
![Page 3: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/3.jpg)
Tungsten noun Making Spark faster & prepare for the next five years.
![Page 4: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/4.jpg)
How do DataFrames and Tungsten relate to each other?
![Page 5: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/5.jpg)
Google Trends for “dataframe”
Single-node tabular data structure, with API for relational algebra (filter, join, …) math and stats input/output (CSV, JSON, …) ad infinitum
![Page 6: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/6.jpg)
Data frame: lingua franca for “small data”
head(flights) #> Source: local data frame [6 x 16] #> #> year month day dep_time dep_delay arr_time arr_delay carrier tailnum #> 1 2013 1 1 517 2 830 11 UA N14228 #> 2 2013 1 1 533 4 850 20 UA N24211 #> 3 2013 1 1 542 2 923 33 AA N619AA #> 4 2013 1 1 544 -‐1 1004 -‐18 B6 N804JB #> .. ... ... ... ... ... ... ... ... ...
![Page 7: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/7.jpg)
Spark DataFrame
> head(filter(df, df$waiting < 50)) # an example in R ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48
Distributed data frame for Java, Python, R, Scala Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn
![Page 8: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/8.jpg)
data size
KB MB GB TB PB
Existing Single-node Data Frames
Spark DataFrame
![Page 9: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/9.jpg)
It is not Spark vs Python/R, but Spark and Python/R.
![Page 10: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/10.jpg)
Spark and Python/R
Spark DF
scalability multi-core
multi-machines
Python/R DF
Viz
Machine Learning
Stats
wealth of
libraries
![Page 11: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/11.jpg)
Spark RDD Execution
Java/Scala API
JVM Execution
Python API
Python Execution
opaque closures (user-defined functions)
![Page 12: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/12.jpg)
Spark DataFrame Execution
DataFrame
Logical Plan
Physical Execution
Catalyst optimizer
Intermediate representation for computation
![Page 13: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/13.jpg)
Spark DataFrame Execution
Python DF
Logical Plan
Physical Execution
Catalyst optimizer
Java/Scala DF
R DF
Intermediate representation for computation
Simple wrappers to create logical plan
![Page 14: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/14.jpg)
Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend) R : ~1000 line of code i.e. much easier to add new language bindings (Julia, Clojure, …)
![Page 15: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/15.jpg)
Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregation workload
RDD
![Page 16: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/16.jpg)
Benefit of Logical Plan: Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregation workload (secs)
DataFrame
RDD
![Page 17: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/17.jpg)
What about Tungsten?
![Page 18: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/18.jpg)
Hardware Trends
Storage
Network
CPU
![Page 19: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/19.jpg)
Hardware Trends
2010
Storage 50+MB/s (HDD)
Network 1Gbps
CPU ~3GHz
![Page 20: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/20.jpg)
Hardware Trends
2010 2015
Storage 50+MB/s (HDD)
500+MB/s (SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
![Page 21: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/21.jpg)
Hardware Trends
2010 2015
Storage 50+MB/s (HDD)
500+MB/s (SSD) 10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L
![Page 22: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/22.jpg)
Tungsten: Preparing Spark for Next 5 Years
Substantially speed up execution by optimizing CPU efficiency, via: (1) Runtime code generation (2) Exploiting cache locality (3) Off-heap memory management
![Page 23: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/23.jpg)
From DataFrame to Tungsten
Python DF
Logical Plan
Java/Scala DF
R DF
Tungsten Execution
5PM Deep Dive into Project Tungsten Developer Track by Josh Rosen
![Page 24: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/24.jpg)
Initial Performance Results
0
200
400
600
800
1000
1200
1x 2x 4x 8x
Run
time
(sec
onds
)
Data set size (relative)
Tungsten-off
Tungsten-on
![Page 25: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/25.jpg)
Python Java/Scala R SQL …
DataFrame Logical Plan
LLVM JVM GPU NVRAM
Unified API, One Engine, Automatically Optimized
Tungsten backend
language frontend
…
![Page 26: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/26.jpg)
Tungsten Execution
Python SQL R Streaming
DataFrame
Advanced Analytics
![Page 27: From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032501/55b6e32abb61eb75268b4868/html5/thumbnails/27.jpg)
Spark Office Hours Today
Databricks booth A1
Topic Area
1:00-1:45 Core, YARN, Ops
1:45-2:30 Core/SQL/Data Science
3:00-3:40 Streaming
3:40-4:15 Core, Python, R
4:30-5:15 Machine Learning
5:15-6:00 Matei Zaharia