a look ahead at spark’s development...2015/10/29 · a look ahead at spark’s development...
TRANSCRIPT
![Page 1: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/1.jpg)
A look ahead at Spark’s development
Reynold Xin @rxinSpark Summit EU, AmsterdamOct 29th, 2015
![Page 2: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/2.jpg)
SQL Streaming MLlib
Spark Core (RDD)
GraphX
Spark stack diagram
![Page 3: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/3.jpg)
Frontend(user facing APIs)
Backend(execution)
Spark stack diagram(a different take)
![Page 4: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/4.jpg)
Frontend(RDD, DataFrame, ML pipelines, …)
Backend(scheduler, shuffle, operators, …)
Spark stack diagram(a different take)
![Page 5: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/5.jpg)
Last 12 months of Spark evolution
Frontend
DataFramesData sourcesRMachine learning pipelines…
Backend
Project TungstenSort-based shuffleNetty-based network…
![Page 6: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/6.jpg)
Last 12 months of Spark evolution
Frontend
DataFramesData sourcesRMachine learning pipelines…
Backend
Project TungstenSort-based shuffleNetty-based network…
![Page 7: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/7.jpg)
DataFrame:A Frontend Perspective
![Page 8: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/8.jpg)
Spark DataFrame
> head(filter(df, df$waiting < 50)) # an example in R## eruptions waiting##1 1.750 47##2 1.750 47##3 1.867 48
Scalable data frame for Java, Python, R, Scala
Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn
![Page 9: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/9.jpg)
Spark RDD Execution
Java/Scalafrontend
JVMbackend
Pythonfrontend
Pythonbackend
opaque closures(user-defined functions)
![Page 10: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/10.jpg)
Spark DataFrame Execution
DataFramefrontend
Logical Plan
Physical execution
Catalystoptimizer
Intermediate representation for computation
![Page 11: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/11.jpg)
Spark DataFrame Execution
PythonDF
Logical Plan
Physicalexecution
Catalystoptimizer
Java/ScalaDF
RDF
Intermediate representation for computation
Simple wrappers to create logical plan
![Page 12: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/12.jpg)
Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend)
R : ~1000 line of code
i.e. much easier to add new language bindings (Julia, Clojure, …)
![Page 13: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/13.jpg)
Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregation workload
RDD
![Page 14: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/14.jpg)
Benefit of Logical Plan:Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregation workload (secs)
DataFrame
RDD
![Page 15: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/15.jpg)
Tungsten:A Backend Perspective
![Page 16: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/16.jpg)
Hardware Trends
Storage
Network
CPU
![Page 17: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/17.jpg)
Hardware Trends
2010
Storage 50+MB/s(HDD)
Network 1Gbps
CPU ~3GHz
![Page 18: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/18.jpg)
Hardware Trends
2010 2015
Storage 50+MB/s(HDD)
500+MB/s(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
![Page 19: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/19.jpg)
Hardware Trends
2010 2015
Storage 50+MB/s(HDD)
500+MB/s(SSD) 10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L
![Page 20: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/20.jpg)
Project Tungsten
Substantially speed up execution by optimizing CPU efficiency, via:
(1) Runtime code generation(2) Exploiting cache locality(3) Off-heap memory management
![Page 21: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/21.jpg)
From DataFrame to Tungsten
PythonDF
Logical Plan
Java/ScalaDF
RDF
TungstenExecution
Initial phase in Spark 1.5
More work coming in 2016
![Page 22: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/22.jpg)
3 Things to Look Forward To
![Page 23: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/23.jpg)
Dataset API in Spark 1.6
Typed interface over DataFrames / Tungsten
case class Person(name: String, age: Int)
val dataframe = read.json(“people.json”)val ds: Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.name.startsWith(“M”)).groupBy(“name”).avg(“age”)
![Page 24: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/24.jpg)
Dataset
“Encoder” to specify type informationso Spark can translate it into DataFrameand generate optimized memory layouts
Checkout SPARK-9999
Dataset[T]
DataFrame
encoder
![Page 25: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/25.jpg)
Streaming DataFrames
Easier-to-use APIs (batch, streaming, and interactive)
And optimizations:- Tungsten backends- native support for out-of-order data- data sources and sinks
val stream = read.kafka("...")stream.window(5 mins, 10 secs)
.agg(sum("sales"))
.write.jdbc("mysql://...")
![Page 26: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/26.jpg)
![Page 27: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/27.jpg)
3D XPoint
- DRAM latency- SSD capacity- Byte addressible
![Page 28: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/28.jpg)
Python Java/Scala RSQL …
DataFrameLogical Plan
LLVMJVM SIMD 3D XPoint
Unified API, One Engine, Automatically Optimized
Tungstenbackend
languagefrontend
…
![Page 29: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/29.jpg)
Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
AdvancedAnalytics
![Page 30: A look ahead at Spark’s development...2015/10/29 · A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th, 2015 SQL Streaming MLlib Spark](https://reader030.vdocuments.mx/reader030/viewer/2022041014/5ec5cfb34a29781b3c1abe30/html5/thumbnails/30.jpg)
Office Hours Today @ Databricks booth
Topic Area
10:30 – 11:30 Spark general (Reynold)
13:00 – 14:00 R and data science (Hossein)
13:30 – 14:30 machine learning (Joseph)
14:00 – 15:00 Spark, YARN, etc (Andrew)