sparkr the past, present and future - shivaramshivaram.org/talks/sparkr-summit-2015.pdf ·...
TRANSCRIPT
![Page 1: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/1.jpg)
SparkR The Past, Present and Future
Shivaram Venkataraman
![Page 2: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/2.jpg)
Big Data & R
DataFrames Visualization
Libraries Data +
![Page 3: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/3.jpg)
Big Data & R Big Data Small Learning
Partition Aggregate
Large Scale Machine Learning
![Page 4: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/4.jpg)
1. Big Data, Small Learning
Data Cleaning Filtering
Aggregation
Collect
Subset
DataFrames Visualization Libraries
![Page 5: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/5.jpg)
2(a). Partition Aggregate
Data Collect
Subset
Best Model Params
Parameter Tuning
![Page 6: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/6.jpg)
2(b). Partition Aggregate
Data Combine Models
Model Averaging
![Page 7: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/7.jpg)
3. Large Scale Machine Learning
Data Featurize Learning
Model
![Page 8: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/8.jpg)
Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning
SparkR: Unified approach
![Page 9: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/9.jpg)
Outline Project History Current Release SparkR Future
![Page 10: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/10.jpg)
Speed
Scalable
Flexible
Statistics
Visualization
DataFrames
![Page 11: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/11.jpg)
RDD Parallel Collection
Transformations map filter
groupBy …
Actions count collect
saveAsTextFile …
![Page 12: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/12.jpg)
R + RDD = RRDD
lapply lapplyPartition
groupByKey collect cache …
broadcast includePackage
textFile
![Page 13: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/13.jpg)
Example: Word Count library(SparkR) lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words,
function(word) { list(word, 1L) })
counts <-‐ reduceByKey(wordCount, "+", 2L) output <-‐ collect(counts)
![Page 14: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/14.jpg)
Initial Prototype Standalone R package Install from github
![Page 15: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/15.jpg)
Open Source Development
1. Architecture
2. Usability
![Page 16: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/16.jpg)
Architecture Local Worker
Worker
![Page 17: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/17.jpg)
Architecture Local Worker
Worker R
![Page 18: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/18.jpg)
Architecture Local Worker
Worker R Spark
Context
Java Spark
Context
R-JVM bridge
![Page 19: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/19.jpg)
Architecture Local Worker
Worker R Spark Context
Java Spark
Context
R-JVM bridge
Spark Executor exec R
Spark Executor exec R
![Page 20: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/20.jpg)
Architecture Local Worker
Worker R Spark Context
Java Spark
Context
R-JVM bridge
Spark Executor exec R
Spark Executor exec R
![Page 21: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/21.jpg)
R-JVM Bridge Layer to call JVM methods directly from R Automatic argument serialization
result <-‐ callJStatic( “sparkr.RRDD”, “someMethod”, arg1, arg2)
![Page 22: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/22.jpg)
R-JVM Bridge Use sockets for communication Supported across platforms, languages
R JVM
Netty Server
![Page 23: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/23.jpg)
Usability Need for Data Inputs
Read in CSV, JSON, JDBC etc. High-level API for data manipulation
![Page 24: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/24.jpg)
SparkR DataFrames people <-‐ read.df( “people.json”, “json”) avgAge <-‐ select( df, avg(df$age)) head(avgAge)
DataSources API Support for schema dplyr-like syntax
![Page 25: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/25.jpg)
SparkR DataFrames Scala Optimizations Released in Spark 1.4 ! 0 1 2 3
SparkR DataFrame
Scala DataFrame
Python DataFrame
Time (s)
Demo: github.com/cafreeman/SparkR_DataFrame_Demo
![Page 26: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/26.jpg)
SparkR Future
![Page 27: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/27.jpg)
Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning
![Page 28: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/28.jpg)
Big Data, Small Learning SparkR DataFrames: Read input, aggregation Collect results, apply machine learning Upcoming features:
Support for R transformations More column functions (e.g. math, strings)
![Page 29: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/29.jpg)
Partition Aggregate Upcoming feature:
Simple, parallel API for SparkR Ex: Parameter tuning, Model Averaging Integrated with DataFrames Use existing R packages
![Page 30: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/30.jpg)
Large Scale Machine Learning Integration with MLLib Support for GLM, KMeans etc.
model <-‐ glm( a ~ b + c,
data = df)
![Page 31: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/31.jpg)
Large Scale Machine Learning Key Features
DataFrame inputs R-like formulas Model statistics
model <-‐ glm( a ~ b + c,
data = df) summary(model)
![Page 32: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/32.jpg)
Extensibility Existing data sources R package support on spark-packages.org Example packages
./bin/sparkR -‐-‐packages spark-‐csv
![Page 33: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/33.jpg)
Developer Community >20 contributors including AMPLab, Databricks, Alteryx, Intel R and Scala contributions welcome !
![Page 34: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset](https://reader035.vdocuments.mx/reader035/viewer/2022062917/5ed34cbf54af1354db2a70f8/html5/thumbnails/34.jpg)
SparkR
Big data processing from R DataFrames in Spark 1.4 Future: Large Scale ML & more