sdss - hossein falaki - interacting with distributed data from r … · memory dataframes dynamic...
TRANSCRIPT
![Page 1: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/1.jpg)
Interacting with Distributed Data from R using SparkR
Hossein Falaki@mhfalaki
![Page 2: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/2.jpg)
Outline
• SparkR Architecture & API • Data distribution • R and JVM RPC
• Distributing R functions• Use cases and best practices• Code & platform unification• Parallelizing existing logic• Distributed ML
![Page 3: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/3.jpg)
SparkR Architecture & API
![Page 4: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/4.jpg)
What is SparkR
An R package released with Apache Spark:• Provides R front-end to Apache Spark• Exposes Spark DataFrames (inspired by R and Pandas)• Enables distributed execution of R functions
Generaldistributedprocessing,datasource,off-
memoryDataFrames
dynamicenvironment,interactivity,+10Kpackages,
visualizations+
Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation
![Page 5: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/5.jpg)
SparkR architectureSpark Driver
JVM
Worker 1
JVM
Worker 2
Data SourcesJVMR
R Backend
JVM
![Page 6: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/6.jpg)
SparkR architectureSpark Driver
JVM
Worker 1
JVM
Worker 2
Data SourcesJVMR
R Backend
R R
R R
![Page 7: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/7.jpg)
IOread.df / write.dfcreateDataFrame / collectCachingcache / persist / unpersistcacheTable / uncacheTableSQLsql / tableToDfregisterTempTable / tables
Overview of SparkR APIhttp://spark.apache.org/docs/latest/api/R/
ML Libglm / kmeans / Naïve Bayessurvival regression / ...DataFrame APIselect / subset / groupByhead / avg / nrow / dimUDF functionalityspark.lapply / dapply /gapply / ...
![Page 8: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/8.jpg)
SparkR API :: Spark Session
# Importing the library> library(SparkR)Attaching package: ‘SparkR’
# Initializing connection to Spark> sparkR.session()Spark package found in SPARK_HOME: /databricks/spark Java ref type org.apache.spark.sql.SparkSession id 1
![Page 9: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/9.jpg)
Session Initialization
4. RBackendHandler handles and process requests
3. Each SparkR call sends serialized data over the socket and waits for response
R JVMBackend
1. RBackend opens a server port and waits for connections
2. SparkR establishes a TCP connection
![Page 10: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/10.jpg)
SparkR Serialization
R JVMBackend
R and JVM use a proprietary serialization format as wire protocol.
Basic type: type binary data
List: type element 1,size element 2, element 3, ...
![Page 11: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/11.jpg)
SparkR API :: I/O
Reading> spark.df <- read.df(path = “...”, source = “csv”)> cache(spark.df)> dim(spark.df)[1] 1235349690 31
Writing> write.df(spark.df, path = “...”, source = “json”)
![Page 12: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/12.jpg)
Control plane
1. serialize method name + arguments
2. Send to backend3. de-serialize
4. find Spark method
5. invoke method
6. serialize returned value
8. de-serialize and return result to user
R JVM
7. Send to R process
![Page 13: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/13.jpg)
Data planeSpark Driver
JVM
Worker 1
JVM
Worker 2
Data SourcesJVMR
R Backend
![Page 14: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/14.jpg)
SparkR API :: I/O
JVM to R> r.df <- collect(spark.df)
R to JVM> spark.df <- createDataFrame(iris)
![Page 15: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/15.jpg)
Data planeSpark Driver
JVM
Worker 1
JVM
Worker 2
Data SourcesJVMR
R Backend
![Page 16: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/16.jpg)
SparkR API :: Transforming data
# Filtering> iad <- subset(df, df$Dest == “IAD”)
# Grouping and aggregation> arrivals <- count(groupBy(iad, iad$Origin))
# Ordering> head(orderBy(arrivals, -arrivals$count))
![Page 17: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/17.jpg)
SparkR API :: SQL
> createOrReplaceTempView (df, “table”)> arrivals <- sql(“
SELECT count(FlightNum) as count, OriginFROM tableWHERE Dest == “IAD”GROUP BY Origin”)
> plot(collect(arrivals))
![Page 18: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/18.jpg)
Distributing R Computation
![Page 19: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/19.jpg)
IOread.df / write.df /createDataFrame / collectCachingcache / persist / unpersistcacheTable / uncacheTableSQLsql / table / saveAsTableregisterTempTable / tables
Overview of SparkR APIhttp://spark.apache.org/docs/latest/api/R/
ML Libglm / kmeans / Naïve BayesSurvival regressionDataFrame APIselect / subset / groupByhead / avg / column / dimUDF functionalityspark.lapply / dapplygapply / dapplyCollect
![Page 20: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/20.jpg)
SparkR UDF API
spark.lapplyRuns a function over a list of elements on workers
spark.lapply()
dapplyApplies a function to each partition of a SparkDataFrame
dapply()
dapplyCollect()
gapplyApplies a function to each group within a SparkDataFrame
gapply()
gapplyCollect()
![Page 21: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/21.jpg)
spark.lapply
For each element of a list1. Sends the function to an R worker2. Executes the function3. Returns the result of all workers as a list to R driver
spark.lapply(1:100, function(x) {runBootstrap(x)
}
![Page 22: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/22.jpg)
spark.lapply control flow
RWorker JVMR Driver JVM
1. serialize R closure
4. transfer overlocal socket
7. serialize result
2. transfer overlocal socket
8. transfer overlocal socket
10. transfer overlocal socket
11. de-serialize result
9. Transfer serialized closure over the network
3. Transfer serialized closure over the network
5. de-serialize closure
6. Execution
![Page 23: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/23.jpg)
dapply
For each partition of the input Spark DataFrame1. Collects the partition as an R data.frame inside the worker2. Sends the R function to the R process inside worker3. Executes the function with R data.frame as input
dapply(sparkDF, func, schema)
Returns results as DataFrame with provided schema
dapplyCollect(sparkDF, func)
Returns results as an R data.frame
![Page 24: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/24.jpg)
dapply control & data flow
RWorker JVMR Driver JVM
local socket cluster network local socket
input data
ser/de transfer
result data
ser/de transfer
![Page 25: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/25.jpg)
dapplyCollect control & data flow
RWorker JVMR Driver JVM
local socket cluster network local socket
input data
ser/de transfer
result data
ser/de transferresult deserialize
![Page 26: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/26.jpg)
gapply
Groups the Spark DataFrame on one or more columns1. Collects each group as an R data.frame inside the worker2. Sends the R function to the R process inside worker3. Executes the function with R data.frame as input
gapply(df, cols, func, schema)
Returns results as DataFrame with provided schema
gapplyCollect(df, cols, func)
Returns results as an R data.frame
![Page 27: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/27.jpg)
gapplyCollect control & data flow
RWorker JVMR Driver JVM
local socket cluster network local socket
input data
ser/de transfer
result data
ser/de transferresult deserialize
data
shuffle
![Page 28: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/28.jpg)
gapply dapplysignature gapply(df, cols, func, schema)
gapply(gdf, func, schema)dapply(df, func, schema)
userfunctionsignature
function(key, data) function(data)
datapartition controlledbygrouping notcontrolled
![Page 29: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/29.jpg)
Use Cases
![Page 30: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/30.jpg)
1. Framework & code unification
DistributedStorage LocalDisk
DistributedFramework
1. Load, clean, transform, aggregate, sample
2. Save to local storage 3. Read & analyze
4. Iterate
SparkR
![Page 31: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/31.jpg)
Advantages of SparkR
What Why
No needtosaveintermediatedataondisk Transferdatadirectlyinto Rprocess
Noneedtomanage differentcodebases Write entirepipelineinR
Interactiveiterationsondata Distributeddatacanbecachedoncluster
Work withanydatasource&formats UseApacheSparkdatasourcesecosystem
PerforminteractiveanalysisondistributedatthespeedofthinkingfromR
![Page 32: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/32.jpg)
What to watch for
1. SparkR functions shadowing other packages2. How much data can you collect()?3. Cache your active Spark DataFrame4. Push feature engineering to Spark as much as possible• Merging datasets • Transformations• Filtering and projections• Sampling
![Page 33: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/33.jpg)
data.frame vs. DataFrame
• Example error messages
• ... doesn't know how to deal with data of class SparkDataFrame
• no method for coercing this S4 class to a ...
• Expressions other than filtering predicates are not supported in the first parameter of extract operator.
![Page 34: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/34.jpg)
R function vs. SparkSQL expression
Expressions translate to JVM calls, but functions run in R process of driver or workers
• filter(logs$type == “ERROR”)
• ifelse(df$level > 2, “deep”, “shallow”)
• dapply(logs, function(x) {
subset(x, type == “ERROR”)
}, schema(logs))
![Page 35: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/35.jpg)
Compute-bound Data-bound
2. Parallelizing Existing R Programs
• Usually simple statistics• Data is growing by volume or
complexity• Number of interactions are growing• Example: GWAS
• Complex logic implemented in R• Multi-threaded to use more than a
single core• Requires large workstation with a
lot of RAM and many cores• Example: simulations, bootstraping
![Page 36: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/36.jpg)
SparkR to rescue
• Instead of a single large machine (scaling up) use many commodity computers (scaling out)1. Perform data preparation (ETL) using SparkSQL2. Distribute input data according to algorithm logic3. Parallelize R function to execute on each partition
![Page 37: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/37.jpg)
What to watch for
1. Skew in data/computation• Are partitions evenly sized?• Is computation uniform across partitions?
2. Packing too much data into the closure3. Using third-party packages4. Returned data schema5. Keeping the pipeline full to minimize impact of fixed costs
![Page 38: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/38.jpg)
Packing too much into the closure
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...):
org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 29877:0 was 520644552 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456bytes).
![Page 39: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/39.jpg)
Using the wrong apply function
• Do not use spark.lapply() to distribute data• createDataFrame()• Load from storage
• Know your data partitioning• partitionBy() %>% dapply()• groupBy() %>% gapply()
![Page 40: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/40.jpg)
3. Distributed Machine Learning
• SparkR exposes a wide range of distributed learning algorithms implemented in Spark• Formula-based API that takes SparkDataFrame instead of R
data.frame
> df <- createDataFrame(as.data.frame(Titanic))> model <- glm(Freq ~ Sex + Age, df, family = "gaussian")> summary(model)
![Page 41: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/41.jpg)
What to watch for
• Spark MLlib models do not include data• You need to compute residuals
• Many cases you need to specify number of iterations• Spark uses a distributed iterative optimizer to
![Page 42: SDSS - Hossein Falaki - Interacting with Distributed Data from R … · memory DataFrames dynamic environment, interactivity, +10K packages, ... cache / persist / unpersist cacheTable](https://reader033.vdocuments.mx/reader033/viewer/2022043004/5f8538d9f9451e705c7693ef/html5/thumbnails/42.jpg)
Thank You