parallelizing existing r packages

Parallelizing Existing R Packages with SparkR

Hossein Falaki @mhfalaki

About me

• Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR &Databricks R Notebook feature • Currently focusing on R experience at Databricks

2

What is SparkR?

An R package distributed with Apache Spark (soon CRAN): - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by R and Pandas) - Convenient interoperability between R and Spark DataFrames

3

distributed/robust processing, data sources, off-memory data structures

+ Dynamic environment, interactivity, packages, visualization

SparkR architecture

4

SparkDriver

JVMR

RBackend

JVM

Worker

JVM

Worker

DataSourcesJVM

SparkR architecture (since 2.0)

5

SparkDriver

R JVM

RBackend

JVM

Worker

JVM

Worker

DataSourcesR

R

Overview of SparkR API

IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables

6

ML Lib glm / kmeans / Naïve Bayes Survival regression

DataFrame API select / subset / groupBy / head / avg / column / dim

UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect

http://spark.apache.org/docs/latest/api/R/

Overview of SparkR API :: Session

Spark session is your interface to Spark functionality in R o SparkR DataFrames are implemented on top of SparkSQL tables o All DataFrame operations go through a SQL optimizer (catalyst) o Since 2.0 sqlContext is wrapped in a new object called SparkR Session.

7

> spark <- sparkR.session()

All SparkR functions work if you pass them a session or will assume an existing session.

Reading/Writing data

8

R JVM

RBackend

JVM

Worker

JVM

Worker

HDFS/S3/…

read.df()write.df()

Moving data between R and JVM

9

R JVM

RBackendSparkR::collect()

SparkR::createDataFrame()

Overview of SparkR API :: DataFrame API

SparkR DataFrame behaves similar to R data.frames > sparkDF$newCol <- sparkDF$col + 1 > subsetDF <- sparkDF[, c(“date”, “type”)] > recentData <- subset(sparkDF$date == “2015-10-24”) > firstRow <- sparkDF[[1, ]] > names(subsetDF) <- c(“Date”, “Type”) > dim(recentData) > head(count(group_by(subsetDF, “Date”)))

10

Overview of SparkR API :: SQL

You can register a DataFrame as a table and query it in SQL > logs <- read.df(“data/logs”, source = “json”) > registerTempTable(df, “logsTable”) > errorsByCode <- sql(“select count(*) as num, type from logsTable where type == “error” group by code order by date desc”)

> reviewsDF <- tableToDF(“reviewsTable”) > registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”)

11

Moving between languages

12

R Scala

Spark

df <- read.df(...) wiki <- filter(df, ...) registerTempTable(wiki, “wiki”)

val wiki = table(“wiki”) val parsed = wiki.map { Row(_, _, text: String, _, _) =>text.split(‘ ’) } val model = Kmeans.train(parsed)

Overview of SparkR API

IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables

13

ML Lib glm / kmeans / Naïve Bayes Survival regression

DataFrame API select / subset / groupBy / head / avg / column / dim

UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect

http://spark.apache.org/docs/latest/api/R/

SparkR UDF API

14

spark.lapply Runs a function over a list of elements

spark.lapply()

dapply Applies a function to each partition of a SparkDataFrame

dapply()

dapplyCollect()

gapply Applies a function to each group within a SparkDataFrame

gapply()

gapplyCollect()

spark.lapply

15

Simplest SparkR UDF pattern For each element of a list:

1.  Sends the function to an R worker 2.  Executes the function 3.  Returns the result of all workers as a list to R driver

spark.lapply(1:100, function(x) { runBootstrap(x) }

spark.lapply control flow

16

RWorkerJVMRWorkerJVMRWorkerJVMR DriverJVM

1.SerializeRclosure

3.Transferserializedclosureoverthenetwork

5.De-serializeclosure

4.Transferoverlocalsocket

6.Serializeresult

2.Transferoverlocalsocket

7.Transferoverlocalsocket9.Transferover

localsocket

10.Deserializeresult

8.Transferserializedclosureoverthenetwork

dapply

17

For each partition of a Spark DataFrame 1.  collects each partition as an R data.frame 2.  sends the R function to the R worker 3.  executes the function

dapply(sparkDF, func, schema)

combines results as DataFrame with schema

dapplyCollect(sparkDF, func)

combines results as R data.frame

dapply control & data flow

18


local socket cluster network local socket

input data

ser/de transfer

result data

ser/de transfer

dapplyCollect control & data flow

19



input data

ser/de transfer

result transfer result deser

gapply

20

Groups a Spark DataFrame on one or more columns 1.  collects each group as an R data.frame 2.  sends the R function to the R worker 3.  executes the function

gapply(sparkDF, cols, func, schema)

combines results as DataFrame with schema

gapplyCollect(sparkDF, cols, func)

combines results as R data.frame

gapply control & data flow

21



input data

ser/de transfer

result data

ser/de transfer

data

shuffle

dapply vs. gapply

22

gapply dapply

signature gapply(df, cols, func, schema) gapply(gdf, func, schema)

dapply(df, func, schema)

user function signature

function(key, data) function(data)

data partition controlledbygrouping notcontrolled

Parallelizing data

• Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data

– Are partitions evenly sized?

• Auxiliary data – Can be joined with input DataFrame – Can be distributed to all the workers using FileSystem

23

Packages on workers

• SparkR closure capture does not include packages • You need to import packages on each worker inside your

function •  If not installed install packages on workers out-of-band • spark.lapply() can be used to install packages

24

Debugging user code

1.  Verify your code on the Driver 2.  Interactively execute the code on the cluster

–  When R worker fails, Spark Driver throws exception with the R error text

3.  Inspect details of failure reason of failed job in spark UI 4.  Inspect stdout/stderror of workers

25

Demo 26

Notebooks available at:

•  hSp://bit.ly/2krYMwC•  hSp://bit.ly/2ltLVKs

Thank you!