parallelizing existing r packages

27
Parallelizing Existing R Packages with SparkR Hossein Falaki @mhfalaki

Upload: craig-warman

Post on 24-Jan-2018

109 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Parallelizing Existing R Packages

Parallelizing Existing R Packages with SparkR

Hossein Falaki @mhfalaki

Page 2: Parallelizing Existing R Packages

About me

• Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR &Databricks R Notebook feature • Currently focusing on R experience at Databricks

2

Page 3: Parallelizing Existing R Packages

What is SparkR?

An R package distributed with Apache Spark (soon CRAN): - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by R and Pandas) - Convenient interoperability between R and Spark DataFrames

3

distributed/robust processing, data sources, off-memory data structures

+ Dynamic environment, interactivity, packages, visualization

Page 4: Parallelizing Existing R Packages

SparkR architecture

4

SparkDriver

JVMR

RBackend

JVM

Worker

JVM

Worker

DataSourcesJVM

Page 5: Parallelizing Existing R Packages

SparkR architecture (since 2.0)

5

SparkDriver

R JVM

RBackend

JVM

Worker

JVM

Worker

DataSourcesR

R

Page 6: Parallelizing Existing R Packages

Overview of SparkR API

IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables

6

ML Lib glm / kmeans / Naïve Bayes Survival regression

DataFrame API select / subset / groupBy / head / avg / column / dim

UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect

http://spark.apache.org/docs/latest/api/R/

Page 7: Parallelizing Existing R Packages

Overview of SparkR API :: Session

Spark session is your interface to Spark functionality in R o SparkR DataFrames are implemented on top of SparkSQL tables o All DataFrame operations go through a SQL optimizer (catalyst) o Since 2.0 sqlContext is wrapped in a new object called SparkR Session.

7

> spark <- sparkR.session()

All SparkR functions work if you pass them a session or will assume an existing session.

Page 8: Parallelizing Existing R Packages

Reading/Writing data

8

R JVM

RBackend

JVM

Worker

JVM

Worker

HDFS/S3/…

read.df()write.df()

Page 9: Parallelizing Existing R Packages

Moving data between R and JVM

9

R JVM

RBackendSparkR::collect()

SparkR::createDataFrame()

Page 10: Parallelizing Existing R Packages

Overview of SparkR API :: DataFrame API

SparkR DataFrame behaves similar to R data.frames > sparkDF$newCol <- sparkDF$col + 1 > subsetDF <- sparkDF[, c(“date”, “type”)] > recentData <- subset(sparkDF$date == “2015-10-24”) > firstRow <- sparkDF[[1, ]] > names(subsetDF) <- c(“Date”, “Type”) > dim(recentData) > head(count(group_by(subsetDF, “Date”)))

10

Page 11: Parallelizing Existing R Packages

Overview of SparkR API :: SQL

You can register a DataFrame as a table and query it in SQL > logs <- read.df(“data/logs”, source = “json”) > registerTempTable(df, “logsTable”) > errorsByCode <- sql(“select count(*) as num, type from logsTable where type == “error” group by code order by date desc”)

> reviewsDF <- tableToDF(“reviewsTable”) > registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”)

11

Page 12: Parallelizing Existing R Packages

Moving between languages

12

R Scala

Spark

df <- read.df(...) wiki <- filter(df, ...) registerTempTable(wiki, “wiki”)

val wiki = table(“wiki”) val parsed = wiki.map { Row(_, _, text: String, _, _) =>text.split(‘ ’) } val model = Kmeans.train(parsed)

Page 13: Parallelizing Existing R Packages

Overview of SparkR API

IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables

13

ML Lib glm / kmeans / Naïve Bayes Survival regression

DataFrame API select / subset / groupBy / head / avg / column / dim

UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect

http://spark.apache.org/docs/latest/api/R/

Page 14: Parallelizing Existing R Packages

SparkR UDF API

14

spark.lapply Runs a function over a list of elements

spark.lapply()

dapply Applies a function to each partition of a SparkDataFrame

dapply()

dapplyCollect()

gapply Applies a function to each group within a SparkDataFrame

gapply()

gapplyCollect()

Page 15: Parallelizing Existing R Packages

spark.lapply

15

Simplest SparkR UDF pattern For each element of a list:

1.  Sends the function to an R worker 2.  Executes the function 3.  Returns the result of all workers as a list to R driver

spark.lapply(1:100, function(x) { runBootstrap(x) }

Page 16: Parallelizing Existing R Packages

spark.lapply control flow

16

RWorkerJVMRWorkerJVMRWorkerJVMR DriverJVM

1.SerializeRclosure

3.Transferserializedclosureoverthenetwork

5.De-serializeclosure

4.Transferoverlocalsocket

6.Serializeresult

2.Transferoverlocalsocket

7.Transferoverlocalsocket9.Transferover

localsocket

10.Deserializeresult

8.Transferserializedclosureoverthenetwork

Page 17: Parallelizing Existing R Packages

dapply

17

For each partition of a Spark DataFrame 1.  collects each partition as an R data.frame 2.  sends the R function to the R worker 3.  executes the function

dapply(sparkDF, func, schema)

combines results as DataFrame with schema

dapplyCollect(sparkDF, func)

combines results as R data.frame

Page 18: Parallelizing Existing R Packages

dapply control & data flow

18

RWorkerJVMRWorkerJVMRWorkerJVMR DriverJVM

local socket cluster network local socket

input data

ser/de transfer

result data

ser/de transfer

Page 19: Parallelizing Existing R Packages

dapplyCollect control & data flow

19

RWorkerJVMRWorkerJVMRWorkerJVMR DriverJVM

local socket cluster network local socket

input data

ser/de transfer

result transfer result deser

Page 20: Parallelizing Existing R Packages

gapply

20

Groups a Spark DataFrame on one or more columns 1.  collects each group as an R data.frame 2.  sends the R function to the R worker 3.  executes the function

gapply(sparkDF, cols, func, schema)

combines results as DataFrame with schema

gapplyCollect(sparkDF, cols, func)

combines results as R data.frame

Page 21: Parallelizing Existing R Packages

gapply control & data flow

21

RWorkerJVMRWorkerJVMRWorkerJVMR DriverJVM

local socket cluster network local socket

input data

ser/de transfer

result data

ser/de transfer

data

shuffle

Page 22: Parallelizing Existing R Packages

dapply vs. gapply

22

gapply dapply

signature gapply(df, cols, func, schema) gapply(gdf, func, schema)

dapply(df, func, schema)

user function signature

function(key, data) function(data)

data partition controlledbygrouping notcontrolled

Page 23: Parallelizing Existing R Packages

Parallelizing data

• Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data

– Are partitions evenly sized?

• Auxiliary data – Can be joined with input DataFrame – Can be distributed to all the workers using FileSystem

23

Page 24: Parallelizing Existing R Packages

Packages on workers

• SparkR closure capture does not include packages • You need to import packages on each worker inside your

function •  If not installed install packages on workers out-of-band • spark.lapply() can be used to install packages

24

Page 25: Parallelizing Existing R Packages

Debugging user code

1.  Verify your code on the Driver 2.  Interactively execute the code on the cluster

–  When R worker fails, Spark Driver throws exception with the R error text

3.  Inspect details of failure reason of failed job in spark UI 4.  Inspect stdout/stderror of workers

25

Page 26: Parallelizing Existing R Packages

Demo 26

Notebooks available at:

•  hSp://bit.ly/2krYMwC•  hSp://bit.ly/2ltLVKs

Page 27: Parallelizing Existing R Packages

Thank you!