r workshop xx -- parallel computing with r

28
PPaarraalllleell ccoommppuuttiinngg iinn RR Vivian Zhang, Yuan Huang, Tong He SupStat Inc Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1 1 of 28 6/12/14, 5:26 PM

Upload: vivian-s-zhang

Post on 26-Jan-2015

115 views

Category:

Engineering


7 download

DESCRIPTION

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc, parallel computing, R programming

TRANSCRIPT

Page 1: R workshop xx -- Parallel Computing with R

PPaarraalllleell ccoommppuuttiinngg iinn RRVivian Zhang, Yuan Huang, Tong HeSupStat Inc

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

1 of 28 6/12/14, 5:26 PM

Page 2: R workshop xx -- Parallel Computing with R

OOuuttlliinneeIntroduction to Parallel computing

Implementation in R

Examples

·

·

Overview

Package: Foreach

-

-

·

2/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

2 of 28 6/12/14, 5:26 PM

Page 3: R workshop xx -- Parallel Computing with R

Introduction to Parallel computing

3/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

3 of 28 6/12/14, 5:26 PM

Page 4: R workshop xx -- Parallel Computing with R

SSeerriiaall vvss PPaarraalllleell CCoommppuuttaattiioonnSerial Computation

Traditionally, software is written for serial computation, where tasks must be performed in sequenceon a single processor. Only one instruction may execute at any moment in time.

Illustration of Serial Computation

4/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

4 of 28 6/12/14, 5:26 PM

Page 5: R workshop xx -- Parallel Computing with R

SSeerriiaall vvss PPaarraalllleell CCoommppuuttaattiioonnParallel Computation

Parallel computing aims to speed up the computation. In parallel computing,

Illustration of Parallel Computation

The problem is broken apart into discrete pieces of work.

Instructions from each part execute simultaneously on different processors.

·

·

5/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

5 of 28 6/12/14, 5:26 PM

Page 6: R workshop xx -- Parallel Computing with R

PPaarraalllleell ppaarraaddiiggmmMaster-worker paradigm

Master submits jobs to workers and collect results from workers.

No communications between workers

·

·

6/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

6 of 28 6/12/14, 5:26 PM

Page 7: R workshop xx -- Parallel Computing with R

SStteeppss ttoowwaarrddss PPaarraalllleell ccoommppuuttiinnggHardware platforms1.

Whether the problem is parallelable ?2.

Tips to improve the parallel computing's effeciency.3.

Implementation in R.4.

7/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

7 of 28 6/12/14, 5:26 PM

Page 8: R workshop xx -- Parallel Computing with R

HHaarrddwwaarree ppllaattffoorrmmssTwo representative Hardware platforms are multicore and cluster.

Multicore: Most of the PCs have multiple processors on a single chip, which enables the parallelcomputing. CPU is subdivided into multiple "cores", each being a unique execution unit.

Cluster: A set of independent nodes that are connected by a certain network.

·

·

8/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

8 of 28 6/12/14, 5:26 PM

Page 9: R workshop xx -- Parallel Computing with R

WWhheetthheerr tthhee pprroobblleemm iiss ppaarraalllleellaabbllee ??Recap: parallel computing requires that

Example:

The problem can be broken apart into discrete pieces of work.

Instructions from each part can execute simultaneously on different processors.

·

·

vec <- runif(10)

# sum(vec): i-th itervation uses the result from (i-1)-th iteration.sum.vec <- 0for (i in seq(vec)) sum.vec <- sum.vec+vec[i]

# cumsum(vec): i-th itervation are independent from (i-1)-th iteration.cumsum.vec <- 0*seq(vec)for (i in seq(vec)) cumsum.vec[i] <- sum(vec[1:i])

9/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

9 of 28 6/12/14, 5:26 PM

Page 10: R workshop xx -- Parallel Computing with R

IIss tthhiiss aa ggoooodd ppaarraalllleelliizzaattiioonn sscchheemmee??For this cumsum example,

Scheme: Prepare 10 cores and let each core implement one sum(vec[1:i]).

Question: Is this a good parallel scheme?

vec <- runif(10)

# cumsum(vec): i-th itervation are independent from (i-1)-th iteration.cumsum.vec <- 0*seq(vec)for (i in seq(vec)) cumsum.vec[i] <- sum(vec[1:i])

10/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

10 of 28 6/12/14, 5:26 PM

Page 11: R workshop xx -- Parallel Computing with R

IIss tthhiiss aa ggoooodd ppaarraalllleelliizzaattiioonn sscchheemmee??No, because different i may have result in quite widely different computation time, which may bring ina serious load balance issue.

Consider load balancing!

11/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

11 of 28 6/12/14, 5:26 PM

Page 12: R workshop xx -- Parallel Computing with R

LLooaadd bbaallaanncciinnggLoad balancing aims to spread tasks evenly across processors.

When tasks and processors are not load balanced:

Some processes finish early and sit idle waiting

Global computation is finished when the slowest processor(s) completes its task.

·

·

12/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

12 of 28 6/12/14, 5:26 PM

Page 13: R workshop xx -- Parallel Computing with R

PPaarraalllleell oovveerrhheeaaddThe amount of time required to coordinate parallel tasks, as opposed to doing useful work. Paralleloverhead can include factors such as:

Communication is much slower than computation; care should be taken to minimize unnecessarydata transfer to and from workers.

Assign the tasks in chunck help parallel overhead.

Task start-up time

Synchronizations

Data communications

Software overhead imposed by parallel languages, libraries, operating system, etc.

·

·

·

·

13/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

13 of 28 6/12/14, 5:26 PM

Page 14: R workshop xx -- Parallel Computing with R

RRaannddoomm nnuummbbeerr ggeenneerraattoorrssRandom number generators require extra care. Random number streams on different nodes need tobe independent. It is important to avoid producing the same random number stream on each workerand at the same time be able to facilitate reproducible research.

Special-purpose packages (rsprng, rlecuyer) are available; the snow package provides an integratedinterface to these packages.

In snow package, use the following code before the simulations.

clusterSetupSPRNG(cl)

14/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

14 of 28 6/12/14, 5:26 PM

Page 15: R workshop xx -- Parallel Computing with R

AApppplliiccaattiioonnss oonn ssttaattiissttiiccaall mmooddeelliinnggModel selection

Data mining

Monte Carlo simulations

Boostrap (see examples)

Web Scrapper (see examples)

Subset selection

Tuning parameter selection (eg. tuning in regularized regression)

K-fold crossvalidation(see examples)

·

·

·

Random forest (see examples)

Clustering (see examples)

Principle component Analysis

·

·

·

15/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

15 of 28 6/12/14, 5:26 PM

Page 16: R workshop xx -- Parallel Computing with R

Implementation in R

16/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

16 of 28 6/12/14, 5:26 PM

Page 17: R workshop xx -- Parallel Computing with R

OOvveerrvviieeww:: PPaacckkaaggeess

Use foreach + doSNOW.

Rmpi ( R interface to MPI; flexible; powerful, but more complex.)

Snow (will be used for backends with foreach package today)

multicore (work only on a single node and Linux-like machine)

parallel (hybrid package containing snow and multicore)

foreach (parallel backends doSNOW / doMPI / doMC)

·

·

·

·

·

17/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

17 of 28 6/12/14, 5:26 PM

Page 18: R workshop xx -- Parallel Computing with R

ffoorreeaacchh::OOnnee RRiinngg ttoo RRuullee TThheemm AAllllforeach was written by Steve Weston (was in Revolution, now at Yale)

an elegant framework for parallel computing: loop construct + parallel execution

allows the user to specify the parallel environment. The parallel backends include:

·

·

·

doMC (multicore),

doSNOW (snow)

doMPI (Rmpi)

doParallel (parallel)

...

-

-

-

-

-

18/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

18 of 28 6/12/14, 5:26 PM

Page 19: R workshop xx -- Parallel Computing with R

RReeggiisstteerr tthhee ppaarraalllleell bbaacckkeennddss::The doMC package acts as an interface between foreach and the multicore functionality.

The doSNOW package acts as an interface between foreach and the snow functionality.

# Register multicore as backend.library(doMC) registerDoMC(2) foreach code

# Register snow as backend.library(doSNOW)cl <- makeCluster(2, type="SOCK")registerDoSNOW(cl)

foreach code

stopCluster(cl)

19/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

19 of 28 6/12/14, 5:26 PM

Page 20: R workshop xx -- Parallel Computing with R

PPaacckkaaggee:: ssnnoowwInterfaces provided by snow package include:

MPI: Message Passing Interface, via Rmpi1.

NWS: NetWork Spaces via nws2.

PVM: Parallel Virtual Machine3.

Sockets: via the operating system4.

20/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

20 of 28 6/12/14, 5:26 PM

Page 21: R workshop xx -- Parallel Computing with R

RReeggiisstteerr tthhee ppaarraalllleell bbaacckkeennddssGet the name of the currently registered backend:

Get the version of the currently registered backend:

Check how many workers foreach is going to use

getDoParName()

getDoParVersion()

getDoParWorkers()

21/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

21 of 28 6/12/14, 5:26 PM

Page 22: R workshop xx -- Parallel Computing with R

ffoorreeaacchh ccooddee::Syntax

foreach object %do% expression1.

foreach object %dopar% expression2.

# here for(i=1:4) is the foreach object. we call i the iterator.x <- foreach(i=1:4) %dopar% {exp(i)}

foreach object: Specify Looping sturcture, similar to for loop. (for -> foreach)

%do%, %dopar% : Specify excecution method.

expression: Specify how to excecute in each process.

·

·

%do% Execute the R expression sequentially

%dopar% Execute the R expression using the currently registered backend

-

-

·

22/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

22 of 28 6/12/14, 5:26 PM

Page 23: R workshop xx -- Parallel Computing with R

ffoorreeaacchh ccooddee::Return is a list by default. Use .combine argument in foreach object to change.

library(doSNOW) # Register snow as backend.cl <- makeCluster(2, type="SOCK")registerDoSNOW(cl)foreach(i=1:3) %dopar% {i^2} # foreach code

[[1]][1] 1

[[2]][1] 4

[[3]][1] 9

stopCluster(cl)

23/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

23 of 28 6/12/14, 5:26 PM

Page 24: R workshop xx -- Parallel Computing with R

ffoorreeaacchh ccooddee::Set ".combine='c'" to obtain the return in vector.

# Register snow as backend.library(doSNOW)cl <- makeCluster(2, type="SOCK")registerDoSNOW(cl)# foreach codeforeach(i=1:3, .combine='c') %dopar% {i^2}

[1] 1 4 9

stopCluster(cl)

24/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

24 of 28 6/12/14, 5:26 PM

Page 25: R workshop xx -- Parallel Computing with R

ffoorreeaacchh ccooddee:: ttwwoo iitteerraattoorrss..# Register snow as backend.library(doSNOW)cl <- makeCluster(2, type="SOCK")registerDoSNOW(cl)# foreach code: here we have two itrators, i and j.foreach(i=1:3,j=4:6, .combine='c') %dopar% {i+j}

[1] 5 7 9

foreach(i=1:3,j=4:9, .combine='c') %dopar% {i+j}

[1] 5 7 9

stopCluster(cl)

25/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

25 of 28 6/12/14, 5:26 PM

Page 26: R workshop xx -- Parallel Computing with R

NNeessttiinngg tthhee llooooppssSyntax

foreach object 1 %:% foreach object 2 %dopar% { expression }

Example

# foreach codebvec <- c(1,2,3)avec <- c(-1,-2,-3)x <- foreach(b=bvec, .combine='c') %:% foreach(a=avec, .combine='c') %dopar% { a + b }

26/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

26 of 28 6/12/14, 5:26 PM

Page 27: R workshop xx -- Parallel Computing with R

NNeessttiinngg tthhee llooooppsslibrary(doSNOW)cl <- makeCluster(2, type="SOCK")registerDoSNOW(cl)# foreach codebvec <- c(1,2,3)avec <- c(-1,-2,-3)x <- foreach(b=bvec, .combine='c') %:% foreach(a=avec, .combine='c') %dopar% { a + b }stopCluster(cl)

27/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

27 of 28 6/12/14, 5:26 PM

Page 28: R workshop xx -- Parallel Computing with R

RReeffeerreenncceeGood Overview: State of the Art in Parallel Computing with R, Journal of Statistical Software.1.

Hand-on tutorial:2.

Comprehensive textbook:3.

Package foreach, Steve Weston

Using The foreach Package,Steve Weston

Nesting Foreach Loops,Steve Weston

Getting Started with doMC and foreach,Steve Weston

·

·

·

·

Programming on Parallel Machines, Norm Matlof

Introduction to Parallel Computing, Blaise Barney

·

·

28/28

Parallel computing in R http://nycdatascience.com/slides/parallel_R/index.html#1

28 of 28 6/12/14, 5:26 PM