parallel computing for econometricians with amazon web services

67
. . . . . . Parallel Computing for Econometricians with Amazon Web Services Stephen J. Barr University of Rochester March 2, 2011

Upload: stephenjbarr

Post on 07-Jul-2015

198 views

Category:

Education


5 download

TRANSCRIPT

Page 1: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Parallel Computing for Econometricians withAmazon Web Services

Stephen J. Barr

University of Rochester

March 2, 2011

Page 2: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

The Old Way

Page 3: Parallel Computing for Econometricians with Amazon Web Services

.

.

Page 4: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

The New Way

Page 5: Parallel Computing for Econometricians with Amazon Web Services

.

.

Page 6: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 7: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 8: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Algorithms and Implementations

I “Stupidly parallel” - e.g. a for loop where each iteration isindependent.

I Only 1 computer? (need 1-8 cores) - use the R multicore

package on a single EC2 node.I Need more? Use Hadoop / MapReduce - can do complicated

mapping and aggregation, in addition to the stupidly parallelstuff

I MapReduce - use Hadoop directly (Java), Hadoop Streaming(any programming language), rhipe R package (R onHadoop).

Page 9: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

In this presentation, we will be using Hadoop either directlythrough Elastic MapReduce or indirectly via the Segue package forR

Page 10: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Alternatives

I Wait a long time

I Use multicores, eg.http://www.rforge.net/doc/packages/multicore/mclapply.html

I Take over the computer lab and start jobs by hand

I Buy your own cluster (huge initial cost and will be unutilizedmost of the time)

Page 11: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 12: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

What is it?

I Hadoop is made by the Apache Software Foundation, whichmakes open source software. Contributors to the foundationare both large companies and individuals.

I Hadoop Common: The common utilities that support theother Hadoop subprojects.

I HDFS: A distributed file system that provides high throughputaccess to application data.

I MapReduce: A software framework for distributed processingof large data sets on compute clusters.

I Often, when people say “Hadoop” they mean Hadoop’simplementation of the map reduce algorithm.

I Algorithm made by google. Documented here:http://labs.google.com/papers/mapreduce.html

Page 13: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

What is it for?

I Used to process many TB of webserver logs for metrics, targetad placement, etc

I Users include:I Google - calculating pagerank, processing traffic, etc.I Yahoo - > 100,000 CPUs in various clusters, including a 4,000

node cluster. Used for ad placement, etc.I LinkedIn - huge social network graphs - “you may know...”I Amazon - creating product search indices

I See: http://wiki.apache.org/hadoop/PoweredBy

Page 14: Parallel Computing for Econometricians with Amazon Web Services

.

.

MAPREDUCE EXAMPLE – WORD COUNT

Input

Map Phase

Mapper

Mapper

Mapper

“This”, Doc1

“Word”, Doc1

“This”, Doc2

“This”, Doc3

Sort

“This”, Doc1

“Word”, Doc1

“This”, Doc2

“This”, Doc3

“Word”, Doc3 “Word”, Doc3

Reduce Phase

Reducer

Reducer

Output

“This”, 3

“Word”, 2

Page 15: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Algorithm

The idea is that the job is broken into map and reduce steps.

I Mapper processes input and creates chunks

I Reducer aggregates the chunks

Hadoop provides a Java implementation of this algorithm.Features include fault-tolerance, adding nodes on the fly, extremespeed, and more.Hadoop is implemented in Java, and Hadoop Streaming allowsmapper and reducers over any language, communicating over<STDIN>, <STDOUT>.

Page 16: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Hadoop Performance Statistics

I Hadoop is FAST! From 2010 Competition,http://sortbenchmark.org/

Page 17: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 18: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

What is this cloud?

I Cloud computing is the idea of abstracting away fromhardware

I All data and computing resources are managed services

I Pay per hour, based on need

Page 19: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

AWS Overview

Get ready for some acronyms! Amazon Web Services (AWS) is fullof them. The relevant ones are:

I EC2 - Elastic Compute Cloud - Dynamically get N computersfor a few cents per hour. Computers range from microinstances ($ 0.02/hr.) to 8-core, 70GB RAM “quad-xl”($2.00/hr) to GPU machines ($2.10/hr).

I EMR - Elastic map Reduce - automates the instantiation ofHadoop jobs. Builds the cluster, runs the job, completely inthe background

I S3 - Simple Storage Service - Store VERY large objects in thecloud.

I RDS - Relational Database Service - implementation ofMySQL database. Easy way to store data and later load intoR with package RMySQL. E.g.select date,price from myTable where TICKER=’AMZN’

Page 20: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

AWS Links

I EC2 - http://aws.amazon.com/ec2/I EMR - http://aws.amazon.com/elasticmapreduce/

I Getting started guide - http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

I S3 - http://aws.amazon.com/s3/

Page 21: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 22: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Steps

1. Write mapper in R. The output will be aggregatred byHadoop’s aggregate function.

2. Create input files

3. Upload all to S3

4. Configure EMR job in AWS Management Console

5. Done!

Page 23: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Files

The directory emr.simpleExample/simpleSimRmapper containsthe following

I makeData.R generates 1000 csv files with 1,000,000 rows, 4columns each. Each file is about 76 MB

I fileSplit.sh takes a directory of input files and preparesthem for use with EMR (more on this later)

I sjb.simpleMapper.R takes the name of a file from thecommand line, gets it from s3, runs a regression, hands backthe coefficients. These coefficients are then aggregated usingaggregate, a standard Hadoop reducer

Page 24: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 25: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Mapper functions

I INPUT: <STDIN>. This can beI A seed to a random number generatorI Raw data text to processI A list of file names to process - we are doing this one.

I OUTPUT: <STDOUT> (print it!), which next goes to thereducer.

Page 26: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

General R Mapper Code Outline

1 tr imWhiteSpace <− f unct ion ( l i n e ) gsub ( ” (ˆ +) | ( +$ ) ” , ””, l i n e )

2 con <− f i l e ( ” s t d i n ” , open = ” r ” )3 whi le ( l ength ( l i n e <− r e a dL i n e s ( con , n = 1 , warn =

FALSE) ) > 0) {4 l i n e <− t r imWhiteSpace ( l i n e )56 #pro c e s s and p r i n t r e s u l t s7 }8 c l o s e ( con )

Page 27: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Simple Mapper

file: sjb.simpleMapper.R Algorithm:

I get the file from s3

I read it

I run regression

I print results in a way that aggregate can read

Page 28: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Lets run it!

Page 29: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Overview

1. Made some data with makeData.R

2. Used fileSplit.sh to make lists of files to grab from s3.These lists will be fed into the mapper. Then transferred thedata and lists to s3. See moveToS3.sh for a list ofcommands, but don’t try to run this directly.

3. sjb.simpleMapper.R reads lines. Each line is a file. Opensthe file, does some work, prints some output.

4. Configure job on EMR using AWS Management Console.Using the standard aggregator to aggregate results.

Page 30: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Numbers

Consider this, in less than 10 min

I Instantiated a cluster of 13 m2.xlarge (68.4 GB RAM, 8 coreseach)

I Installed Linux OS and Hadoop software on all nodes

I Distribute approx. 20GB of data to the nodes

I Run some analysis in R

I Aggregate the results

I Shut down the cluster

Page 31: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 32: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

UsefulLinks

I Good EMR R Discussion

I Hadoop on EMR with C# and F#

I Hadoop Aggregate

Page 33: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 34: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Description

From the project website:

Segue has a simple goal: Parallel functionality in R; twolines of code; in under 15 minutes.

J.D.Long

From segue homepage: http://code.google.com/p/segue/

Page 35: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

AWS API - the segue underlying

I API stands for Application Program Interface

I All Amazon Web Services have API’s, which allowprogrammatic access. This exposes many more features thanthe AWS Managment Console

I For example, through the API one can start and stop a clusterwithout adding jobs, add nodes to a running cluster, etc.

I Using the API, you can write programs and treat clusters asthe native objects

I segue is such a program

Page 36: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

segue usage

I Segue is ideal for CPU bound applications - e.g. simulations

I replaces lapply, which applies a function to elements of alist, with emrlapply, which distributes the evaluation of thefunction to a cluster via Elastic Map Reduce

I the list can be anything - seeds to a random numbergenerator, matrices to invert, data frames to analyse, etc.

Page 37: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 38: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

code overview

Note: Code available on my website, http://econsteve.com/r.Showing 3 levels of optimization:

I For loops to matrices

I Evaluating firms on multicores

I Evaluating firms on multiple computers on EC2

Page 39: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Simulated MLE

We use the simulator

ln ˆLNR =N∑i=1

ln1

R

R∑r=1

[Ti∏t=1

h(yit |xit , θuri

]

where i ∈ N is a person among people, or firm in a set of firms. Ris a number of of simulations to do, where R ∝

√N, and Ti is the

length of the data for firm i .

Page 40: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

With for loops - R pseudocode

pane lLogL i k . s imp l e <− f u n c t i o n (THETA, da t aL i s t , s e edMat r i x ) {l o g L i k <− 0u i r <− qnorm ( s e edMat r i x )f o r ( n i n 1 :N) {

LiR <− 0 ;f o r ( r i n 1 :R) {

myProduct <− 1a lpha . r <− mu. a + u i r [ r , ( 2 ∗n)−1]∗ s igma . abeta . r <− mu. b + u i r [ r , (2 ∗n ) ] ∗ s igma . bf o r ( t i n 1 :T) {# f i = r e s i d u a l u s i n g Y, THETAmyProduct <− myProduct ∗ f i

}LiR <− LiR + myProductL i <− LiR/Rl o g L i k <− l o g L i k + l og ( L i )

} # end f o r r i n R} # end f o r nr e t u r n ( l o g L i k )

}

Page 41: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

With for loops - R pseudocode

We then maximize the likelihood function as:

opt imRes <− optim (THETA. i n i t 1 , pane lLogL i k . s imp le , g r=NULL , d a t aL i s t , s eedMatr i x , c o n t r o l= l i s t ( f n s c a l e =−1))

This is extremely slow on one processor, and does not lend itself toparallelization. (30 min for 60 firms - didn’t bother to test more).

Page 42: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Opt 1 - matrices, lists, lapply

We adopt a new approach with the following rules:

I Structure the data as a list of lists, where each sublist containsthe data, ticker symbol, and uri for the relevant coefficients

I Make a firm (i ∈ N) likelihood function, and an outer panellikelihood function which sums the results of the firms

Page 43: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Opt 1 - matrices, lists, lapply - firm likelihood

# t h i s shou l d be an e x t r eme l y f a s t f i r m L i k e l i h o o d f u n c t i o nf i r m L i k e l i h o o d <− f u n c t i o n ( da t aL i s t I t em , THETA, R) {

s igma . e <− THETA [ 1 ] ; mu . a <− THETA [ 2 ] ; s igma . a <− THETA[ 3 ]

mu. b <− THETA [ 4 ] ; s igma . b <− THETA[ 5 ]

data . n <− d a t a L i s t I t em $DATA; X . n <− data . n$X; Y . n <− data .n$Y;

T <− nrow ( data . n )

u i rA l pha <− d a t a L i s t I t em $UIRALPHAu i rBe t a <− d a t a L i s t I t em $UIRBETA

a lpha . rmat <− mu. a + u i rA l pha ∗ s igma . abeta . rmat <− mu. b + u i rBe t a ∗ s igma . bYtStack <− repmat (Y . n ,R , 1 )XtStack <− repmat (X . n , R , 1)r e s i dMat <− YtStack − a lpha . rmat − XtStack ∗ beta . rmatf i tMa t <− (1/ ( s igma . e∗ s q r t (2 ∗ p i ) ) ) ∗ exp (−( r e s i dMat ˆ2)/ (2

∗ s igma . e ˆ2) )myProductVec <− app ly ( f i tMat , 1 , prod )L i 2 <− sum ( myProductVec )/Rre tu rn ( L i 2 )

}

Page 44: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

The list-based outer loop

pane lLogL i k . f a s t e r <− f u n c t i o n (THETA, da t aL i s t , s e edMat r i x ){

# the seed mat r i x as R rows , and 2∗N columns where t h e r ea r e N f i rm s and 2 pa ramete r s o f i n t e r e s t ( a l pha andbeta )

u i r <− qnorm ( s e edMat r i x )R <− nrow ( s e edMat r i x )

# no t i c e t ha t we can c a l c u l a t e the l i k e l i h o o d si n d e p e nd en t l y f o r

# each f i rm , so we can make a f u n c t i o n and use l a p p l y .Th i s w i l l be

# u s e f u l f o r p a r a l l e l i z a t i o n

f i rmL i k <− l a p p l y ( d a t aL i s t , f i rmL i k e l i h o o d , THETA, R)l o g L i k <− sum ( l og ( u n l i s t ( f i rmL i k ) ) )r e t u r n ( l o g L i k )

}

Page 45: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 46: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

The list-based outer loop - multicore

Use the R multicore library, and replace lapply with mclapply atthe outer loop.

l i b r a r y ( mu l t i c o r e ). . .f i rmL i k <− mclapp ly ( d a t aL i s t , f i rmL i k e l i h o o d , THETA, R)

This will lead to some substantial speedups.

Page 47: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

multicore

N: 200 R: 150 T: 80 logLike: -34951.8 . On 4-core laptop

> proc . t ime ( )u s e r system e l a p s e d

389.180 36 .960 125.674

N: 1000 R: 320 T: 80 logLike: -174621.9. On EC2 2XL

> proc . t ime ( )u s e r system e l a p s e d

2705.77 2686.08 417 .74

N: 5000 R: 710 T: 80 logLike: -870744.4

> proc . t ime ( )u s e r system e l a p s e d

16206.480 16067.150 2768.588

multicore can provide quick and easy parallelization. Writeprogram so that the parallel part is an operation on a list, thenreplace lapply with mclapply.

Page 48: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Bad

Page 49: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Good

Page 50: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

I multicore is nice for optimizing a local job.

I Most machines today have at least 2 cores. Many have 4 or 8.

I However, that is still only 1 machine. Let’s use n of them →

Page 51: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 52: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

installing segue

Install prerequisite packages rjava and catools. On Ubuntu linux:

sudo apt−get i n s t a l l r−cran−r j a v a r−cran−c a t o o l s

Then, download and install seguehttp://code.google.com/p/segue/

Page 53: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Using segue

Now in R we do:

>library(segue)

As we will be using are AWS account, we are going to need to setcredentials so that other people can’t launch clusters in our name.To get our credentials, go to:http://aws.amazon.com/account/ and click “SecurityCredentials”.Go back into R.

setCredentials (" ABC123",

"REALLY+LONG +12312312+ STRING +456456")

Page 54: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Firing up the cluster in segue

use the createCluster command.

c r e a t e C l u s t e r ( numIns tances=2, cranPackages ,f i l e sOnNodes ,

rObjectsOnNodes , enab leDebugg ing=FALSE , in s tancesPe rNode ,mas t e r I n s t anceType=”m1. sma l l ” , s l a v e I n s t a n c eTyp e=”m1

. sma l l ” ,l o c a t i o n=”us−eas t−1a” , ec2KeyName , copy . image=FALSE ,

o t h e rBoo t s t r apAc t i on s , s o u r c ePa c k a g e sTo I n s t a l l )

In our case, lets fire up 10 m2.4xlarge. This gives us 80 cores and684 GB of RAM to play with.

Page 55: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

parallel random number generation

>myLis t <− NULL>s e t . s eed (1 )> f o r ( i i n 1 : 10 ){

a <− c ( rnorm (999) , NA)myLi s t [ [ i ] ] <− a}

>ou tpu tLoca l <− l a p p l y ( myList , mean , na . rm=T)>outputEmr <− emr l app l y ( myCluster , myList , mean ,na . rm=T)> a l l . e qua l ( outputEmr , ou tpu tLoca l )[ 1 ] TRUE

segue handles this for you. This is very important for simulation.

Page 56: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Monte Carlo π estimation

e s t ima t eP i <− f u n c t i o n ( seed ) {s e t . s eed ( seed )numDraws <− 1e6r <− . 5 #r a d i u s . . . i n ca s e the u n i t c i r c l e i s too bo r i n gx <− r u n i f ( numDraws , min=−r , max=r )y <− r u n i f ( numDraws , min=−r , max=r )i n C i r c l e <− i f e l s e ( ( xˆ2 + y ˆ2) ˆ .5 < r , 1 , 0)r e t u r n ( sum ( i n C i r c l e ) / l ength ( i n C i r c l e ) ∗ 4)

}s e e d L i s t <− as . l i s t ( 1 : 1 00 )r e q u i r e ( segue )myEst imates <− emr l app l y ( myCluster , s e e dL i s t , e s t ima t eP i )myPi <− Reduce (sum , myEst imates ) / l ength ( myEst imates )> format (myPi , d i g i t s =10)[ 1 ] ” 3.14166556 ”

Page 57: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

parallel MLE

Using code from sml.segue.R on my website. It is exactly thesame as the multicore example, but with the addition of 2 lines tostart the cluster.

Page 58: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 59: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

EC2 has GPUs

Cluster GPU Quadruple Extra Large Instance

I 22 GB of memory 33.5 EC2 Compute Units (2 x Intel XeonX5570, quad-core Nehalem architecture)

I 2 x NVIDIA Tesla Fermi M2050 GPUs 1690 GB of instancestorage 64-bit platform

I I/O Performance: Very High (10 Gigabit Ethernet)

I API name: cg1.4xlarge

The Fermi chip is important because they have ECC memory, sosimulations are accurate. These are much more robust than gamerGPUs - cost $2800 per card. Each machine has 2. You can use for$2.10 per hour.

Page 60: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

RHIPE

I RHIPE = R and Hadoop Integrated Processing Environment

I http://www.stat.purdue.edu/~sguha/rhipe/

I Implements rhlapply function

I Exposes much more of Hadoop’s underlying functionality,including the HDFS ⇒

I May be better for large data applications

Page 61: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

StarCluster I

I Allows instantiation of generic clusters on EC2

I Use MPI (Message Passing Interface) for much morecomplicated parallel programs. E.g., holding one giant matrixaccross the RAM of several nodes

I From their page:

I Simple configuration with sensible defaults

I Single ”start” command to automatically launch andconfigure one or more clusters on EC2

I Support for attaching and NFS-sharing Amazon Elastic BlockStorage (EBS) volumes for persistent storage across a cluster

I Comes with a publicly available Amazon Machine Image(AMI) configured for scientific computing

I AMI includes OpenMPI, ATLAS, Lapack, NumPy, SciPy, andother useful libraries

Page 62: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

StarCluster II

I Clusters are automatically configured with NFS, Sun GridEngine queuing system, and password-less ssh betweenmachines

I Supports user-contributed ”plugins” that allow users toperform additional setup routines on the cluster afterStarCluster’s defaults

I http://web.mit.edu/stardev/cluster/

Page 63: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Matlab

I You can do it in theory, but you need either a license manageror use Matlab compiler

I It will cost you.

I Whitepaper from Mathworks: http://www.mathworks.com/programs/techkits/ec2_paper.html

I May be able to coax EMR run a compiled Matlab script, butyou would have to bootstrap each machine to have thelibraries required to run compiled Matlab applications

I Mathworks has no incentive to support this behaviour

I Requires toolboxes ($$$).

Page 64: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

Page 65: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

EC2 and Hadoop are Extremely Powerful

I Huge and active community behind both Hadoop (Apache)and EC2 (Amazon).

I EC2 and AWS in general allow you to change the way youthink about computing resources, as a service rather than asdevices to manage.

I New AWS features are always being added

Page 66: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

AWS in Education

AMAZON WILL GIVE YOU MONEY

I Researcher - send them your proposal, they send you credits,you thank them in the paper.

I Teacher - if you are teaching a class, each student gets $100credit, good for one year. This would be great for teachingeconometrics, where you can provide a machine image withsoftware and data already available.

I Additionally, AWS for your backups (S3) and other tech needs

Page 67: Parallel Computing for Econometricians with Amazon Web Services

. . . . . .

Resources

I My website http://www.econsteve.com/r for the code inthis presentation

I AWS Managment Consolehttp://aws.amazon.com/console/

I AWS Blog http://aws.typepad.com

I AWS in Education http://aws.amazon.com/education/