parallel computing for econometricians with amazon web services

. . . . . .

Parallel Computing for Econometricians withAmazon Web Services

Stephen J. Barr

University of Rochester

March 2, 2011

. . . . . .

The Old Way

. . . . . .

The New Way

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R ExampleThe R code - mapperResources List

segue and a SML ExampleSimulated Maximum Likelihood Examplemulticore - on the way to seguediving into segue

Other EC2 Software Options

Conclusion

. . . . . .

Algorithms and Implementations

I “Stupidly parallel” - e.g. a for loop where each iteration isindependent.

I Only 1 computer? (need 1-8 cores) - use the R multicore

package on a single EC2 node.I Need more? Use Hadoop / MapReduce - can do complicated

mapping and aggregation, in addition to the stupidly parallelstuff

I MapReduce - use Hadoop directly (Java), Hadoop Streaming(any programming language), rhipe R package (R onHadoop).

. . . . . .

In this presentation, we will be using Hadoop either directlythrough Elastic MapReduce or indirectly via the Segue package forR

. . . . . .

Alternatives

I Wait a long time

I Use multicores, eg.http://www.rforge.net/doc/packages/multicore/mclapply.html

I Take over the computer lab and start jobs by hand

I Buy your own cluster (huge initial cost and will be unutilizedmost of the time)

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

What is it?

I Hadoop is made by the Apache Software Foundation, whichmakes open source software. Contributors to the foundationare both large companies and individuals.

I Hadoop Common: The common utilities that support theother Hadoop subprojects.

I HDFS: A distributed file system that provides high throughputaccess to application data.

I MapReduce: A software framework for distributed processingof large data sets on compute clusters.

I Often, when people say “Hadoop” they mean Hadoop’simplementation of the map reduce algorithm.

I Algorithm made by google. Documented here:http://labs.google.com/papers/mapreduce.html

http://labs.google.com/papers/mapreduce.html

. . . . . .

What is it for?

I Used to process many TB of webserver logs for metrics, targetad placement, etc

I Users include:I Google - calculating pagerank, processing traffic, etc.I Yahoo - > 100,000 CPUs in various clusters, including a 4,000

node cluster. Used for ad placement, etc.I LinkedIn - huge social network graphs - “you may know...”I Amazon - creating product search indices

I See: http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

.

.

MAPREDUCE EXAMPLE – WORD COUNT

Input

Map Phase

Mapper

Mapper

Mapper

“This”, Doc1

“Word”, Doc1

“This”, Doc2

“This”, Doc3

Sort

“This”, Doc1

“Word”, Doc1

“This”, Doc2

“This”, Doc3

“Word”, Doc3 “Word”, Doc3

Reduce Phase

Reducer

Reducer

Output

“This”, 3

“Word”, 2

. . . . . .

Algorithm

The idea is that the job is broken into map and reduce steps.

I Mapper processes input and creates chunks

I Reducer aggregates the chunks

Hadoop provides a Java implementation of this algorithm.Features include fault-tolerance, adding nodes on the fly, extremespeed, and more.Hadoop is implemented in Java, and Hadoop Streaming allowsmapper and reducers over any language, communicating over<STDIN>, <STDOUT>.

. . . . . .

Hadoop Performance Statistics

I Hadoop is FAST! From 2010 Competition,http://sortbenchmark.org/

http://sortbenchmark.org/

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

What is this cloud?

I Cloud computing is the idea of abstracting away fromhardware

I All data and computing resources are managed services

I Pay per hour, based on need

. . . . . .

AWS Overview

Get ready for some acronyms! Amazon Web Services (AWS) is fullof them. The relevant ones are:

I EC2 - Elastic Compute Cloud - Dynamically get N computersfor a few cents per hour. Computers range from microinstances ($ 0.02/hr.) to 8-core, 70GB RAM “quad-xl”($2.00/hr) to GPU machines ($2.10/hr).

I EMR - Elastic map Reduce - automates the instantiation ofHadoop jobs. Builds the cluster, runs the job, completely inthe background

I S3 - Simple Storage Service - Store VERY large objects in thecloud.

I RDS - Relational Database Service - implementation ofMySQL database. Easy way to store data and later load intoR with package RMySQL. E.g.select date,price from myTable where TICKER=’AMZN’

. . . . . .

AWS Links

I EC2 - http://aws.amazon.com/ec2/I EMR - http://aws.amazon.com/elasticmapreduce/

I Getting started guide - http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

I S3 - http://aws.amazon.com/s3/

http://aws.amazon.com/ec2/

http://aws.amazon.com/elasticmapreduce/

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

http://aws.amazon.com/s3/

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

Steps

1. Write mapper in R. The output will be aggregatred byHadoop’s aggregate function.

2. Create input files

3. Upload all to S3

4. Configure EMR job in AWS Management Console

5. Done!

. . . . . .

Files

The directory emr.simpleExample/simpleSimRmapper containsthe following

I makeData.R generates 1000 csv files with 1,000,000 rows, 4columns each. Each file is about 76 MB

I fileSplit.sh takes a directory of input files and preparesthem for use with EMR (more on this later)

I sjb.simpleMapper.R takes the name of a file from thecommand line, gets it from s3, runs a regression, hands backthe coefficients. These coefficients are then aggregated usingaggregate, a standard Hadoop reducer

. . . . . .

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

Mapper functions

I INPUT: <STDIN>. This can beI A seed to a random number generatorI Raw data text to processI A list of file names to process - we are doing this one.

I OUTPUT: <STDOUT> (print it!), which next goes to thereducer.

. . . . . .

General R Mapper Code Outline

1 tr imWhiteSpace <− f unct ion ( l i n e ) gsub ( ” (ˆ +) | ( +$ ) ” , ””, l i n e )

2 con <− f i l e ( ” s t d i n ” , open = ” r ” )3 whi le ( l ength ( l i n e <− r e a dL i n e s ( con , n = 1 , warn =

FALSE) ) > 0) {4 l i n e <− t r imWhiteSpace ( l i n e )56 #pro c e s s and p r i n t r e s u l t s7 }8 c l o s e ( con )

. . . . . .

Simple Mapper

file: sjb.simpleMapper.R Algorithm:

I get the file from s3

I read it

I run regression

I print results in a way that aggregate can read

. . . . . .

Lets run it!

. . . . . .

Overview

1. Made some data with makeData.R

2. Used fileSplit.sh to make lists of files to grab from s3.These lists will be fed into the mapper. Then transferred thedata and lists to s3. See moveToS3.sh for a list ofcommands, but don’t try to run this directly.

3. sjb.simpleMapper.R reads lines. Each line is a file. Opensthe file, does some work, prints some output.

4. Configure job on EMR using AWS Management Console.Using the standard aggregator to aggregate results.

http://aws.amazon.com/console/

. . . . . .

Numbers

Consider this, in less than 10 min

I Instantiated a cluster of 13 m2.xlarge (68.4 GB RAM, 8 coreseach)

I Installed Linux OS and Hadoop software on all nodes

I Distribute approx. 20GB of data to the nodes

I Run some analysis in R

I Aggregate the results

I Shut down the cluster

. . . . . .

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

UsefulLinks

I Good EMR R Discussion

I Hadoop on EMR with C# and F#

I Hadoop Aggregate

http://bit.ly/hFGGXy

http://bit.ly/goPOex

http://bit.ly/hdICRi

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

Description

From the project website:

Segue has a simple goal: Parallel functionality in R; twolines of code; in under 15 minutes.

J.D.Long

From segue homepage: http://code.google.com/p/segue/

http://code.google.com/p/segue/

. . . . . .

AWS API - the segue underlying

I API stands for Application Program Interface

I All Amazon Web Services have API’s, which allowprogrammatic access. This exposes many more features thanthe AWS Managment Console

I For example, through the API one can start and stop a clusterwithout adding jobs, add nodes to a running cluster, etc.

I Using the API, you can write programs and treat clusters asthe native objects

I segue is such a program

. . . . . .

segue usage

I Segue is ideal for CPU bound applications - e.g. simulations

I replaces lapply, which applies a function to elements of alist, with emrlapply, which distributes the evaluation of thefunction to a cluster via Elastic Map Reduce

I the list can be anything - seeds to a random numbergenerator, matrices to invert, data frames to analyse, etc.

. . . . . .

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

code overview

Note: Code available on my website, http://econsteve.com/r.Showing 3 levels of optimization:

I For loops to matrices

I Evaluating firms on multicores

I Evaluating firms on multiple computers on EC2

http://econsteve.com/r

. . . . . .

Simulated MLE

We use the simulator

ln ˆLNR =N∑i=1

ln1

R

R∑r=1

[Ti∏t=1

h(yit |xit , θuri

]

where i ∈ N is a person among people, or firm in a set of firms. Ris a number of of simulations to do, where R ∝

√N, and Ti is the

length of the data for firm i .

. . . . . .

With for loops - R pseudocode

pane lLogL i k . s imp l e <− f u n c t i o n (THETA, da t aL i s t , s e edMat r i x ) {l o g L i k <− 0u i r <− qnorm ( s e edMat r i x )f o r ( n i n 1 :N) {

LiR <− 0 ;f o r ( r i n 1 :R) {

myProduct <− 1a lpha . r <− mu. a + u i r [ r , ( 2 ∗n)−1]∗ s igma . abeta . r <− mu. b + u i r [ r , (2 ∗n ) ] ∗ s igma . bf o r ( t i n 1 :T) {# f i = r e s i d u a l u s i n g Y, THETAmyProduct <− myProduct ∗ f i

}LiR <− LiR + myProductL i <− LiR/Rl o g L i k <− l o g L i k + l og ( L i )

} # end f o r r i n R} # end f o r nr e t u r n ( l o g L i k )

}

. . . . . .

With for loops - R pseudocode

We then maximize the likelihood function as:

opt imRes <− optim (THETA. i n i t 1 , pane lLogL i k . s imp le , g r=NULL , d a t aL i s t , s eedMatr i x , c o n t r o l= l i s t ( f n s c a l e =−1))

This is extremely slow on one processor, and does not lend itself toparallelization. (30 min for 60 firms - didn’t bother to test more).

. . . . . .

Opt 1 - matrices, lists, lapply

We adopt a new approach with the following rules:

I Structure the data as a list of lists, where each sublist containsthe data, ticker symbol, and uri for the relevant coefficients

I Make a firm (i ∈ N) likelihood function, and an outer panellikelihood function which sums the results of the firms

. . . . . .

Opt 1 - matrices, lists, lapply - firm likelihood

# t h i s shou l d be an e x t r eme l y f a s t f i r m L i k e l i h o o d f u n c t i o nf i r m L i k e l i h o o d <− f u n c t i o n ( da t aL i s t I t em , THETA, R) {

s igma . e <− THETA [ 1 ] ; mu . a <− THETA [ 2 ] ; s igma . a <− THETA[ 3 ]

mu. b <− THETA [ 4 ] ; s igma . b <− THETA[ 5 ]

data . n <− d a t a L i s t I t em $DATA; X . n <− data . n$X; Y . n <− data .n$Y;

T <− nrow ( data . n )

u i rA l pha <− d a t a L i s t I t em $UIRALPHAu i rBe t a <− d a t a L i s t I t em $UIRBETA

a lpha . rmat <− mu. a + u i rA l pha ∗ s igma . abeta . rmat <− mu. b + u i rBe t a ∗ s igma . bYtStack <− repmat (Y . n ,R , 1 )XtStack <− repmat (X . n , R , 1)r e s i dMat <− YtStack − a lpha . rmat − XtStack ∗ beta . rmatf i tMa t <− (1/ ( s igma . e∗ s q r t (2 ∗ p i ) ) ) ∗ exp (−( r e s i dMat ˆ2)/ (2

∗ s igma . e ˆ2) )myProductVec <− app ly ( f i tMat , 1 , prod )L i 2 <− sum ( myProductVec )/Rre tu rn ( L i 2 )

}

. . . . . .

The list-based outer loop

pane lLogL i k . f a s t e r <− f u n c t i o n (THETA, da t aL i s t , s e edMat r i x ){

# the seed mat r i x as R rows , and 2∗N columns where t h e r ea r e N f i rm s and 2 pa ramete r s o f i n t e r e s t ( a l pha andbeta )

u i r <− qnorm ( s e edMat r i x )R <− nrow ( s e edMat r i x )

# no t i c e t ha t we can c a l c u l a t e the l i k e l i h o o d si n d e p e nd en t l y f o r

# each f i rm , so we can make a f u n c t i o n and use l a p p l y .Th i s w i l l be

# u s e f u l f o r p a r a l l e l i z a t i o n

f i rmL i k <− l a p p l y ( d a t aL i s t , f i rmL i k e l i h o o d , THETA, R)l o g L i k <− sum ( l og ( u n l i s t ( f i rmL i k ) ) )r e t u r n ( l o g L i k )

}

. . . . . .

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

The list-based outer loop - multicore

Use the R multicore library, and replace lapply with mclapply atthe outer loop.

l i b r a r y ( mu l t i c o r e ). . .f i rmL i k <− mclapp ly ( d a t aL i s t , f i rmL i k e l i h o o d , THETA, R)

This will lead to some substantial speedups.

. . . . . .

multicore

N: 200 R: 150 T: 80 logLike: -34951.8 . On 4-core laptop

> proc . t ime ( )u s e r system e l a p s e d

389.180 36 .960 125.674

N: 1000 R: 320 T: 80 logLike: -174621.9. On EC2 2XL


2705.77 2686.08 417 .74

N: 5000 R: 710 T: 80 logLike: -870744.4


16206.480 16067.150 2768.588

multicore can provide quick and easy parallelization. Writeprogram so that the parallel part is an operation on a list, thenreplace lapply with mclapply.

. . . . . .

Bad

. . . . . .

Good

. . . . . .

I multicore is nice for optimizing a local job.

I Most machines today have at least 2 cores. Many have 4 or 8.

I However, that is still only 1 machine. Let’s use n of them →

. . . . . .

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

installing segue

Install prerequisite packages rjava and catools. On Ubuntu linux:

sudo apt−get i n s t a l l r−cran−r j a v a r−cran−c a t o o l s

Then, download and install seguehttp://code.google.com/p/segue/

http://code.google.com/p/segue/

. . . . . .

Using segue

Now in R we do:

>library(segue)

As we will be using are AWS account, we are going to need to setcredentials so that other people can’t launch clusters in our name.To get our credentials, go to:http://aws.amazon.com/account/ and click “SecurityCredentials”.Go back into R.

setCredentials (" ABC123",

"REALLY+LONG +12312312+ STRING +456456")

http://aws.amazon.com/account/

. . . . . .

Firing up the cluster in segue

use the createCluster command.

c r e a t e C l u s t e r ( numIns tances=2, cranPackages ,f i l e sOnNodes ,

rObjectsOnNodes , enab leDebugg ing=FALSE , in s tancesPe rNode ,mas t e r I n s t anceType=”m1. sma l l ” , s l a v e I n s t a n c eTyp e=”m1

. sma l l ” ,l o c a t i o n=”us−eas t−1a” , ec2KeyName , copy . image=FALSE ,

o t h e rBoo t s t r apAc t i on s , s o u r c ePa c k a g e sTo I n s t a l l )

In our case, lets fire up 10 m2.4xlarge. This gives us 80 cores and684 GB of RAM to play with.

. . . . . .

parallel random number generation

>myLis t <− NULL>s e t . s eed (1 )> f o r ( i i n 1 : 10 ){

a <− c ( rnorm (999) , NA)myLi s t [ [ i ] ] <− a}

>ou tpu tLoca l <− l a p p l y ( myList , mean , na . rm=T)>outputEmr <− emr l app l y ( myCluster , myList , mean ,na . rm=T)> a l l . e qua l ( outputEmr , ou tpu tLoca l )[ 1 ] TRUE

segue handles this for you. This is very important for simulation.

. . . . . .

Monte Carlo π estimation

e s t ima t eP i <− f u n c t i o n ( seed ) {s e t . s eed ( seed )numDraws <− 1e6r <− . 5 #r a d i u s . . . i n ca s e the u n i t c i r c l e i s too bo r i n gx <− r u n i f ( numDraws , min=−r , max=r )y <− r u n i f ( numDraws , min=−r , max=r )i n C i r c l e <− i f e l s e ( ( xˆ2 + y ˆ2) ˆ .5 < r , 1 , 0)r e t u r n ( sum ( i n C i r c l e ) / l ength ( i n C i r c l e ) ∗ 4)

}s e e d L i s t <− as . l i s t ( 1 : 1 00 )r e q u i r e ( segue )myEst imates <− emr l app l y ( myCluster , s e e dL i s t , e s t ima t eP i )myPi <− Reduce (sum , myEst imates ) / l ength ( myEst imates )> format (myPi , d i g i t s =10)[ 1 ] ” 3.14166556 ”

. . . . . .

parallel MLE

Using code from sml.segue.R on my website. It is exactly thesame as the multicore example, but with the addition of 2 lines tostart the cluster.

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

EC2 has GPUs

Cluster GPU Quadruple Extra Large Instance

I 22 GB of memory 33.5 EC2 Compute Units (2 x Intel XeonX5570, quad-core Nehalem architecture)

I 2 x NVIDIA Tesla Fermi M2050 GPUs 1690 GB of instancestorage 64-bit platform

I I/O Performance: Very High (10 Gigabit Ethernet)

I API name: cg1.4xlarge

The Fermi chip is important because they have ECC memory, sosimulations are accurate. These are much more robust than gamerGPUs - cost $2800 per card. Each machine has 2. You can use for$2.10 per hour.

. . . . . .

RHIPE

I RHIPE = R and Hadoop Integrated Processing Environment

I http://www.stat.purdue.edu/~sguha/rhipe/

I Implements rhlapply function

I Exposes much more of Hadoop’s underlying functionality,including the HDFS ⇒

I May be better for large data applications

http://www.stat.purdue.edu/~sguha/rhipe/

. . . . . .

StarCluster I

I Allows instantiation of generic clusters on EC2

I Use MPI (Message Passing Interface) for much morecomplicated parallel programs. E.g., holding one giant matrixaccross the RAM of several nodes

I From their page:

I Simple configuration with sensible defaults

I Single ”start” command to automatically launch andconfigure one or more clusters on EC2

I Support for attaching and NFS-sharing Amazon Elastic BlockStorage (EBS) volumes for persistent storage across a cluster

I Comes with a publicly available Amazon Machine Image(AMI) configured for scientific computing

I AMI includes OpenMPI, ATLAS, Lapack, NumPy, SciPy, andother useful libraries

. . . . . .

StarCluster II

I Clusters are automatically configured with NFS, Sun GridEngine queuing system, and password-less ssh betweenmachines

I Supports user-contributed ”plugins” that allow users toperform additional setup routines on the cluster afterStarCluster’s defaults

I http://web.mit.edu/stardev/cluster/

http://web.mit.edu/stardev/cluster/

. . . . . .

Matlab

I You can do it in theory, but you need either a license manageror use Matlab compiler

I It will cost you.

I Whitepaper from Mathworks: http://www.mathworks.com/programs/techkits/ec2_paper.html

I May be able to coax EMR run a compiled Matlab script, butyou would have to bootstrap each machine to have thelibraries required to run compiled Matlab applications

I Mathworks has no incentive to support this behaviour

I Requires toolboxes ($$$).

http://www.mathworks.com/programs/techkits/ec2_paper.html

http://www.mathworks.com/programs/techkits/ec2_paper.html

. . . . . .

Table of Contents

Tools Overview

Hadoop

Amazon Web Services




Conclusion

. . . . . .

EC2 and Hadoop are Extremely Powerful

I Huge and active community behind both Hadoop (Apache)and EC2 (Amazon).

I EC2 and AWS in general allow you to change the way youthink about computing resources, as a service rather than asdevices to manage.

I New AWS features are always being added

. . . . . .

AWS in Education

AMAZON WILL GIVE YOU MONEY

I Researcher - send them your proposal, they send you credits,you thank them in the paper.

I Teacher - if you are teaching a class, each student gets $100credit, good for one year. This would be great for teachingeconometrics, where you can provide a machine image withsoftware and data already available.

I Additionally, AWS for your backups (S3) and other tech needs

. . . . . .

Resources

I My website http://www.econsteve.com/r for the code inthis presentation

I AWS Managment Consolehttp://aws.amazon.com/console/

I AWS Blog http://aws.typepad.com

I AWS in Education http://aws.amazon.com/education/

http://www.econsteve.com/r

http://aws.amazon.com/console/

http://aws.typepad.com

http://aws.amazon.com/education/