[harvard cs264] 07 - gpu cluster programming (mpi & zeromq)

Lecture #7: GPU Cluster Programming | March 8th, 2011

Nicolas Pinto (MIT, Harvard) pinto@mit.edu

Massively Parallel ComputingCS 264 / CSCI E-292

Administrativia• Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11

• Project info: http://www.cs264.org/projects/projects.html

• Project ideas: http://forum.cs264.org/index.php?board=6.0

• Project proposal deadline: Fri 3/25/11(but you should submit way before to start working on it asap)

• Need a private private repo for your project?

Let us know! Poll on the forum: http://forum.cs264.org/index.php?topic=228.0

Goodies

• Guest Lectures: 14 distinguished speakers

• Schedule updated (see website)

Goodies (cont’d)

• Amazon AWS free credits coming soon(only for students who completed HW0+1)

• It’s more than $14,000 donation for the class!

• Special thanks: Kurt Messersmith @ Amazon

Goodies (cont’d)

• Best Project Prize: Tesla C2070 (Fermi) Board

• It’s more than $4,000 donation for the class!

• Special thanks: David Luebke & Chandra Cheij @ NVIDIA

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for CS264

Todayyey!!

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

The Problem

Many computational problems too big for single CPU

Lack of RAM

Lack of CPU cycles

Want to distribute work between many CPUs

slide by Richard Edgar

Types of Parallelism

Some computations are ‘embarrassingly parallel’

Can do a lot of computation on minimal data

RC5 DES, SETI@HOME etc.

Solution is to distribute across the Internet

Use TCP/IP or similar

Types of Parallelism

Some computations very tightly coupled

Have to communicate a lot of data at each step

e.g. hydrodynamics

Internet latencies much too high

Need a dedicated machine

Tightly Coupled Computing

Two basic approaches

Shared memory

Distributed memory

Each has advantages and disadvantages

Some terminology

Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

Hybrid approach increasingly common

Hybrid approach increasingly commonnow: mostly hybrid

“distributed memory” “shared memory”

Some terminologyvia an interconnection network using explicit communication operations.

Hybrid approach increasingly commonnow: mostly hybrid

Shared Memory Machines

Have lots of CPUs share the same memory banks

Spawn lots of threads

Each writes to globally shared memory

Multicore CPUs now ubiquitous

Most computers now ‘shared memory machines’

NASA ‘Columbia’ ComputerUp to 2048 cores in single system

Spawning lots of threads (relatively) easy

pthreads, OpenMP

Don’t have to worry about data location

Disadvantage is memory performance scaling

Frontside bus saturates rapidly

Can use Non-Uniform Memory Architecture (NUMA)

Silicon Graphics Origin & Altix series

Gets expensive very fast

Some terminology

Hybrid approach increasingly common

now: mostly hybrid

Distributed Memory Clusters

Alternative is a lot of cheap machines

High-speed network between individual nodes

Network can cost as much as the CPUs!

How do nodes communicate?

Distributed Memory Clusters

NASA ‘Pleiades’ Cluster51,200 cores

Distributed Memory Model

Communication is key issue

Each node has its own address space (exclusive access, no global memory?)

Could use TCP/IP

Painfully low level

Solution: a communication protocol like message-passing (e.g. MPI)

Distributed Memory Model

All data must be explicitly partitionned

Exchange of data by explicit communication

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

Message Passing Interface

MPI is a communication protocol for parallel programs

Language independent

Open standard

Originally created by working group at SC92

Bindings for C, C++, Fortran, Python, etc.

http://www.mcs.anl.gov/research/projects/mpi/http://www.mpi-forum.org/

MPI processes have independent address spaces

Communicate by sending messages

Means of sending messages invisible

Use shared memory if available! (i.e. can be used behind the scenes shared memory architectures)

On Level 5 (Session) and higher of OSI model

OSI Model ?

MPI is a standard, a specification, for message-passing libraries

Two major implementations of MPI

OpenMPI

Programs should work with either

Basic Idea

• Usually programmed with SPMD model (single program,multiple data)

• In MPI-1 number of tasks is static - cannot dynamicallyspawn new tasks at runtime. Enhanced in MPI-2.

• No assumptions on type of interconnection network; allprocessors can send a message to any other processor.

• All parallelism explicit - programmer responsible forcorrectly identifying parallelism and implementing parallelalgorithms

adapted from Berger & Klöckner (NYU 2010)

Credits: James Carr (OCI)

Hello World

#include <mpi.h>#include <stdio.h>

int main(int argc, char** argv) {int rank, size;MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);

printf("Hello world from %d of %d\n", rank, size);

MPI_Finalize();return 0;

Hello WorldTo compile: Need to load “MPI” wrappers in addition to thecompiler modules (OpenMPI,‘ MPICH,...)

module load openmpi/intel/1.3.3

To compile: mpicc hello.c

To run: need to tell how many processes you are requesting

mpiexec -n 10 a.out (mpirun -np 10 a.out)

module load mpi/openmpi/1.2.8/gnu

http://www.youtube.com/watch?v=pLqjQ55tz-U

The beauty of data visualization

http://www.youtube.com/watch?v=pLqjQ55tz-U

The beauty of data visualization

Example: gprof2dot

“ They’ve done studies, you know. 60% of the time, it works every time... ”

- Brian Fantana (Anchorman, 2004)

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

Basic MPI

MPI is a library of routines

Bindings exist for many languages

Principal languages are C, C++ and Fortran

Python: mpi4py

We will discuss C++ bindings from now on

http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htm

Basic MPI

MPI allows processes to exchange messages

Processes are members of communicators

Communicator shared by all is MPI::COMM_WORLD

In C++ API, communicators are objects

Within a communicator, each process has unique ID

A Minimal MPI Program

Very much a minimal program

No actual communication occurs

#include <iostream>using namespace std;

#include “mpi.h”

int main( int argc, char* argv ) {

MPI::Init( argc, argv );

cout << “Hello World!” << endl;

MPI::Finalize();

return( EXIT_SUCCESS );}

To compile MPI programs use mpic++mpic++ -o MyProg myprog.cpp

The mpic++ command is a wrapper for default compiler

Adds in libraries

Use mpic++ --show to see what it does

Will also find mpicc, mpif77 and mpif90 (usually)

To run the program, use mpirunmpirun -np 2 ./MyProg

The -np 2 option launches two processes

Check documentation for your cluster

Number of processes might be implicit

Program should print “Hello World” twice

Communicators

Processes are members of communicators

A process can

Find the size of a given communicator

Determine its ID (or rank) within it

Default communicator is MPI::COMM_WORLD

Communicators

Queries COMM_WORLD communicator for

Number of processes

Current process rank (ID)

Prints these out

Process rank counts from zero

int nProcs, iMyProc;MPI::Init( argc, argv );nProcs = MPI::COMM_WORLD.Get_size();iMyProc = MPI::COMM_WORLD.Get_rank();cout << “Hello from process ”;cout << iMyProc << “ of ”;cout << nProcs << endl;MPI::Finalize();

Communicators

By convention, process with rank 0 is masterconst int iMasterProc = 0;

Can have more than one communicator

Process may have different rank within each

Messages

Haven’t sent any data yet

Communicators have Send and Recv methods for this

One process posts a Send

Must be matched by Recv in the target process

Sending Messages

A sample send is as follows:int a[10];MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag );

The method prototype isvoid Comm::Send( const void* buf, int count, const Datatype& datatype, int dest, int tag) const

MPI copies the buffer into a system buffer and returns

No delivery notification

Receiving Messages

Similar call to receiveint a[10];MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, iMyTag);

Function prototype isvoid Comm::Recv( void* buf, int count, const Datatype& datatype, int source, int tag) const

Blocks until data arrives

MPI::ANY_SOURCE

MPI::ANY_TAG

MPI Datatypes

MPI datatypes are independent of

Language

Endianess

Most common listed opposite

MPI Datatype C/C++

MPI::CHAR signed char

MPI::SHORT signed short

MPI::INT signed int

MPI::LONG signed long

MPI::FLOAT float

MPI::DOUBLE double

MPI::BYTE Untyped byte data

MPI Send & Receive

Master process sends out numbers

Worker processes print out number received

if( iMyProc == iMasterProc ) {for( int i=1; i<nProcs; i++ ) {

int iMessage = 2 * i + 1;cout << “Sending ” << iMessage << “ to process ” << i << endl;MPI::COMM_WORLD.Send( &iMessage, 1,

MPI::INT, i, iTag );

}} else {

int iMessage;MPI::COMM_WORLD.Recv( &iMessage, 1, MPI::INT, iMasterProc, iTag );cout << “Process ” << iMyProc << “ received ” << iMessage << endl;

Six Basic MPI Routines

Have now encounted six MPI routinesMPI::Init(), MPI::Finalize()MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(),MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv()

These are enough to get started ;-)

More sophisticated routines available...

Collective Communications

Send and Recv are point-to-point

Communicate between specific processes

Sometimes we want all processes to exchange data

These are called collective communications

Barriers

Barriers require all processes to synchroniseMPI::COMM_WORLD.Barrier();

Processes wait until all processes arrive at barrier

Potential for deadlock

Bad for performance

Only use if necessary

Broadcasts

Suppose one process has array to be shared with allint a[10];MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc );

If process has rank iSrcProc, it will send the array

Other processes will receive it

All will have a[10] identical to iSrcProc on completion

MPI Broadcast

Broadcast

MPI Bcast(&buf, count, datatype, root, comm)

All processors must call MPI Bcast with the same root value.

Reductions

Suppose we have a large array split across processes

We want to sum all the elements

Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM

Also MPI::COMM_WORLD.Allreduce() variant

Can perform MAX, MIN, MAXLOC, MINLOC too

MPI Reduce

ABCDP0

Reduce

Reduction operators can be min, max, sum, multiply, logicalops, max value and location ... Must be associative(commutative optional)

Scatter and Gather

Split a large array between processes

Use MPI::COMM_WORLD.Scatter()

Each process receives part of the array

Combine small arrays into one large one

Use MPI::COMM_WORLD.Gather()

Designated process will construct entire array

Has MPI::COMM_WORLD.Allgather() variant

MPI Scatter/Gather

Gather

AScatter

MPI Allgather

Allgather

MPI Alltoall

Alltoall

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

Asynchronous Messages

An asynchronous API exists too

Have to allocate buffers

Have to check if send or receive has completed

Will give better performance

Trickier to use

User-Defined Datatypes

Usually have complex data structures

Require means of distributing these

Can pack & unpack manually

MPI allows us to define own datatypes for this

• One-sided RMA (remote memory access) communication

• potential for greater efficiency, easier programming.

• Use ”windows” into memory to expose regions for access

• Race conditions now possible.

• Parallel I/O like message passing but to file system not

other processes.

• Allows for dynamic number of processes and

inter-communicators (as opposed to intra-communicators)

• Cleaned up MPI-1

RMA• Processors can designate portions of its address space as

available to other processors for read/write operations(MPI Get, MPI Put, MPI Accumulate).

• RMA window objects created by collective window-creationfns. (MPI Win create must be called by all participants)

• Before accessing, call MPI Win fence (or other synchr.mechanisms) to start RMA access epoch; fence (like a barrier)separates local ops on window from remote ops

• RMA operations are no-blocking; separate synchronizationneeded to check completion. Call MPI Win fence again.

RMA windowPut

P0 local memory P1 local memory

Some MPI Bugs

Sample MPI Bugs

Only works for even number of processors.

MPI Bugs

What’s wrong?

Sample MPI Bugs

Only works for even number of processors.

MPI Bugs

Sample MPI Bugs

Supose have local variable, e.g. energy, and want to sum allthe processors energy to find total energy of the system.

Recall

MPI_Reduce(sendbuf,recvbuf,count,datatype,op,

root,comm)

Using the same variable, as in

MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM,

MPI_COMM_WORLD)

will bomb.

MPI Bugs

Suppose you have a local variable “energy” and you want to sum all the processors “energy” to find the total energy of the system

What’s wrong?

Communication Topologies

Some topologies very common

Grid, hypercube etc.

API provided to set up communicators following these

Parallel Performance

Recall Amdahl’s law:

if T1 = serial cost + parallel cost

Tp = serial cost + parallel cost/p

But really

Tp = serial cost + parallel cost/p + Tcommunication

How expensive is it?

Network Characteristics

Interconnection network connects nodes, transfers data

Important qualities:• Topology - the structure used to connect the nodes

• Routing algorithm - how messages are transmittedbetween processors, along which path (= nodes alongwhich message transferred).

• Switching strategy = how message is cut into pieces andassigned a path

• Flow control (for dealing with congestion) - stall, store datain buffers, re-route data, tell source to halt, discard, etc.

Interconnection NetworkRepresent as graph G = (V ,E), V = set of nodes to beconnected, E = direct links between the nodes. Links usuallybidirectional - transfer msg in both directions at same time.Characterize network by:

• diameter - maximum over all pairs of nodes of the shortestpath between the nodes (length of path in messagetransmission)

• degree - number of direct links for a node (number of directneighbors)

• bisection bandwidth - minimum number of edges that mustbe removed to partition network into two parts of equal sizewith no connection between them. (measures networkcapacity for transmitting messages simultaneously)

• node/edge connectivity - numbers of node/edges that mustfail to disconnect the network (measure of reliability)

Linear Array

• p vertices, p − 1 links• Diameter = p − 1• Degree = 2• Bisection bandwidth = 1• Node connectivity = 1, edge connectivity = 1

Ring topology

• diameter = p/2• degree = 2• bisection bandwidth = 2• node connectivity = 2

edge connectivity = 2

Mesh topology

• diameter = 2(√

p − 1)3d mesh is 3( 3

√p − 1)

• degree = 4 (6 in 3d )

• bisection bandwidth√

• node connectivity 2edge connectivity 2

Route along each dimension in turn

Torus topology

Diameter halved, Bisection bandwidth doubled,Edge and Node connectivity doubled over mesh

Hypercube topology

0010 0011

0000 0001

0100 0101

010 011

000 001

100 101

1000 1001

• p = 2k processors labelled with binary numbers of length k• k -dimensional cube constructed from two (k − 1)-cubes

• Connect corresponding procs if labels differ in 1 bit

(Hamming distance d between 2 k -bit binary words =

path of length d between 2 nodes)

Hypercube topology

0010 0011

0000 0001

0100 0101

010 011

000 001

100 101

1000 1001

• diameter = k ( =log p)

• degree = k

• bisection bandwidth = p/2

• node connectivity k

edge connectivity k

Dynamic Networks

Above networks were direct, or static interconnection networks= processors connected directly with each through fixedphysical links.

Indirect or dynamic networks = contain switches which providean indirect connection between the nodes. Switches configureddynamically to establish a connection.

• bus• crossbar• multistage network - e.g. butterfly, omega, baseline

Crossbar

• Connecting n inputs and m outputs takes nm switches.(Typically only for small numbers of processors)

• At each switch can either go straight or change dir.

• Diameter = 1, bisection bandwidth = padapted from Berger & Klöckner (NYU 2010)

Butterfly

16 × 16 butterfly network:

stage 3000

stage 0 stage 1 stage 2

for p = 2k+1 processors, k + 1 stages, 2k switches per stage,2 × 2 switches

Fat tree

• Complete binary tree• Processors at leaves• Increase links for higher bandwidth near root

Current picture

• Old style: mapped algorithms to topologies

• New style: avoid topology-specific optimizations

• Want code that runs on next year’s machines too.

• Topology awareness in vendor MPI libraries?

• Software topology - easy of programming, but not used for

performance?

Should we care ?

• Old school: map algorithms to specific topologies

• New school: avoid topology-specific optimimizations (the code should be optimal on next year’s infrastructure....)

• Meta-programming / Auto-tuning ?

Top500 Interconnects

10/30/10 9:25 AMInterconnect Family Share Over Time | TOP500 Supercomputing Sites

of 2http://www.top500.org/overtime/list/35/connfam

CONTACT SUBMISSIONS LINKS HOME

Statistics Charts Development

Top500 List:06/2010

Statistics Type:Vendors

Generate

Search

Is Underutilizing Processors Such anAwful Idea?

Have Honey, Will Compute: Bees WaxNumeric

IDC Has Plan to Get EuropeanSupercomputing Back on Track

ORNL Climate System's Big Reveal

Manufacturers Turn to HPC to CutTesting Costs

MATLAB on the TeraGrid One YearLater

Supercomputer in the Works for Virginia

Not Your Parents' CFD

Machine Learns Language Starting withthe Facts

GPU-based Supercomputing Could Face

HPCWire

Bookmark

Save This Page

Home Statistics Historical Charts

Interconnect Family Share Over TimeIn addition to the charts below, you can view the the data used to generate thischart in table format using the statistics page. A direct link to the statistics is alsoavailable.

PROJECT LISTS STATISTICS RESOURCES NEWS

MPI References

• Lawrence Livermore tutorialhttps:computing.llnl.gov/tutorials/mpi/

• Using MPIPortable Parallel Programming with the Message=PassingInterfaceby Gropp, Lusk, Skjellum

• Using MPI-2Advanced Features of the Message Passing Interfaceby Gropp, Lusk, Thakur

• Lots of other on-line tutorials, books, etc.

Ignite: Google Trends

http://www.youtube.com/watch?v=m0b-QX0JDXc

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

MPI with CUDA

MPI and CUDA almost orthogonal

Each node simply becomes faster

Problem matching MPI processes to GPUs

Use compute-exclusive mode on GPUs

Tell cluster environment to limit processes per node

Have to know your cluster documentation

Data Movement

Communication now very expensive

GPUs can only communicate via their hosts

Very laborious

Again: need to minimize communication

MPI Summary

MPI provides cross-platform interprocess communication

Invariably available on computer clusters

Only need six basic commands to get started

Much more sophistication available

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

ZeroMQ

• ‘messaging middleware’ ‘TCP on steroids’ ‘new layer on the networking stack’

• not a complete messaging system

• just a simple messaging library to be used programmatically.

• a “pimped” socket interface allowing you to quickly design / build a complex communication system without much effort

http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all

ZeroMQ• Fastest. Messaging. Ever.

• Excellent documentation:

• examples

• white papers for everything

• Bindings for Ada, Basic, C, Chicken Scheme, Common Lisp, C#, C++, D, Erlang*, Go*, Haskell*, Java, Lua, node.js, Objective-C, ooc, Perl, Perl, PHP, Python, Racket, Ruby, Tcl

Message Patterns

Demo: Why ZeroMQ ?

http://www.youtube.com/watch?v=_JCBphyciAs

MPI vs ZeroMQ ?

• MPI is a specification, ZeroMQ is an implementation.

• Design:

• MPI is designed for tightly-coupled compute clusters with fast and reliable networks.

• ZeroMQ is designed for large distributed systems (web-like).

• Fault tolerance:

• MPI has very limited facilities for fault tolerance (the default error handling behavior in most implementations is a system-wide fail, ouch!).

• ZeroMQ is resilient to faults and network instability.

• ZeroMQ could be a good transport layer for an MPI-like implementation.

http://stackoverflow.com/questions/35490/spread-vs-mpi-vs-zeromq

CUDASA

Fast Forward

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&#$"'()*"'#+,

!"#$%&'()*)++$+,-.'/0'1234'0/*')'-,%5+$'672'#/

-./(01%102.8+#,9672'-:-#$.-;<=>'?$-+)>'@8)&*/7+$">'AAA

31#4"56(01%102 672'B+8-#$*'$%C,*/%.$%#-D:*,%$#>'=%0,%,E)%&>'D:*,FG6>'AAA

1/%-,-#$%#'&$C$+/($*',%#$*0)B$

!)-,+:'$.H$&&$&'#/'#I$'1234'B/.(,+$'(*/B$--

1234J'1/.(8#$'2%,0,$&'3$C,B$'78 4*BI,#$B#8*$'&'9(8:/#1;/(CUDASA: Computed Unified Device Systems Architecture

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507

*&,56$(8(6+.507$+75.$9*:#;<$-.%'#+./0%'123'9=(>56(;;-.*-& ?4061,$57$3&)&44(4

-0$60@@+756&.507$?(./((7$;-.*-& ?4061,

#7@0=5A5(=$B#C2$3)0D)&@@57D$57.()A&6(-0$60=($@0=5A56&.507$)(E+5)(=

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507

*8#9$-./'0'1./'0) -./'234"':;0,.<=7($*8#$(>+&4,$07($"=?@A$.;)(&B

(%#; C4061,$57$3&)&44(4244$C4061,$,;&)($60DD07$,',.(D$D(D0)'$:(>EF$G#H2$I40C&4$D(D0)'<J0)140&BKC&4&76(B$,6;(B+457I$0L$C4061,$.0$.;($*8#,

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507

*8#9$#+-./%'0/1#$%*'23':70;(<=7($*8#$(>+&4,$07($?"@$3)06(,,

?"@$A)0+3$3)06(,,$;)< B4061,$57$3&)&44(4-0$57.)57,56$A40B&4$C(C0)'$ D5,.)5B+.(;$,E&)(;$C(C0)'$C&7&A(C(7.

F0)140&;GB&4&76(;$,6E(;+457A$0H$B4061,$.0$.E($*8#,

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507

8(9+(7.5&4$&33456&.507$3)06(,,2):5.)&)'$;<;==$&33456&.507$60>(24406&.507<?(&4406&.507$0@$>5,.)5:+.(><,',.(A$A(A0)'B,,+($0@$@+76.507$6&44,$07$7(./0)1<:+,$4(C(4

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%"$&'()*&#+,-#+

!"#$%&'()#$*+#,%-.!)/0.12#'+3.+#-.1%4+#'5$1.6"7.89).:+2%7;5#54+:.1%'."6.%3'%#15"#1.6"7.+--5'5"#+:.+<1'7+$'5"#.:+2%71

8%#%7+:5=%.&7%1%#'.&7",7+445#,.&+7+-5,4;545$.89).5#'%76+$%.6"7.+::.#%>.?@)1

97",7+44+<5:5'2

A5--%#.B#-%7:25#,.4%$*+#5141.6"7.&+7+::%:514C$"44B#5$+'5"#

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%"$&'()*&#+,-#+'./-#*01

!"#$%&'()*+',-.&/

!!"#$%&#!!'($)*'"+,-./0#$&12'3&4&561647'8999:';;'<=>?@=

!!A$B1!!'($)*'A+,-./9997'8 ;;'CDEF

"+,-.'GGG'<"H'<%H'IB'JJJ/3&4&561647K

01&,234-5'(-564728"34-5

925,34-5':2"%464&8

!!1&BL!! ($)*'1+,-./0#$&12'3&4&561647'8 ;;'CDEF

A+,-./3&4&561647K

!!B6M,6-.6!! ($)*'B+,-./9997'8 ;;'NOO

1+,-.'GGG'<"'JJJ/3&4&561647K

!"#$%&'()*+!+',-.&/

;&5&8"%4<&.01&,234-5'

(-564728"34-5

=&>925,34-5':2"%464&8

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"$&'()*&#+,-#+'./-#*01

!"#$%&'$()* +,-)#./ 0*$.%*&1 23(1$4(*#

&--1('&$()*51&6.% !!"#$%#&'#!!

*.$7)%851&6.% !!()*!! !!&)+#!! ()*,+-./()*012

"3#51&6.% !!34"5!! !!6)"3!! 34"5,+-./34"5012

9:;51&6.% !!78)*48!! !!+#91'#!! 7:1+012./*8)'5,+-./*8)'5012./36:#4+,+-

+,-)#./5<3*'$()*#5&%.5&''.##("1.5<%)=5*.,$5>(?>.%5&"#$%&'$()*23(1$4(*#5&%.5&3$)=&$('&1165-%)-&?&$./5$)5&1153*/.%16(*?51&6.%#

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( ::

!"#$%"%&'(')*&

!"#$%&'()%*+,+-

;)< ,#.((%#= /01'."2%#34.2'(567/(+'8"'/$(#'9(9:+;2:,.

<%/$+%*"$'.(/1,+'.(&'&:+-(=:+(#'$9:+;(2,-'+

>:&&:#(%#$'+=,0'(="#0$%:#/(?'@3@(,$:&%0(="#0$%:#/A

>7<BCB(>:&D%2'+

>:.'($+,#/2,$%:#(=+:&(>7<BCB(0:.'($:(>7<B(9%$1($1+',./EFG4

C'2=H0:#$,%#'.(D+'H0:&D%2'+($:(>7<B(0:&D%2'+(D+:0'//

5,/'.(:#(62/,(?>II(D,+/'+A(9%$1(,..'.(>7<BE>7<BCB(="#0$%:#,2%$-

J"22(,#,2-/%/(:=(/-#$,K(,#.(/'&,#$%0/(+'8"%+'.

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-./,0(1%2

!"#$%&'()*+'(,-#./&#,0*.1!!"#$%!!&'()*&"+,-./)-"&)0&12(#"&314&5&666&7

"89:*:1&$";,."&5

)-"&)<&12(#"&31<&

*)=>&"#$%?*@0&"#$%A)=<

7&B;#99:;!$";,."!"+,-.<

'()*&"+,-./B;#99:;!$",."!"+,-.&39#;#=4&5

)-"&)&C&9#;#=DE)<&12(#"&31&C&9#;#=DE1<

*)=>&"#$%?*@&C&9#;#=DE"#$%?*@<

*)=>&"#$%A)=&C&9#;#=DE"#$%A)=<

5&666&7

23456(,7-'#+()*$%#,08&'(/09.#,:-'

;/'-<+'=0.'+

>:0&,<0./

3-090.#&(=:.),0*.(8*+?

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-./,0(1%2

!"#$%&'()*+'(,-#./&#,0*.1

2'.'-#,'+()*+'(&#3*4,5*%3(64.),0*.(%#-#$','-/(0.,*(7-#%%'-(/,-4), 8 !"#$#9:*%4&#,'(/);'+4&'-(<4'4'(70,;(#&&(=&*)>/(*6(,;'((%#<9."=+ 8 %&#9?','-$0.'(=40&,@0./(6*-('#);(=&*)> 8 '()*+,-"#'()*%!. 9

A#>'(4%(B!C(7*->'-(,;-'#+/(6-*$(,;'(,;-'#+(%**&D+&'(B!C/(-'<4'/,(.'",(%'.+0.E(=&*)>(6-*$(<4'4'

A#0,(6*-(#&&(=&*)>/(,*(='(%-*)'//'+D//40.E(#((%#<9."= 0/(#(=&*)>0.E()#&&

'/012#333#%&#4445!"#$67

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-%'.*/0,1(2%/

!"#$%&'()*+(&,")-)(+".%&"%/0*%+(1$'%0*,).%234%,)&$'5(6$789*%'0)%6":;,+$<&,:$%.$)$'(&$#%$=$)&%+"";%>$?=@%&"%&A'$(#%;""+B-;;+,6(&,")%,**0$*%/'"(#6(*&%:$**(.$*%&"

,**0$%$C$60&,")%<)=+8$*/(")*#;$'5"':%*A('$#%#,*&',/0&$#%:$:"'1%";$'(&,")*

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-%'.*/0,1(2%/

!"#$%&'&()*$(+,*%&'-%-.$/01#+2%'3.-4,*#*(.1)'%53%%&(16')/)*%-'-%-.$/'.7')(162%'1.&%

0#3"'32,)*%$'1.&%'&%&(3#*%)'4#$*'.7')/)*%-'-%-.$/'7.$'8!9:.1*(1,.,)';($*,#2'#&&$%))'$#16%

<1*%$7#3%';(#'!"#$%$&$''(!)*!"#$%$&+,!-.)*!"#$%$/0++0;=>'*.':?8@'62.+#2'-%-.$/'-#1#6%-%1*

<-42%-%1*%&',)(16'9A<'B%-.*%'9%-.$/'@33%))'CB9@DE.'6,#$#1*(%)'7.$'3.13,$$%1*'1.1F#*.-(3'-%-.$/'#33%))%)'C#)'(1':?8@D:?8@!@'#*.-(3'7,13*(.1)'7.$'8!9

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&#'()$#(*+,+%%"%-#.

!"#$%&'#"%()%*+,-%."/"0'1%2'$034%251$3617%8-9:;;<*531=%>/%$>6%>?%@A*+,-%-9:;;%13B0'07%?5/&$3>/*1>&CDB'#"=%#5BD2'$034%60>&"##3/.%'6613"=%>/%'11%1"E"1#%>?%6'0'11"13#2

,;G%H6$"0>/%IJKL%I4I%&>0"#

M/$"1%NOOKKL%P%&>0"#

QRMGM,%N5'=0>%STUOKK

QRMGM,%VVKK9!T%A1$0'

I%&'0=#% 8(OW(O%1'/"#<

QRMGM,%VVKK9!

I%&'0=#% 8(OWP%1'/"#<X%&'0=#% 8(OWPWP%1'/"#<P%&'0=#% 8(OWPWPWP%1'/"#<

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&#'()$#(*+,+%%"%-#.(/012&34

!"#$%&'#"%()%*+,-./0,1"2%03%410,'1%511+256'$5061<=&"/"(+!"#">"&"(7+%*?+@*(".%?"%*/-+8).+1*(-.%/("6-+A&)>%&+1&&$<"*%(")*78%9'&:#,'&:"/%"$%'18;%(<<=

>'/$5$506%03%'11%#&"6"%"1"2"6$#%56%$'#?%'6@%?"/6"1%,10&?#A6530/2%@5/"&$506'1%/'@5'6&"%@5#$/5,+$506%BC(D%#'2.1"#E

F6%25115#"&06@#%30/%'%#5641"-60@"%2+1$5-G>A%#H#$"2%I5$:%30+/%JKF9FL%DD<<G!#

J+2,"/%03%#&"6"%"1"2"6$#

C%G>A (%G>A# M%G>A#

N(=OD P(O (ON C(Q

CNC<=( C<N< P(< (OP

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&#'()"&*+,-(./,/%%"%0#1

!"##$%#$%&#'&()*+%,-.//%012'3%'(*4#"5%)6"677(72189(1*%&61(%:+

;%&701*("%'#<(1=%*4#%>>??-91%(6&@=%-236A2*%.*@("'(*

;B???; 86*"2&(1% 234(56%+7#<)= &#8)0*6*2#'+% :CD1%

<)= &#880'2&6*2#'+ ECF1

G23@%&#880'2&6*2#'%&#1*1

,2'37(%!H%42*@%E%-!I1%01(<%61%E%12'37(%-!I%&701*("%'#<(1

:?;E?; 86*"2&(1% 489(56%+7#(JA01%7(K(7%#'7L+%M:E%-N7#)1OP(Q02"(1%2'*("R)"#&(11%&#880'2&6*2#'

S,/%6&&(11(1%*65(%:CB%*28(1%7#'3("%*@6'%&#8)0*6*2#'

T#%646"('(11%#$%<6*6%7#&672*L

G23@%J0''(&(116"LO%&#880'2&6*2#'%#K("@(6<

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&'("#

!"#$%$&'()*+,-./,'*/'!"#$'0/1'+,234.,5'36-7,+*8/19':21244+4.-;<.,.;24'=>2,5+-'*/'/1.5.,24'42,5625+

?/8':1/512;;.,5'2,@'4+21,.,5'/A+1>+2@

B//@'-=24.,5'3+>2A./1'/,'36-'4+A+4(-:+=.244C'0/1'A+1C'4215+'*215+*'=/;:6*2*./,-

(2-C'*/'.,*+512*+'.,*/'*>+'!"#$'@+A+4/:;+,*':1/=+--

!611+,*':1/D+=*'-*2*+&()*+,-./,'0/1'!"#$%$'*/'2@@'2821+,+--'/0'@2*2'4/=24.*C

E@+2&'!24432=9';+=>2,.-;'.,'-<-/$(")*+/)*8"9$.%(")*<.,.;.F+'2;/6,*'/0'#%<'@2*2'*/'3+'=/;;6,.=2*+@$6*/;2*.=244C';29+'6-+'/0'2-C,=>1/,/6-'@2*2'*12,-0+1'*/'*>+'BG"-

G1+:212*./,-'0/1';29.,5'!"#$%$':634.=4C'2A2.4234+

MultiGPU MapReduce

Fast Forward

MapReduce

http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png

Why MapReduce?

• Simple programming model

• Parallel programming model

• Scalable

• Previous GPU work: neither multi-GPU nor out-of-core

Benchmarks—Which• Matrix Multiplication (MM)

• Word Occurrence (WO)

• Sparse-Integer Occurrence (SIO)

• Linear Regression (LR)

• K-Means Clustering (KMC)

• (Volume Renderer—presented 90minutes ago @ MapReduce ’10)

Benchmarks—Why

• Needed to stress aspects of GPMR

• Unbalanced work (WO)

• Multiple emits/Non-uniform number of emits (LR, KMC, WO)

• Sparsity of keys (SIO)

• Accumulation (WO, LR, KMC)

• Many key-value pairs (SIO)

• Compute Bound Scalability (MM)

Benchmarks—Results

MM SIO WO KMC LR

Input Element Size — 4 bytes 1 byte 16 bytes 8 bytes

# Elems in first set (×106) 1024

2, 2048

2, 4096

2, 16384

21, 8, 32, 128 1, 16, 64, 512 1, 8, 32, 512 1, 16, 64, 512

# Elems in second set (×106/GPU) — 1, 2, 4, 1, 2, 4, 8, 16, 1, 2, 4, 1, 2, 4, 8, 16

8, 16, 32 32, 64, 128, 256 8, 16, 32 32, 64

TABLE 1: Dataset Sizes for all four benchmarks. We tested Phoenix against the first input set for SIO, KMC, LR, and the second set for

WO. We test GPMR against all available input sets.

MM KMC LR SIO WO

1-GPU Speedup 162.712 2.991 1.296 1.450 11.080

4-GPU Speedup 559.209 11.726 4.085 2.322 18.441

TABLE 2: Speedup for GPMR over Phoenix on our large (second-

biggest) input data from our first set. The exception is MM, for which

we use our small input set (Phoenix required almost twenty seconds

to multiply two 1024×1024 matrices).

MM KMC WO

1-GPU Speedup 2.695 37.344 3.098

4-GPU Speedup 10.760 129.425 11.709

TABLE 3: Speedup for GPMR over Mars on 4096× 4096 Matrix

Multiplication, an 8M-point K-Means Clustering, and a 512 MB

Word Occurrence. These sizes represent the largest problems that

can meet the in-core memory requirements of Mars.

summarizes speedup results over Phoenix, while Table 3 gives

speedup results of GPMR over Mars. Note that GPMR, even

in the one-GPU configuration, is faster on all benchmarks that

either Phoenix or Mars, and GPMR shows good scalability to

four GPUs as well.

Source code size is another important metric. One signif-

icant benefit of MapReduce in general is its high level of

abstraction: as a result, code sizes are small and development

time is reduced, since the developer does not have to focus

on the low-level details of communication and scheduling but

instead on the algorithm. Table 4 shows the different number

of lines required for each of three benchmarks implemented

in Phoenix, Mars, and GPMR. We would also like to show

developer time required to implement each benchmark for

each platform, but neither Mars nor Phoenix published such

information (and we wanted to use the applications provided

so as not to introduce bias in Mars’s or Phoenix’s runtimes). As

a frame of reference, the lead author of this paper implemented

and tested MM in GPMR in three hours, SIO in half an hour,

KMC in two hours, LR in two hours, and WO in four hours.

KMC, LR, and WO were then later modified in about half an

hour each to add Accumulation.

7 CONCLUSION

GPMR offers many benefits to MapReduce programmers.

The most important is scalability. While it is unrealistic to

expect perfect scalability from all but the most compute-bound

tasks, GPMR’s minimal overhead and transfer costs position

MM KMC WO

Phoenix 317 345 231

Mars 235 152 140

GPMR 214 129 397

TABLE 4: Lines of source code for three common benchmarks

written in Phoenix, Mars, and GPMR. We exclude setup code from

all counts as it was roughly the same for all benchmarks and had

little to do with the actual MapReduce code. For GPMR we included

boilerplate code in the form of class header files and C++ wrapper

functions that invoke CUDA kernels. If we excluded these files,

GPMR’s totals would be even smaller. Also, WO is so large because

of the hashing required in GPMR’s implementation.

Fig. 2: GPMR runtime breakdowns on our the largest datasets.

This figure shows how each application exhibits different runtime

characteristics, and also how exhibited characteristics change as we

increase the number of GPUs.

it well in comparison to other MapReduce implementations.

GPMR also offers flexibility to developers in several areas,

particularly when compared with Mars. GPMR allows flexible

mappings between threads and keys and customization of the

MapReduce pipeline with additional communication-reducing

stages while still providing sensible default implementations.

Our results demonstrate that even difficult applications that

have not traditionally been addressed by GPUs can still show

vs. CPU

vs. GPU

Benchmarks - Results

iPhD one more thingor two...

Life/Code Hacking #3The Pomodoro Technique

http://lifehacker.com/#!5554725/the-pomodoro-technique-trains-your-brain-away-from-distractions

Life/Code Hacking #3The Pomodoro Technique

http://www.youtube.com/watch?v=QYyJZOHgpco

[harvard cs264] 07 - gpu cluster programming (mpi & zeromq)

global memory

distributed memory approach

memory distributed memoryeach

shared memory machinesslide

shared memory machinesspawning

shared memory machineshave

memory accesses needeocessor

memory accesses needeation

Education

zeromq anatomy & jeromq

realtime analytics using mongodb, python, gevent, and zeromq

ekbpy'2012 - Антон Патрушев - zeromq

zeromq - pycon india 2013

jiří sedláček - vývoj backendu pomocí zeromq a...

mpi mpi mpi

cs264: program analysis catalog name: implementation of...

owf12/php de inetd à zeromq

zeromq - sockets on steroids!

zeromq magic: integration - openalt...zeromq magic:...

distributed rpc in nova with zeromq

software architecture over zeromq

scala.io 2013 - scala and zeromq: events beyond the jvm

[harvard cs264] 03 - introduction to gpu computing, cuda...

zeromq is the answer: php tek 11 version

[harvard cs264] 08b - mapreduce and hadoop (zak stone,...

[harvard cs264] 11b - analysis-driven performance...

build reliable, traceable, distributed systems with zeromq

zeromq with nodejs

zeromq 消息模式分析