massively parallel computing

136
Lecture #7: GPU Cluster Programming | March 8th, 2011 Nicolas Pinto (MIT, Harvard) [email protected] Massively Parallel Computing CS 264 / CSCI E-292

Upload: sal27adam

Post on 03-Jan-2016

137 views

Category:

Documents


0 download

DESCRIPTION

GPU Cluster Programming

TRANSCRIPT

Page 1: Massively Parallel Computing

Lecture #7: GPU Cluster Programming | March 8th, 2011

Nicolas Pinto (MIT, Harvard) [email protected]

Massively Parallel ComputingCS 264 / CSCI E-292

Page 2: Massively Parallel Computing

Administrativia• Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11

• Project info: http://www.cs264.org/projects/projects.html

• Project ideas: http://forum.cs264.org/index.php?board=6.0

• Project proposal deadline: Fri 3/25/11(but you should submit way before to start working on it asap)

• Need a private private repo for your project?

Let us know! Poll on the forum: http://forum.cs264.org/index.php?topic=228.0

Page 3: Massively Parallel Computing

Goodies

• Guest Lectures: 14 distinguished speakers

• Schedule updated (see website)

Page 4: Massively Parallel Computing

Goodies (cont’d)

• Amazon AWS free credits coming soon(only for students who completed HW0+1)

• It’s more than $14,000 donation for the class!

• Special thanks: Kurt Messersmith @ Amazon

Page 5: Massively Parallel Computing

Goodies (cont’d)

• Best Project Prize: Tesla C2070 (Fermi) Board

• It’s more than $4,000 donation for the class!

• Special thanks: David Luebke & Chandra Cheij @ NVIDIA

Page 6: Massively Parallel Computing

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for CS264

Page 7: Massively Parallel Computing

Todayyey!!

Page 8: Massively Parallel Computing

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

Page 9: Massively Parallel Computing

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

Page 10: Massively Parallel Computing

The Problem

Many computational problems too big for single CPU

Lack of RAM

Lack of CPU cycles

Want to distribute work between many CPUs

slide by Richard Edgar

Page 11: Massively Parallel Computing

Types of Parallelism

Some computations are ‘embarrassingly parallel’

Can do a lot of computation on minimal data

RC5 DES, SETI@HOME etc.

Solution is to distribute across the Internet

Use TCP/IP or similar

slide by Richard Edgar

Page 12: Massively Parallel Computing

Types of Parallelism

Some computations very tightly coupled

Have to communicate a lot of data at each step

e.g. hydrodynamics

Internet latencies much too high

Need a dedicated machine

slide by Richard Edgar

Page 13: Massively Parallel Computing

Tightly Coupled Computing

Two basic approaches

Shared memory

Distributed memory

Each has advantages and disadvantages

slide by Richard Edgar

Page 14: Massively Parallel Computing

Some terminology

Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M

Interconnection Network

P PP

M M M

Hybrid approach increasingly common

Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M

Interconnection Network

P PP

M M M

Hybrid approach increasingly commonnow: mostly hybrid

“distributed memory” “shared memory”

Page 15: Massively Parallel Computing

Some terminologyvia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M

Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M

Interconnection Network

P PP

M M M

Hybrid approach increasingly commonnow: mostly hybrid

“distributed memory” “shared memory”

Page 16: Massively Parallel Computing

Shared Memory Machines

Have lots of CPUs share the same memory banks

Spawn lots of threads

Each writes to globally shared memory

Multicore CPUs now ubiquitous

Most computers now ‘shared memory machines’

slide by Richard Edgar

Page 17: Massively Parallel Computing

Shared Memory Machines

NASA ‘Columbia’ ComputerUp to 2048 cores in single system

slide by Richard Edgar

Page 18: Massively Parallel Computing

Shared Memory Machines

Spawning lots of threads (relatively) easy

pthreads, OpenMP

Don’t have to worry about data location

Disadvantage is memory performance scaling

Frontside bus saturates rapidly

Can use Non-Uniform Memory Architecture (NUMA)

Silicon Graphics Origin & Altix series

Gets expensive very fast

slide by Richard Edgar

Page 19: Massively Parallel Computing

Some terminology

Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M

Interconnection Network

P PP

M M M

Hybrid approach increasingly common

Interconnection Network

P PP

M M M

now: mostly hybrid

“distributed memory” “shared memory”

Page 20: Massively Parallel Computing

Distributed Memory Clusters

Alternative is a lot of cheap machines

High-speed network between individual nodes

Network can cost as much as the CPUs!

How do nodes communicate?

slide by Richard Edgar

Page 21: Massively Parallel Computing

Distributed Memory Clusters

NASA ‘Pleiades’ Cluster51,200 cores

slide by Richard Edgar

Page 22: Massively Parallel Computing

Distributed Memory Model

Communication is key issue

Each node has its own address space (exclusive access, no global memory?)

Could use TCP/IP

Painfully low level

Solution: a communication protocol like message-passing (e.g. MPI)

slide by Richard Edgar

Page 23: Massively Parallel Computing

Distributed Memory Model

All data must be explicitly partitionned

Exchange of data by explicit communication

slide by Richard Edgar

Page 24: Massively Parallel Computing

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

Page 25: Massively Parallel Computing

Message Passing Interface

MPI is a communication protocol for parallel programs

Language independent

Open standard

Originally created by working group at SC92

Bindings for C, C++, Fortran, Python, etc.

http://www.mcs.anl.gov/research/projects/mpi/http://www.mpi-forum.org/

slide by Richard Edgar

Page 26: Massively Parallel Computing

Message Passing Interface

MPI processes have independent address spaces

Communicate by sending messages

Means of sending messages invisible

Use shared memory if available! (i.e. can be used behind the scenes shared memory architectures)

On Level 5 (Session) and higher of OSI model

slide by Richard Edgar

Page 27: Massively Parallel Computing

OSI Model ?

Page 28: Massively Parallel Computing

Message Passing Interface

MPI is a standard, a specification, for message-passing libraries

Two major implementations of MPI

MPICH

OpenMPI

Programs should work with either

slide by Richard Edgar

Page 29: Massively Parallel Computing

Basic Idea

• Usually programmed with SPMD model (single program,multiple data)

• In MPI-1 number of tasks is static - cannot dynamicallyspawn new tasks at runtime. Enhanced in MPI-2.

• No assumptions on type of interconnection network; allprocessors can send a message to any other processor.

• All parallelism explicit - programmer responsible forcorrectly identifying parallelism and implementing parallelalgorithms

adapted from Berger & Klöckner (NYU 2010)

Page 30: Massively Parallel Computing

Credits: James Carr (OCI)

Page 31: Massively Parallel Computing

Hello World

#include <mpi.h>#include <stdio.h>

int main(int argc, char** argv) {int rank, size;MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);

printf("Hello world from %d of %d\n", rank, size);

MPI_Finalize();return 0;

}

adapted from Berger & Klöckner (NYU 2010)

Page 32: Massively Parallel Computing

Hello WorldTo compile: Need to load “MPI” wrappers in addition to thecompiler modules (OpenMPI,‘ MPICH,...)

module load openmpi/intel/1.3.3

To compile: mpicc hello.c

To run: need to tell how many processes you are requesting

mpiexec -n 10 a.out (mpirun -np 10 a.out)

module load mpi/openmpi/1.2.8/gnu

adapted from Berger & Klöckner (NYU 2010)

Page 33: Massively Parallel Computing

http://www.youtube.com/watch?v=pLqjQ55tz-U

The beauty of data visualization

Page 34: Massively Parallel Computing

http://www.youtube.com/watch?v=pLqjQ55tz-U

The beauty of data visualization

Page 35: Massively Parallel Computing

Example: gprof2dot

Page 36: Massively Parallel Computing
Page 37: Massively Parallel Computing

“ They’ve done studies, you know. 60% of the time, it works every time... ”

- Brian Fantana (Anchorman, 2004)

Page 38: Massively Parallel Computing

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

Page 39: Massively Parallel Computing

Basic MPI

MPI is a library of routines

Bindings exist for many languages

Principal languages are C, C++ and Fortran

Python: mpi4py

We will discuss C++ bindings from now on

http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htm

slide by Richard Edgar

Page 40: Massively Parallel Computing

Basic MPI

MPI allows processes to exchange messages

Processes are members of communicators

Communicator shared by all is MPI::COMM_WORLD

In C++ API, communicators are objects

Within a communicator, each process has unique ID

slide by Richard Edgar

Page 41: Massively Parallel Computing

A Minimal MPI Program

Very much a minimal program

No actual communication occurs

#include <iostream>using namespace std;

#include “mpi.h”

int main( int argc, char* argv ) {

MPI::Init( argc, argv );

cout << “Hello World!” << endl;

MPI::Finalize();

return( EXIT_SUCCESS );}

slide by Richard Edgar

Page 42: Massively Parallel Computing

A Minimal MPI Program

To compile MPI programs use mpic++mpic++ -o MyProg myprog.cpp

The mpic++ command is a wrapper for default compiler

Adds in libraries

Use mpic++ --show to see what it does

Will also find mpicc, mpif77 and mpif90 (usually)

slide by Richard Edgar

Page 43: Massively Parallel Computing

A Minimal MPI Program

To run the program, use mpirunmpirun -np 2 ./MyProg

The -np 2 option launches two processes

Check documentation for your cluster

Number of processes might be implicit

Program should print “Hello World” twice

slide by Richard Edgar

Page 44: Massively Parallel Computing

Communicators

Processes are members of communicators

A process can

Find the size of a given communicator

Determine its ID (or rank) within it

Default communicator is MPI::COMM_WORLD

slide by Richard Edgar

Page 45: Massively Parallel Computing

Communicators

Queries COMM_WORLD communicator for

Number of processes

Current process rank (ID)

Prints these out

Process rank counts from zero

int nProcs, iMyProc;MPI::Init( argc, argv );nProcs = MPI::COMM_WORLD.Get_size();iMyProc = MPI::COMM_WORLD.Get_rank();cout << “Hello from process ”;cout << iMyProc << “ of ”;cout << nProcs << endl;MPI::Finalize();

slide by Richard Edgar

Page 46: Massively Parallel Computing

Communicators

By convention, process with rank 0 is masterconst int iMasterProc = 0;

Can have more than one communicator

Process may have different rank within each

slide by Richard Edgar

Page 47: Massively Parallel Computing

Messages

Haven’t sent any data yet

Communicators have Send and Recv methods for this

One process posts a Send

Must be matched by Recv in the target process

slide by Richard Edgar

Page 48: Massively Parallel Computing

Sending Messages

A sample send is as follows:int a[10];MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag );

The method prototype isvoid Comm::Send( const void* buf, int count, const Datatype& datatype, int dest, int tag) const

MPI copies the buffer into a system buffer and returns

No delivery notification

slide by Richard Edgar

Page 49: Massively Parallel Computing

Receiving Messages

Similar call to receiveint a[10];MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, iMyTag);

Function prototype isvoid Comm::Recv( void* buf, int count, const Datatype& datatype, int source, int tag) const

Blocks until data arrives

MPI::ANY_SOURCE

MPI::ANY_TAG

slide by Richard Edgar

Page 50: Massively Parallel Computing

MPI Datatypes

MPI datatypes are independent of

Language

Endianess

Most common listed opposite

MPI Datatype C/C++

MPI::CHAR signed char

MPI::SHORT signed short

MPI::INT signed int

MPI::LONG signed long

MPI::FLOAT float

MPI::DOUBLE double

MPI::BYTE Untyped byte data

slide by Richard Edgar

Page 51: Massively Parallel Computing

MPI Send & Receive

Master process sends out numbers

Worker processes print out number received

if( iMyProc == iMasterProc ) {for( int i=1; i<nProcs; i++ ) {

int iMessage = 2 * i + 1;cout << “Sending ” << iMessage << “ to process ” << i << endl;MPI::COMM_WORLD.Send( &iMessage, 1,

MPI::INT, i, iTag );

}} else {

int iMessage;MPI::COMM_WORLD.Recv( &iMessage, 1, MPI::INT, iMasterProc, iTag );cout << “Process ” << iMyProc << “ received ” << iMessage << endl;

}

slide by Richard Edgar

Page 52: Massively Parallel Computing

Six Basic MPI Routines

Have now encounted six MPI routinesMPI::Init(), MPI::Finalize()MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(),MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv()

These are enough to get started ;-)

More sophisticated routines available...

slide by Richard Edgar

Page 53: Massively Parallel Computing

Collective Communications

Send and Recv are point-to-point

Communicate between specific processes

Sometimes we want all processes to exchange data

These are called collective communications

slide by Richard Edgar

Page 54: Massively Parallel Computing

Barriers

Barriers require all processes to synchroniseMPI::COMM_WORLD.Barrier();

Processes wait until all processes arrive at barrier

Potential for deadlock

Bad for performance

Only use if necessary

slide by Richard Edgar

Page 55: Massively Parallel Computing

Broadcasts

Suppose one process has array to be shared with allint a[10];MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc );

If process has rank iSrcProc, it will send the array

Other processes will receive it

All will have a[10] identical to iSrcProc on completion

slide by Richard Edgar

Page 56: Massively Parallel Computing

MPI Broadcast

Broadcast

P0

P1

P2

P3

P0

P1

P2

P3

A

A

A

A

A

MPI Bcast(&buf, count, datatype, root, comm)

All processors must call MPI Bcast with the same root value.

adapted from Berger & Klöckner (NYU 2010)

Page 57: Massively Parallel Computing

Reductions

Suppose we have a large array split across processes

We want to sum all the elements

Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM

Also MPI::COMM_WORLD.Allreduce() variant

Can perform MAX, MIN, MAXLOC, MINLOC too

slide by Richard Edgar

Page 58: Massively Parallel Computing

MPI Reduce

ABCDP0

P1

P2

P3

P0

P1

P2

P3

Reduce

A

B

C

D

Reduction operators can be min, max, sum, multiply, logicalops, max value and location ... Must be associative(commutative optional)

adapted from Berger & Klöckner (NYU 2010)

Page 59: Massively Parallel Computing

Scatter and Gather

Split a large array between processes

Use MPI::COMM_WORLD.Scatter()

Each process receives part of the array

Combine small arrays into one large one

Use MPI::COMM_WORLD.Gather()

Designated process will construct entire array

Has MPI::COMM_WORLD.Allgather() variant

slide by Richard Edgar

Page 60: Massively Parallel Computing

MPI Scatter/Gather

Gather

P0

P1

P2

P3

P0

P1

P2

P3

A

B

C

D

AScatter

B C D

adapted from Berger & Klöckner (NYU 2010)

Page 61: Massively Parallel Computing

MPI Allgather

Allgather

P0

P1

P2

P3

P0

P1

P2

P3

A

A

A

A

B C D

B

B

B

C

C

C

D

D

D

A

B

C

D

adapted from Berger & Klöckner (NYU 2010)

Page 62: Massively Parallel Computing

MPI Alltoall

Alltoall

P0

P1

P2

P3

P0

P1

P2

P3

A0

A1

A2

A3

B0

B1

B2

B3

C0

C1

C2

C3

D0

D1

D2

D3

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

adapted from Berger & Klöckner (NYU 2010)

Page 63: Massively Parallel Computing

Asynchronous Messages

An asynchronous API exists too

Have to allocate buffers

Have to check if send or receive has completed

Will give better performance

Trickier to use

slide by Richard Edgar

Page 64: Massively Parallel Computing

User-Defined Datatypes

Usually have complex data structures

Require means of distributing these

Can pack & unpack manually

MPI allows us to define own datatypes for this

slide by Richard Edgar

Page 65: Massively Parallel Computing

MPI-2

• One-sided RMA (remote memory access) communication

• potential for greater efficiency, easier programming.

• Use ”windows” into memory to expose regions for access

• Race conditions now possible.

• Parallel I/O like message passing but to file system not

other processes.

• Allows for dynamic number of processes and

inter-communicators (as opposed to intra-communicators)

• Cleaned up MPI-1

adapted from Berger & Klöckner (NYU 2010)

Page 66: Massively Parallel Computing

RMA• Processors can designate portions of its address space as

available to other processors for read/write operations(MPI Get, MPI Put, MPI Accumulate).

• RMA window objects created by collective window-creationfns. (MPI Win create must be called by all participants)

• Before accessing, call MPI Win fence (or other synchr.mechanisms) to start RMA access epoch; fence (like a barrier)separates local ops on window from remote ops

• RMA operations are no-blocking; separate synchronizationneeded to check completion. Call MPI Win fence again.

RMA windowPut

P0 local memory P1 local memory

adapted from Berger & Klöckner (NYU 2010)

Page 67: Massively Parallel Computing

Some MPI Bugs

Page 68: Massively Parallel Computing

Sample MPI Bugs

Only works for even number of processors.

MPI Bugs

What’s wrong?

adapted from Berger & Klöckner (NYU 2010)

Page 69: Massively Parallel Computing

Sample MPI Bugs

Only works for even number of processors.

MPI Bugs

adapted from Berger & Klöckner (NYU 2010)

Page 70: Massively Parallel Computing

Sample MPI Bugs

Supose have local variable, e.g. energy, and want to sum allthe processors energy to find total energy of the system.

Recall

MPI_Reduce(sendbuf,recvbuf,count,datatype,op,

root,comm)

Using the same variable, as in

MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM,

MPI_COMM_WORLD)

will bomb.

MPI Bugs

Suppose you have a local variable “energy” and you want to sum all the processors “energy” to find the total energy of the system

What’s wrong?

adapted from Berger & Klöckner (NYU 2010)

Page 71: Massively Parallel Computing

Communication Topologies

Page 72: Massively Parallel Computing

Communication Topologies

Some topologies very common

Grid, hypercube etc.

API provided to set up communicators following these

slide by Richard Edgar

Page 73: Massively Parallel Computing

Parallel Performance

Recall Amdahl’s law:

if T1 = serial cost + parallel cost

then

Tp = serial cost + parallel cost/p

But really

Tp = serial cost + parallel cost/p + Tcommunication

How expensive is it?

adapted from Berger & Klöckner (NYU 2010)

Page 74: Massively Parallel Computing

Network Characteristics

Interconnection network connects nodes, transfers data

Important qualities:• Topology - the structure used to connect the nodes

• Routing algorithm - how messages are transmittedbetween processors, along which path (= nodes alongwhich message transferred).

• Switching strategy = how message is cut into pieces andassigned a path

• Flow control (for dealing with congestion) - stall, store datain buffers, re-route data, tell source to halt, discard, etc.

adapted from Berger & Klöckner (NYU 2010)

Page 75: Massively Parallel Computing

Interconnection NetworkRepresent as graph G = (V ,E), V = set of nodes to beconnected, E = direct links between the nodes. Links usuallybidirectional - transfer msg in both directions at same time.Characterize network by:

• diameter - maximum over all pairs of nodes of the shortestpath between the nodes (length of path in messagetransmission)

• degree - number of direct links for a node (number of directneighbors)

• bisection bandwidth - minimum number of edges that mustbe removed to partition network into two parts of equal sizewith no connection between them. (measures networkcapacity for transmitting messages simultaneously)

• node/edge connectivity - numbers of node/edges that mustfail to disconnect the network (measure of reliability)

adapted from Berger & Klöckner (NYU 2010)

Page 76: Massively Parallel Computing

Linear Array

• p vertices, p − 1 links• Diameter = p − 1• Degree = 2• Bisection bandwidth = 1• Node connectivity = 1, edge connectivity = 1

adapted from Berger & Klöckner (NYU 2010)

Page 77: Massively Parallel Computing

Ring topology

• diameter = p/2• degree = 2• bisection bandwidth = 2• node connectivity = 2

edge connectivity = 2

adapted from Berger & Klöckner (NYU 2010)

Page 78: Massively Parallel Computing

Mesh topology

• diameter = 2(√

p − 1)3d mesh is 3( 3

√p − 1)

• degree = 4 (6 in 3d )

• bisection bandwidth√

p

• node connectivity 2edge connectivity 2

Route along each dimension in turn

adapted from Berger & Klöckner (NYU 2010)

Page 79: Massively Parallel Computing

Torus topology

Diameter halved, Bisection bandwidth doubled,Edge and Node connectivity doubled over mesh

adapted from Berger & Klöckner (NYU 2010)

Page 80: Massively Parallel Computing

Hypercube topology

1100

1110

1010

0 1

00 01

10 11

0010 0011

0111

0000 0001

0100 0101

010 011

111

000 001

100 101

110

0110

1011

1111

1000 1001

1101

• p = 2k processors labelled with binary numbers of length k• k -dimensional cube constructed from two (k − 1)-cubes

• Connect corresponding procs if labels differ in 1 bit

(Hamming distance d between 2 k -bit binary words =

path of length d between 2 nodes)

adapted from Berger & Klöckner (NYU 2010)

Page 81: Massively Parallel Computing

Hypercube topology

1100

1110

1010

0 1

00 01

10 11

0010 0011

0111

0000 0001

0100 0101

010 011

111

000 001

100 101

110

0110

1011

1111

1000 1001

1101

• diameter = k ( =log p)

• degree = k

• bisection bandwidth = p/2

• node connectivity k

edge connectivity k

adapted from Berger & Klöckner (NYU 2010)

Page 82: Massively Parallel Computing

Dynamic Networks

Above networks were direct, or static interconnection networks= processors connected directly with each through fixedphysical links.

Indirect or dynamic networks = contain switches which providean indirect connection between the nodes. Switches configureddynamically to establish a connection.

• bus• crossbar• multistage network - e.g. butterfly, omega, baseline

adapted from Berger & Klöckner (NYU 2010)

Page 83: Massively Parallel Computing

Crossbar

Mm

P1

P2

Pn

M1 M2

• Connecting n inputs and m outputs takes nm switches.(Typically only for small numbers of processors)

• At each switch can either go straight or change dir.

• Diameter = 1, bisection bandwidth = padapted from Berger & Klöckner (NYU 2010)

Page 84: Massively Parallel Computing

Butterfly

16 × 16 butterfly network:

stage 3000

001

010

011

100

101

110

111

stage 0 stage 1 stage 2

for p = 2k+1 processors, k + 1 stages, 2k switches per stage,2 × 2 switches

adapted from Berger & Klöckner (NYU 2010)

Page 85: Massively Parallel Computing

Fat tree

• Complete binary tree• Processors at leaves• Increase links for higher bandwidth near root

adapted from Berger & Klöckner (NYU 2010)

Page 86: Massively Parallel Computing

Current picture

• Old style: mapped algorithms to topologies

• New style: avoid topology-specific optimizations

• Want code that runs on next year’s machines too.

• Topology awareness in vendor MPI libraries?

• Software topology - easy of programming, but not used for

performance?

adapted from Berger & Klöckner (NYU 2010)

Page 87: Massively Parallel Computing

Should we care ?

• Old school: map algorithms to specific topologies

• New school: avoid topology-specific optimimizations (the code should be optimal on next year’s infrastructure....)

• Meta-programming / Auto-tuning ?

Page 88: Massively Parallel Computing

Top500 Interconnects

10/30/10 9:25 AMInterconnect Family Share Over Time | TOP500 Supercomputing Sites

Page 1 of 2http://www.top500.org/overtime/list/35/connfam

CONTACT SUBMISSIONS LINKS HOME

Statistics Charts Development

Top500 List:06/2010

Statistics Type:Vendors

Generate

Search

Search

Is Underutilizing Processors Such anAwful Idea?

Have Honey, Will Compute: Bees WaxNumeric

IDC Has Plan to Get EuropeanSupercomputing Back on Track

ORNL Climate System's Big Reveal

Manufacturers Turn to HPC to CutTesting Costs

MATLAB on the TeraGrid One YearLater

Supercomputer in the Works for Virginia

Not Your Parents' CFD

Machine Learns Language Starting withthe Facts

GPU-based Supercomputing Could Face

HPCWire

Bookmark

Save This Page

Home Statistics Historical Charts

Interconnect Family Share Over TimeIn addition to the charts below, you can view the the data used to generate thischart in table format using the statistics page. A direct link to the statistics is alsoavailable.

PROJECT LISTS STATISTICS RESOURCES NEWS

adapted from Berger & Klöckner (NYU 2010)

Page 89: Massively Parallel Computing

MPI References

• Lawrence Livermore tutorialhttps:computing.llnl.gov/tutorials/mpi/

• Using MPIPortable Parallel Programming with the Message=PassingInterfaceby Gropp, Lusk, Skjellum

• Using MPI-2Advanced Features of the Message Passing Interfaceby Gropp, Lusk, Thakur

• Lots of other on-line tutorials, books, etc.

adapted from Berger & Klöckner (NYU 2010)

Page 90: Massively Parallel Computing

Ignite: Google Trends

http://www.youtube.com/watch?v=m0b-QX0JDXc

Page 91: Massively Parallel Computing

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

Page 92: Massively Parallel Computing

MPI with CUDA

MPI and CUDA almost orthogonal

Each node simply becomes faster

Problem matching MPI processes to GPUs

Use compute-exclusive mode on GPUs

Tell cluster environment to limit processes per node

Have to know your cluster documentation

slide by Richard Edgar

Page 93: Massively Parallel Computing

Data Movement

Communication now very expensive

GPUs can only communicate via their hosts

Very laborious

Again: need to minimize communication

slide by Richard Edgar

Page 94: Massively Parallel Computing

MPI Summary

MPI provides cross-platform interprocess communication

Invariably available on computer clusters

Only need six basic commands to get started

Much more sophistication available

slide by Richard Edgar

Page 95: Massively Parallel Computing

Outline

1. The problem

2. Intro to MPI

3. MPI Basics

4. MPI+CUDA

5. Other approaches

Page 96: Massively Parallel Computing
Page 97: Massively Parallel Computing

ZeroMQ

• ‘messaging middleware’ ‘TCP on steroids’ ‘new layer on the networking stack’

• not a complete messaging system

• just a simple messaging library to be used programmatically.

• a “pimped” socket interface allowing you to quickly design / build a complex communication system without much effort

http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all

Page 100: Massively Parallel Computing

Demo: Why ZeroMQ ?

http://www.youtube.com/watch?v=_JCBphyciAs

Page 101: Massively Parallel Computing

MPI vs ZeroMQ ?

• MPI is a specification, ZeroMQ is an implementation.

• Design:

• MPI is designed for tightly-coupled compute clusters with fast and reliable networks.

• ZeroMQ is designed for large distributed systems (web-like).

• Fault tolerance:

• MPI has very limited facilities for fault tolerance (the default error handling behavior in most implementations is a system-wide fail, ouch!).

• ZeroMQ is resilient to faults and network instability.

• ZeroMQ could be a good transport layer for an MPI-like implementation.

http://stackoverflow.com/questions/35490/spread-vs-mpi-vs-zeromq

Page 102: Massively Parallel Computing

CUDASA

Fast Forward

Page 103: Massively Parallel Computing
Page 104: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&#$"'()*"'#+,

!"#$%&'()*)++$+,-.'/0'1234'0/*')'-,%5+$'672'#/

-./(01%102.8+#,9672'-:-#$.-;<=>'?$-+)>'@8)&*/7+$">'AAA

31#4"56(01%102 672'B+8-#$*'$%C,*/%.$%#-D:*,%$#>'=%0,%,E)%&>'D:*,FG6>'AAA

1/%-,-#$%#'&$C$+/($*',%#$*0)B$

!)-,+:'$.H$&&$&'#/'#I$'1234'B/.(,+$'(*/B$--

1234J'1/.(8#$'2%,0,$&'3$C,B$'78 4*BI,#$B#8*$'&'9(8:/#1;/(CUDASA: Computed Unified Device Systems Architecture

Page 105: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507

*&,56$(8(6+.507$+75.$9*:#;<$-.%'#+./0%'123'9=(>56(;;-.*-& ?4061,$57$3&)&44(4

-0$60@@+756&.507$?(./((7$;-.*-& ?4061,

#7@0=5A5(=$B#C2$3)0D)&@@57D$57.()A&6(-0$60=($@0=5A56&.507$)(E+5)(=

Page 106: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507

*8#9$-./'0'1./'0) -./'234"':;0,.<=7($*8#$(>+&4,$07($"=?@A$.;)(&B

(%#; C4061,$57$3&)&44(4244$C4061,$,;&)($60DD07$,',.(D$D(D0)'$:(>EF$G#H2$I40C&4$D(D0)'<J0)140&BKC&4&76(B$,6;(B+457I$0L$C4061,$.0$.;($*8#,

Page 107: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507

*8#9$#+-./%'0/1#$%*'23':70;(<=7($*8#$(>+&4,$07($?"@$3)06(,,

?"@$A)0+3$3)06(,,$;)< B4061,$57$3&)&44(4-0$57.)57,56$A40B&4$C(C0)'$ D5,.)5B+.(;$,E&)(;$C(C0)'$C&7&A(C(7.

F0)140&;GB&4&76(;$,6E(;+457A$0H$B4061,$.0$.E($*8#,

Page 108: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507

8(9+(7.5&4$&33456&.507$3)06(,,2):5.)&)'$;<;==$&33456&.507$60>(24406&.507<?(&4406&.507$0@$>5,.)5:+.(><,',.(A$A(A0)'B,,+($0@$@+76.507$6&44,$07$7(./0)1<:+,$4(C(4

Page 109: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%"$&'()*&#+,-#+

!"#$%&'()#$*+#,%-.!)/0.12#'+3.+#-.1%4+#'5$1.6"7.89).:+2%7;5#54+:.1%'."6.%3'%#15"#1.6"7.+--5'5"#+:.+<1'7+$'5"#.:+2%71

8%#%7+:5=%.&7%1%#'.&7",7+445#,.&+7+-5,4;545$.89).5#'%76+$%.6"7.+::.#%>.?@)1

97",7+44+<5:5'2

A5--%#.B#-%7:25#,.4%$*+#5141.6"7.&+7+::%:514C$"44B#5$+'5"#

Page 110: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%"$&'()*&#+,-#+'./-#*01

!"#$%&'()*+',-.&/

!!"#$%&#!!'($)*'"+,-./0#$&12'3&4&561647'8999:';;'<=>?@=

!!A$B1!!'($)*'A+,-./9997'8 ;;'CDEF

999'

"+,-.'GGG'<"H'<%H'IB'JJJ/3&4&561647K

:

01&,234-5'(-564728"34-5

925,34-5':2"%464&8

!!1&BL!! ($)*'1+,-./0#$&12'3&4&561647'8 ;;'CDEF

A+,-./3&4&561647K

:

!!B6M,6-.6!! ($)*'B+,-./9997'8 ;;'NOO

1+,-.'GGG'<"'JJJ/3&4&561647K

:

!"#$%&'()*+!+',-.&/

;&5&8"%4<&.01&,234-5'

(-564728"34-5

=&>925,34-5':2"%464&8

Page 111: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"$&'()*&#+,-#+'./-#*01

!"#$%&'$()* +,-)#./ 0*$.%*&1 23(1$4(*#

&--1('&$()*51&6.% !!"#$%#&'#!!

*.$7)%851&6.% !!()*!! !!&)+#!! ()*,+-./()*012

"3#51&6.% !!34"5!! !!6)"3!! 34"5,+-./34"5012

9:;51&6.% !!78)*48!! !!+#91'#!! 7:1+012./*8)'5,+-./*8)'5012./36:#4+,+-

+,-)#./5<3*'$()*#5&%.5&''.##("1.5<%)=5*.,$5>(?>.%5&"#$%&'$()*23(1$4(*#5&%.5&3$)=&$('&1165-%)-&?&$./5$)5&1153*/.%16(*?51&6.%#

Page 112: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( ::

!"#$%"%&'(')*&

!"#$%&'()%*+,+-

;)< ,#.((%#= /01'."2%#34.2'(567/(+'8"'/$(#'9(9:+;2:,.

<%/$+%*"$'.(/1,+'.(&'&:+-(=:+(#'$9:+;(2,-'+

>:&&:#(%#$'+=,0'(="#0$%:#/(?'@3@(,$:&%0(="#0$%:#/A

>7<BCB(>:&D%2'+

>:.'($+,#/2,$%:#(=+:&(>7<BCB(0:.'($:(>7<B(9%$1($1+',./EFG4

C'2=H0:#$,%#'.(D+'H0:&D%2'+($:(>7<B(0:&D%2'+(D+:0'//

5,/'.(:#(62/,(?>II(D,+/'+A(9%$1(,..'.(>7<BE>7<BCB(="#0$%:#,2%$-

J"22(,#,2-/%/(:=(/-#$,K(,#.(/'&,#$%0/(+'8"%+'.

Page 113: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-./,0(1%2

!"#$%&'()*+'(,-#./&#,0*.1!!"#$%!!&'()*&"+,-./)-"&)0&12(#"&314&5&666&7

"89:*:1&$";,."&5

)-"&)<&12(#"&31<&

*)=>&"#$%?*@0&"#$%A)=<

7&B;#99:;!$";,."!"+,-.<

'()*&"+,-./B;#99:;!$",."!"+,-.&39#;#=4&5

)-"&)&C&9#;#=DE)<&12(#"&31&C&9#;#=DE1<

*)=>&"#$%?*@&C&9#;#=DE"#$%?*@<

*)=>&"#$%A)=&C&9#;#=DE"#$%A)=<

5&666&7

7

23456(,7-'#+()*$%#,08&'(/09.#,:-'

;/'-<+'=0.'+

>:0&,<0./

3-090.#&(=:.),0*.(8*+?

Page 114: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-./,0(1%2

!"#$%&'()*+'(,-#./&#,0*.1

2'.'-#,'+()*+'(&#3*4,5*%3(64.),0*.(%#-#$','-/(0.,*(7-#%%'-(/,-4), 8 !"#$#9:*%4&#,'(/);'+4&'-(<4'4'(70,;(#&&(=&*)>/(*6(,;'((%#<9."=+ 8 %&#9?','-$0.'(=40&,@0./(6*-('#);(=&*)> 8 '()*+,-"#'()*%!. 9

A#>'(4%(B!C(7*->'-(,;-'#+/(6-*$(,;'(,;-'#+(%**&D+&'(B!C/(-'<4'/,(.'",(%'.+0.E(=&*)>(6-*$(<4'4'

A#0,(6*-(#&&(=&*)>/(,*(='(%-*)'//'+D//40.E(#((%#<9."= 0/(#(=&*)>0.E()#&&

'/012#333#%&#4445!"#$67

Page 115: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-%'.*/0,1(2%/

!"#$%&'()*+(&,")-)(+".%&"%/0*%+(1$'%0*,).%234%,)&$'5(6$789*%'0)%6":;,+$<&,:$%.$)$'(&$#%$=$)&%+"";%>$?=@%&"%&A'$(#%;""+B-;;+,6(&,")%,**0$*%/'"(#6(*&%:$**(.$*%&"

,**0$%$C$60&,")%<)=+8$*/(")*#;$'5"':%*A('$#%#,*&',/0&$#%:$:"'1%";$'(&,")*

Page 116: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-%'.*/0,1(2%/

!"#$%&'&()*$(+,*%&'-%-.$/01#+2%'3.-4,*#*(.1)'%53%%&(16')/)*%-'-%-.$/'.7')(162%'1.&%

0#3"'32,)*%$'1.&%'&%&(3#*%)'4#$*'.7')/)*%-'-%-.$/'7.$'8!9:.1*(1,.,)';($*,#2'#&&$%))'$#16%

<1*%$7#3%';(#'!"#$%$&$''(!)*!"#$%$&+,!-.)*!"#$%$/0++0;=>'*.':?8@'62.+#2'-%-.$/'-#1#6%-%1*

<-42%-%1*%&',)(16'9A<'B%-.*%'9%-.$/'@33%))'CB9@DE.'6,#$#1*(%)'7.$'3.13,$$%1*'1.1F#*.-(3'-%-.$/'#33%))%)'C#)'(1':?8@D:?8@!@'#*.-(3'7,13*(.1)'7.$'8!9

Page 117: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&#'()$#(*+,+%%"%-#.

!"#$%&'#"%()%*+,-%."/"0'1%2'$034%251$3617%8-9:;;<*531=%>/%$>6%>?%@A*+,-%-9:;;%13B0'07%?5/&$3>/*1>&CDB'#"=%#5BD2'$034%60>&"##3/.%'6613"=%>/%'11%1"E"1#%>?%6'0'11"13#2

@FA)

,;G%H6$"0>/%IJKL%I4I%&>0"#

M/$"1%NOOKKL%P%&>0"#

9FA)

QRMGM,%N5'=0>%STUOKK

QRMGM,%VVKK9!T%A1$0'

I%&'0=#% 8(OW(O%1'/"#<

QRMGM,%VVKK9!

I%&'0=#% 8(OWP%1'/"#<X%&'0=#% 8(OWPWP%1'/"#<P%&'0=#% 8(OWPWPWP%1'/"#<

Page 118: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&#'()$#(*+,+%%"%-#.(/012&34

!"#$%&'#"%()%*+,-./0,1"2%03%410,'1%511+256'$5061<=&"/"(+!"#">"&"(7+%*?+@*(".%?"%*/-+8).+1*(-.%/("6-+A&)>%&+1&&$<"*%(")*78%9'&:#,'&:"/%"$%'18;%(<<=

>'/$5$506%03%'11%#&"6"%"1"2"6$#%56%$'#?%'6@%?"/6"1%,10&?#A6530/2%@5/"&$506'1%/'@5'6&"%@5#$/5,+$506%BC(D%#'2.1"#E

F6%25115#"&06@#%30/%'%#5641"-60@"%2+1$5-G>A%#H#$"2%I5$:%30+/%JKF9FL%DD<<G!#

J+2,"/%03%#&"6"%"1"2"6$#

C%G>A (%G>A# M%G>A#

N(=OD P(O (ON C(Q

CNC<=( C<N< P(< (OP

Page 119: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&#'()"&*+,-(./,/%%"%0#1

!"##$%#$%&#'&()*+%,-.//%012'3%'(*4#"5%)6"677(72189(1*%&61(%:+

;%&701*("%'#<(1=%*4#%>>??-91%(6&@=%-236A2*%.*@("'(*

;B???; 86*"2&(1% 234(56%+7#<)= &#8)0*6*2#'+% :CD1%

<)= &#880'2&6*2#'+ ECF1

G23@%&#880'2&6*2#'%&#1*1

,2'37(%!H%42*@%E%-!I1%01(<%61%E%12'37(%-!I%&701*("%'#<(1

:?;E?; 86*"2&(1% 489(56%+7#(JA01%7(K(7%#'7L+%M:E%-N7#)1OP(Q02"(1%2'*("R)"#&(11%&#880'2&6*2#'

S,/%6&&(11(1%*65(%:CB%*28(1%7#'3("%*@6'%&#8)0*6*2#'

T#%646"('(11%#$%<6*6%7#&672*L

G23@%J0''(&(116"LO%&#880'2&6*2#'%#K("@(6<

Page 120: Massively Parallel Computing

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&'("#

!"#$%$&'()*+,-./,'*/'!"#$'0/1'+,234.,5'36-7,+*8/19':21244+4.-;<.,.;24'=>2,5+-'*/'/1.5.,24'42,5625+

?/8':1/512;;.,5'2,@'4+21,.,5'/A+1>+2@

B//@'-=24.,5'3+>2A./1'/,'36-'4+A+4(-:+=.244C'0/1'A+1C'4215+'*215+*'=/;:6*2*./,-

(2-C'*/'.,*+512*+'.,*/'*>+'!"#$'@+A+4/:;+,*':1/=+--

!611+,*':1/D+=*'-*2*+&()*+,-./,'0/1'!"#$%$'*/'2@@'2821+,+--'/0'@2*2'4/=24.*C

E@+2&'!24432=9';+=>2,.-;'.,'-<-/$(")*+/)*8"9$.%(")*<.,.;.F+'2;/6,*'/0'#%<'@2*2'*/'3+'=/;;6,.=2*+@$6*/;2*.=244C';29+'6-+'/0'2-C,=>1/,/6-'@2*2'*12,-0+1'*/'*>+'BG"-

G1+:212*./,-'0/1';29.,5'!"#$%$':634.=4C'2A2.4234+

Page 121: Massively Parallel Computing

MultiGPU MapReduce

Fast Forward

Page 122: Massively Parallel Computing
Page 123: Massively Parallel Computing

MapReduce

http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png

Page 124: Massively Parallel Computing

Why MapReduce?

• Simple programming model

• Parallel programming model

• Scalable

• Previous GPU work: neither multi-GPU nor out-of-core

Page 125: Massively Parallel Computing

Benchmarks—Which• Matrix Multiplication (MM)

• Word Occurrence (WO)

• Sparse-Integer Occurrence (SIO)

• Linear Regression (LR)

• K-Means Clustering (KMC)

• (Volume Renderer—presented 90minutes ago @ MapReduce ’10)

Page 126: Massively Parallel Computing

Benchmarks—Why

• Needed to stress aspects of GPMR

• Unbalanced work (WO)

• Multiple emits/Non-uniform number of emits (LR, KMC, WO)

• Sparsity of keys (SIO)

• Accumulation (WO, LR, KMC)

• Many key-value pairs (SIO)

• Compute Bound Scalability (MM)

Page 127: Massively Parallel Computing

Benchmarks—Results

Page 128: Massively Parallel Computing

Benchmarks—Results

9

MM SIO WO KMC LR

Input Element Size — 4 bytes 1 byte 16 bytes 8 bytes

# Elems in first set (×106) 1024

2, 2048

2, 4096

2, 16384

21, 8, 32, 128 1, 16, 64, 512 1, 8, 32, 512 1, 16, 64, 512

# Elems in second set (×106/GPU) — 1, 2, 4, 1, 2, 4, 8, 16, 1, 2, 4, 1, 2, 4, 8, 16

8, 16, 32 32, 64, 128, 256 8, 16, 32 32, 64

TABLE 1: Dataset Sizes for all four benchmarks. We tested Phoenix against the first input set for SIO, KMC, LR, and the second set for

WO. We test GPMR against all available input sets.

MM KMC LR SIO WO

1-GPU Speedup 162.712 2.991 1.296 1.450 11.080

4-GPU Speedup 559.209 11.726 4.085 2.322 18.441

TABLE 2: Speedup for GPMR over Phoenix on our large (second-

biggest) input data from our first set. The exception is MM, for which

we use our small input set (Phoenix required almost twenty seconds

to multiply two 1024×1024 matrices).

MM KMC WO

1-GPU Speedup 2.695 37.344 3.098

4-GPU Speedup 10.760 129.425 11.709

TABLE 3: Speedup for GPMR over Mars on 4096× 4096 Matrix

Multiplication, an 8M-point K-Means Clustering, and a 512 MB

Word Occurrence. These sizes represent the largest problems that

can meet the in-core memory requirements of Mars.

summarizes speedup results over Phoenix, while Table 3 gives

speedup results of GPMR over Mars. Note that GPMR, even

in the one-GPU configuration, is faster on all benchmarks that

either Phoenix or Mars, and GPMR shows good scalability to

four GPUs as well.

Source code size is another important metric. One signif-

icant benefit of MapReduce in general is its high level of

abstraction: as a result, code sizes are small and development

time is reduced, since the developer does not have to focus

on the low-level details of communication and scheduling but

instead on the algorithm. Table 4 shows the different number

of lines required for each of three benchmarks implemented

in Phoenix, Mars, and GPMR. We would also like to show

developer time required to implement each benchmark for

each platform, but neither Mars nor Phoenix published such

information (and we wanted to use the applications provided

so as not to introduce bias in Mars’s or Phoenix’s runtimes). As

a frame of reference, the lead author of this paper implemented

and tested MM in GPMR in three hours, SIO in half an hour,

KMC in two hours, LR in two hours, and WO in four hours.

KMC, LR, and WO were then later modified in about half an

hour each to add Accumulation.

7 CONCLUSION

GPMR offers many benefits to MapReduce programmers.

The most important is scalability. While it is unrealistic to

expect perfect scalability from all but the most compute-bound

tasks, GPMR’s minimal overhead and transfer costs position

MM KMC WO

Phoenix 317 345 231

Mars 235 152 140

GPMR 214 129 397

TABLE 4: Lines of source code for three common benchmarks

written in Phoenix, Mars, and GPMR. We exclude setup code from

all counts as it was roughly the same for all benchmarks and had

little to do with the actual MapReduce code. For GPMR we included

boilerplate code in the form of class header files and C++ wrapper

functions that invoke CUDA kernels. If we excluded these files,

GPMR’s totals would be even smaller. Also, WO is so large because

of the hashing required in GPMR’s implementation.

Fig. 2: GPMR runtime breakdowns on our the largest datasets.

This figure shows how each application exhibits different runtime

characteristics, and also how exhibited characteristics change as we

increase the number of GPUs.

it well in comparison to other MapReduce implementations.

GPMR also offers flexibility to developers in several areas,

particularly when compared with Mars. GPMR allows flexible

mappings between threads and keys and customization of the

MapReduce pipeline with additional communication-reducing

stages while still providing sensible default implementations.

Our results demonstrate that even difficult applications that

have not traditionally been addressed by GPUs can still show

vs. CPU

vs. GPU

Page 129: Massively Parallel Computing

Benchmarks - Results

Good

Page 130: Massively Parallel Computing

Benchmarks - Results

Good

Page 131: Massively Parallel Computing

Benchmarks - Results

Good

Page 132: Massively Parallel Computing

iPhD one more thingor two...

Page 133: Massively Parallel Computing

Life/Code Hacking #3The Pomodoro Technique

Page 134: Massively Parallel Computing

http://lifehacker.com/#!5554725/the-pomodoro-technique-trains-your-brain-away-from-distractions

Life/Code Hacking #3The Pomodoro Technique

Page 136: Massively Parallel Computing

COME