[harvard cs264] 07 - gpu cluster programming (mpi & zeromq)
Post on 11-May-2015
6.058 Views
Preview:
DESCRIPTION
TRANSCRIPT
Lecture #7: GPU Cluster Programming | March 8th, 2011
Nicolas Pinto (MIT, Harvard) pinto@mit.edu
Massively Parallel ComputingCS 264 / CSCI E-292
Administrativia• Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11
• Project info: http://www.cs264.org/projects/projects.html
• Project ideas: http://forum.cs264.org/index.php?board=6.0
• Project proposal deadline: Fri 3/25/11(but you should submit way before to start working on it asap)
• Need a private private repo for your project?
Let us know! Poll on the forum: http://forum.cs264.org/index.php?topic=228.0
Goodies
• Guest Lectures: 14 distinguished speakers
• Schedule updated (see website)
Goodies (cont’d)
• Amazon AWS free credits coming soon(only for students who completed HW0+1)
• It’s more than $14,000 donation for the class!
• Special thanks: Kurt Messersmith @ Amazon
Goodies (cont’d)
• Best Project Prize: Tesla C2070 (Fermi) Board
• It’s more than $4,000 donation for the class!
• Special thanks: David Luebke & Chandra Cheij @ NVIDIA
During this course,
we’ll try to
and use existing material ;-)
“ ”
adapted for CS264
Todayyey!!
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
The Problem
Many computational problems too big for single CPU
Lack of RAM
Lack of CPU cycles
Want to distribute work between many CPUs
slide by Richard Edgar
Types of Parallelism
Some computations are ‘embarrassingly parallel’
Can do a lot of computation on minimal data
RC5 DES, SETI@HOME etc.
Solution is to distribute across the Internet
Use TCP/IP or similar
slide by Richard Edgar
Types of Parallelism
Some computations very tightly coupled
Have to communicate a lot of data at each step
e.g. hydrodynamics
Internet latencies much too high
Need a dedicated machine
slide by Richard Edgar
Tightly Coupled Computing
Two basic approaches
Shared memory
Distributed memory
Each has advantages and disadvantages
slide by Richard Edgar
Some terminology
Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.
Interconnection Network
P PP
M M M
Interconnection Network
P PP
M M M
Hybrid approach increasingly common
Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.
Interconnection Network
P PP
M M M
Interconnection Network
P PP
M M M
Hybrid approach increasingly commonnow: mostly hybrid
“distributed memory” “shared memory”
Some terminologyvia an interconnection network using explicit communication operations.
Interconnection Network
P PP
M M M
Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.
Interconnection Network
P PP
M M M
Interconnection Network
P PP
M M M
Hybrid approach increasingly commonnow: mostly hybrid
“distributed memory” “shared memory”
Shared Memory Machines
Have lots of CPUs share the same memory banks
Spawn lots of threads
Each writes to globally shared memory
Multicore CPUs now ubiquitous
Most computers now ‘shared memory machines’
slide by Richard Edgar
Shared Memory Machines
NASA ‘Columbia’ ComputerUp to 2048 cores in single system
slide by Richard Edgar
Shared Memory Machines
Spawning lots of threads (relatively) easy
pthreads, OpenMP
Don’t have to worry about data location
Disadvantage is memory performance scaling
Frontside bus saturates rapidly
Can use Non-Uniform Memory Architecture (NUMA)
Silicon Graphics Origin & Altix series
Gets expensive very fast
slide by Richard Edgar
Some terminology
Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.
Interconnection Network
P PP
M M M
Interconnection Network
P PP
M M M
Hybrid approach increasingly common
Interconnection Network
P PP
M M M
now: mostly hybrid
“distributed memory” “shared memory”
Distributed Memory Clusters
Alternative is a lot of cheap machines
High-speed network between individual nodes
Network can cost as much as the CPUs!
How do nodes communicate?
slide by Richard Edgar
Distributed Memory Clusters
NASA ‘Pleiades’ Cluster51,200 cores
slide by Richard Edgar
Distributed Memory Model
Communication is key issue
Each node has its own address space (exclusive access, no global memory?)
Could use TCP/IP
Painfully low level
Solution: a communication protocol like message-passing (e.g. MPI)
slide by Richard Edgar
Distributed Memory Model
All data must be explicitly partitionned
Exchange of data by explicit communication
slide by Richard Edgar
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Message Passing Interface
MPI is a communication protocol for parallel programs
Language independent
Open standard
Originally created by working group at SC92
Bindings for C, C++, Fortran, Python, etc.
http://www.mcs.anl.gov/research/projects/mpi/http://www.mpi-forum.org/
slide by Richard Edgar
Message Passing Interface
MPI processes have independent address spaces
Communicate by sending messages
Means of sending messages invisible
Use shared memory if available! (i.e. can be used behind the scenes shared memory architectures)
On Level 5 (Session) and higher of OSI model
slide by Richard Edgar
OSI Model ?
Message Passing Interface
MPI is a standard, a specification, for message-passing libraries
Two major implementations of MPI
MPICH
OpenMPI
Programs should work with either
slide by Richard Edgar
Basic Idea
• Usually programmed with SPMD model (single program,multiple data)
• In MPI-1 number of tasks is static - cannot dynamicallyspawn new tasks at runtime. Enhanced in MPI-2.
• No assumptions on type of interconnection network; allprocessors can send a message to any other processor.
• All parallelism explicit - programmer responsible forcorrectly identifying parallelism and implementing parallelalgorithms
adapted from Berger & Klöckner (NYU 2010)
Credits: James Carr (OCI)
Hello World
#include <mpi.h>#include <stdio.h>
int main(int argc, char** argv) {int rank, size;MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello world from %d of %d\n", rank, size);
MPI_Finalize();return 0;
}
adapted from Berger & Klöckner (NYU 2010)
Hello WorldTo compile: Need to load “MPI” wrappers in addition to thecompiler modules (OpenMPI,‘ MPICH,...)
module load openmpi/intel/1.3.3
To compile: mpicc hello.c
To run: need to tell how many processes you are requesting
mpiexec -n 10 a.out (mpirun -np 10 a.out)
module load mpi/openmpi/1.2.8/gnu
adapted from Berger & Klöckner (NYU 2010)
http://www.youtube.com/watch?v=pLqjQ55tz-U
The beauty of data visualization
http://www.youtube.com/watch?v=pLqjQ55tz-U
The beauty of data visualization
Example: gprof2dot
“ They’ve done studies, you know. 60% of the time, it works every time... ”
- Brian Fantana (Anchorman, 2004)
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Basic MPI
MPI is a library of routines
Bindings exist for many languages
Principal languages are C, C++ and Fortran
Python: mpi4py
We will discuss C++ bindings from now on
http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htm
slide by Richard Edgar
Basic MPI
MPI allows processes to exchange messages
Processes are members of communicators
Communicator shared by all is MPI::COMM_WORLD
In C++ API, communicators are objects
Within a communicator, each process has unique ID
slide by Richard Edgar
A Minimal MPI Program
Very much a minimal program
No actual communication occurs
#include <iostream>using namespace std;
#include “mpi.h”
int main( int argc, char* argv ) {
MPI::Init( argc, argv );
cout << “Hello World!” << endl;
MPI::Finalize();
return( EXIT_SUCCESS );}
slide by Richard Edgar
A Minimal MPI Program
To compile MPI programs use mpic++mpic++ -o MyProg myprog.cpp
The mpic++ command is a wrapper for default compiler
Adds in libraries
Use mpic++ --show to see what it does
Will also find mpicc, mpif77 and mpif90 (usually)
slide by Richard Edgar
A Minimal MPI Program
To run the program, use mpirunmpirun -np 2 ./MyProg
The -np 2 option launches two processes
Check documentation for your cluster
Number of processes might be implicit
Program should print “Hello World” twice
slide by Richard Edgar
Communicators
Processes are members of communicators
A process can
Find the size of a given communicator
Determine its ID (or rank) within it
Default communicator is MPI::COMM_WORLD
slide by Richard Edgar
Communicators
Queries COMM_WORLD communicator for
Number of processes
Current process rank (ID)
Prints these out
Process rank counts from zero
int nProcs, iMyProc;MPI::Init( argc, argv );nProcs = MPI::COMM_WORLD.Get_size();iMyProc = MPI::COMM_WORLD.Get_rank();cout << “Hello from process ”;cout << iMyProc << “ of ”;cout << nProcs << endl;MPI::Finalize();
slide by Richard Edgar
Communicators
By convention, process with rank 0 is masterconst int iMasterProc = 0;
Can have more than one communicator
Process may have different rank within each
slide by Richard Edgar
Messages
Haven’t sent any data yet
Communicators have Send and Recv methods for this
One process posts a Send
Must be matched by Recv in the target process
slide by Richard Edgar
Sending Messages
A sample send is as follows:int a[10];MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag );
The method prototype isvoid Comm::Send( const void* buf, int count, const Datatype& datatype, int dest, int tag) const
MPI copies the buffer into a system buffer and returns
No delivery notification
slide by Richard Edgar
Receiving Messages
Similar call to receiveint a[10];MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, iMyTag);
Function prototype isvoid Comm::Recv( void* buf, int count, const Datatype& datatype, int source, int tag) const
Blocks until data arrives
MPI::ANY_SOURCE
MPI::ANY_TAG
slide by Richard Edgar
MPI Datatypes
MPI datatypes are independent of
Language
Endianess
Most common listed opposite
MPI Datatype C/C++
MPI::CHAR signed char
MPI::SHORT signed short
MPI::INT signed int
MPI::LONG signed long
MPI::FLOAT float
MPI::DOUBLE double
MPI::BYTE Untyped byte data
slide by Richard Edgar
MPI Send & Receive
Master process sends out numbers
Worker processes print out number received
if( iMyProc == iMasterProc ) {for( int i=1; i<nProcs; i++ ) {
int iMessage = 2 * i + 1;cout << “Sending ” << iMessage << “ to process ” << i << endl;MPI::COMM_WORLD.Send( &iMessage, 1,
MPI::INT, i, iTag );
}} else {
int iMessage;MPI::COMM_WORLD.Recv( &iMessage, 1, MPI::INT, iMasterProc, iTag );cout << “Process ” << iMyProc << “ received ” << iMessage << endl;
}
slide by Richard Edgar
Six Basic MPI Routines
Have now encounted six MPI routinesMPI::Init(), MPI::Finalize()MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(),MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv()
These are enough to get started ;-)
More sophisticated routines available...
slide by Richard Edgar
Collective Communications
Send and Recv are point-to-point
Communicate between specific processes
Sometimes we want all processes to exchange data
These are called collective communications
slide by Richard Edgar
Barriers
Barriers require all processes to synchroniseMPI::COMM_WORLD.Barrier();
Processes wait until all processes arrive at barrier
Potential for deadlock
Bad for performance
Only use if necessary
slide by Richard Edgar
Broadcasts
Suppose one process has array to be shared with allint a[10];MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc );
If process has rank iSrcProc, it will send the array
Other processes will receive it
All will have a[10] identical to iSrcProc on completion
slide by Richard Edgar
MPI Broadcast
Broadcast
P0
P1
P2
P3
P0
P1
P2
P3
A
A
A
A
A
MPI Bcast(&buf, count, datatype, root, comm)
All processors must call MPI Bcast with the same root value.
adapted from Berger & Klöckner (NYU 2010)
Reductions
Suppose we have a large array split across processes
We want to sum all the elements
Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM
Also MPI::COMM_WORLD.Allreduce() variant
Can perform MAX, MIN, MAXLOC, MINLOC too
slide by Richard Edgar
MPI Reduce
ABCDP0
P1
P2
P3
P0
P1
P2
P3
Reduce
A
B
C
D
Reduction operators can be min, max, sum, multiply, logicalops, max value and location ... Must be associative(commutative optional)
adapted from Berger & Klöckner (NYU 2010)
Scatter and Gather
Split a large array between processes
Use MPI::COMM_WORLD.Scatter()
Each process receives part of the array
Combine small arrays into one large one
Use MPI::COMM_WORLD.Gather()
Designated process will construct entire array
Has MPI::COMM_WORLD.Allgather() variant
slide by Richard Edgar
MPI Scatter/Gather
Gather
P0
P1
P2
P3
P0
P1
P2
P3
A
B
C
D
AScatter
B C D
adapted from Berger & Klöckner (NYU 2010)
MPI Allgather
Allgather
P0
P1
P2
P3
P0
P1
P2
P3
A
A
A
A
B C D
B
B
B
C
C
C
D
D
D
A
B
C
D
adapted from Berger & Klöckner (NYU 2010)
MPI Alltoall
Alltoall
P0
P1
P2
P3
P0
P1
P2
P3
A0
A1
A2
A3
B0
B1
B2
B3
C0
C1
C2
C3
D0
D1
D2
D3
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
D0 D1 D2 D3
adapted from Berger & Klöckner (NYU 2010)
Asynchronous Messages
An asynchronous API exists too
Have to allocate buffers
Have to check if send or receive has completed
Will give better performance
Trickier to use
slide by Richard Edgar
User-Defined Datatypes
Usually have complex data structures
Require means of distributing these
Can pack & unpack manually
MPI allows us to define own datatypes for this
slide by Richard Edgar
MPI-2
• One-sided RMA (remote memory access) communication
• potential for greater efficiency, easier programming.
• Use ”windows” into memory to expose regions for access
• Race conditions now possible.
• Parallel I/O like message passing but to file system not
other processes.
• Allows for dynamic number of processes and
inter-communicators (as opposed to intra-communicators)
• Cleaned up MPI-1
adapted from Berger & Klöckner (NYU 2010)
RMA• Processors can designate portions of its address space as
available to other processors for read/write operations(MPI Get, MPI Put, MPI Accumulate).
• RMA window objects created by collective window-creationfns. (MPI Win create must be called by all participants)
• Before accessing, call MPI Win fence (or other synchr.mechanisms) to start RMA access epoch; fence (like a barrier)separates local ops on window from remote ops
• RMA operations are no-blocking; separate synchronizationneeded to check completion. Call MPI Win fence again.
RMA windowPut
P0 local memory P1 local memory
adapted from Berger & Klöckner (NYU 2010)
Some MPI Bugs
Sample MPI Bugs
Only works for even number of processors.
MPI Bugs
What’s wrong?
adapted from Berger & Klöckner (NYU 2010)
Sample MPI Bugs
Only works for even number of processors.
MPI Bugs
adapted from Berger & Klöckner (NYU 2010)
Sample MPI Bugs
Supose have local variable, e.g. energy, and want to sum allthe processors energy to find total energy of the system.
Recall
MPI_Reduce(sendbuf,recvbuf,count,datatype,op,
root,comm)
Using the same variable, as in
MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM,
MPI_COMM_WORLD)
will bomb.
MPI Bugs
Suppose you have a local variable “energy” and you want to sum all the processors “energy” to find the total energy of the system
What’s wrong?
adapted from Berger & Klöckner (NYU 2010)
Communication Topologies
Communication Topologies
Some topologies very common
Grid, hypercube etc.
API provided to set up communicators following these
slide by Richard Edgar
Parallel Performance
Recall Amdahl’s law:
if T1 = serial cost + parallel cost
then
Tp = serial cost + parallel cost/p
But really
Tp = serial cost + parallel cost/p + Tcommunication
How expensive is it?
adapted from Berger & Klöckner (NYU 2010)
Network Characteristics
Interconnection network connects nodes, transfers data
Important qualities:• Topology - the structure used to connect the nodes
• Routing algorithm - how messages are transmittedbetween processors, along which path (= nodes alongwhich message transferred).
• Switching strategy = how message is cut into pieces andassigned a path
• Flow control (for dealing with congestion) - stall, store datain buffers, re-route data, tell source to halt, discard, etc.
adapted from Berger & Klöckner (NYU 2010)
Interconnection NetworkRepresent as graph G = (V ,E), V = set of nodes to beconnected, E = direct links between the nodes. Links usuallybidirectional - transfer msg in both directions at same time.Characterize network by:
• diameter - maximum over all pairs of nodes of the shortestpath between the nodes (length of path in messagetransmission)
• degree - number of direct links for a node (number of directneighbors)
• bisection bandwidth - minimum number of edges that mustbe removed to partition network into two parts of equal sizewith no connection between them. (measures networkcapacity for transmitting messages simultaneously)
• node/edge connectivity - numbers of node/edges that mustfail to disconnect the network (measure of reliability)
adapted from Berger & Klöckner (NYU 2010)
Linear Array
• p vertices, p − 1 links• Diameter = p − 1• Degree = 2• Bisection bandwidth = 1• Node connectivity = 1, edge connectivity = 1
adapted from Berger & Klöckner (NYU 2010)
Ring topology
• diameter = p/2• degree = 2• bisection bandwidth = 2• node connectivity = 2
edge connectivity = 2
adapted from Berger & Klöckner (NYU 2010)
Mesh topology
• diameter = 2(√
p − 1)3d mesh is 3( 3
√p − 1)
• degree = 4 (6 in 3d )
• bisection bandwidth√
p
• node connectivity 2edge connectivity 2
Route along each dimension in turn
adapted from Berger & Klöckner (NYU 2010)
Torus topology
Diameter halved, Bisection bandwidth doubled,Edge and Node connectivity doubled over mesh
adapted from Berger & Klöckner (NYU 2010)
Hypercube topology
1100
1110
1010
0 1
00 01
10 11
0010 0011
0111
0000 0001
0100 0101
010 011
111
000 001
100 101
110
0110
1011
1111
1000 1001
1101
• p = 2k processors labelled with binary numbers of length k• k -dimensional cube constructed from two (k − 1)-cubes
• Connect corresponding procs if labels differ in 1 bit
(Hamming distance d between 2 k -bit binary words =
path of length d between 2 nodes)
adapted from Berger & Klöckner (NYU 2010)
Hypercube topology
1100
1110
1010
0 1
00 01
10 11
0010 0011
0111
0000 0001
0100 0101
010 011
111
000 001
100 101
110
0110
1011
1111
1000 1001
1101
• diameter = k ( =log p)
• degree = k
• bisection bandwidth = p/2
• node connectivity k
edge connectivity k
adapted from Berger & Klöckner (NYU 2010)
Dynamic Networks
Above networks were direct, or static interconnection networks= processors connected directly with each through fixedphysical links.
Indirect or dynamic networks = contain switches which providean indirect connection between the nodes. Switches configureddynamically to establish a connection.
• bus• crossbar• multistage network - e.g. butterfly, omega, baseline
adapted from Berger & Klöckner (NYU 2010)
Crossbar
Mm
P1
P2
Pn
M1 M2
• Connecting n inputs and m outputs takes nm switches.(Typically only for small numbers of processors)
• At each switch can either go straight or change dir.
• Diameter = 1, bisection bandwidth = padapted from Berger & Klöckner (NYU 2010)
Butterfly
16 × 16 butterfly network:
stage 3000
001
010
011
100
101
110
111
stage 0 stage 1 stage 2
for p = 2k+1 processors, k + 1 stages, 2k switches per stage,2 × 2 switches
adapted from Berger & Klöckner (NYU 2010)
Fat tree
• Complete binary tree• Processors at leaves• Increase links for higher bandwidth near root
adapted from Berger & Klöckner (NYU 2010)
Current picture
• Old style: mapped algorithms to topologies
• New style: avoid topology-specific optimizations
• Want code that runs on next year’s machines too.
• Topology awareness in vendor MPI libraries?
• Software topology - easy of programming, but not used for
performance?
adapted from Berger & Klöckner (NYU 2010)
Should we care ?
• Old school: map algorithms to specific topologies
• New school: avoid topology-specific optimimizations (the code should be optimal on next year’s infrastructure....)
• Meta-programming / Auto-tuning ?
Top500 Interconnects
10/30/10 9:25 AMInterconnect Family Share Over Time | TOP500 Supercomputing Sites
Page 1 of 2http://www.top500.org/overtime/list/35/connfam
CONTACT SUBMISSIONS LINKS HOME
Statistics Charts Development
Top500 List:06/2010
Statistics Type:Vendors
Generate
Search
Search
Is Underutilizing Processors Such anAwful Idea?
Have Honey, Will Compute: Bees WaxNumeric
IDC Has Plan to Get EuropeanSupercomputing Back on Track
ORNL Climate System's Big Reveal
Manufacturers Turn to HPC to CutTesting Costs
MATLAB on the TeraGrid One YearLater
Supercomputer in the Works for Virginia
Not Your Parents' CFD
Machine Learns Language Starting withthe Facts
GPU-based Supercomputing Could Face
HPCWire
Bookmark
Save This Page
Home Statistics Historical Charts
Interconnect Family Share Over TimeIn addition to the charts below, you can view the the data used to generate thischart in table format using the statistics page. A direct link to the statistics is alsoavailable.
PROJECT LISTS STATISTICS RESOURCES NEWS
adapted from Berger & Klöckner (NYU 2010)
MPI References
• Lawrence Livermore tutorialhttps:computing.llnl.gov/tutorials/mpi/
• Using MPIPortable Parallel Programming with the Message=PassingInterfaceby Gropp, Lusk, Skjellum
• Using MPI-2Advanced Features of the Message Passing Interfaceby Gropp, Lusk, Thakur
• Lots of other on-line tutorials, books, etc.
adapted from Berger & Klöckner (NYU 2010)
Ignite: Google Trends
http://www.youtube.com/watch?v=m0b-QX0JDXc
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
MPI with CUDA
MPI and CUDA almost orthogonal
Each node simply becomes faster
Problem matching MPI processes to GPUs
Use compute-exclusive mode on GPUs
Tell cluster environment to limit processes per node
Have to know your cluster documentation
slide by Richard Edgar
Data Movement
Communication now very expensive
GPUs can only communicate via their hosts
Very laborious
Again: need to minimize communication
slide by Richard Edgar
MPI Summary
MPI provides cross-platform interprocess communication
Invariably available on computer clusters
Only need six basic commands to get started
Much more sophistication available
slide by Richard Edgar
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
ZeroMQ
• ‘messaging middleware’ ‘TCP on steroids’ ‘new layer on the networking stack’
• not a complete messaging system
• just a simple messaging library to be used programmatically.
• a “pimped” socket interface allowing you to quickly design / build a complex communication system without much effort
http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
ZeroMQ• Fastest. Messaging. Ever.
• Excellent documentation:
• examples
• white papers for everything
• Bindings for Ada, Basic, C, Chicken Scheme, Common Lisp, C#, C++, D, Erlang*, Go*, Haskell*, Java, Lua, node.js, Objective-C, ooc, Perl, Perl, PHP, Python, Racket, Ruby, Tcl
http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
Message Patterns
http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
Demo: Why ZeroMQ ?
http://www.youtube.com/watch?v=_JCBphyciAs
MPI vs ZeroMQ ?
• MPI is a specification, ZeroMQ is an implementation.
• Design:
• MPI is designed for tightly-coupled compute clusters with fast and reliable networks.
• ZeroMQ is designed for large distributed systems (web-like).
• Fault tolerance:
• MPI has very limited facilities for fault tolerance (the default error handling behavior in most implementations is a system-wide fail, ouch!).
• ZeroMQ is resilient to faults and network instability.
• ZeroMQ could be a good transport layer for an MPI-like implementation.
http://stackoverflow.com/questions/35490/spread-vs-mpi-vs-zeromq
CUDASA
Fast Forward
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&#$"'()*"'#+,
!"#$%&'()*)++$+,-.'/0'1234'0/*')'-,%5+$'672'#/
-./(01%102.8+#,9672'-:-#$.-;<=>'?$-+)>'@8)&*/7+$">'AAA
31#4"56(01%102 672'B+8-#$*'$%C,*/%.$%#-D:*,%$#>'=%0,%,E)%&>'D:*,FG6>'AAA
1/%-,-#$%#'&$C$+/($*',%#$*0)B$
!)-,+:'$.H$&&$&'#/'#I$'1234'B/.(,+$'(*/B$--
1234J'1/.(8#$'2%,0,$&'3$C,B$'78 4*BI,#$B#8*$'&'9(8:/#1;/(CUDASA: Computed Unified Device Systems Architecture
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&'()%*)+%,
!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507
*&,56$(8(6+.507$+75.$9*:#;<$-.%'#+./0%'123'9=(>56(;;-.*-& ?4061,$57$3&)&44(4
-0$60@@+756&.507$?(./((7$;-.*-& ?4061,
#7@0=5A5(=$B#C2$3)0D)&@@57D$57.()A&6(-0$60=($@0=5A56&.507$)(E+5)(=
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&'()%*)+%,
!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507
*8#9$-./'0'1./'0) -./'234"':;0,.<=7($*8#$(>+&4,$07($"=?@A$.;)(&B
(%#; C4061,$57$3&)&44(4244$C4061,$,;&)($60DD07$,',.(D$D(D0)'$:(>EF$G#H2$I40C&4$D(D0)'<J0)140&BKC&4&76(B$,6;(B+457I$0L$C4061,$.0$.;($*8#,
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&'()%*)+%,
!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507
*8#9$#+-./%'0/1#$%*'23':70;(<=7($*8#$(>+&4,$07($?"@$3)06(,,
?"@$A)0+3$3)06(,,$;)< B4061,$57$3&)&44(4-0$57.)57,56$A40B&4$C(C0)'$ D5,.)5B+.(;$,E&)(;$C(C0)'$C&7&A(C(7.
F0)140&;GB&4&76(;$,6E(;+457A$0H$B4061,$.0$.E($*8#,
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&'()%*)+%,
!"#$%&'()*+,$%&'()-(./0)1$%&'()233456&.507
8(9+(7.5&4$&33456&.507$3)06(,,2):5.)&)'$;<;==$&33456&.507$60>(24406&.507<?(&4406&.507$0@$>5,.)5:+.(><,',.(A$A(A0)'B,,+($0@$@+76.507$6&44,$07$7(./0)1<:+,$4(C(4
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%"$&'()*&#+,-#+
!"#$%&'()#$*+#,%-.!)/0.12#'+3.+#-.1%4+#'5$1.6"7.89).:+2%7;5#54+:.1%'."6.%3'%#15"#1.6"7.+--5'5"#+:.+<1'7+$'5"#.:+2%71
8%#%7+:5=%.&7%1%#'.&7",7+445#,.&+7+-5,4;545$.89).5#'%76+$%.6"7.+::.#%>.?@)1
97",7+44+<5:5'2
A5--%#.B#-%7:25#,.4%$*+#5141.6"7.&+7+::%:514C$"44B#5$+'5"#
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%"$&'()*&#+,-#+'./-#*01
!"#$%&'()*+',-.&/
!!"#$%&#!!'($)*'"+,-./0#$&12'3&4&561647'8999:';;'<=>?@=
!!A$B1!!'($)*'A+,-./9997'8 ;;'CDEF
999'
"+,-.'GGG'<"H'<%H'IB'JJJ/3&4&561647K
:
01&,234-5'(-564728"34-5
925,34-5':2"%464&8
!!1&BL!! ($)*'1+,-./0#$&12'3&4&561647'8 ;;'CDEF
A+,-./3&4&561647K
:
!!B6M,6-.6!! ($)*'B+,-./9997'8 ;;'NOO
1+,-.'GGG'<"'JJJ/3&4&561647K
:
!"#$%&'()*+!+',-.&/
;&5&8"%4<&.01&,234-5'
(-564728"34-5
=&>925,34-5':2"%464&8
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"$&'()*&#+,-#+'./-#*01
!"#$%&'$()* +,-)#./ 0*$.%*&1 23(1$4(*#
&--1('&$()*51&6.% !!"#$%#&'#!!
*.$7)%851&6.% !!()*!! !!&)+#!! ()*,+-./()*012
"3#51&6.% !!34"5!! !!6)"3!! 34"5,+-./34"5012
9:;51&6.% !!78)*48!! !!+#91'#!! 7:1+012./*8)'5,+-./*8)'5012./36:#4+,+-
+,-)#./5<3*'$()*#5&%.5&''.##("1.5<%)=5*.,$5>(?>.%5&"#$%&'$()*23(1$4(*#5&%.5&3$)=&$('&1165-%)-&?&$./5$)5&1153*/.%16(*?51&6.%#
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( ::
!"#$%"%&'(')*&
!"#$%&'()%*+,+-
;)< ,#.((%#= /01'."2%#34.2'(567/(+'8"'/$(#'9(9:+;2:,.
<%/$+%*"$'.(/1,+'.(&'&:+-(=:+(#'$9:+;(2,-'+
>:&&:#(%#$'+=,0'(="#0$%:#/(?'@3@(,$:&%0(="#0$%:#/A
>7<BCB(>:&D%2'+
>:.'($+,#/2,$%:#(=+:&(>7<BCB(0:.'($:(>7<B(9%$1($1+',./EFG4
C'2=H0:#$,%#'.(D+'H0:&D%2'+($:(>7<B(0:&D%2'+(D+:0'//
5,/'.(:#(62/,(?>II(D,+/'+A(9%$1(,..'.(>7<BE>7<BCB(="#0$%:#,2%$-
J"22(,#,2-/%/(:=(/-#$,K(,#.(/'&,#$%0/(+'8"%+'.
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"%&'(')*&+,-./,0(1%2
!"#$%&'()*+'(,-#./&#,0*.1!!"#$%!!&'()*&"+,-./)-"&)0&12(#"&314&5&666&7
"89:*:1&$";,."&5
)-"&)<&12(#"&31<&
*)=>&"#$%?*@0&"#$%A)=<
7&B;#99:;!$";,."!"+,-.<
'()*&"+,-./B;#99:;!$",."!"+,-.&39#;#=4&5
)-"&)&C&9#;#=DE)<&12(#"&31&C&9#;#=DE1<
*)=>&"#$%?*@&C&9#;#=DE"#$%?*@<
*)=>&"#$%A)=&C&9#;#=DE"#$%A)=<
5&666&7
7
23456(,7-'#+()*$%#,08&'(/09.#,:-'
;/'-<+'=0.'+
>:0&,<0./
3-090.#&(=:.),0*.(8*+?
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"%&'(')*&+,-./,0(1%2
!"#$%&'()*+'(,-#./&#,0*.1
2'.'-#,'+()*+'(*4,5*%3(64.),0*.(%#-#$','-/(0.,*(7-#%%'-(/,-4), 8 !"#$#9:*%4&#,'(/);'+4&'-(<4'4'(70,;(#&&(=&*)>/(*6(,;'((%#<9."=+ 8 %	?','-$0.'(=40&,@0./(6*-('#);(=&*)> 8 '()*+,-"#'()*%!. 9
A#>'(4%(B!C(7*->'-(,;-'#+/(6-*$(,;'(,;-'#+(%**&D+&'(B!C/(-'<4'/,(.'",(%'.+0.E(=&*)>(6-*$(<4'4'
A#0,(6*-(#&&(=&*)>/(,*(='(%-*)'//'+D//40.E(#((%#<9."= 0/(#(=&*)>0.E()#&&
'/012#333#%ᅝ!"#$67
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"%&'(')*&+,-%'.*/0,1(2%/
!"#$%&'()*+(&,")-)(+".%&"%/0*%+(1$'%0*,).%234%,)&$'5(6$789*%'0)%6":;,+$<&,:$%.$)$'(&$#%$=$)&%+"";%>$?=@%&"%&A'$(#%;""+B-;;+,6(&,")%,**0$*%/'"(#6(*&%:$**(.$*%&"
,**0$%$C$60&,")%<)=+8$*/(")*#;$'5"':%*A('$#%#,*&',/0&$#%:$:"'1%";$'(&,")*
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"%&'(')*&+,-%'.*/0,1(2%/
!"#$%&'&()*$(+,*%&'-%-.$/01#+2%'3.-4,*#*(.1)'%53%%&(16')/)*%-'-%-.$/'.7')(162%'1.&%
0#3"'32,)*%$'1.&%'&%&(3#*%)'4#$*'.7')/)*%-'-%-.$/'7.$'8!9:.1*(1,.,)';($*,#2'#&&$%))'$#16%
<1*%$7#3%';(#'!"#$%$&$''(!)*!"#$%$&+,!-.)*!"#$%$/0++0;=>'*.':?8@'62.+#2'-%-.$/'-#1#6%-%1*
<-42%-%1*%&',)(16'9A<'B%-.*%'9%-.$/'@33%))'CB9@DE.'6,#$#1*(%)'7.$'3.13,$$%1*'1.1F#*.-(3'-%-.$/'#33%))%)'C#)'(1':?8@D:?8@!@'#*.-(3'7,13*(.1)'7.$'8!9
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%&#'()$#(*+,+%%"%-#.
!"#$%&'#"%()%*+,-%."/"0'1%2'$034%251$3617%8-9:;;<*531=%>/%$>6%>?%@A*+,-%-9:;;%13B0'07%?5/&$3>/*1>&CDB'#"=%#5BD2'$034%60>&"##3/.%'6613"=%>/%'11%1"E"1#%>?%6'0'11"13#2
@FA)
,;G%H6$"0>/%IJKL%I4I%&>0"#
M/$"1%NOOKKL%P%&>0"#
9FA)
QRMGM,%N5'=0>%STUOKK
QRMGM,%VVKK9!T%A1$0'
I%&'0=#% 8(OW(O%1'/"#<
QRMGM,%VVKK9!
I%&'0=#% 8(OWP%1'/"#<X%&'0=#% 8(OWPWP%1'/"#<P%&'0=#% 8(OWPWPWP%1'/"#<
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%&#'()$#(*+,+%%"%-#.(/012&34
!"#$%&'#"%()%*+,-./0,1"2%03%410,'1%511+256'$5061<=&"/"(+!"#">"&"(7+%*?+@*(".%?"%*/-+8).+1*(-.%/("6-+A&)>%&+1&&$<"*%(")*78%9'&:#,'&:"/%"$%'18;%(<<=
>'/$5$506%03%'11%#&"6"%"1"2"6$#%56%$'#?%'6@%?"/6"1%,10&?#A6530/2%@5/"&$506'1%/'@5'6&"%@5#$/5,+$506%BC(D%#'2.1"#E
F6%25115#"&06@#%30/%'%#5641"-60@"%2+1$5-G>A%#H#$"2%I5$:%30+/%JKF9FL%DD<<G!#
J+2,"/%03%#&"6"%"1"2"6$#
C%G>A (%G>A# M%G>A#
N(=OD P(O (ON C(Q
CNC<=( C<N< P(< (OP
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%&#'()"&*+,-(./,/%%"%0#1
!"##$%#$%&#'&()*+%,-.//%012'3%'(*4#"5%)6"677(72189(1*%&61(%:+
;%&701*("%'#<(1=%*4#%>>??-91%(6&@=%-236A2*%.*@("'(*
;B???; 86*"2&(1% 234(56%+7#<)= )0*6*2#'+% :CD1%
<)= Ͱ'2&6*2#'+ ECF1
G23@%Ͱ'2&6*2#'%*1
,2'37(%!H%42*@%E%-!I1%01(<%61%E%12'37(%-!I%&701*("%'#<(1
:?;E?; 86*"2&(1% 489(56%+7#(JA01%7(K(7%#'7L+%M:E%-N7#)1OP(Q02"(1%2'*("R)"#&(11%Ͱ'2&6*2#'
S,/%6&&(11(1%*65(%:CB%*28(1%7#'3("%*@6'%)0*6*2#'
T#%646"('(11%#$%<6*6%7#&672*L
G23@%J0''(&(116"LO%Ͱ'2&6*2#'%#K("@(6<
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%&'("#
!"#$%$&'()*+,-./,'*/'!"#$'0/1'+,234.,5'36-7,+*8/19':21244+4.-;<.,.;24'=>2,5+-'*/'/1.5.,24'42,5625+
?/8':1/512;;.,5'2,@'4+21,.,5'/A+1>+2@
B//@'-=24.,5'3+>2A./1'/,'36-'4+A+4(-:+=.244C'0/1'A+1C'4215+'*215+*'=/;:6*2*./,-
(2-C'*/'.,*+512*+'.,*/'*>+'!"#$'@+A+4/:;+,*':1/=+--
!611+,*':1/D+=*'-*2*+&()*+,-./,'0/1'!"#$%$'*/'2@@'2821+,+--'/0'@2*2'4/=24.*C
E@+2&'!24432=9';+=>2,.-;'.,'-<-/$(")*+/)*8"9$.%(")*<.,.;.F+'2;/6,*'/0'#%<'@2*2'*/'3+'=/;;6,.=2*+@$6*/;2*.=244C';29+'6-+'/0'2-C,=>1/,/6-'@2*2'*12,-0+1'*/'*>+'BG"-
G1+:212*./,-'0/1';29.,5'!"#$%$':634.=4C'2A2.4234+
MultiGPU MapReduce
Fast Forward
MapReduce
http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png
Why MapReduce?
• Simple programming model
• Parallel programming model
• Scalable
• Previous GPU work: neither multi-GPU nor out-of-core
Benchmarks—Which• Matrix Multiplication (MM)
• Word Occurrence (WO)
• Sparse-Integer Occurrence (SIO)
• Linear Regression (LR)
• K-Means Clustering (KMC)
• (Volume Renderer—presented 90minutes ago @ MapReduce ’10)
Benchmarks—Why
• Needed to stress aspects of GPMR
• Unbalanced work (WO)
• Multiple emits/Non-uniform number of emits (LR, KMC, WO)
• Sparsity of keys (SIO)
• Accumulation (WO, LR, KMC)
• Many key-value pairs (SIO)
• Compute Bound Scalability (MM)
Benchmarks—Results
Benchmarks—Results
9
MM SIO WO KMC LR
Input Element Size — 4 bytes 1 byte 16 bytes 8 bytes
# Elems in first set (×106) 1024
2, 2048
2, 4096
2, 16384
21, 8, 32, 128 1, 16, 64, 512 1, 8, 32, 512 1, 16, 64, 512
# Elems in second set (×106/GPU) — 1, 2, 4, 1, 2, 4, 8, 16, 1, 2, 4, 1, 2, 4, 8, 16
8, 16, 32 32, 64, 128, 256 8, 16, 32 32, 64
TABLE 1: Dataset Sizes for all four benchmarks. We tested Phoenix against the first input set for SIO, KMC, LR, and the second set for
WO. We test GPMR against all available input sets.
MM KMC LR SIO WO
1-GPU Speedup 162.712 2.991 1.296 1.450 11.080
4-GPU Speedup 559.209 11.726 4.085 2.322 18.441
TABLE 2: Speedup for GPMR over Phoenix on our large (second-
biggest) input data from our first set. The exception is MM, for which
we use our small input set (Phoenix required almost twenty seconds
to multiply two 1024×1024 matrices).
MM KMC WO
1-GPU Speedup 2.695 37.344 3.098
4-GPU Speedup 10.760 129.425 11.709
TABLE 3: Speedup for GPMR over Mars on 4096× 4096 Matrix
Multiplication, an 8M-point K-Means Clustering, and a 512 MB
Word Occurrence. These sizes represent the largest problems that
can meet the in-core memory requirements of Mars.
summarizes speedup results over Phoenix, while Table 3 gives
speedup results of GPMR over Mars. Note that GPMR, even
in the one-GPU configuration, is faster on all benchmarks that
either Phoenix or Mars, and GPMR shows good scalability to
four GPUs as well.
Source code size is another important metric. One signif-
icant benefit of MapReduce in general is its high level of
abstraction: as a result, code sizes are small and development
time is reduced, since the developer does not have to focus
on the low-level details of communication and scheduling but
instead on the algorithm. Table 4 shows the different number
of lines required for each of three benchmarks implemented
in Phoenix, Mars, and GPMR. We would also like to show
developer time required to implement each benchmark for
each platform, but neither Mars nor Phoenix published such
information (and we wanted to use the applications provided
so as not to introduce bias in Mars’s or Phoenix’s runtimes). As
a frame of reference, the lead author of this paper implemented
and tested MM in GPMR in three hours, SIO in half an hour,
KMC in two hours, LR in two hours, and WO in four hours.
KMC, LR, and WO were then later modified in about half an
hour each to add Accumulation.
7 CONCLUSION
GPMR offers many benefits to MapReduce programmers.
The most important is scalability. While it is unrealistic to
expect perfect scalability from all but the most compute-bound
tasks, GPMR’s minimal overhead and transfer costs position
MM KMC WO
Phoenix 317 345 231
Mars 235 152 140
GPMR 214 129 397
TABLE 4: Lines of source code for three common benchmarks
written in Phoenix, Mars, and GPMR. We exclude setup code from
all counts as it was roughly the same for all benchmarks and had
little to do with the actual MapReduce code. For GPMR we included
boilerplate code in the form of class header files and C++ wrapper
functions that invoke CUDA kernels. If we excluded these files,
GPMR’s totals would be even smaller. Also, WO is so large because
of the hashing required in GPMR’s implementation.
Fig. 2: GPMR runtime breakdowns on our the largest datasets.
This figure shows how each application exhibits different runtime
characteristics, and also how exhibited characteristics change as we
increase the number of GPUs.
it well in comparison to other MapReduce implementations.
GPMR also offers flexibility to developers in several areas,
particularly when compared with Mars. GPMR allows flexible
mappings between threads and keys and customization of the
MapReduce pipeline with additional communication-reducing
stages while still providing sensible default implementations.
Our results demonstrate that even difficult applications that
have not traditionally been addressed by GPUs can still show
vs. CPU
vs. GPU
Benchmarks - Results
Good
Benchmarks - Results
Good
Benchmarks - Results
Good
iPhD one more thingor two...
Life/Code Hacking #3The Pomodoro Technique
http://lifehacker.com/#!5554725/the-pomodoro-technique-trains-your-brain-away-from-distractions
Life/Code Hacking #3The Pomodoro Technique
http://www.youtube.com/watch?v=QYyJZOHgpco
COME
top related