advanced features of mpi - prace research … set up 4 blocks*/ int blockcounts[4]={50,1,4,2}; ......
TRANSCRIPT
2
When not to use Collective Operations
• Sequences of collective communication can be pipelined for better efficiency
• Example: Processor 0 reads data from a file and broadcasts it to all other processes. • Do i=1,m
if (rank .eq. 0) read *, a call mpi_bcast( a, n, MPI_INTEGER, 0, comm, ierr ) EndDo
• Takes m n log p time.
• It can be done in (m+p) n time!
3
Pipeline the Messages • Process 0 reads data from a file and sends it to the next
process. Others forward the data. • Do i=1,m
if (rank .eq. 0) then read *, a call mpi_send(a, n, type, 1, 0, comm, ierr) else call mpi_recv(a, n, type, rank-1, 0, comm, status, ierr) call mpi_send(a, n, type, next, 0, comm, ierr) endif EndDo
4
Concurrency between Steps • Broadcast: • Pipeline
Time
Another example of deferring synchronization
Each broadcast takes less time than pipeline version, but total time is longer
5
Timing MPI Programs • The elapsed (wall-clock) time between two points in an MPI program
can be computed using MPI_Wtime: double t1, t2; t1 = MPI_Wtime(); ... t2 = MPI_Wtime(); printf( “time is %d\n”, t2 - t1 );
• The value returned by a single call to MPI_Wtime has little value. • Times in general are local, but an implementation might offer
synchronized times. • For advanced users: see the attribute MPI_WTIME_IS_GLOBAL.
Sample Timing Harness • Average times, make several trials
t1 = MPI_Wtime(); for (i < maxloop) { <operation to be timed> } time = (MPI_Wtime() - t1) / maxloop;
• Use MPI_Wtick to discover clock resolution • Use getrusage (Unix) to get other effects (e.g., context
switches, paging)
7
MPI Profiling Interface (PMPI) • PMPI allows selective replacement of MPI routines at link
time (no need to recompile) • Every MPI function also exists under the name PMPI_ • Often implemented using weak symbols – need not
duplicate the whole library
8
MPI Library User Program
Call MPI_Send
Call MPI_Bcast
MPI_Send
MPI_Bcast
Profiling Interface
Profiling Library
PMPI_Send
MPI_Send
9
Example Use of Profiling Interface
static int nsend = 0;
int MPI_Send( void *start, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )
{
nsend++;
return PMPI_send(start, count, datatype, dest, tag, comm);
}
10
Finding Unsafe uses of MPI_Send
subroutine MPI_Send( start, count, datatype, dest, tag, comm, ierr ) integer start(*), count, datatype, dest, tag, comm call PMPI_Ssend( start, count, datatype, dest, tag, comm, ierr ) end
• MPI_Ssend will not complete until the matching receive starts
• MPI_Send can be implemented as MPI_Ssend
• At some value of count, MPI_Send will act like MPI_Ssend (or fail)
11
Finding Unsafe MPI_Send II • Have the application generate a message about unsafe uses
of MPI_Send • Hint: use MPI_Issend • C users can use __FILE__ and __LINE__
• sometimes possible in Fortran (.F files)
12
Reporting on Unsafe MPI_Send
subroutine MPI_Send( start, count, datatype, dest, tag, comm, ierr ) integer start(*), count, datatype, dest, tag, comm include ’mpif.h’ integer request, status(MPI_STATUS_SIZE) double precision t1, t2 logical flag
call PMPI_Issend(start, count, datatype, dest, tag, comm, request, ierr ) flag = .false. t1 = MPI_Wtime() Do while (.not. flag .and. t1 + 10 .gt. MPI_Wtime()) call PMPI_Test( request, flag, status, ierr ) Enddo if (.not. Flag) then print *, ’MPI_Send appears to be hanging’ call MPI_Abort( MPI_COMM_WORLD, 1, ierr ) endif end
13
Defining Your Own Communicators • MPI has a large number of routines for manipulating groups
and defining communicators • All you need is MPI_Comm_split
• MPI_Comm_split( old_comm, color, key, new_comm) • Splits old_comm into several new_comm • Each new_comm contains those processes in old_comm that
specified the same value of color • Ranking in new_comm is controlled by key
• MPI_Comm_dup creates a duplicate of input communicator • Duplicate has its own “context” – safe communication space
• Libraries should use MPI_Comm_dup to get a private communicator
14
Why Contexts? • Parallel libraries require isolation of messages from one
another and from the user that cannot be adequately handled by tags.
• Consider the following examples • Sub1 and Sub2 are from different libraries
Sub1(); Sub2();
• Sub1a and Sub1b are from the same library Sub1a(); Sub2(); Sub1b();
15
Correct Execution of Library Calls
Sub1
Sub2
Process 0 Process 1 Process 2
Recv(any)
Recv(any) Send(1)
Send(0)
Recv(1)
Recv(2)
Recv(0) Send(2)
Send(1)
Send(0)
16
Incorrect Execution of Library Calls
Sub1
Sub2
Process 0 Process 1 Process 2
Recv(any)
Recv(any) Send(1)
Send(0)
Recv(2)
Recv(0)
Recv(1)
Send(2)
Send(1)
Send(0) ?
Program hangs (Recv(1) never satisfied)
17
Correct Execution of Library Calls with Pending Communication
Recv(any)
Send(1)
Send(0)
Recv(any)
Recv(2)
Send(1) Recv(0)
Send(2)
Send(0)
Recv(1)
Sub1a
Sub2
Sub1b
18
Incorrect Execution of Library Calls with Pending Communication
Recv(any)
Send(1)
Send(0)
Recv(any)
Recv(2)
Send(1) Recv(0)
Send(2)
Send(0)
Recv(1)
Sub1a
Sub2
Sub1b Program Runs—but with wrong data!
19
Datatypes in MPI
• MPI datatypes have two purposes: • Heterogeneity • Noncontiguous data
• Basic vs. derived datatype: • Basic datatype • Derived datatype
• Vector • Contiguous • Indexed • Hindexed • Structure
20
Basic Datatype in C
MPI Datatype C Datatype
MPI_BYTE MPI_CHAR MPI_DOUBLE MPI_FLOAT MPI_INT MPI_LONG MPI_LONG_DOUBLE MPI_PACKED MPI_SHORT MPI_UNSIGNED_CHAR MPI_UNSIGNED MPI_UNSIGNED_LONG MPI_UNSINGED_SHORT
singed char double float int long long double
short unsigned char unsigned int unsigned long unsigned short
Additional datatypes defined in MPI 2.2 corresponding to C99 language types int32_t, int64_t, etc.
21
Typemaps in MPI
• In MPI, a datatype is represented as a typemap
• Extent of a datatype
• An artificial extent can be set by using MPI_UB and MPI_LB
LB UB
extent
Memory locations specified by datatype
22
Typemaps in MPI (cont.)
• Example: • (int,0),(char,4) is a typemap • The extent of this typemap is 5
23
VECTOR Datatype
29
22
15
8
1
30
23
16
9
2
32
25
18
11
4
33
26
19
12
5
34
27
20
13
6
35
28
21
14
7
31
24
17
10
3
MPI_Type_vector(count, blocklength, stride, oldtype, &newtype) MPI_Type_commit(&newtype)
MPI_Type_Vector(5,1,7,MPI_DOUBLE,newtype); MPI_Type_commit(&newtype); MPI_Send(buffer,1,newtype,dest,tag,comm); MPI_Type_free(&newtype);
24
CONTIGUOUS Datatype
• Assume an original datatype oldtype has typemap (double,0), (char,8), then
MPI_Type_contiguous(3,oldtype,&newtype);
• Creates a datatype newtype with a typemap: ?
• To actually send such a data use the sequence of calls: MPI_Type_contiguous(count,datatype,&newtype); MPI_Type_commit(&newtype); MPI_Send(buffer,1,newtype,dest,tag,comm); MPI_Type_free(&newtype);
MPI_Type_contiguous(count, oldtype, &newtype) MPI_Type_commit(&newtype)
25
INDEXED Datatype
• Assume an original datatype oldtype has typemap (double,0) (char 8). Let B=(3,1) and D=(4,0),
MPI_Type_indexed(2,B,D,oldtype,&newtype);
• Creates a datatype newtype with a typemap: ?
MPI_Type_indexed(count, &array_of_blocklengths, &array_of_displacements, oldtype, &newtype) MPI_Type_commit(&newtype)
26
Structure Datatype
MPI_Type_structure(count, &array_of_blocklengths, &array_of_displacements, oldtype, &newtype) MPI_Type_commit(&newtype)
Example: struct{
char display[50]; int max; double xmin,ymin; doube xmax,ymax; int width; int length;
} cmdline;
/* set up 4 blocks*/ int blockcounts[4]={50,1,4,2}; MPI_Datatype types[4]; MPI_Int displs[4]; MPI_Datatype cmdtype;
MPI_Address(&cmdline.display, &displs[0]); MPI_Address(&cmdline.max, &displs[1]); MPI_Address(&cmdline.xmin, &displs[2]); MPI_Address(&cmdline.width, &displs[3]); MPI_Address(&cmdline+1, &displs[4]); types[0]=MPI_CHAR; types[1]=MPI_INT; types[2]=MPI_DOUBLE; types[3]=MPI_INT; types[4]=MPI_UB; for (i=4;i>=0;i--)
displs[i]-=displs[0]; MPI_Type_struct(5,blockcounts,displs,types,&cmdtype); MPI_Type_commit(&cmdtype);
27
Cartesian Topology
• MPI lets user specify various application topologies • A Cartesian topology is a mesh • Example: 3*4 Cartesian mesh with arrows pointing at the right
neighbors:
(0,2) (1,2) (2,2) (3,2)
(0,1) (1,1) (2,1) (3,1)
(0,0) (1,0) (2,0) (3,0)
28
Defining a Cartesian Topologies
• The routine MPI_Cart_create() creates a Cartesian decomposition of the processes, with the number of dimensions given by the ndim argument
• This creates a new communicator with the same processes as the input communicator, but with the specified topology
dims[0]=4; dims[1]=3; periods[0]=0; periods[1]=0; /* specify if wrapround */
reorder = 0;
ndim=2;
MPI_Cart_create(MPI_COMM_WORLD,ndim,*dims,*periods,reorder,comm2d);
29
Finding Neighbors
• The question “who are my neighbors?” can be answered with MPI_Cart_shift:
MPI_Cart_shift( comm2d, 1, 1, nbrleft, nbrright);
MPI_Cart_shift( comm2d, 0, 1, nbrbottom, nbrtop); The values returned are the ranks, in the communicator comm2d, of the neighbors shifted by 1 in the two dimensions
int MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest)
30
Periodic vs. Nonperiodic Grids
• Who are my neighbors if I am at the edge of a Cartesian mesh?
• In the nonperiodic case, a neighbor may not exist. This is indicated by a rank of MPI_PROC_NULL.
?
Nonperiodic Grid Periodic Grid
31
Who Am I?
• This question can be answered with:
int coords[2];
MPI_Comm_rank(comm2d, myrank); MPI_Cart_coords(comm2d, myrank,2, coords);
returns the Cartesian coordinates of the calling process in coords.
int MPI_Cart_coords ( MPI_Comm comm, int rank, int maxdims, int *coords )
32
Partitioning
• When creating a Cartesian topology, one question is ``What is a good choice for the decomposition of the processors?''
• This question can be answered with MPI_Dims_create:
int dims[2]=(0 ,0);
MPI_Comm_size( MPI_COMM_WORLD, &nprocs); MPI_Dims_create( nprocs, 2, dims) ;
e.g. MPI_Dims_create(6, 2, dims) returns dims=(3,2)
int MPI_Dims_create(int nnodes, int ndims, int *dims)
33
Other Topology Routines
• MPI contains routines to translate between Cartesian coordinates and ranks in a communicator, and to access the properties of a Cartesian topology.
• MPI_Graph_create allows the creation of a general graph topology
• MPI_Dist_graph_create is a more scalable version defined in MPI 2.2
• In summary, all these routines allow the MPI implementation to provide an ordering of processes in a topology that makes logical neighbors close in the physical interconnect
34
Error Handling • By default, an error causes all processes to abort. • The user can cause routines to return (with an error code)
instead • MPI_Comm_set_errhandler
• A user can also write and install custom error handlers. • Libraries can handle errors differently from applications.
• MPI provides a way for each library to have its own error handler without changing the default behavior for other libraries or for the user’s code
• MPI_Error_string() can be used to convert an error code into a string that can be printed
35 35
MPI-2
Same process of definition by MPI Forum MPI-2 is an extension of MPI
– Extends the message-passing model. • Parallel I/O • Remote memory operations (one-sided) • Dynamic process management
– Adds other functionality • C++ and Fortran 90 bindings – similar to original C and Fortran-77 bindings
• External interfaces • Language interoperability • MPI interaction with threads
36 36
MPI-2 Implementation Status
Most parallel computer vendors now support MPI-2 on their machines – Except in some cases for the dynamic process management
functions, which require interaction with other system software Cluster MPIs, such as MPICH2 and Open MPI, support most of
MPI-2 including dynamic process management
38
MPI and Threads
MPI describes parallelism between processes (with separate address spaces)
Thread parallelism provides a shared-memory model within a process
OpenMP and Pthreads are common models – OpenMP provides convenient features for loop-level
parallelism. Threads are created and managed by the compiler, based on user directives.
– Pthreads provide more complex and dynamic approaches. Threads are created and managed explicitly by the user.
39
Programming for Multicore
Almost all chips are multicore these days Today’s clusters often comprise multiple CPUs per node sharing
memory, and the nodes themselves are connected by a network Common options for programming such clusters
– All MPI • Use MPI to communicate between processes both within a
node and across nodes • MPI implementation internally uses shared memory to
communicate within a node – MPI + OpenMP
• Use OpenMP within a node and MPI across nodes – MPI + Pthreads
• Use Pthreads within a node and MPI across nodes The latter two approaches are known as “hybrid programming”
40 40
MPI’s Four Levels of Thread Safety
MPI defines four levels of thread safety. These are in the form of commitments the application makes to the MPI implementation.
– MPI_THREAD_SINGLE: only one thread exists in the application
– MPI_THREAD_FUNNELED: multithreaded, but only the main thread makes MPI calls (the one that called MPI_Init or MPI_Init_thread)
– MPI_THREAD_SERIALIZED: multithreaded, but only one thread at a time makes MPI calls
– MPI_THREAD_MULTIPLE: multithreaded and any thread can make MPI calls at any time (with some restrictions to avoid races – see next slide)
MPI defines an alternative to MPI_Init – MPI_Init_thread(requested, provided)
• Application indicates what level it needs; MPI implementation returns the level it supports