advanced features of mpi - prace research … set up 4 blocks*/ int blockcounts[4]={50,1,4,2}; ......

Advanced Features of MPI

Rajeev Thakur Argonne National Laboratory

2

When not to use Collective Operations

•  Sequences of collective communication can be pipelined for better efficiency

•  Example: Processor 0 reads data from a file and broadcasts it to all other processes. •  Do i=1,m

if (rank .eq. 0) read *, a call mpi_bcast( a, n, MPI_INTEGER, 0, comm, ierr ) EndDo

•  Takes m n log p time.

•  It can be done in (m+p) n time!

3

Pipeline the Messages •  Process 0 reads data from a file and sends it to the next

process. Others forward the data. •  Do i=1,m

if (rank .eq. 0) then read *, a call mpi_send(a, n, type, 1, 0, comm, ierr) else call mpi_recv(a, n, type, rank-1, 0, comm, status, ierr) call mpi_send(a, n, type, next, 0, comm, ierr) endif EndDo

4

Concurrency between Steps •  Broadcast: •  Pipeline

Time

Another example of deferring synchronization

Each broadcast takes less time than pipeline version, but total time is longer

5

Timing MPI Programs •  The elapsed (wall-clock) time between two points in an MPI program

can be computed using MPI_Wtime: double t1, t2; t1 = MPI_Wtime(); ... t2 = MPI_Wtime(); printf( “time is %d\n”, t2 - t1 );

•  The value returned by a single call to MPI_Wtime has little value. •  Times in general are local, but an implementation might offer

synchronized times. •  For advanced users: see the attribute MPI_WTIME_IS_GLOBAL.

Sample Timing Harness •  Average times, make several trials

t1 = MPI_Wtime(); for (i < maxloop) { <operation to be timed> } time = (MPI_Wtime() - t1) / maxloop;

•  Use MPI_Wtick to discover clock resolution •  Use getrusage (Unix) to get other effects (e.g., context

switches, paging)

7

MPI Profiling Interface (PMPI) •  PMPI allows selective replacement of MPI routines at link

time (no need to recompile) •  Every MPI function also exists under the name PMPI_ •  Often implemented using weak symbols – need not

duplicate the whole library

8

MPI Library User Program

Call MPI_Send

Call MPI_Bcast

MPI_Send

MPI_Bcast

Profiling Interface

Profiling Library

PMPI_Send

MPI_Send

9

Example Use of Profiling Interface

static int nsend = 0;

int MPI_Send( void *start, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )

{

nsend++;

return PMPI_send(start, count, datatype, dest, tag, comm);

}

10

Finding Unsafe uses of MPI_Send

subroutine MPI_Send( start, count, datatype, dest, tag, comm, ierr ) integer start(*), count, datatype, dest, tag, comm call PMPI_Ssend( start, count, datatype, dest, tag, comm, ierr ) end

• MPI_Ssend will not complete until the matching receive starts

• MPI_Send can be implemented as MPI_Ssend

•  At some value of count, MPI_Send will act like MPI_Ssend (or fail)

11

Finding Unsafe MPI_Send II •  Have the application generate a message about unsafe uses

of MPI_Send •  Hint: use MPI_Issend •  C users can use __FILE__ and __LINE__

•  sometimes possible in Fortran (.F files)

12

Reporting on Unsafe MPI_Send

subroutine MPI_Send( start, count, datatype, dest, tag, comm, ierr ) integer start(*), count, datatype, dest, tag, comm include ’mpif.h’ integer request, status(MPI_STATUS_SIZE) double precision t1, t2 logical flag

call PMPI_Issend(start, count, datatype, dest, tag, comm, request, ierr ) flag = .false. t1 = MPI_Wtime() Do while (.not. flag .and. t1 + 10 .gt. MPI_Wtime()) call PMPI_Test( request, flag, status, ierr ) Enddo if (.not. Flag) then print *, ’MPI_Send appears to be hanging’ call MPI_Abort( MPI_COMM_WORLD, 1, ierr ) endif end

13

Defining Your Own Communicators •  MPI has a large number of routines for manipulating groups

and defining communicators •  All you need is MPI_Comm_split

•  MPI_Comm_split( old_comm, color, key, new_comm) •  Splits old_comm into several new_comm •  Each new_comm contains those processes in old_comm that

specified the same value of color •  Ranking in new_comm is controlled by key

•  MPI_Comm_dup creates a duplicate of input communicator •  Duplicate has its own “context” – safe communication space

•  Libraries should use MPI_Comm_dup to get a private communicator

14

Why Contexts? •  Parallel libraries require isolation of messages from one

another and from the user that cannot be adequately handled by tags.

•  Consider the following examples •  Sub1 and Sub2 are from different libraries

Sub1(); Sub2();

•  Sub1a and Sub1b are from the same library Sub1a(); Sub2(); Sub1b();

15

Correct Execution of Library Calls

Sub1

Sub2

Process 0 Process 1 Process 2

Recv(any)

Recv(any) Send(1)

Send(0)

Recv(1)

Recv(2)

Recv(0) Send(2)

Send(1)

Send(0)

16

Incorrect Execution of Library Calls

Sub1

Sub2

Process 0 Process 1 Process 2

Recv(any)

Recv(any) Send(1)

Send(0)

Recv(2)

Recv(0)

Recv(1)

Send(2)

Send(1)

Send(0) ?

Program hangs (Recv(1) never satisfied)

17

Correct Execution of Library Calls with Pending Communication

Recv(any)

Send(1)

Send(0)

Recv(any)

Recv(2)

Send(1) Recv(0)

Send(2)

Send(0)

Recv(1)

Sub1a

Sub2

Sub1b

18

Incorrect Execution of Library Calls with Pending Communication

Recv(any)

Send(1)

Send(0)

Recv(any)

Recv(2)

Send(1) Recv(0)

Send(2)

Send(0)

Recv(1)

Sub1a

Sub2

Sub1b Program Runs—but with wrong data!

19

Datatypes in MPI

•  MPI datatypes have two purposes: •  Heterogeneity •  Noncontiguous data

•  Basic vs. derived datatype: •  Basic datatype •  Derived datatype

•  Vector •  Contiguous •  Indexed •  Hindexed •  Structure

20

Basic Datatype in C

MPI Datatype C Datatype

MPI_BYTE MPI_CHAR MPI_DOUBLE MPI_FLOAT MPI_INT MPI_LONG MPI_LONG_DOUBLE MPI_PACKED MPI_SHORT MPI_UNSIGNED_CHAR MPI_UNSIGNED MPI_UNSIGNED_LONG MPI_UNSINGED_SHORT

singed char double float int long long double

short unsigned char unsigned int unsigned long unsigned short

Additional datatypes defined in MPI 2.2 corresponding to C99 language types int32_t, int64_t, etc.

21

Typemaps in MPI

•  In MPI, a datatype is represented as a typemap

•  Extent of a datatype

•  An artificial extent can be set by using MPI_UB and MPI_LB

LB UB

extent

Memory locations specified by datatype

22

Typemaps in MPI (cont.)

•  Example: •  (int,0),(char,4) is a typemap •  The extent of this typemap is 5

23

VECTOR Datatype

29

22

15

8

1

30

23

16

9

2

32

25

18

11

4

33

26

19

12

5

34

27

20

13

6

35

28

21

14

7

31

24

17

10

3

MPI_Type_vector(count, blocklength, stride, oldtype, &newtype) MPI_Type_commit(&newtype)

MPI_Type_Vector(5,1,7,MPI_DOUBLE,newtype); MPI_Type_commit(&newtype); MPI_Send(buffer,1,newtype,dest,tag,comm); MPI_Type_free(&newtype);

24

CONTIGUOUS Datatype

•  Assume an original datatype oldtype has typemap (double,0), (char,8), then

MPI_Type_contiguous(3,oldtype,&newtype);

•  Creates a datatype newtype with a typemap: ?

•  To actually send such a data use the sequence of calls: MPI_Type_contiguous(count,datatype,&newtype); MPI_Type_commit(&newtype); MPI_Send(buffer,1,newtype,dest,tag,comm); MPI_Type_free(&newtype);

MPI_Type_contiguous(count, oldtype, &newtype) MPI_Type_commit(&newtype)

25

INDEXED Datatype

•  Assume an original datatype oldtype has typemap (double,0) (char 8). Let B=(3,1) and D=(4,0),

MPI_Type_indexed(2,B,D,oldtype,&newtype);

•  Creates a datatype newtype with a typemap: ?

MPI_Type_indexed(count, &array_of_blocklengths, &array_of_displacements, oldtype, &newtype) MPI_Type_commit(&newtype)

26

Structure Datatype

MPI_Type_structure(count, &array_of_blocklengths, &array_of_displacements, oldtype, &newtype) MPI_Type_commit(&newtype)

Example: struct{

char display[50]; int max; double xmin,ymin; doube xmax,ymax; int width; int length;

} cmdline;

/* set up 4 blocks*/ int blockcounts[4]={50,1,4,2}; MPI_Datatype types[4]; MPI_Int displs[4]; MPI_Datatype cmdtype;

MPI_Address(&cmdline.display, &displs[0]); MPI_Address(&cmdline.max, &displs[1]); MPI_Address(&cmdline.xmin, &displs[2]); MPI_Address(&cmdline.width, &displs[3]); MPI_Address(&cmdline+1, &displs[4]); types[0]=MPI_CHAR; types[1]=MPI_INT; types[2]=MPI_DOUBLE; types[3]=MPI_INT; types[4]=MPI_UB; for (i=4;i>=0;i--)

displs[i]-=displs[0]; MPI_Type_struct(5,blockcounts,displs,types,&cmdtype); MPI_Type_commit(&cmdtype);

27

Cartesian Topology

•  MPI lets user specify various application topologies •  A Cartesian topology is a mesh •  Example: 3*4 Cartesian mesh with arrows pointing at the right

neighbors:

(0,2) (1,2) (2,2) (3,2)

(0,1) (1,1) (2,1) (3,1)

(0,0) (1,0) (2,0) (3,0)

28

Defining a Cartesian Topologies

•  The routine MPI_Cart_create() creates a Cartesian decomposition of the processes, with the number of dimensions given by the ndim argument

•  This creates a new communicator with the same processes as the input communicator, but with the specified topology

dims[0]=4; dims[1]=3; periods[0]=0; periods[1]=0; /* specify if wrapround */

reorder = 0;

ndim=2;

MPI_Cart_create(MPI_COMM_WORLD,ndim,*dims,*periods,reorder,comm2d);

29

Finding Neighbors

•  The question “who are my neighbors?” can be answered with MPI_Cart_shift:

MPI_Cart_shift( comm2d, 1, 1, nbrleft, nbrright);

MPI_Cart_shift( comm2d, 0, 1, nbrbottom, nbrtop); The values returned are the ranks, in the communicator comm2d, of the neighbors shifted by 1 in the two dimensions

int MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest)

30

Periodic vs. Nonperiodic Grids

•  Who are my neighbors if I am at the edge of a Cartesian mesh?

•  In the nonperiodic case, a neighbor may not exist. This is indicated by a rank of MPI_PROC_NULL.

?

Nonperiodic Grid Periodic Grid

31

Who Am I?

•  This question can be answered with:

int coords[2];

MPI_Comm_rank(comm2d, myrank); MPI_Cart_coords(comm2d, myrank,2, coords);

returns the Cartesian coordinates of the calling process in coords.

int MPI_Cart_coords ( MPI_Comm comm, int rank, int maxdims, int *coords )

32

Partitioning

•  When creating a Cartesian topology, one question is ``What is a good choice for the decomposition of the processors?''

•  This question can be answered with MPI_Dims_create:

int dims[2]=(0 ,0);

MPI_Comm_size( MPI_COMM_WORLD, &nprocs); MPI_Dims_create( nprocs, 2, dims) ;

e.g. MPI_Dims_create(6, 2, dims) returns dims=(3,2)

int MPI_Dims_create(int nnodes, int ndims, int *dims)

33

Other Topology Routines

•  MPI contains routines to translate between Cartesian coordinates and ranks in a communicator, and to access the properties of a Cartesian topology.

•  MPI_Graph_create allows the creation of a general graph topology

•  MPI_Dist_graph_create is a more scalable version defined in MPI 2.2

•  In summary, all these routines allow the MPI implementation to provide an ordering of processes in a topology that makes logical neighbors close in the physical interconnect

34

Error Handling •  By default, an error causes all processes to abort. •  The user can cause routines to return (with an error code)

instead •  MPI_Comm_set_errhandler

•  A user can also write and install custom error handlers. •  Libraries can handle errors differently from applications.

•  MPI provides a way for each library to have its own error handler without changing the default behavior for other libraries or for the user’s code

•  MPI_Error_string() can be used to convert an error code into a string that can be printed

35 35

MPI-2

  Same process of definition by MPI Forum  MPI-2 is an extension of MPI

–  Extends the message-passing model. • Parallel I/O • Remote memory operations (one-sided) • Dynamic process management

–  Adds other functionality • C++ and Fortran 90 bindings –  similar to original C and Fortran-77 bindings

• External interfaces •  Language interoperability • MPI interaction with threads

36 36

MPI-2 Implementation Status

 Most parallel computer vendors now support MPI-2 on their machines –  Except in some cases for the dynamic process management

functions, which require interaction with other system software   Cluster MPIs, such as MPICH2 and Open MPI, support most of

MPI-2 including dynamic process management

37 37

MPI and Threads

38

MPI and Threads

 MPI describes parallelism between processes (with separate address spaces)

  Thread parallelism provides a shared-memory model within a process

 OpenMP and Pthreads are common models –  OpenMP provides convenient features for loop-level

parallelism. Threads are created and managed by the compiler, based on user directives.

–  Pthreads provide more complex and dynamic approaches. Threads are created and managed explicitly by the user.

39

Programming for Multicore

  Almost all chips are multicore these days   Today’s clusters often comprise multiple CPUs per node sharing

memory, and the nodes themselves are connected by a network   Common options for programming such clusters

–  All MPI • Use MPI to communicate between processes both within a

node and across nodes • MPI implementation internally uses shared memory to

communicate within a node –  MPI + OpenMP

• Use OpenMP within a node and MPI across nodes –  MPI + Pthreads

• Use Pthreads within a node and MPI across nodes   The latter two approaches are known as “hybrid programming”

40 40

MPI’s Four Levels of Thread Safety

  MPI defines four levels of thread safety. These are in the form of commitments the application makes to the MPI implementation.

–  MPI_THREAD_SINGLE: only one thread exists in the application

–  MPI_THREAD_FUNNELED: multithreaded, but only the main thread makes MPI calls (the one that called MPI_Init or MPI_Init_thread)

–  MPI_THREAD_SERIALIZED: multithreaded, but only one thread at a time makes MPI calls

–  MPI_THREAD_MULTIPLE: multithreaded and any thread can make MPI calls at any time (with some restrictions to avoid races – see next slide)

  MPI defines an alternative to MPI_Init –  MPI_Init_thread(requested, provided)

•  Application indicates what level it needs; MPI implementation returns the level it supports

advanced features of mpi - prace research … set up 4 blocks*/ int blockcounts[4]={50,1,4,2}; ......

Documents