a very short introduction to mpi vincent keller, cadmos (with basile schaeli, epfl – i&c –...

A Very Short Introduction to MPI

Vincent Keller, CADMOS

(with Basile Schaeli, EPFL – I&C – LSP, 2008 - 2013)

MPI

• Run multiple instances of the same program– mpiexec -n p myApp args– Starts p instances of myApp

• Instances exchange information by sending messages to each other

• Communications take place within a communicator– Set of processes, indexed from 0 to communicatorSize-1– MPI_COMM_WORLD predefined communicator containing all

processes

• Compile with mpicc (C), mpicxx (C++) or mpif77/mpif90 (Fortran)– All are wrappers around regular compilers (try e.g. mpicc –show)

hello.c

int main(int argc, char *argv[]){

int size, rank;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

printf("I'm process %d out of %d\n", rank, size);

MPI_Finalize();}

Point-to-point communications

MPI_Send(buf, count, datatype, destination, tag, communicator)

MPI_Recv(buf, count, datatype, source, tag, communicator, status)

• Sends (receives) count elements of type datatype from (into) buffer buf.

• Each send must be matched by a receive!– You must know source, tag, communicator and size of incoming

message– If you don’t know, use MPI_Probe / MPI_Iprobe to get that

information

Mismatches cause race conditions or deadlocks

ping.c

int main(int argc, char *argv[]) {

int rank;

int buf[100];

MPI_Status status;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0)

MPI_Send(buf, 100, MPI_INT, 1, 0, MPI_COMM_WORLD);

else

MPI_Recv(buf, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);

MPI_Finalize();

}

Watch out for deadlocks!

send(1, 0) recv(0, 0)

Process 0 Process 1

• ping.c runs ok with 2 processes

• Deadlocks if more than 2

send(1, 0) recv(0, 0)

Process 0 Process 1 Process 2

finalize() finalize()

finalize() finalize()

recv(0, 0)

ping_correct.c

int main(int argc, char *argv[]) {

int rank;

int buf[100];

MPI_Status status;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0)

MPI_Send(buf, 100, MPI_INT, 1, 0, MPI_COMM_WORLD);

else if (rank == 1)

MPI_Recv(buf, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);

MPI_Finalize();

}

Blocking communications

• MPI_Send– Returns when buffer can be reused

• MPI_Ssend– Returns when other end called matching recv

• MPI_Recv– Returns when message has been received

• MPI_Sendrecv– Sends and receive within the same call to avoid deadlocks

• See also:– MPI_Bsend, MPI_Rsend, MPI_Sendrecv_replace

Non-blocking communications

• MPI_Isend, MPI_Irecv– Do not wait for message to be buffered / sent / received– Fills an additional MPI_Request parameter that identifies the request

• Do not use / modify / delete buffers until request completed

• Wait calls block until request(s) completed

MPI_Wait(request, status)

MPI_Waitall(count, array_of_requests, array_of_statuses)

• See also:– MPI_Issend, MPI_Ibsend, MPI_Irsend, MPI_Waitsome, MPI_Waitany,

MPI_Test, MPI_Testall, MPI_Testany, MPI_Testsome

iping.c

int main(int argc, char *argv[]) {int rank;int buf[100];MPI_Request request;MPI_Status status;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0)MPI_Isend(buf, 100, MPI_INT, 1, 0,

MPI_COMM_WORLD, &request);else if (rank == 1)

MPI_Irecv(buf, 100, MPI_INT, 0, 0,MPI_COMM_WORLD, &request);

MPI_Wait(&request, &status);

MPI_Finalize();}

exchange.c

if (rank == 0) {MPI_Isend(buf1, 10, MPI_INT, 1, 0,

MPI_COMM_WORLD, &request);MPI_Recv(buf2, 10, MPI_INT, 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);}else if (rank == 1){

MPI_Isend(buf1, 10, MPI_INT, 0, 0,MPI_COMM_WORLD, &request);

MPI_Recv(buf2, 10, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

}

MPI_Wait(&request, &status);memcpy(buf1, buf2, 10*sizeof(int));

• Processes 0 and 1 exchange the content of their buffers

exchange_sendrecv.c

if (rank == 0){

MPI_Sendrecv(buf1, 10, MPI_INT, 1, 0, buf2, 10, MPI_INT, 1, 0,


MPI_Sendrecv(buf1, 10, MPI_INT, 0, 0, buf2, 10, MPI_INT, 0, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);}

memcpy(buf1, buf2, 10*sizeof(int));


exchange_sendrecv_replace.c

if (rank == 0){

MPI_Sendrecv_replace(buf1, 10, MPI_INT, 1, 0, 1, 0,


MPI_Sendrecv_replace(buf1, 10, MPI_INT, 0, 0, 0, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);}


Wildcard receives

• MPI_ANY_SOURCE and MPI_ANY_TAG are wildcards

switch(rank) {

case 0: MPI_Send(..., 1, 0, ...); break;

case 1:

MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ...);


break;

case 2: MPI_Send(..., 1, 1, ...); break;

}

send(1, 0) send(1, 1)recv(*, *)

recv(*, *)


Barrier synchronization

MPI_Barrier(communicator)• Returns when all processes in communicator have joined

switch(rank) {

case 0:

MPI_Send(..., dest = 1, tag = 0, ...);

MPI_Barrier(MPI_COMM_WORLD); break;

case 1:


MPI_Barrier(MPI_COMM_WORLD);

MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ...); break;

case 2:


MPI_Send(..., src = 1, tag = 1, ...); break;

}

Barriers

send(1, 0)

send(1, 1)

recv(*, *)

recv(*, *)


barrier barrier barrier

• Order of calls on process 1 is important– recv – barrier – recv: correct– barrier – recv – recv: message race or deadlock (depends

on msg size)– recv – recv – barrier: deadlock

Barriers

• barrier – recv – recv

• MPI_Send on process 0 may succeed if message can be buffered (MPI_Ssend always fails, MPI_Isend always succeeds)

send(1, 0)

send(1, 1)recv(*, *)

recv(*, *)



Barriers

• recv – recv – barrier

• Process 1 never enters the barrier since its second recv is never matched

send(1, 0)

send(1, 1)

recv(*, 0)

recv(2, *)



Collective communications

• MPI_Bcast( buffer, count, datatype, root, comm ) – Sends the same data to every process

• MPI_Scatter( sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)– Sends pieces of a buffer to every process of the

communicator– “The outcome is as if the root executed n send operations

MPI_Send(sendbuf+i*sendcount*extent(sendtype), sendcount, sendtype, i, ...)and each process executed a receiveMPI_Recv(recvbuf, recvcount, recvtype, i, ...) “


• MPI_Gather– Retrieves pieces of data from every process

• MPI_Allgather– All pieces retrieved by all processes

• MPI_Reduce– Performs a reduction operation (MPI_SUM, …) across all nodes– E.g. dot product on distributed vectors (produit scalaire)

• MPI_Allreduce– Result distributed to all processes

• MPI_Alltoall– Sends all data to all processes

• Every process of the communicator must participate– Parameters must match

Mismatches cause race conditions or deadlocks

Receiving image parts in order

/* Generate image parts */.../* Each process sends */MPI_Isend(imgPart, partSize, MPI_BYTE, 0, 0,

MPI_COMM_WORLD, &request);

/* Process 0 receives all parts into bufif (rank == 0){

char *buf = malloc(nProcs*partSize);for (int i=0; i<nProcs; i++)

MPI_Recv(buf + i*partSize, partSize, MPI_BYTE, i, 0, MPI_COMM_WORLD,

MPI_STATUS_IGNORE);}

MPI_Wait(&request, MPI_STATUS_IGNORE);

Receiving image parts out-of-order/* Generate image parts */

...

/* Each process sends */

MPI_Isend(imgPart, partSize, MPI_BYTE, 0, 0, MPI_COMM_WORLD, &request);

/* Process 0 receives all parts into buf

if (rank == 0) {

char *buf = malloc(nProcs*partSize);

MPI_Status s; int count;

for (int i=0; i<nProcs; i++) {

MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &s);

MPI_Get_count(&s, MPI_BYTE, &count);

MPI_Recv(buf + s.MPI_SOURCE*count, count, MPI_BYTE,s.MPI_SOURCE, s.MPI_TAG, MPI_COMM_WORLD,

MPI_STATUS_IGNORE);

} } MPI_Wait(&request, MPI_STATUS_IGNORE);

Receiving image parts with a collective

/* Generate image parts */

...

/* Process 0 is the root of the collective,

i.e. the receiver of all parts */

int root = 0;

char *buf = NULL;

if (rank == root) /* Only the root must allocate buf */

buf = malloc(nProcs*partSize)

MPI_Gather(part, partSize, MPI_BYTE, buf, partSize, MPI_BYTE, root, MPI_COMM_WORLD);


• Avoid putting collective calls within conditional clauses

• Ex from slide 11:

/* Send from 0 to 1 */

if (rank == 0) MPI_Send(..., 1, 0, ...);

else if (rank == 1)



/* Send from 2 to 1 */

if (rank == 1)


else if (rank == 2)

MPI_Send(..., 1, 1, ...);

Custom communicators and datatypes

• You can define your own communicators– MPI_Comm_dup duplicates a communicator (e.g. to

enable private communications within library functions such as DPS)

– MPI_Comm_split splits a communicator into multiple smaller communicators (useful when using 2D and 3D domain decompositions, DPS game of life is 1D)

• You can define custom datatypes, e.g.– Simple structs (MPI_Type_struct)

– Vectors (MPI_Type_vector)

– No pointers

Does this program terminate? (assume 2 processes)

int rank;MPI_Init(&argc, &argv);MPI_Comm_rank(&rank, MPI_COMM_WORLD);

if (rank) MPI_Send(&rank, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);else MPI_Recv(&rank, 1, MPI_INT, 1, 0, MPI_COMM_WORLD,

MPI_STATUS_IGNORE);

if (rank) MPI_Send(&rank, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);else MPI_Recv(&rank, 1, MPI_INT, 1, 0, MPI_COMM_WORLD,

MPI_STATUS_IGNORE);

MPI_Finalize();

MPI vs. DPS

• MPI does not impose communication pattern + Suitable for all problems– Everything is left to the user– Serialization up to the user (but custom datatypes can be created)– Deadlocks may occur

• DPS requires acyclic graph– Some (many?) problems not naturally mapped on a graph– Overheads for creating operations, threads, data objects+ Graph provides convenient graphical representation, makes it easy

to build pipelined applications+ Built-in load-balancing, flow control+ Automatic serialization of complex datatypes (pointers, STL, …)+ Built-in checkpointing+ Deadlock-free

MPI vs. DPS

• MPI communicators represent groups of processes– Memory space private to each process

– One process may belong to multiple communicators

• DPS thread collections represent groups of threads– Thread state private to each thread

– One thread only belongs to a single thread collection

• DPS runs on top of MPI– DPS applications become MPI applications

– Submit DPS jobs like any MPI job (integrates nicely with infrastructure)

• We could not implement MPI on top of DPS

MPI vs. DPS

• MPI allocates processes statically– MPI 2.0 supports for adding and removing processes dynamically– Whole application stops if one process (or one computer) crashes

• When using TCP, DPS allocates processes dynamically– Fault-tolerance is built into the framework– For simple applications: just add –ft on command line– More in “Fault-tolerant applications” in “Advanced Tutorial” chapter

• Both can run on heterogeneous clusters– Hence datatypes (MPI_INT, MPI_DOUBLE, …) to handle byte

ordering– DPS tested on Solaris/Windows/Mac/Linux on x86/Sparc/PowerPC

DPS over MPI

dpsMain(int argv, char *argv[], dps::Application *app){

int support; // Initialize MPI with multithreadingMPI_Init_thread(&argv, &argv, MPI_THREAD_MULTIPLE,

&support);assert(support == MPI_THREAD_MULTIPLE);

dps::Controller c(argc, argv, app);if (!app->init()) // Every process calls init()

return; // abort if init returned falseMPI_Comm_rank(MPI_COMM_WORLD, &rank);if (rank == 0)

app->start(); // Only main instance calls start()c.run(); // All instances start waiting for messagesMPI_Finalize();

}

int main(int argc, char *argv[]) { return dps::dpsMain(new MyApp());}

Useful links

Function list and official MPI documentation:

http://www.mpi-forum.org/docs/mpi-11-html/node182.html

A quick MPI reference:

http://www.mathematik.uni-marburg.de/~berthold/parprog05/MPI-QRef.pdf

a very short introduction to mpi vincent keller, cadmos (with basile schaeli, epfl – i&c –...

Documents