pds lectures 6 & 7 an introduction to mpi

50
1 PDS Lectures 6 & 7 An Introduction to MPI Originally by William Gropp and Ewing Lusk Argonne National Laboratory Adapted by Anda Iamnitchi

Upload: reeves

Post on 12-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

PDS Lectures 6 & 7 An Introduction to MPI. Originally by William Gropp and Ewing Lusk Argonne National Laboratory Adapted by Anda Iamnitchi. Outline. Background The message-passing model Origins of MPI and current status Sources of further MPI information Basics of MPI message passing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PDS Lectures 6 & 7 An Introduction to MPI

1

PDS Lectures 6 & 7 An Introduction to MPI

Originally by William Gropp and Ewing Lusk

Argonne National Laboratory

Adapted by Anda Iamnitchi

Page 2: PDS Lectures 6 & 7 An Introduction to MPI

2

Outline Background

The message-passing model Origins of MPI and current status Sources of further MPI information

Basics of MPI message passing Hello, World! Fundamental concepts Simple examples in Fortran and C

Extended point-to-point operations non-blocking communication Modes

Collective operations

Page 3: PDS Lectures 6 & 7 An Introduction to MPI

3

The Message-Passing Model

A process is (traditionally) a program counter and address space.

Processes may have multiple threads (program counters and associated stacks) sharing a single address space. MPI is for communication among processes, which have separate address spaces.

Interprocess communication consists of Synchronization Movement of data from one process’s address space to

another’s.

Page 4: PDS Lectures 6 & 7 An Introduction to MPI

4

Types of Parallel Computing Models (review)

Data Parallel - the same instructions are carried out simultaneously on multiple data items (SIMD)

Task Parallel - different instructions on different data (MIMD) SPMD (single program, multiple data) not synchronized at

individual operation level SPMD is equivalent to MIMD since each MIMD program can be

made SPMD (similarly for SIMD, but not in practical sense.)

Message passing (and MPI) is for MIMD/SPMD parallelism.

Page 5: PDS Lectures 6 & 7 An Introduction to MPI

5

Cooperative Operations for Communication

The message-passing approach makes the exchange of data cooperative.

Data is explicitly sent by one process and received by another. An advantage is that any change in the receiving process’s

memory is made with the receiver’s explicit participation. Communication and synchronization are combined.

Process 0 Process 1

Send(data)

Receive(data)

Page 6: PDS Lectures 6 & 7 An Introduction to MPI

6

One-Sided Operations for Communication One-sided operations between processes include remote

memory reads and writes Only one process needs to explicitly participate. An advantage is that communication and synchronization are

decoupled One-sided operations are part of MPI-2.

Process 0 Process 1

Put(data)

(memory)

(memory)

Get(data)

Page 7: PDS Lectures 6 & 7 An Introduction to MPI

7

What is MPI? A message-passing library specification

extended message-passing model not a language or compiler specification not a specific implementation or product

For parallel computers, clusters, and heterogeneous networks

Designed to provide access to advanced parallel hardware for end users library writers tool developers

Page 8: PDS Lectures 6 & 7 An Introduction to MPI

8

A Minimal MPI Program (C)

#include "mpi.h"

#include <stdio.h>

int main( int argc, char *argv[] )

{

MPI_Init( &argc, &argv );

printf( "Hello, world!\n" );

MPI_Finalize();

return 0;

}

Page 9: PDS Lectures 6 & 7 An Introduction to MPI

9

Finding Out About the Environment Two important questions that arise early in a

parallel program are: How many processes are participating in this

computation? Which one am I?

MPI provides functions to answer these questions: MPI_Comm_size reports the number of processes. MPI_Comm_rank reports the rank, a number between 0

and size-1, identifying the calling process

Page 10: PDS Lectures 6 & 7 An Introduction to MPI

10

Better Hello (C)

#include "mpi.h"#include <stdio.h>

int main( int argc, char *argv[] ){ int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); printf( "I am %d of %d\n", rank, size ); MPI_Finalize(); return 0;}

Page 11: PDS Lectures 6 & 7 An Introduction to MPI

11

MPI Basic Send/Receive

We need to fill in the details in

Things that need specifying: How will “data” be described? How will processes be identified? How will the receiver recognize/screen messages? What will it mean for these operations to complete?

Process 0 Process 1

Send(data)

Receive(data)

Page 12: PDS Lectures 6 & 7 An Introduction to MPI

12

What is message passing?

Data transfer plus synchronization

• Requires cooperation of sender and receiver• Cooperation not always apparent in code

DataProcess 0

Process 1

May I Send?

Yes

DataData

DataData

DataData

DataData

Time

Page 13: PDS Lectures 6 & 7 An Introduction to MPI

13

Some Basic Concepts

Processes can be collected into groups. Each message is sent in a context, and must be

received in the same context. A group and context together form a communicator. A process is identified by its rank in the group

associated with a communicator. There is a default communicator whose group

contains all initial processes, called MPI_COMM_WORLD.

Page 14: PDS Lectures 6 & 7 An Introduction to MPI

14

MPI Datatypes

The data in a message to sent or received is described by a triple (address, count, datatype), where

An MPI datatype is recursively defined as: predefined, corresponding to a data type from the language

(e.g., MPI_INT, MPI_DOUBLE_PRECISION) a contiguous array of MPI datatypes a strided block of datatypes an indexed array of blocks of datatypes an arbitrary structure of datatypes

There are MPI functions to construct custom datatypes, such an array of (int, float) pairs, or a row of a matrix stored columnwise.

Page 15: PDS Lectures 6 & 7 An Introduction to MPI

15

MPI Tags

Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message.

Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive.

Some non-MPI message-passing systems have called tags “message types”. MPI calls them tags to avoid confusion with datatypes.

Page 16: PDS Lectures 6 & 7 An Introduction to MPI

16

MPI Basic (Blocking) Send

MPI_SEND (start, count, datatype, dest, tag, comm)

The message buffer is described by (start, count, datatype). The target process is specified by dest, which is the rank of the

target process in the communicator specified by comm. When this function returns, the data has been delivered to the

system and the buffer can be reused. The message may not have been received by the target process.

MPI_Send(buf, count, datatype, dest, tag, comm)

Address of

Number of items

Datatype of

Rank of destination

Message tag

Communicator

send buffer

to send

each item

process

Page 17: PDS Lectures 6 & 7 An Introduction to MPI

17

MPI Basic (Blocking) Receive

MPI_RECV(start, count, datatype, source, tag, comm, status) Waits until a matching (on source and tag) message is received

from the system, and the buffer can be used. source is rank in communicator specified by comm, or

MPI_ANY_SOURCE. status contains further information Receiving fewer than count occurrences of datatype is OK, but

receiving more is an error.

MPI_Recv(buf, count, datatype, src, tag, comm, status)

Address of

Maximum number

Datatype of

Rank of source

Message tag

Communicator

receive buffer

of items to receive

each item

process

Statusafter operation

Page 18: PDS Lectures 6 & 7 An Introduction to MPI

18

Example

To send an integer x from process 0 to process 1,

MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */

if (myrank == 0) {int x;MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD);

} else if (myrank == 1) {int x;MPI_Recv(&x, 1, MPI_INT, 0, msgtag,MPI_COMM_WORLD,status);

}

Page 19: PDS Lectures 6 & 7 An Introduction to MPI

19

Retrieving Further Information

Status is a data structure allocated in the user’s program.

In C:int recvd_tag, recvd_from, recvd_count;MPI_Status status;MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status )

recvd_tag = status.MPI_TAG;recvd_from = status.MPI_SOURCE;MPI_Get_count( &status, datatype, &recvd_count );

Page 20: PDS Lectures 6 & 7 An Introduction to MPI

20

Why Datatypes?

Since all data is labeled by type, an MPI implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes (heterogeneous communication).

Specifying application-oriented layout of data in memory reduces memory-to-memory copies in the implementation allows the use of special hardware (scatter/gather) when

available

Page 21: PDS Lectures 6 & 7 An Introduction to MPI

21

Tags and Contexts

Separation of messages used to be accomplished by use of tags, but this requires libraries to be aware of tags used by other libraries. this can be defeated by use of “wild card” tags.

Contexts are different from tags no wild cards allowed allocated dynamically by the system when a library sets up a

communicator for its own use. User-defined tags still provided in MPI for user convenience in

organizing application Use MPI_Comm_split to create new communicators

Page 22: PDS Lectures 6 & 7 An Introduction to MPI

22

Communicators

Defines scope of a communication operation.

Processes have ranks associated with communicator.

Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes.

Other communicators can be established for groups of processes.

Page 23: PDS Lectures 6 & 7 An Introduction to MPI

23

MPI is Simple

Many parallel programs can be written using just these six functions, only two of which are non-trivial: MPI_INIT MPI_FINALIZE MPI_COMM_SIZE MPI_COMM_RANK MPI_SEND MPI_RECV

Point-to-point (send/recv) isn’t the only way...

Page 24: PDS Lectures 6 & 7 An Introduction to MPI

24

Introduction to Collective Operations in MPI

Collective operations are called by all processes in a communicator.

MPI_BCAST distributes data from one process (the root) to all others in a communicator.

MPI_REDUCE combines data from all processes in communicator and returns it to one process.

In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.

Page 25: PDS Lectures 6 & 7 An Introduction to MPI

25

Example: PI in C -1#include "mpi.h"#include <math.h>int main(int argc, char *argv[])

{int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;

Page 26: PDS Lectures 6 & 7 An Introduction to MPI

26

Example: PI in C - 2 h = 1.0 / (double) n;

sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize();

return 0;

}

Page 27: PDS Lectures 6 & 7 An Introduction to MPI

27

Alternative set of 6 Functions for Simplified MPI

MPI_INIT MPI_FINALIZE MPI_COMM_SIZE MPI_COMM_RANK MPI_BCAST MPI_REDUCE

What else is needed (and why)?

Page 28: PDS Lectures 6 & 7 An Introduction to MPI

28

Collective Communication

Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations:

MPI_Bcast() - Broadcast from root to all other processes MPI_Gather() - Gather values for group of processes MPI_Scatter() - Scatters buffer in parts to group of processes MPI_Alltoall() - Sends data from all processes to all processes MPI_Reduce() - Combine values on all processes to single value MPI_Reduce_scatter() - Combine values and scatter results MPI_Scan() - Compute prefix reductions of data on processes

Page 29: PDS Lectures 6 & 7 An Introduction to MPI

29

Example

To gather items from group of processes into process 0, using dynamically allocated memory in root process:

int data[10]; /*data to be gathered from processes*/MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */if (myrank == 0) {

MPI_Comm_size(MPI_COMM_WORLD, &grp_size); /*find group size*/buf = (int *)malloc(grp_size*10*sizeof (int)); /*allocate memory*/

}MPI_Gather(data,10,MPI_INT,buf,grp_size*10,MPI_INT,0,MPI_COMM_WORLD) ;

MPI_Gather() gathers from all processes, including root.

Page 30: PDS Lectures 6 & 7 An Introduction to MPI

30

Send a large message from process 0 to process 1 If there is insufficient storage at the destination, the send must wait for the user to provide the memory space

(through a receive) What happens with

Sources of Deadlocks

Process 0

Send(1)Recv(1)

Process 1

Send(0)Recv(0)

• This is called “unsafe” because it depends on the availability of system buffers

Page 31: PDS Lectures 6 & 7 An Introduction to MPI

31

Some Solutions to the “unsafe” Problem Order the operations more carefully:

Process 0

Send(1)Recv(1)

Process 1

Recv(0)Send(0)

Use non-blocking operations:

Process 0

Isend(1)Irecv(1)Waitall

Process 1

Isend(0)Irecv(0)Waitall

Page 32: PDS Lectures 6 & 7 An Introduction to MPI

32

MPI Nonblocking Routines

Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered.

Nonblocking receive - MPI_Irecv() - will return even if no message to accept.

Page 33: PDS Lectures 6 & 7 An Introduction to MPI

33

Nonblocking Routine Formats

MPI_Isend(buf,count,datatype,dest,tag,comm,request)

MPI_Irecv(buf,count,datatype,source,tag,comm, request)

Completion detected by MPI_Wait() and MPI_Test().

MPI_Wait() waits until operation completed and returns then.

MPI_Test() returns with flag set indicating whether operation completed at that time.

Need to know whether particular operation completed.

Determined by accessing request parameter.

Page 34: PDS Lectures 6 & 7 An Introduction to MPI

34

Example

To send an integer x from process 0 to process 1 and allow process 0 to continue,

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */

if (myrank == 0) {

int x;

MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1);

compute();

MPI_Wait(req1, status);

} else if (myrank == 1) {

int x;

MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status);

}

Page 35: PDS Lectures 6 & 7 An Introduction to MPI

35

MPI-2 (available on C4 lab machines)Extends MPI-1 with: Dynamic Process Management

Dynamic process startup Dynamic establishment of connections

One-sided communication (Remote Memory Access) Put/get Other operations

Parallel I/O Other features

Bindings for C++/ Fortran-90; interlanguage issues Etc.

Page 36: PDS Lectures 6 & 7 An Introduction to MPI

36

MPI-2 Specifics: Parallel I/O

Page 37: PDS Lectures 6 & 7 An Introduction to MPI

37

/* example of sequential Unix write into a common file */#include "mpi.h"#include <stdio.h>#define BUFSIZE 100

int main(int argc, char *argv[]){    int i, myrank, numprocs, buf[BUFSIZE];    MPI_Status status;    FILE *myfile;

    MPI_Init(&argc, &argv);    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);    for (i=0; i<BUFSIZE; i++)        buf[i] = myrank * BUFSIZE + i;    if (myrank != 0)        MPI_Send(buf, BUFSIZE, MPI_INT, 0, 99, MPI_COMM_WORLD);    else {        myfile = fopen("testfile", "w");        fwrite(buf, sizeof(int), BUFSIZE, myfile);        for (i=1; i<numprocs; i++) {            MPI_Recv(buf, BUFSIZE, MPI_INT, i, 99, MPI_COMM_WORLD, &status);            fwrite(buf, sizeof(int), BUFSIZE, myfile);        }        fclose(myfile);    }    MPI_Finalize();    return 0;}

Page 38: PDS Lectures 6 & 7 An Introduction to MPI

38

#include "mpi.h"#include <stdio.h>#define BUFSIZE 100

int main(int argc, char *argv[]){    int i, myrank, buf[BUFSIZE];    char filename[128];    FILE *myfile;

    MPI_Init(&argc, &argv);    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);    for (i=0; i<BUFSIZE; i++)        buf[i] = myrank * BUFSIZE + i;    sprintf(filename, "testfile.%d", myrank);    myfile = fopen(filename, "w");    fwrite(buf, sizeof(int), BUFSIZE, myfile);    fclose(myfile);    MPI_Finalize();    return 0;}

Page 39: PDS Lectures 6 & 7 An Introduction to MPI

39

/* example of parallel MPI write into a single file */#include "mpi.h"#include <stdio.h>#define BUFSIZE 100

int main(int argc, char *argv[]){    int i, myrank, buf[BUFSIZE];    MPI_File thefile;

    MPI_Init(&argc, &argv);    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);    for (i=0; i<BUFSIZE; i++)        buf[i] = myrank * BUFSIZE + i;    MPI_File_open(MPI_COMM_WORLD, "testfile",                  MPI_MODE_CREATE | MPI_MODE_WRONLY,                  MPI_INFO_NULL, &thefile);    MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int),                      MPI_INT, MPI_INT, "native", MPI_INFO_NULL);    MPI_File_write(thefile, buf, BUFSIZE, MPI_INT,                   MPI_STATUS_IGNORE);    MPI_File_close(&thefile);    MPI_Finalize();    return 0;}

Page 40: PDS Lectures 6 & 7 An Introduction to MPI

40

C bindings for the I/O functions

int MPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info, MPI_File *fh)

int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype,               MPI_Datatype filetype, char *datarep, MPI_Info info)

int MPI_File_write(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status)

int MPI_File_close(MPI_File *fh)

Page 41: PDS Lectures 6 & 7 An Introduction to MPI

41

Remote Memory Access

“window”: portion of each process’ address space that is explicitly exposes to remove memory operations by other processes in the MPI communication.

Page 42: PDS Lectures 6 & 7 An Introduction to MPI

42

RMA (cont)

Objectives and performance issues: balance efficiency and portability across several classes of

architectures, including SMPs, NUMAs, distributed-memory massively parallel processors (MPPs), heterogeneous networks;

retain the "look and feel" of MPI-1; deal with subtle memory behavior issues, such as cache

coherence and sequential consistency; and separate synchronization from data movement to enhance

performance.

Page 43: PDS Lectures 6 & 7 An Introduction to MPI

43

/* Compute pi by numerical integration, RMA version */#include "mpi.h"#include <math.h>int main(int argc, char *argv[]){    int n, myid, numprocs, i;    double PI25DT = 3.141592653589793238462643;    double mypi, pi, h, sum, x;    MPI_Win nwin, piwin;

    MPI_Init(&argc,&argv);    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);    MPI_Comm_rank(MPI_COMM_WORLD,&myid);

    if (myid == 0) {        MPI_Win_create(&n, sizeof(int), 1, MPI_INFO_NULL,                       MPI_COMM_WORLD, &nwin);        MPI_Win_create(&pi, sizeof(double), 1, MPI_INFO_NULL,                       MPI_COMM_WORLD, &piwin);    }    else {        MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL,                       MPI_COMM_WORLD, &nwin);        MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL,                       MPI_COMM_WORLD, &piwin);    }

Page 44: PDS Lectures 6 & 7 An Introduction to MPI

44

while (1) {        if (myid == 0) {            printf("Enter the number of intervals: (0 quits) ");            fflush(stdout);            scanf("%d",&n);            pi = 0.0;        }        MPI_Win_fence(0, nwin);        if (myid != 0)            MPI_Get(&n, 1, MPI_INT, 0, 0, 1, MPI_INT, nwin);        MPI_Win_fence(0, nwin);        if (n == 0)            break;        else {            h   = 1.0 / (double) n;            sum = 0.0;            for (i = myid + 1; i <= n; i += numprocs) {                x = h * ((double)i - 0.5);                sum += (4.0 / (1.0 + x*x));            }            mypi = h * sum;            MPI_Win_fence( 0, piwin);            MPI_Accumulate(&mypi, 1, MPI_DOUBLE, 0, 0, 1, MPI_DOUBLE, MPI_SUM, piwin);            MPI_Win_fence(0, piwin);            if (myid == 0)                        MPI_COMM_WORLD, &piwin);    }    else {        MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &nwin);        MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &piwin);    }printf("pi is approximately %. 16f, Error is %. 16f/n", pi, fabs(pi - PI25DT));        }    }    MPI_Win_free(&nwin);    MPI_Win_free(&piwin);    MPI_Finalize();    return 0;}

Page 45: PDS Lectures 6 & 7 An Introduction to MPI

45

C bindings for the RMA functions used in the cpi exampleint MPI_Win_create(void *base, MPI_Aint size, int disp_unit, 

MPI_Info info, MPI_Comm comm, MPI_Win *win)

int MPI_Win_fence(int assert, MPI_Win win)

int MPI_Get(void *origin_addr, int origin_count, MPI_Datatype origin_datatype,

int target_rank, MPI_Aint target_disp, int target_count,          MPI_Datatype target_datatype, MPI_Win win)  

int MPI_Accumulate(void *origin_addr, int origin_count,          MPI_Datatype origin_datatype, int target_rank,          MPI_Aint target_disp, int target_count,          MPI_Datatype target_datatype, MPI_Op op, MPI_Win win)

int MPI_Win_free(MPI_Win *win)

Page 46: PDS Lectures 6 & 7 An Introduction to MPI

46

Dynamic Processes

Page 47: PDS Lectures 6 & 7 An Introduction to MPI

47

When to use MPI

Portability and Performance Irregular Data Structures Building Tools for Others

Libraries Need to Manage memory on a per processor basis

Page 48: PDS Lectures 6 & 7 An Introduction to MPI

48

When not to use MPI

Regular computation matches HPF/SIMD But see PETSc/HPF comparison (ICASE 97-72)

Solution (e.g., library) already exists http://www.mcs.anl.gov/mpi/libraries.html

Require Fault Tolerance Sockets

Distributed Computing CORBA, DCOM, etc.

Page 49: PDS Lectures 6 & 7 An Introduction to MPI

49

Error Handling

By default, an error causes all processes to abort.

The user can cause routines to return (with an error code) instead. In C++, exceptions are thrown (MPI-2)

A user can also write and install custom error handlers.

Libraries might want to handle errors differently from applications.

Page 50: PDS Lectures 6 & 7 An Introduction to MPI

50

Resources

On-line information (including tutorials): http://www.mcs.anl.gov/mpi/ https://computing.llnl.gov/tutorials/mpi/

Using MPI-2 advanced features of the message-passing interface, William Gropp ... [et al.]. E-book available from the USF library

The Standard itself: at http://www.mpi-forum.org All MPI official releases, in both postscript and HTML

Online examples available athttp://www.mcs.anl.gov/mpi/tutorials/perf

ftp://ftp.mcs.anl.gov/mpi/mpiexmpl.tar.gz contains source code and run scripts that allows you to evaluate an MPI implementation