parallel computing with mpi

90
PARALLEL COMPUTING WITH MPI Anne Weill-Zrahia With acknowledgments to Cornell Theory Center

Upload: kristen-haynes

Post on 03-Jan-2016

61 views

Category:

Documents


1 download

DESCRIPTION

PARALLEL COMPUTING WITH MPI. Anne Weill-Zrahia With acknowledgments to Cornell Theory Center. Introduction to Parallel Computing. Parallel computer :A set of processors that work cooperatively to solve a computational problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PARALLEL COMPUTING WITH MPI

PARALLEL COMPUTING WITH MPI

Anne Weill-Zrahia

With acknowledgments to Cornell Theory Center

Page 2: PARALLEL COMPUTING WITH MPI

Introduction to Parallel Computing

• Parallel computer :A set of processors that work cooperatively to solve a computational problem.

• Distributed computing : a number of processors communicating over a network

• Metacomputing : Use of several parallel computers

Page 3: PARALLEL COMPUTING WITH MPI

Why parallel computing

• Single processor performance – limited by physics

• Multiple processors – break down problem into simple tasks or domains

• Plus – obtain same results as in sequential program, faster.

• Minus – need to rewrite code

Page 4: PARALLEL COMPUTING WITH MPI

Parallel classification

• Parallel architectures

Shared Memory /

Distributed Memory• Programming paradigms

Data parallel /

Message passing

Page 5: PARALLEL COMPUTING WITH MPI

Shared memory

PP PP

Memory

Page 6: PARALLEL COMPUTING WITH MPI

Shared Memory

• Each processor can access any part of the memory

• Access times are uniform (in principle)

• Easier to program (no explicit message passing)

• Bottleneck when several tasks access same location

Page 7: PARALLEL COMPUTING WITH MPI

Data-parallel programming

• Single program defining operations

• Single memory

• Loosely synchronous (completion of loop)

• Parallel operations on array elements

Page 8: PARALLEL COMPUTING WITH MPI

Distributed Memory

• Processor can only access local memory

• Access times depend on location

• Processors must communicate via explicit message passing

Page 9: PARALLEL COMPUTING WITH MPI

Distributed Memory

Interconnection network

Page 10: PARALLEL COMPUTING WITH MPI

Message Passing Programming

• Separate program on each processor

• Local Memory

• Control over distribution and transfer of data

• Additional complexity of debugging due to communications

Page 11: PARALLEL COMPUTING WITH MPI

Performance issues

• Concurrency – ability to perform actions simultaneously

• Scalability – performance is not impaired by increasing number of processors

• Locality – high ration of local memory accesses/remote memory accesses (or low communication)

Page 12: PARALLEL COMPUTING WITH MPI

SP2 Benchmark

• Goal : Checking performance of real world applications on the SP2

• Execution time (seconds):CPU time for applications

• Speedup Execution time for 1 processor = ------------------------------------ Execution time for p processors

Page 13: PARALLEL COMPUTING WITH MPI
Page 14: PARALLEL COMPUTING WITH MPI

WHAT is MPI?

• A message- passing library specification

• Extended message-passing model

• Not specific to implementation or computer

Page 15: PARALLEL COMPUTING WITH MPI

BASICS of MPI PROGRAMMING

• MPI is a message-passing library

• Assumes : a distributed memory architecture

• Includes : routines for performing communication (exchange of data and synchronization) among the processors.

Page 16: PARALLEL COMPUTING WITH MPI

Message Passing

• Data transfer + synchronization

• Synchronization : the act of bringing one or more processes to known points in their execution

• Distributed memory: memory split up into segments, each may be accessed by only one process.

Page 17: PARALLEL COMPUTING WITH MPI
Page 18: PARALLEL COMPUTING WITH MPI

MPI STANDARD

• Standard by consensus, designed in an open forum

• Introduced by the MPI FORUM in May 1994, updated in June 1995.

• MPI-2 (1998) produces extensions to the MPI standard

Page 19: PARALLEL COMPUTING WITH MPI

IS MPI Large or Small?

• A large number of features has been included (blocking/non-blocking , collective vs p.t.p,efficiency features)

However ...

• A small subset of functions is sufficient

Page 20: PARALLEL COMPUTING WITH MPI

Why use MPI ?

• Standardization

• Portability

• Performance

• Richness

• Designed to enable libraries

Page 21: PARALLEL COMPUTING WITH MPI

Writing an MPI Program

• If there is a serial version , make sure it is debugged

• If not, try to write a serial version first

• When debugging in parallel , start with a few nodes first.

Page 22: PARALLEL COMPUTING WITH MPI

Format of MPI routines

CMPI_xxx(parameters)

include mpi.h

FOR

TRAN

call MPIxxx(parame

ters, ierror)

include mpif.h

Page 23: PARALLEL COMPUTING WITH MPI

Six useful MPI functions

MPI_INITInitialized for the MPI environment

MPI_COMM_SIZEReturns the number of processes

MPI_COMM_RANKReturns this process’s number (rank)

Page 24: PARALLEL COMPUTING WITH MPI

Communication routines

MPI_SENDSends a message

MPI_RECV Receives a message

Page 25: PARALLEL COMPUTING WITH MPI

End MPI part of program

MPI_FINALIZE Exit in an orderly way

Page 26: PARALLEL COMPUTING WITH MPI

#include “mpi.h”;

int main( int argc, char *argv[]){

int tag=100;

int rank,size,i;

MPI_Status * status char message[12];

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD,&size);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

strcpy(message,"Hello,world");

if (rank==0)

for (i=1;i<size;i++){ MPI_Send(message,12,MPI_CHAR,i,tag,MPI_COMM_WORLD);

}

} else MPI_Recv(message,12,MPI_CHAR,0,tag,MPI_COMM_WORLD,&status);

printf("node %d : %s \n",rank,message);

MPI_Finalize;)(

return 0;

}

Page 27: PARALLEL COMPUTING WITH MPI

MPI Messages

• DATA data to be sent

• ENVELOPE – information to route the data.

Page 28: PARALLEL COMPUTING WITH MPI

Description of MPI_Send (MPI_Recv)

Startbuf The address where data start

Count Number of elements in message

DatatypeType of elements

Destination /source

Rank in communicator (0 .. Size-1)

Page 29: PARALLEL COMPUTING WITH MPI

Description of MPI_Send (MPI_Recv)

TagArbitrary number to help distinguish between messages

Communicator Communications universe

Status only for receive !!!!

Contains 3 fields : sender, tag and error code

Page 30: PARALLEL COMPUTING WITH MPI

Some useful remarks

• Source= MPI_ANY_SOURCE means that any source is acceptable

• Tags specified by sender and receiver must match, or MPI_ANY_TAG : any tag is acceptable

• Communicator must be the same for send/receive. Usually : MPI_COMM_WORLD

Page 31: PARALLEL COMPUTING WITH MPI

POINT-TO-POINT COMMUNICATION

• Transmission of a message between one pair of processes

• Programmer can choose mode of transmission

• Programmer can choose mode of transmission

Page 32: PARALLEL COMPUTING WITH MPI

MODE of TRANSMISSION

• Can be chosen by programmer

• …or let the system decide

• Synchronous mode• Ready mode• Buffered mode

• Standard mode

Page 33: PARALLEL COMPUTING WITH MPI

BLOCKING /NON-BLOCKING COMMUNICATIONS

BlockingSend or receive suspends execution till message buffer is safe to use

Non -blocking

Separates computation from communication. Send is initiated, but not completed. We can use a separate call to verify that communication has been completed.

Page 34: PARALLEL COMPUTING WITH MPI

SR

MPI_SEND

MPI_RECV

BLOCKING STANDARD SEND

Size>threshold Task waits

Date transfer fromsource complete

Task continues when data transfer to buffer is complete

waitTransfer begins

when MPI_RECV has been posted

Page 35: PARALLEL COMPUTING WITH MPI

SR

MPI_ISEND

MPI_IRECV

NON BLOCKING STANDARD SEND

Size>threshold Task waits

Date transfer fromsource complete

waitTransfer begins

when MPI_IRECV has been posted

MPI_WAIT

MPI_WAIT

No interruption if wait is late enough

Page 36: PARALLEL COMPUTING WITH MPI

SR

MPI_SEND

MPI_RECV

BLOCKING STANDARD SEND

Size<=thresholdData transfer

fromsource complete

Task continues when data transfer to user’sbuffer is complete

Transfer to buffer

on receiver

Page 37: PARALLEL COMPUTING WITH MPI

SR

MPI_ISEND

MPI_IRECV

NON BLOCKING STANDARD SEND

Size<=thresholdNo delay even though message is not yet in buffer on R

Date transfer fromsource complete

Transfer to buffer can be avoided if MPI_IRECV postedearly enough

MPI_WAIT

MPI_WAIT

No delay if wait is late enough

Page 38: PARALLEL COMPUTING WITH MPI

BLOCKING COMMUNICATION

printf(“Task %d has sent the message \n”,isrc);

MPI_Send(&rmessage1,mslen,MPI_DOUBLE,

idest,isend_tag,MPI_COMM_WORLD);

MPI_Recv(&rmessage2,mslen,MPI_DOUBLE,

isrc,irecv_tag,MPI_COMM_WORLD, &status);

Page 39: PARALLEL COMPUTING WITH MPI

NON-BLOCKINGMPI_ISend(&rmessage1,mslen,MPI_DOUBLE,idest,isend_tag,MPI_COMM_WORLD,

&request_send)

MPI_IRecv(&rmessage2,mslen,MPI_DOUBLE,

isrc,irecv_tag,MPI_COMM_WORLD,

&request_rec)

MPI_WAIT(&request_rec,&istatus)

Page 40: PARALLEL COMPUTING WITH MPI

program deadlock

implicit none include 'mpif.h' integer MSGLEN, ITAG_A, ITAG_B parameter ( MSGLEN = 2048, ITAG_A = 100, ITAG_B = 200 ) real rmessage1(MSGLEN), ! message buffers . rmessage2(MSGLEN) integer irank, ! rank of task in communicator. idest, isrc, ! rank in communicator of destination ! and source tasks . isend_tag, irecv_tag, ! message tags . istatus(MPI_STATUS_SIZE), ! status of communication . ierr, ! return status . i

call MPI_Init ( ierr ) call MPI_Comm_Rank ( MPI_COMM_WORLD, irank, ierr ) print *, " Task ", irank, " initialized"C initialize message buffers do i = 1, MSGLEN rmessage1(i) = 100 rmessage2(i) = -100 end do

C

Page 41: PARALLEL COMPUTING WITH MPI

Deadlock program (cont)if ( irank.EQ.0 ) then idest = 1

isrc = 1 isend_tag = ITAG_A irecv_tag = ITAG_B else if ( irank.EQ.1 ) then idest = 0 isrc = 0 isend_tag = ITAG_B irecv_tag = ITAG_A end ifC ----------------------------------------------------------------C send and receive messagesC ------------------------------------------------------------- print *, " Task ", irank, " has sent the message" call MPI_Send ( rmessage1, MSGLEN, MPI_REAL, idest, isend_tag, . MPI_COMM_WORLD, ierr ) call MPI_Recv ( rmessage2, MSGLEN, MPI_REAL, isrc, irecv_tag, . MPI_COMM_WORLD, istatus, ierr ) print *, " Task ", irank, " has received the message"

call MPI_Finalize (ierr)end

Page 42: PARALLEL COMPUTING WITH MPI

DEADLOCK example

A

B

MPI_SEND

MPI_SEND

MPI_RECV

MPI_RECV

Page 43: PARALLEL COMPUTING WITH MPI

Deadlock example

• SP2 implementation:No Receive has been posted yet,so both processes block

• Solutions

Different ordering

Non-blocking calls

MPI_Sendrecv

Page 44: PARALLEL COMPUTING WITH MPI

Determining Information about

Messages

• Wait

• Test

• Probe

Page 45: PARALLEL COMPUTING WITH MPI

MPI_WAIT

• Useful for both sender and receiver of non-blocking communications

• Receiving process blocks until message is received, under programmer control

• Sending process blocks until send operation completes, at which time the message buffer is available for re-use

Page 46: PARALLEL COMPUTING WITH MPI

MPI_WAIT

compute

transmit

S

R

MPI_WAIT

Page 47: PARALLEL COMPUTING WITH MPI

MPI_TEST

compute

transmit

S

R

MPI_Isend

MPI_TEST

Page 48: PARALLEL COMPUTING WITH MPI

MPI_TEST

• Used for both sender and receiver of non-blocking communication

• Non-blocking call• Receiver checks to see if a specific sender has sent a message that is waiting to be delivered ... messages from all other senders are ignored

Page 49: PARALLEL COMPUTING WITH MPI

MPI_TEST (cont.)

Sender can find out if the message-buffer can be re-used ... have to wait until operation is complete before doing so

Page 50: PARALLEL COMPUTING WITH MPI

MPI_PROBE

• Receiver is notified when messages from potentially any sender arrive and are ready to be processed.

• Blocking call

Page 51: PARALLEL COMPUTING WITH MPI

Programming recommendations

• Blocking calls are needed when:

• Tasks must synchronize• MPI_Wait immediately follows communication call

Page 52: PARALLEL COMPUTING WITH MPI

Collective Communication

• Establish a communication pattern within a group of nodes.

• All processes in the group call the communication routine, with matching arguments.

• Collective routine calls can return when their participation in the collective communication is complete.

Page 53: PARALLEL COMPUTING WITH MPI

Properties of collective calls

• On completion: he caller is now free to access locations in the communication buffer.

• Does NOT indicate that other processors in the group have completed

• Only MPI_BARRIER will synchronize all processes

Page 54: PARALLEL COMPUTING WITH MPI

Properties

• MPI guarantees that a message generated by collective communication calls will not be confused with a message generated by point-to-point communication

• Communicator is the group identifier.

Page 55: PARALLEL COMPUTING WITH MPI

Barrier

• Synchronization primitive. A node calling it will block until all the nodes within the group have called it.

• Syntax

MPI_Barrier(Comm, Ierr)

Page 56: PARALLEL COMPUTING WITH MPI

Broadcast

• Send data on one node to all other nodes in communicator.

• MPI_Bcast(buffer, count, datatype,root,comm,ierr)

Page 57: PARALLEL COMPUTING WITH MPI

Broadcast DATA

A0

A0

A0

A0

A0P0

P1

P2

P3

Page 58: PARALLEL COMPUTING WITH MPI

Gather and ScatterDATA

A0

A3

A2

A1

A0P0

P1

P2

P3

A1 A2 A3 scatter

gather

Page 59: PARALLEL COMPUTING WITH MPI

Allgather effect

C0

DATA

A0

A0

A0

A0

A0P0

P1

P2

P3allgather

B0

C0

D0

D0

B0 D0

D0

D0

B0

B0

B0

C0

C0

C0

Page 60: PARALLEL COMPUTING WITH MPI

Syntax for Scatter & Gather

MPI_Gather(sendbuf,scount,,datatype,recvbuf,rcount,rdatatype,root,comm,ierr)

MPI_Scatter(sndbuf,scount,datatype,recvbuf,rcount, datatype,root,comm,ierr)

Page 61: PARALLEL COMPUTING WITH MPI

Scatter and Gather

• Gather: Collect data from every member of the group (including the root) on the root node in linear order by the rank of the node.

• Scatter: Distribute data from the root to every member of the group in linear order by node.

Page 62: PARALLEL COMPUTING WITH MPI

ALLGATHER

• All processes, not just the root, receive the result. The jth block of the receive buffer is the block of data sent from the jth process

• Syntax :

MPI_Allgather( sndbuf,scount,datatype,recvbuf,rcount,rdatatype,comm,ierr)

Page 63: PARALLEL COMPUTING WITH MPI

Gather example

DIMENSION A(25,100),b(100),cpart(25),ctotal(100) INTEGER root DATA root/0/

DO I=1,25 cpart(I) = 0. DO K=1,100 cpart(I) = cpart(I) + A(I,K)*b(K) END DO END DO call MPI_GATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL,

root, MPI_COMM_WORLD, ierr)

Page 64: PARALLEL COMPUTING WITH MPI

AllGather example

DIMENSION A(25,100),b(100),cpart(25),ctotal(100) INTEGER root

DO I=1,25 cpart(I) = 0. DO K=1,100 cpart(I) = cpart(I) + A(I,K)*b(K) END DO END DO call

MPI_AllGATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL, MPI_COMM_WORLD, ierr)

Page 65: PARALLEL COMPUTING WITH MPI

Parallel matrix-vector multiplication

=P125

P2P3

P4

A * b = c

25

25

25

Page 66: PARALLEL COMPUTING WITH MPI

Global Computations

• Reduction

• Scan

Page 67: PARALLEL COMPUTING WITH MPI

Reduction

• The partial result in each process in the group is combined in one specified process

Page 68: PARALLEL COMPUTING WITH MPI

Reduction

DjJth item of data at the root process

*Reduction operation (sum, max,min ….)

Dj = D(0,j)*D(1,j)* ... *

D(n-1,j)

Page 69: PARALLEL COMPUTING WITH MPI

Scan operation

•Scan or prefix-reduction operation performs partial reductions on distributed data

• Dkjkj = D0j*D1j* ... *Dkj k=0,1,n-1

Page 70: PARALLEL COMPUTING WITH MPI

Varying size gather and scatter

• Both size and memory location of the messages are varying

• More flexibility in writing code • less need to copy data into temporary buffers

• more compact final code • Vendor implementation may be optimal

Page 71: PARALLEL COMPUTING WITH MPI

Scatterv syntax

Scatterv(sbuf,scount,stype,rbuf,rcount,displs,rtype,root,comm,ierr)

SCOUNTS(I) number of items to send from process root to process I

DISPLS(I) displacement from sbuf to beginning of ith message

Page 72: PARALLEL COMPUTING WITH MPI

SCATTER

P0

P0

P1

P2

P3

Page 73: PARALLEL COMPUTING WITH MPI

SCATTERV

P0

P0

P1

P2

P3

Page 74: PARALLEL COMPUTING WITH MPI
Page 75: PARALLEL COMPUTING WITH MPI

Advanced Datatypes

• Predefined basic datatypes -- contiguous data of the same type.

• We sometimes need:

non-contiguous data of single type

contiguous data of mixed types

Page 76: PARALLEL COMPUTING WITH MPI

Solutions

• multiple MPI calls to send and receive each data element

• copy the data to a buffer before sending it (MPI_PACK)

• use MPI_BYTE to get around the datatype-matching rules

Page 77: PARALLEL COMPUTING WITH MPI

Drawback

• Slow , clumsy and wasteful of memory

• Using MPI_BYTE or MPI_PACKED can hamper portability

Page 78: PARALLEL COMPUTING WITH MPI

General Datatypes and Typemaps

• a sequence of basic datatypes

• a sequence of integer (byte) displacements

Page 79: PARALLEL COMPUTING WITH MPI

Typemaps

typemap= [(type0,disp0),(type1,disp1),….,

(typen,disp n)]

Displacement are relative to the buffer

Example :

Typemap (MPI_INT)= [(int,0)]

Page 80: PARALLEL COMPUTING WITH MPI

Extent of a Derived Datatype

Lb Min(disp0,disp1,…,dispn)

Ub Max(disp0+sizeof(type0),….

ExtentUb – Lb +pad

Page 81: PARALLEL COMPUTING WITH MPI

MPI_TYPE_EXTENT

• MPI_TYPE_EXTENT(datatype,extent,ierr)

Describes distance (in bytes) from start of datatype to start of the next datatype .

Page 82: PARALLEL COMPUTING WITH MPI

How to use

• Construct the datatype• Allocate the datatype.• Use the datatype• Deallocate the datatype

Page 83: PARALLEL COMPUTING WITH MPI

PERFORMANCE ISSUES

• Hidden communication takes place

• Performance depends on implementation of MPI

• Because of forced synchronization, it is not always best to use collective communication

Page 84: PARALLEL COMPUTING WITH MPI

Example : simple broadcast

1

2

3

8

B

BB

Data:B*(P-1)Steps : P-1

Page 85: PARALLEL COMPUTING WITH MPI

Example : simple scatter

1

2

3

8

B

BB

Data:B*(P-1)Steps : P-1

Page 86: PARALLEL COMPUTING WITH MPI

Example : better scatter

1

1 24*B

Data:B*p*logPSteps : log P

1 3 2 4

1 5 3 6 2 7 4 8

2*B 2*B

B BBB

Page 87: PARALLEL COMPUTING WITH MPI

Timing for sending a message

Time is composed of startup time – time to send a 0 length message and transfer time – time to transfer a byte of data.

Tcomm = Tstartup + B * Ttransfer

It may be worthwhile to group several sends together

Page 88: PARALLEL COMPUTING WITH MPI

Performance evaluation

Fortran :

Real*8 t1

T1= MPI_Wtime() ! Returns elapsed time

C:

double t1 ;

t1 =MPI_Wtime ();

Page 89: PARALLEL COMPUTING WITH MPI

Example : better broadcast

1

1 2B B

Data:B*(P-1)Steps : log P

1 3 2 7

1 5 3 6 2 7 4 8

Page 90: PARALLEL COMPUTING WITH MPI

MPI References

• The MPI Standard :

www-unix.mcs.anl.gov/mpi/index.html

• Parallel Programming with MPI,Peter S. Pacheco,Morgan Kaufmann,1997

• Using MPI, W. Gropp,Ewing Lusk,Anthony Skjellum, The MIT Press,1999.