compe472 parallel computing 2.1 message-passing computing chapter 2 –programming a message passing...

123
COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 Programming a Message Passing computer 1. Using a special parallel PL 2. Extending an existing language 3. Using a high-level language and providing a message- passing library Here, the third option is employed

Post on 24-Jan-2016

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.1

Message-Passing ComputingChapter 2

– Programming a Message Passing computer

1. Using a special parallel PL

2. Extending an existing language

3. Using a high-level language and providing a message-passing library

– Here, the third option is employed

Page 2: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.2

Message-Passing Programming using User-level Message-Passing Libraries

Two primary mechanisms needed:

1. A method of creating separate processes for execution on different computers

• Static process creation (MPI): Before the execution number of processes are fixed

• Dynamic process creation (MPI2): At runtime processess can be created

2. A method of sending and receiving messages

Page 3: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.3

Programming Models: 1. Multiple program, multiple data (MPMD) model

Sourcefile

Executable

Processor 0 Processor p - 1

Compile to suitprocessor

Sourcefile

Page 4: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.4

Programming models: 2. Single Program Multiple Data (SPMD) model

.

Sourcefile

Executables

Processor 0 Processor p - 1

Compile to suitprocessor

Basic MPI way

Different processes merged into one program. Control statements select different parts for each processor to execute. All executables started together - static process creation

Page 5: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.5

Multiple Program Multiple Data (MPMD) Model

Process 1

Process 2spawn();

Time

Start executionof process 2

Separate programs for each processor. One processor executes master process. Other processes started from within master process - dynamic process creation.

Page 6: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.6

Basic “point-to-point”Send and Receive Routines

Process 1 Process 2

send(&x, 2);

recv(&y, 1);

x y

Movementof data

Generic syntax (actual formats later)

Passing a message between processes using send() and recv() library calls:

Page 7: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.7

Synchronous Message Passing

Routines that actually return when message transfer completed.

Synchronous send routine• Waits until complete message can be accepted by the

receiving process before sending the message. Synchronous receive routine

• Waits until the message it is expecting arrives. • No need for buffer storage

Synchronous routines intrinsically perform two actions: They transfer data and they synchronize processes.

Page 8: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.8

Synchronous send() and recv() using 3-way protocol

Process 1 Process 2

send();

recv();Suspend

Time

processAcknowledgment

MessageBoth processescontinue

(a) When send() occurs before recv()

Process 1 Process 2

recv();

send();Suspend

Time

process

Acknowledgment

MessageBoth processescontinue

(b) When recv() occurs before send()

Request to send

Request to send

Page 9: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.9

Asynchronous Message Passing

• Routines that do not wait for actions to complete before returning. Usually require local storage for messages.

• More than one version depending upon the actual semantics for returning.

• In general, they do not synchronize processes but allow processes to move forward sooner. Must be used with care.

Page 10: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.10

MPI Definitions of Blocking and Non-Blocking

• Blocking - return after their local actions complete, though the message transfer may not have been completed.

• Non-blocking - return immediately.

Assumes that data storage used for transfer not modified by subsequent statements prior to being used for transfer, and it is left to the programmer to ensure this.These terms may have different interpretations in other systems.

Page 11: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.11

How message-passing routines return before message transfer completed

Process 1 Process 2

send();

recv();

Message buffer

Readmessage buffer

Continueprocess

Time

Message buffer needed between source and destination to hold message:

Page 12: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.12

Asynchronous (blocking) routines changing to synchronous routines

• Once local actions completed and message is safely on its way, sending process can continue with subsequent work.

• Buffers only of finite length and a point could be reached when send routine held up because all available buffer space exhausted.

• Then, send routine will wait until storage becomes re-available - i.e then routine behaves as a synchronous routine.

Page 13: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.13

Message Tag

• Used to differentiate between different types of messages being sent.

• Message tag is carried within message.

• If special type matching is not required, a wild card message tag is used, so that the recv() will match with any send().

Page 14: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.14

Message Tag Example

Process 1 Process 2

send(&x,2, 5);

recv(&y,1, 5);

x y

Movementof data

Waits for a message from process 1 with a tag of 5

To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y:

Page 15: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.15

“Group” message passing routines

Have routines that send message(s) to a group of processes or receive message(s) from a group of processes

Higher efficiency than separate point-to-point routines although not absolutely necessary.

Page 16: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.16

Scatter

scatter();

buf

scatter();

data

scatter();

datadata

Process 0 Process p - 1Process 1

Action

Code

MPI form

Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process.

Page 17: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.17

Gather

gather();

buf

gather();

data

gather();

datadata

Process 0 Process p - 1Process 1

Action

Code

MPI form

Having one process collect individual values from set of processes.

Page 18: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.18

Reduce

reduce();

buf

reduce();

data

reduce();

datadata

Process 0 Process p - 1Process 1

+

Action

Code

Gather operation combined with specified arithmetic/logical operation.

Example: Values could be gathered and then added together by root:

Page 19: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.19

AllGather & AllReduce• AllGather and AllReduce: perform

gather/reduce and broadcast result• First a group must be formed and root

process selected

Page 20: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.20

BarrierBarrier: synchronization point• Example barrier based on an allReduce• (typically more efficient implementations are used)

Page 21: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.21

PVM(Parallel Virtual Machine)

Perhaps first widely adopted attempt at using a workstation cluster as a multicomputer platform, developed by Oak Ridge National Laboratories. Available at no charge.

Programmer decomposes problem into separate programs (usually master and group of identical slave programs).

Programs compiled to execute on specific types of computers.

Set of computers used on a problem first must be defined prior to executing the programs (in a hostfile).

Page 22: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.22

Message routing between computers done by PVM daemon processes installed by PVM on computers that form the virtual machine.

PVM

Application

daemon

program

Workstation

PVMdaemon

Applicationprogram

Applicationprogram

PVMdaemon

Workstation

Workstation

Messagessent throughnetwork

(executable)

(executable)

(executable)

MPI implementation we use is similar.

Can have more than one processrunning on each computer.

Page 23: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.23

MPI(Message Passing Interface)

• Message passing library standard developed by group of academics and industrial partners to foster more widespread use and portability.

• Defines routines, not implementation.

• Several free implementations exist.

Page 24: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.24

MPIProcess Creation and Execution

• Purposely not defined - Will depend upon implementation.

• Only static process creation supported in MPI version 1. All processes must be defined prior to execution and started together.

• Originally SPMD model of computation. • MPMD also possible with static creation - each program

to be started together specified.

Page 25: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.25

Communicators

• Defines scope of a communication operation.

• Processes have ranks associated with communicator.

• Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes.

• Other communicators can be established for groups of processes.

Page 26: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.26

Using SPMD Computational Model

main (int argc, char *argv[]){MPI_Init(&argc, &argv);

.

.MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /*find process rank */

if (myrank == 0)master();

elseslave();..

MPI_Finalize();}

where master() and slave() are to be executed by master process and slave process, respectively.

Page 27: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.27

Unsafe message passing - Example

lib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);(a) Intended behavior

(b) Possible behaviorlib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);

Destination

Source

Page 28: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.28

MPI Solution“Communicators”

• Defines a communication domain - a set of processes that are allowed to communicate between themselves.

• Communication domains of libraries can be separated from that of a user program.

• Used in all point-to-point and collective MPI message-passing communications.

Page 29: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.29

Default Communicator MPI_COMM_WORLD

• Exists as first communicator for all processes existing in the application.

• A set of MPI routines exists for forming communicators.

• Processes have a “rank” in a communicator.

Page 30: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.30

MPI Point-to-Point Communication

• Uses send and receive routines with message tags (and communicator).

• Wild card message tag (MPI_ANY_TAG) is available

Page 31: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.31

MPI Blocking Routines

• Return when “locally complete” - when location used to hold message can be used again or altered without affecting message being sent.

• Blocking send will send message and return - does not mean that message has been received, just that process free to move on without adversely affecting message.

Page 32: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.32

Parameters of blocking send

MPI_Send(buf, count, datatype, dest, tag, comm)

Address of

Number of items

Datatype of

Rank of destination

Message tag

Communicator

send buffer

to send

each item

process

Page 33: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.33

! Örnek send-recv MPI Fortran 90! program main use mpi implicit none integer status(MPI_STATUS_SIZE) integer :: ierr, rank, size, i, p, tag character*100 message

call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, p, ierr) if (rank .ne. 0) then message="Islemciden merhaba!" call MPI_SEND(message,len(message),MPI_CHARACTER,0,0,MPI_COMM_WORLD,ierr) else tag=0 do i=1, p-1 call MPI_RECV(message,len(message),MPI_CHARACTER,i,tag,MPI_COMM_WORLD,status,ierr) print *, "Islemci:",i,"mesaji:",message end do; end if call MPI_FINALIZE(ierr)end

Page 34: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.34

! Mid-point Rule to Compute PI using MPI_BCAST and MPI_REDUCE program main use mpi double precision starttime, endtime double precision PI25DT parameter (PI25DT = 3.141592653589793238462643d0) double precision mypi, pi, h, sum, x, f, a double precision starttime, endtime integer n, myid, numprocs, i, ierr f(a) = 4.d0 / (1.d0 + a*a) ! function to integrate call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) 10 if (myid.eq.0) then print *, 'Enter the number of intervals: (0 quits) ' read(*,*) n endif starttime = MPI_WTIME() call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)! check for quit signal if ( n .le. 0 ) goto 30! calculate the interval size h = 1.0d0/n sum = 0.0d0 do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 continue mypi = h * sum! collect all the partial sums call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0, & MPI_COMM_WORLD,ierr) endtime = MPI_WTIME() if (myid.eq.0) then print *, 'pi is ', pi, 'Error is ', abs(pi - PI25DT) print *, 'time is ', endtime-starttime, ' seconds' endif go to 1030 call MPI_FINALIZE(ierr) stop end

Page 35: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.35

Parameters of blocking receive

MPI_Recv(buf, count, datatype, src, tag, comm, status)

Address of

Maximum number

Datatype of

Rank of source

Message tag

Communicator

receive buffer

of items to receive

each item

process

Statusafter operation

Page 36: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.36

Example

To send an integer x from process 0 to process 1,

MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */

if (myrank == 0) {int x;MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD);

} else if (myrank == 1) {int x;MPI_Recv(&x, 1, MPI_INT, 0,msgtag,MPI_COMM_WORLD,status);

}

Page 37: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.37

C - MPI Datatypes

Page 38: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.38

Fortran – MPI Basic Datatypes

Page 39: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.39

The status array• Status is a data structure allocated in the user’s program.• C language

– int recvd_tag, recvd_from, recvd_count;– MPI_Status status;– MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status)– recvd_tag = status.MPI_TAG;– recvd_from = status.MPI_SOURCE;– MPI_Get_count( &status, datatype, &recvd_count);

• Fortran language– integer recvd_tag, recvd_from, recvd_count– integer status(MPI_STATUS_SIZE)– call

MPI_RECV(..,MPI_ANY_SOURCE,MPI_ANY_TAG,..status,ierr)– tag_recvd = status(MPI_TAG)– recvd_from = status(MPI_SOURCE)– call MPI_GET_COUNT(status, datatype, recvd_count, ierr)

Page 40: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.40

MPI Nonblocking Routines

• Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered.

• Nonblocking receive - MPI_Irecv() - will return even if no message to accept.

Page 41: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.41

Nonblocking Routine Formats

MPI_Isend(buf,count,datatype,dest,tag,comm,request)

MPI_Irecv(buf,count,datatype,source,tag,comm, request)

Completion detected by MPI_Wait() and MPI_Test().

MPI_Wait() waits until operation completed and returns then.

MPI_Test() returns with flag set indicating whether operation completed at that time.

Need to know whether particular operation completed.

Determined by accessing request parameter.

Page 42: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.42

MPI_ISEND & MPI_IRECV• Fortran:

– MPI_ISEND(buf, count, type, dest, tag, comm, req, ierr)– MPI_IRECV(buf, count, type, sour, tag, comm, req, ierr)

– buf array of type type.– count (INTEGER) number of element of buf to be sent– type (INTEGER) MPI type of buf– dest (INTEGER) rank of the destination process– sour (INTEGER) rank of the source process– tag (INTEGER) number identifying the message– comm (INTEGER) communicator of the sender and receiver– req (INTEGER) output, identifier of the communications

handle– ierr (INTEGER) output, error code (if ierr=0 no error

occurs)

Page 43: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.43

Example

To send an integer x from process 0 to process 1 and allow process 0 to continue,

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */

if (myrank == 0) {

int x;

MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1);

compute();

MPI_Wait(req1, status);

} else if (myrank == 1) {

int x;

MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status);

}

Page 44: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.44

Send Communication Modes

• Standard Mode Send - Not assumed that corresponding receive routine has started. Amount of buffering not defined by MPI. If buffering provided, send could complete before receive reached.

• Buffered Mode - Send may start and return before a matching receive. Necessary to specify buffer space via routine MPI_Buffer_attach().

• Synchronous Mode - Send and receive can start before each other but can only complete together.

• Ready Mode - Send can only start if matching receive already reached, otherwise error. Use with care.

Page 45: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.45

• Each of the four modes can be applied to both blocking and nonblocking send routines.

• Only the standard mode is available for the blocking and nonblocking receive routines.

• Any type of send routine can be used with any type of receive routine.

Page 46: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.46

Communication Modes and MPISubroutines

Mode Completion Condition Blocking subroutine Non-blocking subroutine

Standard send Message sent (receive state unknown)

MPI_SEND MPI_ISEND

receive Completes when a message has arrived

MPI_RECV MPI_IRECV

Synchronous send Only completes when the receive has completed

MPI_SSEND MPI_ISSEND

Buffered send Always completes, irrespective of receiver

MPI_BSEND MPI_IBSEND

Ready send Always completes, irrespective of whether the receive has completed

MPI_RSEND MPI_IRSEND

Page 47: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.47

Collective Communication

Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations:

• MPI_BCAST() - Broadcast from root to all other processes• MPI_GATHER() - Gather values for group of processes• MPI_SCATTER() - Scatters buffer in parts to group of processes• MPI_ALLTOALL() - Sends data from all processes to all

processes• MPI_REDUCE() - Combine values on all processes to single

value• MPI_REDUCE_SCATTER() - Combine values and scatter results• MPI_SCAN() - Compute prefix reductions of data on processes

Page 48: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.48

BroadcastSending same message to all processes concerned with problem.Multicast - sending same message to defined group of processes.

bcast();

buf

bcast();

data

bcast();

datadata

Process 0 Process p - 1Process 1

Action

Code

MPI form

Page 49: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.49

Broadcast Illustrated

Page 50: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.50

Broadcast (MPI_BCAST)

One-to-all communication: same data sent from rootprocess to all others in the communicator• Fortran:

– INTEGER count, type, root, comm, ierr– CALL MPI_BCAST(buf, count, type, root, comm, ierr)– Buf array of type type

• C:– int MPI_Bcast(void *buf, int count, MPI_Datatype

datatypem int root, MPI_Comm comm)

• All processes must specify same root, rank and comm

Page 51: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.51

Broadcast ExamplePROGRAM broad_castINCLUDE 'mpif.h'INTEGER ierr, myid, nproc, rootINTEGER status(MPI_STATUS_SIZE)REAL A(2)CALL MPI_INIT(ierr)CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)root = 0IF( myid .EQ. 0 ) THENa(1) = 2.0a(2) = 4.0END IFCALL MPI_BCAST(a, 2, MPI_REAL, 0, MPI_COMM_WORLD, ierr)WRITE(6,*) myid, ': a(1)=', a(1), 'a(2)=', a(2)CALL MPI_FINALIZE(ierr)END

Page 52: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.52

Reduction (MPI_REDUCE)

The reduction operation allow to:• Collect data from each process• Reduce the data to a single value• Store the result on the root processes• Store the result on all processes• Reduction function works with arrays• Operations: sum, product, min, max, and, ….• Internally is usually implemented with a binary tree

Page 53: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.53

Reduction Operation (SUM)

Page 54: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.54

Reduction in FORTRAN

• MPI_REDUCE( snd_buf, rcv_buf, count, type, op, root, comm, ierr)

– snd_buf input array of type type containing local values.– rcv_buf output array of type type containing global results– count (INTEGER) number of element of snd_buf and rcv_buf– type (INTEGER) MPI type of snd_buf and rcv_buf– op (INTEGER) parallel operation to be performed– root (INTEGER) MPI id of the process storing the result– comm (INTEGER) communicator of processes involved in the

operation– ierr (INTEGER) output, error code (if ierr=0 no error occours)

• MPI_ALLREDUCE( snd_buf, rcv_buf, count, type, op, comm, ierr)– The argument root is missing, the result is stored to all processes.

Page 55: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.55

Predefined Reduction Operations

Page 56: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.56

MPI_Scatter

• One-to-all communication: different data sent from root process to all others in the communicator

• Fortran:• CALL MPI_SCATTER(sndbuf, sndcount,

sndtype, rcvbuf, rcvcount, rcvtype, root, comm, ierr)– Arguments definition are like other MPI subroutine– sndcount is the number of elements sent to each

process, not the size of sndbuf, that should be sndcount times the number of process in the communicator

– The sender arguments are significant only at root

Page 57: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.57

MPI_Gather

One-to-all communication: different data collected by the root process, from all others processes in thecommunicator. Is the opposite of Scatter• Fortran:• CALL MPI_GATHER(sndbuf, sndcount,

sndtype, rcvbuf, rcvcount,rcvtype, root, comm, ierr)– Arguments definition are like other MPI subroutine– rcvcount is the number of elements collected from each

process, not the size of rcvbuf, that should be rcvcount times the number of process in the communicator

– The receiver arguments are significant only at root

Page 58: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.58

Scatter/Gather

Page 59: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.59

GATHER Example program gather use mpi integer ierr, myid, p, nsd, i, root integer status(MPI_STATUS_SIZE) real a(21), b(3) call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,p,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) root = 0 b(1) = myid b(2) = myid b(3) = myid nsd = 3 np = nsd * p call MPI_GATHER(b,nsd,MPI_REAL,a,nsd,MPI_REAL,root,MPI_COMM_WORLD,& ierr) if (myid .eq. root) then do i=1,np write(6,*) myid,': a(',i,')=',a(i) end do end if call MPI_FINALIZE(ierr) end

Page 60: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.60

Execution

bash-3.00$ mpirun -np 3 gather_test 0 : a( 1 )= 0.0E+0 0 : a( 2 )= 0.0E+0 0 : a( 3 )= 0.0E+0 0 : a( 4 )= 1.0 0 : a( 5 )= 1.0 0 : a( 6 )= 1.0 0 : a( 7 )= 2.0 0 : a( 8 )= 2.0 0 : a( 9 )= 2.0

Page 61: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.61

Scatter/Gather Examples

Page 62: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.62

MPI_Barrier()

• Stop processes until all processes within a communicator reach the barrier

• Almost never required in a parallel program• Occasionally useful in measuring

performance and load balancing• Fortran:

– CALL MPI_BARRIER( comm, ierr)

• C:– int MPI_Barrier(MPI_Comm comm)

Page 63: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.63

Barrier

Page 64: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.64

Barrier routine

• A means of synchronizing processes by stopping each one until they all have reached a specific “barrier” call.

Page 65: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.65

Evaluating Parallel Programs

Page 66: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.66

Sequential execution time, ts: Estimate by counting computational steps of best sequential algorithm.

Parallel execution time, tp: In addition to number of computational steps, tcomp, need to estimate communication overhead, tcomm:

tp = tcomp + tcomm

Page 67: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.67

Computational Time

Count number of computational steps.When more than one process executed simultaneously, count computational steps of most complex process. Generally, function of n and p, i.e.

tcomp = f (n, p)

Often break down computation time into parts. Then

tcomp = tcomp1 + tcomp2 + tcomp3 + …

Analysis usually done assuming that all processors are same and operating at same speed.

Page 68: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.68

Communication Time

Many factors, including network structure and network contention. As a first approximation, use

tcomm = tstartup + ntdata

tstartup is startup time, essentially time to send a message with no data. Assumed to be constant.

tdata is transmission time to send one data word, also assumed constant, and there are n data words.

Page 69: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.69

Idealized Communication Time

Number of data items (n)

Startup time

Page 70: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.70

Final communication time, tcomm

Summation of communication times of all sequential messages from a process, i.e.

tcomm = tcomm1 + tcomm2 + tcomm3 + …

Communication patterns of all processes assumed same and take place together so that only one process need be considered.

Both tstartup and tdata, measured in units of one computational step, so that can add tcomp and tcomm together to obtain parallel execution time, tp.

Page 71: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.71

Benchmark FactorsWith ts, tcomp, and tcomm, can establish speedup factor and computation/communication ratio for a particular algorithm/implementation:

Both functions of number of processors, p, and number of data elements, n.

Page 72: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.72

Factors give indication of scalability of parallel solution with increasing number of processors and problem size.

Computation/communication ratio will highlight effect of communication with increasing problem size and system size.

Page 73: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.73

Debugging/Evaluating Parallel Programs Empirically

Page 74: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.74

Visualization ToolsPrograms can be watched as they are executed in a space-time diagram (or process-time diagram):

Process 1

Process 2

Process 3

TimeComputingWaitingMessage-passing system routine

Message

Page 75: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.75

Implementations of visualization tools are available for MPI.

An example is the Upshot program visualization system.

Page 76: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.76

Evaluating Programs EmpiricallyMeasuring Execution Time

To measure the execution time between point L1 and point L2 in the code, we might have a construction such as

.

L1: time(&t1); /* start timer */.

L2: time(&t2); /* stop timer */.

elapsed_time = difftime(t2, t1); /* elapsed_time = t2 - t1 */printf(“Elapsed time = %5.2f seconds”, elapsed_time);

MPI provides the routine MPI_Wtime() for returning time (in seconds).

Page 77: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.77

Mesaj iletim zamanı! Pinpon zamanlama program main use mpi double precision starttime, endtime integer n, myid, numprocs, i, ierr, status(MPI_STATUS_SIZE), x call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) x = 2 if ( myid .eq. 0 ) then starttime = MPI_WTIME() do 1 i=1,1000000 call MPI_SEND(x,1,MPI_INTEGER,1,1,MPI_COMM_WORLD, ierr) call MPI_RECV(x,1,MPI_INTEGER,1,1,MPI_COMM_WORLD, status, ierr) 1 continue endtime = MPI_WTIME() print *,'Elapsed time of sending a word:',0.5*(endtime-

starttime)/1000000,' seconds' else do 2 i=1,1000000 call MPI_RECV(x,1,MPI_INTEGER,0,1,MPI_COMM_WORLD, status, ierr) call MPI_SEND(x,1,MPI_INTEGER,0,1,MPI_COMM_WORLD, ierr) 2 continue end if call MPI_FINALIZE(ierr) stop end

Page 78: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.78

Parallel Programming Home Page

http://www.cs.uncc.edu/par_prog

Gives step-by-step instructions for compiling and executing programs, and other information.

Page 79: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.79

Compiling/Executing MPI ProgramsPreliminaries

• Set up paths• Create required directory structure• Create a file (hostfile) listing machines to be

used (required)

Details described on home page.

Page 80: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.80

Hostfile

Before starting MPI for the first time, need to create a hostfile

Sample hostfile

ws404

#is-sm1 //Currently not executing, commented

pvm1 //Active processors, UNCC sun cluster called pvm1 - pvm8

pvm2

pvm3

pvm4

pvm5

pvm6

pvm7

pvm8

Page 81: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.81

Compiling/executing (SPMD) MPI program

For LAM MPI version 6.5.2. At a command line:

To start MPI:

First time: lamboot -v hostfile

Subsequently: lamboot

To compile MPI programs:

mpicc -o file file.c

or mpiCC -o file file.cpp

To execute MPI program:

mpirun -v -np no_processors file

To remove processes for reboot

lamclean -v

Terminate LAM

lamhalt

If fails

wipe -v lamhost

Page 82: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.82

Compiling/Executing Multiple MPI Programs

Create a file specifying programs:

Example1 master and 2 slaves, “appfile” contains

n0 master

n0-1 slave

To execute:

mpirun -v appfile

Sample output

3292 master running on n0 (o)

3296 slave running on n0 (o)

412 slave running on n1

Page 83: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.83

Circuit Satisfiability MPI-C/* * Circuit Satisfiability, Version 2 * This enhanced version of the program prints the * total number of solutions. */#include "mpi.h"#include <stdio.h>int main (int argc, char *argv[]) { int count; /* Solutions found by this proc */ int global_count; /* Total number of solutions */ int i; int id; /* Process rank */ int p; /* Number of processes */ int check_circuit (int, int); MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p); count = 0; for (i = id; i < 65536; i += p) count += check_circuit (id, i); MPI_Reduce (&count, &global_count, 1, MPI_INT, MPI_SUM, 0,MPI_COMM_WORLD); printf ("Process %d is done\n", id); fflush (stdout); MPI_Finalize(); if (!id) printf ("There are %d different solutions\n",global_count); return 0;}

Page 84: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.84

MPI-C…/* Return 1 if 'i'th bit of 'n' is 1; 0 otherwise */#define EXTRACT_BIT(n,i) ((n&(1<<i))?1:0)

int check_circuit (int id, int z) { int v[16]; /* Each element is a bit of z */ int i; for (i = 0; i < 16; i++) v[i] = EXTRACT_BIT(z,i); if ((v[0] || v[1]) && (!v[1] || !v[3]) && (v[2] || v[3]) && (!v[3]

|| !v[4]) && (v[4] || !v[5]) && (v[5] || !v[6]) && (v[5] || v[6]) && (v[6] || !v[15]) && (v[7]

|| !v[8]) && (!v[7] || !v[13]) && (v[8] || v[9]) && (v[8] || !v[9]) && (!

v[9] || !v[10]) && (v[9] || v[11]) && (v[10] || v[11]) && (v[12] || v[13]) &&

(v[13] || !v[14]) && (v[14] || v[15])) { printf ("%d) %d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d\n", id, v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7],v[8],v[9], v[10],v[11],v[12],v[13],v[14],v[15]); fflush (stdout); return 1; } else return 0;}

Page 85: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.85

Floyd’s Algorithm

• to find the least-expensive paths between all the vertices in a graph.

• operates on a matrix representing the costs of edges between vertices.

Fig: Determine whether a path going from Vi to Vj via Vk is shorter than the best-known path from Vi to Vj

Page 86: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.86

Parallel Floyd : Version 1

• one-dimensional decomposition of the I matrix.• At most N processors• In (a), the data allocated to a single task are shaded: a contiguous

block of rows. • In (b), the data required by this task in the kth step of the algorithm

are shaded: its own block and the k th row.

for k = 0 to N-1 for i = local_i_start to local_i_end for j = 0 to N-1 (k+1) = min( (k), (k)+ (k)) endfor endfor endfor

Page 87: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.87

Parallel Floyd : Version 2• Two-dimensional decomposition of the I matrix.• up to N^2 processors • In (a), the data allocated to a single task are shaded: a contiguous

submatrix. • In (b), the data required by this task in the k th step of the algorithm

are shaded: its own block, and part of the k th row and column. 

for k = 0 to N-1 for i = local_i_start to local_i_end for j = local_j_start to local_j_end (k+1) = min( (k), (k)+ (k)) endfor endfor endfor

Page 88: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.88

Problem all-pairs shortest paths

• Given a weighted graph G(V,E,w), the all-pairs shortest paths problem is to find the shortest paths between all pairs of vertices vi, vj that belongs to V.

Page 89: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.89

Sequential Floyd

• Input: the adjacency matrix Di,j

for k = 1 to |V| for i = 1 to |V| for j = 1 to |V|

Di,j = min(Di,j , Di,k + Dk,j )

• Output: Di,j contains shortest path from i to j

Page 90: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.90

What is the Adjacency Matrix

• Create an Adjacency Matrix

0 1 ∞ ∞

1 0 1 4

∞ 1 0 2

∞ 4 2 0

1

3

2

4

1

1

2

4

Page 91: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.91

Understanding Data Dependency

• Di,j = min(Di,j , Di,k + Dk,j )• K=1 I = 1 << D1,1N = min(D1,1N , D1,1 + D1,1N >> D0 • K=1 I = 2 << D2,1N = min(D2,1N , D2,1 + D1,1N >>• . . • . .• K=t I = f << D f,1N = min(D f,1N , D f, t + D t,1N >> Dt• . .• . .• K=N I=N << DN,1N = min(DN,1N , DN,N + DN,1N >> Dk

• Where 0 < t, f < N• Realize after each i,j loop Matrix D got fully updated.

Page 92: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.92

Key elements in solving the problem easily

• Send the Adjacency Matrix to every process ‘Broadcasting’

• Compute partitions and scatter them for each process ‘Scattering’

• Compute your ith row part and send the results to everybody ‘Broadcasting’

• After K iteration get the results ‘Gathering’

Page 93: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.93

/* Send data size to all other processes*/ MPI_Bcast(&size, 1, MPI_INT, 0, MPI_COMM_WORLD);

/* Compute my area of working, the partition.*/ partition_size = size / processors; if (size % processors) partition_size++;

my_partition= (int*) malloc(size * partition_size * sizeof(int)); kth_row = (int*) malloc(size * 1 * sizeof(int));

MPI_Scatter(graph, partition_size * size, MPI_INT, my_partition, partition_size * size, MPI_INT, 0, MPI_COMM_WORLD);

/* Calculation.*/ for (k = 0; k < size; k++) {

/* Broadcast the kth row.*/ if (my_rank == (k / partition_size)) for (i = 0; i < size; i++) kth_row[i] = my_partition[(k%partition_size)*size+i]; MPI_Bcast(kth_row, size, MPI_INT, (k / partition_size), MPI_COMM_WORLD);

/* Update my rows.*/ for (i = 0; (i < partition_size) && (i < size); i++) { if (my_partition[i*size+k] < 1) continue; for (j = 0; j < size; j++) { if (kth_row[j] < 1) continue; if (my_partition[i*size+j] < 0) my_partition[i*size+j] = my_partition[i*size+k] + kth_row[j]; else my_partition[i*size+j] = min(my_partition[i*size+j], my_partition[i*size+k] + kth_row[j]); } } }

/* Collect the data.*/ printf(" Collecting results ... ... \n"); MPI_Gather(my_partition, partition_size * size, MPI_INT, graph, partition_size * size, MPI_INT, 0, MPI_COMM_WORLD);

Page 94: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.94

Matrix-vector Multiplication• The figure below demonstrates schematically

how a matrix-vector multiplication, A=B*C, can be decomposed into four independent computations involving a scalar multiplying a column vector.

• This approach is different from that which is usually taught in a linear algebra course because this decomposition lends itself better to parallelization.

• These computations are independent and do not require communication, something that usually reduces performance of parallel code.

Page 95: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.95

Matrix-vector Multiplication (Columnwise)

Schematic of parallel decomposition for vector-matrix multiplication, A=B*C. The vector A is depicted in yellow. The matrix B and vector C are depicted in multiple colors representing the portions, columns, and elements assigned to each processor, respectively.

Page 96: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.96

Matrix-vector Multiplication (Columnwise)

A=B*C

33,322,311,300,3

33,222,211,200,2

33,122,111,100,1

33,022,011,000,0

3

2

1

0

cbcbcbcb

cbcbcbcb

cbcbcbcb

cbcbcbcb

a

a

a

a

P0 P1 P2 P3

+

+

+

+

+

+

+

+

+

+

+

+

Reduction (SUM)

P0 P1 P2 P3

Page 97: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.97

Matrix-vector Multiplication• The columns of matrix B and elements of column vector C

must be distributed to the various processors using MPI commands called scatter operations.

• Note that MPI provides two types of scatter operations depending on whether the problem can be divided evenly among the number of processors or not.

• Each processor now has a column of B, called Bpart, and an element of C, called Cpart. Each processor can now perform an independent vector-scalar multiplication.

• Once this has been accomplished, every processor will have a part of the final column vector A, called Apart.

• The column vectors on each processor can be added together with an MPI reduction command that computes the final sum on the root processor.

Page 98: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.98

Matrix-Vector Mult.#include <stdio.h>#include <mpi.h>#define NCOLS 4int main(int argc, char **argv) {

int i,j,k,l,int ierr, rank, size, root;float A[NCOLS], T[NCOLS][NCOLS];float Apart[1];float Bpart[NCOLS],C[NCOLS];float A_exact[NCOLS];float B[NCOLS][NCOLS];float Cpart[1];root = 0;/* Initiate MPI. */ierr=MPI_Init(&argc, &argv);ierr=MPI_Comm_rank(MPI_COMM_WORLD, &rank);ierr=MPI_Comm_size(MPI_COMM_WORLD, &size);

/* Initialize B and C. */if (rank == root) {

B[0][0] = 1; B[0][1] = 2; B[0][2] = 3; B[0][3] = 4;B[1][0] = 4; B[1][1] = -5; B[1][2] = 6; B[1][3] = 4;B[2][0] = 7; B[2][1] = 8; B[2][2] = 9; B[2][3] = 2;B[3][0] = 3; B[3][1] = -1; B[3][2] = 5; B[3][3] = 0;

/* Transpose B */ for (i=0; i< NCOLS; i++) for (j=0; j< NCOLS; j++) T[i][j] = B[j][i];

C[0] = 1; C[1] = -4; C[2] = 7;C[3] = 3; }

Page 99: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.99

Matrix-Vector Mult./* Put up a barrier until I/O is complete */

ierr=MPI_Barrier(MPI_COMM_WORLD);/* Scatter matrix B by rows. */ierr=MPI_Scatter(T,NCOLS,MPI_FLOAT,Bpart,NCOLS,MPI_FLOAT,root,MPI_COMM_WORLD);

/* Do the vector-scalar multiplication. */ierr=MPI_Scatter(C,1,MPI_FLOAT,Cpart,1,MPI_FLOAT,root,MPI_COMM_WORLD);/* Do the vector-scalar multiplication. */

for(j=0;j<NCOLS;j++)Apart[j] = Cpart[0]*Bpart[j];

/* Reduce to matrix A. */ierr=MPI_Reduce(Apart,A,NCOLS,MPI_FLOAT,MPI_SUM,root,MPI_COMM_WORLD);if (rank == 0) {printf("\nThis is the result of the parallel computation:\n\n");printf("A[0]=%g\n",A[0]);printf("A[1]=%g\n",A[1]);printf("A[2]=%g\n",A[2]);printf("A[3]=%g\n",A[3]);for(k=0;k<NCOLS;k++) {

A_exact[k] = 0.0;for(l=0;l<NCOLS;l++) {

A_exact[k] += B[k][l]*C[l];}}printf("\nThis is the result of the serial computation:\n\n");printf("A_exact[0]=%g\n",A_exact[0]);printf("A_exact[1]=%g\n",A_exact[1]);printf("A_exact[2]=%g\n",A_exact[2]);printf("A_exact[3]=%g\n",A_exact[3]);

}MPI_Finalize(); }

Page 100: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.100

Matrix-matrix Multiplication

• A similar, albeit naive, type of decomposition can be achieved for matrix-matrix multiplication, A=B*C.

• The figure below shows schematically how matrix-matrix multiplication of two 4x4 matrices can be decomposed into four independent vector-matrix multiplications, which can be performed on four different processors.

Page 101: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.101

Matrix-matrix Multiplication

Schematic of a decomposition for matrix-matrix multiplication, A=B*C, in Fortran 90. The matrices A and C are depicted as multicolored columns with each color denoting a different processor. The matrix B, in yellow, is broadcast to all processors.

Page 102: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.102

Matrix-matrix Multiplication

• The basic steps are

1. Distribute the columns of C among the processors using a scatter operation.

2. Broadcast the matrix B to every processor.

3. Form the product of B with the columns of C on each processor. These are the corresponding columns of A.

4. Bring the columns of A back to one processor using a gather operation.

Page 103: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.103

Matrix-matrix Multiplication

• Again, in C, the problem could be decomposed in rows. This is shown schematically below.

• The code is left as your homework!!!

Page 104: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.104

Matrix-matrix Multiplication

Schematic of a decomposition for matrix-matrix multiplication, A=B*C, in the C programming language. The matrices A and B are depicted as multicolored rows with each color denoting a different processor. The matrix C, in yellow, is broadcast to all processors.

Page 105: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.105

The Use of Ghost Cells to solve a

Poisson Equation • The objective in data parallelism is for all processors

to work on a single task simultaneously. The computational domain (e.g., a 2D or 3D grid) is divided among the processors such that the computational work load is balanced. Before each processor can compute on its local data, it must perform communications with other processors so that all of the necessary information is brought on each processor in order for it to accomplish its local

task.

Page 106: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.106

The Use of Ghost Cells to solve a Poisson Equation

• As an instructive example of data parallelism, an arbitrary number of processors is used to solve the 2D Poisson Equation in electrostatics (i.e., Laplace Equation with a source). The equation to solve is

where phi(x,y) is our unknown potential function and rho(x,y) is the known source charge density. The domain of the problem is the box defined by the x-axis, y-axis, and the lines x=L and y=L.

2222 4/34/

2

,

,4,

yLxayLxa eea

yx

yxyx

Poisson Equation on a 2D grid with periodic boundary conditions.

Page 107: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.107

The Use of Ghost Cells to solve a Poisson Equation

• Serial Code:• To solve this equation, an iterative scheme is employed

using finite differences. The update equation for the field phi at the (n+1)th iteration is written in terms of the values at nth iteration via

iterating until the condition

has been satisfied.

1,1,,1,1,2

, 4

1 jijijijijiji x

jiji

ji

oldji

newji

,,

,,,

Page 108: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.108

The Use of Ghost Cells to solve a Poisson Equation

• Parallel Code:• In this example, the domain is chopped into

rectangles, in what is often called block-block decomposition. In Figure below,

Parallel Poisson solver via domain decomposition on a 3x5 processor grid.

Page 109: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.109

The Use of Ghost Cells to solve a Poisson Equation

• An example N=64 x M=64 computational grid is shown that will be divided amongst NP=15 processors.

• The number of processors, NP, is purposely chosen such that it does not divide evenly into either N or M.

• Because the computational domain has been divided into rectangles, the 15 processors {P(0),P(1),...,P(14)} (which are laid out in row-major order on the processor grid) can be given a 2-digit designation that represents their processor grid row number and processor grid column number. MPI has commands that allow you to do this.

Page 110: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.110

The Use of Ghost Cells to solve a Poisson Equation

indexing in a parallel Poisson solver on a 3x5 processor grid.

Page 111: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.111

The Use of Ghost Cells to solve a Poisson Equation

• Note that P(1,2) (i.e., P(7)) is responsible for indices i=23-43 and j=27-39 in the serial code double do-loop.

• A parallel speedup is obtained because each processor is working on essentially 1/15 of the total data.

• However, there is a problem. What does P(1,2) do when its 5-point stencil hits the boundaries of its domain (i.e., when i=23 or i=43, or j=27 or j=39)? The 5-point stencil now reaches into another processor's domain, which means that boundary data exists in memory on another separate processor.

• Because the update formula for phi at grid point (i,j) involves neighboring grid indices {i-1,i,i+1;j-1,j,j+1}, P(1,2) must communicate with its North, South, East, and West (N, S, E, W) neighbors to get one column of boundary data from its E, W neighbors and one row of boundary data from its N,S neighbors.

• This is illustrated in Figure below.

Page 112: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.112

The Use of Ghost Cells to solve a Poisson Equation

Boundary data movement in the parallel Poisson solver following each iteration of the stencil.

Page 113: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.113

The Use of Ghost Cells to solve a Poisson Equation

• In order to accommodate this transference of boundary data between processors, each processor must dimension its local array phi to have two extra rows and 2 extra columns.

• This is illustrated in Figure where the shaded areas indicate the extra rows and columns needed for the boundary data from other processors.

Ghost cells: Local indices.

Page 114: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.114

The Use of Ghost Cells to solve a Poisson Equation

• Note that even though this example speaks of global indices, the whole point about parallelism is that no one processor ever has the global phi matrix on processor.

• Each processor has only its local version of phi with its own sub-collection of i and j indices.

• Locally these indices are labeled beginning at either 0 or 1, as in Figure 13.14, rather than beginning at their corresponding global values, as in Figure 13.12.

• Keeping track of the on-processor local indices and the global (in-your-head) indices is the bookkeeping that you have to manage when using message passing parallelism.

Page 115: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.115

The Use of Ghost Cells to solve a Poisson Equation

• Other parallel paradigms, such as High Performance Fortran (HPF) or OpenMP, are directive-based, i.e., compiler directives are inserted into the code to tell the supercomputer to distribute data across processors or to perform other operations. The difference between the two paradigms is akin to the difference between an automatic and stick-shift transmission car.

• In the directive based paradigm (automatic), the compiler (car) does the data layout and parallel communications (gear shifting) implicitly.

• In the message passing paradigm (stick-shift), the user (driver) performs the data layout and parallel communications explicitly. In this example, this communication can be performed in a regular prescribed pattern for all processors.

• For example, all processors could first communicate with their N-most partners, then S, then E, then W. What is happening when all processors communicate with their E neighbors is illustrated in Figure below.

Page 116: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.116

The Use of Ghost Cells to solve a Poisson Equation

Data movement, shift right (East).

Page 117: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.117

The Use of Ghost Cells to solve a Poisson Equation

• Note that in this shift right communication, P(i,j) places its right-most column of boundary data into the left-most ghost column of P(i,j+1). In addition, P(i,j) receives the right-most column of boundary data from P(i,j-1) into its own left-most ghost column.

• For each iteration, the psuedo-code for the parallel algorithm is thust = 0(0) Initialize psi(1) Loop over stencil iterations

(2) Perform parallel N shift communications of boundary data (3) Perform parallel S shift communications of boundary data (4) Perform parallel E shift communications of boundary data(5) Perform parallel W shift communications of boundary data

(6) for{i=1;i<=N_local;i++){for(j=1;j<=M_local;j++){

update phi[i][j]}

}End Loop over stencil iterations

(7) Output data

Page 118: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.118

The Use of Ghost Cells to solve a Poisson Equation

• Note that initializing the data should be performed in parallel. That is, each processor P(i,j) should only initialize the portion of phi for which it is responsible. (Recall NO processor contains the full global phi).

• In relation to this point, step (7), Output data, is not such a simple-minded task when performing parallel calculations. Should you reduce all the data from phi_local on each processor to one giant phi_global on P(0,0) and then print out the data? This is certainly one way to do it, but it seems to defeat the purpose of not having all the data reside on one processor.

• For example, what if phi_global is too large to fit in memory on a single processor? A second alternative is for each processor to write out its own phi_local to a file "phi.ij", where ij indicates the processor's 2-digit designation (e.g. P(1,2) writes out to file "phi.12").

• The data then has to be manipulated off processor by another code to put it into a form that may be rendered by a visualization package. This code itself may have to be a parallel code.

Page 119: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.119

The Use of Ghost Cells to solve a Poisson Equation

• As you can see, the issue of parallel I/O is not a trivial one and is in fact a topic of current research among parallel language developers and researchers.

Page 120: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.120

Matrix-vector Multiplication using a Client-Server Approach

• In Section 13.2.1, a simple data decomposition for multiplying a matrix and a vector was described. This decomposition is also used here to demonstrate a "client-server" approach. The code for this example is in the C program, server_client_c.c.

• In server_client_c.c, all input/output is handled by the "server" (preset to be processor 0). This includes parsing the command-line arguments, reading the file containing the matrix A and vector x, and writing the result to standard output. The file containing the matrix A and the vector x has the form m nx1 x2 ...a11 a12 ...a21 a22 ......

where A is m (rows) by n (columns), and x is a column vector with n elements.

Page 121: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.121

Matrix-vector Multiplication using a Client-Server Approach

• After the server reads in the size of A, it broadcasts this information to all of the clients.

• It then checks to make sure that there are fewer processors than columns. (If there are more processors than columns, then using a parallel program is not efficient and the program exits.)

• The server and all of the clients then allocate memory locations for A and x. The server also allocates memory for the result.

• Because there are more columns than client processors, the first "round" consists of the server sending one column to each of the client processors.

• All of the clients receive a column to process. Upon finishing, the clients send results back to the server. As the server receives a "result" buffer from a client, it sends the next unprocessed column to that client.

Page 122: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.122

Matrix-vector Multiplication using a Client-Server Approach

• The source code is divided into two sections: the "server" code and the "client" code. The pseudo-code for each of these sections is

• Server:– Broadcast (vector) x to all client processors. – Send a column of A to each processor. – While there are more columns to process OR there are expected

results, receive results and send next unprocessed column. – Print result.

• Client: – Receive (vector) x.– Receive a column of A with tag = column number. – Multiply respective element of (vector) x (which is the same as tag)

to produce the (vector) result. – Send result back to server.

• Note that the numbers used in the pseudo-code (for both the server and client) have been added to the source code.

Page 123: COMPE472 Parallel Computing 2.1 Message-Passing Computing Chapter 2 –Programming a Message Passing computer 1.Using a special parallel PL 2.Extending an

COMPE472 Parallel Computing 2.123

Matrix-vector Multiplication using a Client-Server Approach

• Source code similar to server_client_c.c., server_client_r.c is also provided as an example.

• The main difference between theses codes is the way the data is stored.

• Because only contiguous memory locations can be sent using MPI_SEND, server_client_c.c stores the matrix A "column-wise" in memory, while server_client_r.c stores the matrix A "row-wise" in memory.

• The pseudo-code for server_client_c.c and server_client_r.c is stated in the "block" documentation at the beginning of the source code.