compe472 parallel computing 2.1 message-passing computing chapter 2 –programming a message passing...

COMPE472 Parallel Computing 2.1

Message-Passing ComputingChapter 2

– Programming a Message Passing computer

1. Using a special parallel PL

2. Extending an existing language

3. Using a high-level language and providing a message-passing library

– Here, the third option is employed


Message-Passing Programming using User-level Message-Passing Libraries

Two primary mechanisms needed:

1. A method of creating separate processes for execution on different computers

• Static process creation (MPI): Before the execution number of processes are fixed

• Dynamic process creation (MPI2): At runtime processess can be created

2. A method of sending and receiving messages


Programming Models: 1. Multiple program, multiple data (MPMD) model

Sourcefile

Executable

Processor 0 Processor p - 1

Compile to suitprocessor

Sourcefile


Programming models: 2. Single Program Multiple Data (SPMD) model

.

Sourcefile

Executables

Processor 0 Processor p - 1

Compile to suitprocessor

Basic MPI way

Different processes merged into one program. Control statements select different parts for each processor to execute. All executables started together - static process creation


Multiple Program Multiple Data (MPMD) Model

Process 1

Process 2spawn();

Time

Start executionof process 2

Separate programs for each processor. One processor executes master process. Other processes started from within master process - dynamic process creation.


Basic “point-to-point”Send and Receive Routines

Process 1 Process 2

send(&x, 2);

recv(&y, 1);

x y

Movementof data

Generic syntax (actual formats later)

Passing a message between processes using send() and recv() library calls:


Synchronous Message Passing

Routines that actually return when message transfer completed.

Synchronous send routine• Waits until complete message can be accepted by the

receiving process before sending the message. Synchronous receive routine

• Waits until the message it is expecting arrives. • No need for buffer storage

Synchronous routines intrinsically perform two actions: They transfer data and they synchronize processes.


Synchronous send() and recv() using 3-way protocol

Process 1 Process 2

send();

recv();Suspend

Time

processAcknowledgment

MessageBoth processescontinue

(a) When send() occurs before recv()

Process 1 Process 2

recv();

send();Suspend

Time

process

Acknowledgment

MessageBoth processescontinue

(b) When recv() occurs before send()

Request to send

Request to send


Asynchronous Message Passing

• Routines that do not wait for actions to complete before returning. Usually require local storage for messages.

• More than one version depending upon the actual semantics for returning.

• In general, they do not synchronize processes but allow processes to move forward sooner. Must be used with care.


MPI Definitions of Blocking and Non-Blocking

• Blocking - return after their local actions complete, though the message transfer may not have been completed.

• Non-blocking - return immediately.

Assumes that data storage used for transfer not modified by subsequent statements prior to being used for transfer, and it is left to the programmer to ensure this.These terms may have different interpretations in other systems.


How message-passing routines return before message transfer completed

Process 1 Process 2

send();

recv();

Message buffer

Readmessage buffer

Continueprocess

Time

Message buffer needed between source and destination to hold message:


Asynchronous (blocking) routines changing to synchronous routines

• Once local actions completed and message is safely on its way, sending process can continue with subsequent work.

• Buffers only of finite length and a point could be reached when send routine held up because all available buffer space exhausted.

• Then, send routine will wait until storage becomes re-available - i.e then routine behaves as a synchronous routine.


Message Tag

• Used to differentiate between different types of messages being sent.

• Message tag is carried within message.

• If special type matching is not required, a wild card message tag is used, so that the recv() will match with any send().


Message Tag Example

Process 1 Process 2

send(&x,2, 5);

recv(&y,1, 5);

x y

Movementof data

Waits for a message from process 1 with a tag of 5

To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y:


“Group” message passing routines

Have routines that send message(s) to a group of processes or receive message(s) from a group of processes

Higher efficiency than separate point-to-point routines although not absolutely necessary.


Scatter

scatter();

buf

scatter();

data

scatter();

datadata

Process 0 Process p - 1Process 1

Action

Code

MPI form

Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process.


Gather

gather();

buf

gather();

data

gather();

datadata


Action

Code

MPI form

Having one process collect individual values from set of processes.


Reduce

reduce();

buf

reduce();

data

reduce();

datadata


+

Action

Code

Gather operation combined with specified arithmetic/logical operation.

Example: Values could be gathered and then added together by root:


AllGather & AllReduce• AllGather and AllReduce: perform

gather/reduce and broadcast result• First a group must be formed and root

process selected


BarrierBarrier: synchronization point• Example barrier based on an allReduce• (typically more efficient implementations are used)


PVM(Parallel Virtual Machine)

Perhaps first widely adopted attempt at using a workstation cluster as a multicomputer platform, developed by Oak Ridge National Laboratories. Available at no charge.

Programmer decomposes problem into separate programs (usually master and group of identical slave programs).

Programs compiled to execute on specific types of computers.

Set of computers used on a problem first must be defined prior to executing the programs (in a hostfile).


Message routing between computers done by PVM daemon processes installed by PVM on computers that form the virtual machine.

PVM

Application

daemon

program

Workstation

PVMdaemon

Applicationprogram

Applicationprogram

PVMdaemon

Workstation

Workstation

Messagessent throughnetwork

(executable)

(executable)

(executable)

MPI implementation we use is similar.

Can have more than one processrunning on each computer.


MPI(Message Passing Interface)

• Message passing library standard developed by group of academics and industrial partners to foster more widespread use and portability.

• Defines routines, not implementation.

• Several free implementations exist.


MPIProcess Creation and Execution

• Purposely not defined - Will depend upon implementation.

• Only static process creation supported in MPI version 1. All processes must be defined prior to execution and started together.

• Originally SPMD model of computation. • MPMD also possible with static creation - each program

to be started together specified.


Communicators

• Defines scope of a communication operation.

• Processes have ranks associated with communicator.

• Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes.

• Other communicators can be established for groups of processes.


Using SPMD Computational Model

main (int argc, char *argv[]){MPI_Init(&argc, &argv);

.

.MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /*find process rank */

if (myrank == 0)master();

elseslave();..

MPI_Finalize();}

where master() and slave() are to be executed by master process and slave process, respectively.


Unsafe message passing - Example

lib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);(a) Intended behavior

(b) Possible behaviorlib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);

Destination

Source


MPI Solution“Communicators”

• Defines a communication domain - a set of processes that are allowed to communicate between themselves.

• Communication domains of libraries can be separated from that of a user program.

• Used in all point-to-point and collective MPI message-passing communications.


Default Communicator MPI_COMM_WORLD

• Exists as first communicator for all processes existing in the application.

• A set of MPI routines exists for forming communicators.

• Processes have a “rank” in a communicator.


MPI Point-to-Point Communication

• Uses send and receive routines with message tags (and communicator).

• Wild card message tag (MPI_ANY_TAG) is available


MPI Blocking Routines

• Return when “locally complete” - when location used to hold message can be used again or altered without affecting message being sent.

• Blocking send will send message and return - does not mean that message has been received, just that process free to move on without adversely affecting message.


Parameters of blocking send

MPI_Send(buf, count, datatype, dest, tag, comm)

Address of

Number of items

Datatype of

Rank of destination

Message tag

Communicator

send buffer

to send

each item

process


! Örnek send-recv MPI Fortran 90! program main use mpi implicit none integer status(MPI_STATUS_SIZE) integer :: ierr, rank, size, i, p, tag character*100 message

call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, p, ierr) if (rank .ne. 0) then message="Islemciden merhaba!" call MPI_SEND(message,len(message),MPI_CHARACTER,0,0,MPI_COMM_WORLD,ierr) else tag=0 do i=1, p-1 call MPI_RECV(message,len(message),MPI_CHARACTER,i,tag,MPI_COMM_WORLD,status,ierr) print *, "Islemci:",i,"mesaji:",message end do; end if call MPI_FINALIZE(ierr)end


! Mid-point Rule to Compute PI using MPI_BCAST and MPI_REDUCE program main use mpi double precision starttime, endtime double precision PI25DT parameter (PI25DT = 3.141592653589793238462643d0) double precision mypi, pi, h, sum, x, f, a double precision starttime, endtime integer n, myid, numprocs, i, ierr f(a) = 4.d0 / (1.d0 + a*a) ! function to integrate call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) 10 if (myid.eq.0) then print *, 'Enter the number of intervals: (0 quits) ' read(*,*) n endif starttime = MPI_WTIME() call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)! check for quit signal if ( n .le. 0 ) goto 30! calculate the interval size h = 1.0d0/n sum = 0.0d0 do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 continue mypi = h * sum! collect all the partial sums call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0, & MPI_COMM_WORLD,ierr) endtime = MPI_WTIME() if (myid.eq.0) then print *, 'pi is ', pi, 'Error is ', abs(pi - PI25DT) print *, 'time is ', endtime-starttime, ' seconds' endif go to 1030 call MPI_FINALIZE(ierr) stop end


Parameters of blocking receive

MPI_Recv(buf, count, datatype, src, tag, comm, status)

Address of

Maximum number

Datatype of

Rank of source

Message tag

Communicator

receive buffer

of items to receive

each item

process

Statusafter operation


Example

To send an integer x from process 0 to process 1,

MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */

if (myrank == 0) {int x;MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD);

} else if (myrank == 1) {int x;MPI_Recv(&x, 1, MPI_INT, 0,msgtag,MPI_COMM_WORLD,status);

}


C - MPI Datatypes


Fortran – MPI Basic Datatypes


The status array• Status is a data structure allocated in the user’s program.• C language

– int recvd_tag, recvd_from, recvd_count;– MPI_Status status;– MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status)– recvd_tag = status.MPI_TAG;– recvd_from = status.MPI_SOURCE;– MPI_Get_count( &status, datatype, &recvd_count);

• Fortran language– integer recvd_tag, recvd_from, recvd_count– integer status(MPI_STATUS_SIZE)– call

MPI_RECV(..,MPI_ANY_SOURCE,MPI_ANY_TAG,..status,ierr)– tag_recvd = status(MPI_TAG)– recvd_from = status(MPI_SOURCE)– call MPI_GET_COUNT(status, datatype, recvd_count, ierr)


MPI Nonblocking Routines

• Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered.

• Nonblocking receive - MPI_Irecv() - will return even if no message to accept.


Nonblocking Routine Formats

MPI_Isend(buf,count,datatype,dest,tag,comm,request)

MPI_Irecv(buf,count,datatype,source,tag,comm, request)

Completion detected by MPI_Wait() and MPI_Test().

MPI_Wait() waits until operation completed and returns then.

MPI_Test() returns with flag set indicating whether operation completed at that time.

Need to know whether particular operation completed.

Determined by accessing request parameter.


MPI_ISEND & MPI_IRECV• Fortran:

– MPI_ISEND(buf, count, type, dest, tag, comm, req, ierr)– MPI_IRECV(buf, count, type, sour, tag, comm, req, ierr)

– buf array of type type.– count (INTEGER) number of element of buf to be sent– type (INTEGER) MPI type of buf– dest (INTEGER) rank of the destination process– sour (INTEGER) rank of the source process– tag (INTEGER) number identifying the message– comm (INTEGER) communicator of the sender and receiver– req (INTEGER) output, identifier of the communications

handle– ierr (INTEGER) output, error code (if ierr=0 no error

occurs)


Example

To send an integer x from process 0 to process 1 and allow process 0 to continue,

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */

if (myrank == 0) {

int x;

MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1);

compute();

MPI_Wait(req1, status);

} else if (myrank == 1) {

int x;

MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status);

}


Send Communication Modes

• Standard Mode Send - Not assumed that corresponding receive routine has started. Amount of buffering not defined by MPI. If buffering provided, send could complete before receive reached.

• Buffered Mode - Send may start and return before a matching receive. Necessary to specify buffer space via routine MPI_Buffer_attach().

• Synchronous Mode - Send and receive can start before each other but can only complete together.

• Ready Mode - Send can only start if matching receive already reached, otherwise error. Use with care.


• Each of the four modes can be applied to both blocking and nonblocking send routines.

• Only the standard mode is available for the blocking and nonblocking receive routines.

• Any type of send routine can be used with any type of receive routine.


Communication Modes and MPISubroutines

Mode Completion Condition Blocking subroutine Non-blocking subroutine

Standard send Message sent (receive state unknown)

MPI_SEND MPI_ISEND

receive Completes when a message has arrived

MPI_RECV MPI_IRECV

Synchronous send Only completes when the receive has completed

MPI_SSEND MPI_ISSEND

Buffered send Always completes, irrespective of receiver

MPI_BSEND MPI_IBSEND

Ready send Always completes, irrespective of whether the receive has completed

MPI_RSEND MPI_IRSEND


Collective Communication

Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations:

• MPI_BCAST() - Broadcast from root to all other processes• MPI_GATHER() - Gather values for group of processes• MPI_SCATTER() - Scatters buffer in parts to group of processes• MPI_ALLTOALL() - Sends data from all processes to all

processes• MPI_REDUCE() - Combine values on all processes to single

value• MPI_REDUCE_SCATTER() - Combine values and scatter results• MPI_SCAN() - Compute prefix reductions of data on processes


BroadcastSending same message to all processes concerned with problem.Multicast - sending same message to defined group of processes.

bcast();

buf

bcast();

data

bcast();

datadata


Action

Code

MPI form


Broadcast Illustrated


Broadcast (MPI_BCAST)

One-to-all communication: same data sent from rootprocess to all others in the communicator• Fortran:

– INTEGER count, type, root, comm, ierr– CALL MPI_BCAST(buf, count, type, root, comm, ierr)– Buf array of type type

• C:– int MPI_Bcast(void *buf, int count, MPI_Datatype

datatypem int root, MPI_Comm comm)

• All processes must specify same root, rank and comm


Broadcast ExamplePROGRAM broad_castINCLUDE 'mpif.h'INTEGER ierr, myid, nproc, rootINTEGER status(MPI_STATUS_SIZE)REAL A(2)CALL MPI_INIT(ierr)CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)root = 0IF( myid .EQ. 0 ) THENa(1) = 2.0a(2) = 4.0END IFCALL MPI_BCAST(a, 2, MPI_REAL, 0, MPI_COMM_WORLD, ierr)WRITE(6,*) myid, ': a(1)=', a(1), 'a(2)=', a(2)CALL MPI_FINALIZE(ierr)END


Reduction (MPI_REDUCE)

The reduction operation allow to:• Collect data from each process• Reduce the data to a single value• Store the result on the root processes• Store the result on all processes• Reduction function works with arrays• Operations: sum, product, min, max, and, ….• Internally is usually implemented with a binary tree


Reduction Operation (SUM)


Reduction in FORTRAN

• MPI_REDUCE( snd_buf, rcv_buf, count, type, op, root, comm, ierr)

– snd_buf input array of type type containing local values.– rcv_buf output array of type type containing global results– count (INTEGER) number of element of snd_buf and rcv_buf– type (INTEGER) MPI type of snd_buf and rcv_buf– op (INTEGER) parallel operation to be performed– root (INTEGER) MPI id of the process storing the result– comm (INTEGER) communicator of processes involved in the

operation– ierr (INTEGER) output, error code (if ierr=0 no error occours)

• MPI_ALLREDUCE( snd_buf, rcv_buf, count, type, op, comm, ierr)– The argument root is missing, the result is stored to all processes.


Predefined Reduction Operations


MPI_Scatter

• One-to-all communication: different data sent from root process to all others in the communicator

• Fortran:• CALL MPI_SCATTER(sndbuf, sndcount,

sndtype, rcvbuf, rcvcount, rcvtype, root, comm, ierr)– Arguments definition are like other MPI subroutine– sndcount is the number of elements sent to each

process, not the size of sndbuf, that should be sndcount times the number of process in the communicator

– The sender arguments are significant only at root


MPI_Gather

One-to-all communication: different data collected by the root process, from all others processes in thecommunicator. Is the opposite of Scatter• Fortran:• CALL MPI_GATHER(sndbuf, sndcount,

sndtype, rcvbuf, rcvcount,rcvtype, root, comm, ierr)– Arguments definition are like other MPI subroutine– rcvcount is the number of elements collected from each

process, not the size of rcvbuf, that should be rcvcount times the number of process in the communicator

– The receiver arguments are significant only at root


Scatter/Gather


GATHER Example program gather use mpi integer ierr, myid, p, nsd, i, root integer status(MPI_STATUS_SIZE) real a(21), b(3) call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,p,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) root = 0 b(1) = myid b(2) = myid b(3) = myid nsd = 3 np = nsd * p call MPI_GATHER(b,nsd,MPI_REAL,a,nsd,MPI_REAL,root,MPI_COMM_WORLD,& ierr) if (myid .eq. root) then do i=1,np write(6,*) myid,': a(',i,')=',a(i) end do end if call MPI_FINALIZE(ierr) end


Execution

bash-3.00$ mpirun -np 3 gather_test 0 : a( 1 )= 0.0E+0 0 : a( 2 )= 0.0E+0 0 : a( 3 )= 0.0E+0 0 : a( 4 )= 1.0 0 : a( 5 )= 1.0 0 : a( 6 )= 1.0 0 : a( 7 )= 2.0 0 : a( 8 )= 2.0 0 : a( 9 )= 2.0


Scatter/Gather Examples


MPI_Barrier()

• Stop processes until all processes within a communicator reach the barrier

• Almost never required in a parallel program• Occasionally useful in measuring

performance and load balancing• Fortran:

– CALL MPI_BARRIER( comm, ierr)

• C:– int MPI_Barrier(MPI_Comm comm)


Barrier


Barrier routine

• A means of synchronizing processes by stopping each one until they all have reached a specific “barrier” call.


Evaluating Parallel Programs


Sequential execution time, ts: Estimate by counting computational steps of best sequential algorithm.

Parallel execution time, tp: In addition to number of computational steps, tcomp, need to estimate communication overhead, tcomm:

tp = tcomp + tcomm


Computational Time

Count number of computational steps.When more than one process executed simultaneously, count computational steps of most complex process. Generally, function of n and p, i.e.

tcomp = f (n, p)

Often break down computation time into parts. Then

tcomp = tcomp1 + tcomp2 + tcomp3 + …

Analysis usually done assuming that all processors are same and operating at same speed.


Communication Time

Many factors, including network structure and network contention. As a first approximation, use

tcomm = tstartup + ntdata

tstartup is startup time, essentially time to send a message with no data. Assumed to be constant.

tdata is transmission time to send one data word, also assumed constant, and there are n data words.


Idealized Communication Time

Number of data items (n)

Startup time


Final communication time, tcomm

Summation of communication times of all sequential messages from a process, i.e.

tcomm = tcomm1 + tcomm2 + tcomm3 + …

Communication patterns of all processes assumed same and take place together so that only one process need be considered.

Both tstartup and tdata, measured in units of one computational step, so that can add tcomp and tcomm together to obtain parallel execution time, tp.


Benchmark FactorsWith ts, tcomp, and tcomm, can establish speedup factor and computation/communication ratio for a particular algorithm/implementation:

Both functions of number of processors, p, and number of data elements, n.


Factors give indication of scalability of parallel solution with increasing number of processors and problem size.

Computation/communication ratio will highlight effect of communication with increasing problem size and system size.


Debugging/Evaluating Parallel Programs Empirically


Visualization ToolsPrograms can be watched as they are executed in a space-time diagram (or process-time diagram):

Process 1

Process 2

Process 3

TimeComputingWaitingMessage-passing system routine

Message


Implementations of visualization tools are available for MPI.

An example is the Upshot program visualization system.


Evaluating Programs EmpiricallyMeasuring Execution Time

To measure the execution time between point L1 and point L2 in the code, we might have a construction such as

.

L1: time(&t1); /* start timer */.

L2: time(&t2); /* stop timer */.

elapsed_time = difftime(t2, t1); /* elapsed_time = t2 - t1 */printf(“Elapsed time = %5.2f seconds”, elapsed_time);

MPI provides the routine MPI_Wtime() for returning time (in seconds).


Mesaj iletim zamanı! Pinpon zamanlama program main use mpi double precision starttime, endtime integer n, myid, numprocs, i, ierr, status(MPI_STATUS_SIZE), x call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) x = 2 if ( myid .eq. 0 ) then starttime = MPI_WTIME() do 1 i=1,1000000 call MPI_SEND(x,1,MPI_INTEGER,1,1,MPI_COMM_WORLD, ierr) call MPI_RECV(x,1,MPI_INTEGER,1,1,MPI_COMM_WORLD, status, ierr) 1 continue endtime = MPI_WTIME() print *,'Elapsed time of sending a word:',0.5*(endtime-

starttime)/1000000,' seconds' else do 2 i=1,1000000 call MPI_RECV(x,1,MPI_INTEGER,0,1,MPI_COMM_WORLD, status, ierr) call MPI_SEND(x,1,MPI_INTEGER,0,1,MPI_COMM_WORLD, ierr) 2 continue end if call MPI_FINALIZE(ierr) stop end


Parallel Programming Home Page

http://www.cs.uncc.edu/par_prog

Gives step-by-step instructions for compiling and executing programs, and other information.


Compiling/Executing MPI ProgramsPreliminaries

• Set up paths• Create required directory structure• Create a file (hostfile) listing machines to be

used (required)

Details described on home page.


Hostfile

Before starting MPI for the first time, need to create a hostfile

Sample hostfile

ws404

#is-sm1 //Currently not executing, commented

pvm1 //Active processors, UNCC sun cluster called pvm1 - pvm8

pvm2

pvm3

pvm4

pvm5

pvm6

pvm7

pvm8


Compiling/executing (SPMD) MPI program

For LAM MPI version 6.5.2. At a command line:

To start MPI:

First time: lamboot -v hostfile

Subsequently: lamboot

To compile MPI programs:

mpicc -o file file.c

or mpiCC -o file file.cpp

To execute MPI program:

mpirun -v -np no_processors file

To remove processes for reboot

lamclean -v

Terminate LAM

lamhalt

If fails

wipe -v lamhost


Compiling/Executing Multiple MPI Programs

Create a file specifying programs:

Example1 master and 2 slaves, “appfile” contains

n0 master

n0-1 slave

To execute:

mpirun -v appfile

Sample output

3292 master running on n0 (o)

3296 slave running on n0 (o)

412 slave running on n1


Circuit Satisfiability MPI-C/* * Circuit Satisfiability, Version 2 * This enhanced version of the program prints the * total number of solutions. */#include "mpi.h"#include <stdio.h>int main (int argc, char *argv[]) { int count; /* Solutions found by this proc */ int global_count; /* Total number of solutions */ int i; int id; /* Process rank */ int p; /* Number of processes */ int check_circuit (int, int); MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p); count = 0; for (i = id; i < 65536; i += p) count += check_circuit (id, i); MPI_Reduce (&count, &global_count, 1, MPI_INT, MPI_SUM, 0,MPI_COMM_WORLD); printf ("Process %d is done\n", id); fflush (stdout); MPI_Finalize(); if (!id) printf ("There are %d different solutions\n",global_count); return 0;}


MPI-C…/* Return 1 if 'i'th bit of 'n' is 1; 0 otherwise */#define EXTRACT_BIT(n,i) ((n&(1<<i))?1:0)

int check_circuit (int id, int z) { int v[16]; /* Each element is a bit of z */ int i; for (i = 0; i < 16; i++) v[i] = EXTRACT_BIT(z,i); if ((v[0] || v[1]) && (!v[1] || !v[3]) && (v[2] || v[3]) && (!v[3]

|| !v[4]) && (v[4] || !v[5]) && (v[5] || !v[6]) && (v[5] || v[6]) && (v[6] || !v[15]) && (v[7]

|| !v[8]) && (!v[7] || !v[13]) && (v[8] || v[9]) && (v[8] || !v[9]) && (!

v[9] || !v[10]) && (v[9] || v[11]) && (v[10] || v[11]) && (v[12] || v[13]) &&

(v[13] || !v[14]) && (v[14] || v[15])) { printf ("%d) %d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d\n", id, v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7],v[8],v[9], v[10],v[11],v[12],v[13],v[14],v[15]); fflush (stdout); return 1; } else return 0;}


Floyd’s Algorithm

• to find the least-expensive paths between all the vertices in a graph.

• operates on a matrix representing the costs of edges between vertices.

Fig: Determine whether a path going from Vi to Vj via Vk is shorter than the best-known path from Vi to Vj


Parallel Floyd : Version 1

• one-dimensional decomposition of the I matrix.• At most N processors• In (a), the data allocated to a single task are shaded: a contiguous

block of rows. • In (b), the data required by this task in the kth step of the algorithm

are shaded: its own block and the k th row.

for k = 0 to N-1 for i = local_i_start to local_i_end for j = 0 to N-1 (k+1) = min( (k), (k)+ (k)) endfor endfor endfor


Parallel Floyd : Version 2• Two-dimensional decomposition of the I matrix.• up to N^2 processors • In (a), the data allocated to a single task are shaded: a contiguous

submatrix. • In (b), the data required by this task in the k th step of the algorithm

are shaded: its own block, and part of the k th row and column.

for k = 0 to N-1 for i = local_i_start to local_i_end for j = local_j_start to local_j_end (k+1) = min( (k), (k)+ (k)) endfor endfor endfor


Problem all-pairs shortest paths

• Given a weighted graph G(V,E,w), the all-pairs shortest paths problem is to find the shortest paths between all pairs of vertices vi, vj that belongs to V.


Sequential Floyd

• Input: the adjacency matrix Di,j

for k = 1 to |V| for i = 1 to |V| for j = 1 to |V|

Di,j = min(Di,j , Di,k + Dk,j )

• Output: Di,j contains shortest path from i to j


What is the Adjacency Matrix

• Create an Adjacency Matrix

0 1 ∞ ∞

1 0 1 4

∞ 1 0 2

∞ 4 2 0

1

3

2

4

1

1

2

4


Understanding Data Dependency

• Di,j = min(Di,j , Di,k + Dk,j )• K=1 I = 1 << D1,1N = min(D1,1N , D1,1 + D1,1N >> D0 • K=1 I = 2 << D2,1N = min(D2,1N , D2,1 + D1,1N >>• . . • . .• K=t I = f << D f,1N = min(D f,1N , D f, t + D t,1N >> Dt• . .• . .• K=N I=N << DN,1N = min(DN,1N , DN,N + DN,1N >> Dk

• Where 0 < t, f < N• Realize after each i,j loop Matrix D got fully updated.


Key elements in solving the problem easily

• Send the Adjacency Matrix to every process ‘Broadcasting’

• Compute partitions and scatter them for each process ‘Scattering’

• Compute your ith row part and send the results to everybody ‘Broadcasting’

• After K iteration get the results ‘Gathering’


/* Send data size to all other processes*/ MPI_Bcast(&size, 1, MPI_INT, 0, MPI_COMM_WORLD);

/* Compute my area of working, the partition.*/ partition_size = size / processors; if (size % processors) partition_size++;

my_partition= (int*) malloc(size * partition_size * sizeof(int)); kth_row = (int*) malloc(size * 1 * sizeof(int));

MPI_Scatter(graph, partition_size * size, MPI_INT, my_partition, partition_size * size, MPI_INT, 0, MPI_COMM_WORLD);

/* Calculation.*/ for (k = 0; k < size; k++) {

/* Broadcast the kth row.*/ if (my_rank == (k / partition_size)) for (i = 0; i < size; i++) kth_row[i] = my_partition[(k%partition_size)*size+i]; MPI_Bcast(kth_row, size, MPI_INT, (k / partition_size), MPI_COMM_WORLD);

/* Update my rows.*/ for (i = 0; (i < partition_size) && (i < size); i++) { if (my_partition[i*size+k] < 1) continue; for (j = 0; j < size; j++) { if (kth_row[j] < 1) continue; if (my_partition[i*size+j] < 0) my_partition[i*size+j] = my_partition[i*size+k] + kth_row[j]; else my_partition[i*size+j] = min(my_partition[i*size+j], my_partition[i*size+k] + kth_row[j]); } } }

/* Collect the data.*/ printf(" Collecting results ... ... \n"); MPI_Gather(my_partition, partition_size * size, MPI_INT, graph, partition_size * size, MPI_INT, 0, MPI_COMM_WORLD);


Matrix-vector Multiplication• The figure below demonstrates schematically

how a matrix-vector multiplication, A=B*C, can be decomposed into four independent computations involving a scalar multiplying a column vector.

• This approach is different from that which is usually taught in a linear algebra course because this decomposition lends itself better to parallelization.

• These computations are independent and do not require communication, something that usually reduces performance of parallel code.


Matrix-vector Multiplication (Columnwise)

Schematic of parallel decomposition for vector-matrix multiplication, A=B*C. The vector A is depicted in yellow. The matrix B and vector C are depicted in multiple colors representing the portions, columns, and elements assigned to each processor, respectively.


Matrix-vector Multiplication (Columnwise)

A=B*C

33,322,311,300,3

33,222,211,200,2

33,122,111,100,1

33,022,011,000,0

3

2

1

0

cbcbcbcb

cbcbcbcb

cbcbcbcb

cbcbcbcb

a

a

a

a

P0 P1 P2 P3

+

+

+

+

+

+

+

+

+

+

+

+

Reduction (SUM)

P0 P1 P2 P3


Matrix-vector Multiplication• The columns of matrix B and elements of column vector C

must be distributed to the various processors using MPI commands called scatter operations.

• Note that MPI provides two types of scatter operations depending on whether the problem can be divided evenly among the number of processors or not.

• Each processor now has a column of B, called Bpart, and an element of C, called Cpart. Each processor can now perform an independent vector-scalar multiplication.

• Once this has been accomplished, every processor will have a part of the final column vector A, called Apart.

• The column vectors on each processor can be added together with an MPI reduction command that computes the final sum on the root processor.


Matrix-Vector Mult.#include <stdio.h>#include <mpi.h>#define NCOLS 4int main(int argc, char **argv) {

int i,j,k,l,int ierr, rank, size, root;float A[NCOLS], T[NCOLS][NCOLS];float Apart[1];float Bpart[NCOLS],C[NCOLS];float A_exact[NCOLS];float B[NCOLS][NCOLS];float Cpart[1];root = 0;/* Initiate MPI. */ierr=MPI_Init(&argc, &argv);ierr=MPI_Comm_rank(MPI_COMM_WORLD, &rank);ierr=MPI_Comm_size(MPI_COMM_WORLD, &size);

/* Initialize B and C. */if (rank == root) {

B[0][0] = 1; B[0][1] = 2; B[0][2] = 3; B[0][3] = 4;B[1][0] = 4; B[1][1] = -5; B[1][2] = 6; B[1][3] = 4;B[2][0] = 7; B[2][1] = 8; B[2][2] = 9; B[2][3] = 2;B[3][0] = 3; B[3][1] = -1; B[3][2] = 5; B[3][3] = 0;

/* Transpose B */ for (i=0; i< NCOLS; i++) for (j=0; j< NCOLS; j++) T[i][j] = B[j][i];

C[0] = 1; C[1] = -4; C[2] = 7;C[3] = 3; }


Matrix-Vector Mult./* Put up a barrier until I/O is complete */

ierr=MPI_Barrier(MPI_COMM_WORLD);/* Scatter matrix B by rows. */ierr=MPI_Scatter(T,NCOLS,MPI_FLOAT,Bpart,NCOLS,MPI_FLOAT,root,MPI_COMM_WORLD);

/* Do the vector-scalar multiplication. */ierr=MPI_Scatter(C,1,MPI_FLOAT,Cpart,1,MPI_FLOAT,root,MPI_COMM_WORLD);/* Do the vector-scalar multiplication. */

for(j=0;j<NCOLS;j++)Apart[j] = Cpart[0]*Bpart[j];

/* Reduce to matrix A. */ierr=MPI_Reduce(Apart,A,NCOLS,MPI_FLOAT,MPI_SUM,root,MPI_COMM_WORLD);if (rank == 0) {printf("\nThis is the result of the parallel computation:\n\n");printf("A[0]=%g\n",A[0]);printf("A[1]=%g\n",A[1]);printf("A[2]=%g\n",A[2]);printf("A[3]=%g\n",A[3]);for(k=0;k<NCOLS;k++) {

A_exact[k] = 0.0;for(l=0;l<NCOLS;l++) {

A_exact[k] += B[k][l]*C[l];}}printf("\nThis is the result of the serial computation:\n\n");printf("A_exact[0]=%g\n",A_exact[0]);printf("A_exact[1]=%g\n",A_exact[1]);printf("A_exact[2]=%g\n",A_exact[2]);printf("A_exact[3]=%g\n",A_exact[3]);

}MPI_Finalize(); }


Matrix-matrix Multiplication

• A similar, albeit naive, type of decomposition can be achieved for matrix-matrix multiplication, A=B*C.

• The figure below shows schematically how matrix-matrix multiplication of two 4x4 matrices can be decomposed into four independent vector-matrix multiplications, which can be performed on four different processors.



Schematic of a decomposition for matrix-matrix multiplication, A=B*C, in Fortran 90. The matrices A and C are depicted as multicolored columns with each color denoting a different processor. The matrix B, in yellow, is broadcast to all processors.



• The basic steps are

1. Distribute the columns of C among the processors using a scatter operation.

2. Broadcast the matrix B to every processor.

3. Form the product of B with the columns of C on each processor. These are the corresponding columns of A.

4. Bring the columns of A back to one processor using a gather operation.



• Again, in C, the problem could be decomposed in rows. This is shown schematically below.

• The code is left as your homework!!!



Schematic of a decomposition for matrix-matrix multiplication, A=B*C, in the C programming language. The matrices A and B are depicted as multicolored rows with each color denoting a different processor. The matrix C, in yellow, is broadcast to all processors.


The Use of Ghost Cells to solve a

Poisson Equation • The objective in data parallelism is for all processors

to work on a single task simultaneously. The computational domain (e.g., a 2D or 3D grid) is divided among the processors such that the computational work load is balanced. Before each processor can compute on its local data, it must perform communications with other processors so that all of the necessary information is brought on each processor in order for it to accomplish its local

task.


The Use of Ghost Cells to solve a Poisson Equation

• As an instructive example of data parallelism, an arbitrary number of processors is used to solve the 2D Poisson Equation in electrostatics (i.e., Laplace Equation with a source). The equation to solve is

where phi(x,y) is our unknown potential function and rho(x,y) is the known source charge density. The domain of the problem is the box defined by the x-axis, y-axis, and the lines x=L and y=L.

2222 4/34/

2

,

,4,

yLxayLxa eea

yx

yxyx

Poisson Equation on a 2D grid with periodic boundary conditions.



• Serial Code:• To solve this equation, an iterative scheme is employed

using finite differences. The update equation for the field phi at the (n+1)th iteration is written in terms of the values at nth iteration via

iterating until the condition

has been satisfied.

1,1,,1,1,2

, 4

1 jijijijijiji x

jiji

ji

oldji

newji

,,

,,,



• Parallel Code:• In this example, the domain is chopped into

rectangles, in what is often called block-block decomposition. In Figure below,

Parallel Poisson solver via domain decomposition on a 3x5 processor grid.



• An example N=64 x M=64 computational grid is shown that will be divided amongst NP=15 processors.

• The number of processors, NP, is purposely chosen such that it does not divide evenly into either N or M.

• Because the computational domain has been divided into rectangles, the 15 processors {P(0),P(1),...,P(14)} (which are laid out in row-major order on the processor grid) can be given a 2-digit designation that represents their processor grid row number and processor grid column number. MPI has commands that allow you to do this.



indexing in a parallel Poisson solver on a 3x5 processor grid.



• Note that P(1,2) (i.e., P(7)) is responsible for indices i=23-43 and j=27-39 in the serial code double do-loop.

• A parallel speedup is obtained because each processor is working on essentially 1/15 of the total data.

• However, there is a problem. What does P(1,2) do when its 5-point stencil hits the boundaries of its domain (i.e., when i=23 or i=43, or j=27 or j=39)? The 5-point stencil now reaches into another processor's domain, which means that boundary data exists in memory on another separate processor.

• Because the update formula for phi at grid point (i,j) involves neighboring grid indices {i-1,i,i+1;j-1,j,j+1}, P(1,2) must communicate with its North, South, East, and West (N, S, E, W) neighbors to get one column of boundary data from its E, W neighbors and one row of boundary data from its N,S neighbors.

• This is illustrated in Figure below.



Boundary data movement in the parallel Poisson solver following each iteration of the stencil.



• In order to accommodate this transference of boundary data between processors, each processor must dimension its local array phi to have two extra rows and 2 extra columns.

• This is illustrated in Figure where the shaded areas indicate the extra rows and columns needed for the boundary data from other processors.

Ghost cells: Local indices.



• Note that even though this example speaks of global indices, the whole point about parallelism is that no one processor ever has the global phi matrix on processor.

• Each processor has only its local version of phi with its own sub-collection of i and j indices.

• Locally these indices are labeled beginning at either 0 or 1, as in Figure 13.14, rather than beginning at their corresponding global values, as in Figure 13.12.

• Keeping track of the on-processor local indices and the global (in-your-head) indices is the bookkeeping that you have to manage when using message passing parallelism.



• Other parallel paradigms, such as High Performance Fortran (HPF) or OpenMP, are directive-based, i.e., compiler directives are inserted into the code to tell the supercomputer to distribute data across processors or to perform other operations. The difference between the two paradigms is akin to the difference between an automatic and stick-shift transmission car.

• In the directive based paradigm (automatic), the compiler (car) does the data layout and parallel communications (gear shifting) implicitly.

• In the message passing paradigm (stick-shift), the user (driver) performs the data layout and parallel communications explicitly. In this example, this communication can be performed in a regular prescribed pattern for all processors.

• For example, all processors could first communicate with their N-most partners, then S, then E, then W. What is happening when all processors communicate with their E neighbors is illustrated in Figure below.



Data movement, shift right (East).



• Note that in this shift right communication, P(i,j) places its right-most column of boundary data into the left-most ghost column of P(i,j+1). In addition, P(i,j) receives the right-most column of boundary data from P(i,j-1) into its own left-most ghost column.

• For each iteration, the psuedo-code for the parallel algorithm is thust = 0(0) Initialize psi(1) Loop over stencil iterations

(2) Perform parallel N shift communications of boundary data (3) Perform parallel S shift communications of boundary data (4) Perform parallel E shift communications of boundary data(5) Perform parallel W shift communications of boundary data

(6) for{i=1;i<=N_local;i++){for(j=1;j<=M_local;j++){

update phi[i][j]}

}End Loop over stencil iterations

(7) Output data



• Note that initializing the data should be performed in parallel. That is, each processor P(i,j) should only initialize the portion of phi for which it is responsible. (Recall NO processor contains the full global phi).

• In relation to this point, step (7), Output data, is not such a simple-minded task when performing parallel calculations. Should you reduce all the data from phi_local on each processor to one giant phi_global on P(0,0) and then print out the data? This is certainly one way to do it, but it seems to defeat the purpose of not having all the data reside on one processor.

• For example, what if phi_global is too large to fit in memory on a single processor? A second alternative is for each processor to write out its own phi_local to a file "phi.ij", where ij indicates the processor's 2-digit designation (e.g. P(1,2) writes out to file "phi.12").

• The data then has to be manipulated off processor by another code to put it into a form that may be rendered by a visualization package. This code itself may have to be a parallel code.



• As you can see, the issue of parallel I/O is not a trivial one and is in fact a topic of current research among parallel language developers and researchers.


Matrix-vector Multiplication using a Client-Server Approach

• In Section 13.2.1, a simple data decomposition for multiplying a matrix and a vector was described. This decomposition is also used here to demonstrate a "client-server" approach. The code for this example is in the C program, server_client_c.c.

• In server_client_c.c, all input/output is handled by the "server" (preset to be processor 0). This includes parsing the command-line arguments, reading the file containing the matrix A and vector x, and writing the result to standard output. The file containing the matrix A and the vector x has the form m nx1 x2 ...a11 a12 ...a21 a22 ......

where A is m (rows) by n (columns), and x is a column vector with n elements.



• After the server reads in the size of A, it broadcasts this information to all of the clients.

• It then checks to make sure that there are fewer processors than columns. (If there are more processors than columns, then using a parallel program is not efficient and the program exits.)

• The server and all of the clients then allocate memory locations for A and x. The server also allocates memory for the result.

• Because there are more columns than client processors, the first "round" consists of the server sending one column to each of the client processors.

• All of the clients receive a column to process. Upon finishing, the clients send results back to the server. As the server receives a "result" buffer from a client, it sends the next unprocessed column to that client.



• The source code is divided into two sections: the "server" code and the "client" code. The pseudo-code for each of these sections is

• Server:– Broadcast (vector) x to all client processors. – Send a column of A to each processor. – While there are more columns to process OR there are expected

results, receive results and send next unprocessed column. – Print result.

• Client: – Receive (vector) x.– Receive a column of A with tag = column number. – Multiply respective element of (vector) x (which is the same as tag)

to produce the (vector) result. – Send result back to server.

• Note that the numbers used in the pseudo-code (for both the server and client) have been added to the source code.



• Source code similar to server_client_c.c., server_client_r.c is also provided as an example.

• The main difference between theses codes is the way the data is stored.

• Because only contiguous memory locations can be sent using MPI_SEND, server_client_c.c stores the matrix A "column-wise" in memory, while server_client_r.c stores the matrix A "row-wise" in memory.

• The pseudo-code for server_client_c.c and server_client_r.c is stated in the "block" documentation at the beginning of the source code.

compe472 parallel computing 2.1 message-passing computing chapter 2 –programming a message passing...

Documents

message transfer

complete message

receiving process

message passing computerusing

separate processes

userlevel message

special parallel plextending

execution number of