basics of message-passing

Basics of Message-passing• Mechanics of message-passing

– A means of creating separate processes on different computers– A way to send and receive messages

• Single program multiple data (SPMD) model– Logic for multiple processes merged into one program– Control Statements separate processor blocks of logic– A compiled program is stored on each processor– All executables are started together statically– Example: MPI

• Multiple program multiple data (MPMD) model– Each processor has a separate master program– Master program spawns child processes dynamically– Example: PVM

Point-to-point Communication

• General syntaxSend(data, destination, message tag)Receive(data, source, message tag)

• Synchronous – Completes after data safely

transferred– No copying between message

buffers

• Asynchronous – Completes when transmission

begins– Local buffers are free for

application use

Process 1 Process 2

send(&x, 2);

recv(&y, 1);

x y

Generic syntax (actual formats later)

Synchronized sends and receivesProcess 1 Process 2

send();

recv();Suspend

Time

processAcknowledgment

MessageBoth processescontinue

(a) send() occurs before recv()

Process 1 Process 2

recv();

send();Suspend

Time

process

Acknowledgment

MessageBoth processescontinue

Request to send

Request to send

(b) recv() occurs before send()

Point to Point MPI calls• Synchronous Send - completes when data is successfully received• Buffered Send - completes after data is copied to a user supplied buffer

– Becomes synchronous if no buffers are available• Ready Send – synchronous; matching receive must precede the send

– Completion occurs when remote processor receives the data– A matching receive must precede the send

• Receive - completes when the data becomes available.• Standard Send

– If receive posted, completes if data is on its way (asynchronous)– If no receive posted, completes when data is buffered by MPI

• Becomes synchronous if no buffers are available

• Blocking - Return occurs when the call completes• Non-Blocking - Return occurs immediately

– Application is responsible to properly poll or wait for completion– Allows more parallel processing

Buffered Send ExampleApplications supply a data buffer area using

MPI_Buffer_attach() to hold the data during transmission

Process 1 Process 2

send();

recv();

Message buffer

Readmessage buffer

Continueprocess

Time

Message Tags• Differentiates between types of messages• The message tag is carried within message.• Wild card receive operations

– MPI_ANY_TAG: matches any message type– MPI_ANY_SOURCE: matches messages from any sender

Send message type 5 from buffer x to buffer y in process 2

Process 1 Process 2

send(&x,2,5);

recv(&y,1,5);

x y

Movementof data

Waits for a message from process 1 with a tag of 5

Collective Communication

• MPI_Bcast()): Broadcast or Multicast data to processors in a group• Scatter (MPI_Scatter()): Send part of an array to separate processes• Gather (MPI_Gather()): Collect array elements from separate processes• AlltoAll (MPI_Alltoall()): A combination of gather and scatter• MPI_Reduce(): Combine values from all processes to a single value• MPI_Reduce_scatter(): Combination of reduce and then scatter result• MPI_Scan(): Perform a prefix reduction on data on all processors• MPI_Barrier(): Pause until all processors reach the barrier call

Route a message to a communicator (group of processors)

Advantages: •MPI can use the processor hierarchy to improve efficiency•Less programming needed for collective operations

BroadcastBroadcast - Sending the same message to all processesMulticast - Sending the same message to a defined group of processes.

bcast();

buf

bcast();

data

bcast();

datadata

Process 0 Process p 1Process 1

Action

Code

MPI f or m

Scatter

Distributing each element of an array to separate processesContents of the ith location of the array transmits to process i

scatter();

buf

scatter();

data

scatter();

datadata


Action

Code

MPI f or m

GatherOne process collects individual values from set of processes.

gather();

buf

gather();

data

gather();

datadata


Action

Code

MPI for m

ReducePerform a distributed calculation

Example: Perform addition over a distributed array

reduce();

buf

reduce();

data

reduce();

datadata


+

Action

Code

MPI form

PVM (Parallel Virtual Machine)

• Oak Ridge National Laboratories, Free distribution• Host process controls the environment• Parent process spawns other processes• Daemon processes control message passing• Get Buffer, Pack, Send, Unpack• Non-blocking send, blocking or non-blocking receive• Wild card tags and process source• Sample Program: Page 52

mpij and MpiJava• Overview

– MpiJava is a wrapper sitting on mpich or lamMpi– mpij is a native Java implementation of mpi

• Documentation – MpiJava (http://www.hpjava.org/mpiJava.html)– mpij (uses the same API as MpiJava)

• Java Grande consortium (http://www.javagrande.org)– Sponsors conferences & encourages Java for Parallel

Programming– Maintains Java based paradigms (mpiJava, HPJava, and mpiJ)

• Other Java based implementations– JavaMpi is another less popular MPI Java wrapper

http://www.hpjava.org/mpiJava.html

http://www.hpjava.org/mpiJava.html

http://www.javagrande.org/

http://www.javagrande.org/

SPMD Computationmain (int argc, char *argv[]){ MPI_Init(&argc, &argv);

.

. MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) master(); else slave();

.

. MPI_Finalize();}

The master process executes master()

The slave processes execute slave()

Unsafe message passing

lib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);(a) Intended behavior

(b) Possible behaviorlib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);

Destination

Source

Delivery order is a function of timing among processors

Communicators

• Communicators allow:– Collective communication to groups of processors– Give mechanism to identify processors for point to point transfers

• The default communicator is MPI_COMM_WORLD– A unique rank corresponds to each executing process– The rank is an integer from 0 to p – 1– The number of processors executing is p

• Applications can create subset communicators– Each processor has a unique rank in each sub-communicator– The rank is an integer from 0 to g-1– The number of processors in the group is g

A collection of processes

Point-to-point Message Transfer

MPI_Send(buf, count, datatype, dest, tag, comm)

Address of

Number of items

Datatype of

Rank of destination

Message tag

Communicator

send buffer

to send

each item

processMPI_Recv(buf, count, datatype, src, tag, comm, status)

Address of

Maximum number

Datatype of

Rank of source

Message tag

Communicator

receive buffer

of items to receive

each item

process

Statusafter operation

MPI_Comm_rank(MPI_COMM_WORLD,&myrank); int x;MPI_Status *stat;if (myrank == 0) { MPI_Send(&x,1,MPI_INT,1,99,MPI_COMM_WORLD);} else if (myrank == 1) { MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);}

Non-blocking Point-to-point Transfer

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);int x;MPI_Request *io;MPI_STATUS *stat;if (myrank == 0)

{ MPI_Isend(&x,1,MPI_INT,1,99,MPI_COMM_WORLD,io); doSomeProcessing();

MPI_Wait(io, stat);} else if (myrank == 1) { MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);}

• MPI_Isend() and MPI_Irecv() return immediately• MPI_Wait() returns after the operation completes• MPI_Test() returns a non-zero if the operation is complete

Collective Communication Example

• Processor 0 gather items from a group of processes• The master processor allocates memory to hold the data• The remote processors initialize the data array• All processors execute the MPI_Gather() function

int data[10]; /*data to gather from processes*/MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if (myrank == 0){ MPI_Comm_size(MPI_COMM_WORLD, &grp_size); buf = (int *)malloc(grp_size*10*sizeof (int));}else { for (i=0; i<10; i++) data[i] = myrank; }

MPI_Gather(data, 10, MPI_INT, buf, grp_size*10 , MPI_INT, 0,MPI_COMM_WORLD) ;

Calculate Parallel Run Time

Sequential execution time: ts

– ts = # compute steps of best sequential algorithm (Big Oh)Parallel execution time: tp = tcomp + tcomm

– Communication overhead: Tcomm = m(tstartup + ntdata) where

tstartup is message latency = time to send a message with no datatdata is transmission time to send one data elementn is the number of data elements, m is the number of messages

– Computation overhead tcomp=f (n, p))

• Assumptions– All processors are homogeneous and run at the same speed– Tp = worst case execution time over all processors– Tstartup, tdata, and tcomp are measured in computational step units so

they can be added together

Estimating Scalability

Notes:•p and n respectively indicate number of processors and data elements•The above formulae help estimate scalability with respect to p and n

Parallel Visualization Tools

Process 1

Process 2

Process 3

TimeComputingWaitingMessage-passing system routine

Message

Observe using a space-time diagram (or process-time diagram)

Parallel Program Development• Cautions

– Parallel program programming is harder than sequential programming– Some algorithms don’t lend themselves to running in parallel

• Advised Steps of Development– Step 1: Program and test as much as possible sequentially– Step 2: Code the Parallel version– Step 3: Run in parallel; one processor with few threads– Step 4: Add more threads as confidence grows– Step 5: Run in parallel with a small number of processors– Step 6: Add more processes as confidence grows

• Tools– There are parallel debuggers that can help– Insert assertion error checks within the code– Instrument the code (add print statements)– Timing: time(), gettimeofday(), clock(), MPI_Wtime()

basics of message-passing

Documents