csce569 parallel computing
DESCRIPTION
CSCE569 Parallel Computing. Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu http://mleg.cse.sc.edu/edu/csce569/. University of South Carolina Department of Computer Science and Engineering. Chapter Objectives. Creating 2-D arrays Thinking about “grain size” - PowerPoint PPT PresentationTRANSCRIPT
Lecture 4TTH 03:30AM-04:45PM
Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csc
e569/
CSCE569 Parallel Computing
University of South CarolinaDepartment of Computer Science and Engineering
Chapter ObjectivesCreating 2-D arraysThinking about “grain size”Introducing point-to-point communicationsReading and printing 2-D matricesAnalyzing performance when computations
and communications overlap
OutlineAll-pairs shortest path problemDynamic 2-D arraysParallel algorithm designPoint-to-point communicationBlock row matrix I/OAnalysis and benchmarking
All-pairs Shortest Path Problem
A
E
B
C
D
4
6
1 35
3
1
2
0 6 3 6
4 0 7 10
12 6 0 3
7 3 10 0
9 5 12 2
A
B
C
D
E
A B C D
4
8
1
11
0
E
Resulting Adjacency Matrix Containing Distances
Floyd’s Algorithm
for k 0 to n-1for i 0 to n-1
for j 0 to n-1a[i,j] min (a[i,j], a[i,k] + a[k,j])
endforendfor
endfor
Why It Works
i
k
j
Shortest path from i to k through 0, 1, …, k-1
Shortest path from k to j through 0, 1, …, k-1
Shortest path from i to j through 0, 1, …, k-1
Computedin previousiterations
Dynamic 1-D Array Creation
A
Heap
Run-time Stack
Dynamic 2-D Array Creation
Heap
Run-time Stack
Bstorage B
Designing Parallel AlgorithmPartitioningCommunicationAgglomeration and Mapping
PartitioningDomain or functional decomposition?Look at pseudocodeSame assignment statement executed n3
timesNo functional parallelismDomain decomposition: divide matrix A
into its n2 elements
Communication
Primitive tasksUpdatinga[3,4] whenk = 1
Iteration k:every taskin row kbroadcastsits value w/intask column
Iteration k:every taskin column kbroadcastsits value w/intask row
Agglomeration and MappingNumber of tasks: staticCommunication among tasks: structuredComputation time per task: constantStrategy:
Agglomerate tasks to minimize communication
Create one task per MPI process
Two Data Decompositions
Rowwise block striped Columnwise block striped
Comparing DecompositionsColumnwise block striped
Broadcast within columns eliminatedRowwise block striped
Broadcast within rows eliminatedReading matrix from file simpler
Choose rowwise block striped decomposition
File Input
File
Pop Quiz
Why don’t we input the entire file at onceand then scatter its contents among theprocesses, allowing concurrent messagepassing?
Point-to-point CommunicationInvolves a pair of processesOne process sends a messageOther process receives the message
Send/Receive Not Collective
Function MPI_Send
int MPI_Send (
void *message,
int count,
MPI_Datatype datatype,
int dest,
int tag,
MPI_Comm comm
)
Function MPI_Recv
int MPI_Recv (
void *message,
int count,
MPI_Datatype datatype,
int source,
int tag,
MPI_Comm comm,
MPI_Status *status
)
Coding Send/Receive
…if (ID == j) { … Receive from I …}…if (ID == i) { … Send to j …}…
Receive is before Send.Why does this work?
Inside MPI_Send and MPI_Recv
Sending Process Receiving Process
ProgramMemory
SystemBuffer
SystemBuffer
ProgramMemory
MPI_Send MPI_Recv
Return from MPI_SendFunction blocks until message buffer freeMessage buffer is free when
Message copied to system buffer, orMessage transmitted
Typical scenarioMessage copied to system bufferTransmission overlaps computation
Return from MPI_RecvFunction blocks until message in bufferIf message never arrives, function never
returns
DeadlockDeadlock: process waiting for a condition
that will never become trueEasy to write send/receive code that
deadlocksTwo processes: both receive before sendSend tag doesn’t match receive tagProcess sends message to wrong destination
process
Computational ComplexityInnermost loop has complexity (n)Middle loop executed at most n/p timesOuter loop executed n timesOverall complexity (n3/p)
Communication ComplexityNo communication in inner loopNo communication in middle loopBroadcast in outer loop — complexity is (n
log p)Overall complexity (n2 log p)
Execution Time Expression (1)
)/4(log/ npnnpnn
Iterations of outer loop
Iterations of middle loop
Cell update time
Iterations of outer loop
Messages per broadcast
Message-passing time
Iterations of inner loop
Computation/communication Overlap
Execution Time Expression (2)
Iterations of outer loop
Iterations of middle loop
Cell update time
Iterations of outer loop
Messages per broadcast
Message-passing time
Iterations of inner loop
/4loglog/ nppnnpnn Message transmission
Predicted vs. Actual Performance
Execution Time (sec)Execution Time (sec)
ProcessesProcesses PredictedPredicted ActualActual
11 25.5425.54 25.5425.54
22 13.0213.02 13.8913.89
33 9.019.01 9.609.60
44 6.896.89 7.297.29
55 5.865.86 5.995.99
66 5.015.01 5.165.16
77 4.404.40 4.504.50
88 3.943.94 3.983.98
SummaryTwo matrix decompositions
Rowwise block stripedColumnwise block striped
Blocking send/receive functionsMPI_SendMPI_Recv
Overlapping communications with computations