csce569 parallel computing

32
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu http://mleg.cse.sc.edu/edu /csce569/ CSCE569 Parallel Computing University of South Carolina Department of Computer Science and Engineering

Upload: farica

Post on 15-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

CSCE569 Parallel Computing. Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu http://mleg.cse.sc.edu/edu/csce569/. University of South Carolina Department of Computer Science and Engineering. Chapter Objectives. Creating 2-D arrays Thinking about “grain size” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSCE569 Parallel Computing

Lecture 4TTH 03:30AM-04:45PM

Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csc

e569/

CSCE569 Parallel Computing

University of South CarolinaDepartment of Computer Science and Engineering

Page 2: CSCE569 Parallel Computing

Chapter ObjectivesCreating 2-D arraysThinking about “grain size”Introducing point-to-point communicationsReading and printing 2-D matricesAnalyzing performance when computations

and communications overlap

Page 3: CSCE569 Parallel Computing

OutlineAll-pairs shortest path problemDynamic 2-D arraysParallel algorithm designPoint-to-point communicationBlock row matrix I/OAnalysis and benchmarking

Page 4: CSCE569 Parallel Computing

All-pairs Shortest Path Problem

A

E

B

C

D

4

6

1 35

3

1

2

0 6 3 6

4 0 7 10

12 6 0 3

7 3 10 0

9 5 12 2

A

B

C

D

E

A B C D

4

8

1

11

0

E

Resulting Adjacency Matrix Containing Distances

Page 5: CSCE569 Parallel Computing

Floyd’s Algorithm

for k 0 to n-1for i 0 to n-1

for j 0 to n-1a[i,j] min (a[i,j], a[i,k] + a[k,j])

endforendfor

endfor

Page 6: CSCE569 Parallel Computing

Why It Works

i

k

j

Shortest path from i to k through 0, 1, …, k-1

Shortest path from k to j through 0, 1, …, k-1

Shortest path from i to j through 0, 1, …, k-1

Computedin previousiterations

Page 7: CSCE569 Parallel Computing

Dynamic 1-D Array Creation

A

Heap

Run-time Stack

Page 8: CSCE569 Parallel Computing

Dynamic 2-D Array Creation

Heap

Run-time Stack

Bstorage B

Page 9: CSCE569 Parallel Computing

Designing Parallel AlgorithmPartitioningCommunicationAgglomeration and Mapping

Page 10: CSCE569 Parallel Computing

PartitioningDomain or functional decomposition?Look at pseudocodeSame assignment statement executed n3

timesNo functional parallelismDomain decomposition: divide matrix A

into its n2 elements

Page 11: CSCE569 Parallel Computing

Communication

Primitive tasksUpdatinga[3,4] whenk = 1

Iteration k:every taskin row kbroadcastsits value w/intask column

Iteration k:every taskin column kbroadcastsits value w/intask row

Page 12: CSCE569 Parallel Computing

Agglomeration and MappingNumber of tasks: staticCommunication among tasks: structuredComputation time per task: constantStrategy:

Agglomerate tasks to minimize communication

Create one task per MPI process

Page 13: CSCE569 Parallel Computing

Two Data Decompositions

Rowwise block striped Columnwise block striped

Page 14: CSCE569 Parallel Computing

Comparing DecompositionsColumnwise block striped

Broadcast within columns eliminatedRowwise block striped

Broadcast within rows eliminatedReading matrix from file simpler

Choose rowwise block striped decomposition

Page 15: CSCE569 Parallel Computing

File Input

File

Page 16: CSCE569 Parallel Computing

Pop Quiz

Why don’t we input the entire file at onceand then scatter its contents among theprocesses, allowing concurrent messagepassing?

Page 17: CSCE569 Parallel Computing

Point-to-point CommunicationInvolves a pair of processesOne process sends a messageOther process receives the message

Page 18: CSCE569 Parallel Computing

Send/Receive Not Collective

Page 19: CSCE569 Parallel Computing

Function MPI_Send

int MPI_Send (

void *message,

int count,

MPI_Datatype datatype,

int dest,

int tag,

MPI_Comm comm

)

Page 20: CSCE569 Parallel Computing

Function MPI_Recv

int MPI_Recv (

void *message,

int count,

MPI_Datatype datatype,

int source,

int tag,

MPI_Comm comm,

MPI_Status *status

)

Page 21: CSCE569 Parallel Computing

Coding Send/Receive

…if (ID == j) { … Receive from I …}…if (ID == i) { … Send to j …}…

Receive is before Send.Why does this work?

Page 22: CSCE569 Parallel Computing

Inside MPI_Send and MPI_Recv

Sending Process Receiving Process

ProgramMemory

SystemBuffer

SystemBuffer

ProgramMemory

MPI_Send MPI_Recv

Page 23: CSCE569 Parallel Computing

Return from MPI_SendFunction blocks until message buffer freeMessage buffer is free when

Message copied to system buffer, orMessage transmitted

Typical scenarioMessage copied to system bufferTransmission overlaps computation

Page 24: CSCE569 Parallel Computing

Return from MPI_RecvFunction blocks until message in bufferIf message never arrives, function never

returns

Page 25: CSCE569 Parallel Computing

DeadlockDeadlock: process waiting for a condition

that will never become trueEasy to write send/receive code that

deadlocksTwo processes: both receive before sendSend tag doesn’t match receive tagProcess sends message to wrong destination

process

Page 26: CSCE569 Parallel Computing

Computational ComplexityInnermost loop has complexity (n)Middle loop executed at most n/p timesOuter loop executed n timesOverall complexity (n3/p)

Page 27: CSCE569 Parallel Computing

Communication ComplexityNo communication in inner loopNo communication in middle loopBroadcast in outer loop — complexity is (n

log p)Overall complexity (n2 log p)

Page 28: CSCE569 Parallel Computing

Execution Time Expression (1)

)/4(log/ npnnpnn

Iterations of outer loop

Iterations of middle loop

Cell update time

Iterations of outer loop

Messages per broadcast

Message-passing time

Iterations of inner loop

Page 29: CSCE569 Parallel Computing

Computation/communication Overlap

Page 30: CSCE569 Parallel Computing

Execution Time Expression (2)

Iterations of outer loop

Iterations of middle loop

Cell update time

Iterations of outer loop

Messages per broadcast

Message-passing time

Iterations of inner loop

/4loglog/ nppnnpnn Message transmission

Page 31: CSCE569 Parallel Computing

Predicted vs. Actual Performance

Execution Time (sec)Execution Time (sec)

ProcessesProcesses PredictedPredicted ActualActual

11 25.5425.54 25.5425.54

22 13.0213.02 13.8913.89

33 9.019.01 9.609.60

44 6.896.89 7.297.29

55 5.865.86 5.995.99

66 5.015.01 5.165.16

77 4.404.40 4.504.50

88 3.943.94 3.983.98

Page 32: CSCE569 Parallel Computing

SummaryTwo matrix decompositions

Rowwise block stripedColumnwise block striped

Blocking send/receive functionsMPI_SendMPI_Recv

Overlapping communications with computations