lecture 4 tth 03:30am-04:45pm dr. jianjun hu csce569 parallel computing university of south...

57
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu http://mleg.cse.sc.edu/edu /csce569/ CSCE569 Parallel Computing University of South Carolina Department of Computer Science and Engineering

Upload: naomi-elliott

Post on 14-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Lecture 4TTH 03:30AM-04:45PM

Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csc

e569/

CSCE569 Parallel Computing

University of South CarolinaDepartment of Computer Science and Engineering

Page 2: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Outline: Parallel Programming Modeland algorithm design

Task/channel modelAlgorithm design methodologyCase studies

Page 3: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Task/Channel ModelParallel computation = set of tasksTask

ProgramLocal memoryCollection of I/O ports

Tasks interact by sending messages through channels

Page 4: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Task/Channel Model

TaskTaskChannelChannel

Page 5: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Foster’s Design MethodologyPartitioningCommunicationAgglomerationMapping

Page 6: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Foster’s Methodology

P ro b lemP artitio ning

C o m m unic atio n

A gglo m eratio nM ap p ing

Page 7: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

PartitioningDividing computation and data into

piecesDomain decomposition

Divide data into piecesDetermine how to associate

computations with the dataFunctional decomposition

Divide computation into piecesDetermine how to associate data with

the computations

Page 8: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Example Domain Decompositions

Page 9: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Example Functional Decomposition

Page 10: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Partitioning ChecklistAt least 10x more primitive tasks than

processors in target computerMinimize redundant computations and

redundant data storagePrimitive tasks roughly the same sizeNumber of tasks an increasing function of

problem size

Page 11: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

CommunicationDetermine values passed among tasksLocal communication

Task needs values from a small number of other tasks

Create channels illustrating data flowGlobal communication

Significant number of tasks contribute data to perform a computation

Don’t create channels for them early in design

Page 12: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Communication ChecklistCommunication operations balanced

among tasksEach task communicates with only small

group of neighborsTasks can perform communications

concurrentlyTask can perform computations

concurrently

Page 13: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

AgglomerationGrouping tasks into larger tasksGoals

Improve performanceMaintain scalability of programSimplify programming

In MPI programming, goal often to create one agglomerated task per processor

Page 14: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Agglomeration Can Improve PerformanceEliminate communication between

primitive tasks agglomerated into consolidated task

Combine groups of sending and receiving tasks

Page 15: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Agglomeration ChecklistLocality of parallel algorithm has increasedReplicated computations take less time

than communications they replaceData replication doesn’t affect scalabilityAgglomerated tasks have similar

computational and communications costsNumber of tasks increases with problem

sizeNumber of tasks suitable for likely target

systemsTradeoff between agglomeration and code

modifications costs is reasonable

Page 16: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

MappingProcess of assigning tasks to processorsCentralized multiprocessor: mapping done

by operating systemDistributed memory system: mapping done

by userConflicting goals of mapping

Maximize processor utilizationMinimize interprocessor communication

Page 17: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Mapping Example

Page 18: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Optimal MappingFinding optimal mapping is NP-hardMust rely on heuristics

Page 19: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Mapping Decision TreeStatic number of tasks

Structured communicationConstant computation time per task

Agglomerate tasks to minimize commCreate one task per processor

Variable computation time per taskCyclically map tasks to processors

Unstructured communicationUse a static load balancing algorithm

Dynamic number of tasks

Page 20: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Mapping StrategyStatic number of tasksDynamic number of tasks

Frequent communications between tasksUse a dynamic load balancing algorithm

Many short-lived tasksUse a run-time task-scheduling algorithm

Page 21: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Mapping ChecklistConsidered designs based on one task per

processor and multiple tasks per processorEvaluated static and dynamic task

allocationIf dynamic task allocation chosen, task

allocator is not a bottleneck to performanceIf static task allocation chosen, ratio of

tasks to processors is at least 10:1

Page 22: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Case StudiesBoundary value problemFinding the maximumThe n-body problemAdding data input

Page 23: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Boundary Value Problem

Ice water Rod Insulation

Page 24: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Rod Cools as Time Progresses

Page 25: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Finite Difference Approximation

Page 26: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

PartitioningOne data item per grid pointAssociate one primitive task with each grid

pointTwo-dimensional domain decomposition

Page 27: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

CommunicationIdentify communication pattern between

primitive tasksEach interior primitive task has three

incoming and three outgoing channels

Page 28: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Agglomeration and Mapping

Agglomeration

Page 29: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Sequential execution time – time to update elementn – number of elementsm – number of iterationsSequential execution time: m (n-1)

Page 30: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Parallel Execution Timep – number of processors – message latencyParallel execution time m((n-1)/p+2)

Page 31: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Finding the Maximum Error

Computed 0.15 0.16 0.16 0.19Correct 0.15 0.16 0.17 0.18Error (%) 0.00% 0.00% 6.25% 5.26%

6.25%

Page 32: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

ReductionGiven associative operator a0 a1 a2 … an-1

ExamplesAddMultiplyAnd, OrMaximum, Minimum

Page 33: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Parallel Reduction Evolution

Page 34: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Parallel Reduction Evolution

Page 35: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Parallel Reduction Evolution

Page 36: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Binomial Trees

Subgraph of hypercube

Page 37: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Finding Global Sum

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Page 38: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Finding Global Sum

1 7 -6 4

4 5 8 2

Page 39: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Finding Global Sum

8 -2

9 10

Page 40: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Finding Global Sum

17 8

Page 41: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Finding Global Sum

25

Binomial Tree

Page 42: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Agglomeration

Page 43: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Agglomeration

sum

sum sum

sum

Page 44: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

The n-body Problem

Page 45: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

The n-body Problem

Page 46: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

PartitioningDomain partitioningAssume one task per particleTask has particle’s position, velocity vectorIteration

Get positions of all other particlesCompute new position, velocity

Page 47: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Gather

Page 48: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

All-gather

Page 49: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Complete Graph for All-gather

Page 50: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Hypercube for All-gather

Page 51: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Communication Time

p

pnp

p

np

i

)1(

log2

log

1

1-i

Hypercube

Complete graph

p

pnp

pnp

)1(

)1()/

)(1(

Page 52: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Adding Data Input

Page 53: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Scatter

Page 54: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Scatter in log p Steps

12345678 56781234 56 12

7834

Page 55: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Summary: Task/channel ModelParallel computation

Set of tasksInteractions through channels

Good designsMaximize local computationsMinimize communicationsScale up

Page 56: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Summary: Design StepsPartition computationAgglomerate tasksMap tasks to processorsGoals

Maximize processor utilizationMinimize inter-processor communication

Page 57: Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu  CSCE569 Parallel Computing University of South Carolina Department of

Summary: Fundamental AlgorithmsReductionGather and scatterAll-gather