lecture 4 tth 03:30am-04:45pm dr. jianjun hu csce569 parallel computing university of south...
TRANSCRIPT
Lecture 4TTH 03:30AM-04:45PM
Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csc
e569/
CSCE569 Parallel Computing
University of South CarolinaDepartment of Computer Science and Engineering
Outline: Parallel Programming Modeland algorithm design
Task/channel modelAlgorithm design methodologyCase studies
Task/Channel ModelParallel computation = set of tasksTask
ProgramLocal memoryCollection of I/O ports
Tasks interact by sending messages through channels
Task/Channel Model
TaskTaskChannelChannel
Foster’s Design MethodologyPartitioningCommunicationAgglomerationMapping
Foster’s Methodology
P ro b lemP artitio ning
C o m m unic atio n
A gglo m eratio nM ap p ing
PartitioningDividing computation and data into
piecesDomain decomposition
Divide data into piecesDetermine how to associate
computations with the dataFunctional decomposition
Divide computation into piecesDetermine how to associate data with
the computations
Example Domain Decompositions
Example Functional Decomposition
Partitioning ChecklistAt least 10x more primitive tasks than
processors in target computerMinimize redundant computations and
redundant data storagePrimitive tasks roughly the same sizeNumber of tasks an increasing function of
problem size
CommunicationDetermine values passed among tasksLocal communication
Task needs values from a small number of other tasks
Create channels illustrating data flowGlobal communication
Significant number of tasks contribute data to perform a computation
Don’t create channels for them early in design
Communication ChecklistCommunication operations balanced
among tasksEach task communicates with only small
group of neighborsTasks can perform communications
concurrentlyTask can perform computations
concurrently
AgglomerationGrouping tasks into larger tasksGoals
Improve performanceMaintain scalability of programSimplify programming
In MPI programming, goal often to create one agglomerated task per processor
Agglomeration Can Improve PerformanceEliminate communication between
primitive tasks agglomerated into consolidated task
Combine groups of sending and receiving tasks
Agglomeration ChecklistLocality of parallel algorithm has increasedReplicated computations take less time
than communications they replaceData replication doesn’t affect scalabilityAgglomerated tasks have similar
computational and communications costsNumber of tasks increases with problem
sizeNumber of tasks suitable for likely target
systemsTradeoff between agglomeration and code
modifications costs is reasonable
MappingProcess of assigning tasks to processorsCentralized multiprocessor: mapping done
by operating systemDistributed memory system: mapping done
by userConflicting goals of mapping
Maximize processor utilizationMinimize interprocessor communication
Mapping Example
Optimal MappingFinding optimal mapping is NP-hardMust rely on heuristics
Mapping Decision TreeStatic number of tasks
Structured communicationConstant computation time per task
Agglomerate tasks to minimize commCreate one task per processor
Variable computation time per taskCyclically map tasks to processors
Unstructured communicationUse a static load balancing algorithm
Dynamic number of tasks
Mapping StrategyStatic number of tasksDynamic number of tasks
Frequent communications between tasksUse a dynamic load balancing algorithm
Many short-lived tasksUse a run-time task-scheduling algorithm
Mapping ChecklistConsidered designs based on one task per
processor and multiple tasks per processorEvaluated static and dynamic task
allocationIf dynamic task allocation chosen, task
allocator is not a bottleneck to performanceIf static task allocation chosen, ratio of
tasks to processors is at least 10:1
Case StudiesBoundary value problemFinding the maximumThe n-body problemAdding data input
Boundary Value Problem
Ice water Rod Insulation
Rod Cools as Time Progresses
Finite Difference Approximation
PartitioningOne data item per grid pointAssociate one primitive task with each grid
pointTwo-dimensional domain decomposition
CommunicationIdentify communication pattern between
primitive tasksEach interior primitive task has three
incoming and three outgoing channels
Agglomeration and Mapping
Agglomeration
Sequential execution time – time to update elementn – number of elementsm – number of iterationsSequential execution time: m (n-1)
Parallel Execution Timep – number of processors – message latencyParallel execution time m((n-1)/p+2)
Finding the Maximum Error
Computed 0.15 0.16 0.16 0.19Correct 0.15 0.16 0.17 0.18Error (%) 0.00% 0.00% 6.25% 5.26%
6.25%
ReductionGiven associative operator a0 a1 a2 … an-1
ExamplesAddMultiplyAnd, OrMaximum, Minimum
Parallel Reduction Evolution
Parallel Reduction Evolution
Parallel Reduction Evolution
Binomial Trees
Subgraph of hypercube
Finding Global Sum
4 2 0 7
-3 5 -6 -3
8 1 2 3
-4 4 6 -1
Finding Global Sum
1 7 -6 4
4 5 8 2
Finding Global Sum
8 -2
9 10
Finding Global Sum
17 8
Finding Global Sum
25
Binomial Tree
Agglomeration
Agglomeration
sum
sum sum
sum
The n-body Problem
The n-body Problem
PartitioningDomain partitioningAssume one task per particleTask has particle’s position, velocity vectorIteration
Get positions of all other particlesCompute new position, velocity
Gather
All-gather
Complete Graph for All-gather
Hypercube for All-gather
Communication Time
p
pnp
p
np
i
)1(
log2
log
1
1-i
Hypercube
Complete graph
p
pnp
pnp
)1(
)1()/
)(1(
Adding Data Input
Scatter
Scatter in log p Steps
12345678 56781234 56 12
7834
Summary: Task/channel ModelParallel computation
Set of tasksInteractions through channels
Good designsMaximize local computationsMinimize communicationsScale up
Summary: Design StepsPartition computationAgglomerate tasksMap tasks to processorsGoals
Maximize processor utilizationMinimize inter-processor communication
Summary: Fundamental AlgorithmsReductionGather and scatterAll-gather