chapter 3 parallel algorithm design parallel programming · mpi and openmp programming model mpi:...
TRANSCRIPT
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 3 Parallel Algorithm Design
Parallel Programmingin C with MPI and OpenMP
Michael J. Quinn Wonyong Sung Modification
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Programming
Load balancingBest with data-level parallel processing
Low communication overheadArchitecture dependent
Precedence relation and schedulingMemory/cache/IO consideration
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
PartitioningDividing both computation and data into piecesDomain (x-axis, data) decomposition
Divide data into pieces – in many case best partitioningSPMD (Single Processor Multiple Data) paradigmGood for massively parallel distributed memory multi-computer systems
Functional decompositionDivide computation into piecesMay good for heterogeneous multiprocessors
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Partitioning Checklist
At least 10x more primitive tasks than processors in target computerMinimize redundant computations and redundant data storagePrimitive tasks roughly the same sizeNumber of tasks an increasing function of problem size (scalable partitioning)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
CommunicationDetermine values passed among tasksLocal communication
Task needs values from a small number of other tasksCreate channels illustrating data flow
Global communicationSignificant number of tasks contribute data to perform a computationDon’t create channels for them early in design
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Checklist
Communication operations balanced among tasksEach task communicates with only small group of neighborsTasks can perform communications concurrentlyTask can perform computations concurrently
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
AgglomerationGrouping tasks into larger tasksGoals
Eliminate communication between primitive tasks agglomerated into consolidated task Maintain scalability of programCombine groups of sending and receiving tasks
In MPI programming, goal often to create one agglomerated task per processor
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Agglomeration ChecklistLocality of parallel algorithm has increasedReplicated computations take less time than communications they replaceData replication doesn’t affect scalabilityAgglomerated tasks have similar computational and communications costsNumber of tasks increases with problem sizeNumber of tasks suitable for likely target systemsTradeoff between agglomeration and code modifications costs is reasonable
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MappingProcess of assigning tasks to processorsMPI and OpenMP programming model
MPI: purely parallel, need parallel algorithm from the start-upOpenMP: Fork-join model, starting-from sequential version and incremental parallelization
Conflicting goals of mappingMaximize processor utilizationMinimize interprocessor communication
NP-hard problem
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Mapping Decision TreeStatic number of tasks
Structured communicationConstant computation time per task
• Agglomerate tasks to minimize comm• Create one task per processor
Variable computation time per task• Cyclically map tasks to processors
Unstructured communication• Use a static load balancing algorithm
Dynamic number of tasks
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Mapping Strategy
Static number of tasksDynamic number of tasks
Frequent communications between tasksUse a dynamic load balancing algorithm –analyzes the current tasks and produces a new mapping of tasks to processors
Many short-lived tasksUse a run-time task-scheduling algorithm
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Task SchedulingCentralized method
One manager and many workersWhen a worker processor has nothing to do, it requests a task from the manager.Sometimes, the manager can be a bottleneck
DistributedEach processor maintains its own list of tasksPush (processors with too many send to some others) and pull strategies
Hybrid method
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Mapping Checklist
Considered designs based on one task per processor and multiple tasks per processorEvaluated static and dynamic task allocationIf dynamic task allocation chosen, task allocator is not a bottleneck to performanceIf static task allocation chosen, ratio of tasks to processors is at least 10:1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Case Studies
Boundary value problemFinding the maximumThe n-body problemAdding data input
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Partitioning
One data item per grid pointAssociate one primitive task with each grid pointTwo-dimensional domain decomposition
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication
Identify communication pattern between primitive tasksEach interior primitive task has three incoming and three outgoing channels
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Sequential execution time
χ – time to update elementn – number of elementsm – number of iterationsSequential execution time: m (n-1) χ
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Execution Time
p – number of processorsλ – message latencyParallel execution time m(χ⎡(n-1)/p⎤+2λ)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Finding the Maximum ErrorComputed 0.15 0.16 0.16 0.19Correct 0.15 0.16 0.17 0.18Error (%) 0.00% 0.00% 6.25% 5.26%
6.25%
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Reduction
Given associative operator ⊕a0 ⊕ a1 ⊕ a2 ⊕ … ⊕ an-1
ExamplesAddMultiplyAnd, OrMaximum, Minimum
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Binomial Trees
Subgraph of hypercube
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Finding Global Sum
4 2 0 7
-3 5 -6 -3
8 1 2 3
-4 4 6 -1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Finding Global Sum
1 7 -6 4
4 5 8 2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Finding Global Sum
8 -2
9 10
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Finding Global Sum
17 8
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Finding Global Sum
25
Binomial Tree
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Agglomeration
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Agglomeration
sum
sum sum
sum
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Partitioning
Domain partitioningAssume one task per particleTask has particle’s position, velocity vectorIteration
Get positions of all other particlesCompute new position, velocity
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gather
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
All-gather
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Scatter
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Scatter in log p Steps
12345678 56781234 56 12
7834
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary: Task/channel Model
Parallel computationSet of tasksInteractions through channels
Good designsMaximize local computationsMinimize communicationsScale up
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary: Design Steps
Partition computationAgglomerate tasksMap tasks to processorsGoals
Maximize processor utilizationMinimize inter-processor communication
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary: Fundamental Algorithms
ReductionGather and scatterAll-gather