chapter 3 parallel algorithm design parallel programming · mpi and openmp programming model mpi:...

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Chapter 3 Parallel Algorithm Design

Parallel Programmingin C with MPI and OpenMP

Michael J. Quinn Wonyong Sung Modification


Parallel Programming

Load balancingBest with data-level parallel processing

Low communication overheadArchitecture dependent

Precedence relation and schedulingMemory/cache/IO consideration


PartitioningDividing both computation and data into piecesDomain (x-axis, data) decomposition

Divide data into pieces – in many case best partitioningSPMD (Single Processor Multiple Data) paradigmGood for massively parallel distributed memory multi-computer systems

Functional decompositionDivide computation into piecesMay good for heterogeneous multiprocessors


Partitioning Checklist

At least 10x more primitive tasks than processors in target computerMinimize redundant computations and redundant data storagePrimitive tasks roughly the same sizeNumber of tasks an increasing function of problem size (scalable partitioning)


CommunicationDetermine values passed among tasksLocal communication

Task needs values from a small number of other tasksCreate channels illustrating data flow

Global communicationSignificant number of tasks contribute data to perform a computationDon’t create channels for them early in design


Communication Checklist

Communication operations balanced among tasksEach task communicates with only small group of neighborsTasks can perform communications concurrentlyTask can perform computations concurrently


AgglomerationGrouping tasks into larger tasksGoals

Eliminate communication between primitive tasks agglomerated into consolidated task Maintain scalability of programCombine groups of sending and receiving tasks

In MPI programming, goal often to create one agglomerated task per processor


Agglomeration ChecklistLocality of parallel algorithm has increasedReplicated computations take less time than communications they replaceData replication doesn’t affect scalabilityAgglomerated tasks have similar computational and communications costsNumber of tasks increases with problem sizeNumber of tasks suitable for likely target systemsTradeoff between agglomeration and code modifications costs is reasonable


MappingProcess of assigning tasks to processorsMPI and OpenMP programming model

MPI: purely parallel, need parallel algorithm from the start-upOpenMP: Fork-join model, starting-from sequential version and incremental parallelization

Conflicting goals of mappingMaximize processor utilizationMinimize interprocessor communication

NP-hard problem


Mapping Decision TreeStatic number of tasks

Structured communicationConstant computation time per task

• Agglomerate tasks to minimize comm• Create one task per processor

Variable computation time per task• Cyclically map tasks to processors

Unstructured communication• Use a static load balancing algorithm

Dynamic number of tasks


Mapping Strategy

Static number of tasksDynamic number of tasks

Frequent communications between tasksUse a dynamic load balancing algorithm –analyzes the current tasks and produces a new mapping of tasks to processors

Many short-lived tasksUse a run-time task-scheduling algorithm


Task SchedulingCentralized method

One manager and many workersWhen a worker processor has nothing to do, it requests a task from the manager.Sometimes, the manager can be a bottleneck

DistributedEach processor maintains its own list of tasksPush (processors with too many send to some others) and pull strategies

Hybrid method


Mapping Checklist

Considered designs based on one task per processor and multiple tasks per processorEvaluated static and dynamic task allocationIf dynamic task allocation chosen, task allocator is not a bottleneck to performanceIf static task allocation chosen, ratio of tasks to processors is at least 10:1


Case Studies

Boundary value problemFinding the maximumThe n-body problemAdding data input


Partitioning

One data item per grid pointAssociate one primitive task with each grid pointTwo-dimensional domain decomposition


Communication

Identify communication pattern between primitive tasksEach interior primitive task has three incoming and three outgoing channels


Sequential execution time

χ – time to update elementn – number of elementsm – number of iterationsSequential execution time: m (n-1) χ


Parallel Execution Time

p – number of processorsλ – message latencyParallel execution time m(χ⎡(n-1)/p⎤+2λ)


Finding the Maximum ErrorComputed 0.15 0.16 0.16 0.19Correct 0.15 0.16 0.17 0.18Error (%) 0.00% 0.00% 6.25% 5.26%

6.25%


Reduction

Given associative operator ⊕a0 ⊕ a1 ⊕ a2 ⊕ … ⊕ an-1

ExamplesAddMultiplyAnd, OrMaximum, Minimum


Binomial Trees

Subgraph of hypercube


Finding Global Sum

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1


Finding Global Sum

1 7 -6 4

4 5 8 2


Finding Global Sum

8 -2

9 10


Finding Global Sum

17 8


Finding Global Sum

25

Binomial Tree


Agglomeration


Agglomeration

sum

sum sum

sum


Partitioning

Domain partitioningAssume one task per particleTask has particle’s position, velocity vectorIteration

Get positions of all other particlesCompute new position, velocity


Gather


All-gather


Scatter


Scatter in log p Steps

12345678 56781234 56 12

7834


Summary: Task/channel Model

Parallel computationSet of tasksInteractions through channels

Good designsMaximize local computationsMinimize communicationsScale up


Summary: Design Steps

Partition computationAgglomerate tasksMap tasks to processorsGoals

Maximize processor utilizationMinimize inter-processor communication


Summary: Fundamental Algorithms

ReductionGather and scatterAll-gather

chapter 3 parallel algorithm design parallel programming · mpi and openmp programming model mpi:...

Documents