con currency mapping
TRANSCRIPT
-
8/3/2019 Con Currency Mapping
1/40
Parallel Processing
Samer Arandi
Parallel Processing
66523
Computer Engineering Department
An-Najah National University
Concurrency and Mapping
-
8/3/2019 Con Currency Mapping
2/40
Outline
Characteristics of tasks and interactionstask generation, granularity, and contextcharacteristics of task interactions
Mapping techniques for load balancingstatic mappings
dynamic mappings
Methods for minimizing interaction overheads
Parallel algorithm design templates
11
-
8/3/2019 Con Currency Mapping
3/40
Characteristics of Tasks
Key characteristicsgeneration strategyassociated workassociated data size
Impact choice and performance of parallel algorithms
12
-
8/3/2019 Con Currency Mapping
4/40
-
8/3/2019 Con Currency Mapping
5/40
Task Sizes
Uniform: all the same size (example?)
Non-uniformsometimes sizes are known or can be estimated a-priorisometimes not
- example: tasks in quicksortsize of each partition depends upon pivot selected
14
Task Size: amount of time required for completion
Implications on mapping?
-
8/3/2019 Con Currency Mapping
6/40
Size of Data Associated with Tasks
Data may be small or large compared to the computationsize(input) < size(computation), e.g., 15 puzzle
size(input) = size(computation) > size(output), e.g., minsize(input) = size(output)
-
8/3/2019 Con Currency Mapping
7/40
Characteristics of Task Interactions
Orthogonal classification criteria
Static vs. dynamic
Regular vs. irregular
Read-only vs. read-write
One-sided vs. two-sided
16
-
8/3/2019 Con Currency Mapping
8/40
Characteristics of Task Interactions
Static interactionstasks and interactions are known a-priorisimpler to code
Dynamic interactionstiming or interacting tasks cannot be determined a-prioriharder to code
- especially using two-sided message passing APIs
17
-
8/3/2019 Con Currency Mapping
9/40
Characteristics of Task Interactions
Regular interactionsinteractions have a pattern that can be described with a function
- e.g. mesh, ring
regular patterns can be exploited for efficient implementation
- e.g. schedule communication to avoid conflicts on network links
Irregular interactionslack a well-defined topologymodeled by a graph
18
-
8/3/2019 Con Currency Mapping
10/40
Static Regular Task Interaction Pattern
Image operations, e.g. edge detectionNearest neighbor interactions on a 2D mesh
19
-
8/3/2019 Con Currency Mapping
11/40
Static Irregular Task Interaction Pattern
Sparse matrix-vector multiply
20
A task must scan its associated row(s) ofA to know which entry -of
vector b- it requires (implies the tasks it needs to interact with)
-
8/3/2019 Con Currency Mapping
12/40
Characteristics of Task Interactions
Read-only interactionstasks only read data associated with other tasksexample: Matrix Multiplication (shared: A and B)
Read-write interactionsread and modify data associated with other tasks
example: shared tasks priority queues
harder to code: requires synchronization
- need to avoid ordering races (read-write and write-write, etc)
21
-
8/3/2019 Con Currency Mapping
13/40
Characteristics of Task Interactions
One-sidedinitiated & completed independently by 1 of 2 interacting tasks
- GET
- PUT
Two-sidedboth tasks coordinate in an interaction
- SEND + RECV
22
-
8/3/2019 Con Currency Mapping
14/40
Outline
Characteristics of tasks and interactionstask generation, granularity, and contextcharacteristics of task interactions
Mapping techniques for load balancingstatic mappingsdynamic mappings
Methods for minimizing interaction overheads
Parallel algorithm design templates
11
-
8/3/2019 Con Currency Mapping
15/40
Mapping Techniques
Map concurrent tasks to processes for execution
Overheads of mappingsserialization (idling) -due touneven load balancing/dependencies
communication
A good mapping tries to minimize both sources of overheads
Conflicting objectives: minimizing one increases the otherassigning all work to one processor(going to the extreme)
- minimizes communication
- significant idling
minimizing serialization introduces communication
24
Goal: all tasks complete in the shortest possible time
-
8/3/2019 Con Currency Mapping
16/40
Mapping Techniques for Minimum Idling
Overall load balancing alone doesnt necessarily minimize idling
Time Time
25
Must balance computation and interactions at each stage
Task dependency graph determines when a task can run
-
8/3/2019 Con Currency Mapping
17/40
Mapping Techniques for Minimum Idling
Static vs. dynamic mappings Static mapping
a-priorimapping of tasks to processesrequirements
- a good estimate of task size
- even so, optimal mapping may be NP completee.g., multiple knapsack problem
Dynamic mappingmap tasks to processes at runtimewhy?
- tasks are generated at runtime, or
- their sizes are unknown
Factors that influence choice of mapping size of data associated with a task nature of underlying domain
26
need to make sure cost of moving data doesnt outweigh the
benefit of dynamic mapping
-
8/3/2019 Con Currency Mapping
18/40
Schemes for Static Mapping
Data partitionings Task graph partitionings
Hybrid strategies
27
-
8/3/2019 Con Currency Mapping
19/40
Mappings Based on Data Partitioning
Partition computation using a combination ofdata partitioning
owner-computes rule
Example: 1-D block distribution for dense matrices
28
-
8/3/2019 Con Currency Mapping
20/40
Block Array Distribution Schemes
Multi-dimensional block distributions
Multi-dimensional partitioning enables larger # of processes
29
-
8/3/2019 Con Currency Mapping
21/40
Block Array Distribution Example
Multiplying two dense matrices C = A x B
Partition the output matrix C using a block decomposition
Give each task the same number of elements of Ceach element of C corresponds to a dot producteven load balance
Obvious choices: 1D or 2D decomposition
Select to minimize associated communication overhead
30
-
8/3/2019 Con Currency Mapping
22/40
Imbalance and Block Array Distributions
Consider a block distribution for LU decomposition Computing different blocks requires different amounts of work
If we map all tasks associated with a certain block onto a processin a 9-process ensemble=> imbalance => significant idle time
Another computation with similar distribution challenges
Gaussian Elimination 33
-
8/3/2019 Con Currency Mapping
23/40
Block Cyclic Distribution
Variant of the block distribution scheme that can be used to
alleviate the load-imbalance and idling
Steps
1. partition an array into many more blocks than the numberof available processes
2. assign blocks to processes in a round-robin manner- each process gets several non-adjacent blocks
34
-
8/3/2019 Con Currency Mapping
24/40
Block-Cyclic Distribution
(a) 1D block-cyclic (b) 2D block-cyclic
35
In certain cases even block-cyclic results in imbalance:
- Randomized Block Distribution
-
8/3/2019 Con Currency Mapping
25/40
Decomposition by Graph Partitioning
Sparse-matrix vector multiply
Graph of the matrix is useful for decompositionwork ~ number of edges
communication for a node ~ node degree
Goal: balance work & minimize communication
Partition the graphassign equal number of nodes to each processminimize edge count of the graph partition 36
Data partitioning is very effective for problems that use dense
matrices and have regular interaction patterns.
However, some problems utilize sparse matrices and have data-dependent and irregular interaction patters
-
8/3/2019 Con Currency Mapping
26/40
Partitioning a Graph of Lake Superior
Partitioning for minimum edge-cut (8 processes) 37
Random Partitioning ( 8 processes)
-
8/3/2019 Con Currency Mapping
27/40
Mappings Based on Task Partitioning
Partitioning a task-dependency graph
Optimal partitioning for general task-dependency graphNP-complete problem
Excellent heuristics exist for structured graphs
38
-
8/3/2019 Con Currency Mapping
28/40
Mapping a Binary Tree Dependency Graph
Dependency graph for quicksortTask assignment to processes in a hypercube*
*hypercube: node numbers that differ in 1 bit are adjacent
39
-
8/3/2019 Con Currency Mapping
29/40
Task Partitioning: Mapping a Sparse Graph
36
17 item to communicate
13 item to communicate
-
8/3/2019 Con Currency Mapping
30/40
Hierarchical Mappings
Sometimes a single mapping is inadequatee.g., task mapping of a binary tree cannot readily use alarge number of processors (e.g. parallel quicksort).
Hierarchical approachuse a task mapping at the top leveldata partitioning within each level
42
-
8/3/2019 Con Currency Mapping
31/40
Schemes for Dynamic Mapping
Dynamic mapping AKA dynamic load balancingload balancing is the primary motivation for dynamic mapping
Stylescentralizeddistributed
44
-
8/3/2019 Con Currency Mapping
32/40
Centralized Dynamic Mapping
Processes = master(s) or slaves General strategy
when a slave runs out of work request more from master
Challenge
master may become bottleneck for large # of processes
Approachchunk scheduling: process picks up several of tasks at oncehowever
- large chunk sizes may cause significant load imbalances
- gradually decrease chunk size as the computation progresses
45
-
8/3/2019 Con Currency Mapping
33/40
Distributed Dynamic Mapping
All processes as peers Each process can send or receive work from other processes
avoids centralized bottleneck
Four critical design questions
how are sending and receiving processes paired together?
who initiates work transfer?
how much work is transferred?when is a transfer triggered?
Ideal answers can be application specific
Cilk uses a distributed dynamic mapping: work stealing
46
Distributed v.s. Shared Memory Architectures Suitability-For message-passing computersthe computation size should be >> the data size
-
8/3/2019 Con Currency Mapping
34/40
Outline
Characteristics of tasks and interactionstask generation, granularity, and contextcharacteristics of task interactions
Mapping techniques for load balancingstatic mappingsdynamic mappings
Methods for minimizing interaction overheads
Parallel algorithm design templates
11
-
8/3/2019 Con Currency Mapping
35/40
Minimizing Interaction Overheads (1)
Rules of thumb
Maximize data localitydont fetch data you already have
restructure computation to reuse data promptly
Minimize volume of data exchangepartition interaction graph to minimize edge crossings
Minimize frequency of communicationtry to aggregate messages where possible
Minimize contention and hot-spots
use decentralized techniques (avoidance)
48
-
8/3/2019 Con Currency Mapping
36/40
Minimizing Interaction Overheads (2)
Techniques Overlap communication with computation
use non-blocking communication primitives
- overlap communication with your own computation
- one-sided: prefetch remote data to hide latencymultithread code on a processor
- overlap communication with another threads computation
Replicate data or computation to reduce communication
Use group communication instead of point-to-point primitives
Issue multiple communications and overlap their latency(reduces exposed latency)
49
-
8/3/2019 Con Currency Mapping
37/40
Outline
Characteristics of tasks and interactionstask generation, granularity, and contextcharacteristics of task interactions
Mapping techniques for load balancing
static mappings
dynamic mappings
Methods for minimizing interaction overheads
Parallel algorithm design templates
11
-
8/3/2019 Con Currency Mapping
38/40
Parallel Algorithm Model
Definition: ways of structuring a parallel algorithm Aspects of a model
decomposition
mapping technique
strategy to minimize interactions
51
-
8/3/2019 Con Currency Mapping
39/40
Common Parallel Algorithm Models
Data paralleleach task performs similar operations on different datatypically statically map tasks to processes
Task graphuse task dependency graph relationships to
- promote locality, or reduce interaction costs
Master-slaveone or more master processes generate workallocate it to worker processes
allocation may be static or dynamic
Pipeline / producer-consumerpass a stream of data through a sequence of processeseach performs some operation on it
Hybridapply multiple models hierarchically, or
apply multiple models in sequence to different phases
52
-
8/3/2019 Con Currency Mapping
40/40
References
Adapted from slides Principles of Parallel Algorithm Designby Ananth Grama
Based on Chapter3 of Introduction to Parallel Computingby Ananth Grama, Anshul Gupta, George Karypis, and VipinKumar. Addison Wesley, 2003
46
Slides originally from John Mellor-Crummey (Rice), COMP 422