con currency mapping

8/3/2019 Con Currency Mapping

1/40

Parallel Processing

Samer Arandi

[email protected]

Parallel Processing

66523

Computer Engineering Department

An-Najah National University

Concurrency and Mapping


2/40

Outline

Characteristics of tasks and interactionstask generation, granularity, and contextcharacteristics of task interactions

Mapping techniques for load balancingstatic mappings

dynamic mappings

Methods for minimizing interaction overheads

Parallel algorithm design templates

11


3/40

Characteristics of Tasks

Key characteristicsgeneration strategyassociated workassociated data size

Impact choice and performance of parallel algorithms

12


4/40


5/40

Task Sizes

Uniform: all the same size (example?)

Non-uniformsometimes sizes are known or can be estimated a-priorisometimes not

- example: tasks in quicksortsize of each partition depends upon pivot selected

14

Task Size: amount of time required for completion

Implications on mapping?


6/40

Size of Data Associated with Tasks

Data may be small or large compared to the computationsize(input) < size(computation), e.g., 15 puzzle

size(input) = size(computation) > size(output), e.g., minsize(input) = size(output)


7/40

Characteristics of Task Interactions

Orthogonal classification criteria

Static vs. dynamic

Regular vs. irregular

Read-only vs. read-write

One-sided vs. two-sided

16


8/40


Static interactionstasks and interactions are known a-priorisimpler to code

Dynamic interactionstiming or interacting tasks cannot be determined a-prioriharder to code

- especially using two-sided message passing APIs

17


9/40


Regular interactionsinteractions have a pattern that can be described with a function

- e.g. mesh, ring

regular patterns can be exploited for efficient implementation

- e.g. schedule communication to avoid conflicts on network links

Irregular interactionslack a well-defined topologymodeled by a graph

18


10/40

Static Regular Task Interaction Pattern

Image operations, e.g. edge detectionNearest neighbor interactions on a 2D mesh

19


11/40

Static Irregular Task Interaction Pattern

Sparse matrix-vector multiply

20

A task must scan its associated row(s) ofA to know which entry -of

vector b- it requires (implies the tasks it needs to interact with)


12/40


Read-only interactionstasks only read data associated with other tasksexample: Matrix Multiplication (shared: A and B)

Read-write interactionsread and modify data associated with other tasks

example: shared tasks priority queues

harder to code: requires synchronization

- need to avoid ordering races (read-write and write-write, etc)

21


13/40


One-sidedinitiated & completed independently by 1 of 2 interacting tasks

- GET

- PUT

Two-sidedboth tasks coordinate in an interaction

- SEND + RECV

22


14/40

Outline


Mapping techniques for load balancingstatic mappingsdynamic mappings



11


15/40

Mapping Techniques

Map concurrent tasks to processes for execution

Overheads of mappingsserialization (idling) -due touneven load balancing/dependencies

communication

A good mapping tries to minimize both sources of overheads

Conflicting objectives: minimizing one increases the otherassigning all work to one processor(going to the extreme)

- minimizes communication

- significant idling

minimizing serialization introduces communication

24

Goal: all tasks complete in the shortest possible time


16/40

Mapping Techniques for Minimum Idling

Overall load balancing alone doesnt necessarily minimize idling

Time Time

25

Must balance computation and interactions at each stage

Task dependency graph determines when a task can run


17/40

Mapping Techniques for Minimum Idling

Static vs. dynamic mappings Static mapping

a-priorimapping of tasks to processesrequirements

- a good estimate of task size

- even so, optimal mapping may be NP completee.g., multiple knapsack problem

Dynamic mappingmap tasks to processes at runtimewhy?

- tasks are generated at runtime, or

- their sizes are unknown

Factors that influence choice of mapping size of data associated with a task nature of underlying domain

26

need to make sure cost of moving data doesnt outweigh the

benefit of dynamic mapping


18/40

Schemes for Static Mapping

Data partitionings Task graph partitionings

Hybrid strategies

27


19/40

Mappings Based on Data Partitioning

Partition computation using a combination ofdata partitioning

owner-computes rule

Example: 1-D block distribution for dense matrices

28


20/40

Block Array Distribution Schemes

Multi-dimensional block distributions

Multi-dimensional partitioning enables larger # of processes

29


21/40

Block Array Distribution Example

Multiplying two dense matrices C = A x B

Partition the output matrix C using a block decomposition

Give each task the same number of elements of Ceach element of C corresponds to a dot producteven load balance

Obvious choices: 1D or 2D decomposition

Select to minimize associated communication overhead

30


22/40

Imbalance and Block Array Distributions

Consider a block distribution for LU decomposition Computing different blocks requires different amounts of work

If we map all tasks associated with a certain block onto a processin a 9-process ensemble=> imbalance => significant idle time

Another computation with similar distribution challenges

Gaussian Elimination 33


23/40

Block Cyclic Distribution

Variant of the block distribution scheme that can be used to

alleviate the load-imbalance and idling

Steps

1. partition an array into many more blocks than the numberof available processes

2. assign blocks to processes in a round-robin manner- each process gets several non-adjacent blocks

34


24/40

Block-Cyclic Distribution

(a) 1D block-cyclic (b) 2D block-cyclic

35

In certain cases even block-cyclic results in imbalance:

- Randomized Block Distribution


25/40

Decomposition by Graph Partitioning

Sparse-matrix vector multiply

Graph of the matrix is useful for decompositionwork ~ number of edges

communication for a node ~ node degree

Goal: balance work & minimize communication

Partition the graphassign equal number of nodes to each processminimize edge count of the graph partition 36

Data partitioning is very effective for problems that use dense

matrices and have regular interaction patterns.

However, some problems utilize sparse matrices and have data-dependent and irregular interaction patters


26/40

Partitioning a Graph of Lake Superior

Partitioning for minimum edge-cut (8 processes) 37

Random Partitioning ( 8 processes)


27/40

Mappings Based on Task Partitioning

Partitioning a task-dependency graph

Optimal partitioning for general task-dependency graphNP-complete problem

Excellent heuristics exist for structured graphs

38


28/40

Mapping a Binary Tree Dependency Graph

Dependency graph for quicksortTask assignment to processes in a hypercube*

*hypercube: node numbers that differ in 1 bit are adjacent

39


29/40

Task Partitioning: Mapping a Sparse Graph

36

17 item to communicate

13 item to communicate


30/40

Hierarchical Mappings

Sometimes a single mapping is inadequatee.g., task mapping of a binary tree cannot readily use alarge number of processors (e.g. parallel quicksort).

Hierarchical approachuse a task mapping at the top leveldata partitioning within each level

42


31/40

Schemes for Dynamic Mapping

Dynamic mapping AKA dynamic load balancingload balancing is the primary motivation for dynamic mapping

Stylescentralizeddistributed

44


32/40

Centralized Dynamic Mapping

Processes = master(s) or slaves General strategy

when a slave runs out of work request more from master

Challenge

master may become bottleneck for large # of processes

Approachchunk scheduling: process picks up several of tasks at oncehowever

- large chunk sizes may cause significant load imbalances

- gradually decrease chunk size as the computation progresses

45


33/40

Distributed Dynamic Mapping

All processes as peers Each process can send or receive work from other processes

avoids centralized bottleneck

Four critical design questions

how are sending and receiving processes paired together?

who initiates work transfer?

how much work is transferred?when is a transfer triggered?

Ideal answers can be application specific

Cilk uses a distributed dynamic mapping: work stealing

46

Distributed v.s. Shared Memory Architectures Suitability-For message-passing computersthe computation size should be >> the data size


34/40

Outline


Mapping techniques for load balancingstatic mappingsdynamic mappings



11


35/40

Minimizing Interaction Overheads (1)

Rules of thumb

Maximize data localitydont fetch data you already have

restructure computation to reuse data promptly

Minimize volume of data exchangepartition interaction graph to minimize edge crossings

Minimize frequency of communicationtry to aggregate messages where possible

Minimize contention and hot-spots

use decentralized techniques (avoidance)

48


36/40

Minimizing Interaction Overheads (2)

Techniques Overlap communication with computation

use non-blocking communication primitives

- overlap communication with your own computation

- one-sided: prefetch remote data to hide latencymultithread code on a processor

- overlap communication with another threads computation

Replicate data or computation to reduce communication

Use group communication instead of point-to-point primitives

Issue multiple communications and overlap their latency(reduces exposed latency)

49


37/40

Outline


Mapping techniques for load balancing

static mappings

dynamic mappings



11


38/40

Parallel Algorithm Model

Definition: ways of structuring a parallel algorithm Aspects of a model

decomposition

mapping technique

strategy to minimize interactions

51


39/40

Common Parallel Algorithm Models

Data paralleleach task performs similar operations on different datatypically statically map tasks to processes

Task graphuse task dependency graph relationships to

- promote locality, or reduce interaction costs

Master-slaveone or more master processes generate workallocate it to worker processes

allocation may be static or dynamic

Pipeline / producer-consumerpass a stream of data through a sequence of processeseach performs some operation on it

Hybridapply multiple models hierarchically, or

apply multiple models in sequence to different phases

52


40/40

References

Adapted from slides Principles of Parallel Algorithm Designby Ananth Grama

Based on Chapter3 of Introduction to Parallel Computingby Ananth Grama, Anshul Gupta, George Karypis, and VipinKumar. Addison Wesley, 2003

46

Slides originally from John Mellor-Crummey (Rice), COMP 422

con currency mapping

Documents