con currency mapping

Upload: mazen-alkoa

Post on 06-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Con Currency Mapping

    1/40

    Parallel Processing

    Samer Arandi

    [email protected]

    Parallel Processing

    66523

    Computer Engineering Department

    An-Najah National University

    Concurrency and Mapping

  • 8/3/2019 Con Currency Mapping

    2/40

    Outline

    Characteristics of tasks and interactionstask generation, granularity, and contextcharacteristics of task interactions

    Mapping techniques for load balancingstatic mappings

    dynamic mappings

    Methods for minimizing interaction overheads

    Parallel algorithm design templates

    11

  • 8/3/2019 Con Currency Mapping

    3/40

    Characteristics of Tasks

    Key characteristicsgeneration strategyassociated workassociated data size

    Impact choice and performance of parallel algorithms

    12

  • 8/3/2019 Con Currency Mapping

    4/40

  • 8/3/2019 Con Currency Mapping

    5/40

    Task Sizes

    Uniform: all the same size (example?)

    Non-uniformsometimes sizes are known or can be estimated a-priorisometimes not

    - example: tasks in quicksortsize of each partition depends upon pivot selected

    14

    Task Size: amount of time required for completion

    Implications on mapping?

  • 8/3/2019 Con Currency Mapping

    6/40

    Size of Data Associated with Tasks

    Data may be small or large compared to the computationsize(input) < size(computation), e.g., 15 puzzle

    size(input) = size(computation) > size(output), e.g., minsize(input) = size(output)

  • 8/3/2019 Con Currency Mapping

    7/40

    Characteristics of Task Interactions

    Orthogonal classification criteria

    Static vs. dynamic

    Regular vs. irregular

    Read-only vs. read-write

    One-sided vs. two-sided

    16

  • 8/3/2019 Con Currency Mapping

    8/40

    Characteristics of Task Interactions

    Static interactionstasks and interactions are known a-priorisimpler to code

    Dynamic interactionstiming or interacting tasks cannot be determined a-prioriharder to code

    - especially using two-sided message passing APIs

    17

  • 8/3/2019 Con Currency Mapping

    9/40

    Characteristics of Task Interactions

    Regular interactionsinteractions have a pattern that can be described with a function

    - e.g. mesh, ring

    regular patterns can be exploited for efficient implementation

    - e.g. schedule communication to avoid conflicts on network links

    Irregular interactionslack a well-defined topologymodeled by a graph

    18

  • 8/3/2019 Con Currency Mapping

    10/40

    Static Regular Task Interaction Pattern

    Image operations, e.g. edge detectionNearest neighbor interactions on a 2D mesh

    19

  • 8/3/2019 Con Currency Mapping

    11/40

    Static Irregular Task Interaction Pattern

    Sparse matrix-vector multiply

    20

    A task must scan its associated row(s) ofA to know which entry -of

    vector b- it requires (implies the tasks it needs to interact with)

  • 8/3/2019 Con Currency Mapping

    12/40

    Characteristics of Task Interactions

    Read-only interactionstasks only read data associated with other tasksexample: Matrix Multiplication (shared: A and B)

    Read-write interactionsread and modify data associated with other tasks

    example: shared tasks priority queues

    harder to code: requires synchronization

    - need to avoid ordering races (read-write and write-write, etc)

    21

  • 8/3/2019 Con Currency Mapping

    13/40

    Characteristics of Task Interactions

    One-sidedinitiated & completed independently by 1 of 2 interacting tasks

    - GET

    - PUT

    Two-sidedboth tasks coordinate in an interaction

    - SEND + RECV

    22

  • 8/3/2019 Con Currency Mapping

    14/40

    Outline

    Characteristics of tasks and interactionstask generation, granularity, and contextcharacteristics of task interactions

    Mapping techniques for load balancingstatic mappingsdynamic mappings

    Methods for minimizing interaction overheads

    Parallel algorithm design templates

    11

  • 8/3/2019 Con Currency Mapping

    15/40

    Mapping Techniques

    Map concurrent tasks to processes for execution

    Overheads of mappingsserialization (idling) -due touneven load balancing/dependencies

    communication

    A good mapping tries to minimize both sources of overheads

    Conflicting objectives: minimizing one increases the otherassigning all work to one processor(going to the extreme)

    - minimizes communication

    - significant idling

    minimizing serialization introduces communication

    24

    Goal: all tasks complete in the shortest possible time

  • 8/3/2019 Con Currency Mapping

    16/40

    Mapping Techniques for Minimum Idling

    Overall load balancing alone doesnt necessarily minimize idling

    Time Time

    25

    Must balance computation and interactions at each stage

    Task dependency graph determines when a task can run

  • 8/3/2019 Con Currency Mapping

    17/40

    Mapping Techniques for Minimum Idling

    Static vs. dynamic mappings Static mapping

    a-priorimapping of tasks to processesrequirements

    - a good estimate of task size

    - even so, optimal mapping may be NP completee.g., multiple knapsack problem

    Dynamic mappingmap tasks to processes at runtimewhy?

    - tasks are generated at runtime, or

    - their sizes are unknown

    Factors that influence choice of mapping size of data associated with a task nature of underlying domain

    26

    need to make sure cost of moving data doesnt outweigh the

    benefit of dynamic mapping

  • 8/3/2019 Con Currency Mapping

    18/40

    Schemes for Static Mapping

    Data partitionings Task graph partitionings

    Hybrid strategies

    27

  • 8/3/2019 Con Currency Mapping

    19/40

    Mappings Based on Data Partitioning

    Partition computation using a combination ofdata partitioning

    owner-computes rule

    Example: 1-D block distribution for dense matrices

    28

  • 8/3/2019 Con Currency Mapping

    20/40

    Block Array Distribution Schemes

    Multi-dimensional block distributions

    Multi-dimensional partitioning enables larger # of processes

    29

  • 8/3/2019 Con Currency Mapping

    21/40

    Block Array Distribution Example

    Multiplying two dense matrices C = A x B

    Partition the output matrix C using a block decomposition

    Give each task the same number of elements of Ceach element of C corresponds to a dot producteven load balance

    Obvious choices: 1D or 2D decomposition

    Select to minimize associated communication overhead

    30

  • 8/3/2019 Con Currency Mapping

    22/40

    Imbalance and Block Array Distributions

    Consider a block distribution for LU decomposition Computing different blocks requires different amounts of work

    If we map all tasks associated with a certain block onto a processin a 9-process ensemble=> imbalance => significant idle time

    Another computation with similar distribution challenges

    Gaussian Elimination 33

  • 8/3/2019 Con Currency Mapping

    23/40

    Block Cyclic Distribution

    Variant of the block distribution scheme that can be used to

    alleviate the load-imbalance and idling

    Steps

    1. partition an array into many more blocks than the numberof available processes

    2. assign blocks to processes in a round-robin manner- each process gets several non-adjacent blocks

    34

  • 8/3/2019 Con Currency Mapping

    24/40

    Block-Cyclic Distribution

    (a) 1D block-cyclic (b) 2D block-cyclic

    35

    In certain cases even block-cyclic results in imbalance:

    - Randomized Block Distribution

  • 8/3/2019 Con Currency Mapping

    25/40

    Decomposition by Graph Partitioning

    Sparse-matrix vector multiply

    Graph of the matrix is useful for decompositionwork ~ number of edges

    communication for a node ~ node degree

    Goal: balance work & minimize communication

    Partition the graphassign equal number of nodes to each processminimize edge count of the graph partition 36

    Data partitioning is very effective for problems that use dense

    matrices and have regular interaction patterns.

    However, some problems utilize sparse matrices and have data-dependent and irregular interaction patters

  • 8/3/2019 Con Currency Mapping

    26/40

    Partitioning a Graph of Lake Superior

    Partitioning for minimum edge-cut (8 processes) 37

    Random Partitioning ( 8 processes)

  • 8/3/2019 Con Currency Mapping

    27/40

    Mappings Based on Task Partitioning

    Partitioning a task-dependency graph

    Optimal partitioning for general task-dependency graphNP-complete problem

    Excellent heuristics exist for structured graphs

    38

  • 8/3/2019 Con Currency Mapping

    28/40

    Mapping a Binary Tree Dependency Graph

    Dependency graph for quicksortTask assignment to processes in a hypercube*

    *hypercube: node numbers that differ in 1 bit are adjacent

    39

  • 8/3/2019 Con Currency Mapping

    29/40

    Task Partitioning: Mapping a Sparse Graph

    36

    17 item to communicate

    13 item to communicate

  • 8/3/2019 Con Currency Mapping

    30/40

    Hierarchical Mappings

    Sometimes a single mapping is inadequatee.g., task mapping of a binary tree cannot readily use alarge number of processors (e.g. parallel quicksort).

    Hierarchical approachuse a task mapping at the top leveldata partitioning within each level

    42

  • 8/3/2019 Con Currency Mapping

    31/40

    Schemes for Dynamic Mapping

    Dynamic mapping AKA dynamic load balancingload balancing is the primary motivation for dynamic mapping

    Stylescentralizeddistributed

    44

  • 8/3/2019 Con Currency Mapping

    32/40

    Centralized Dynamic Mapping

    Processes = master(s) or slaves General strategy

    when a slave runs out of work request more from master

    Challenge

    master may become bottleneck for large # of processes

    Approachchunk scheduling: process picks up several of tasks at oncehowever

    - large chunk sizes may cause significant load imbalances

    - gradually decrease chunk size as the computation progresses

    45

  • 8/3/2019 Con Currency Mapping

    33/40

    Distributed Dynamic Mapping

    All processes as peers Each process can send or receive work from other processes

    avoids centralized bottleneck

    Four critical design questions

    how are sending and receiving processes paired together?

    who initiates work transfer?

    how much work is transferred?when is a transfer triggered?

    Ideal answers can be application specific

    Cilk uses a distributed dynamic mapping: work stealing

    46

    Distributed v.s. Shared Memory Architectures Suitability-For message-passing computersthe computation size should be >> the data size

  • 8/3/2019 Con Currency Mapping

    34/40

    Outline

    Characteristics of tasks and interactionstask generation, granularity, and contextcharacteristics of task interactions

    Mapping techniques for load balancingstatic mappingsdynamic mappings

    Methods for minimizing interaction overheads

    Parallel algorithm design templates

    11

  • 8/3/2019 Con Currency Mapping

    35/40

    Minimizing Interaction Overheads (1)

    Rules of thumb

    Maximize data localitydont fetch data you already have

    restructure computation to reuse data promptly

    Minimize volume of data exchangepartition interaction graph to minimize edge crossings

    Minimize frequency of communicationtry to aggregate messages where possible

    Minimize contention and hot-spots

    use decentralized techniques (avoidance)

    48

  • 8/3/2019 Con Currency Mapping

    36/40

    Minimizing Interaction Overheads (2)

    Techniques Overlap communication with computation

    use non-blocking communication primitives

    - overlap communication with your own computation

    - one-sided: prefetch remote data to hide latencymultithread code on a processor

    - overlap communication with another threads computation

    Replicate data or computation to reduce communication

    Use group communication instead of point-to-point primitives

    Issue multiple communications and overlap their latency(reduces exposed latency)

    49

  • 8/3/2019 Con Currency Mapping

    37/40

    Outline

    Characteristics of tasks and interactionstask generation, granularity, and contextcharacteristics of task interactions

    Mapping techniques for load balancing

    static mappings

    dynamic mappings

    Methods for minimizing interaction overheads

    Parallel algorithm design templates

    11

  • 8/3/2019 Con Currency Mapping

    38/40

    Parallel Algorithm Model

    Definition: ways of structuring a parallel algorithm Aspects of a model

    decomposition

    mapping technique

    strategy to minimize interactions

    51

  • 8/3/2019 Con Currency Mapping

    39/40

    Common Parallel Algorithm Models

    Data paralleleach task performs similar operations on different datatypically statically map tasks to processes

    Task graphuse task dependency graph relationships to

    - promote locality, or reduce interaction costs

    Master-slaveone or more master processes generate workallocate it to worker processes

    allocation may be static or dynamic

    Pipeline / producer-consumerpass a stream of data through a sequence of processeseach performs some operation on it

    Hybridapply multiple models hierarchically, or

    apply multiple models in sequence to different phases

    52

  • 8/3/2019 Con Currency Mapping

    40/40

    References

    Adapted from slides Principles of Parallel Algorithm Designby Ananth Grama

    Based on Chapter3 of Introduction to Parallel Computingby Ananth Grama, Anshul Gupta, George Karypis, and VipinKumar. Addison Wesley, 2003

    46

    Slides originally from John Mellor-Crummey (Rice), COMP 422