parallel programming concepts summary - uni … | parallel programming concepts | dr. peter tröger...

Parallel Programming Concepts Summary

Dr. Peter Tröger M.Sc. Frank Feinbube

Course Topics

■  The Parallelization Problem □  Power wall, memory wall, Moore’s law

□  Terminology and metrics ■  Shared Memory Parallelism

□  Theory of concurrency, hardware today and in the past □  Programming models, optimization, profiling

■  Shared Nothing Parallelism □  Theory of concurrency, hardware today and in the past □  Programming models, optimization, profiling

■  Accelerators

■  Patterns ■  Future trends

3

Scaring Students with Word Clouds ...

The Free Lunch Is Over

■  Clock speed curve flattened in 2003 □  Heat □  Power consumption

□  Leakage ■  2-3 GHz since 2001 (!) ■  Speeding up the serial

instruction execution through clock speed improvements no longer works

■  We stumbled into the Many-Core Era

5

[Her

b Sut

ter,

2009

]

The Power Wall

■  Air cooling capabilities are limited □  Maximum temperature of 100-125 °C, hot spot problem

□  Static and dynamic power consumption must be limited ■  Power consumption increases with Moore‘s law,

but grow of hardware performance is expected ■  Further reducing voltage as compensation

□  We can’t do that endlessly, lower limit around 0.7V □  Strange physical effects

■  Next-generation processors need to use even less power □  Lower the frequencies, scale them dynamically □  Use only parts of the processor at a time (‘dark silicon’) □  Build energy-efficient special purpose hardware

■  No chance for faster processors through frequency increase

6

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Memory Wall

■  Caching: Well established optimization technique for performance ■  Relies on data locality

□  Some instructions are often used (e.g. loops) □  Some data is often used (e.g. local variables) □  Hardware keeps a copy of the data in the faster cache □  On read attempts, data is taken directly from the cache

□  On write, data is cached and eventually written to memory ■  Similar to ILP, the potential is limited

□  Larger caches do not help automatically □  At some point, all data locality in the

code is already exploited □  Manual vs. compiler-driven optimization

[arstechnica.com]

7

Memory Wall

■  If caching is limited, we simply need faster memory ■  The problem: Shared memory is ‘shared’

□  Interconnect contention □  Memory bandwidth ◊ Memory transfer speed is limited by the power wall ◊ Memory transfer size is limited by the power wall

■  Transfer technology cannot keep up with GHz processors

■  Memory is too slow, effects cannot be hidden through caching completely à “Memory wall”

[dell.com]

8


The Situation

■  Hardware people □  Number of transistors N is still increasing

□  Building larger caches no longer helps (memory wall) □  ILP is out of options (ILP wall) □  Voltage / power consumption is at the limit (power wall) ◊  Some help with dynamic scaling approaches

□  Frequency is stalled (power wall) □  Only possible offer is to use increasing N for more cores

■  For faster software in the future ... □  Speedup must come from the utilization of an increasing core

count, since F is now fixed □  Software must participate in the power wall handling,

to keep F fixed □  Software must tackle the memory wall

9

Three Ways Of Doing Anything Faster [Pfister]

■  Work harder (clock speed) Ø  Power wall problem Ø  Memory wall problem

■  Work smarter (optimization, caching) Ø  ILP wall problem Ø  Memory wall problem

■  Get help (parallelization) □  More cores per single CPU

□  Software needs to exploit them in the right way

Ø  Memory wall problem

Problem

CPU

Core

Core

Core

Core

Core

10


Parallelism on Different Levels

■  A processor chip (socket) □  Chip multi-processing (CMP)

◊ Multiple CPU’s per chip, called cores ◊ Multi-core / many-core

□  Simultaneous multi-threading (SMT) ◊  Interleaved execution of tasks on one core

◊  Example: Intel Hyperthreading □  Chip multi-threading (CMT) = CMP + SMT □  Instruction-level parallelism (ILP) ◊  Parallel processing of single instructions per core

■  Multiple processor chips in one machine (multi-processing) □  Symmetric multi-processing (SMP)

■  Multiple processor chips in many machines (multi-computer)

11



[ars

tech

nica

.com

]

ILP, SMT ILP, SMT ILP, SMT ILP, SMT

ILP, SMT ILP, SMT ILP, SMT ILP, SMT

CM

P Arc

hite

ctur

e

12



13

© 2011 IBM Corporation

IBM System Technology Group

1. Chip:16+2 !P

cores

2. Single Chip Module

4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus

5a. Midplane: 16 Node Cards

6. Rack: 2 Midplanes

7. System: 96 racks, 20PF/s

3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling

5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus

•Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming models for exploitation of node hardware concurrency

Blue Gene/Q

Small

Memory on Different Levels

volatile

non-volatile

Registers

Processor Caches

Random Access Memory (RAM)

Flash / SSD Memory

Hard Drives

Tapes

Fast Expensive

Slow Large

14


Cheap

A Wild Mixture

15

Network

GF100

16

…

GF100

A Wild Mixture

17

MIC

GDDR5

CPU

Core Core

CPU

Core Core

GPU Core

QPI

16x PCIE

16x PCIE DDR3

DDR3

GPU

GDDR5

Dual Gigabit LAN

MIC

GDDR5

CPU Core Core

CPU Core Core

Core Core

QPI

16x PCIE

16x PCIE DDR3

DDR3

GPU

GDDR5

Dual Gigabit LAN

MIC

GDDR5

CPU Core Core

CPU Core Core

Core Core

QPI

16x PCIE

16x PCIE DDR3

DDR3

GPU

GDDR5

Dual Gigabit LAN

MIC

GDDR5

CPU Core Core

CPU Core Core

Core Core

QPI

16x PCIE

16x PCIE DDR3

DDR3

GPU

GDDR5

Dual Gigabit LAN

The Parallel Programming Problem

18

Execution Environment Parallel Application Match ?

Configuration

Flexible

Type


Hardware Abstraction: Flynn‘s Taxonomy

■  Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension

Single Instruction, Single Data (SISD)

Single Instruction, Multiple Data (SIMD)

19

Processing Step Instruction

Data Item

Output Processing Step

Instruction

Data Items

Output

Multiple Instruction, Single Data (MISD)

Processing Step

Instructions Data Item

Output

Multiple Instruction, Multiple Data (MIMD)

Processing Step

Instructions Data Items

Output

Hardware Abstraction: Tasks + Processing Elements

Program Program Program

Process Process Process Process Task

PE

Process Process Process Process Task Process Process Process Process Task

PE PE

PE

Memory

Node

Net

wor

k

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

20


Hardware Abstraction: PRAM

■  RAM assumptions: Constant memory access time, unlimited memory

■  PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors

■  Alternative models: BSP, LogP

21

CPU

Input Memory Output

CPU CPU

Shared Bus

CPU

Input Memory Output

Hardware Abstraction: BSP

■  Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 ■  Success of von Neumann model

□  Bridge between hardware and software □  High-level languages can be efficiently compiled on this model □  Hardware designers can optimize the realization of this model

■  Similar model for parallel machines

□  Should be neutral about the number of processors □  Program should be written for v virtual processors that are

mapped to p physical ones □  When v >> p, the compiler has options

■  BSP computation consists of a series of supersteps: □  1.) Concurrent computation on all processors

□  2.) Exchange of data between all processes □  3.) Barrier synchronization

22

Hardware Abstraction: CSP

■  Behavior of real-world objects can be described through their interaction with other objects □  Leave out internal implementation details □  Interface of a process is described as set of atomic events

■  Event examples for an ATM: □  card – insertion of a credit card in an ATM card slot □  money – extraction of money from the ATM dispenser

■  Events for a printer: {accept, print}

■  Alphabet - set of relevant (!) events for an object description □  Event may never happen in the interaction □  Interaction is restricted to this set of events □  αATM = {card, money}

■  A CSP process is the behavior of an object, described with its alphabet

23

Hardware Abstraction: LogP

■  Criticism on overly simplification in PRAM-based approaches, encourage exploitation of ,formal loopholes‘ (e.g. communication)

■  Trend towards multicomputer systems with large local memories ■  Characterization of a parallel machine by:

□  P: Number of processors □  g (gap): Minimum time between two consecutive transmissions ◊  Reciprocal corresponds to per-processor communication

bandwidth □  L (latency): Upper bound on messaging time □  o (overhead): Exclusive processor time needed for send /

receive operation ■  L, o, G in multiples of processor cycles

24

Hardware Abstraction: OpenCL

Private

Per work-item

Local Shared within a workgroup

Global/ Constant Visible to

all workgroups

Host Memory

On the CPU ParProg | GPU Computing | FF2013

25

[4]


26


Configuration

Flexible

Type

Software View: Concurrency vs. Parallelism

■  Concurrency means dealing with several things at once □  Programming concept for the developer

□  In shared-memory systems, implemented by time sharing ■  Parallelism means doing several things at once

□  Demands parallel hardware ■  Parallel programming is a misnomer

□  Concurrent programming aiming at parallel execution ■  Any parallel software is concurrent software

□  Note: Some researchers disagree, most practitioners agree ■  Concurrent software is not always parallel software

□  Many server applications achieve scalability by optimizing concurrency only (web server)

27


Concurrency

Parallelism

Server Example: No Concurrency, No Parallelism

28


��

��

��

��

��

��

��

��

��

��

Server Example: Concurrency for Throughput

29


��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Server Example: Parallelism for Throughput

30


��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Server Example: Parallelism for Speedup

31


��

��

��

��

��

��

��

��

��

��

��

��

Concurrent Execution

■  Program as sequence of atomic statements □  „Atomic“: Executed without interruption

■  Concurrent execution is the interleaving of atomic statements from multiple tasks □  Tasks may share resources

(variables, operating system handles, …) □  Operating system timing is not predictable,

so interleaving is not predictable □  May impact the result of the application

■  Since parallel programs are concurrent programs, we need to deal with that!


32

y=x z=x y=y-1 z=z+1 x=y x=z x=2

y=x y=y-1 x=y z=x z=z+1 x=z x=1

y=x y=y-1 y=x y=y+1 x=y x=y x=0

y=x z=x y=y-1 z=z+1 x=z x=y x=0

x=1 y=x y=y-1 x=y

z=x z=z+1 x=z

Case 3 Case 4

Case 1 Case 2

Critical Section

■  N threads has some code - critical section - with shared data access

■  Mutual Exclusion demand

□  Only one thread at a time is allowed into its critical section, among all threads that have critical sections for the same resource.

■  Progress demand

□  If no other thread is in the critical section, the decision for entering should not be postponed indefinitely. Only threads that wait for entering the critical section are allowed to participate in decisions.

■  Bounded Waiting demand

□  It must not be possible for a thread requiring access to a critical section to be delayed indefinitely by other threads entering the section (starvation problem)

33

Critical Sections with Mutexes


34 T1 T2 T3

m.lock()

m.unlock()

m.lock()

m.lock()

m.unlock()

m.unlock()

Critical Section

Critical Section

Critical Section

Waiting Queue

T3

T2 T3

T2

Critical Sections with High-Level Primitives

■  Today: Multitude of high-level synchronization primitives ■  Spinlock

□  Perform busy waiting, lowest overhead for short locks ■  Reader / Writer Lock

□  Special case of mutual exclusion through semaphores □  Multiple „Reader“ processes can enter the critical section at the

same time, but „Writer“ process should gain exclusive access □  Different optimizations possible:

minimum reader delay, minimum writer delay, throughput, … ■  Mutex

□  Semaphore that works amongst operating system processes ■  Concurrent Collections

□  Blocking queues and key-value maps with concurrency support

35

Critical Sections with High-Level Primitives

■  Reentrant Lock □  Lock can be obtained several times without locking on itself

□  Useful for cyclic algorithms (e.g. graph traversal) and problems were lock bookkeeping is very expensive

□  Reentrant mutex needs to remember the locking thread(s), which increases the overhead

■  Barriers □  All concurrent activities stop there and continue together □  Participants statically defined at compile- or start-time □  Newer dynamic barrier concept allows late binding of

participants (e.g. X10 clocks, Java phasers) □  Memory barrier or memory fence enforce separation of

memory operations before and after the barrier ◊ Needed for low-level synchronization implementation

36

37

Nasty Stuff

Deadlock ■  Two or more processes / threads are unable to proceed

■  Each is waiting for one of the others to do something Livelock ■  Two or more processes / threads continuously change their states

in response to changes in the other processes / threads ■  No global progress for the application

Race condition

■  Two or more processes / threads are executed concurrently ■  Final result of the application depends on the relative timing of

their execution

■  1970. E.G. Coffman and A. Shoshani. Sequencing tasks in multiprocess systems to avoid deadlocks. □  All conditions must be fulfilled to allow a deadlock to happen □  Mutual exclusion condition - Individual resources are available

or held by no more than one thread at a time

□  Hold and wait condition – Threads already holding resources may attempt to hold new resources

□  No preemption condition – Once a thread holds a resource, it must voluntarily release it on its own

□  Circular wait condition – Possible for a thread to wait for a resource held by the next thread in the chain

■  Avoiding circular wait turned out to be the easiest solution for deadlock avoidance

■  Avoiding mutual exclusion leads to non-blocking synchronization □  These algorithms no longer have a critical section

38

Coffman Conditions

39

Terminology

Starvation ■  A runnable process / thread is overlooked indefinitely

■  Although it is able to proceed, it is never chosen to run (dispatching / scheduling)

Atomic Operation ■  Function or action implemented as a sequence of one or more

instructions ■  Appears to be indivisible - no other process / thread can see an

intermediate state or interrupt the operation ■  Executed as a group, or not executed at all

Mutual Exclusion

■  The requirement that when one process / thread is using a resource, no other shall be allowed to do that

Is it worth the pain?

■  Parallelization metrics are application-dependent, but follow a common set of concepts □  Speedup: More resources lead less time for solving the same

task □  Linear speedup: n times more resources à n times speedup

□  Scaleup: More resources solve a larger version of the same task in the same time

□  Linear scaleup: n times more resources à n times larger problem solvable

■  The most important goal depends on the application □  Transaction processing usually heads for throughput

(scalability) □  Decision support usually heads for response time (speedup)

40

Tasks: v=12 Processing elements: N= 3

Time needed: T3= 4 (Linear) Speedup: T1/T3=12/4=3

Speedup

■  Idealized assumptions □  All tasks are equal sized

□  All code parts can run in parallel Application

1 2 3 4 5 6 7 8 9 10

11

12

1 2 3 4

5 6 7 8

9 10

11

12

t t

Tasks: v=12 Processing elements: N=1

Time needed: T1=12

41


Speedup with Load Imbalance

■  Assumptions □  Tasks have different size,

best-possible speedup depends on optimized resource usage

□  All code parts can run in parallel

Application

2 3 4 5 6 7 8 9 10

11

12

t t

1

2 3 4 1 5 6 7 8

9 10

11

12

Tasks: v=12 Processing elements: N= 3

Time needed: T3= 6 Speedup: T1/T3=16/6=2.67

Tasks: v=12 Processing elements: N=1

Time needed: T1=16

42


Speedup with Serial Parts

■  Each application has inherently non-parallelizable serial parts □  Algorithmic limitations

□  Shared resources acting as bottleneck □  Overhead for program start □  Communication overhead in shared-nothing systems

2 3

4 5

6 7 8

9 10

11

12

tSER1

1

tPAR1 tSER2 tPAR2 tSER3

43


Amdahl’s Law

■  Gene Amdahl. “Validity of the single processor approach to achieving large scale computing capabilities”. AFIPS 1967 □  Serial parts TSER = tSER1 + tSER2 + tSER3 + … □  Parallelizable parts TPAR = tPAR1 + tPAR2 + tPAR3 + …

□  Execution time with one processing element: T1 = TSER+TPAR

□  Execution time with N parallel processing elements: TN >= TSER + TPAR / N ◊  Equal only on perfect parallelization,

e.g. no load imbalance □  Amdahl’s Law for maximum speedup with N processing elements

S =T1

TN

44


S =TSER + TPAR

TSER + TPAR/N

Amdahl’s Law

45


Amdahl’s Law

■  Speedup through parallelism is hard to achieve ■  For unlimited resources, speedup is bound by the serial parts:

□  Assume T1=1

■  Parallelization problem relates to all system layers □  Hardware offers some degree of parallel execution □  Speedup gained is bound by serial parts: ◊  Limitations of hardware components

◊ Necessary serial activities in the operating system, virtual runtime system, middleware and the application

◊ Overhead for the parallelization itself

46


SN!1 =T1

TN!1SN!1 =

1

TSER

Gustafson-Barsis’ Law (1988)

■  Gustafson and Barsis pointed out that people are typically not interested in the shortest execution time □  Rather solve the biggest problem in reasonable time

■  Problem size could then scale with the number of processors

□  Leads to larger parallelizable part with increasing N □  Typical goal in simulation problems

■  Time spend in the sequential part is usually fixed or grows slower than the problem size à linear speedup possible

■  Formally: □  PN: Portion of the program that benefits from parallelization,

depending on N (and implicitly the problem size) □  Maximum scaled speedup by N processors:

47


48


Configuration

Flexible

Type

Programming Model for Shared Memory

49

Process

Explicitly Shared Memory

■  Different programming models for concurrency in shared memory

■  Processes and threads mapped to processing elements (cores)

■  Process- und thread-based programming typically part of operating system lectures


Memory

Process

Memory

Thread Thread

Task

Task

Task

Task

Concurrent Processes Concurrent Threads

Concurrent Tasks

Main Thread

Process

Memory

Main Thread

Process

Memory

Main Thread Thread Thread

OpenMP

■  Programming with the fork-join model □  Master thread forks into declared tasks

□  Runtime environment may run them in parallel, based on dynamic mapping to threads from a pool

□  Worker task barrier before finalization (join)

50


[Wik

iped

ia]

Task Scheduling

■  Classical task scheduling with central queue □  All worker threads fetch tasks from a central queue

□  Scalability issue with increasing thread (resp. core) count ■  Work stealing in OpenMP (and other libraries)

□  Task queue per thread □  Idling thread “steals” tasks from another thread

□  Independent from thread scheduling

□  Only mutual synchronization

□  No central queue

51

Thre

ad

New Task

Next Task

Task

Que

ue

Thre

ad

New Task

Next Task

Task

Que

ue

Work Stealing

PGAS Languages

■  Non-uniform memory architectures (NUMA) became default ■  But: Understanding of memory in programming is flat

□  All variables are equal in access time □  Considering the memory hierarchy is low-level coding

(e.g. cache-aware programming) ■  Partitioned global address space (PGAS) approach

□  Driven by high-performance computing community □  Modern approach for large-scale NUMA

□  Explicit notion of memory partition per processor ◊ Data is designated as local (near) or global (possibly far) ◊  Programmer is aware of NUMA nodes

□  Performance optimization for deep memory hierarchies

52


Parallel Programming for Accelerators

■  OpenCL exposes CPUs, GPUs, and other Accelerators as “devices”

■  Each “device” contains one or more “compute units”, i.e. cores, SMs,... ■  Each “compute unit” contains one or more SIMD “processing elements”

ParProg | GPU Computing | FF2013

53

[4]

The BIG idea behind OpenCL

OpenCL execution model … execute a kernel at each point in a problem domain.

E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions

54

Message Passing

■  Programming paradigm targeting shared-nothing infrastructures □  Implementations for shared memory available,

but typically not the best-possible approach ■  Multiple instances of the same application on a set of nodes (SPMD)

Instance 0 Instance

1

Instance 2 Instance

3

Submission Host

Execution Hosts

Single Program Multiple Data (SPMD)

56

Single Program Multiple Data (SPMD)

P0 P1 P2 P3

seq. program and data distribution

seq. node program with message passing

identical copies with different process

identifications

Actor Model

■  Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI 1973. □  Another mathematical model for concurrent computation □  No global system state concept (relationship to physics)

□  Actor as computation primitive ◊ Makes local decisions ◊ Concurrently creates more actors ◊ Concurrently sends / receives messages

□  Asynchronous one-way messaging with changing topology (CSP communication graph is fixed), no order guarantees

□  Recipient is identified by mailing address ■  „Everything is an actor“

57

Actor Model

■  Interaction with asynchronous, unordered, distributed messaging ■  Fundamental aspects

□  Emphasis on local state, time and name space □  No central entity □  Actor A gets to know actor B only by direct creation,

or by name transmission from another actor C ■  Computation

□  Not global state sequence, but partially ordered set of events

◊  Event: Receipt of a message by a target actor ◊  Each event is a transition from one local state to another ◊  Events may happen in parallel

■  Messaging reliability declared as orthogonal aspect

58

Message Passing Interface (MPI)

■  MPI_GATHER ( IN sendbuf, IN sendcount, IN sendtype, OUT recvbuf, IN recvcount, IN recvtype, IN root, IN comm )

□  Each process sends its buffer to the root process, including root □  Incoming messages are stored in rank order

□  Receive buffer is ignored for all non-root processes □  MPI_GATHERV allows varying count of data to be received □  Returns if the buffer is re-usable (no finishing promised)

59


60


Configuration

Flexible

Type

Execution Environment Mapping

61

Sing

le In

stru

ctio

n,

Mul

tiple

Dat

a (SIM

D)

Mul

tiple

Inst

ruct

ion,

Mul

tiple

Dat

a (M

IMD

)

Patterns for Parallel Programming [Mattson]

■  Finding Concurrency Design Space □  task / data decomposition, task grouping and ordering due to

data flow dependencies, design evaluation ■  Algorithm Structure Design Space

□  Task parallelism, divide and conquer, geometric decomposition, recursive data, pipeline, event-based coordination

□  Mapping of concurrent design elements to execution units ■  Supporting Structures Design Space

□  SPMD, master / worker, loop parallelism, fork / join, shared data, shared queue, distributed array

□  Program structures and data structures used for code creation ■  Implementation Mechanisms Design Space

62

Designing Parallel Algorithms [Foster]

■  Map workload problem on an execution environment □  Concurrency for speedup

□  Data locality for speedup □  Scalability

■  Best parallel solution typically differs massively from the sequential version of an algorithm

■  Foster defines four distinct stages of a methodological approach

■  Example: Parallel Sum

63

Example: Parallel Reduction

■  Reduce a set of elements into one, given an operation

■  Example: Sum

64

Designing Parallel Algorithms [Foster]

■  A) Search for concurrency and scalability □  Partitioning –

Decompose computation and data into small tasks □  Communication –

Define necessary coordination of task execution ■  B) Search for locality and other performance-related issues

□  Agglomeration – Consider performance and implementation costs

□  Mapping – Maximize processor utilization, minimize communication

■  Might require backtracking or parallel investigation of steps

65

Partitioning

■  Expose opportunities for parallel execution – fine-grained decomposition

■  Good partition keeps computation and data together □  Data partitioning leads to data parallelism

□  Computation partitioning leads task parallelism □  Complementary approaches, can lead to different algorithms □  Reveal hidden structures of the algorithm that have potential □  Investigate complementary views on the problem

■  Avoid replication of either computation or data, can be revised later to reduce communication overhead

■  Step results in multiple candidate solutions

66

Partitioning - Decomposition Types

■  Domain Decomposition □  Define small data fragments

□  Specify computation for them □  Different phases of computation

on the same data are handled separately □  Rule of thumb:

First focus on large or frequently used data structures ■  Functional Decomposition

□  Split up computation into disjoint tasks, ignore the data accessed for the moment

□  With significant data overlap, domain decomposition is more appropriate

67

Partitioning Strategies [Breshears]

■  Produce at least as many tasks as there will be threads / cores □  But: Might be more effective to use only fraction of the cores

(granularity) □  Computation must pay-off with respect to overhead

■  Avoid synchronization, since it adds up as overhead to serial execution time

■  Patterns for data decomposition □  By element (one-dimensional) □  By row, by column group, by block (multi-dimensional) □  Influenced by ratio of computation and synchronization

68

Partitioning - Checklist

■  Checklist for resulting partitioning scheme □  Order of magnitude more tasks than processors ?

-> Keeps flexibility for next steps □  Avoidance of redundant computation and storage

requirements ? -> Scalability for large problem sizes

□  Tasks of comparable size ? -> Goal to allocate equal work to processors

□  Does number of tasks scale with the problem size ? -> Algorithm should be able to solve larger tasks with more processors

■  Resolve bad partitioning by estimating performance behavior, and eventually reformulating the problem

69

Communication Step

■  Specify links between data consumers and data producers ■  Specify kind and number of messages on these links

■  Domain decomposition problems might have tricky communication infrastructures, due to data dependencies

■  Communication in functional decomposition problems can easily be modeled from the data flow between the tasks

■  Categorization of communication patterns □  Local communication (few neighbors) vs.

global communication □  Structured communication (e.g. tree) vs.

unstructured communication □  Static vs. dynamic communication structure □  Synchronous vs. asynchronous communication

70

Communication - Hints

■  Distribute computation and communication, don‘t centralize algorithm □  Bad example: Central manager for parallel summation □  Divide-and-conquer helps as mental model to identify

concurrency ■  Unstructured communication is hard to agglomerate,

better avoid it ■  Checklist for communication design

□  Do all tasks perform the same amount of communication ? -> Distribute or replicate communication hot spots

□  Does each task performs only local communication ? □  Can communication happen concurrently ? □  Can computation happen concurrently ?

71

Ghost Cells

■  Domain decomposition might lead to chunks that demand data from each other for their computation

■  Solution 1: Copy necessary portion of data (,ghost cells‘) □  If no synchronization is needed after update

□  Data amount and frequency of update influences resulting overhead and efficiency

□  Additional memory consumption ■  Solution 2: Access relevant data ,remotely‘

□  Delays thread coordination until the data is really needed

□  Correctness („old“ data vs. „new“ data) must be considered on parallel progress

72

Agglomeration Step

■  Algorithm so far is correct, but not specialized for some execution environment

■  Check again partitioning and communication decisions □  Agglomerate tasks for efficient execution on some machine

□  Replicate data and / or computation for efficiency reasons ■  Resulting number of tasks can still be greater than the number of

processors ■  Three conflicting guiding decisions

□  Reduce communication costs by coarser granularity of computation and communication

□  Preserve flexibility with respect to later mapping decisions

□  Reduce software engineering costs (serial -> parallel version)

73

Agglomeration [Foster]

74

Agglomeration – Granularity vs. Flexibility

■  Reduce communication costs by coarser granularity □  Sending less data

□  Sending fewer messages (per-message initialization costs) □  Agglomerate, especially if tasks cannot run concurrently ◊  Reduces also task creation costs

□  Replicate computation to avoid communication (helps also with reliability)

■  Preserve flexibility

□  Flexible large number of tasks still prerequisite for scalability ■  Define granularity as compile-time or run-time parameter

75

Agglomeration - Checklist

■  Communication costs reduced by increasing locality ? ■  Does replicated computation outweighs its costs in all cases ?

■  Does data replication restrict the range of problem sizes / processor counts ?

■  Does the larger tasks still have similar computation / communication costs ?

■  Does the larger tasks still act with sufficient concurrency ? ■  Does the number of tasks still scale with the problem size ? ■  How much can the task count decrease, without disturbing load

balancing, scalability, or engineering costs ? ■  Is the transition to parallel code worth the engineering costs ?

76

Mapping Step

■  Only relevant for shared-nothing systems, since shared memory systems typically perform automatic task scheduling

■  Minimize execution time by □  Place concurrent tasks on different nodes

□  Place tasks with heavy communication on the same node ■  Conflicting strategies, additionally restricted by resource limits

□  In general, NP-complete bin packing problem ■  Set of sophisticated (dynamic) heuristics for load balancing

□  Preference for local algorithms that do not need global scheduling state

77

Surface-To-Volume Effect [Foster, Breshears]

■  Visualize the data to be processed (in parallel) as sliced 3D cube ■  Synchronization requirements of a task

□  Proportional to the surface of the data slice it operates upon □  Visualized by the amount of ,borders‘ of the slice

■  Computation work of a task □  Proportional to the volume of the data slice it operates upon

□  Represents the granularity of decomposition ■  Ratio of synchronization and computation

□  High synchronization, low computation, high ratio à bad □  Low synchronization, high computation, low ratio à good

□  Ratio decreases for increasing data size per task ■  Coarse granularity by agglomerating tasks in all dimensions

□  For given volume, the surface then goes down à good

78

Surface-To-Volume Effect [Foster, Breshears]

79

(C)

nice

rweb

.com

Surface-to-Volume Effect [Foster]

■  Computation on 8x8 grid ■  (a): 64 tasks,

one point each □  64x4=256

synchronizations □  256 data values are

transferred ■  (b): 4 tasks,

16 points each □  4x4=16

synchronizations □  16x4=64 data values

are transferred

80

Designing Parallel Algorithms [Breshears]

■  Parallel solution must keep sequential consistency property ■  „Mentally simulate“ the execution of parallel streams

□  Check critical parts of the parallelized sequential application ■  Amount of computation per parallel task

■  Always introduced by moving from serial to parallel code ■  Speedup must offset the parallelization overhead (Amdahl)

■  Granularity: Amount of parallel computation done before synchronization is needed

□  Fine-grained granularity overhead vs. coarse-grained granularity concurrency ◊  Iterative approach of finding the right granularity ◊ Decision might be only correct only for a chosen execution

environment

81

OK ?!?

Certificate ‚for free‘

83

parallel programming concepts summary - uni … | parallel programming concepts | dr. peter tröger...

Documents