parallel programming concepts summary - uni … | parallel programming concepts | dr. peter tröger...

83
1

Upload: phamnhan

Post on 07-Mar-2018

235 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

1

Page 2: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Parallel Programming Concepts Summary

Dr. Peter Tröger M.Sc. Frank Feinbube

Page 3: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Course Topics

■  The Parallelization Problem □  Power wall, memory wall, Moore’s law

□  Terminology and metrics ■  Shared Memory Parallelism

□  Theory of concurrency, hardware today and in the past □  Programming models, optimization, profiling

■  Shared Nothing Parallelism □  Theory of concurrency, hardware today and in the past □  Programming models, optimization, profiling

■  Accelerators

■  Patterns ■  Future trends

3

Page 4: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Scaring Students with Word Clouds ...

Page 5: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

The Free Lunch Is Over

■  Clock speed curve flattened in 2003 □  Heat □  Power consumption

□  Leakage ■  2-3 GHz since 2001 (!) ■  Speeding up the serial

instruction execution through clock speed improvements no longer works

■  We stumbled into the Many-Core Era

5

[Her

b Sut

ter,

2009

]

Page 6: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

The Power Wall

■  Air cooling capabilities are limited □  Maximum temperature of 100-125 °C, hot spot problem

□  Static and dynamic power consumption must be limited ■  Power consumption increases with Moore‘s law,

but grow of hardware performance is expected ■  Further reducing voltage as compensation

□  We can’t do that endlessly, lower limit around 0.7V □  Strange physical effects

■  Next-generation processors need to use even less power □  Lower the frequencies, scale them dynamically □  Use only parts of the processor at a time (‘dark silicon’) □  Build energy-efficient special purpose hardware

■  No chance for faster processors through frequency increase

6

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 7: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Memory Wall

■  Caching: Well established optimization technique for performance ■  Relies on data locality

□  Some instructions are often used (e.g. loops) □  Some data is often used (e.g. local variables) □  Hardware keeps a copy of the data in the faster cache □  On read attempts, data is taken directly from the cache

□  On write, data is cached and eventually written to memory ■  Similar to ILP, the potential is limited

□  Larger caches do not help automatically □  At some point, all data locality in the

code is already exploited □  Manual vs. compiler-driven optimization

[arstechnica.com]

7

Page 8: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Memory Wall

■  If caching is limited, we simply need faster memory ■  The problem: Shared memory is ‘shared’

□  Interconnect contention □  Memory bandwidth ◊ Memory transfer speed is limited by the power wall ◊ Memory transfer size is limited by the power wall

■  Transfer technology cannot keep up with GHz processors

■  Memory is too slow, effects cannot be hidden through caching completely à “Memory wall”

[dell.com]

8

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 9: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

The Situation

■  Hardware people □  Number of transistors N is still increasing

□  Building larger caches no longer helps (memory wall) □  ILP is out of options (ILP wall) □  Voltage / power consumption is at the limit (power wall) ◊  Some help with dynamic scaling approaches

□  Frequency is stalled (power wall) □  Only possible offer is to use increasing N for more cores

■  For faster software in the future ... □  Speedup must come from the utilization of an increasing core

count, since F is now fixed □  Software must participate in the power wall handling,

to keep F fixed □  Software must tackle the memory wall

9

Page 10: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Three Ways Of Doing Anything Faster [Pfister]

■  Work harder (clock speed) Ø  Power wall problem Ø  Memory wall problem

■  Work smarter (optimization, caching) Ø  ILP wall problem Ø  Memory wall problem

■  Get help (parallelization) □  More cores per single CPU

□  Software needs to exploit them in the right way

Ø  Memory wall problem

Problem

CPU

Core

Core

Core

Core

Core

10

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 11: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Parallelism on Different Levels

■  A processor chip (socket) □  Chip multi-processing (CMP)

◊ Multiple CPU’s per chip, called cores ◊ Multi-core / many-core

□  Simultaneous multi-threading (SMT) ◊  Interleaved execution of tasks on one core

◊  Example: Intel Hyperthreading □  Chip multi-threading (CMT) = CMP + SMT □  Instruction-level parallelism (ILP) ◊  Parallel processing of single instructions per core

■  Multiple processor chips in one machine (multi-processing) □  Symmetric multi-processing (SMP)

■  Multiple processor chips in many machines (multi-computer)

11

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 12: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Parallelism on Different Levels

[ars

tech

nica

.com

]

ILP, SMT ILP, SMT ILP, SMT ILP, SMT

ILP, SMT ILP, SMT ILP, SMT ILP, SMT

CM

P Arc

hite

ctur

e

12

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 13: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Parallelism on Different Levels

13

© 2011 IBM Corporation

IBM System Technology Group

1. Chip:16+2 !P

cores

2. Single Chip Module

4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus

5a. Midplane: 16 Node Cards

6. Rack: 2 Midplanes

7. System: 96 racks, 20PF/s

3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling

5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus

•Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming models for exploitation of node hardware concurrency

Blue Gene/Q

Page 14: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Small

Memory on Different Levels

volatile

non-volatile

Registers

Processor Caches

Random Access Memory (RAM)

Flash / SSD Memory

Hard Drives

Tapes

Fast Expensive

Slow Large

14

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Cheap

Page 15: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

A Wild Mixture

15

Network

Page 16: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

GF100

16

GF100

Page 17: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

A Wild Mixture

17

MIC

GDDR5

CPU

Core Core

CPU

Core Core

GPU Core

QPI

16x PCIE

16x PCIE DDR3

DDR3

GPU

GDDR5

Dual Gigabit LAN

MIC

GDDR5

CPU Core Core

CPU Core Core

Core Core

QPI

16x PCIE

16x PCIE DDR3

DDR3

GPU

GDDR5

Dual Gigabit LAN

MIC

GDDR5

CPU Core Core

CPU Core Core

Core Core

QPI

16x PCIE

16x PCIE DDR3

DDR3

GPU

GDDR5

Dual Gigabit LAN

MIC

GDDR5

CPU Core Core

CPU Core Core

Core Core

QPI

16x PCIE

16x PCIE DDR3

DDR3

GPU

GDDR5

Dual Gigabit LAN

Page 18: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

The Parallel Programming Problem

18

Execution Environment Parallel Application Match ?

Configuration

Flexible

Type

Page 19: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Hardware Abstraction: Flynn‘s Taxonomy

■  Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension

Single Instruction, Single Data (SISD)

Single Instruction, Multiple Data (SIMD)

19

Processing Step Instruction

Data Item

Output Processing Step

Instruction

Data Items

Output

Multiple Instruction, Single Data (MISD)

Processing Step

Instructions Data Item

Output

Multiple Instruction, Multiple Data (MIMD)

Processing Step

Instructions Data Items

Output

Page 20: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Hardware Abstraction: Tasks + Processing Elements

Program Program Program

Process Process Process Process Task

PE

Process Process Process Process Task Process Process Process Process Task

PE PE

PE

Memory

Node

Net

wor

k

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

20

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 21: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Hardware Abstraction: PRAM

■  RAM assumptions: Constant memory access time, unlimited memory

■  PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors

■  Alternative models: BSP, LogP

21

CPU

Input Memory Output

CPU CPU

Shared Bus

CPU

Input Memory Output

Page 22: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Hardware Abstraction: BSP

■  Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 ■  Success of von Neumann model

□  Bridge between hardware and software □  High-level languages can be efficiently compiled on this model □  Hardware designers can optimize the realization of this model

■  Similar model for parallel machines

□  Should be neutral about the number of processors □  Program should be written for v virtual processors that are

mapped to p physical ones □  When v >> p, the compiler has options

■  BSP computation consists of a series of supersteps: □  1.) Concurrent computation on all processors

□  2.) Exchange of data between all processes □  3.) Barrier synchronization

22

Page 23: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Hardware Abstraction: CSP

■  Behavior of real-world objects can be described through their interaction with other objects □  Leave out internal implementation details □  Interface of a process is described as set of atomic events

■  Event examples for an ATM: □  card – insertion of a credit card in an ATM card slot □  money – extraction of money from the ATM dispenser

■  Events for a printer: {accept, print}

■  Alphabet - set of relevant (!) events for an object description □  Event may never happen in the interaction □  Interaction is restricted to this set of events □  αATM = {card, money}

■  A CSP process is the behavior of an object, described with its alphabet

23

Page 24: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Hardware Abstraction: LogP

■  Criticism on overly simplification in PRAM-based approaches, encourage exploitation of ,formal loopholes‘ (e.g. communication)

■  Trend towards multicomputer systems with large local memories ■  Characterization of a parallel machine by:

□  P: Number of processors □  g (gap): Minimum time between two consecutive transmissions ◊  Reciprocal corresponds to per-processor communication

bandwidth □  L (latency): Upper bound on messaging time □  o (overhead): Exclusive processor time needed for send /

receive operation ■  L, o, G in multiples of processor cycles

24

Page 25: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Hardware Abstraction: OpenCL

Private

Per work-item

Local Shared within a workgroup

Global/ Constant Visible to

all workgroups

Host Memory

On the CPU ParProg | GPU Computing | FF2013

25

[4]

Page 26: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

The Parallel Programming Problem

26

Execution Environment Parallel Application Match ?

Configuration

Flexible

Type

Page 27: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Software View: Concurrency vs. Parallelism

■  Concurrency means dealing with several things at once □  Programming concept for the developer

□  In shared-memory systems, implemented by time sharing ■  Parallelism means doing several things at once

□  Demands parallel hardware ■  Parallel programming is a misnomer

□  Concurrent programming aiming at parallel execution ■  Any parallel software is concurrent software

□  Note: Some researchers disagree, most practitioners agree ■  Concurrent software is not always parallel software

□  Many server applications achieve scalability by optimizing concurrency only (web server)

27

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Concurrency

Parallelism

Page 28: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Server Example: No Concurrency, No Parallelism

28

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

�������

�������

�����

�����

���� ��

���� ��

�����������������

�������������

�������������

���������������

Page 29: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Server Example: Concurrency for Throughput

29

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

���������

���������

��������

��������

������

������

��� ���

��� ���

������������������

��������� � �

������������������

��������� � �

���������� � �

�����������������

���������� � �

�����������������

Page 30: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Server Example: Parallelism for Throughput

30

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

���������

���������

��������

��������

������

������

�����

�����

��� ���

��� ���

������������������

������������������

��������� � �

��������� � �

���������� � �

���������� � �

�����������������

�����������������

Page 31: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Server Example: Parallelism for Speedup

31

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

���������

���������

�����

�����

�����

�����

�� ���

�� ���

����������������� � �����

��������� ��������� � �

��������� � �� �����

������������������������

Page 32: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Concurrent Execution

■  Program as sequence of atomic statements □  „Atomic“: Executed without interruption

■  Concurrent execution is the interleaving of atomic statements from multiple tasks □  Tasks may share resources

(variables, operating system handles, …) □  Operating system timing is not predictable,

so interleaving is not predictable □  May impact the result of the application

■  Since parallel programs are concurrent programs, we need to deal with that!

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

32

y=x z=x y=y-1 z=z+1 x=y x=z x=2

y=x y=y-1 x=y z=x z=z+1 x=z x=1

y=x y=y-1 y=x y=y+1 x=y x=y x=0

y=x z=x y=y-1 z=z+1 x=z x=y x=0

x=1 y=x y=y-1 x=y

z=x z=z+1 x=z

Case 3 Case 4

Case 1 Case 2

Page 33: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Critical Section

■  N threads has some code - critical section - with shared data access

■  Mutual Exclusion demand

□  Only one thread at a time is allowed into its critical section, among all threads that have critical sections for the same resource.

■  Progress demand

□  If no other thread is in the critical section, the decision for entering should not be postponed indefinitely. Only threads that wait for entering the critical section are allowed to participate in decisions.

■  Bounded Waiting demand

□  It must not be possible for a thread requiring access to a critical section to be delayed indefinitely by other threads entering the section (starvation problem)

33

Page 34: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Critical Sections with Mutexes

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

34 T1 T2 T3

m.lock()

m.unlock()

m.lock()

m.lock()

m.unlock()

m.unlock()

Critical Section

Critical Section

Critical Section

Waiting Queue

T3

T2 T3

T2

Page 35: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Critical Sections with High-Level Primitives

■  Today: Multitude of high-level synchronization primitives ■  Spinlock

□  Perform busy waiting, lowest overhead for short locks ■  Reader / Writer Lock

□  Special case of mutual exclusion through semaphores □  Multiple „Reader“ processes can enter the critical section at the

same time, but „Writer“ process should gain exclusive access □  Different optimizations possible:

minimum reader delay, minimum writer delay, throughput, … ■  Mutex

□  Semaphore that works amongst operating system processes ■  Concurrent Collections

□  Blocking queues and key-value maps with concurrency support

35

Page 36: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Critical Sections with High-Level Primitives

■  Reentrant Lock □  Lock can be obtained several times without locking on itself

□  Useful for cyclic algorithms (e.g. graph traversal) and problems were lock bookkeeping is very expensive

□  Reentrant mutex needs to remember the locking thread(s), which increases the overhead

■  Barriers □  All concurrent activities stop there and continue together □  Participants statically defined at compile- or start-time □  Newer dynamic barrier concept allows late binding of

participants (e.g. X10 clocks, Java phasers) □  Memory barrier or memory fence enforce separation of

memory operations before and after the barrier ◊ Needed for low-level synchronization implementation

36

Page 37: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

37

Nasty Stuff

Deadlock ■  Two or more processes / threads are unable to proceed

■  Each is waiting for one of the others to do something Livelock ■  Two or more processes / threads continuously change their states

in response to changes in the other processes / threads ■  No global progress for the application

Race condition

■  Two or more processes / threads are executed concurrently ■  Final result of the application depends on the relative timing of

their execution

Page 38: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

■  1970. E.G. Coffman and A. Shoshani. Sequencing tasks in multiprocess systems to avoid deadlocks. □  All conditions must be fulfilled to allow a deadlock to happen □  Mutual exclusion condition - Individual resources are available

or held by no more than one thread at a time

□  Hold and wait condition – Threads already holding resources may attempt to hold new resources

□  No preemption condition – Once a thread holds a resource, it must voluntarily release it on its own

□  Circular wait condition – Possible for a thread to wait for a resource held by the next thread in the chain

■  Avoiding circular wait turned out to be the easiest solution for deadlock avoidance

■  Avoiding mutual exclusion leads to non-blocking synchronization □  These algorithms no longer have a critical section

38

Coffman Conditions

Page 39: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

39

Terminology

Starvation ■  A runnable process / thread is overlooked indefinitely

■  Although it is able to proceed, it is never chosen to run (dispatching / scheduling)

Atomic Operation ■  Function or action implemented as a sequence of one or more

instructions ■  Appears to be indivisible - no other process / thread can see an

intermediate state or interrupt the operation ■  Executed as a group, or not executed at all

Mutual Exclusion

■  The requirement that when one process / thread is using a resource, no other shall be allowed to do that

Page 40: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Is it worth the pain?

■  Parallelization metrics are application-dependent, but follow a common set of concepts □  Speedup: More resources lead less time for solving the same

task □  Linear speedup: n times more resources à n times speedup

□  Scaleup: More resources solve a larger version of the same task in the same time

□  Linear scaleup: n times more resources à n times larger problem solvable

■  The most important goal depends on the application □  Transaction processing usually heads for throughput

(scalability) □  Decision support usually heads for response time (speedup)

40

Page 41: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Tasks: v=12 Processing elements: N= 3

Time needed: T3= 4 (Linear) Speedup: T1/T3=12/4=3

Speedup

■  Idealized assumptions □  All tasks are equal sized

□  All code parts can run in parallel Application

1 2 3 4 5 6 7 8 9 10

11

12

1 2 3 4

5 6 7 8

9 10

11

12

t t

Tasks: v=12 Processing elements: N=1

Time needed: T1=12

41

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 42: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Speedup with Load Imbalance

■  Assumptions □  Tasks have different size,

best-possible speedup depends on optimized resource usage

□  All code parts can run in parallel

Application

2 3 4 5 6 7 8 9 10

11

12

t t

1

2 3 4 1 5 6 7 8

9 10

11

12

Tasks: v=12 Processing elements: N= 3

Time needed: T3= 6 Speedup: T1/T3=16/6=2.67

Tasks: v=12 Processing elements: N=1

Time needed: T1=16

42

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 43: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Speedup with Serial Parts

■  Each application has inherently non-parallelizable serial parts □  Algorithmic limitations

□  Shared resources acting as bottleneck □  Overhead for program start □  Communication overhead in shared-nothing systems

2 3

4 5

6 7 8

9 10

11

12

tSER1

1

tPAR1 tSER2 tPAR2 tSER3

43

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 44: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Amdahl’s Law

■  Gene Amdahl. “Validity of the single processor approach to achieving large scale computing capabilities”. AFIPS 1967 □  Serial parts TSER = tSER1 + tSER2 + tSER3 + … □  Parallelizable parts TPAR = tPAR1 + tPAR2 + tPAR3 + …

□  Execution time with one processing element: T1 = TSER+TPAR

□  Execution time with N parallel processing elements: TN >= TSER + TPAR / N ◊  Equal only on perfect parallelization,

e.g. no load imbalance □  Amdahl’s Law for maximum speedup with N processing elements

S =T1

TN

44

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

S =TSER + TPAR

TSER + TPAR/N

Page 45: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Amdahl’s Law

45

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 46: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Amdahl’s Law

■  Speedup through parallelism is hard to achieve ■  For unlimited resources, speedup is bound by the serial parts:

□  Assume T1=1

■  Parallelization problem relates to all system layers □  Hardware offers some degree of parallel execution □  Speedup gained is bound by serial parts: ◊  Limitations of hardware components

◊ Necessary serial activities in the operating system, virtual runtime system, middleware and the application

◊ Overhead for the parallelization itself

46

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

SN!1 =T1

TN!1SN!1 =

1

TSER

Page 47: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Gustafson-Barsis’ Law (1988)

■  Gustafson and Barsis pointed out that people are typically not interested in the shortest execution time □  Rather solve the biggest problem in reasonable time

■  Problem size could then scale with the number of processors

□  Leads to larger parallelizable part with increasing N □  Typical goal in simulation problems

■  Time spend in the sequential part is usually fixed or grows slower than the problem size à linear speedup possible

■  Formally: □  PN: Portion of the program that benefits from parallelization,

depending on N (and implicitly the problem size) □  Maximum scaled speedup by N processors:

47

Page 48: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

The Parallel Programming Problem

48

Execution Environment Parallel Application Match ?

Configuration

Flexible

Type

Page 49: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Programming Model for Shared Memory

49

Process

Explicitly Shared Memory

■  Different programming models for concurrency in shared memory

■  Processes and threads mapped to processing elements (cores)

■  Process- und thread-based programming typically part of operating system lectures

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Memory

Process

Memory

Thread Thread

Task

Task

Task

Task

Concurrent Processes Concurrent Threads

Concurrent Tasks

Main Thread

Process

Memory

Main Thread

Process

Memory

Main Thread Thread Thread

Page 50: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

OpenMP

■  Programming with the fork-join model □  Master thread forks into declared tasks

□  Runtime environment may run them in parallel, based on dynamic mapping to threads from a pool

□  Worker task barrier before finalization (join)

50

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

[Wik

iped

ia]

Page 51: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Task Scheduling

■  Classical task scheduling with central queue □  All worker threads fetch tasks from a central queue

□  Scalability issue with increasing thread (resp. core) count ■  Work stealing in OpenMP (and other libraries)

□  Task queue per thread □  Idling thread “steals” tasks from another thread

□  Independent from thread scheduling

□  Only mutual synchronization

□  No central queue

51

Thre

ad

New Task

Next Task

Task

Que

ue

Thre

ad

New Task

Next Task

Task

Que

ue

Work Stealing

Page 52: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

PGAS Languages

■  Non-uniform memory architectures (NUMA) became default ■  But: Understanding of memory in programming is flat

□  All variables are equal in access time □  Considering the memory hierarchy is low-level coding

(e.g. cache-aware programming) ■  Partitioned global address space (PGAS) approach

□  Driven by high-performance computing community □  Modern approach for large-scale NUMA

□  Explicit notion of memory partition per processor ◊ Data is designated as local (near) or global (possibly far) ◊  Programmer is aware of NUMA nodes

□  Performance optimization for deep memory hierarchies

52

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 53: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Parallel Programming for Accelerators

■  OpenCL exposes CPUs, GPUs, and other Accelerators as “devices”

■  Each “device” contains one or more “compute units”, i.e. cores, SMs,... ■  Each “compute unit” contains one or more SIMD “processing elements”

ParProg | GPU Computing | FF2013

53

[4]

Page 54: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

The BIG idea behind OpenCL

OpenCL execution model … execute a kernel at each point in a problem domain.

E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions

54

Page 55: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Message Passing

■  Programming paradigm targeting shared-nothing infrastructures □  Implementations for shared memory available,

but typically not the best-possible approach ■  Multiple instances of the same application on a set of nodes (SPMD)

Instance 0 Instance

1

Instance 2 Instance

3

Submission Host

Execution Hosts

Page 56: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Single Program Multiple Data (SPMD)

56

Single Program Multiple Data (SPMD)

P0 P1 P2 P3

seq. program and data distribution

seq. node program with message passing

identical copies with different process

identifications

Page 57: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Actor Model

■  Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI 1973. □  Another mathematical model for concurrent computation □  No global system state concept (relationship to physics)

□  Actor as computation primitive ◊ Makes local decisions ◊ Concurrently creates more actors ◊ Concurrently sends / receives messages

□  Asynchronous one-way messaging with changing topology (CSP communication graph is fixed), no order guarantees

□  Recipient is identified by mailing address ■  „Everything is an actor“

57

Page 58: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Actor Model

■  Interaction with asynchronous, unordered, distributed messaging ■  Fundamental aspects

□  Emphasis on local state, time and name space □  No central entity □  Actor A gets to know actor B only by direct creation,

or by name transmission from another actor C ■  Computation

□  Not global state sequence, but partially ordered set of events

◊  Event: Receipt of a message by a target actor ◊  Each event is a transition from one local state to another ◊  Events may happen in parallel

■  Messaging reliability declared as orthogonal aspect

58

Page 59: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Message Passing Interface (MPI)

■  MPI_GATHER ( IN sendbuf, IN sendcount, IN sendtype, OUT recvbuf, IN recvcount, IN recvtype, IN root, IN comm )

□  Each process sends its buffer to the root process, including root □  Incoming messages are stored in rank order

□  Receive buffer is ignored for all non-root processes □  MPI_GATHERV allows varying count of data to be received □  Returns if the buffer is re-usable (no finishing promised)

59

Page 60: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

The Parallel Programming Problem

60

Execution Environment Parallel Application Match ?

Configuration

Flexible

Type

Page 61: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Execution Environment Mapping

61

Sing

le In

stru

ctio

n,

Mul

tiple

Dat

a (SIM

D)

Mul

tiple

Inst

ruct

ion,

Mul

tiple

Dat

a (M

IMD

)

Page 62: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Patterns for Parallel Programming [Mattson]

■  Finding Concurrency Design Space □  task / data decomposition, task grouping and ordering due to

data flow dependencies, design evaluation ■  Algorithm Structure Design Space

□  Task parallelism, divide and conquer, geometric decomposition, recursive data, pipeline, event-based coordination

□  Mapping of concurrent design elements to execution units ■  Supporting Structures Design Space

□  SPMD, master / worker, loop parallelism, fork / join, shared data, shared queue, distributed array

□  Program structures and data structures used for code creation ■  Implementation Mechanisms Design Space

62

Page 63: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Designing Parallel Algorithms [Foster]

■  Map workload problem on an execution environment □  Concurrency for speedup

□  Data locality for speedup □  Scalability

■  Best parallel solution typically differs massively from the sequential version of an algorithm

■  Foster defines four distinct stages of a methodological approach

■  Example: Parallel Sum

63

Page 64: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Example: Parallel Reduction

■  Reduce a set of elements into one, given an operation

■  Example: Sum

64

Page 65: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Designing Parallel Algorithms [Foster]

■  A) Search for concurrency and scalability □  Partitioning –

Decompose computation and data into small tasks □  Communication –

Define necessary coordination of task execution ■  B) Search for locality and other performance-related issues

□  Agglomeration – Consider performance and implementation costs

□  Mapping – Maximize processor utilization, minimize communication

■  Might require backtracking or parallel investigation of steps

65

Page 66: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Partitioning

■  Expose opportunities for parallel execution – fine-grained decomposition

■  Good partition keeps computation and data together □  Data partitioning leads to data parallelism

□  Computation partitioning leads task parallelism □  Complementary approaches, can lead to different algorithms □  Reveal hidden structures of the algorithm that have potential □  Investigate complementary views on the problem

■  Avoid replication of either computation or data, can be revised later to reduce communication overhead

■  Step results in multiple candidate solutions

66

Page 67: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Partitioning - Decomposition Types

■  Domain Decomposition □  Define small data fragments

□  Specify computation for them □  Different phases of computation

on the same data are handled separately □  Rule of thumb:

First focus on large or frequently used data structures ■  Functional Decomposition

□  Split up computation into disjoint tasks, ignore the data accessed for the moment

□  With significant data overlap, domain decomposition is more appropriate

67

Page 68: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Partitioning Strategies [Breshears]

■  Produce at least as many tasks as there will be threads / cores □  But: Might be more effective to use only fraction of the cores

(granularity) □  Computation must pay-off with respect to overhead

■  Avoid synchronization, since it adds up as overhead to serial execution time

■  Patterns for data decomposition □  By element (one-dimensional) □  By row, by column group, by block (multi-dimensional) □  Influenced by ratio of computation and synchronization

68

Page 69: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Partitioning - Checklist

■  Checklist for resulting partitioning scheme □  Order of magnitude more tasks than processors ?

-> Keeps flexibility for next steps □  Avoidance of redundant computation and storage

requirements ? -> Scalability for large problem sizes

□  Tasks of comparable size ? -> Goal to allocate equal work to processors

□  Does number of tasks scale with the problem size ? -> Algorithm should be able to solve larger tasks with more processors

■  Resolve bad partitioning by estimating performance behavior, and eventually reformulating the problem

69

Page 70: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Communication Step

■  Specify links between data consumers and data producers ■  Specify kind and number of messages on these links

■  Domain decomposition problems might have tricky communication infrastructures, due to data dependencies

■  Communication in functional decomposition problems can easily be modeled from the data flow between the tasks

■  Categorization of communication patterns □  Local communication (few neighbors) vs.

global communication □  Structured communication (e.g. tree) vs.

unstructured communication □  Static vs. dynamic communication structure □  Synchronous vs. asynchronous communication

70

Page 71: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Communication - Hints

■  Distribute computation and communication, don‘t centralize algorithm □  Bad example: Central manager for parallel summation □  Divide-and-conquer helps as mental model to identify

concurrency ■  Unstructured communication is hard to agglomerate,

better avoid it ■  Checklist for communication design

□  Do all tasks perform the same amount of communication ? -> Distribute or replicate communication hot spots

□  Does each task performs only local communication ? □  Can communication happen concurrently ? □  Can computation happen concurrently ?

71

Page 72: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Ghost Cells

■  Domain decomposition might lead to chunks that demand data from each other for their computation

■  Solution 1: Copy necessary portion of data (,ghost cells‘) □  If no synchronization is needed after update

□  Data amount and frequency of update influences resulting overhead and efficiency

□  Additional memory consumption ■  Solution 2: Access relevant data ,remotely‘

□  Delays thread coordination until the data is really needed

□  Correctness („old“ data vs. „new“ data) must be considered on parallel progress

72

Page 73: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Agglomeration Step

■  Algorithm so far is correct, but not specialized for some execution environment

■  Check again partitioning and communication decisions □  Agglomerate tasks for efficient execution on some machine

□  Replicate data and / or computation for efficiency reasons ■  Resulting number of tasks can still be greater than the number of

processors ■  Three conflicting guiding decisions

□  Reduce communication costs by coarser granularity of computation and communication

□  Preserve flexibility with respect to later mapping decisions

□  Reduce software engineering costs (serial -> parallel version)

73

Page 74: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Agglomeration [Foster]

74

Page 75: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Agglomeration – Granularity vs. Flexibility

■  Reduce communication costs by coarser granularity □  Sending less data

□  Sending fewer messages (per-message initialization costs) □  Agglomerate, especially if tasks cannot run concurrently ◊  Reduces also task creation costs

□  Replicate computation to avoid communication (helps also with reliability)

■  Preserve flexibility

□  Flexible large number of tasks still prerequisite for scalability ■  Define granularity as compile-time or run-time parameter

75

Page 76: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Agglomeration - Checklist

■  Communication costs reduced by increasing locality ? ■  Does replicated computation outweighs its costs in all cases ?

■  Does data replication restrict the range of problem sizes / processor counts ?

■  Does the larger tasks still have similar computation / communication costs ?

■  Does the larger tasks still act with sufficient concurrency ? ■  Does the number of tasks still scale with the problem size ? ■  How much can the task count decrease, without disturbing load

balancing, scalability, or engineering costs ? ■  Is the transition to parallel code worth the engineering costs ?

76

Page 77: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Mapping Step

■  Only relevant for shared-nothing systems, since shared memory systems typically perform automatic task scheduling

■  Minimize execution time by □  Place concurrent tasks on different nodes

□  Place tasks with heavy communication on the same node ■  Conflicting strategies, additionally restricted by resource limits

□  In general, NP-complete bin packing problem ■  Set of sophisticated (dynamic) heuristics for load balancing

□  Preference for local algorithms that do not need global scheduling state

77

Page 78: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Surface-To-Volume Effect [Foster, Breshears]

■  Visualize the data to be processed (in parallel) as sliced 3D cube ■  Synchronization requirements of a task

□  Proportional to the surface of the data slice it operates upon □  Visualized by the amount of ,borders‘ of the slice

■  Computation work of a task □  Proportional to the volume of the data slice it operates upon

□  Represents the granularity of decomposition ■  Ratio of synchronization and computation

□  High synchronization, low computation, high ratio à bad □  Low synchronization, high computation, low ratio à good

□  Ratio decreases for increasing data size per task ■  Coarse granularity by agglomerating tasks in all dimensions

□  For given volume, the surface then goes down à good

78

Page 79: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Surface-To-Volume Effect [Foster, Breshears]

79

(C)

nice

rweb

.com

Page 80: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Surface-to-Volume Effect [Foster]

■  Computation on 8x8 grid ■  (a): 64 tasks,

one point each □  64x4=256

synchronizations □  256 data values are

transferred ■  (b): 4 tasks,

16 points each □  4x4=16

synchronizations □  16x4=64 data values

are transferred

80

Page 81: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Designing Parallel Algorithms [Breshears]

■  Parallel solution must keep sequential consistency property ■  „Mentally simulate“ the execution of parallel streams

□  Check critical parts of the parallelized sequential application ■  Amount of computation per parallel task

■  Always introduced by moving from serial to parallel code ■  Speedup must offset the parallelization overhead (Amdahl)

■  Granularity: Amount of parallel computation done before synchronization is needed

□  Fine-grained granularity overhead vs. coarse-grained granularity concurrency ◊  Iterative approach of finding the right granularity ◊ Decision might be only correct only for a chosen execution

environment

81

Page 82: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

OK ?!?

Page 83: Parallel Programming Concepts Summary - uni … | Parallel Programming Concepts | Dr. Peter Tröger . Memory Wall Caching: Well established optimization technique for performance Relies

Certificate ‚for free‘

83