openhpi - parallel programming concepts - week 1

68
Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.1: Welcome ! Dr. Peter Tröger + Teaching Team

Upload: peter-troeger

Post on 15-Jan-2015

132 views

Category:

Education


3 download

DESCRIPTION

Week 1 in the OpenHPI course on parallel programming concepts is about hardware and software trends that lead to the rise of parallel programming for ordinary developers. Find the whole course at http://bit.ly/1l3uD4h.

TRANSCRIPT

Page 1: OpenHPI - Parallel Programming Concepts - Week 1

Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.1: Welcome !

Dr. Peter Tröger + Teaching Team

Page 2: OpenHPI - Parallel Programming Concepts - Week 1

Course Content

■  Overview of theoretical and practical concepts ■  This course is for you if …

□  … you have skills in software development, regardless of the programming language.

□  … you want to get an overview of parallelization concepts. □  … you want to assess the feasibility of parallel hardware,

software and libraries for your parallelization problem. ■  This course is not for you if …

□  … you have no practical experience with software development at all.

□  … you want a solution for a specific parallelization problem.

□  … you want to learn one specific parallel programming tool or language in detail.

2

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 3: OpenHPI - Parallel Programming Concepts - Week 1

Parallel Programming Concepts

3

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 4: OpenHPI - Parallel Programming Concepts - Week 1

Course Organization

■  Six lecture weeks, final exam in week 7 ■  Several lecture units per week, per unit:

□  Video, slides, non-graded self-test □  Sometimes mandatory and optional readings □  Sometimes optional programming tasks □  Week finished with a graded assignment

■  Six graded assignments sum up to max. 90 points ■  Graded final exam with max. 90 points ■  OpenHPI certificate awarded for getting ≥90 points in total ■  Forum can be used to discuss with other participants

■  FAQ is constantly updated

4

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 5: OpenHPI - Parallel Programming Concepts - Week 1

Course Organization

■  Week 1: Terminology and fundamental concepts □  Moore’s law, power wall, memory wall, ILP wall,

speedup vs. scaleup, Amdahl’s law, Flynn’s taxonomy, … ■  Week 2: Shared memory parallelism – The basics

□  Concurrency, race condition, semaphore, mutex, deadlock, monitor, …

■  Week 3: Shared memory parallelism – Programming □  Threads, OpenMP, Intel TBB, Cilk, Scala, …

■  Week 4: Accelerators □  Hardware today, CUDA, GPU Computing, OpenCL, …

■  Week 5: Distributed memory parallelism □  CSP, Actor model, clusters, HPC, MPI, MapReduce, …

■  Week 6: Patterns, best practices and examples

5

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 6: OpenHPI - Parallel Programming Concepts - Week 1

Why Parallel?

6

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 7: OpenHPI - Parallel Programming Concepts - Week 1

Computer Markets

■  Embedded and Mobile Computing □  Cars, smartphones, entertainment industry, medical devices, …

□  Power/performance and price as relevant issues ■  Desktop Computing

□  Price/performance ratio and extensibility as relevant issues ■  Server Computing

□  Business service provisioning as typical goal □  Web servers, banking back-end, order processing, ... □  Performance and availability as relevant issues

■  Most software benefits from having better performance ■  The computer hardware industry is constantly delivering

7

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 8: OpenHPI - Parallel Programming Concepts - Week 1

Running Applications

Application

Instructions

8

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 9: OpenHPI - Parallel Programming Concepts - Week 1

Three Ways Of Doing Anything Faster [Pfister]

■  Work harder (clock speed) □  Hardware solution □  No longer feasible

■  Work smarter (optimization, caching) □  Hardware solution □  No longer feasible

as only solution ■  Get help

(parallelization) □  Hardware + Software

in cooperation

Application

Instructions

t

9

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 10: OpenHPI - Parallel Programming Concepts - Week 1

Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.2: Moore’s Law and the Power Wall

Dr. Peter Tröger + Teaching Team

Page 11: OpenHPI - Parallel Programming Concepts - Week 1

Processor Hardware

■  First computers had fixed programs (e.g. electronic calculator) ■  Von Neumann architecture (1945)

□  Instructions for central processing unit (CPU) in memory □  Program is treated as data □  Loading of code during runtime, self-modification

■  Multiple such processors: Symmetric multiprocessing (SMP)

CPU

Memory Control Unit

Arithmetic Logic Unit Input

Output Bus

11

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 12: OpenHPI - Parallel Programming Concepts - Week 1

Moore’s Law

■  “...the number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years. ...” (Gordon Moore, 1965) □  CPUs contain different hardware parts, such as logic gates □  Parts are built from transistors

□  Rule of exponential growth for the number of transistors on one CPU chip

□  Meanwhile a self-fulfilling prophecy □  Applied not only in processor industry,

but also in other areas □  Sometimes misinterpreted as

performance indication □  May still hold for the next 10-20 years

[Wik

iped

ia]

12

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 13: OpenHPI - Parallel Programming Concepts - Week 1

Moore’s Law

[Wik

imed

ia]

13

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 14: OpenHPI - Parallel Programming Concepts - Week 1

Moore’s Law vs. Software

■  Nathan P. Myhrvold, “The Next Fifty Years of Software”, 1997 □  “Software is a gas. It expands to fit the container it is in.”

◊ Constant increase in the amount of code □  “Software grows until it becomes limited by Moore’s law.” ◊  Software often grows faster than hardware capabilities

□  “Software growth makes Moore’s Law possible.”

◊  Software and hardware market stimulate each other □  “Software is only limited by human ambition & expectation.” ◊  People will always find ways for exploiting performance

■  Jevon’s paradox:

□  “Technological progress that increases the efficiency with which a resource is used tends to increase (rather than decrease) the rate of consumption of that resource.”

14

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 15: OpenHPI - Parallel Programming Concepts - Week 1

Processor Performance Development

Transistors)#)Clock)Speed)(MHz))Power)(W))Perf/Clock)(ILP))

“Work harder”

“Work smarter”

[Her

b Sut

ter,

2009

]

15

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 16: OpenHPI - Parallel Programming Concepts - Week 1

A Physics Problem

■  Power: Energy needed to run the processor

■  Static power (SP): Leakage in transistors while being inactive

■  Dynamic power (DP): Energy needed to switch a transistor

■  Moore’s law: N goes up exponentially, C goes down with size ■  Power dissipation demands cooling

□  Power density: Watt/cm2

■  Make dynamic power increase less dramatic: □  Bringing down V reduces energy consumption, quadratically! □  Don’t use N only for logic gates

■  Industry was able to increase the frequency (F) for decades

DP (approx.) = Number of Transistors (N) x Capacitance (C) x Voltage2 (V2) x Frequency (F)

16

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 17: OpenHPI - Parallel Programming Concepts - Week 1

Processor Supply Voltage

1

10

100

1970 1980 1990 2000 2010

Pow

er S

uppl

y (V

olt)

Processor Supply VoltageProcessor Supply Voltage

[Moo

re,

ISSCC]

17

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 18: OpenHPI - Parallel Programming Concepts - Week 1

Power Density

■  Growth of watts per square centimeter in microprocessors ■  Higher temperatures: Increased leakage, slower transistors

0 W

20 W

40 W

60 W

80 W

100 W

120 W

140 W

1992 1995 1997 2000 2002 2005

Hot Plate

Air Cooling Limit

18

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 19: OpenHPI - Parallel Programming Concepts - Week 1

Power Density

[Kevin Skadron, 2007]

2

©20

07, K

evin

Ska

dron

“Cooking-Aware” Computing?19

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 20: OpenHPI - Parallel Programming Concepts - Week 1

Second Problem: Leakage Increase

0.001

0.01

0.1

1

10

100

1000

1960 1970 1980 1990 2000 2010

Pow

er (W

)

Processor Power (Watts) Processor Power (Watts) -- Active & Leakage Active & Leakage

ActiveActive

LeakageLeakage

[ww

w.ie

eegh

n.or

g]

■  Static leakage today: Up to 40% of CPU power consumption

20

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 21: OpenHPI - Parallel Programming Concepts - Week 1

The Power Wall

■  Air cooling capabilities are limited □  Maximum temperature of 100-125 °C, hot spot problem

□  Static and dynamic power consumption must be limited ■  Power consumption increases with Moore‘s law,

but grow of hardware performance is expected ■  Further reducing voltage as compensation

□  We can’t do that endlessly, lower limit around 0.7V □  Strange physical effects

■  Next-generation processors need to use even less power □  Lower the frequencies, scale them dynamically □  Use only parts of the processor at a time (‘dark silicon’) □  Build energy-efficient special purpose hardware

■  No chance for faster processors through frequency increase

21

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 22: OpenHPI - Parallel Programming Concepts - Week 1

The Free Lunch Is Over

■  Clock speed curve flattened in 2003 □  Heat, power,

leakage ■  Speeding up the serial

instruction execution through clock speed improvements no longer works

■  Additional issues □  ILP wall □  Memory wall

[Her

b Sut

ter,

2009

]

22

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 23: OpenHPI - Parallel Programming Concepts - Week 1

Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.3: ILP Wall and Memory Wall

Dr. Peter Tröger + Teaching Team

Page 24: OpenHPI - Parallel Programming Concepts - Week 1

Three Ways Of Doing Anything Faster [Pfister]

■  Work harder (clock speed) □  Hardware solution !  Power wall problem

■  Work smarter (optimization, caching) □  Hardware solution

■  Get help (parallelization) □  Hardware + Software

Application

Instructions

24

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 25: OpenHPI - Parallel Programming Concepts - Week 1

Instruction Level Parallelism

■  Increasing the frequency is no longer an option ■  Provide smarter instruction processing for better performance

■  Instruction level parallelism (ILP) □  Processor hardware optimizes low-level instruction execution □  Instruction pipelining ◊ Overlapped execution of serial instructions

□  Superscalar execution ◊ Multiple units of one processor are used in parallel

□  Out-of-order execution ◊  Reorder instructions that do not have data dependencies

□  Speculative execution ◊ Control flow speculation and branch prediction

■  Today’s processors are packed with such ILP logic

25

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 26: OpenHPI - Parallel Programming Concepts - Week 1

The ILP Wall

■  No longer cost-effective to dedicate new transistors to ILP mechanisms

■  Deeper pipelines make the power problem worse

■  High ILP complexity effectively reduces the processing speed for a given frequency (e.g. misprediction)

■  More aggressive ILP technologies too risky due to unknown real-world workloads

■  No ground-breaking new ideas ■  " “ILP wall” ■  Ok, let’s use the transistors for better caching

[Wikipedia]

26

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 27: OpenHPI - Parallel Programming Concepts - Week 1

Caching

■  von Neumann architecture □  Instructions are stored in main memory

□  Program is treated as data □  For each instruction execution, data must be fetched

■  When the frequency increases, main memory becomes a performance bottleneck

■  Caching: Keep data copy in very fast, small memory on the CPU

CPU

Memory Control Unit

Arithmetic Logic Unit Input

Output Bus

Cache

27

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 28: OpenHPI - Parallel Programming Concepts - Week 1

Small

Memory Hardware Hierarchy

volatile

non-volatile

Registers

Processor Caches

Random Access Memory (RAM)

Flash / SSD Memory

Hard Drives

Tapes

Fast Expensive

Slow Large

28

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Cheap

Page 29: OpenHPI - Parallel Programming Concepts - Week 1

Memory Hardware Hierarchy

CPU core CPU core CPU core CPU core

L2 Cache L2 Cache

L3 Cache

L1 Cache L1 Cache L1 Cache L1 Cache

Bus

Bus Bus

L = Level 29

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 30: OpenHPI - Parallel Programming Concepts - Week 1

Caching for Performance

■  Well established optimization technique for performance ■  Caching relies on data locality

□  Some instructions are often used (e.g. loops) □  Some data is often used (e.g. local variables) □  Hardware keeps a copy of the data in the faster cache □  On read attempts, data is taken directly from the cache

□  On write, data is cached and eventually written to memory ■  Similar to ILP, the potential is limited

□  Larger caches do not help automatically □  At some point, all data locality in the

code is already exploited □  Manual vs. compiler-driven optimization

[arstechnica.com]

30

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 31: OpenHPI - Parallel Programming Concepts - Week 1

Memory Wall

■  If caching is limited, we simply need faster memory ■  The problem: Shared memory is ‘shared’

□  Interconnect contention □  Memory bandwidth ◊ Memory transfer speed is limited by the power wall ◊ Memory transfer size is limited by the power wall

■  Transfer technology cannot keep up with GHz processors

■  Memory is too slow, effects cannot be hidden through caching completely " “Memory wall”

[dell.com]

31

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 32: OpenHPI - Parallel Programming Concepts - Week 1

Problem Summary

■  Hardware perspective □  Number of transistors N is still increasing

□  Building larger caches no longer helps (memory wall) □  ILP is out of options (ILP wall) □  Voltage / power / frequency is at the limit (power wall) ◊  Some help with dynamic scaling approaches

□  Remaining option: Use N for more cores per processor chip ■  Software perspective

□  Performance must come from the utilization of this increasing core count per chip, since F is now fixed

□  Software must tackle the memory wall

32

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 33: OpenHPI - Parallel Programming Concepts - Week 1

Three Ways Of Doing Anything Faster [Pfister]

■  Work harder (clock speed) !  Power wall problem !  Memory wall problem

■  Work smarter (optimization, caching) !  ILP wall problem !  Memory wall problem

■  Get help (parallelization) □  More cores per single CPU

□  Software needs to exploit them in the right way

!  Memory wall problem

Problem

CPU

Core

Core

Core

Core

Core

33

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 34: OpenHPI - Parallel Programming Concepts - Week 1

Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.4: Parallel Hardware Classification

Dr. Peter Tröger + Teaching Team

Page 35: OpenHPI - Parallel Programming Concepts - Week 1

Parallelism [Mattson et al.]

■  Task □  Parallel program breaks a problem into tasks

■  Execution unit □  Representation of a concurrently running task (e.g. thread) □  Tasks are mapped to execution units

■  Processing element (PE)

□  Hardware element running one execution unit □  Depends on scenario - logical processor vs. core vs. machine □  Execution units run simultaneously on processing elements,

controlled by some scheduler ■  Synchronization - Mechanism to order activities of parallel tasks ■  Race condition - Program result depends on the scheduling order

35

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 36: OpenHPI - Parallel Programming Concepts - Week 1

Faster Processing through Parallelization

Program

Task Task

Task

Task

Task

36

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 37: OpenHPI - Parallel Programming Concepts - Week 1

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Flynn‘s Taxonomy (1966)

■  Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension

Single Instruction, Single Data (SISD)

Single Instruction, Multiple Data (SIMD)

37

Processing Step Instruction

Data Item

Output Processing Step

Instruction

Data Items

Output

Multiple Instruction, Single Data (MISD)

Processing Step

Instructions Data Item

Output

Multiple Instruction, Multiple Data (MIMD)

Processing Step

Instructions Data Items

Output

Page 38: OpenHPI - Parallel Programming Concepts - Week 1

Flynn‘s Taxonomy (1966)

■  Single Instruction, Single Data (SISD) □  No parallelism in the execution

□  Old single processor architectures ■  Single Instruction, Multiple Data (SIMD)

□  Multiple data streams processed with one instruction stream at the same time

□  Typical in graphics hardware and GPU accelerators □  Special SIMD machines in high-performance computing

■  Multiple Instructions, Single Data (MISD) □  Multiple instructions applied to the same data in parallel □  Rarely used in practice, only for fault tolerance

■  Multiple Instructions, Multiple Data (MIMD)

□  Every modern processor, compute clusters

38

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 39: OpenHPI - Parallel Programming Concepts - Week 1

Parallelism on Different Levels

Program Program Program

Process Process Process Process Task

PE

Process Process Process Process Task Process Process Process Process Task

PE PE

PE

Memory

Node

Net

wor

k

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

39

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 40: OpenHPI - Parallel Programming Concepts - Week 1

Parallelism on Different Levels

■  A processor chip (socket) □  Chip multi-processing (CMP)

◊ Multiple CPU’s per chip, called cores ◊ Multi-core / many-core

□  Simultaneous multi-threading (SMT) ◊  Interleaved execution of tasks on one core

◊  Example: Intel Hyperthreading □  Chip multi-threading (CMT) = CMP + SMT □  Instruction-level parallelism (ILP) ◊  Parallel processing of single instructions per core

■  Multiple processor chips in one machine (multi-processing) □  Symmetric multi-processing (SMP)

■  Multiple processor chips in many machines (multi-computer)

40

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 41: OpenHPI - Parallel Programming Concepts - Week 1

Parallelism on Different Levels

[ars

tech

nica

.com

]

ILP, SMT ILP, SMT ILP, SMT ILP, SMT

ILP, SMT ILP, SMT ILP, SMT ILP, SMT

CM

P Arc

hite

ctur

e

41

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 42: OpenHPI - Parallel Programming Concepts - Week 1

Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.5: Memory Architectures

Dr. Peter Tröger + Teaching Team

Page 43: OpenHPI - Parallel Programming Concepts - Week 1

Parallelism on Different Levels

Program Program Program

Process Process Process Process Task

PE

Process Process Process Process Task Process Process Process Process Task

PE PE

PE

Memory

Node

Net

wor

k

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

PE PE

PE

Memory

43

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 44: OpenHPI - Parallel Programming Concepts - Week 1

Shared Memory vs. Shared Nothing

■  Organization of parallel processing hardware as … □  Shared memory system

◊  Tasks can directly access a common address space ◊  Implemented as memory hierarchy with different cache levels

□  Shared nothing system ◊  Tasks can only access local memory

◊ Global coordination of parallel execution by explicit communication (e.g. messaging) between tasks

□  Hybrid architectures possible in practice ◊ Cluster of shared memory systems ◊  Accelerator hardware in a shared memory system ●  Dedicated local memory on the accelerator ●  Example: SIMD GPU hardware in SMP computer system

44

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 45: OpenHPI - Parallel Programming Concepts - Week 1

Shared Memory vs. Shared Nothing

■  Pfister: “shared memory” vs. “distributed memory” ■  Foster: “multiprocessor” vs. “multicomputer”

■  Tannenbaum: “shared memory” vs. “private memory”

Processing Element

Task

Shared Memory

Processing Element

Task

Processing Element

Task

Processing Element

Task

Mes

sage

Mes

sage

Mes

sage

Mes

sage

Data Data Data Data

45

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 46: OpenHPI - Parallel Programming Concepts - Week 1

Shared Memory

■  Processing elements act independently ■  Use the same global address space

■  Changes are visible for all processing elements ■  Uniform memory access (UMA) system

□  Equal access time for all PE’s to all memory locations □  Default approach for SMP systems of the past

■  Non-uniform memory access (NUMA) system □  Delay on memory access according to the accessed region □  Typically due to core / processor interconnect technology

■  Cache-coherent NUMA (CC-NUMA) system

◊ NUMA system that keeps all caches consistent ◊  Transparent hardware mechanisms ◊  Became standard approach with recent X86 chips

46

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 47: OpenHPI - Parallel Programming Concepts - Week 1

Socket

UMA Example

■  Two dual-core processor chips in an SMP system ■  Level 1 cache (fast, small), Level 2 cache (slower, larger)

■  Hardware manages cache coherency among all cores

Core Core

L1 Cache L1 Cache

L2 Cache

RAM

Chipset / Memory Controller

System Bus

Socket

Core Core

L1 Cache L1 Cache

L2 Cache

47

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

RAM RAM RAM

Page 48: OpenHPI - Parallel Programming Concepts - Week 1

Socket

NUMA Example

■  Eight cores on 2 sockets in an SMP system ■  Memory controllers + chip interconnect realize a single memory

address space for the software

Core Core

L1 L1

L3 Cache

RAM

L2 L2

Core Core

L1

L2

L1

L2

Memory Controller

RAM

Chip

Interconnect

Socket

Core Core

L1 L1

L3 Cache

L2 L2

Core Core

L1

L2

L1

L2

Memory Controller

48

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 49: OpenHPI - Parallel Programming Concepts - Week 1

NUMA Example: 4-way Intel Nehalem SMP

Core Core

Core Core

Q P I

Core Core

Core Core

Q P I

Core Core

Core Core

Q P I

Core Core

Core Core

Q P I L3

Cac

he

L3 C

ache

L3 C

ache

Mem

ory

Cont

rolle

r

Mem

ory

Cont

rolle

r M

emor

y Co

ntro

ller

L3 C

ache

M

emor

y Co

ntro

ller

I/O I/O

I/O I/O

Mem

ory

Mem

ory

Mem

ory

Mem

ory

49

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 50: OpenHPI - Parallel Programming Concepts - Week 1

Shared Nothing

■  Processing elements no longer share a common global memory ■  Easy scale-out by adding machines to the messaging network

■  Cluster computing: Combine machines with cheap interconnect □  Compute cluster: Speedup for an application ◊  Batch processing, data parallelism

□  Load-balancing cluster: Better throughput for some service

□  High Availability (HA) cluster: Fault tolerance ■  Cluster to the extreme

□  High Performance Computing (HPC) □  Massively Parallel Processing (MPP) hardware

□  TOP500 list of the fastest supercomputers

50

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 51: OpenHPI - Parallel Programming Concepts - Week 1

Clusters

Processing Element

Task

Processing Element

Task

Mes

sage

Mes

sage

Mes

sage

Mes

sage

Data Data

51

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 52: OpenHPI - Parallel Programming Concepts - Week 1

Shared Nothing Example

Socket

Core Core

L1 L1

L3 Cache

RAM

L2 L2

Memory Controller

Network Interface

Socket

Core Core

L1 L1

L3 Cache

RAM

L2 L2

Memory Controller

Network Interface

Socket

Core Core

L1 L1

L3 Cache

RAM

L2 L2

Memory Controller

Network Interface

Machine Machine Machine

52

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Interconnection Network

Page 53: OpenHPI - Parallel Programming Concepts - Week 1

Hybrid Example

Machine

Socket

Core Core

L1D L1D

L3 Cache

RAM

L2 L2

Memory Controller

Network Interface

Chip Inter-connect

Socket

Core Core

L1D L1D

L3 Cache

RAM

L2 L2

Memory Controller

Machine

Socket

Core Core

L1D L1D

L3 Cache

RAM

L2 L2

Memory Controller

Network Interface

Chip Inter-connect

Socket

Core Core

L1D L1D

L3 Cache

RAM

L2 L2

Memory Controller

53

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Interconnection Network

Page 54: OpenHPI - Parallel Programming Concepts - Week 1

Example: Cluster of Nehalem SMPs

Network

54

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 55: OpenHPI - Parallel Programming Concepts - Week 1

The Parallel Programming Problem

■  Execution environment has a particular type (SIMD, MIMD, UMA, NUMA, …)

■  Execution environment maybe configurable (number of resources) ■  Parallel application must be mapped to available resources

Execution Environment Parallel Application Match ?

Configuration

Flexible

Type

55

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 56: OpenHPI - Parallel Programming Concepts - Week 1

Parallel Programming Concepts OpenHPI Course Week 1 : Terminology and fundamental concepts Unit 1.6: Speedup and Scaleup

Dr. Peter Tröger + Teaching Team

Page 57: OpenHPI - Parallel Programming Concepts - Week 1

Which One Is Faster ?

■  Usage scenario □  Transporting a fridge

■  Usage environment □  Driving through a forest

■  Perception of performance □  Maximum speed

□  Average speed □  Acceleration

■  We need some kind of application-specific benchmark

57

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 58: OpenHPI - Parallel Programming Concepts - Week 1

Parallelism for …

■  Speedup – compute faster ■  Throughput – compute more in the same time

■  Scalability – compute faster / more with additional resources ■  …

Processing Element A1

Processing Element A2

Processing Element A3

Processing Element B1

Processing Element B2

Processing Element B3 Sca

ling

Up

Scaling Out

Mai

n M

emor

y

Mai

n M

emor

y

58

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 59: OpenHPI - Parallel Programming Concepts - Week 1

Metrics

■  Parallelization metrics are application-dependent, but follow a common set of concepts □  Speedup: Adding more resources leads to less time for

solving the same problem. □  Linear speedup:

n times more resources " n times speedup □  Scaleup: Adding more resources solves a larger version of the

same problem in the same time. □  Linear scaleup:

n times more resources " n times larger problem solvable ■  The most important goal depends on the application

□  Throughput demands scalability of the software □  Response time demands speedup of the processing

59

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 60: OpenHPI - Parallel Programming Concepts - Week 1

Tasks: v=12 Processing elements: N= 3

Time needed: T3= 4 (Linear) Speedup: T1/T3=12/4=3

Speedup

■  Idealized assumptions □  All tasks are equal sized

□  All code parts can run in parallel Application

1 2 3 4 5 6 7 8 9 10

11

12

1 2 3 4

5 6 7 8

9 10

11

12

t t

Tasks: v=12 Processing elements: N=1

Time needed: T1=12

60

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 61: OpenHPI - Parallel Programming Concepts - Week 1

Speedup with Load Imbalance

■  Assumptions □  Tasks have different size,

best-possible speedup depends on optimized resource usage

□  All code parts can run in parallel

Application

2 3 4 5 6 7 8 9 10

11

12

t t

1

2 3 4 1 5 6 7 8

9 10

11

12

Tasks: v=12 Processing elements: N= 3

Time needed: T3= 6 Speedup: T1/T3=16/6=2.67

Tasks: v=12 Processing elements: N=1

Time needed: T1=16

61

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 62: OpenHPI - Parallel Programming Concepts - Week 1

Speedup with Serial Parts

■  Each application has inherently non-parallelizable serial parts □  Algorithmic limitations

□  Shared resources acting as bottleneck □  Overhead for program start □  Communication overhead in shared-nothing systems

2 3

4 5

6 7 8

9 10

11

12

tSER1

1

tPAR1 tSER2 tPAR2 tSER3

62

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 63: OpenHPI - Parallel Programming Concepts - Week 1

Amdahl’s Law

■  Gene Amdahl. “Validity of the single processor approach to achieving large scale computing capabilities”. AFIPS 1967 □  Serial parts TSER = tSER1 + tSER2 + tSER3 + … □  Parallelizable parts TPAR = tPAR1 + tPAR2 + tPAR3 + …

□  Execution time with one processing element: T1 = TSER+TPAR

□  Execution time with N parallel processing elements: TN >= TSER + TPAR / N ◊  Equal only on perfect parallelization,

e.g. no load imbalance □  Amdahl’s Law for maximum speedup with N processing elements

S =T1

TN

63

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

S =TSER + TPAR

TSER + TPAR/N

Page 64: OpenHPI - Parallel Programming Concepts - Week 1

Amdahl’s Law

64

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 65: OpenHPI - Parallel Programming Concepts - Week 1

Amdahl’s Law

■  Speedup through parallelism is hard to achieve ■  For unlimited resources, speedup is bound by the serial parts:

□  Assume T1=1

■  Parallelization problem relates to all system layers □  Hardware offers some degree of parallel execution □  Speedup gained is bound by serial parts: ◊  Limitations of hardware components

◊ Necessary serial activities in the operating system, virtual runtime system, middleware and the application

◊ Overhead for the parallelization itself

65

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

SN!1 =T1

TN!1SN!1 =

1

TSER

Page 66: OpenHPI - Parallel Programming Concepts - Week 1

Amdahl’s Law

■  “Everyone knows Amdahl’s law, but quickly forgets it.” [Thomas Puzak, IBM]

■  90% parallelizable code leads to not more than 10x speedup □  Regardless of the number of processing elements

■  Parallelism is only useful … □  … for small number of processing elements □  … for highly parallelizable code

■  What’s the sense in big parallel / distributed hardware setups?

■  Relevant assumptions □  Put the same problem on different hardware □  Assumption of fixed problem size □  Only consideration of execution time for one problem

66

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 67: OpenHPI - Parallel Programming Concepts - Week 1

Gustafson-Barsis’ Law (1988)

■  Gustafson and Barsis: People are typically not interested in the shortest execution time □  Rather solve a bigger problem in reasonable time

■  Problem size could then scale with the number of processors

□  Typical in simulation and farmer / worker problems □  Leads to larger parallel fraction with increasing N □  Serial part is usually fixed or grows slower

■  Maximum scaled speedup by N processors:

■  Linear speedup now becomes possible ■  Software needs to ensure that serial parts remain constant ■  Other models exist (e.g. Work-Span model, Karp-Flatt metric)

67

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

S =TSER +N · TPAR

TSER + TPAR

Page 68: OpenHPI - Parallel Programming Concepts - Week 1

Summary: Week 1

■  Moore’s Law and the Power Wall □  Processing element speed no longer increases

■  ILP Wall and Memory Wall □  Memory access is not fast enough for modern hardware

■  Parallel Hardware Classification □  From ILP to SMP, SIMD vs. MIMD

■  Memory Architectures □  UMA vs. NUMA

■  Speedup and Scaleup □  Amdahl’s Law and Gustavson’s Law

Since we need parallelism for speedup, how can we express it in software?

68

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger