parallel computing architectures - · pdf filerow-wise vs column-wise access row-wise access...

66
Parallel Computing Parallel Computing Architectures Architectures Moreno Marzolla Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/

Upload: phungtram

Post on 06-Mar-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

Parallel Computing Parallel Computing ArchitecturesArchitectures

Moreno MarzollaDip. di Informatica—Scienza e Ingegneria (DISI)Università di Bologna

http://www.moreno.marzolla.name/

Parallel Architectures 2

Parallel Architectures 3

An Abstract Parallel Architecture

● How is parallelism handled?● What is the exact physical location of the memories?● What is the topology of the interconnect network?

Processor

Interconnect Network

Memory Memory Memory

Processor Processor Processor

Parallel Architectures 4

Why are parallel architectures important?

● There is no "typical" parallel computer: different vendors use different architectures

● There is currently no “universal” programming paradigm that fits all architectures– Parallel programs must be tailored to the underlying parallel

architecture– The architecture of a parallel computer limits the choice of

the programming paradigm that can be used

Parallel Architectures 5

Von Neumann architectureand its extensions

Parallel Architectures 6

Von Neumann architecture

Processor(CPU)

Memory I/O subsystem

System bus

Parallel Architectures 7

Von Neumann architecture

ALU

R0

R1

Rn-1

PC

PSW

IR

MemoryMemory

Bus Control

DataAddress

Control

Parallel Architectures 8

The Fetch-Decode-Execute cycle

● The CPU performs an infinite loop● Fetch

– Fetch the opcode of the next instruction from the memory address stored in the PC register, and put the opcode in the IR register

● Decode– The content of the IR register is analyzed to identify the

instruction to execute ● Execute

– The control unit activates the appropriate functional units of the CPU to perform the actions required by the instruction (e.g., read values from memory, execute arithmetic computations, and so on)

Parallel Architectures 9

● Reduce the memory access latency– Rely on CPU registers whenever possible– Use caches

● Hide the memory access latency– Multithreading and context-switch during memory accesses

● Execute multiple instructions at the same time– Pipelining – Multiple issue– Branch prediction– Speculative execution– SIMD extensions

How to limit the bottlenecks of theVon Neumann architecture

CPU times compared to the real world

1 CPU cycle 0.3 ns 1 sLevel 1 cache access 0.9 ns 3 sMain memory access 120 ns 6 minSolid-state disk I/O 50-150 μs 2-6 daysRotational disk I/O 1-10 ms 1-12 monthsInternet: SF to NYC 40 ms 4 yearsInternet: SF to UK 81 ms 8 yearsInternet: SF to Australia 183 ms 19 yearsPhysical system reboot 1 m 6 millennia

Source: https://blog.codinghorror.com/the-infinite-space-between-words/

Parallel Architectures 11

Caching

Parallel Architectures 12

Cache hierarchy

● Large memories are slow; fast memories are small

CPU

L1 Cache

L2 Cache

L3 Cache

DRAM

(possible) interconnect bus

Parallel Architectures 13

Cache hierarchy of the AMD Bulldozer architecture

Source: http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29

Parallel Architectures 14

CUDA memory hierarchyBlock

Global Memory

Constant Memory

Texture Memory

LocalMemory

Shared Memory

Registers Registers

Thread Thread

LocalMemory

Block

LocalMemory

Shared Memory

Registers Registers

Thread Thread

LocalMemory

Parallel Architectures 15

How the cache works

● Cache memory is very fast– Often located inside the processor chip– Expensive, very small compared to system memory

● The cache contains a copy of the content of some recently accessed (main) memory locations– If the same memory location is accessed again, and its

content is in cache, then the cache is accessed instead of the system RAM

Parallel Architectures 16

How the cache works

● If the CPU accesses a memory location whose content is not already in cache– the content of that memory location and "some" adjacent

locations are copied in cache– in doing so it might be necessary to purge some old data

from the cache to make room to the new data● The smallest unit of data that can be transferred

to/from the cache is the cache line– On Intel processors, usually 1 cache line = 64B

Parallel Architectures 17

Example

a

Cache line

RAM b c d e f g h i j k l

Cache

CPU

Parallel Architectures 18

Example

aRAM b c d e f g h i j k l

Cache

CPU

Cache line

Parallel Architectures 19

Example

aRAM b c d e f g h i j k l

Cache

CPU

a b c d e f g h i j k l

Parallel Architectures 20

Example

aRAM b c d e f g h i j k l

Cache

CPU

a b c d e f g h i j k l

Parallel Architectures 21

Example

aRAM b c d e f g h i j k l

Cache

CPU

a b c d e f g h i j k l

Parallel Architectures 22

Example

aRAM b c d e f g h i j k l

Cache

CPU

a b c d e f g h i j k l

Parallel Architectures 23

Spatial and temporal locality

● Cache memory works well when applications exhibit spatial and/or temporal locality

● Spatial locality– Accessing adjacent memory locations is OK

● Temporal locality– Repeatedly accessing the same memory location(s) is OK

Parallel Architectures 24

Example: matrix-matrix product

● Given two square matrices p, q, compute r = p q

rqp

void matmul( double *p, double* q, double *r, int n){ int i, j, k; for (i=0; i<n; i++) { for (j=0; j<n; j++) { r[i*n + j] = 0.0; for (k=0; k<n; k++) { r[i*n + j] += p[i*n + k] * q[k*n + j]; } } }}

i

j

i

j

Parallel Architectures 25

Matrix representation

● Matrices in C are stored in row-major order– Elements of each row are contiguous in memory– Adjacent rows are contiguous in memory

p[0][0]

p[1][0]

p[2][0]

p[3][0]

p[0][0] p[1][0] p[2][0] p[3][0]

p[4][4]

Parallel Architectures 26

Row-wise vs column-wise access

Row-wise access is OK: the data is contiguous in memory, so the cache helps (spatial locality)

Column-wise access is NOT OK: the accessed elements are not contiguous in memory (strided access) so the cache does NOT help

Parallel Architectures 27

Matrix-matrix product

● Given two squared matrices p, q, compute r = p q

rqp

i

j

i

j

NOT ok; column-wise access

OK; row-wise access

Parallel Architectures 28

Optimizing the memory access patternp00 p01 p02 p03

p10 p11 p12 p13

p20 p21 p22 p23

p30 p31 p32 p33

q00 q10 q02 q03

q10 q11 q12 q13

q20 q21 q22 q23

q30 q31 q32 q33

r00 r01 r02 r03

r10 r11 r12 r13

r20 r21 r22 r23

r30 r31 r32 r33

p q r

Parallel Architectures 29

Optimizing the memory access patternp00 p01 p02 p03

p10 p11 p12 p13

p20 p21 p22 p23

p30 p31 p32 p33

q00 q10 q02 q03

q10 q11 q12 q13

q20 q21 q22 q23

q30 q31 q32 q33

r00 r01 r02 r03

r10 r11 r12 r13

r20 r21 r22 r23

r30 r31 r32 r33

p q r

p00 p01 p02 p03

p10 p11 p12 p13

p20 p21 p22 p23

p30 p31 p32 p33

q00

q01

q02

q03

q10

q11

q12

q13

q20

q21

q22

q23

q30

q31

q32

q33

r00 r01 r02 r03

r10 r11 r12 r13

r20 r21 r22 r23

r30 r31 r32 r33

p qT r

Transpose q

Parallel Architectures 30

But...

● Transposing the matrix q requires time. Do we gain some advantage in doing so?

● See cache.c

Parallel Architectures 31

Instruction-Level Parallelism (ILP)

Parallel Architectures 32

Instruction-Level Parallelism● Uses multiple functional units to increase the

performance of a processor– Pipelining: the functional units are organized like an

assembly line, and can be used strictly in that order– Multiple issue: the functional units can be used whenever

required● Static multiple issue: the order in which functional units are activated

is decided at compile time (example: Intel IA64)● Dynamic multiple issue (superscalar): the order in which functional

units are activated is decided at run time

Parallel Architectures 33

Instruction-Level ParallelismIF ID EX MEM WB

IF Instruction FetchID Instruction DecodeEX ExecuteMEM Memory AccessWB Write Back

Instruction Fetchand Decode Unit

Integer Integer FloatingPoint

LoadStore

Commit UnitIn-ordercommit

In-orderissue

Out of order execute

FloatingPoint

Pipelining

MultipleIssue

Parallel Architectures 34

PipeliningIF ID EX MEM WB

Instr1

IF ID EX MEM WB

Instr2

Instr1

IF ID EX MEM WB

Instr3

Instr2

Instr1

IF ID EX MEM WB

Instr4 Instr

3Instr

2Instr

1

IF ID EX MEM WB

Instr5 Instr

4Instr

3Instr

2Instr

1

Flusso di istruzioni

Parallel Architectures 35

Control Dependency

z = x + y;

if ( z > 0 ) { w = x ;} else { w = y ;}

● The instructions “w = x” and “w = y” have a control dependency on “z > 0”

● Control dependencies can limit the performance of pipelined architectures

z = x + y;c = z > 0;w = x*c + y*(1-c);

Parallel Architectures 36

In the real world...

● From GCC documentationhttp://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

— Built-in Function: long __builtin_expect(long exp, long c)

You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.

The return value is the value of exp, which should be an integral expression. The semantics of the built-in are that it is expected that exp == c. For example:

if (__builtin_expect (x, 0)) foo ();

Parallel Architectures 37

Branch Hint: Example

#include <stdlib.h>

int main( void ){

int A[1000000];size_t i;const size_t n = sizeof(A) / sizeof(A[0]);for ( i=0; __builtin_expect( i<n, 1 ); i++ ) {

A[i] = i;}return 0;

}

Refrain from this kind of micro-optimization: ideally, this is stuff for compiler writers

Parallel Architectures 38

Hardware multithreading

● Allows the CPU to switch to another task when the current task is stalled

● Fine-grained multithreading– A context switch has essentially zero cost– The CPU can switch to another task even on stalls of short

durations (e.g., waiting for memory operations)– Requires CPU with specific support, e.g., Cray XMT

● Coarse-grained multithreading– A context switch has non-negligible cost, and is appropriate

only for threads blocked on I/O operations or similar– The CPU is less efficient in the presence of stalls of short

duration

Source: https://www.slideshare.net/jasonriedy/sting-a-framework-for-analyzing-spaciotemporal-interaction-networks-and-graphs

Parallel Architectures 40

Hardware multithreading● Simultaneous multithreading (SMT) is an implementation

of fine-grained multithreading where different threads can use multiple functional units at the same time

● HyperThreading is Intel's implementation of SMT– Each physical processor core is seen by the Operating System

as two "logical" processors – Each “logical” processor maintains a complete set of the

architecture state:● general-purpose registers● control registers● advanced programmable interrupt controller (APIC) registers● some machine state registers

– Intel claims that HT provides a 15—30% speedup with respect to a similar, non-HT processor

Parallel Architectures 41

HyperThreading

● The pipeline stages are separated by two buffers (one for each executing thread)

● If one thread is stalled, the other one might go ahead and fill the pipeline slot, resulting in higher processor utilization

IF ID EX MEM WB

Que

ueQ

ueue

Que

ueQ

ueue

Que

ueQ

ueue

Que

ueQ

ueue

Hyper-Threading Technology Architecture and Microarchitecture http://www.cs.virginia.edu/~mc2zk/cs451/vol6iss1_art01.pdf

HyperThreading

● See the output of lscpu ("Thread(s) per core") or lstopo (hwloc Ubuntu/Debian package)

Processor without HT Processor with HT

Parallel Architectures 43

Parallel Architectures

Parallel Architectures 44

Flynn's Taxonomy

SISDSingle Instruction Stream

Single Data Stream

SIMDSingle Instruction Stream

Multiple Data Streams

MISDMultiple Instruction Streams

Single Data Stream

MIMDMultiple Instruction Streams

Multiple Data Streams

Data Streams

Single Multiple

Inst

ruct

ion

Str

eam

s

Mul

tiple

Sin

gle

Von Neumann architecture

Parallel Architectures 45

SIMD

● SIMD instructions apply the same operation (e.g., sum, product…) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands)– This means that there must be 4/8/... independent ALUs

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

STORE C[0]

LOAD A[1]

LOAD B[1]

C[1] = A[1] + B[1]

STORE C[1]

LOAD A[2]

LOAD B[2]

C[2] = A[2] + B[2]

STORE C[2]

LOAD A[3]

LOAD B[3]

C[3] = A[3] + B[3]

STORE C[3]

Tim

e

Parallel Architectures 46

XMM7

SSEStreaming SIMD Extensions

● Extension to the x86 instruction set● Provide new SIMD instructions operating on small

arrays of integer or floating-point numbers● Introduces 8 new 128-bit registers (XMM0—XMM7)● SSE2 instructions can handle

– 2 64-bit doubles, or– 2 64-bit integers, or– 4 32-bit integers, or– 8 16-bit integers, or– 16 8-bit chars

XMM0

32 32 32 32128 bit

XMM1

Parallel Architectures 47

SSE (Streaming SIMD Extensions)

X3 X2 X1 X0

X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0

32 bit 32 bit 32 bit 32 bit

Y3 Y2 Y1 Y0

Op Op Op Op

4 lanes

Parallel Architectures 48

Example__m128 a = _mm_set_ps( 1.0, 2.0, 3.0, 4.0 );__m128 b = _mm_set_ps( 2.0, 4.0, 6.0, 8.0 );__m128 ab = _mm_mul_ps(a, b);

1.0 2.0 3.0 4.0

2.0 4.0 6.0 8.0

2.0 8.0 18.0 32.0

a

b

ab

32 bit 32 bit 32 bit 32 bit

Parallel Architectures 49

GPU

Chip GPU Fermi (fonte: http://www.legitreviews.com/article/1100/1/)

● Modern GPUs (Graphics Processing Units) have a large number of cores, and can be regarded as a form of SIMD architecture

Parallel Architectures 50

CPU vs GPU

ALU ALU

ALU

Control

Cache

DRAM controller

ALU

DRAM controller

CPU GPU

● The difference between CPUs and GPUs can be appreciated by looking at how the chip surface is used

Parallel Architectures 51

GPU core

● A single CPU core contains a fetch/decode unit shared among multiple ALUs– If there are 8 ALU, each

instruction can operate on 8 values simultaneously

● Each GPU core maintains multiple execution contexts, and can switch between them at virtually zero cost– Fine-grained parallelism

ALU ALU ALU ALU

ALU ALU ALU ALU

Ctx Ctx

Ctx Ctx

Ctx Ctx

Ctx Ctx

Fetch / Decode

Ctx

Ctx

Ctx

Ctx

Parallel Architectures 52

GPU

● Example: 12 instruction streams 8 ALU = 96 operations in parallel

Parallel Architectures 53

MIMD

● In MIMD systems there are multiple execution units that can execute multiple sequences of instructions– Multiple Instruction Streams

● Each execution unit generally operates on its own input data– Multiple Data Streams

CALL F()

z = 8

y = 1.7

z = x + y

a = 18

b = 9

if ( a>b ) c = 7

a = a - 1

w = 7

t = 13

k = G(w,t)

k = k + 1

Tim

e

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

STORE C[0]

Parallel Architectures 54

MIMD architectures● Shared Memory

– A set of processors sharing a common memory space

– Each processor can access any memory location

● Distributed Memory– A set of compute nodes connected

through an interconnection network

● The most simple example: cluster of PCs connected via Ethernet

– Nodes can share data through explicit communications

Memory

CPU CPU CPU

Interconnect

CPU

CPU

Mem

CPU

Mem

CPU

Mem

CPU

Mem

Interconnect

Parallel Architectures 55

Hybrid architectures

● Many HPC systems are based on hybrid architectures– Each compute node is a shared-memory multiprocessor– A large number of compute nodes is connected through an

interconnect network

CPU

Mem

Interconnect

CPU

CPU

CPU

GPU GPU

CPU

Mem

CPU

CPU

CPU

GPU GPU

CPU

Mem

CPU

CPU

CPU

GPU GPU

CPU

Mem

CPU

CPU

CPU

GPU GPU

Parallel Architectures 56

Shared memory example

Intel core i7 AMD “Istanbul”

Parallel Architectures 57

Distributed memory systemIBM BlueGene / Q @ CINECA

Architecture10 BGQ FrameModelIBM-BG/QProcessor TypeIBM PowerA2, 1.6 GHzComputing Cores163840Computing Nodes10240RAM1GByte / coreInternal Network5D TorusDisk Space2PByte scratch spacePeak Performance2PFlop/s

Parallel Architectures 58

Parallel Architectures 59

Parallel Architectures 61

www.top500.org(June 2017)

Parallel Architectures 62

SANDIA ASCI REDDate: 1996Peak performance:1.8TeraflopsFloor space:150m2

Power consumption:800.000 Watt

Parallel Architectures 63

Sony PLAYSTATION 3Date: 2006Peak performance:>1.8TeraflopsFloor space:0.08m2

Power consumption:<200 Watt

SANDIA ASCI REDDate: 1996Peak performance:1.8TeraflopsFloor space:150m2

Power consumption:800.000 Watt

Parallel Architectures 64

Inside SONY's PS3

Cell Broadband Engine

Parallel Architectures 65

Empirical rules

● When writing parallel applications (especially on distributed-memory architectures) keep in mind that:– Computation is fast– Communication is slow– Input/output is incredibly slow

Parallel Architectures 66

Recap

Shared memory● Advantages:

– Easier to program– Useful for applications with

irregular data access patterns (e.g., graph algorithms)

● Disadvantages:– The programmer must take

care of race conditions– Limited memory bandwidth

Distributed memory● Advantages:

– Highly scalable, provide very high computational power by adding more nodes

– Useful for applications with strong locality of reference, with high computation / communication ratio

● Disadvantages:– Latency of interconnect

network– Difficult to program