gpu programming: 1. parallel programming models · gpu programming: 1. parallel programming models...

65
GPU programming: 1. Parallel programming models Sylvain Collange Inria Rennes – Bretagne Atlantique [email protected]

Upload: vothuy

Post on 25-May-2018

274 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

GPU programming:1. Parallel programming models

Sylvain CollangeInria Rennes – Bretagne Atlantique

[email protected]

Page 2: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

Self-introduction

Sylvain Collange

Researcher at Inria, Rennes, France

Visiting UFMG until end october

Topics:

Computer architecture

Computer arithmetic

Compilers

Contact: [email protected]

Page 3: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

3

Graphics processing unit (GPU)

Graphics rendering accelerator for computer games

Mass market: low unit price, amortized R&D

Increasing programmability and flexibility

Inexpensive, high-performance parallel processor

GPUs are everywhere, from cell phones to supercomputers

orGPU

GPU

Page 4: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

4

GPGPU today

2002: General-Purpose computation on GPU (GPGPU)

Scientists run non-graphics computations on GPUs

2015: the fastest supercomputers use GPUs/accelerators

GPU architectures now in high-performance computing

#1 Tianhe-2 (China)16,000 × (2 Intel Xeon + 3 Intel Xeon Phi)

#2 Titan (USA)18,648 × (1 AMD Opteron + 1 NVIDIA Tesla K20X)

Page 5: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

5

GPGPU today

2002: General-Purpose computation on GPU (GPGPU)

Scientists run non-graphics computations on GPUs

2015: the fastest supercomputers use GPUs/accelerators

GPU architectures now in high-performance computing

Grifo04 (Brazil)544 × (1 Intel Xeon + 2 NVIDIA Tesla M2050)

Santos Dumont GPU (Brazil)(Intel Xeon + NVIDIA Tesla K40)

Page 6: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

6

Goals of the course

How does a GPU work?

How to design parallel algorithms?

How to implement them on a GPU?

How to get good performance?

Will focus on NVIDIA CUDA environment

Similar concepts in OpenCL

Non-goals : the course is not about

Graphics programming (OpenGL, Direct3D)

High-level languages that target GPUs (OpenACC, OpenMP 4…)

Parallel programming for multi-core and clusters (OpenMP, MPI...)

Programming for SIMD extensions (SSE, AVX...)

Page 7: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

7

Outline of the course

Today: Introduction to parallel programming models

PRAM

BSP

Multi-BSP

Introduction to GPU architecture

Let's pretend we are a GPU architect

How would we design a GPU?

GPU programming

The software side

Programming model

Performance optimization

Possible bottlenecks

Common optimization techniques

Page 8: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

8

Organization: tentative schedule

Week Course: Tuesday 1:00pm room 2006

Lab: Thursday 1:00pmroom 2011

08/09 Programming models Parallel algorithms

15/09 GPU architecture 1 Know your GPU

22/09 GPU architecture 2 Matrix multiplication

29/09 GPU programming 1 Computing ln(2) the hard way

06/10 GPU programming 2

13/10 GPU optimization 1 Game of life

20/10 GPU optimization 2

27/10 Advanced features Project

Lectures and lab exercices will be available online at:http://homepages.dcc.ufmg.br/~sylvain.collange/gpucourse/

Page 9: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

9

Introduction to parallel prog. models

Today's course

Goals:

Be aware of the major parallel models and their characteristics

Write platform-independent parallel algorithms

Incomplete: just an overview,not an actual programming model course

Just cover what we need for the rest of the course

Informal: no formal definition, no proofs of theorems

Proofs by handwaving

Page 10: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

10

Outline of the lecture

Why programming models?

PRAM

Model

Example

Brent's theorem

BSP

Motivation

Description

Example

Multi-BSP

Algorithm example: parallel prefix

Page 11: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

11

Why programming models?

Abstract away the complexity of the machine

Abstract away implementation differences(languages, architectures, runtime...)

Focus on the algorithmic choices

First design the algorithm, then implement it

Define the asymptotic algorithmic complexity of algorithms

Compare algorithmsindependently of the implementation / language / computer / operating system...

Page 12: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

12

Serial model: RAM/Von Neumann/Turing

Abstract machine

One unlimited memory, constant access time

Basis for classical algorithmic complexity theory

Complexity = number of instructions executed as a function of problem size

Memory

Program

MachineExecutein sequence

instructions

Read, writeJump

Compute

Page 13: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

13

Serial algorithms

Can you cite examples of serial algorithms and their complexity?

Page 14: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

14

Serial algorithms

Can you cite examples of serial algorithms and their complexity?

Binary search: O(log n)

Quick sort: O(n log n)

Factorization of n-digit number: O(en)

Multiplication of n-digit numbers:O(n log n) < O(n log n 2O(log* n)) < O(n log n log log n)

Still an active area of research!

But for parallel machines, need parallel models

Conjecturedlower bound

Best known algorithmFürer, 2007

Previously bestknown algorithm

Schönhage-Strassen, 1971

Page 15: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

15

Outline of the lecture

Why programming models?

PRAM

Model

Example

Brent's theorem

BSP

Motivation

Description

Example

Multi-BSP

Algorithm example: parallel prefix

Page 16: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

16

Parallel model: PRAM

1 memory, constant access time

N processors

Synchronous steps

All processors runan instruction simultaneously

Usually the same instruction

Each processors knowsits own number

S. Fortune and J. Wyllie. Parallelism in random access machines.Proc. ACM symposium on Theory of computing. 1978.

Memory

Program

P0Executein sequence

instructions

Read, write

P1 Pn...

Processors

Page 17: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

17

PRAM example: SAXPY

Given vectors X and Y of size n, compute vector R←a×X+Y with n processors

Processor Pi:xi X[i]←yi Y[i]←ri a × xi + yi←R[i] ri←

X Y

P0 P1 P2 P3

R

Page 18: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

18

PRAM example: SAXPY

Given vectors X and Y of size n, compute vector R←a×X+Y with n processors

Processor Pi:xi X[i]←yi Y[i]←ri a × xi + yi←R[i] ri←

X Y

P0 P1 P2 P3

R

Page 19: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

19

PRAM example

Given vectors X and Y of size n, compute vector R←a×X+Y with n processors

Processor Pi:xi X[i]←yi Y[i]←ri a × xi + yi←R[i] ri←

X Y

P0 P1 P2 P3

R

Page 20: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

20

PRAM example: SAXPY

Given vectors X and Y of size n, compute vector R←a×X+Y with n processors

Complexity: O(1) with n processorsO(n) with 1 processor

Processor Pi:xi X[i]←yi Y[i]←ri a × xi + yi←R[i] ri←

X Y

P0 P1 P2 P3

R

Page 21: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

22

Reduction example

Given a vector a of size n, compute the sum of elementswith n/2 processors

Sequential algorithm

Complexity: O(n)with 1 processor

All operations are dependent: no parallelism

Need to use associativity to reorder the sum

for i = 0 to n-1sum sum + a[i]←

a0

a1

a2

a3

Dependency graph:

Page 22: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

23

Parallel reduction: dependency graph

a0

a1

a2

We compute r=(...((a0

a⊕1) (a⊕

2a⊕

3)) ... a⊕ ⊕

n-1)...)

Pairs

Groups of 4

a3

a4

a5

a6

a7

⊕ ⊕

r

Page 23: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

24

Parallel reduction: PRAM algorithm

a0

a1

a2

First step

a3

a4

a5

a6

a7

⊕ ⊕

r

a0

a1

a2

a3

a0

a1

ai←a

2i+a

2i+1

ai←a

2i+a

2i+1

P0

P1

P2

P3

Processor Pi:a[i] = a[2*i] + a[2*i+1]

ai←a

2i+a

2i+1

Page 24: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

25

Parallel reduction: PRAM algorithm

Second step

Reduction on n/2 processors

Other processors stay idle

⊕ ⊕

r

a0

a1

a2

a3

a0

a1 a

i←a

2i+a

2i+1

P0

P1

P2

P3

Processor Pi:n'=n/2if i < n' then

a[i] = a[2*i] + a[2*i+1]end if

ai←a

2i+a

2i+1

Page 25: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

26

Parallel reduction: PRAM algorithm

a0

a1

a2

Complete loop

a3

a4

a5

a6

a7

⊕ ⊕

r

a0

a1

a2

a3

a0

a1

ai←a

2i+a

2i+1

ai←a

2i+a

2i+1

P0

P1

P2

P3

Processor Pi:while n > 1 do

if i < n thena[i] = a[2*i] + a[2*i+1]

end ifn = n / 2;

end while

ai←a

2i+a

2i+1

Complexity: O(log(n)) with O(n) processors

Page 26: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

27

Less than n processors

A computer that grows with problem size is not very realistic

Assume p processors for n elements, p < n

Back to R←a×X+Y example

Solution 1: group by blocksassign ranges to each processor

Processor Pi:for j = i* n/p to (i+1)* n/p -1⌈ ⌉ ⌈ ⌉

xj X[j]←yj Y[j]←rj a × xj + yj←R[j] rj←

X Y

P0 P1 P2 P3

R

Step 1P0 P2

Page 27: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

28

Less than n processors

A computer that grows with problem size is not very realistic

Assume p processors for n elements, p < n

Back to R←a×X+Y example

Solution 1: group by blocksassign ranges to each processor

Processor Pi:for j = i* n/p to (i+1)* n/p -1⌈ ⌉ ⌈ ⌉

xj X[j]←yj Y[j]←rj a × xj + yj←R[j] rj←

X Y

P0 P1 P2 P3

R

Step 2P0 P2

Page 28: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

29

Solution 2: slice and interleave

Processor Pi:for j = i to n step p

xj X[j]←yj Y[j]←rj a × xj + yj←R[j] rj←

X Y

P0 P1 P2 P3

R

Step 1

Page 29: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

30

Solution 2: slice and interleave

Equivalent in the PRAM model

But not in actual programming!

Processor Pi:for j = i to n step p

xj X[j]←yj Y[j]←rj a × xj + yj←R[j] rj←

X Y

P0 P1 P2 P3

R

Step 2

Page 30: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

31

Brent's theorem

Generalization for any PRAM algorithmof complexity C(n) with n processors

Derive an algorithm for n elements, with p processors only

Idea: each processor emulates n/p virtual processors

Each emulated step takes n/p actual steps

Complexity: O(n/p * C(n))

P1P0 P2 P3 P0 P1

n/p

n p

1

Page 31: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

32

Interpretation of Brent's theorem

We can design algorithms for a variable (unlimited) number of processors

They can run on any machine with fewer processors with known complexity

There may exist a more efficient algorithm with p processors only

Page 32: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

33

PRAM: wrap-up

Unlimited parallelism

Unlimited shared memory

Only count operations: zero communication cost

Pros

Simple model: high-level abstractioncan model many processors or even hardware

Cons

Simple model: far from implementation

Page 33: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

34

Outline of the lecture

Why programming models?

PRAM

Model

Example

Brent's theorem

BSP

Motivation

Description

Example

Multi-BSP

Algorithm example: parallel prefix

Page 34: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

35

PRAM limitations

PRAM model proposed in 1978

Inspired by SIMD machines of the time

Assumptions

All processors synchronized every instruction

Negligible communication latency

Useful as a theoretical model,but far from modern computers

ILLIAC-IV, an early SIMD machine

Page 35: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

36

Modern architectures

Modern supercumputers are clusters of computers

Global synchronizationcosts millions of cycles

Memory is distributed

Inside each node

Multi-core CPUs, GPUs

Non-uniform memory access (NUMA) memory

Synchronization cost at all levels

Mare Nostrum, a modern distributedmemory machine

Page 36: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

37

Bulk-Synchonous Parallel (BSP) model

Assumes distributed memory

But also works with shared memory

Good fit for GPUs too,with a few adaptations

Processors execute instructions independently

Communications between processors are explicit

Processors need to synchronize with each other

Localmemory

Program

P0

Executeindependently

instructions

Read, write

P1 Pn...Processors

Communicationnetwork +Synchronization

L. Valiant. A bridging model for parallel computation. Comm. ACM 1990.

Page 37: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

38

Superstep

A program is a sequence of supersteps

Superstep: each processor

Computes

Sends result

Receive data

Barrier: wait until all processors have finished their superstep

Next superstep: can use data received in previous step

Computation+

communicationSuperstep

Barrier

Superstep

Page 38: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

39

Example: reduction in BSP

Start from dependency graph again

a0

a1

a2

a3

a4

a5

a6

a7

⊕ ⊕

r

Page 39: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

40

Reduction: BSP

Adding barriers

Barrier

Barrier

Barrier

a0

a1

a2

a3

a4

a5

a6

a7

⊕ ⊕

r

a0

a1

a2

a3

a0

a1

ai←a

2i+a

2i+1

ai←a

2i+a

2i+1

P0

P1

P2

P3

ai←a

2i+a

2i+1

P4

P5

P6

P7

Communication

Page 40: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

41

Reducing communication

Data placement matters in BSP

Optimization: keep left-side operand local

a0

a1

a2

a3

a4

a5

a6

a7

r

Barrier

Barrier

Barrier

P0

P1

P2

P3

P4

P5

P6

P7

⊕ ⊕

⊕ ⊕

a0

a2

a4

a6

a0

a4

Communication

a2i←a

2i+a

2i+1

a4i←a

4i+a

4i+2

a8i←a

8i+a

8i+4

Page 41: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

42

Wrapup

BSP takes synchronization cost into account

Explicit communications

Homogeneous communication cost:does not depend on processors

Page 42: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

43

Outline of the lecture

Why programming models?

PRAM

Model

Example

Brent's theorem

BSP

Motivation

Description

Example

Multi-BSP

Algorithm examples

Page 43: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

44

The world is not flat

It is hierarchical !

Page 44: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

45

Multi-BSP model

Multi-BSP: BSP generalization with groups of processorsin multiple nested levels

Level 1superstep

Level 1Barrier

Level 2superstep

Level 2Barrier

Higher level: more expensive synchronization

Arbitrary number of levels

L. Valiant. A bridging model for multi-core computing. Algorithms-ESA 2008.

Page 45: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

46

Multi-BSP and multi-core

Minimize communication cost on hierarchical platforms

Make parallel program hierarchical too

Take thread affinity into account

On clusters (MPI): add more levels up

On GPUs (CUDA): add more levels down

Thread

Core

L1 cache

Shared L2 cache

Shared L3 cache

Computer Program

L1superstep

L2 super-step

L3super-step

Page 46: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

47

Reduction: multi-BSP

Break into 2 levels

a0

a1

a2

a3

P0

P1

P2

P3

a4

a5

a6

a7

r

L1 barrier

L2 Barrier

L1 barrierLevel 1reduction

Level 2reduction

P4

P5

P6

P7

⊕ ⊕

⊕ ⊕

a0

a2

a4

a6

a0

a4

Page 47: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

48

Recap

PRAM

Single shared memory

Many processors in lockstep

BSP

Distributed memory,message passing

Synchronization with barriers

Multi-BSP

BSP with multiple scales

Moreabstract

Closer toimplementation

Page 48: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

49

Outline of the lecture

Why programming models?

PRAM

Model

Example

Brent's theorem

BSP

Motivation

Description

Example

Multi-BSP

Algorithm examples

Parallel prefix

Stencil

Page 49: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

50

Work efficiency

Work: number of operations

Work efficiency: ratio betweenwork of sequential algorithm andwork of parallel algorithm

Example: reduction

Sequential sum: n-1 operations

Parallel reduction: n-1 operations

Work efficiency: 1This is the best case!

Page 50: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

51

Parallel prefix

Problem: on a given vector a, for each element,compute the sum s

i of all preceding elements

Result: vector s = (s0, s

1, … s

n-1)

“Sum” can be any associative operator

Applications: stream compaction, graph algorithms,addition in hardware…

si=∑k=0

i

ak

Mark Harris, Shubhabrata Sengupta, and John D. Owens. "Parallel prefix sum (scan) with CUDA." GPU gems 3, 2007

Page 51: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

52

Sequential prefix: graph

Sequential algorithma

0a

1a

2

a3

Dependency graph:

sum 0←for i = 0 to n do

s[i] sum←sum sum + a[k]←

end for

Can we parallelize it?

Of course we can:this is a parallel programming course!

Add more computations to reduce the depth of the graph

s0

s1

s2

s3

Page 52: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

53

Parallel prefix

Optimize for graph depth

Make a reduction tree for each output

Reuse common intermediate results

a0

a1

a2

a3

s0

s1

s2

s3

Page 53: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

54

A simple parallel prefix algorithm

Also known as Kogge-Stone in hardware

Hillis, W. Daniel, and Guy L. Steele Jr. "Data parallel algorithms." Communications of the ACM 29.12 (1986): 1170-1183.

a0 a1 a2 a3 a4 a5 a6 a7

a0 Σ0-1 Σ1-2 Σ2-3 Σ3-4 Σ4-5 Σ5-6 Σ6-7⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

a0 Σ0-1 Σ0-2 Σ0-3 Σ1-4 Σ2-5 Σ3-6 Σ4-7⊕ ⊕ ⊕ ⊕ ⊕ ⊕

a0 Σ0-1 Σ0-2 Σ0-3 Σ0-4 Σ0-5 Σ0-6 Σ0-7⊕ ⊕ ⊕ ⊕

Σi-j is the sum ∑k=i

j

ak

Step 0

Step 1

Step 2

s[i] a[i]←if i ≥ 1 then

s[i] s[i-1] + s[i]←

if i ≥ 2 thens[i] s[i-2] + s[i]←

if i ≥ 4 thens[i] s[i-4] + s[i]←

if i ≥ 2d thens[i] s[i-2← d] + s[i]

Step d:

P0 P1 P2 P3 P4 P5 P6 P7

Page 54: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

55

Analysis

Complexity

log2(n) steps with n processors

Work efficiency

O(n log2(n)) operations

1/log2(n) work efficiency

All communications are global

Bad fit for Multi-BSP model

Page 55: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

56

Another parallel prefix algorithm

Two passes

First, upsweep: parallel reduction

Compute partial sums

Keep “intermediate” (non-root) results

a0 a1 a2 a3 a4 a5 a6 a7

a0 Σ0-1 a2 Σ2-3 a4 Σ4-5 a6 Σ6-7⊕ ⊕ ⊕ ⊕

Guy E. Blelloch, Prefix sums and their applications, Tech. Report, 1990.

Σ0-3 Σ4-7⊕ ⊕

Σ0-7⊕

a0 Σ0-1 a2 a4 Σ4-5 a6

Σ0-3a0 Σ0-1 a2 a4 Σ4-5 a6P0 P1 P2 P3 P4 P5 P6 P7

for d = 0 to log2(N)-1if i mod 2d+1 = 2d+1-1 then

s[i] s[i+2← d] + s[i]end if

end for

d=0

d=1

d=2

Page 56: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

57

Second pass: downsweep

Set root to zero

Go down the tree

Left child: propagate parent

Right child: compute parent + left child

0Σ0-3a0 Σ0-1 a2 a4 Σ4-5 a6

⊕Σ0-30a0 Σ0-1 a2 a4 Σ4-5 a6

Σ0-5Σ0-1a0 0 a2 a4 Σ0-3 a6⊕⊕

Σ0-6Σ0-2a0 Σ0-40 Σ0-1 Σ0-3 Σ0-5⊕⊕ ⊕⊕

In parallel

Hopefully nobody will notice the off-by-one difference...

P0 P1 P2 P3 P4 P5 P6 P7

s[N-1] 0←

for d = log2(n)-1 to 0 step -1if i mod 2d+1 = 2d+1-1 then

x s[i]←s[i] s[i-2← d] + xs[i-2d] x←

end ifend ford=0

d=1

d=2

Page 57: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

58

Analysis

Complexity

2 log2(n) steps: O(log(n))

Work efficiency

Performs 2 n operations

Work efficiency = 1/2: O(1)

Communications are local away from the root

Efficient algorithm in Multi-BSP model

Page 58: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

59

Another variation: Brent-Kung

Same reduction in upsweep phase,simpler computation in downsweep phase

Many different topologies: Kogge-Stone, Ladner-Fischer, Han-Carlson...

Different tradeoffs, mostly relevant for hardware implementation

Σ0-3a0 Σ0-1 a2 a4 Σ4-5 a6

⊕a0 Σ0-1 a2 a4 Σ0-5 a6

a0 Σ0-2 Σ0-4 Σ0-6⊕⊕

Σ0-7

Σ0-3 Σ0-7

Σ0-1 Σ0-5Σ0-3 Σ0-7⊕

P0 P1 P2 P3 P4 P5 P6 P7

if i mod 2d+1 = 2d+1-2d thens[i] s[i← -2d] + s[i]

end if

Page 59: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

60

Outline of the lecture

Why programming models?

PRAM

Model

Example

Brent's theorem

BSP

Motivation

Description

Example

Multi-BSP

Algorithm examples

Parallel prefix

Stencil

Page 60: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

61

Stencil

Given a n-D grid, compute each element as a function of its neighbors

All updates done in parallel: no dependence

Commonly, repeat the stencil operation in a loop

Applications: image processing, fluid flow simulation…

Page 61: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

62

Stencil: naïve parallel algorithm

1-D stencil of size n, nearest neighbors

m successive steps

Processor i:for t = 0 to m

s[i] = f(s[i-1], s[i], s[i+1])end for

f t = 0

t = 1

t = 2

P0 P1 P2 P3

ff f

Needs global synchronization at each stage

Page 62: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

63

Ghost zone optimization

Idea: trade more computation for less global communication

Affect tiles to groups of processors with overlap

s1 s5s2 s3 s4s0

P0 P1 P2 P3

s6 s7 s8

P4 P5 P6

s6 s10s7 s8 s9s5

P7 P8 P9 P10

s11 s12 s13

P11 P12 P13P14 P15 P16 P17

Overlap:Ghost zone

Processors

Data

Memory for group P0-P8 Memory for group P9-P17

Page 63: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

64

Ghost zone optimization

Initialize all elements including ghost zones

Perform local stencil computations

Global synchronization only needed periodically

s1 s5s2 s3 s4s0

P0 P1 P2 P3

s6 s7 s8

P4 P5 P6

s6 s10s7 s8 s9s5

P7 P8 P9 P10

s11 s12 s13

P11 P12 P13

Communication +Synchronization

s1s0 s7 s8 s6s5 s12 s13

t=0

t=1

t=2 Ghost zone

P14 P15 P16 P17

Page 64: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

65

Wrapup

Computation is cheap

Focus on communication and global memory access

Between communication/synchronization steps,this is familiar sequential programming

Local communication is cheap

Focus on global communication

Page 65: GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

66

Conclusion

This time: programming models

High level view of parallel programming

Next time: actual GPU architecture

Low level view

Rest of the course: how to bridge the gap

Programming: mapping an algorithm to an architecture