gpu ocean workshop: opencl part i - universitetet i...

48
Technology for a better society 1 Micro course on High-performance simulation with high-level languages André R. Brodtkorb, SINTEF GPU Ocean Workshop: OpenCL part I

Upload: others

Post on 24-Jul-2020

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 1

Micro course on

High-performance simulation with high-level languages

André R. Brodtkorb, SINTEF

GPU Ocean Workshop: OpenCL part I

Page 2: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 2

• Aim of course:

• To equip you with a set of tools and techniques for working

efficiently with high-performance software development.

• Abridged version of five hour short course

• Part 1: Theory. (3 hours of lectures)

• Part 2: Practice. (2 hours of laboratory exercises)

Overview of short course

Page 3: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 3

• Part 1a – Introduction

• Motivation for going parallel

• Parallel algorithm design

• Programming GPUs with CUDA

• Part 1b – Solving conservation laws with pyopencl

• Solving ODEs and PDEs on a computer

• The heat equation in 1D and 2D

• The linear wave equation

Outline

Page 4: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 4

Motivation for going parallel

Page 5: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 5

• The key to increasing performance,

is to consider the full algorithm and

architecture interaction.

• A good knowledge of both the

algorithm and the computer

architecture is required.

Why care about computer hardware?

Graph from David Keyes, Scientific Discovery through Advanced Computing, Geilo Winter School, 2008

Page 6: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 6

History lesson: development of the microprocessor 1/2

1942: Digital Electric Computer(Atanasoff and Berry)

1971: Microprocessor(Hoff, Faggin, Mazor)

1947: Transistor (Shockley, Bardeen, and Brattain)

1956

1958: Integrated Circuit (Kilby)

2000

1971- Exponential growth(Moore, 1965)

Page 7: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 7

1971: 4004, 2300 trans, 740 KHz

1982: 80286, 134 thousand trans, 8 MHz

1993: Pentium P5, 1.18 mill. trans, 66 MHz

2000: Pentium 4, 42 mill. trans, 1.5 GHz

2010: Nehalem2.3 bill. Trans, 8 cores, 2.66 GHz

History lesson: development of the microprocessor 2/2

Page 8: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 8

• Heat density approaching that of nuclear reactor core: Power wall

• Traditional cooling solutions (heat sink + fan) insufficient

• Industry solution: multi-core and parallelism!

What happened in 2004?

Original graph by G. Taylor, “Energy Efficient Circuit Design and the Future of Power Delivery” EPEPS’09

W /

cm

2

Critical dimension (um)

Page 9: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 9

Why Parallelism?

100%

100%

100%

85%

90% 90%

100%

Frequency

Performance

Power

Single-core Dual-core

The power density of microprocessors

is proportional to the clock frequency cubed:1

1 Brodtkorb et al. State-of-the-art in heterogeneous computing, 2010

Page 10: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 10

• Up-to 5760 floating point

operations in parallel!

• 5-10 times as power

efficient as CPUs!

Massive Parallelism: The Graphics Processing Unit

0

50

100

150

200

250

300

350

400

2000 2005 2010 2015

Ba

nd

wid

th (

GB

/s)

0

1000

2000

3000

4000

5000

6000

2000 2005 2010 2015

Gig

afl

op

s (

SP

)

Page 11: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 11

Parallel algorithm design

Page 12: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 12

• When the processors are symmetric (identical),

we tend to use symmetric multiprocessing.

• Tasks will take the same amount of time

independent of which processor it runs on.

• All procesors can see everything in memory

• If we have different processors,

we revert to heterogeneous computing.

• Tasks will take a different amount of time

on different processors

• Not all tasks can run on all processors.

• Each processor sees only part of the memory

• We can even mix the two above, add message passing, etc.!

Type of parallel processing

Multi-core CPU GPU

Multi-core CPU

Page 13: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 13

• Most algorithms are like baking recipies,

Tailored for a single person / processor:

• First, do A,

• Then do B,

• Continue with C,

• And finally complete by doing D.

• How can we utilize an "army of identical chefs"?

• How can we utilize an "army of different chefs"?

Mapping an algorithm to a parallel architecture

Picture: Daily Mail Reporter , www.dailymail.co.uk

Page 14: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 14

• Data parallelism performs the same operation

for a set of different input data

• Scales well with the data size:

The larger the problem, the more processors you can utilize

• Trivial example:

Element-wise multiplication of two vectors:

• c[i] = a[i] * b[i] i=0…N

• Processor i multiplies elements i of vectors a and b.

Data parallel workloads

Page 15: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 15

• Most algorithms contains

a mixture of work-loads:

• Some serial parts

• Some task and / or data parallel parts

• Amdahl’s law:

• There is a limit to speedup offered by

parallelism

• Serial parts become the bottleneck for a

massively parallel architecture!

• Example: 5% of code is serial: maximum

speedup is 20 times!

Limits on performance 1/4

S: Speedup

P: Parallel portion of code

N: Number of processorsGraph from Wikipedia, user Daniels220, CC-BY-SA 3.0

Page 16: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 16

• Gustafson's law:

• If you cannot reduce serial parts of

algorithm, make the parallel portion

dominate the execution time

• Essentially: solve a bigger problem!

Limits on performance 2/4

S: Speedup

P: Number of processors

α: Serial portion of codeGraph from Wikipedia, user Peahihawaii, CC-BY-SA 3.0

Page 17: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 17

• Moving data has become the major bottleneck in computing.

• Downloading 1GB from Japan to Switzerland consumes

roughly the energy of 1 charcoal briquette1.

• A FLOP costs less than moving one byte2.

• Key insight: flops are free, moving data is expensive

Limits on performance 3/4

1 Energy content charcoal: 10 MJ / kg, kWh per GB: 0.2 (Coroama et al., 2013), Weight charcoal briquette: ~25 grams

2Simon Horst, Why we need Exascale, and why we won't get there by 2020, 2014

Page 18: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 18

• A single precision number is four bytes

• You must perform over 60 operations for each

float read on a GPU!

• Over 25 operations on a CPU!

• This groups algorithms into two classes:

• Memory bound

Example: Matrix multiplication

• Compute bound

Example: Computing π

• The third limiting factor is latencies

• Waiting for data

• Waiting for floating point units

• Waiting for ...

Limits on performance 4/4

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

2000 2005 2010 2015

Optimal FLOPs per byte (SP)

CPU

GPU

Page 19: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 19

• Total performance is the product of

algorithmic and numerical performance

• Your mileage may vary: algorithmic

performance is highly problem

dependent

• Many algorithms have low numerical

performance

• Only able to utilize a fraction of the

capabilities of processors, and often

worse in parallel

• Need to consider both the algorithm and

the architecture for maximum performance

Algorithmic and numerical performance

Nu

me

rica

l pe

rfo

rma

nce

Algorithmic performance

Page 20: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 20

Programming GPUs

Page 21: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 21

• CUDA has the most mature development ecosystem

• Released by NVIDIA in 2007

• Enables programming GPUs using a C-like language

• Essentially C / C++ with some additional syntax for

executing a function in parallel on the GPU

• OpenCL is a very good alternative that also runs on

non-NVIDIA hardware (Intel Xeon Phi, AMD GPUs, CPUs)

• Equivalent to CUDA, but slightly more cumbersome.

• We will use pyopencl later on!

• For high-level development, languages like

OpenACC (pragma based) or C++ AMP (extension to C++) exist

• Typicall works well for toy problems,

but may not always work too well for complex algorithms

Computing with CUDA

Page 22: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 22

• We want to add two matrices,

a and b, and store the result in c.

• For best performance, loop through one row at a time

(sequential memory access pattern)

Example: Adding two matrices in CUDA 1/2

Matrix from Wikipedia: Matrix addition

C+

+ o

n C

PU

void addFunctionCPU(float* c, float* a, float* b, unsigned int cols, unsigned int rows) {

for (unsigned int j=0; j<rows; ++j) { for (unsigned int i=0; i<cols; ++i) {

unsigned int k = j*cols + i; c[k] = a[k] + b[k];

}}

}

Page 23: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 23

GPU function

Ind

ice

s

Ca

lls

GP

U

fun

ctio

n

__global__ void addMatricesKernel(float* c, float* a, float* b,unsigned int cols, unsigned int rows) {

//Indexing calculationsunsigned int global_x = blockIdx.x*blockDim.x + threadIdx.x;unsigned int global_y = blockIdx.y*blockDim.y + threadIdx.y;unsigned int k = global_y*cols + global_x;

//Actual additionc[k] = a[k] + b[k];

}

void addFunctionCUDA(float* c, float* a, float* b,unsigned int cols, unsigned int rows) {

dim3 block(8, 8);dim3 grid(cols/8, rows/8);... //More code here: Allocate data on GPU, copy CPU data to GPUaddMatricesKernel<<<grid, block>>>(gpu_c, gpu_a, gpu_b, cols, rows);... //More code here: Download result from GPU to CPU

}

Example: Adding two matrices in CUDA 2/2

Implicit double for loop

for (int blockIdx.x = 0;

blockIdx.x < grid.x;

blockIdx.x) { …

Page 24: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 24

• Two-layered parallelism

• A block consists of threads:

Threads within the same block can

cooperate and communicate

• A grid consists of blocks:

All blocks run independently.

• Blocks and grid can be

1D, 2D, and 3D

• Global synchronization and

communication is only possible

between kernel launches

• Really expensive, and should be

avoided if possible

Grids and blocks in CUDA

Page 25: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 25

• CUDA and OpenCL have a virtually identical programming/execution model

• The largest difference is that OpenCL requires a bit more code to get started, and

different concepts have different names.

• The major benefit of OpenCL is that it can run on multiple different devices

• Supports Intel CPUs, Intel Xeon Phi, NVIDIA GPUs, AMD GPUs, etc.

• CUDA supports only NVIDIA GPUs.

CUDA versus OpenCL

Page 26: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 26

CUDA OpenCL

__global__ function __kernel function

__device__ function No annotation necessary

__constant__ variable declaration __constant variable declaration

__device__ variable declaration __global variable declaration

__shared__ variable declaration __local variable declaration

CUDA versus OpenCLCUDA OpenCL

SM (Stream Multiprocessor) CU (Compute Unit)

Thread Work-item

Block Work-group

Global memory Global memory

Constant memory Constant memory

Shared memory Local memory

Local memory Private memory

CUDA OpenCL

gridDim get_num_groups()

blockDim get_local_size()

blockIdx get_group_id()

threadIdx get_local_id()

blockIdx * blockDim + threadIdx get_global_id()

gridDim * blockDim get_global_size()

CUDA OpenCL

__syncthreads() barrier()

__threadfence() No direct equivalent

__threadfence_block() mem_fence()

No direct equivalent read_mem_fence()

No direct equivalent write_mem_fence()

CUDA OpenCL

cudaGetDeviceProperties() clGetDeviceInfo()

cudaMalloc() clCreateBuffer()

cudaMemcpy() clEnqueueRead(Write)Buffer()

cudaFree() clReleaseMemObj()

kernel<<<...>>>() clEnqueueNDRangeKernel()

Page 27: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 27

Ca

lls G

PU

fun

ctio

n

GPU function

OpenCL matrix addition

__kernel void addMatricesKernel(__global float* c, __global float* a,

__global float* b, unsigned int cols, unsigned int rows) {

//Indexing calculations

unsigned int global_x = get_global_id(0);

unsigned int global_y = get_global_id(1);

unsigned int k = global_y*cols + global_x;

//Actual addition

c[k] = a[k] + b[k];

}

void addFunctionOpenCL() {

... //More code here: Allocate data on GPU, copy CPU data to GPU

//Set arguments

clSetKernelArg(ckKernel, 0, sizeof(cl_mem), (void*)&gpu_c);

clSetKernelArg(ckKernel, 1, sizeof(cl_mem), (void*)&gpu_a);

clSetKernelArg(ckKernel, 2, sizeof(cl_mem), (void*)&gpu_b);

clSetKernelArg(ckKernel, 3, sizeof(cl_int), (void*)&cols);

clSetKernelArg(ckKernel, 4, sizeof(cl_int), (void*)&rows);

// Launch kernel

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &gws, &lws, 0, NULL, NULL);

... //More code here: Download result from GPU to CPU

}

Page 28: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 28

Computing π with CUDA

Page 29: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 29

• There are many ways of estimating Pi. One way is to

estimate the area of a circle.

• Sample random points within one quadrant

• Find the ratio of points inside to outside the circle

• Area of quarter circle: Ac = πr2/4

Area of square: As = r2

• π = 4 Ac/As ≈ 4 #points inside / #points outside

• Increase accuracy by sampling more points

• Increase speed by using more nodes

• Algorithm:

1. Sample random points within a quadrant

2. Compute distance from point to origin

3. If distance less than r, point is inside circle

4. Estimate π as 4 #points inside / #points outside

Computing π with CUDA

pi=3.1345 pi=3.1305 pi=3.1597

pi=3.14157

Remember: The algorithms serves as an example:

it's far more efficient to estimate π as 22/7, or 355/113

Page 30: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 30

2 &

31

4

float computePi(int n_points) { int n_inside = 0; for (int i=0; i<n_points; ++i) {

//Generate coordinatefloat x = generateRandomNumber(); float y = generateRandomNumber(); //Compute distancefloat r = sqrt(x*x + y*y); //Check if within circleif (r < 1.0f) { ++n_inside; }

} //Estimate Pifloat pi = 4.0f * n_inside / static_cast<float>(n_points); return pi;

}

Serial CPU code (C/C++)

Page 31: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 31

float computePi(int n_points) { int n_inside = 0; #pragma omp parallel for reduction(+:n_inside) for (int i=0; i<n_points; ++i) {

//Generate coordinatefloat x = generateRandomNumber(); float y = generateRandomNumber(); //Compute distancefloat r = sqrt(x*x + y*y); //Check if within circleif (r <= 1.0f) { ++n_inside; }

} //Estimate Pifloat pi = 4.0f * n_inside / static_cast<float>(n_points); return pi;

}

Parallel CPU code (C/C++ with OpenMP)

Make sure that every

expression involving

n_inside modifies the

global variable using

the + operator

Run for loop in parallel

using multiple threads

Page 32: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 32

• Parallel: 3.8 seconds @ 100% CPU

• Serial: 30 seconds @ 10% CPU

Performance

Page 33: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 33

GPU function__global__ void computePiKernel1(unsigned int* output) { //Generate coordinatefloat x = generateRandomNumber();float y = generateRandomNumber();

//Compute radiusfloat r = sqrt(x*x + y*y);

//Check if within circleif (r <= 1.0f) {

output[blockIdx.x] = 1; } else {

output[blockIdx.x] = 0; }

}

Parallel GPU version 1 (CUDA) 1/3

*Random numbers on GPUs can be a slightly tricky, see cuRAND for more information

Page 34: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 34

float computePi(int n_points) {dim3 grid = dim3(n_points, 1, 1);dim3 block = dim3(1, 1, 1);

//Allocate data on graphics card for outputcudaMalloc((void**)&gpu_data, gpu_data_size);

//Execute function on GPU (“lauch the kernel”)computePiKernel1<<<grid, block>>>(gpu_data);

//Copy results from GPU to CPUcudaMemcpy(&cpu_data[0], gpu_data, gpu_data_size,

cudaMemcpyDeviceToHost);

//Estimate Pifor (int i=0; i<cpu_data.size(); ++i) {

n_inside += cpu_data[i];}return pi = 4.0f * n_inside / n_points;

}

Parallel GPU version 1 (CUDA) 2/3

Page 35: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 35

• Unable to run more than 65535

sample points

• Barely faster than single

threaded CPU version for

largest size!

• Kernel launch overhead

appears to dominate runtime

• The fit between algorithm and

architecture is poor:

• 1 thread per block: Utilizes

at most 1/32 of

computational power.

Parallel GPU version 1 (CUDA) 3/3

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

1E+01

1E+02

1E+02 1E+04 1E+06 1E+08

Tim

e (

seco

nd

s)

Sample points

CPU ST

CPU MT

GPU 1

Page 36: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 36

• CPU scalar: 1 thread, 1 operand on 1 data element

• CPU SSE/AVX: 1 thread, 1 operand on 2-8 data elements

• GPU Warp: 32 threads, 32 operands on 32 data elements

• Exposed as individual threads

• Actually runs the same instruction

• Divergence implies serialization and masking

GPU Vector Execution Model

Scalar operation SSE/AVX operation Warp operation

Page 37: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 37

Hardware automatically serializes and masks divergent code flow:

• Execution time is the sum of all branches taken

• Programmer is relieved of fiddling with element masks (which is necessary for SSE/AVX)

• Worst case 1/32 performance

• Important to minimize divergent code flow within warps!

• Move conditionals into data, use min, max, conditional moves.

Serialization and masking

Page 38: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 38

32

th

rea

ds

pe

r

blo

ck

Ne

w

ind

exi

ng

__global__ void computePiKernel2(unsigned int* output) { //Generate coordinatefloat x = generateRandomNumber();float y = generateRandomNumber();

//Compute radiusfloat r = sqrt(x*x + y*y);

//Check if within circleif (r <= 1.0f) {

output[blockIdx.x*blockDim.x + threadIdx.x] = 1; } else {

output[blockIdx.x*blockDim.x + threadIdx.x] = 0; }

}

float computePi(int n_points) {dim3 grid = dim3(n_points/32, 1, 1);dim3 block = dim3(32, 1, 1);…//Execute function on GPU (“lauch the kernel”)computePiKernel1<<<grid, block>>>(gpu_data);…

}

Parallel GPU version 2 (CUDA) 1/2

Page 39: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 39

• Unable to run more than

32*65535 sample points

• Works well with 32-wide SIMD

• Able to keep up with multi-

threaded version at maximum

size!

• We perform roughly 16

operations per 4 bytes written (1

int): memory bound kernel!

Optimal is 60 operations!

Parallel GPU version 2 (CUDA) 2/2

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

1E+01

1E+02

1E+02 1E+04 1E+06 1E+08

Tim

e (

seco

nd

s)

Sample points

CPU ST

CPU MT

GPU 1

GPU 2

Page 40: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 40

__global__ void computePiKernel3(unsigned int* output, unsigned int seed) {__shared__ int inside[32];

//Generate coordinate//Compute radius…

//Check if within circleif (r <= 1.0f) {

inside[threadIdx.x] = 1; } else {

inside[threadIdx.x] = 0; }

… //Use shared memory reduction to find number of inside per block

Parallel GPU version 3 (CUDA) 1/4

Shared memory: a kind of “programmable cache”

We have 32 threads: One entry per thread

Page 41: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 41

… //Continued from previous slide

//Use shared memory reduction to find number of inside per block//Remember: 32 threads is one warp, which execute synchronouslyif (threadIdx.x < 16) {

p[threadIdx.x] = p[threadIdx.x] + p[threadIdx.x+16];p[threadIdx.x] = p[threadIdx.x] + p[threadIdx.x+8];p[threadIdx.x] = p[threadIdx.x] + p[threadIdx.x+4];p[threadIdx.x] = p[threadIdx.x] + p[threadIdx.x+2];p[threadIdx.x] = p[threadIdx.x] + p[threadIdx.x+1];

}

if (threadIdx.x == 0) { output[blockIdx.x] = inside[threadIdx.x];

}}

Parallel GPU version 3 (CUDA) 2/4

Page 42: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 42

• Shared memory is a kind of programmable

cache

• Fast to access (just slightly slower

than registers)

• Programmers responsibility to move

data into shared memory

• All threads in one block can see the

same shared memory

• Often used for communication

between threads

• Sum all elements in shared memory using

shared memory reduction

Parallel GPU version 3 (CUDA) 3/4

Page 43: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 43

• Memory bandwidth use reduced

by factor 32!

• Good speed-up over

multithreaded CPU!

• Maximum size is still limited to

65535*32.

• Two ways of increasing size:

• Increase number of threads

• Make each thread do more

work

Parallel GPU version 3 (CUDA) 4/4

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

1E+01

1E+02

1E+02 1E+04 1E+06 1E+08

Tim

e (

seco

nd

s)

Sample points

CPU ST

CPU MT

GPU 1

GPU 2

GPU 3

Page 44: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 44

__global__ void computePiKernel4(unsigned int* output) { int n_inside = 0;

//Shared memory: All threads can access this__shared__ int inside[32]; inside[threadIdx.x] = 0;

for (unsigned int i=0; i<iters_per_thread; ++i) { //Generate coordinate//Compute radius//Check if within circleif (r <= 1.0f) { ++inside[threadIdx.x]; }

}

//Communicate with other threads to find sum per block//Write out to main GPU memory

}

Parallel GPU version 4 (CUDA) 1/2

Page 45: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 45

• Overheads appears to dominate

runtime up-to 10.000.000 points:

• Memory allocation

• Kernel launch

• Memory copy

• Estimated GFLOPS: ~450

Thoretical peak: ~4000

• Things to investigate further:

• Profile-driven development*!

• Check number of threads,

memory access patterns,

instruction stalls, bank conflicts, ...

Parallel GPU version 4 (CUDA) 2/2

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

1E+01

1E+02

1E+02 1E+04 1E+06 1E+08

Tim

e (

seco

nd

s)

Sample points

CPU ST

CPU MT

GPU 1

GPU 2

GPU 3

GPU 4

*See e.g., Brodtkorb, Sætra, Hagen,

GPU Programming Strategies and Trends in GPU Computing, JPDC, 2013

Page 46: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 46

• Previous slide indicates speedup of

• 100x versus OpenMP version

• 1000x versus single threaded version

• Theoretical performance gap is 10x: why so fast?

• Reasons why the comparison is fair:

• Same generation CPU (Core i7 3930K) and GPU (GTX 780)

• Code available on Github: you can test it yourself!

• Reasons why the comparison is unfair:

• Optimized GPU code, unoptimized CPU code.

• I do not show how much of CPU/GPU resources I actually use (profiling)

• I cheat with the random function (I use a simple linear congruential generator).

Comparing performance

Page 47: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 47

Summary

Page 48: GPU Ocean Workshop: OpenCL part I - Universitetet i oslobabrodtk.at.ifi.uio.no/files/publications/... · • Part 1a –Introduction • Motivation for going parallel • Parallel

Technology for a better society 48

• All current processors are parallel:

• You cannot ignore parallelization and expect high performance

• Serial programs utilize 1% of potential!

• We need to desing our algorithms with a specific architecture in mind

• Data parallel, task parallel

• Symmetric multiprocessing, heterogeneous computing

• GPUs can be programmed using many different languages

• Cuda is the most mature

• OpenCL is portable across hardware platforms

• Need to consider the hardware

• Even for "simple" data-parallel workloads such as computing π

Summary part 1a