accelerate your application with keplerdeveloper.download.nvidia.com/gtc/pdf/gtc2012/... · ©...

106
Accelerate your Application with Kepler Peter Messmer

Upload: others

Post on 09-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

Accelerate your Application with Kepler

Peter Messmer

Page 2: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Goals

How to analyze/optimize an existing application for GPUs

What to consider when designing new GPU applications

How to optimize performance with Kepler /CUDA5

Page 3: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

GPU Acceleration

Page 4: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Here: Focus on Programming Languages

Applications

Libraries Programming

Languages OpenACC

Directives

Page 5: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Here: Focus on Programming Languages

Applications

Libraries Programming

Languages OpenACC

Directives

Page 6: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Page 7: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Page 8: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

APOD – A Systematic Path to Performance

Parallelize

Optimize

Assess

Deploy

Page 9: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Starting point: Matrix transpose

for(int j=0; j < N; j++)

for(int i=0; i < N; i++)

out[j][i] = in[i][j];

i

j

Page 10: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Matrix transpose on CPU

void transpose(float in[], float out[])

{

for(int j=0; j < N; j++)

for(int i=0; i < N; i++)

out[i*N+j] = in[j*N+i];

}

float in[], out[];

transpose(in, out, width);

i

j

Page 11: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

An initial CUDA Version

__global__ void transpose(float in[], float out[])

{

for(int j=0; j < N; j++)

for(int i=0; i < N; i++)

out[i*N+j] = in[j*N+i];

}

float in[], out[];

cudaMemcpy(in, in_host, N*N*sizeof(float));

transpose<<<1,1>>>(in, out);

i

j

Page 12: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

An initial CUDA Version

+ Quickly implemented - Performance weak

__global__ transpose(float in[], float out[])

{

for(int j=0; j < N; j++)

for(int i=0; i < N; i++)

out[i*N+j] = in[j*N+i];

}

float in[], out[];

transpose<<<1,1>>>(in, out);

Page 13: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

An initial CUDA Version

+ Quickly implemented - Performance weak

=> Express Parallelism!

__global__ transpose(float in[], float out[])

{

for(int j=0; j < N; j++)

for(int i=0; i < N; i++)

out[i*N+j] = in[j*N+i];

}

float in[], out[];

transpose<<<1,1>>>(in, out);

Page 14: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Thread: Sequential execution unit All threads execute same sequential program

Threads execute in parallel

Threads Blocks: A group of threads Executes on a single Streaming Multiprocessor (SM)

Threads within a block can cooperate

Light-weight synchronization

Data exchange

Grid: A group of thread blocks Communication between blocks expensive

Threadblocks of a grid execute on multiple SMs

Recap: Kernel Execution Model

Page 15: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

First Parallelization: Inner Loop

Process source rows independently

tid

in j

tid tid

tid out

j

tid

tid

__global__ transpose(float in[], float out[])

{

int tid = threadIdx.x;

for(int j=0; j < N; j++)

out[tid*N+j] = in[j*N+tid];

}

float in[], out[];

transpose<<<1,N>>>(in, out);

Page 16: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Second Parallelization: One Block per Row

__global__ transpose(float in[], float out[])

{

int tid = threadIdx.x;

int bid = blockIdx.x;

out[tid*N+bid] = in[bid*N+tid];

}

float in[][], out[][];

transpose<<<N,N>>>(in, out);

tid out

tid

tid

tid

in

tid tid

bid

bid

bid bid

Page 17: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

- Application analysis

- Kernel properties

NVVP – NVIDIA Visual Profiler

Page 18: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Application Assessment with NVVP

Page 19: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Application Assessment with NVVP

Page 20: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Application Assessment with NVVP

Page 21: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Source-Level Hot-spot Analysis in NVVP

Page 22: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Source-Level Hot-spot Analysis in NVVP

Page 23: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

What does Uncoalesced Store mean?

Global memory access happens in

transactions of 32 Bytes

Coalesced access:

Group of 32 threads (“warp”) accessing adjacent bytes

Uncoalesced access:

Group of 32 threads accessing scattered bytes

Results in up to 32 transactions

0 1 31

0 1 31

Page 24: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Memory Access Patterns

Array access: ~OK: x[i] = a[i+1] – a[i]

Bad: x[i] = a[64*i] – a[i]

SoA vs AoS:

OK : point.x[i]

Bad: point[i].x

Random access Bad: a[rand_fun(i)]

0 1 31

0 1 31

Page 25: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

How can we improve the write?

Coalesced read

Scattered write (stride N)

Process matrix tile, not single row/column

Transpose matrix tile within block

in

out

Page 26: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

How can we improve the write?

Coalesced read

Scattered write (stride N)

Process matrix tile, not single row/column

Transpose matrix tile within block

=> Need threads in a block to cooperate

=> Use shared memory

in

out

Page 27: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Shared memory

- Accessible by all threads in a block

- Fast compared to global mem

- Low access latency

- high BW

- (almost like registers)

- Common uses:

- Software managed cache

- Data layout conversion

Global Memory (DRAM)

Registers

SM-0

Registers

SM-N

SMEM SMEM

Page 28: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Transpose with coalesced read/write

__global__ transpose(float in[], float out[])

{

__shared__ float tile[TILE][TILE];

int glob_in = xIndex + (yIndex)*N;

int glob_out = xIndex + (yIndex)*N;

tile[threadIdx.y][threadIdx.x] = in[glob_in];

__syncthreads();

out[glob_out] = tile[threadIdx.x][threadIdx.y];

}

grid(N/TILE, N/TILE,1)

threads(TILE, TILE, 1)

transpose<<<grid, threads>>>(in, out);

Page 29: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Transpose with coalesced read/write

__global__ transpose(float in[], float out[])

{

__shared__ float tile[TILE][TILE];

int glob_in = xIndex + (yIndex)*N;

int glob_out = xIndex + (yIndex)*N;

tile[threadIdx.y][threadIdx.x] = in[glob_in];

__syncthreads();

out[glob_out] = tile[threadIdx.x][threadIdx.y];

}

grid(N/TILE, N/TILE,1)

threads(TILE, TILE, 1)

transpose<<<grid, threads>>>(in, out);

Page 30: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Transpose with coalesced read/write

__global__ transpose(float in[], float out[])

{

__shared__ float tile[TILE][TILE];

int glob_in = xIndex + (yIndex)*N;

int glob_out = xIndex + (yIndex)*N;

tile[threadIdx.y][threadIdx.x] = in[glob_in];

__syncthreads();

out[glob_out] = tile[threadIdx.x][threadIdx.y];

}

grid(N/TILE, N/TILE,1)

threads(TILE, TILE, 1)

transpose<<<grid, threads>>>(in, out);

Page 31: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

- Synchronization in kernel

tile[y][x] = in[in_data]

__syncthreads()

out[out_index] = tile[x][y]

- Keep threads blocked at barrier to minimum

What happens at the barrier?

Thread-Block

Matrix Tile

Matrix Tile

Thread-Block

Page 32: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

- Synchronization in kernel

tile[y][x] = in[in_data]

__syncthreads()

out[out_index] = tile[x][y]

- Keep threads blocked at barrier to minimum

- Use more thread blocks, but:

- # blocks per SM limited by # threads/block

Thread Serialization

Thread-Block

Matrix Tile

Matrix Tile

Thread-Block

Page 33: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

- Synchronization in kernel

tile[y][x] = in[in_data]

__syncthreads()

out[out_index] = tile[x][y]

- Keep threads blocked at barrier to minimum

- Use more thread blocks, but:

- # blocks per SM limited by # threads/block

Thread Serialization

Thread-Block

Matrix Tile

Matrix Tile

Thread-Block

Solution: Reduce number of threads per block

Page 34: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Impact of Reduced Serialization

Page 35: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Impact of Reduced Serialization

Page 36: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Shared Memory Organization

Organized in 32 independent banks

Optimal access: disjoint banks or multicast

Multiple access to same bank: serialization

Solution for transpose: pading

tile[16][16] => tile[16][17]

C

Bank

Any 1:1 mapping or MC

C C C

Bank Bank Bank

C

Bank

C C C

Bank Bank Bank

Page 37: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Final Solution

Page 38: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Final Solution

Page 39: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

APOD Cycle Summary

Assessment: Algorithm highly parallel

Parallelization: 1 thread per column 12 GB/s

Parallelization: 1 thread per element 99 GB/s

Optimization: Memory access coalescence 93 GB/s

Optimization: Latency hiding 124 GB/s

Optimization: Bank conflict resolution 170 GB/s

=> Ready for Deployment

APOD applied at each P/O step

Page 40: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Additional Metrics

Page 41: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Control Flow

if ( ... )

{

// then-clause

}

else

{

// else-clause

}

instr

ucti

on

s

Page 42: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Execution within warps is coherent in

str

ucti

on

s / t

ime

Warp

(“vector” of threads)

35 34 33 63 62 32 3 2 1 31 30 0

Warp

(“vector” of threads)

Page 43: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Execution diverges within a warp in

str

ucti

on

s / t

ime

3 2 1 31 30 0 35 34 33 63 62 32

Page 44: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Execution diverges within a warp in

str

ucti

on

s / t

ime

3 2 1 31 30 0 35 34 33 63 62 32

Solution: Group threads with similar control flow

Page 45: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Occupancy

Need independent threads per SM to hide

latencies

- Memory access

- Instruction

Hardware resources determine maximum

number of threads/threadblocks per SM

Consumed resources determine actual number

Occupancy = Nactual / Nmax

Page 46: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Occupancy

Limiting resources:

- Number of threads

- Number of registers per thread

- Number of blocks

- Amount of shared memory per block

No need for 100% occupancy

- Depends on kernel

Page 47: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Occupancy Calculator

Analyze effect of resource

consumption on occupancy

Page 48: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

- Command-Line Profiler

- Access to hardware counters

- List of supported counters: --query-events

Alternatives to NVVP: nvprof

%nvprof --print-gpu-trace ./transpose

Profiling result:

Start Duration Grid Size Block Size Regs* Size Throughput Name

577.11ms 874.57us - - - 4.19MB 4.80GB/s [CUDA memcpy HtoD]

598.45ms 1.67ms (1 1 1) (1024 1 1) 22 - - transposeNaive(float*,

600.12ms 1.67ms (1 1 1) (1024 1 1) 22 - - transposeNaive(float*,

601.79ms 1.67ms (1 1 1) (1024 1 1) 22 - - transposeNaive(float*,

nvprof --print-gpu-trace --aggregate-mode-off --events sm_cta_launched ./transpose

Profiling result:

Device Event Name, Kernel, Values

0 sm_cta_launched, transposeNaive(float*, ..), 76 73 72 72 73 74 75 73 73 72 73 73 72 73

Page 49: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Alternatives to NVVP: nvprof

Page 50: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Alternatives to NVVP: Instrumentation

cudaEventRecord(start, 0);

transpose<<<grid, threads>>>(..);

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop);

transpose

start

stop

Tim

e

EventSynchronize

Page 51: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Characteristics of an Ideal GPU Candidate

Sufficient parallelism

K20X: Up to 28’672 threads in flight

Rather > 10’000-way than ~10-way

Memory access patterns

Ideally close to stride 1 possible

Control-flow patterns

Low divergence, at least for groups

of threads

Page 52: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Characteristics of an Ideal GPU Candidate

Sufficient parallelism

K20X: Up to 28’672 threads in flight

Rather > 10’000-way than ~10-way

Memory access patterns

Ideally close to stride 1 possible

Control-flow patterns

Low divergence, at least for groups

of threads

Page 53: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries
Page 54: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

Streams

Stream 0

Host A

pp

Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit

Page 55: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

Streams

Stream 0

Host A

pp

Grid 1

Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit

Page 56: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

Streams

Stream 0

Host A

pp

Grid 1

Grid 2

Grid 3

Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit

Page 57: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

Streams

Stream 0

Host A

pp

Grid 2

Grid 3

Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit Grid 1 Grid 1

Page 58: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

Streams

Stream 0

Host A

pp

Grid 2

Grid 3

Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit Grid 1

Page 59: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

Streams

Stream 0

Host A

pp

Grid 2

Grid 3

Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit

Page 60: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

Streams

Stream 0

Host A

pp

Grid 3

Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit Grid 2

Page 61: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

- In different streams can be concurrent

All kernel launches are asynchronous to the host

Streams

Stream 1

Stream 2

Host A

pp

Grid 1

Grid 2

Grid 3

Grid 5

Grid 6

Grid 7 Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit

Page 62: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

- In different streams can be concurrent

All kernel launches are asynchronous to the host

Streams

Stream 1

Stream 2

Host A

pp

Grid 2

Grid 3

Grid 5

Grid 6

Grid 7 Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit

Grid 1 Grid 1

Page 63: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

- In different streams can be concurrent

All kernel launches are asynchronous to the host

Streams

Stream 1

Stream 2

Host A

pp

Grid 2

Grid 3

Grid 6

Grid 7

Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit Grid 1 Grid 5

Page 64: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches

- within the same stream are in-order

- In different streams can be concurrent

All kernel launches are asynchronous to the host

Streams

Stream 1

Stream 2

Host A

pp

Grid 3

Grid 6

Grid 7

Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit Grid 2 Grid 5

Page 65: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Asynchronous Data Transfer / Pipelining

Host

GPU

Time

Page 66: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Asynchronous Data Transfer / Pipelining

CPU

GPU

CPU

GPU

Time

Page 67: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Asynchronous Data Transfer / Pipelining

CPU

GPU

CPU

GPU

Time

Page 68: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Asynchronous Data Transfer / Pipelining

CPU

GPU

CPU

GPU

Time

Page 69: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Asynchronous Data Transfer / Pipelining

CPU

GPU

CPU

GPU

Time

Page 70: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Asynchronous Data Transfer / Pipelining

CPU

GPU

CPU

GPU

Time

Page 71: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Asynchronous Data Transfer / Pipelining

CPU

GPU

CPU

GPU

Time

cudaStream_t stream1, stream2;

cudaMemcpyAsync( dst, src, size,

dir, stream1 );

kernel<<<grid, block, 0, stream2>>>(…);

Page 72: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Hyper-Q Enables Efficient Scheduling

Grid management unit can select most appropriate grid from 32 streams

Improves scheduling of concurrently executed grids

Particularly interesting for MPI applications

Page 73: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Strong Scaling of MPI Application

GPU parallelizable part CPU parallel part Serial part

N=1

Multicore CPU only

Page 74: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Strong Scaling of MPI Application

GPU parallelizable part CPU parallel part Serial part

N=2 N=1

Multicore CPU only

Page 75: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Strong Scaling of MPI Application

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1

Multicore CPU only

Page 76: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Strong Scaling of MPI Application

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only

Page 77: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

GPU Accelerated MPI Application

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only GPU accelerated CPU

N=1

Page 78: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

GPU Accelerated Strong Scaling

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only GPU accelerated CPU

With Hyper-Q/Proxy Available in K20

N=4 N=2 N=1 N=8

Page 79: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

GPU Accelerated Strong Scaling

GPU parallelizable part CPU parallel part Serial part

N=4 N=2 N=1 N=8

Multicore CPU only GPU accelerated CPU

With Hyper-Q/Proxy Available in K20

N=4 N=2 N=1 N=8

Page 80: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Example: Hyper-Q/Proxy for CP2K

Page 81: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

How to use Hyper-Q

- No application modifications necessary

- Proxy process between user processes and GPU

- nvidia-proxy-server-control –d

Page 82: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Don’t Forget Large-Scale Behavior

Profile in realistic environment

Get profile at scale

Tau, Scalasca, VampirTrace+Vampir, Craypat, ..

Fix messaging problems first! GPUs will accelerate your compute, amplify messaging problems

Will also help CPU-only code

Compute

Waste

Nrank = 384

Tim

e

Page 83: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

CPU GPU

CUDA Dynamic Parallelism

GPU as Co-Processor

Page 84: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

CPU GPU CPU GPU

CUDA Dynamic Parallelism

Autonomous, Dynamic Parallelism GPU as Co-Processor

Page 85: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Dynamic Work Generation

Initial Grid

Page 86: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Dynamic Work Generation

Initial Grid

Statically assign conservative

worst-case grid

Fixed Grid

Page 87: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Dynamic Work Generation

Initial Grid

Statically assign conservative

worst-case grid

Dynamically assign performance

where accuracy is required

Dynamic Grid

Fixed Grid

Page 88: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kernel launches grids

Identical syntax as host

CUDA runtime function in

cudadevrt library

__global__ void childKernel()

{

printf("Hello %d", threadIdx.x);

}

__global__ void parentKernel()

{

childKernel<<<1,10>>>();

cudaDeviceSynchronize();

printf("World!\n");

}

int main(int argc, char *argv[])

{

parentKernel<<<1,1>>>();

cudaDeviceSynchronize();

return 0;

}

CUDA Dynamic Parallelism

Page 89: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Characteristics of an Ideal GPU Candidate

Sufficient parallelism

K20X: Up to 28’672 threads in flight

Rather > 10’000-way than ~10-way

Memory access patterns

Ideally close to stride 1 possible

Control-flow patterns

Low divergence, at least for groups

of threads

Concurrent grids

Page 90: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries
Page 91: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Before CUDA 5: Whole-Program Compilation

Earlier CUDA releases required a single source file for each kernel Linking with external code was not supported

a.cu b.cu c.cu main.cpp + program.exe

#include all files together to build

Page 92: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

CUDA 5: Separate Compilation & Linking

+ program.exe main.cpp

a.cu b.cu

a.o b.o

c.cu

c.o

Separate compilation allows building independent object files

CUDA 5 can link multiple object files into one program

Page 93: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Benefits of Separate Compilation & Linking

Easier to reuse your existing code

- No need to include all files together any more

- “extern” attribute is respected

Incremental compilation reduces build time

- e.g. 47,000 line single-file: 50s down to 4s

Use 3rd party GPU Callable libraries or create your own

- GPU Callable BLAS Library (libcublas_device.a) included

in CUDA Toolkit 5.0, uses Dynamic Parallelism

Page 94: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

CUDA 5: GPU Callable Libraries

Can combine object files into static libraries Link and externally call device code

a.cu b.cu

a.o b.o +

ab.a

+

main.cpp

program.exe

foo.cu

+

Page 95: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

CUDA 5: GPU Callable Libraries

a.cu b.cu

a.o b.o +

ab.a

ab.a

program2.exe

+

main2.cpp

bar.cu

+

+

main.cpp

program.exe

foo.cu

+

Combine object files into static libraries

Facilitates code reuse, reduces compile time

Page 96: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

CUDA 5: Callbacks

Enables closed-source device

libraries to call user-defined

device callback functions

vendor.a

+

main.cpp

program.exe

foo.cu

+

callback.cu +

Page 97: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Device Linker Invocation

Introduction of an optional link step for device code

nvcc –arch=sm_20 –dc a.cu b.cu

nvcc –arch=sm_20 –dlink a.o b.o –o link.o

g++ a.o b.o link.o –L<path> -lcudart

Page 98: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Device Linker Invocation

Introduction of an optional link step for device code

Link device-runtime library for dynamic parallelism

Currently, link occurs at cubin level (PTX not yet supported)

nvcc –arch=sm_20 –dc a.cu b.cu

nvcc –arch=sm_20 –dlink a.o b.o –o link.o

g++ a.o b.o link.o –L<path> -lcudart

nvcc –arch=sm_35 –dc a.cu b.cu

nvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.o

g++ a.o b.o link.o –L<path> -lcudadevrt -lcudart

Page 99: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

GPUDirect enables GPU-aware MPI

GPU-GPU transfer across NIC Unified Virtual addresses allows

Without CPU participation to detect location of buffer pointed to

Page 100: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

GPUDirect enables GPU-aware MPI

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost);

MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);

MPI_Recv(r_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);

cudaMemcpy(r_buf_h,r_buf_d,size,cudaMemcpyHostToDevice);

Simplifies to

MPI_Send(s_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD);

MPI_Recv(r_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD);

(for CPU and GPU buffers)

Page 101: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

GPU Management: nvidia-smi

Multi-GPU systems are widely available

Different systems are set up differently

Want to get quick information on - Approximate GPU utilization

- Approximate memory footprint

- Number of GPUs

- ECC state

- Driver version

Inspect and modify GPU state

Thu Nov 1 09:10:29 2012

+------------------------------------------------------+

| NVIDIA-SMI 4.304.51 Driver Version: 304.51 |

|-------------------------------+----------------------+----------------------+

| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla K20X | 0000:03:00.0 Off | Off |

| N/A 30C P8 28W / 235W | 0% 12MB / 6143MB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 Tesla K20X | 0000:85:00.0 Off | Off |

| N/A 28C P8 26W / 235W | 0% 12MB / 6143MB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Compute processes: GPU Memory |

| GPU PID Process name Usage |

|=============================================================================|

| No running compute processes found |

+-----------------------------------------------------------------------------+

Page 103: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Where to find additional information: GTC

Kepler architecture:

GTC12 Session S0642: Inside Kepler

Assessing performance limiters:

GTC10 Session 2012: Analysis-driven Optimization (slides 5-19):

http://www.nvidia.com/content/GTC-2010/pdfs/2012_GTC2010v2.pdf

Profiling tools:

GTC12 sessions:

S0419: Optimizing Application Performance with CUDA Performance Tools

S0420: Nsight IDE for Linux and Mac

...

CUPTI documentation (describes all the profiler counters)

Included in every CUDA toolkit (/cuda/extras/cupti/doc/Cupti_Users_Guide.pdf

GPU computing webinars in general:

http://developer.nvidia.com/gpu-computing-webinars

http://www.gputechconf.com/gtcnew/on-

demand-gtc.php

Page 104: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

© NVIDIA Corporation 2012

Kepler and CUDA5: Powerful yet Easy

Kepler and CUDA5 simplify GPU acceleration

Bypass optimization trial/error with APOD

Profile and analyze sefficiently with NVVP

Improve MPI scalability with Hyper-Q/Proxy

Parallelize with CUDA Dynamic Parallelism

Page 105: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

Thank you!

Page 106: Accelerate your Application with Keplerdeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · © NVIDIA Corporation 2012 Here: Focus on Programming Languages Applications Libraries

Backup Slides