accelerate your application with keplerdeveloper.download.nvidia.com/gtc/pdf/gtc2012/... · ©...

Accelerate your Application with Kepler

Peter Messmer

© NVIDIA Corporation 2012

Goals

How to analyze/optimize an existing application for GPUs

What to consider when designing new GPU applications

How to optimize performance with Kepler /CUDA5

GPU Acceleration


Here: Focus on Programming Languages

Applications

Libraries Programming

Languages OpenACC

Directives


APOD – A Systematic Path to Performance

Parallelize

Optimize

Assess

Deploy


Starting point: Matrix transpose

for(int j=0; j < N; j++)

for(int i=0; i < N; i++)

out[j][i] = in[i][j];

i

j


Matrix transpose on CPU

void transpose(float in[], float out[])

{



out[i*N+j] = in[j*N+i];

}

float in[], out[];

transpose(in, out, width);

i

j


An initial CUDA Version

__global__ void transpose(float in[], float out[])

{




}

float in[], out[];

cudaMemcpy(in, in_host, N*N*sizeof(float));

transpose<<<1,1>>>(in, out);

i

j



+ Quickly implemented - Performance weak

__global__ transpose(float in[], float out[])

{




}

float in[], out[];

…




+ Quickly implemented - Performance weak

=> Express Parallelism!


{




}

float in[], out[];

…



Thread: Sequential execution unit All threads execute same sequential program

Threads execute in parallel

Threads Blocks: A group of threads Executes on a single Streaming Multiprocessor (SM)

Threads within a block can cooperate

Light-weight synchronization

Data exchange

Grid: A group of thread blocks Communication between blocks expensive

Threadblocks of a grid execute on multiple SMs

Recap: Kernel Execution Model


First Parallelization: Inner Loop

Process source rows independently

tid

in j

tid tid

tid out

j

tid

tid


{

int tid = threadIdx.x;


out[tid*N+j] = in[j*N+tid];

}

float in[], out[];

…

transpose<<<1,N>>>(in, out);


Second Parallelization: One Block per Row


{

int tid = threadIdx.x;

int bid = blockIdx.x;

out[tid*N+bid] = in[bid*N+tid];

}

float in[][], out[][];

…

transpose<<<N,N>>>(in, out);

tid out

tid

tid

tid

in

tid tid

bid

bid

bid bid


- Application analysis

- Kernel properties

NVVP – NVIDIA Visual Profiler


Application Assessment with NVVP


Source-Level Hot-spot Analysis in NVVP


What does Uncoalesced Store mean?

Global memory access happens in

transactions of 32 Bytes

Coalesced access:

Group of 32 threads (“warp”) accessing adjacent bytes

Uncoalesced access:

Group of 32 threads accessing scattered bytes

Results in up to 32 transactions

0 1 31

0 1 31


Memory Access Patterns

Array access: ~OK: x[i] = a[i+1] – a[i]

Bad: x[i] = a[64*i] – a[i]

SoA vs AoS:

OK : point.x[i]

Bad: point[i].x

Random access Bad: a[rand_fun(i)]

0 1 31

0 1 31


How can we improve the write?

Coalesced read

Scattered write (stride N)

Process matrix tile, not single row/column

Transpose matrix tile within block

in

out


How can we improve the write?

Coalesced read

Scattered write (stride N)

Process matrix tile, not single row/column

Transpose matrix tile within block

=> Need threads in a block to cooperate

=> Use shared memory

in

out


Shared memory

- Accessible by all threads in a block

- Fast compared to global mem

- Low access latency

- high BW

- (almost like registers)

- Common uses:

- Software managed cache

- Data layout conversion

Global Memory (DRAM)

Registers

SM-0

Registers

SM-N

SMEM SMEM


Transpose with coalesced read/write


{

__shared__ float tile[TILE][TILE];

int glob_in = xIndex + (yIndex)*N;

int glob_out = xIndex + (yIndex)*N;

tile[threadIdx.y][threadIdx.x] = in[glob_in];

__syncthreads();

out[glob_out] = tile[threadIdx.x][threadIdx.y];

}

grid(N/TILE, N/TILE,1)

threads(TILE, TILE, 1)

transpose<<<grid, threads>>>(in, out);


- Synchronization in kernel

tile[y][x] = in[in_data]

__syncthreads()

out[out_index] = tile[x][y]

- Keep threads blocked at barrier to minimum

What happens at the barrier?

Thread-Block

Matrix Tile

Matrix Tile

Thread-Block




__syncthreads()



- Use more thread blocks, but:

- # blocks per SM limited by # threads/block

Thread Serialization

Thread-Block

Matrix Tile

Matrix Tile

Thread-Block




__syncthreads()



- Use more thread blocks, but:

- # blocks per SM limited by # threads/block

Thread Serialization

Thread-Block

Matrix Tile

Matrix Tile

Thread-Block

Solution: Reduce number of threads per block


Impact of Reduced Serialization


Shared Memory Organization

Organized in 32 independent banks

Optimal access: disjoint banks or multicast

Multiple access to same bank: serialization

Solution for transpose: pading

tile[16][16] => tile[16][17]

C

Bank

Any 1:1 mapping or MC

C C C

Bank Bank Bank

C

Bank

C C C

Bank Bank Bank


Final Solution


APOD Cycle Summary

Assessment: Algorithm highly parallel

Parallelization: 1 thread per column 12 GB/s

Parallelization: 1 thread per element 99 GB/s

Optimization: Memory access coalescence 93 GB/s

Optimization: Latency hiding 124 GB/s

Optimization: Bank conflict resolution 170 GB/s

=> Ready for Deployment

APOD applied at each P/O step


Additional Metrics


Control Flow

if ( ... )

{

// then-clause

}

else

{

// else-clause

}

instr

ucti

on

s


Execution within warps is coherent in

str

ucti

on

s / t

ime

Warp

(“vector” of threads)

35 34 33 63 62 32 3 2 1 31 30 0

Warp

(“vector” of threads)


Execution diverges within a warp in

str

ucti

on

s / t

ime

3 2 1 31 30 0 35 34 33 63 62 32


Execution diverges within a warp in

str

ucti

on

s / t

ime

3 2 1 31 30 0 35 34 33 63 62 32

Solution: Group threads with similar control flow


Occupancy

Need independent threads per SM to hide

latencies

- Memory access

- Instruction

Hardware resources determine maximum

number of threads/threadblocks per SM

Consumed resources determine actual number

Occupancy = Nactual / Nmax


Occupancy

Limiting resources:

- Number of threads

- Number of registers per thread

- Number of blocks

- Amount of shared memory per block

No need for 100% occupancy

- Depends on kernel


Occupancy Calculator

Analyze effect of resource

consumption on occupancy


- Command-Line Profiler

- Access to hardware counters

- List of supported counters: --query-events

Alternatives to NVVP: nvprof

%nvprof --print-gpu-trace ./transpose

Profiling result:

Start Duration Grid Size Block Size Regs* Size Throughput Name

577.11ms 874.57us - - - 4.19MB 4.80GB/s [CUDA memcpy HtoD]

598.45ms 1.67ms (1 1 1) (1024 1 1) 22 - - transposeNaive(float*,



nvprof --print-gpu-trace --aggregate-mode-off --events sm_cta_launched ./transpose

Profiling result:

Device Event Name, Kernel, Values

0 sm_cta_launched, transposeNaive(float*, ..), 76 73 72 72 73 74 75 73 73 72 73 73 72 73


Alternatives to NVVP: nvprof


Alternatives to NVVP: Instrumentation

cudaEventRecord(start, 0);

transpose<<<grid, threads>>>(..);

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop);

transpose

start

stop

Tim

e

EventSynchronize


Characteristics of an Ideal GPU Candidate

Sufficient parallelism

K20X: Up to 28’672 threads in flight

Rather > 10’000-way than ~10-way

Memory access patterns

Ideally close to stride 1 possible

Control-flow patterns

Low divergence, at least for groups

of threads


Kernel launches

- within the same stream are in-order

Streams

Stream 0

Host A

pp


SM-0 SM-N Grid

Mgmt

Unit


Kernel launches


Streams

Stream 0

Host A

pp

Grid 1


SM-0 SM-N Grid

Mgmt

Unit


Kernel launches


Streams

Stream 0

Host A

pp

Grid 1

Grid 2

Grid 3


SM-0 SM-N Grid

Mgmt

Unit


Kernel launches


Streams

Stream 0

Host A

pp

Grid 2

Grid 3


SM-0 SM-N Grid

Mgmt

Unit Grid 1 Grid 1


Kernel launches


Streams

Stream 0

Host A

pp

Grid 2

Grid 3


SM-0 SM-N Grid

Mgmt

Unit Grid 1


Kernel launches


Streams

Stream 0

Host A

pp

Grid 2

Grid 3


SM-0 SM-N Grid

Mgmt

Unit


Kernel launches


Streams

Stream 0

Host A

pp

Grid 3


SM-0 SM-N Grid

Mgmt

Unit Grid 2


Kernel launches


- In different streams can be concurrent

All kernel launches are asynchronous to the host

Streams

Stream 1

Stream 2

Host A

pp

Grid 1

Grid 2

Grid 3

Grid 5

Grid 6

Grid 7 Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit


Kernel launches




Streams

Stream 1

Stream 2

Host A

pp

Grid 2

Grid 3

Grid 5

Grid 6

Grid 7 Global Memory (DRAM)

SM-0 SM-N Grid

Mgmt

Unit

Grid 1 Grid 1


Kernel launches




Streams

Stream 1

Stream 2

Host A

pp

Grid 2

Grid 3

Grid 6

Grid 7


SM-0 SM-N Grid

Mgmt

Unit Grid 1 Grid 5


Kernel launches




Streams

Stream 1

Stream 2

Host A

pp

Grid 3

Grid 6

Grid 7


SM-0 SM-N Grid

Mgmt

Unit Grid 2 Grid 5


Asynchronous Data Transfer / Pipelining

Host

GPU

Time



CPU

GPU

CPU

GPU

Time



CPU

GPU

CPU

GPU

Time

cudaStream_t stream1, stream2;

cudaMemcpyAsync( dst, src, size,

dir, stream1 );

kernel<<<grid, block, 0, stream2>>>(…);


Hyper-Q Enables Efficient Scheduling

Grid management unit can select most appropriate grid from 32 streams

Improves scheduling of concurrently executed grids

Particularly interesting for MPI applications


Strong Scaling of MPI Application

GPU parallelizable part CPU parallel part Serial part

N=1

Multicore CPU only




N=2 N=1

Multicore CPU only




N=4 N=2 N=1

Multicore CPU only




N=4 N=2 N=1 N=8

Multicore CPU only


GPU Accelerated MPI Application


N=4 N=2 N=1 N=8

Multicore CPU only GPU accelerated CPU

N=1


GPU Accelerated Strong Scaling


N=4 N=2 N=1 N=8

Multicore CPU only GPU accelerated CPU

With Hyper-Q/Proxy Available in K20

N=4 N=2 N=1 N=8


Example: Hyper-Q/Proxy for CP2K


How to use Hyper-Q

- No application modifications necessary

- Proxy process between user processes and GPU

- nvidia-proxy-server-control –d


Don’t Forget Large-Scale Behavior

Profile in realistic environment

Get profile at scale

Tau, Scalasca, VampirTrace+Vampir, Craypat, ..

Fix messaging problems first! GPUs will accelerate your compute, amplify messaging problems

Will also help CPU-only code

Compute

Waste

Nrank = 384

Tim

e


CPU GPU

CUDA Dynamic Parallelism

GPU as Co-Processor


CPU GPU CPU GPU


Autonomous, Dynamic Parallelism GPU as Co-Processor


Dynamic Work Generation

Initial Grid



Initial Grid

Statically assign conservative

worst-case grid

Fixed Grid



Initial Grid

Statically assign conservative

worst-case grid

Dynamically assign performance

where accuracy is required

Dynamic Grid

Fixed Grid


Kernel launches grids

Identical syntax as host

CUDA runtime function in

cudadevrt library

__global__ void childKernel()

{

printf("Hello %d", threadIdx.x);

}

__global__ void parentKernel()

{

childKernel<<<1,10>>>();

cudaDeviceSynchronize();

printf("World!\n");

}

int main(int argc, char *argv[])

{

parentKernel<<<1,1>>>();

cudaDeviceSynchronize();

return 0;

}



Characteristics of an Ideal GPU Candidate

Sufficient parallelism

K20X: Up to 28’672 threads in flight

Rather > 10’000-way than ~10-way

Memory access patterns

Ideally close to stride 1 possible

Control-flow patterns

Low divergence, at least for groups

of threads

Concurrent grids


Before CUDA 5: Whole-Program Compilation

Earlier CUDA releases required a single source file for each kernel Linking with external code was not supported

a.cu b.cu c.cu main.cpp + program.exe

#include all files together to build


CUDA 5: Separate Compilation & Linking

+ program.exe main.cpp

a.cu b.cu

a.o b.o

c.cu

c.o

Separate compilation allows building independent object files

CUDA 5 can link multiple object files into one program


Benefits of Separate Compilation & Linking

Easier to reuse your existing code

- No need to include all files together any more

- “extern” attribute is respected

Incremental compilation reduces build time

- e.g. 47,000 line single-file: 50s down to 4s

Use 3rd party GPU Callable libraries or create your own

- GPU Callable BLAS Library (libcublas_device.a) included

in CUDA Toolkit 5.0, uses Dynamic Parallelism


CUDA 5: GPU Callable Libraries

Can combine object files into static libraries Link and externally call device code

a.cu b.cu

a.o b.o +

ab.a

+

main.cpp

program.exe

foo.cu

+


CUDA 5: GPU Callable Libraries

a.cu b.cu

a.o b.o +

ab.a

ab.a

program2.exe

+

main2.cpp

bar.cu

+

+

main.cpp

program.exe

foo.cu

+

Combine object files into static libraries

Facilitates code reuse, reduces compile time


CUDA 5: Callbacks

Enables closed-source device

libraries to call user-defined

device callback functions

vendor.a

+

main.cpp

program.exe

foo.cu

+

callback.cu +


Device Linker Invocation

Introduction of an optional link step for device code

nvcc –arch=sm_20 –dc a.cu b.cu

nvcc –arch=sm_20 –dlink a.o b.o –o link.o

g++ a.o b.o link.o –L<path> -lcudart


Device Linker Invocation

Introduction of an optional link step for device code

Link device-runtime library for dynamic parallelism

Currently, link occurs at cubin level (PTX not yet supported)


nvcc –arch=sm_20 –dlink a.o b.o –o link.o

g++ a.o b.o link.o –L<path> -lcudart


nvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.o

g++ a.o b.o link.o –L<path> -lcudadevrt -lcudart


GPUDirect enables GPU-aware MPI

GPU-GPU transfer across NIC Unified Virtual addresses allows

Without CPU participation to detect location of buffer pointed to


GPUDirect enables GPU-aware MPI

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost);

MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);

MPI_Recv(r_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);

cudaMemcpy(r_buf_h,r_buf_d,size,cudaMemcpyHostToDevice);

Simplifies to

MPI_Send(s_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD);

MPI_Recv(r_buf,size,MPI_CHAR,1,100,MPI_COMM_WORLD);

(for CPU and GPU buffers)


GPU Management: nvidia-smi

Multi-GPU systems are widely available

Different systems are set up differently

Want to get quick information on - Approximate GPU utilization

- Approximate memory footprint

- Number of GPUs

- ECC state

- Driver version

Inspect and modify GPU state

Thu Nov 1 09:10:29 2012

+------------------------------------------------------+

| NVIDIA-SMI 4.304.51 Driver Version: 304.51 |

|-------------------------------+----------------------+----------------------+

| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla K20X | 0000:03:00.0 Off | Off |

| N/A 30C P8 28W / 235W | 0% 12MB / 6143MB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 Tesla K20X | 0000:85:00.0 Off | Off |

| N/A 28C P8 26W / 235W | 0% 12MB / 6143MB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Compute processes: GPU Memory |

| GPU PID Process name Usage |

|=============================================================================|

| No running compute processes found |

+-----------------------------------------------------------------------------+


Where to find additional information

CUDA documentation [1]

- Best Practice Guide [2]

- Kepler Tuning Guide [3]

Kepler whitepaper [4]

[1] http://docs.nvidia.com

[2] http://docs.nvidia.com/cuda/cuda-c-best-practices-guide

[3] http://docs.nvidia.com/cuda/kepler-tuning-guide

[4] http://www.nvidia.com/object/nvidia-kepler.html

http://docs.nvidia.com/

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide









http://docs.nvidia.com/cuda/kepler-tuning-guide





http://www.nvidia.com/object/nvidia-kepler.html






Where to find additional information: GTC

Kepler architecture:

GTC12 Session S0642: Inside Kepler

Assessing performance limiters:

GTC10 Session 2012: Analysis-driven Optimization (slides 5-19):

http://www.nvidia.com/content/GTC-2010/pdfs/2012_GTC2010v2.pdf

Profiling tools:

GTC12 sessions:

S0419: Optimizing Application Performance with CUDA Performance Tools

S0420: Nsight IDE for Linux and Mac

...

CUPTI documentation (describes all the profiler counters)

Included in every CUDA toolkit (/cuda/extras/cupti/doc/Cupti_Users_Guide.pdf

GPU computing webinars in general:

http://developer.nvidia.com/gpu-computing-webinars

http://www.gputechconf.com/gtcnew/on-

demand-gtc.php










Kepler and CUDA5: Powerful yet Easy

Kepler and CUDA5 simplify GPU acceleration

Bypass optimization trial/error with APOD

Profile and analyze sefficiently with NVVP

Improve MPI scalability with Hyper-Q/Proxy

Parallelize with CUDA Dynamic Parallelism

Thank you!

Backup Slides

accelerate your application with keplerdeveloper.download.nvidia.com/gtc/pdf/gtc2012/... · ©...

Documents