cuda 5.5 release candidateon-demand.gputechconf.com › gtc › 2013 › webinar ›...

CUDA 5.5 Release Candidate

June 25, 2013

© 2013 NVIDIA

Linux RPM and Debian Packages

SIMPLIFIED INSTALLATION AND UPGRADES!

Use native package installers

apt-get, yum, zypper

Everything: apt-get install cuda

One package: apt-get install cuda-documentation

Updates: apt-get upgrade

© 2013 NVIDIA

RPM and Debian Packaging Features

Side-by-side installations apt-get install cuda-5-5 cuda-6-0 (when available)

Cross-platform development (i.e., 32-bit target on 64-bit OS) apt-get install cuda-cross

Version locking update to latest version: apt-get install cuda

update to latest 5.5 version: apt-get install cuda-5-5

Everything is installed under /usr/local/cuda-5.5/

© 2013 NVIDIA

5.5 Download Page

© 2013 NVIDIA

RPM and Debian packages

One package repository per supported Linux distribution

Except Ubuntu 10.04 and RHEL 5

Ubuntu Example $ [ download cuda-repo-<distro>_<version>_<architecture>.deb ]

$ sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb

$ sudo apt-get update

$ sudo apt-get install cuda

© 2013 NVIDIA

3rd Party Library ISV Application Developer End-User System

myCUDAPluginA.so

libcudart.so.4.2

CUDA App

libcudart.so.5.0

myCUDAPluginB.so

libcudart.so.3.2

© 2013 NVIDIA


myCUDAPluginA.so

libcudart.so.4.2

CUDA App

libcudart.so.5.0

CUDA App

libcudart.so.5.0

myCUDAPluginA.so

libcudart.so.4.2

myCUDAPluginB.so

libcudart.so.3.2

myCUDAPluginB.so

libcudart.so.3.2

Complex coordination to ship correct components

© 2013 NVIDIA

Static Library CUDART (CUDA Runtime)


myCUDAPluginA.so CUDA App

CUDA App

myCUDAPluginA.so

myCUDAPluginB.so myCUDAPluginB.so

Distribution is made much simpler

Linux, Windows, Mac

© 2013 NVIDIA

Stream Priorities Accelerates Critical Path

Kernel A Kernel B

Kernel X

Kernel C

Kernel X

Kernel C Stream 1

High-Priority

Stream 2

Kernel A Kernel B Stream 1

Stream 2

With Priorities—especially useful when Kernel X generates data for MPI_Send()

No Priorities

Kernel X

Launched

Kernel X

Launched

© 2013 NVIDIA

GPU

CUDA Dynamic Parallelism (CDP)

__global__ void B(float *data)

{

do_stuff(data);

X <<< ... >>> (data);

Y <<< ... >>> (data);

Z <<< ... >>> (data);

cudaDeviceSynchronize();

do_more_stuff(data);

}

A

B

C

X

Y

Z

CPU int main() {

float *data;

setup(data);

A <<< ... >>> (data);

B <<< ... >>> (data);

C <<< ... >>> (data);

cudaDeviceSynchronize();

return 0;

}

main

© 2013 NVIDIA

CUDA 5.5 Scheduling Optimizations vs 5.0

~2.5X lower child-child

latency

~3.5X lower completion

latency

SW scheduler parallelized

Primarily benefits

smaller grids

Long dependent chains

~50% speedup vs. 5.0 for a

chain of 10 5μs grids

© 2013 NVIDIA

Without Hyper-Q 100

50

0 Time

GPU

Uti

lizati

on %

100

50

0 Time

GPU

Uti

lizati

on %

With Hyper-Q

© 2013 NVIDIA

Hyper-Q in CUDA 5.0: Streams

© 2013 NVIDIA

New in CUDA 5.5: Hyper-Q / MPI

FERMI 1 MPI Task at a Time

KEPLER 32 Simultaneous MPI Tasks

© 2013 NVIDIA

GPU

CUDA Server Process

CUDA

MPI

Rank

0

CUDA

MPI

Rank

1

CUDA

MPI

Rank

2

CUDA

MPI

Rank

3

Multi-Process Server Required for Hyper-Q / MPI

$ mpirun -np 4 my_cuda_app

No application re-compile to share the GPU

No user configuration needed Can be preconfigured by SysAdmin

MPI Ranks using CUDA are clients

Server spawns on-demand per user

One job per user No isolation between MPI ranks

Exclusive process mode enforces single server

One GPU per rank No cudaSetDevice()

only CUDA device 0 is visible

© 2013 NVIDIA

Strong Scaling of CP2K on Cray XK7

Hyper-Q with multiple

MPI ranks leads to

2.5X speedup over

single MPI rank using

the GPU

Blog post by Peter

Messmer of NVIDIA

© 2013 NVIDIA

Multi-user debugging with a single GPU

Nsight Visual Studio already supports

Nsight VSE now bundled in CUDA installer

Now also supported by CUDA-GDB &

Nsight Eclipse Edition

BETA feature

SM 3.5 Only

Debugger & GUI on one GPU

Multi-user debug on one GPU

2 ways to opt-in

CUDA_DEBUGGER_SOFTWARE_PREEMPTION=1

set cuda software_preemption on

CUDA-GDB and Nsight EE now also

support Dynamic Parallelism debug

© 2013 NVIDIA

Remote Debugging using Nsight EE

© 2013 NVIDIA

Automatic Performance Analysis

NEW in 5.5 Step-by-step optimization guidance

© 2013 NVIDIA

Identifying Candidate Kernels

Analysis system estimates which kernels are best candidates for

speedup

Execution time, achieved occupancy

© 2013 NVIDIA

Primary Performance Bound

Most likely limiter to performance for a kernel

Memory bandwidth

Compute resources

Instruction and memory latency

Primary bound should be addressed first

Often beneficial to examine secondary bounds as well

© 2013 NVIDIA

GPU per-process accounting statistics

$ nvidia-smi -q -d ACCOUNTING ==============NVSMI LOG============== Timestamp : Tue Apr 16 10:36:59 2013 Driver Version : 319.15 Attached GPUs : 1 GPU 0000:06:00.0 Accounting Mode : Enabled Accounting Mode Buffer Size : 128 Accounted Processes Process ID : 28043 GPU Utilization : 20 % Memory Utilization : 4 % Max memory usage : 461 MB Time : 5566 ms Process ID : 28085 GPU Utilization : 99 % Memory Utilization : 100 % Max memory usage : 101 MB Time : 11888 ms

Requirements

Tesla or Quadro GPU

Kepler

CUDA 5.5

Linux 32 & 64, Windows 64

© 2013 NVIDIA

CUFFT API Enhancements: Extensibility

Existing APIs (still work in 5.5)

cufftPlanMany(…)

/* Might recreate plan */

cufftSetCompatibilityMode(…)

cufftExecC2C(…)

cufftDestroy(…)

Each new configuration option

may require expensive plan

re-creation

New additional APIs in 5.5

cufftCreate(…)

cufftSetCompatibilityMode(…)

cufftMakePlanMany(…)

cufftExecC2C(…) cufftDestroy(…)

Allows new configuration

options to be employed

without multiple re-plan steps

© 2013 NVIDIA

CUFFT API Enhancements: Memory

Query size of workspace

cufftEstimate1d(…),

cufftEstimate2d(…),

cufftEstimate3d(…), and

cufftEstimateMany(…)

Helps determine if plan fits in

GPU memory

Returned size is not exact in

5.5 (see docs for details)

Control scratch workspace

cufftCreate(…)

cufftSetAutoAllocate(0)

/* returns size of work area */

cufftMakePlanMany(…)

cufftSetWorkArea(…)

cufftExecC2C(…)

cufftDestroy(…)

Share the same workspace across

executions of different plans

(non-concurrently)

© 2013 NVIDIA

CUFFT API Enhancements: FFTW support

Easily port from FFTW to

CUFFT by changing link library

Supports All Combinations of:

Single and Double Precision

C2C, R2C, C2R Transforms

FFTW Basic Interface

FFTW Advanced Interface

FFTW Guru Interface

Does Not Support:

Extended Precision

Real to Real Transforms

“Split” Memory Layout

Distributed Memory with MPI

*guru64* APIs

FFTW compatibility header

file helps developers detect

when unsupported FFTW APIs

are used

© 2013 NVIDIA

CUDA 5.5 enables new platforms

Enable 3rd party toolchains…

Compiler SDK based on LLVM

(libnvvm)

Allows 3rd party ports of new

languages to GPUs

Enable ARM…

Cross compilation from x86 or

native ARM compilation

CUDA development tools support

SECO mITX board for HPC:

© 2013 NVIDIA

OS and Compiler Support Matrix 32 64 Status

Ubuntu-12.04 X X New Ubuntu-12.10 X X New Ubuntu-10.04 X X Continued

Ubuntu-11.10 X X Removed Fedora 18 X New Fedora 16 X X Removed RHEL-5.5+ X Continued RHEL-6.X X Continued Mac OS X 10.8 X X Continued Mac OS X 10.7.x X X Continued OpenSUSE-12.1 X Continued SLES 11 SP2 X Continued WinXP X X Continued Vista/Win7/Win8 X X Continued Includes Win 2008 and 2012 server VC 9.0 (VS 2008) X X Continued VC 10.0 (VS 2010) X X Continued VC 11.0 (VS 2012) X X New

© 2013 NVIDIA

CUDA 5.5

Linux RPM/DEB installers

Stream Priorities

Static CUDART

Dynamic Parallelism performance improvements

MPS on Linux

Multi-user and remote debugging

New Visual Profiler guided optimization

CUFFT API Enhancements

LLVM based Compiler SDK

cuda 5.5 release candidateon-demand.gputechconf.com › gtc › 2013 › webinar ›...

Documents