cuda 5.5 release candidateon-demand.gputechconf.com › gtc › 2013 › webinar ›...

CUDA 5.5 Release Candidate

June 25, 2013

Linux RPM and Debian Packages

SIMPLIFIED INSTALLATION AND UPGRADES!

Use native package installers

apt-get, yum, zypper

Everything: apt-get install cuda

One package: apt-get install cuda-documentation

Updates: apt-get upgrade

RPM and Debian Packaging Features

Side-by-side installations apt-get install cuda-5-5 cuda-6-0 (when available)

Cross-platform development (i.e., 32-bit target on 64-bit OS) apt-get install cuda-cross

Version locking update to latest version: apt-get install cuda

update to latest 5.5 version: apt-get install cuda-5-5

Everything is installed under /usr/local/cuda-5.5/

5.5 Download Page

RPM and Debian packages

One package repository per supported Linux distribution

Except Ubuntu 10.04 and RHEL 5

Ubuntu Example $ [ download cuda-repo-<distro>_<version>_<architecture>.deb ]

$ sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb

$ sudo apt-get update

$ sudo apt-get install cuda

3rd Party Library ISV Application Developer End-User System

myCUDAPluginA.so

libcudart.so.4.2

CUDA App

libcudart.so.5.0

myCUDAPluginB.so

libcudart.so.3.2

myCUDAPluginA.so

libcudart.so.4.2

CUDA App

libcudart.so.5.0

CUDA App

libcudart.so.5.0

myCUDAPluginA.so

libcudart.so.4.2

myCUDAPluginB.so

libcudart.so.3.2

myCUDAPluginB.so

libcudart.so.3.2

Complex coordination to ship correct components

Static Library CUDART (CUDA Runtime)

myCUDAPluginA.so CUDA App

CUDA App

myCUDAPluginA.so

myCUDAPluginB.so myCUDAPluginB.so

Distribution is made much simpler

Linux, Windows, Mac

Stream Priorities Accelerates Critical Path

Kernel A Kernel B

Kernel X

Kernel C

Kernel X

Kernel C Stream 1

High-Priority

Stream 2

Kernel A Kernel B Stream 1

Stream 2

With Priorities—especially useful when Kernel X generates data for MPI_Send()

No Priorities

Kernel X

Launched

Kernel X

Launched

CUDA Dynamic Parallelism (CDP)

__global__ void B(float *data)

do_stuff(data);

X <<< ... >>> (data);

Y <<< ... >>> (data);

Z <<< ... >>> (data);

cudaDeviceSynchronize();

do_more_stuff(data);

CPU int main() {

float *data;

setup(data);

A <<< ... >>> (data);

B <<< ... >>> (data);

C <<< ... >>> (data);

cudaDeviceSynchronize();

return 0;

CUDA 5.5 Scheduling Optimizations vs 5.0

~2.5X lower child-child

latency

~3.5X lower completion

latency

SW scheduler parallelized

Primarily benefits

smaller grids

Long dependent chains

~50% speedup vs. 5.0 for a

chain of 10 5μs grids

Without Hyper-Q 100

0 Time

lizati

0 Time

lizati

With Hyper-Q

Hyper-Q in CUDA 5.0: Streams

New in CUDA 5.5: Hyper-Q / MPI

FERMI 1 MPI Task at a Time

KEPLER 32 Simultaneous MPI Tasks

CUDA Server Process

Multi-Process Server Required for Hyper-Q / MPI

$ mpirun -np 4 my_cuda_app

No application re-compile to share the GPU

No user configuration needed Can be preconfigured by SysAdmin

MPI Ranks using CUDA are clients

Server spawns on-demand per user

One job per user No isolation between MPI ranks

Exclusive process mode enforces single server

One GPU per rank No cudaSetDevice()

only CUDA device 0 is visible

Strong Scaling of CP2K on Cray XK7

Hyper-Q with multiple

MPI ranks leads to

2.5X speedup over

single MPI rank using

the GPU

Blog post by Peter

Messmer of NVIDIA

Multi-user debugging with a single GPU

Nsight Visual Studio already supports

Nsight VSE now bundled in CUDA installer

Now also supported by CUDA-GDB &

Nsight Eclipse Edition

BETA feature

SM 3.5 Only

Debugger & GUI on one GPU

Multi-user debug on one GPU

2 ways to opt-in

CUDA_DEBUGGER_SOFTWARE_PREEMPTION=1

set cuda software_preemption on

CUDA-GDB and Nsight EE now also

support Dynamic Parallelism debug

Remote Debugging using Nsight EE

Automatic Performance Analysis

NEW in 5.5 Step-by-step optimization guidance

Identifying Candidate Kernels

Analysis system estimates which kernels are best candidates for

speedup

Execution time, achieved occupancy

Primary Performance Bound

Most likely limiter to performance for a kernel

Memory bandwidth

Compute resources

Instruction and memory latency

Primary bound should be addressed first

Often beneficial to examine secondary bounds as well

Visual Profiler: Performance Bound

Visual Profiler: Memory Efficiency

GPU per-process accounting statistics

$ nvidia-smi -q -d ACCOUNTING ==============NVSMI LOG============== Timestamp : Tue Apr 16 10:36:59 2013 Driver Version : 319.15 Attached GPUs : 1 GPU 0000:06:00.0 Accounting Mode : Enabled Accounting Mode Buffer Size : 128 Accounted Processes Process ID : 28043 GPU Utilization : 20 % Memory Utilization : 4 % Max memory usage : 461 MB Time : 5566 ms Process ID : 28085 GPU Utilization : 99 % Memory Utilization : 100 % Max memory usage : 101 MB Time : 11888 ms

Requirements

Tesla or Quadro GPU

Kepler

CUDA 5.5

Linux 32 & 64, Windows 64

CUFFT API Enhancements: Extensibility

Existing APIs (still work in 5.5)

cufftPlanMany(…)

/* Might recreate plan */

cufftSetCompatibilityMode(…)

cufftExecC2C(…)

cufftDestroy(…)

Each new configuration option

may require expensive plan

re-creation

New additional APIs in 5.5

cufftCreate(…)

cufftSetCompatibilityMode(…)

cufftMakePlanMany(…)

cufftExecC2C(…) cufftDestroy(…)

Allows new configuration

options to be employed

without multiple re-plan steps

CUFFT API Enhancements: Memory

Query size of workspace

cufftEstimate1d(…),

cufftEstimate2d(…),

cufftEstimate3d(…), and

cufftEstimateMany(…)

Helps determine if plan fits in

GPU memory

Returned size is not exact in

5.5 (see docs for details)

Control scratch workspace

cufftCreate(…)

cufftSetAutoAllocate(0)

/* returns size of work area */

cufftMakePlanMany(…)

cufftSetWorkArea(…)

cufftExecC2C(…)

cufftDestroy(…)

Share the same workspace across

executions of different plans

(non-concurrently)

CUFFT API Enhancements: FFTW support

Easily port from FFTW to

CUFFT by changing link library

Supports All Combinations of:

Single and Double Precision

C2C, R2C, C2R Transforms

FFTW Basic Interface

FFTW Advanced Interface

FFTW Guru Interface

Does Not Support:

Extended Precision

Real to Real Transforms

“Split” Memory Layout

Distributed Memory with MPI

*guru64* APIs

FFTW compatibility header

file helps developers detect

when unsupported FFTW APIs

are used

CUDA 5.5 enables new platforms

Enable 3rd party toolchains…

Compiler SDK based on LLVM

(libnvvm)

Allows 3rd party ports of new

languages to GPUs

Enable ARM…

Cross compilation from x86 or

native ARM compilation

CUDA development tools support

SECO mITX board for HPC:

OS and Compiler Support Matrix 32 64 Status

Ubuntu-12.04 X X New Ubuntu-12.10 X X New Ubuntu-10.04 X X Continued

Ubuntu-11.10 X X Removed Fedora 18 X New Fedora 16 X X Removed RHEL-5.5+ X Continued RHEL-6.X X Continued Mac OS X 10.8 X X Continued Mac OS X 10.7.x X X Continued OpenSUSE-12.1 X Continued SLES 11 SP2 X Continued WinXP X X Continued Vista/Win7/Win8 X X Continued Includes Win 2008 and 2012 server VC 9.0 (VS 2008) X X Continued VC 10.0 (VS 2010) X X Continued VC 11.0 (VS 2012) X X New

CUDACasts on YouTube

CUDA 5.5

Linux RPM/DEB installers

Stream Priorities

Static CUDART

Dynamic Parallelism performance improvements

MPS on Linux

Multi-user and remote debugging

New Visual Profiler guided optimization

CUFFT API Enhancements

LLVM based Compiler SDK

cuda 5.5 release candidateon-demand.gputechconf.com › gtc › 2013 › webinar ›...

Documents

ubuntu 10.04

tutorial install ubuntu server 10.04

le novità di ubuntu 10.04

ubuntu office server-10.04-x64

instalacion de un servidor proftpd en ubuntu 10.04

curso ubuntu 10.04 lts

ubuntu 10.04 vadovas pradedantiesiems

instalación de ubuntu 10.04

ubuntu 10.04 server guide

aan de slag met ubuntu 10.04

the perfect desktop - ubuntu 10.04 (lucid lynx)

ubuntu 10.04 installation guide

manual do ubuntu - getting started with ubuntu 10.04

re mastering ubuntu 10.04 menjadi peci remastersys

manual compiz ubuntu 10.04 version 5 jul

modul instalasi server ubuntu 10.04

getting started wth ubuntu 10.04

instalasi ubuntu 10.04 di virtualbox

วิธีติดตั้ง ubuntu linux server 10.04

มีอะไรใหม่ใน ubuntu 10.04 lucid lynx