cuda 5.5 release candidateon-demand.gputechconf.com › gtc › 2013 › webinar ›...
Post on 03-Jul-2020
4 Views
Preview:
TRANSCRIPT
CUDA 5.5 Release Candidate
June 25, 2013
© 2013 NVIDIA
Linux RPM and Debian Packages
SIMPLIFIED INSTALLATION AND UPGRADES!
Use native package installers
apt-get, yum, zypper
Everything: apt-get install cuda
One package: apt-get install cuda-documentation
Updates: apt-get upgrade
© 2013 NVIDIA
RPM and Debian Packaging Features
Side-by-side installations apt-get install cuda-5-5 cuda-6-0 (when available)
Cross-platform development (i.e., 32-bit target on 64-bit OS) apt-get install cuda-cross
Version locking update to latest version: apt-get install cuda
update to latest 5.5 version: apt-get install cuda-5-5
Everything is installed under /usr/local/cuda-5.5/
© 2013 NVIDIA
5.5 Download Page
© 2013 NVIDIA
RPM and Debian packages
One package repository per supported Linux distribution
Except Ubuntu 10.04 and RHEL 5
Ubuntu Example $ [ download cuda-repo-<distro>_<version>_<architecture>.deb ]
$ sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb
$ sudo apt-get update
$ sudo apt-get install cuda
© 2013 NVIDIA
3rd Party Library ISV Application Developer End-User System
myCUDAPluginA.so
libcudart.so.4.2
CUDA App
libcudart.so.5.0
myCUDAPluginB.so
libcudart.so.3.2
© 2013 NVIDIA
3rd Party Library ISV Application Developer End-User System
myCUDAPluginA.so
libcudart.so.4.2
CUDA App
libcudart.so.5.0
CUDA App
libcudart.so.5.0
myCUDAPluginA.so
libcudart.so.4.2
myCUDAPluginB.so
libcudart.so.3.2
myCUDAPluginB.so
libcudart.so.3.2
Complex coordination to ship correct components
© 2013 NVIDIA
Static Library CUDART (CUDA Runtime)
3rd Party Library ISV Application Developer End-User System
myCUDAPluginA.so CUDA App
CUDA App
myCUDAPluginA.so
myCUDAPluginB.so myCUDAPluginB.so
Distribution is made much simpler
Linux, Windows, Mac
© 2013 NVIDIA
Stream Priorities Accelerates Critical Path
Kernel A Kernel B
Kernel X
Kernel C
Kernel X
Kernel C Stream 1
High-Priority
Stream 2
Kernel A Kernel B Stream 1
Stream 2
With Priorities—especially useful when Kernel X generates data for MPI_Send()
No Priorities
Kernel X
Launched
Kernel X
Launched
© 2013 NVIDIA
GPU
CUDA Dynamic Parallelism (CDP)
__global__ void B(float *data)
{
do_stuff(data);
X <<< ... >>> (data);
Y <<< ... >>> (data);
Z <<< ... >>> (data);
cudaDeviceSynchronize();
do_more_stuff(data);
}
A
B
C
X
Y
Z
CPU int main() {
float *data;
setup(data);
A <<< ... >>> (data);
B <<< ... >>> (data);
C <<< ... >>> (data);
cudaDeviceSynchronize();
return 0;
}
main
© 2013 NVIDIA
CUDA 5.5 Scheduling Optimizations vs 5.0
~2.5X lower child-child
latency
~3.5X lower completion
latency
SW scheduler parallelized
Primarily benefits
smaller grids
Long dependent chains
~50% speedup vs. 5.0 for a
chain of 10 5μs grids
© 2013 NVIDIA
Without Hyper-Q 100
50
0 Time
GPU
Uti
lizati
on %
100
50
0 Time
GPU
Uti
lizati
on %
With Hyper-Q
© 2013 NVIDIA
Hyper-Q in CUDA 5.0: Streams
© 2013 NVIDIA
New in CUDA 5.5: Hyper-Q / MPI
FERMI 1 MPI Task at a Time
KEPLER 32 Simultaneous MPI Tasks
© 2013 NVIDIA
GPU
CUDA Server Process
CUDA
MPI
Rank
0
CUDA
MPI
Rank
1
CUDA
MPI
Rank
2
CUDA
MPI
Rank
3
Multi-Process Server Required for Hyper-Q / MPI
$ mpirun -np 4 my_cuda_app
No application re-compile to share the GPU
No user configuration needed Can be preconfigured by SysAdmin
MPI Ranks using CUDA are clients
Server spawns on-demand per user
One job per user No isolation between MPI ranks
Exclusive process mode enforces single server
One GPU per rank No cudaSetDevice()
only CUDA device 0 is visible
© 2013 NVIDIA
Strong Scaling of CP2K on Cray XK7
Hyper-Q with multiple
MPI ranks leads to
2.5X speedup over
single MPI rank using
the GPU
Blog post by Peter
Messmer of NVIDIA
© 2013 NVIDIA
Multi-user debugging with a single GPU
Nsight Visual Studio already supports
Nsight VSE now bundled in CUDA installer
Now also supported by CUDA-GDB &
Nsight Eclipse Edition
BETA feature
SM 3.5 Only
Debugger & GUI on one GPU
Multi-user debug on one GPU
2 ways to opt-in
CUDA_DEBUGGER_SOFTWARE_PREEMPTION=1
set cuda software_preemption on
CUDA-GDB and Nsight EE now also
support Dynamic Parallelism debug
© 2013 NVIDIA
Remote Debugging using Nsight EE
© 2013 NVIDIA
Automatic Performance Analysis
NEW in 5.5 Step-by-step optimization guidance
© 2013 NVIDIA
Identifying Candidate Kernels
Analysis system estimates which kernels are best candidates for
speedup
Execution time, achieved occupancy
© 2013 NVIDIA
Primary Performance Bound
Most likely limiter to performance for a kernel
Memory bandwidth
Compute resources
Instruction and memory latency
Primary bound should be addressed first
Often beneficial to examine secondary bounds as well
© 2013 NVIDIA
Visual Profiler: Performance Bound
© 2013 NVIDIA
© 2013 NVIDIA
Visual Profiler: Memory Efficiency
© 2013 NVIDIA
GPU per-process accounting statistics
$ nvidia-smi -q -d ACCOUNTING ==============NVSMI LOG============== Timestamp : Tue Apr 16 10:36:59 2013 Driver Version : 319.15 Attached GPUs : 1 GPU 0000:06:00.0 Accounting Mode : Enabled Accounting Mode Buffer Size : 128 Accounted Processes Process ID : 28043 GPU Utilization : 20 % Memory Utilization : 4 % Max memory usage : 461 MB Time : 5566 ms Process ID : 28085 GPU Utilization : 99 % Memory Utilization : 100 % Max memory usage : 101 MB Time : 11888 ms
Requirements
Tesla or Quadro GPU
Kepler
CUDA 5.5
Linux 32 & 64, Windows 64
© 2013 NVIDIA
CUFFT API Enhancements: Extensibility
Existing APIs (still work in 5.5)
cufftPlanMany(…)
/* Might recreate plan */
cufftSetCompatibilityMode(…)
cufftExecC2C(…)
cufftDestroy(…)
Each new configuration option
may require expensive plan
re-creation
New additional APIs in 5.5
cufftCreate(…)
cufftSetCompatibilityMode(…)
cufftMakePlanMany(…)
cufftExecC2C(…) cufftDestroy(…)
Allows new configuration
options to be employed
without multiple re-plan steps
© 2013 NVIDIA
CUFFT API Enhancements: Memory
Query size of workspace
cufftEstimate1d(…),
cufftEstimate2d(…),
cufftEstimate3d(…), and
cufftEstimateMany(…)
Helps determine if plan fits in
GPU memory
Returned size is not exact in
5.5 (see docs for details)
Control scratch workspace
cufftCreate(…)
cufftSetAutoAllocate(0)
/* returns size of work area */
cufftMakePlanMany(…)
cufftSetWorkArea(…)
cufftExecC2C(…)
cufftDestroy(…)
Share the same workspace across
executions of different plans
(non-concurrently)
© 2013 NVIDIA
CUFFT API Enhancements: FFTW support
Easily port from FFTW to
CUFFT by changing link library
Supports All Combinations of:
Single and Double Precision
C2C, R2C, C2R Transforms
FFTW Basic Interface
FFTW Advanced Interface
FFTW Guru Interface
Does Not Support:
Extended Precision
Real to Real Transforms
“Split” Memory Layout
Distributed Memory with MPI
*guru64* APIs
FFTW compatibility header
file helps developers detect
when unsupported FFTW APIs
are used
© 2013 NVIDIA
CUDA 5.5 enables new platforms
Enable 3rd party toolchains…
Compiler SDK based on LLVM
(libnvvm)
Allows 3rd party ports of new
languages to GPUs
Enable ARM…
Cross compilation from x86 or
native ARM compilation
CUDA development tools support
SECO mITX board for HPC:
© 2013 NVIDIA
OS and Compiler Support Matrix 32 64 Status
Ubuntu-12.04 X X New Ubuntu-12.10 X X New Ubuntu-10.04 X X Continued
Ubuntu-11.10 X X Removed Fedora 18 X New Fedora 16 X X Removed RHEL-5.5+ X Continued RHEL-6.X X Continued Mac OS X 10.8 X X Continued Mac OS X 10.7.x X X Continued OpenSUSE-12.1 X Continued SLES 11 SP2 X Continued WinXP X X Continued Vista/Win7/Win8 X X Continued Includes Win 2008 and 2012 server VC 9.0 (VS 2008) X X Continued VC 10.0 (VS 2010) X X Continued VC 11.0 (VS 2012) X X New
© 2013 NVIDIA
CUDACasts on YouTube
© 2013 NVIDIA
CUDA 5.5
Linux RPM/DEB installers
Stream Priorities
Static CUDART
Dynamic Parallelism performance improvements
MPS on Linux
Multi-user and remote debugging
New Visual Profiler guided optimization
CUFFT API Enhancements
LLVM based Compiler SDK
top related