nvidia cuda libraries - oak ridge leadership computing...

NVIDIA CUDA Libraries

Ujval Kapasi March 29, 2012

Applications

3rd Party Libraries & Middleware

CUDA-C APIs

NVIDIA Libraries

CUDA Libraries and Middleware

Basic building blocks Easy to use

Fast

Programming language integration Hybrid computation

Multi-node and multi-gpu Domain specific languages

CUBLAS dense linear algebra CUSPARSE sparse linear algebra CUFFT discrete Fourier transforms CURAND random number generation NPP signal and image processing Thrust scan, sort, reduce, transform math.h floating point system calls printf, malloc, assert

NVIDIA CUDA Libraries

0x 2x 4x 6x 8x

CUSPARSE

4x-10x speedups over CPU on single precision

0x 5x

10x 15x

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 log2(size)

CUFFT

0x

5x

10x

15x

reduce transform scan sort

Thrust

0x

2x

4x

6x

SGEMM SSYMM SSYRK STRMM STRSM

CUBLAS

•  CUDA 4.1 on Tesla M2090, ECC on •  MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz NOTE: ROUTINES ASSUME INPUT AND OUTPUT DATA ARE IN GPU MEMORY

Challenge #1: MATCH AN ESTABLISHED CPU ECOSYSTEM

2007 2008 2009 2010 2011

CUDA Toolkit 2.x

• Double Precision support in all libraries

CUDA Toolkit 1.x

• Single precision

• cuBLAS

• cuFFT

• math.h

CUDA Toolkit 3.x

• cuSPARSE

• cuRAND

• printf()

• malloc()

CUDA Toolkit 4.x

• Thrust

• NPP

• assert()

CUDA ecosystem today spans many domains

"   Languages (FORTRAN, python) "   Math, Numerics, Statistics "   Dense & Sparse Linear Algebra "   Algorithms (sort, etc.) "   Image Processing "   Computer Vision "   Signal Processing "   Finance

http://developer.nvidia.com/gpu-accelerated-libraries

!"

#!"

$!"

%!"

&!"

'!!"

'#!"

'$!"

'%!"

'&!"

!" '%" (#" $&" %$" &!" )%" ''#" '#&"

*+,-

./"

/012"3454546"

/07892".:2;0<0=7">99"/012<"#5#5#"?="'#&5'#&5'#&"

@A++B"$C'"

@A++B"$C!"

DE,"

Challenge #2: PREDICTABLE ACCELERATION

Consistently faster than MKL

>3x faster than 4.0 on average

•  cuFFT 4.1 and cuFFT 4.0 on Tesla M2090, ECC on •  MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz Performance may vary based on OS version and motherboard configuration

NOTE: ROUTINES ASSUME INPUT AND OUTPUT DATA ARE IN GPU MEMORY

•  cuBLAS 4.1 on Tesla M2090, ECC on •  MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz Performance may vary based on OS version and motherboard configuration


0

50

100

150

200

250

300

350

400

450

0 256 512 768 1024 1280 1536 1792 2048

GFL

OPS

Matrix Size (NxN)

CUBLAS-Zgemm MKL-Zgemm


•  cuBLAS 4.1 on Tesla M2090, ECC on •  MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz Performance may vary based on OS version and motherboard configuration


!"

#!"

$!"

%!"

&!"

'!!"

'#!"

'$!"

'%!"

'&!"

#!!"

!" '%" (#" $&" %$"

*+,-

./"

DF?:05"G0H27<0=7"34546"

;IJ,>/"'!!"HF?:0;2<" ;IJ,>/"'!K!!!"HF?:0;2<" DE,"'!K!!!"HF?:0;2<"


"   CUDA Libraries accelerate basic building block algorithms "   BLAS ! CUBLAS, FFTW ! CUFFT, STL ! Thrust "   Enables a wide range of 3rd party libraries and middleware

"   CUDA Libraries coverage and performance constantly improving "   Automatic performance improvements with new CUDA releases and new

GPU architectures

Thank You!

nvidia cuda libraries - oak ridge leadership computing...

Documents