clblast: a tuned opencl blas library - cedric nugteren · clblast: a tuned blas library for faster...
TRANSCRIPT
![Page 1: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/1.jpg)
CLBlast: A Tuned BLAS Libraryfor Faster Deep Learning
Cedric Nugteren
May 11, 2017
http://github.com/cnugteren/clblast
http://cnugteren.github.io/clblast
![Page 2: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/2.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 2 out of 43
The Heart of Deep Learning
![Page 3: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/3.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 3 out of 43
GEMM is at the Heart of Deep Learning
![Page 4: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/4.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 4 out of 43
So where are the Matrix-Multiplications?
![Page 5: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/5.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 5 out of 43
Convolutions as Matrix Multiplication
![Page 6: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/6.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 6 out of 43
GEMM is the Heart of Deep Learning
![Page 7: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/7.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 7 out of 43
Does everyone agree?
![Page 8: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/8.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 8 out of 43
Does everyone agree?
![Page 9: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/9.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 9 out of 43
Still true in 2017!
![Page 10: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/10.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 10 out of 43
But why a new BLAS Library?
● NVIDIA’s cuBLAS is great, or is it?
![Page 11: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/11.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 11 out of 43
But why a new BLAS Library?
● NVIDIA’s cuBLAS is great, or is it?
![Page 12: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/12.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 12 out of 43
But why a new BLAS Library?
● NVIDIA’s cuBLAS is great, or is it?
– Not portable, not customisable, not open-source, ...
![Page 13: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/13.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 13 out of 43
But why a new BLAS Library?
● NVIDIA’s cuBLAS is great, or is it?
– Not portable, not customisable, not open-source, ...
● Is AMD’s clBLAS great?
– Not performance portable,
not well engineered,
lack of new features, ...
![Page 14: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/14.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 14 out of 43
Introducing CLBlast
● CLBlast: Modern C++11 OpenCL BLAS library
● Implements all BLAS routines for all precisions (S, D, C, Z)
● Accelerates all kinds of applications:
– Fluid dynamics, quantum chemistry, linear algebra, etc.
– Today’s focus: deep learning
![Page 15: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/15.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 15 out of 43
Introducing CLBlast
● CLBlast: Modern C++11 OpenCL BLAS library
● Implements all BLAS routines for all precisions (S, D, C, Z)
● Accelerates all kinds of applications:
– Fluid dynamics, quantum chemistry, linear algebra, etc.
– Today’s focus: deep learning
● Already integrated into various projects:
– JOCLBlast (Java bindings)
– ArrayFire (GPU accelerated library and applications)
– OpenCL fork of Cafe (github.com/dividiti/ck-cafe)
– OpenCL fork of TF (github.com/hughperkins/tensorlow-cl)
![Page 16: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/16.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 16 out of 43
Introducing CLBlast
CI and
extensive testing
activity
community
![Page 17: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/17.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 17 out of 43
But… is it fast?
● All kernels are generic and tunable thanks to integration of the
CLTune auto-tuner (presented at last year’s GTC)
![Page 18: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/18.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 18 out of 43
But… is it fast?
● All kernels are generic and tunable thanks to integration of the
CLTune auto-tuner (presented at last year’s GTC)
![Page 19: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/19.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 19 out of 43
But… is it fast?
● All kernels are generic and tunable thanks to integration of the
CLTune auto-tuner (presented at last year’s GTC)
● Tuned out-of-the-box for 40 common devices– For new devices: run the auto-tuner when installing CLBlast
![Page 20: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/20.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 20 out of 43
CLBlast Benchmark Results
● Higher is better
● More results at http://cnugteren.github.io/clblast
AXPY
regular
(in GB/s)
AXPY
odd
(in GB/s)
GEMV
odd
(in GB/s)
GEMM
odd
(in GFLOPS)
GEMV
regular
(in GB/s)
GEMM
regular
(in GFLOPS)
![Page 21: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/21.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 21 out of 43
CLBlast on GeForce GTX750Ti
● On-par or better than clBLAS (especially for GEMM)
![Page 22: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/22.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 22 out of 43
CLBlast on GeForce GTX750Ti
● ...but not as fast as NVIDIA’s cuBLAS
![Page 23: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/23.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 23 out of 43
CLBlast on GeForce GTX750Ti
● ...but not as fast as NVIDIA’s cuBLAS
![Page 24: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/24.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 24 out of 43
CLBlast on Radeon M370X
● On-par or better than clBLAS (especially for odd-sized GEMM)
![Page 25: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/25.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 25 out of 43
CLBlast on Skylake ULT GT2
● On-par or better than clBLAS (especially for GEMM)
![Page 26: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/26.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 26 out of 43
CLBlast on Core i5-6200U
● On-par or better than clBLAS (especially for AXPY & GEMV)
![Page 27: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/27.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 27 out of 43
CLBlast for Deep Learning
● What can we do for the deep-learning community?
– Problem-speciic tuning
– Half-precision loating-point (FP16)
– Batched routines
![Page 28: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/28.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 28 out of 43
Tuning Only for a Single Size?
● Default GEMM tuning:
– 1024x1024 matrices
● Deep-learning:
– Various but ixed matrix sizes (dependent on network layout)
– Typically smaller and/or rectangular matrices
![Page 29: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/29.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 29 out of 43
Tuning Only for a Single Size?
● Default GEMM tuning:
– 1024x1024 matrices
● Deep-learning:
– Various but ixed matrix sizes (dependent on network layout)
– Typically smaller and/or rectangular matrices
● Potential for optimal performance in CLBlast:
– Tuning for a custom size possible
– C++ API to change parameters at run-time
![Page 30: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/30.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 30 out of 43
Problem-Speciic Tuning
● SGEMM tuning
for Radeon
M370X GPU
![Page 31: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/31.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 31 out of 43
Problem-Speciic Tuning
● SGEMM tuning
for Radeon
M370X GPU
● Best on the
diagonal
● >100% due to
random tuning
![Page 32: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/32.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 32 out of 43
Problem-Speciic Tuning
● SGEMM tuning
for Radeon
M370X GPU
● Best on the
diagonal
● >100% due to
random tuning
● Gain of ~2x for
some cases
default
![Page 33: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/33.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 33 out of 43
Half-precision loating-point (FP16)
● Double-precision (FP64) not needed for deep-learning
● Even FP32 is too much introducing → half-precision FP16
● Implemented in low-power devices (ARM Mali, Intel GPUs) and
deep-learning speciic GPUs (P100)
![Page 34: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/34.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 34 out of 43
Half-precision loating-point (FP16)
● Double-precision (FP64) not needed for deep-learning
● Even FP32 is too much introducing → half-precision FP16
● Implemented in low-power devices (ARM Mali, Intel GPUs) and
deep-learning speciic GPUs (P100)
● Potential for 2x savings in:
bandwidth, storage, compute, energy
![Page 35: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/35.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 35 out of 43
Half-precision loating-point (FP16)
● Double-precision (FP64) not needed for deep-learning
● Even FP32 is too much introducing → half-precision FP16
● Implemented in low-power devices (ARM Mali, Intel GPUs) and
deep-learning speciic GPUs (P100)
● Potential for 2x savings in:
bandwidth, storage, compute, energy
● Current FP16 support for GPUs:
– cuBLAS: HGEMM only
– clBLAS: no FP16 at all
– CLBlast: all routines!
![Page 36: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/36.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 36 out of 43
Half-precision FP16 on Intel Skylake GPU
● FP16 ~1.8x faster across the board!
FP32
FP16
clBLAS
![Page 37: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/37.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 37 out of 43
Batching BLAS routines
● Small-sized GEMM is super slow
– Not enough work-groups
– Not enough threads
![Page 38: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/38.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 38 out of 43
Batching BLAS routines
● Small-sized GEMM is super slow
– Not enough work-groups
– Not enough threads
● Let’s make it fast again:
– Combine multiple small GEMM operations into a single kernel
– Use ofsets to indicate where the next matrices start
![Page 39: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/39.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 39 out of 43
Batched GEMM on GeForce GTX 750Ti
● SGEMM 128x128x128:
– Regular: ~40 GFLOPS
– Batched: ~10 GFLOPS (1 GEMM) up to ~500 GFLOPS (8K)!
batch size
![Page 40: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/40.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 40 out of 43
Batched GEMM on GeForce GTX 750Ti
● Signiicant beneits for larger sizes as well
– mostly beneicial in the range n=64 till 512
8 GEMMs
64 GEMMs
![Page 41: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/41.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 41 out of 43
What’s next?
● More features for deep learning:
– ‘im2col’
– Winograd? FFT?
● Input-based auto-tuning using learned models
– Similar to S7150: The ISAAC library
● Integration into OpenCL deep-learning projects
– TensorFlow SYCL? LibDNN?
● Suggestions?
![Page 42: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/42.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 42 out of 43
Why is BLAS Important for ?
● HDMap making Deep-learning→
● Deep-learning Fast BLAS libraries→
● More info: S7809 - A Multi-Source, Multi-Sensor Approach to HDMap Creation
– Willem Strijbosch - Head of Autonomous Driving, TomTom
– Today at 10:30 AM in room 210D
![Page 43: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/43.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 43 out of 43
Conclusion
● Introducing CLBlast: a modern C++11 OpenCL BLAS library
● Performance portable thanks to generic kernels and auto-tuning
● Especially targeted at accelerating deep-learning:
– Problem-size speciic tuning:
● Up to 2x in an example experiment
– Half-precision FP16 support:
● Up to 2x beneit in speed and memory savings
– Batched GEMM routine:
● Order of magnitude beneit depending on the use-case
![Page 44: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s](https://reader031.vdocuments.mx/reader031/viewer/2022021820/5aeaddd07f8b9a36698d9214/html5/thumbnails/44.jpg)
CLBlast: A Tuned BLAS Libraryfor Faster Deep Learning
Cedric Nugteren
May 11, 2017
http://github.com/cnugteren/clblast
http://cnugteren.github.io/clblast