better than all the rest - github pagesbetter than all the rest: finding maximum-performance gpu...
TRANSCRIPT
![Page 1: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/1.jpg)
Better Than All the Rest:Finding Maximum-Performance GPU Kernels
Using Auto-Tuning
Cedric Nugteren
GPU Technology ConferenceApril 7, 2016
HPC center of the Netherlands Augmented Reality Technology
![Page 2: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/2.jpg)
How to find the best flight ticket?
2
Which airline?
Fly a day earlier or later?
Which route?
Solution:• Evaluate all combinations manually • … or use a flight comparison website
![Page 3: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/3.jpg)
How to find the best flight ticket?
3
Which airline?
Fly a day earlier or later?
Which route?
Solution:• Evaluate all combinations manually • … or use a flight comparison websitethe CLTune auto-tuner
Cache in shared memory?
Thread block sizes?
More work per thread or less?
![Page 4: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/4.jpg)
Example: convolution
4
Targets:• GPUs (CUDA & OpenCL) • Multi-core CPUs • Other OpenCL-capable devices
Example: blur filter
X
Y Yf
Xf A
X
B
filter size: Xf by Yf
outputinputinput output
filter coefficients
example: 3 by 3 filter
![Page 5: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/5.jpg)
OpenCL 2D convolution
double for-loop
each thread: one output pixel
Thread coarsening (2D)?
Unroll loops?
work-group / thread-block size?
Vector data-types?
Cache in local / shared memory?
5
![Page 6: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/6.jpg)
Search-space explosion
16 x 2 x 16 x 4 x 5 = 10240
3424 configurations
filter illegalconfigurationsLarge search-space:
• Not feasible to explore manually • Perhaps not even feasible automatically?
Thread coarsening (2D)?
Unroll loops?
Work-group / thread-block size?
Vector data-types?
Cache in local / shared memory?
6
![Page 7: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/7.jpg)
Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?
Wide variety of devices:• Different optimal kernels • Even from the same vendor
User-parameter dependent:• Examples: matrix sizes,
image size, filter sizes, etc.
Vector data-types?
Thread coarsening (2D)?
Unroll loops?
Work-group / thread-block size?
Cache in local / shared memory?
X
Y Yf
Xf A
X
B
7
![Page 8: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/8.jpg)
Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?
Wide variety of devices:• Different optimal kernels • Even from the same vendor
User-parameter dependent:• Examples: matrix sizes,
image size, filter sizes, etc.
Vector data-types?
Thread coarsening (2D)?
Unroll loops?
Work-group / thread-block size?
Cache in local / shared memory?
X
Y Yf
Xf A
X
B
8
![Page 9: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/9.jpg)
CLTune in actionExample input data
Verifies output based on optional reference kernel
9
Example: matrix-vector multiplication (y = Ax)
with tiling
Tile size (TS):4 possible values
Kernel to tune
![Page 10: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/10.jpg)
CLTune in action
10
Tuneable tile-size TS
Handles all the host - device data transfers!
![Page 11: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/11.jpg)
CLTune in action
11
Tuneable tile-size TS
Handles all the host - device data transfers!
![Page 12: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/12.jpg)
Search strategiesOption 0: Full search☺ Finds optimal solution ☓Explores all options
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
mean
rotated histogram
12
3424 convolution kernels on Tesla K40m GPU
![Page 13: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/13.jpg)
Search strategiesOption 0: Full searchOption 1: Random search☺ Explores arbitrary fraction ☓ Performance varies
Colours: 3 example runssearch progress (steps)
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
0 20 40 60 80 100
random on K40m
56%
94%
74%
Example: 107 out of 3424 configurations (1/32th)
13
![Page 14: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/14.jpg)
Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealing☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima
search progress (steps)
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
0 20 40 60 80 100
SA T=4 on K40m91%
64%
84%
Example: 107 out of 3424 configurations (1/32th)
Colours: 3 example runs
14
![Page 15: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/15.jpg)
Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealingOption 3: Particle swarm optim.☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima
Colours: 3 example runssearch progress (steps)
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
0 20 40 60 80 100
PSO S=3 on K40m
43%
66%
44%
Example: 107 out of 3424 configurations (1/32th)
Line-types: 3 swarms15
![Page 16: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/16.jpg)
Search strategies evaluationpe
rform
ance
[% o
f bes
t−kn
own]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
Each search: 107 out of 3424 configurations (1/32th)
average best result of 128 searches
meta-parameters for SA and PSO
16
Conclusions:• PSO performs poorly • SA perform good, but not much better
than random search
[1]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.
![Page 17: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/17.jpg)
Search strategies evaluationpe
rform
ance
[% o
f bes
t−kn
own]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: GTX480
●● ● ●
●●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: HD7970
●
● ● ●● ●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: Iris
● ● ● ●
● ●
●
searchspace
17
![Page 18: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/18.jpg)
Solution: machine learning?
18
![Page 19: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/19.jpg)
Machine learning an auto-tuner
19
Evaluate subset of all configurations
(e.g. 100 out of 3424)Train model with the examples
Predict execution time for all other configurations
Neural network (3-layer fully connected)
Linear regression
(take best 10 and evaluate them on actual hardware)
[2]: T.L. Falch and A.C. Elster. Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability.
input: parameter configuration
output: execution time
![Page 20: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/20.jpg)
Machine learning an auto-tuner
20
Trained on a random subset of convolution example (1/32th)
Zoomed in
![Page 21: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/21.jpg)
Machine learning an auto-tuner
21
Zoomed in
Trained on a random subset of convolution example (1/32th)
![Page 22: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/22.jpg)
GEMM case-study
perfo
rman
ce [%
of d
evic
e−pe
ak]
0
20
40
60
80
100
K40m GTX480 HD7970 Iris
GEMM performance comparison this workclBLAScuBLAS
1933
1197
2870 845
318
8823144
2409
240
56
GFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPS
GFLOPS
cuBLASN/A
cuBLASN/A
X
C
+=
Mwg
M
Mwi
KwgK
Nwg
N
Nwi
N
Mwg
Nwg
Nwi
M
Mwi
Per thread tiling
Per workgroup tiling
BAT
Unrolled by a factor Kwi
Mwi x MdimC = Mwg
Conclusions:• Different best parameters for
different devices • Performance better than clBLAS,
but not as good as assembly-tuned cuBLAS
22
![Page 23: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/23.jpg)
GEMM case-study
perfo
rman
ce [%
of d
evic
e−pe
ak]
0
20
40
60
80
100
K40m GTX480 HD7970 Iris
GEMM performance comparison this workclBLAScuBLAS
1933
1197
2870 845
318
8823144
2409
240
56
GFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPS
GFLOPS
cuBLASN/A
cuBLASN/A
X
C
+=
Mwg
M
Mwi
KwgK
Nwg
N
Nwi
N
Mwg
Nwg
Nwi
M
Mwi
Per thread tiling
Per workgroup tiling
BAT
Unrolled by a factor Kwi
Mwi x MdimC = Mwg
Conclusions:• Different best parameters for
different devices • Performance better than clBLAS,
but not as good as assembly-tuned cuBLAS
23
![Page 24: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/24.jpg)
Convolution case-studype
rform
ance
[% o
f dev
ice−
peak
]
0
20
40
60
80
100
3x3 filter 7x7 filter 11x11 filter
device: GTX480
298
125
572
46
787
26GFLOPS
GB/s
GFLOPS
GB/s
GFLOPS
GB/s
Conclusions:• Different best parameters for different:
• devices (see paper) • filter-sizes
• Performance equal or better than the state-of-the-art [3]
[3]: B. Van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Optimizing Convolution Operations on GPUs Using Adaptive Tiling.
24
![Page 25: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/25.jpg)
What about CUDA support?
25
Written for OpenCL, but …• Uses the high-level OpenCL API CLCudaAPI
• And CLCudaAPI also has a CUDA back-end!
• Switch between OpenCL and CUDA with a single include • Uses the CUDA 7.0 driver API and runtime compilation library (nvrtc)
![Page 26: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC](https://reader033.vdocuments.mx/reader033/viewer/2022051511/6013a97dfcc0b73fdc2352fc/html5/thumbnails/26.jpg)
Better Than All the Rest:Finding Maximum-Performance GPU Kernels Using Auto-Tuning
Auto-tuning GPU kernels:• Large search-space • Wide variety of devices • User-parameter dependent Advanced search strategies:• Simulated annealing • Particle swarm optimisation
Case-studies:• Fastest 2D convolution • CLBlast matrix-multiplication
library perfo
rman
ce [%
of d
evic
e−pe
ak]
0
20
40
60
80
100
3x3 filter 7x7 filter 11x11 filter
device: GTX480
298
125
572
46
787
26GFLOPS
GB/s
GFLOPS
GB/s
GFLOPS
GB/s
Machine-learning:• Train a model on a small subset • Use the model to predict the remainder
Source-code on GitHub: https://github.com/CNugteren/CLTune
26
[More info]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.
Slides available @ www.cedricnugteren.nl