seminar on gpgpu programming: optimising …...i gpgpu is fast :) i but without proper memory...
TRANSCRIPT
Seminar on GPGPU Programming:Optimising Matrix Multiplications with CUDA
Axel Eirola
28.01.2010
Table of Contents
Introduction
Multiplication with CPUNaive implementationIT++
Multiplication with CUDANaive implementationUsing Shared MemoryOptimising Block SizeCUBLAS
Discussion
Introduction
I Matrix multiplication for square unfiformally random matrices
I C = AB where A, B, C ∈ R(n,n)
I Syntetical benchmarking, since we do not know anythingabout the matrices
I In real life problems we usually have information about thematrix, it can be symmetric or orthogonal, or have some otherpattern which can be exlpoited in the computations
About the benchmarks
I Executed on miranda
I CPU code in C++, GPU code in CUDA
I Measurements average of 5 runs after one warm-up run
I Calculations performed in single precission floating point
I Only actual calculation timed, no allocation or copyingbetween host and device
I Matrices of sizes 100× 100 (40Kb) to 4000× 4000 (64MB)were used,
Naive CPU implementation
I Simple ”by definition” implementation
I Loops through the elements of the output matrix C, andcalculates each element seperately
I No multithreading, no smart fetching of elements from A andB
Benchmarks
100 1000Matrix Width
0.1
1
10
100
1000
10000
Tim
e (m
s)
CPU Naive
Figure: Naive CPU implementation
BLAS libary IT++
I A general purpose linear algebra and signal processing libraryfor C++
I Utilizes underlying BLAS implementataions
I Seems to do multithreading and smarter memory management
I Does not seem to use Strasses (or any other guys) matrixmultiplication algorithm
Benchmarks
100 1000Matrix Width
0.1
1
10
100
1000
10000
Tim
e (m
s)
CPU NaiveIT++
Figure: IT++ library
Naive GPU implementation
I Trivial reimplementation of the CPU naive code to CUDA
I Replaces the loops with threading, that is that each thread iscreated for each element in the output matrix C
I All data is retreived from the global memory of the GPU
Benchmarks
100 1000Matrix Width
0.1
1
10
100
1000
10000
Tim
e (m
s)
CPU NaiveIT++GPU Naive
Figure: Naive GPU implementation
Speed it up with Shared Memory
I The naive GPU implementation only used global memory foraccessing matrices A and B
I Since each element is accessed multiple times, it would befaster to store the elements somwhere close, such as the SM(stream multiprocessor) shared memory
I Give each thread block a responsibilty to calculate one blockof the output matrix C
I Store data needed to calculate the block in the shared memory
BenchmarksChapter 3. Programming Interface
!
22 CUDA Programming Guide Version 2.3!
Figure 3-1. Matrix Multipliation without Shared Memory "#$!%&''&()*+!,&-$!./01'$!).!/*!)01'$0$*2/2)&*!&%!0/23)4!05'2)1'),/2)&*!2#/2!-&$.!2/6$!/-7/*2/+$!&%!.#/3$-!0$0&389!:*!2#).!)01'$0$*2/2)&*;!$/,#!2#3$/-!<'&,6!).!3$.1&*.)<'$!%&3!,&0152)*+!&*$!.=5/3$!.5<>0/23)4!!"#$!&%!!!/*-!$/,#!2#3$/-!()2#)*!2#$!<'&,6!).!3$.1&*.)<'$!%&3!,&0152)*+!&*$!$'$0$*2!&%!!"#$9!?.!)''5.23/2$-!)*!@)+53$!A>B;!!"#$!).!$=5/'!2&!2#$!13&-5,2!&%!2(&!3$,2/*+5'/3!0/23),$.C!2#$!.5<>0/23)4!&%!%!&%!-)0$*.)&*!&%'()*+,-.$/0123")456!2#/2!#/.!2#$!./0$!')*$!)*-),$.!/.!!"#$;!/*-!2#$!.5<>0/23)4!&%!7!&%!-)0$*.)&*!&$/0123")45-.%'()*+,6!2#/2!#/.!2#$!./0$!,&'50*!)*-),$.!/.!!"#$9!:*!&3-$3!2&!%)2!)*2&!2#$!-$7),$D.!3$.&53,$.;!2#$.$!2(&!3$,2/*+5'/3!0/23),$.!/3$!-)7)-$-!)*2&!/.!0/*8!.=5/3$!0/23),$.!&%!-)0$*.)&*!$/0123")45!/.!*$,$../38!/*-!!"#$!).!,&0152$-!/.!2#$!.50!&%!2#$!13&-5,2.!&%!2#$.$!.=5/3$!0/23),$.9!E/,#!&%!2#$.$!13&-5,2.!).!1$3%&30$-!<8!%)3.2!'&/-)*+!2#$!2(&!,&33$.1&*-)*+!.=5/3$!0/23),$.!%3&0!+'&</'!0$0&38!2&!.#/3$-!0$0&38!()2#!&*$!2#3$/-!'&/-)*+!&*$!$'$0$*2!&%!$/,#!0/23)4;!/*-!2#$*!<8!#/7)*+!$/,#!2#3$/-!,&0152$!&*$!$'$0$*2!&%!2#$!13&-5,29!E/,#!2#3$/-!/,,505'/2$.!2#$!3$.5'2!&%!$/,#!&%!2#$.$!13&-5,2.!)*2&!/!3$+).2$3!/*-!&*,$!-&*$!(3)2$.!2#$!3$.5'2!2&!+'&</'!0$0&389!
A
B
C
B.width A.width
0 col
A.h
eigh
tB
.hei
ght
B.w
idth
-1
row
0
A.height-1
Figure: Naive matrix multiplication
BenchmarksChapter 3. Programming Interface
!
26 CUDA Programming Guide Version 2.3!
Figure 3-2. Matrix Multipliation with Shared Memory
3.2.3 Multiple Devices "!#$%&!%'%&()!*+,!#+-(!)./&01/(!2(-0*(%3!4#(%(!2(-0*(%!*+,!5(!(,.)(6+&(27!&#(06!16$1(6&0(%!*+,!5(!8.(60(27!+,2!$,(!$9!&#()!*+,!5(!%(/(*&(2!9$6!:(6,(/!(;(*.&0$,%3!
<(-(6+/!#$%&!(+2%!*+,!(;(*.&(!2(-0*(!*$2(!$,!&#(!%+)(!2(-0*(7!5.&!5'!2(%0=,7!+!#$%&!(+2!*+,!(;(*.&(!2(-0*(!*$2(!$,!$,/'!$,(!2(-0*(!+&!+,'!=0-(,!&0)(3!"%!+!*$,%(8.(,*(7!)./&01/(!#$%&!(+2%!+6(!6(8.06(2!&$!(;(*.&(!2(-0*(!*$2(!$,!)./&01/(!2(-0*(%3!"/%$7!+,'!>?@"!6(%$.6*(%!*6(+&(2!$.=#!&#(!6.,&0)(!0,!$,(!#$%&!(+2!*+,,$&!5(!.%(2!5'!&#(!6.,&0)(!96$)!+,$&#(6!#$%&!(+23!
4#(!9$//$A0,=!*$2(!%+)1/(!(,.)(6+&(%!+//!2(-0*(%!0,!&#(!%'%&()!+,2!6(&60(-(%!&#(06!16$1(6&0(%3!B&!+/%$!2(&(6)0,(%!&#(!,.)5(6!$9!>?@"C(,+5/(2!2(-0*(%3!int deviceCount; cudaGetDeviceCount(&deviceCount); int device; for (device = 0; device < deviceCount; ++device) { cudaDeviceProp deviceProp; cudaGetDeviceProperties(&deviceProp, device); if (dev == 0) {
A
!
B
C
Csub
BLOCK_SIZE
B.width A.width
BLOCK_SIZEBLOCK_SIZE
BLO
CK
_SIZ
E B
LOC
K_S
IZE
BLO
CK
_SIZ
E
bloc
kRow
row
0
BLOCK_SIZE-1
BLO
CK
_SI
ZE-1
0 col
blockCol
A.h
eigh
tB
.hei
ght
Figure: Matrix multiplication with shared memory
Benchmarks
100 1000Matrix Width
0.1
1
10
100
1000
10000
Tim
e (m
s)
CPU NaiveIT++GPU Naive + Shared Memory
Figure: GPU using shared memory
What can we do with block size
I The block size represents the amount of threads executed byone SM (stream multiprocessor)
I The amount of threads stays constant
I But the amount of data kept in the shared memory of the SMis increased, decreasing the amount of costly accesses to theglobal memory
I Block size is limited to 22, since the maximum amount ofthread blocks in one grid is 512 (222 = 484 and 232 = 529)
Benchmarks
100 1000Matrix Width
0.1
1
10
100
1000
10000
Tim
e (m
s)
CPU NaiveIT++GPU Naive + Shared Memory + large blocksize
Figure: GPU with larger blocksize
CUBLAS library
I A C library provided by nVidia implementing the BLAS (BasicLinear Algebra Subprograms) specification
I Could not find what it actually does, but seems to dosomething.
Benchmarks
100 1000Matrix Width
0.1
1
10
100
1000
10000
Tim
e (m
s)
CPU NaiveIT++GPU Naive + Shared Memory + large blocksizeCUBLAS
Figure: CUBLAS library implementation
Benchmarks
100 1000Matrix Width
0.1
1
10
100
1000
10000
Tim
e (m
s)
CPU NaiveIT++GPU Naive + Shared Memory + large blocksizeCUBLAS
Figure: This is interesting
Benchmarks (Zoomed)
1008 1024 1040 1056 1072 1088 1104 1120 1136 1152 1168 1184 1200Matrix Width
2
4
6
8
10
12
14
16
18
20
Tim
e (m
s)
CUBLAS
Figure: Zoom on spikes
I CUBLAS twice as fast when the width of the matrix isdivisible by 16
I Noticed by O. Schenk et al in Algorithmic performance studieson graphics processing units. Stating that When the matrix isnot divisible by 16, there are conflicts in shared memoryregarding multiple threads accessing the same bank at thesame time. This forces one thread to be put in a queue whilethe other thread is accessing the mem- ory, increasing theamount of time for all memory accesses to be completed.
I The question is: Why aren’t the smaller matrices padded tobecome divisible by 16?
Profit ratio
I Tesla C1060 costs about $1200, and calculates a 2000× 2000matrix in 50 ms
I Core i7 920 costs about $300, and calculates a 2000× 2000matrix in 2000 ms
I CUBLAS is about 40 times faster than IT++, while a Teslacosts only about 4 times more than a Core i7
I So the profit ratio becomes tenfold.
300$ ∗ 2000ms1200$ ∗ 50ms
= 10
Summary
I GPGPU is fast :)
I But without proper memory management it isn’t as fast as itcould be.
I Even the libraries aren’t as fast as they could be