seminar on gpgpu programming: optimising …...i gpgpu is fast :) i but without proper memory...

Seminar on GPGPU Programming:Optimising Matrix Multiplications with CUDA

Axel Eirola

28.01.2010

Table of Contents

Introduction

Multiplication with CPUNaive implementationIT++

Multiplication with CUDANaive implementationUsing Shared MemoryOptimising Block SizeCUBLAS

Discussion

Introduction

I Matrix multiplication for square unfiformally random matrices

I C = AB where A, B, C ∈ R(n,n)

I Syntetical benchmarking, since we do not know anythingabout the matrices

I In real life problems we usually have information about thematrix, it can be symmetric or orthogonal, or have some otherpattern which can be exlpoited in the computations

About the benchmarks

I Executed on miranda

I CPU code in C++, GPU code in CUDA

I Measurements average of 5 runs after one warm-up run

I Calculations performed in single precission floating point

I Only actual calculation timed, no allocation or copyingbetween host and device

I Matrices of sizes 100× 100 (40Kb) to 4000× 4000 (64MB)were used,

Naive CPU implementation

I Simple ”by definition” implementation

I Loops through the elements of the output matrix C, andcalculates each element seperately

I No multithreading, no smart fetching of elements from A andB

Benchmarks

100 1000Matrix Width

0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU Naive

Figure: Naive CPU implementation

BLAS libary IT++

I A general purpose linear algebra and signal processing libraryfor C++

I Utilizes underlying BLAS implementataions

I Seems to do multithreading and smarter memory management

I Does not seem to use Strasses (or any other guys) matrixmultiplication algorithm

Benchmarks


0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++

Figure: IT++ library

Naive GPU implementation

I Trivial reimplementation of the CPU naive code to CUDA

I Replaces the loops with threading, that is that each thread iscreated for each element in the output matrix C

I All data is retreived from the global memory of the GPU

Benchmarks


0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive

Figure: Naive GPU implementation

Speed it up with Shared Memory

I The naive GPU implementation only used global memory foraccessing matrices A and B

I Since each element is accessed multiple times, it would befaster to store the elements somwhere close, such as the SM(stream multiprocessor) shared memory

I Give each thread block a responsibilty to calculate one blockof the output matrix C

I Store data needed to calculate the block in the shared memory

BenchmarksChapter 3. Programming Interface

!

22 CUDA Programming Guide Version 2.3!

Figure 3-1. Matrix Multipliation without Shared Memory "#$!%&''&()*+!,&-$!./01'$!).!/*!)01'$0$*2/2)&*!&%!0/23)4!05'2)1'),/2)&*!2#/2!-&$.!2/6$!/-7/*2/+$!&%!.#/3$-!0$0&389!:*!2#).!)01'$0$*2/2)&*;!$/,#!2#3$/-!<'&,6!).!3$.1&*.)<'$!%&3!,&0152)*+!&*$!.=5/3$!.5<>0/23)4!!"#$!&%!!!/*-!$/,#!2#3$/-!()2#)*!2#$!<'&,6!).!3$.1&*.)<'$!%&3!,&0152)*+!&*$!$'$0$*2!&%!!"#$9!?.!)''5.23/2$-!)*!@)+53$!A>B;!!"#$!).!$=5/'!2&!2#$!13&-5,2!&%!2(&!3$,2/*+5'/3!0/23),$.C!2#$!.5<>0/23)4!&%!%!&%!-)0$*.)&*!&%'()*+,-.$/0123")456!2#/2!#/.!2#$!./0$!')*$!)*-),$.!/.!!"#$;!/*-!2#$!.5<>0/23)4!&%!7!&%!-)0$*.)&*!&$/0123")45-.%'()*+,6!2#/2!#/.!2#$!./0$!,&'50*!)*-),$.!/.!!"#$9!:*!&3-$3!2&!%)2!)*2&!2#$!-$7),$D.!3$.&53,$.;!2#$.$!2(&!3$,2/*+5'/3!0/23),$.!/3$!-)7)-$-!)*2&!/.!0/*8!.=5/3$!0/23),$.!&%!-)0$*.)&*!$/0123")45!/.!*$,$../38!/*-!!"#$!).!,&0152$-!/.!2#$!.50!&%!2#$!13&-5,2.!&%!2#$.$!.=5/3$!0/23),$.9!E/,#!&%!2#$.$!13&-5,2.!).!1$3%&30$-!<8!%)3.2!'&/-)*+!2#$!2(&!,&33$.1&*-)*+!.=5/3$!0/23),$.!%3&0!+'&</'!0$0&38!2&!.#/3$-!0$0&38!()2#!&*$!2#3$/-!'&/-)*+!&*$!$'$0$*2!&%!$/,#!0/23)4;!/*-!2#$*!<8!#/7)*+!$/,#!2#3$/-!,&0152$!&*$!$'$0$*2!&%!2#$!13&-5,29!E/,#!2#3$/-!/,,505'/2$.!2#$!3$.5'2!&%!$/,#!&%!2#$.$!13&-5,2.!)*2&!/!3$+).2$3!/*-!&*,$!-&*$!(3)2$.!2#$!3$.5'2!2&!+'&</'!0$0&389!

A

B

C

B.width A.width

0 col

A.h

eigh

tB

.hei

ght

B.w

idth

-1

row

0

A.height-1

Figure: Naive matrix multiplication

BenchmarksChapter 3. Programming Interface

!

26 CUDA Programming Guide Version 2.3!

Figure 3-2. Matrix Multipliation with Shared Memory

3.2.3 Multiple Devices "!#$%&!%'%&()!*+,!#+-(!)./&01/(!2(-0*(%3!4#(%(!2(-0*(%!*+,!5(!(,.)(6+&(27!&#(06!16$1(6&0(%!*+,!5(!8.(60(27!+,2!$,(!$9!&#()!*+,!5(!%(/(*&(2!9$6!:(6,(/!(;(*.&0$,%3!

<(-(6+/!#$%&!&#6(+2%!*+,!(;(*.&(!2(-0*(!*$2(!$,!&#(!%+)(!2(-0*(7!5.&!5'!2(%0=,7!+!#$%&!&#6(+2!*+,!(;(*.&(!2(-0*(!*$2(!$,!$,/'!$,(!2(-0*(!+&!+,'!=0-(,!&0)(3!"%!+!*$,%(8.(,*(7!)./&01/(!#$%&!&#6(+2%!+6(!6(8.06(2!&$!(;(*.&(!2(-0*(!*$2(!$,!)./&01/(!2(-0*(%3!"/%$7!+,'!>?@"!6(%$.6*(%!*6(+&(2!&#6$.=#!&#(!6.,&0)(!0,!$,(!#$%&!&#6(+2!*+,,$&!5(!.%(2!5'!&#(!6.,&0)(!96$)!+,$&#(6!#$%&!&#6(+23!

4#(!9$//$A0,=!*$2(!%+)1/(!(,.)(6+&(%!+//!2(-0*(%!0,!&#(!%'%&()!+,2!6(&60(-(%!&#(06!16$1(6&0(%3!B&!+/%$!2(&(6)0,(%!&#(!,.)5(6!$9!>?@"C(,+5/(2!2(-0*(%3!int deviceCount; cudaGetDeviceCount(&deviceCount); int device; for (device = 0; device < deviceCount; ++device) { cudaDeviceProp deviceProp; cudaGetDeviceProperties(&deviceProp, device); if (dev == 0) {

A

!

B

C

Csub

BLOCK_SIZE

B.width A.width

BLOCK_SIZEBLOCK_SIZE

BLO

CK

_SIZ

E B

LOC

K_S

IZE

BLO

CK

_SIZ

E

bloc

kRow

row

0

BLOCK_SIZE-1

BLO

CK

_SI

ZE-1

0 col

blockCol

A.h

eigh

tB

.hei

ght

Figure: Matrix multiplication with shared memory

Benchmarks


0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive + Shared Memory

Figure: GPU using shared memory

What can we do with block size

I The block size represents the amount of threads executed byone SM (stream multiprocessor)

I The amount of threads stays constant

I But the amount of data kept in the shared memory of the SMis increased, decreasing the amount of costly accesses to theglobal memory

I Block size is limited to 22, since the maximum amount ofthread blocks in one grid is 512 (222 = 484 and 232 = 529)

Benchmarks


0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive + Shared Memory + large blocksize

Figure: GPU with larger blocksize

CUBLAS library

I A C library provided by nVidia implementing the BLAS (BasicLinear Algebra Subprograms) specification

I Could not find what it actually does, but seems to dosomething.

Benchmarks


0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive + Shared Memory + large blocksizeCUBLAS

Figure: CUBLAS library implementation

Benchmarks


0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive + Shared Memory + large blocksizeCUBLAS

Figure: This is interesting

Benchmarks (Zoomed)

1008 1024 1040 1056 1072 1088 1104 1120 1136 1152 1168 1184 1200Matrix Width

2

4

6

8

10

12

14

16

18

20

Tim

e (m

s)

CUBLAS

Figure: Zoom on spikes

I CUBLAS twice as fast when the width of the matrix isdivisible by 16

I Noticed by O. Schenk et al in Algorithmic performance studieson graphics processing units. Stating that When the matrix isnot divisible by 16, there are conflicts in shared memoryregarding multiple threads accessing the same bank at thesame time. This forces one thread to be put in a queue whilethe other thread is accessing the memory, increasing theamount of time for all memory accesses to be completed.

I The question is: Why aren’t the smaller matrices padded tobecome divisible by 16?

Profit ratio

I Tesla C1060 costs about $1200, and calculates a 2000× 2000matrix in 50 ms

I Core i7 920 costs about $300, and calculates a 2000× 2000matrix in 2000 ms

I CUBLAS is about 40 times faster than IT++, while a Teslacosts only about 4 times more than a Core i7

I So the profit ratio becomes tenfold.

300$ ∗ 2000ms1200$ ∗ 50ms

= 10

Summary

I GPGPU is fast :)

I But without proper memory management it isn’t as fast as itcould be.

I Even the libraries aren’t as fast as they could be

seminar on gpgpu programming: optimising …...i gpgpu is fast :) i but without proper memory...

Documents