seminar on gpgpu programming: optimising …...i gpgpu is fast :) i but without proper memory...

23
Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA Axel Eirola 28.01.2010

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Seminar on GPGPU Programming:Optimising Matrix Multiplications with CUDA

Axel Eirola

28.01.2010

Page 2: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Table of Contents

Introduction

Multiplication with CPUNaive implementationIT++

Multiplication with CUDANaive implementationUsing Shared MemoryOptimising Block SizeCUBLAS

Discussion

Page 3: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Introduction

I Matrix multiplication for square unfiformally random matrices

I C = AB where A, B, C ∈ R(n,n)

I Syntetical benchmarking, since we do not know anythingabout the matrices

I In real life problems we usually have information about thematrix, it can be symmetric or orthogonal, or have some otherpattern which can be exlpoited in the computations

Page 4: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

About the benchmarks

I Executed on miranda

I CPU code in C++, GPU code in CUDA

I Measurements average of 5 runs after one warm-up run

I Calculations performed in single precission floating point

I Only actual calculation timed, no allocation or copyingbetween host and device

I Matrices of sizes 100× 100 (40Kb) to 4000× 4000 (64MB)were used,

Page 5: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Naive CPU implementation

I Simple ”by definition” implementation

I Loops through the elements of the output matrix C, andcalculates each element seperately

I No multithreading, no smart fetching of elements from A andB

Page 6: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Benchmarks

100 1000Matrix Width

0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU Naive

Figure: Naive CPU implementation

Page 7: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

BLAS libary IT++

I A general purpose linear algebra and signal processing libraryfor C++

I Utilizes underlying BLAS implementataions

I Seems to do multithreading and smarter memory management

I Does not seem to use Strasses (or any other guys) matrixmultiplication algorithm

Page 8: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Benchmarks

100 1000Matrix Width

0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++

Figure: IT++ library

Page 9: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Naive GPU implementation

I Trivial reimplementation of the CPU naive code to CUDA

I Replaces the loops with threading, that is that each thread iscreated for each element in the output matrix C

I All data is retreived from the global memory of the GPU

Page 10: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Benchmarks

100 1000Matrix Width

0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive

Figure: Naive GPU implementation

Page 11: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Speed it up with Shared Memory

I The naive GPU implementation only used global memory foraccessing matrices A and B

I Since each element is accessed multiple times, it would befaster to store the elements somwhere close, such as the SM(stream multiprocessor) shared memory

I Give each thread block a responsibilty to calculate one blockof the output matrix C

I Store data needed to calculate the block in the shared memory

Page 12: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

BenchmarksChapter 3. Programming Interface

!

22 CUDA Programming Guide Version 2.3!

Figure 3-1. Matrix Multipliation without Shared Memory "#$!%&''&()*+!,&-$!./01'$!).!/*!)01'$0$*2/2)&*!&%!0/23)4!05'2)1'),/2)&*!2#/2!-&$.!2/6$!/-7/*2/+$!&%!.#/3$-!0$0&389!:*!2#).!)01'$0$*2/2)&*;!$/,#!2#3$/-!<'&,6!).!3$.1&*.)<'$!%&3!,&0152)*+!&*$!.=5/3$!.5<>0/23)4!!"#$!&%!!!/*-!$/,#!2#3$/-!()2#)*!2#$!<'&,6!).!3$.1&*.)<'$!%&3!,&0152)*+!&*$!$'$0$*2!&%!!"#$9!?.!)''5.23/2$-!)*!@)+53$!A>B;!!"#$!).!$=5/'!2&!2#$!13&-5,2!&%!2(&!3$,2/*+5'/3!0/23),$.C!2#$!.5<>0/23)4!&%!%!&%!-)0$*.)&*!&%'()*+,-.$/0123")456!2#/2!#/.!2#$!./0$!')*$!)*-),$.!/.!!"#$;!/*-!2#$!.5<>0/23)4!&%!7!&%!-)0$*.)&*!&$/0123")45-.%'()*+,6!2#/2!#/.!2#$!./0$!,&'50*!)*-),$.!/.!!"#$9!:*!&3-$3!2&!%)2!)*2&!2#$!-$7),$D.!3$.&53,$.;!2#$.$!2(&!3$,2/*+5'/3!0/23),$.!/3$!-)7)-$-!)*2&!/.!0/*8!.=5/3$!0/23),$.!&%!-)0$*.)&*!$/0123")45!/.!*$,$../38!/*-!!"#$!).!,&0152$-!/.!2#$!.50!&%!2#$!13&-5,2.!&%!2#$.$!.=5/3$!0/23),$.9!E/,#!&%!2#$.$!13&-5,2.!).!1$3%&30$-!<8!%)3.2!'&/-)*+!2#$!2(&!,&33$.1&*-)*+!.=5/3$!0/23),$.!%3&0!+'&</'!0$0&38!2&!.#/3$-!0$0&38!()2#!&*$!2#3$/-!'&/-)*+!&*$!$'$0$*2!&%!$/,#!0/23)4;!/*-!2#$*!<8!#/7)*+!$/,#!2#3$/-!,&0152$!&*$!$'$0$*2!&%!2#$!13&-5,29!E/,#!2#3$/-!/,,505'/2$.!2#$!3$.5'2!&%!$/,#!&%!2#$.$!13&-5,2.!)*2&!/!3$+).2$3!/*-!&*,$!-&*$!(3)2$.!2#$!3$.5'2!2&!+'&</'!0$0&389!

A

B

C

B.width A.width

0 col

A.h

eigh

tB

.hei

ght

B.w

idth

-1

row

0

A.height-1

Figure: Naive matrix multiplication

Page 13: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

BenchmarksChapter 3. Programming Interface

!

26 CUDA Programming Guide Version 2.3!

Figure 3-2. Matrix Multipliation with Shared Memory

3.2.3 Multiple Devices "!#$%&!%'%&()!*+,!#+-(!)./&01/(!2(-0*(%3!4#(%(!2(-0*(%!*+,!5(!(,.)(6+&(27!&#(06!16$1(6&0(%!*+,!5(!8.(60(27!+,2!$,(!$9!&#()!*+,!5(!%(/(*&(2!9$6!:(6,(/!(;(*.&0$,%3!

<(-(6+/!#$%&!&#6(+2%!*+,!(;(*.&(!2(-0*(!*$2(!$,!&#(!%+)(!2(-0*(7!5.&!5'!2(%0=,7!+!#$%&!&#6(+2!*+,!(;(*.&(!2(-0*(!*$2(!$,!$,/'!$,(!2(-0*(!+&!+,'!=0-(,!&0)(3!"%!+!*$,%(8.(,*(7!)./&01/(!#$%&!&#6(+2%!+6(!6(8.06(2!&$!(;(*.&(!2(-0*(!*$2(!$,!)./&01/(!2(-0*(%3!"/%$7!+,'!>?@"!6(%$.6*(%!*6(+&(2!&#6$.=#!&#(!6.,&0)(!0,!$,(!#$%&!&#6(+2!*+,,$&!5(!.%(2!5'!&#(!6.,&0)(!96$)!+,$&#(6!#$%&!&#6(+23!

4#(!9$//$A0,=!*$2(!%+)1/(!(,.)(6+&(%!+//!2(-0*(%!0,!&#(!%'%&()!+,2!6(&60(-(%!&#(06!16$1(6&0(%3!B&!+/%$!2(&(6)0,(%!&#(!,.)5(6!$9!>?@"C(,+5/(2!2(-0*(%3!int deviceCount; cudaGetDeviceCount(&deviceCount); int device; for (device = 0; device < deviceCount; ++device) { cudaDeviceProp deviceProp; cudaGetDeviceProperties(&deviceProp, device); if (dev == 0) {

A

!

B

C

Csub

BLOCK_SIZE

B.width A.width

BLOCK_SIZEBLOCK_SIZE

BLO

CK

_SIZ

E B

LOC

K_S

IZE

BLO

CK

_SIZ

E

bloc

kRow

row

0

BLOCK_SIZE-1

BLO

CK

_SI

ZE-1

0 col

blockCol

A.h

eigh

tB

.hei

ght

Figure: Matrix multiplication with shared memory

Page 14: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Benchmarks

100 1000Matrix Width

0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive + Shared Memory

Figure: GPU using shared memory

Page 15: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

What can we do with block size

I The block size represents the amount of threads executed byone SM (stream multiprocessor)

I The amount of threads stays constant

I But the amount of data kept in the shared memory of the SMis increased, decreasing the amount of costly accesses to theglobal memory

I Block size is limited to 22, since the maximum amount ofthread blocks in one grid is 512 (222 = 484 and 232 = 529)

Page 16: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Benchmarks

100 1000Matrix Width

0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive + Shared Memory + large blocksize

Figure: GPU with larger blocksize

Page 17: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

CUBLAS library

I A C library provided by nVidia implementing the BLAS (BasicLinear Algebra Subprograms) specification

I Could not find what it actually does, but seems to dosomething.

Page 18: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Benchmarks

100 1000Matrix Width

0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive + Shared Memory + large blocksizeCUBLAS

Figure: CUBLAS library implementation

Page 19: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Benchmarks

100 1000Matrix Width

0.1

1

10

100

1000

10000

Tim

e (m

s)

CPU NaiveIT++GPU Naive + Shared Memory + large blocksizeCUBLAS

Figure: This is interesting

Page 20: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Benchmarks (Zoomed)

1008 1024 1040 1056 1072 1088 1104 1120 1136 1152 1168 1184 1200Matrix Width

2

4

6

8

10

12

14

16

18

20

Tim

e (m

s)

CUBLAS

Figure: Zoom on spikes

Page 21: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

I CUBLAS twice as fast when the width of the matrix isdivisible by 16

I Noticed by O. Schenk et al in Algorithmic performance studieson graphics processing units. Stating that When the matrix isnot divisible by 16, there are conflicts in shared memoryregarding multiple threads accessing the same bank at thesame time. This forces one thread to be put in a queue whilethe other thread is accessing the mem- ory, increasing theamount of time for all memory accesses to be completed.

I The question is: Why aren’t the smaller matrices padded tobecome divisible by 16?

Page 22: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Profit ratio

I Tesla C1060 costs about $1200, and calculates a 2000× 2000matrix in 50 ms

I Core i7 920 costs about $300, and calculates a 2000× 2000matrix in 2000 ms

I CUBLAS is about 40 times faster than IT++, while a Teslacosts only about 4 times more than a Core i7

I So the profit ratio becomes tenfold.

300$ ∗ 2000ms1200$ ∗ 50ms

= 10

Page 23: Seminar on GPGPU Programming: Optimising …...I GPGPU is fast :) I But without proper memory management it isn’t as fast as it could be. I Even the libraries aren’t as fast as

Summary

I GPGPU is fast :)

I But without proper memory management it isn’t as fast as itcould be.

I Even the libraries aren’t as fast as they could be