lecture 7 cse 260 – parallel computation (fall 2015) scott...

26
Lecture 7 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Matrix multiplication Using Shared Memory

Upload: doanlien

Post on 12-May-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

Lecture 7 CSE 260 – Parallel Computation

(Fall 2015) Scott B. Baden

Matrix multiplication Using Shared Memory

Announcements •  Special (one time) office hours on Friday

2pm to 4pm

Scott B. Baden / CSE 260, UCSD / Fall '15 2

Today’s lecture •  Occupancy •  Using shared memory to increase

memory locality in matrix multiply

Scott B. Baden / CSE 260, UCSD / Fall '15 3

4

Recapping from last time

•  GPUs run 1000s of threads on 13 SMX units, each with 64 double precision cores, 192 single precisions cores, and 64 special function units

•  Threads communicate slowly via global •  On chip memories are O(100x) faster:

shared memory/cache + registers

Scott B. Baden / CSE 260, UCSD / Fall '15 4

The increment benchmark on Sorken •  Total time: timing taken from the host, includes copying

data to the device and copying back the result •  Device only: time taken on device only •  Loop repeats the computation inside the kernel:

1 kernel launch +1 set of data transfers in and out of device •  CUDA bandwidthTest reports 12.1 GB/s Host-Device

for 32MB messages (in $PUB/Examples/CUDA/samples) N = 8388480 (8M ints), block size = 128, times in milliseconds,

Repetitions 10 100 1000 104

1.46 12.1 119 1.19s Device time 9.30 20.1 127 1.19s Kernel launch+data xfer 44.0 358 3.53s -- Host 3.31 24.1 233 a[i] =1+sin(a[i]) Device 3.10s 11.0s 89.5s Host

Scott B. Baden / CSE 260, UCSD / Fall '15 5

Naïve implementation of matrix multiply

•  Each thread computes one element of C u  Loads a row of matrix A u  Loads a column of matrix B u  Computes a dot product

•  Every value of A and B is loaded N times from global memory

Thread Block

3 2 5 4

2

4

2

6

48

Thread (2, 2)

BLOCK_SIZE

A C

B

Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC

Scott B. Baden / CSE 260, UCSD / Fall '15 7

Naïve Kernel __global__ void matMul(DOUBLE* C, DOUBLE* A, DOUBLE* B) { int I = blockIdx.y*blockDim.y + threadIdx.y; int J = blockIdx.x*blockDim.x + threadIdx.x; int N = blockDim.y*gridDim.y; // Assumes a square matrix if ((I < N) && (J < N)){ DOUBLE _c = 0; for (unsigned int k = 0; k < N; k++) { DOUBLE a = A[I * N + k]; DOUBLE b = B[k * N + J]; _c += a * b; } C[I * N + J] = _c; } }

for (i = 0; i < N; i++) for (j = 0; j < N; j++) { DOUBLE sum = 0; for (k = 0; k < N; k++) sum += A[i * N + k] * B[k * N + j]; C[i * N + j] = (DOUBLE) sum; }

Scott B. Baden / CSE 260, UCSD / Fall '15 8

Thread mapping __global__ void matMul(DOUBLE* C, DOUBLE* A, DOUBLE* B) { int I = blockIdx.y*blockDim.y + threadIdx.y; int J = blockIdx.x*blockDim.x + threadIdx.x; int N = blockDim.y*gridDim.y; // Square matrix if ((I < N) && (J < N)){ float _c = 0; for (unsigned int k = 0; k < N; k++) { DOUBLE a = A[I * N + k]; DOUBLE b = B[k * N + J]; _c += a * b; } C[I * N + J] = _c; } }

WID

TH

W

IDT

H

WIDTH WIDTH

M

N

P

ty

tx

C

B

A

Cou

rtesy

Dav

idK

irk/N

VID

IA a

nd W

en-m

ei H

wu/

UIU

C

Scott B. Baden / CSE 260, UCSD / Fall '15 9

CUDA code on the host side unsigned int n2 = N*N*sizeof(DOUBLE); // Generic types DOUBLE *h_A = (DOUBLE*) malloc(n2); // in types.h DOUBLE *h_B = (DOUBLE*) malloc(n2); // Check that allocations went OK assert(h_A); assert(h_B); genMatrix(h_A, N, N); genMatrix(h_B, N, N); // Initialize matrices DOUBLE *d_A, *d_B, *d_C; cudaMalloc((void**) &d_A, n2); ... &d_A ... &d_B checkCUDAError("Error allocating device memory arrays"); // copy host memory to device cudaMemcpy(d_A, h_A, n2, cudaMemcpyHostToDevice); checkCUDAError("Error copying data to device"); cudaMemcpy(d_B, h_B, n2, cudaMemcpyHostToDevice); checkCUDAError("Error copying data to device");

Scott B. Baden / CSE 260, UCSD / Fall '15 10

Host code - continued

// setup execution configurations dim3 threads(ntx, nty,1); // ntx & nty are user input dim3 grid(N / threads.x, N / threads.y); // launch the kernel matMul<<< grid, threads >>>(d_C, d_A, d_B); // retrieve result cudaMemcpy(h_C, d_C, n2, cudaMemcpyDeviceToHost); checkCUDAError("Unable to retrieve result from device"); // Free device storage assert(cudaSuccess ==cudaFree(d_A)); assert(cudaSuccess ==cudaFree(d_B)); assert(cudaSuccess ==cudaFree(d_C));

Scott B. Baden / CSE 260, UCSD / Fall '15 11

Configuration variables •  Types to manage thread geometries •  dim3 gridDim, blockDim

u  Dimensions of the grid in blocks u  Dimensions of a thread block in threads

•  dim3 blockIdx, threadIdx; u  Block index within the grid u  Thread index within the block

__global__ void KernelFunc(...);

dim3 DimGrid(2,3); // 6 thread blocks dim3 DimBlock(3,5,1); // 15 threads /block Kernel<<< DimGrid, DimBlock, >>>(...);

Scott B. Baden / CSE 260, UCSD / Fall '15 12

Kernel

Device Grid 1

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Block (1, 1)

Thread (0, 1) Thread

(1, 1) Thread (2, 1) Thread

(3, 1) Thread (4, 1)

Thread (0, 2) Thread

(1, 2) Thread (2, 2) Thread

(3, 2) Thread (4, 2)

Thread (0, 0) Thread

(1, 0) Thread (2, 0) Thread

(3, 0) Thread (4, 0)

DavidKirk/NVIDIA & Wen-mei Hwu/UIUC

Performance •  N=512, double precision

•  GPU: 1 Device of Sorken •  CPU Stampede, 2.0 GHz Intel Xeon E5-2680 0 @

2.70GHz peak 21.6 GF / core

Geometry GFLOPS 16 x 16 92 32 x32 116

# cores GFLOPS

1 22

2 43

4 84

8 160

Scott B. Baden / CSE 260, UCSD / Fall '15 14

10/16/15 15

Parallel Speedup

•  How much did our GPU implementation improve over the traditional processor?

•  Speedup, S

•  Baseline: a multithreaded program €

Running time of the fastest program on conventional processorsRunning time of the accelerated program

Scott B. Baden / CSE 260, UCSD / Fall '15 15

•  Two ways u  Use an ordinary timer, e.g. gettimeofday() u  Use Cuda events/elapsed time (#ifdef CUDA_TIMER)

•  See incrArray •  Note that kernel invocation is asynchronous

cudaThreadSynchronize();

double t_device_compute = -getTime(); incr<<< nBlocks, bSize >>> (a_d, N); cudaThreadSynchronize(); t_device_compute +=getTime();

Measuring performance

16 Scott B. Baden / CSE 260, UCSD / Fall '15 16

Cuda error: Can't run kernel: invalid device function. •  Cuda can silently fail, you can observe

misleading performance •  E.g. if you specify an invalid grid / thread block

dimensions •  Note: the last error can be cleared by successive

kernel calls, so check frequently cudaMalloc((void **) &a_d, size);

checkCUDAError("Unable to allocate storage on the device");

•  Consult checkCUDAError() in utils.cu (incrArr) •  What about asynchronous calls? •  cf CUDA Programming Guide, “Error Handling”

CUDA Error Handling

17 Scott B. Baden / CSE 260, UCSD / Fall '15 17

18

Warp Scheduling •  Threads assigned to an SMX in units

of a thread block, multiple blocks •  Each block divided into warps of 32

(SIMD) threads, a schedulable unit u  A warp becomes eligible for execution

when all its operands are available u  Dynamic instruction reordering: eligible

warps selected for execution using a prioritized scheduling policy

u  All threads in a Warp execute the same instruction, branches serialize execution

•  Multiple warps simultaneously active, hiding data transfer delays

•  All registers in all the warps are available, 0 overhead scheduling

•  Hardware is free to assign blocks to any SMX

•  There are 4 warp schedulers/SMX

t0 t1 t2 … tm

time

Shared /L1

MT IU

t0 t1 t2 … tm

t0 t1 t2 … tm

Scott B. Baden / CSE 260, UCSD / Fall '15 18

Today’s lecture •  Occupancy •  Using shared memory to increase

memory locality in matrix multiply

Scott B. Baden / CSE 260, UCSD / Fall '15 21

Occupancy •  A minimum number of warps needed to hide memory latency •  Occupancy: # active warps ÷ max # warps supported by vector unit •  Limited by vector unit resources

u  Amount of shared memory u  Number of registers u  Maximum number of threads

•  Consider a kernel (16x16 block size) u  Shared memory/block = 2648 bytes u  Reg/thread=38 [38*256 =9728 < 16k] u  # available registers is the limiting factor

•  Tradeoff: more blocks with fewer threads or more threads with fewer blocks u  Locality: want small blocks of data

(and hence more plentiful warps) that fit into fast memory u  Register consumption

•  Maximizing the occupancy may not maximize performance

fotosearch.com

λp

Scott B. Baden / CSE 260, UCSD / Fall '15 22

Occupancy Calculator

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

fotosearch.com

λp

Determining occupancy •  Recall the definition for occupancy

# active warps ÷ max # warps supported by vector unit

•  NVIDIA provides an occupancy calculator

•  Determine resource usage from nvcc for provided A2 code on Sorken (capability 3.7) nvcc --ptxas-options=-v ptxas info : Used 26 registers, 352 bytes cmem[0]

Scott B. Baden / CSE 260, UCSD / Fall '15 24

Occupancy calculation with 16 x 16 threads

Occupancy =# active warps per SM

Maximum possible # active warps

Physical Limits for GPU Compute Capability: 3,5 Threads per Warp 32 Warps per Multiprocessor 64 Threads per Multiprocessor 2048 Thread Blocks per Multiprocessor 16 Total # of 32-bit registers per Multiprocessor 65536 Register allocation unit size 256 Register allocation granularity warp Registers per Thread 255 Shared Memory per Multiprocessor (bytes) 49152 Shared Memory Allocation unit size 256 Warp allocation granularity 4 Maximum Thread Block Size 1024

Allocated Resources Per Block Limit Per SM

= Allocatable Blocks Per

SM Warps (Threads Per Block / Threads Per Warp) 8 64 8 Registers (Warp limit per SM due to per-warp reg count) 8 64 8 Shared Memory (Bytes) 0 49152 16 Note: SM is an abbreviation for (Streaming) Multiprocessor Maximum Thread Blocks Per Multiprocessor Blocks/SM * Warps/Block = Warps/SM Limited by Max Warps or Max Blocks per Multiprocessor 8 8 64 Limited by Registers per Multiprocessor 8 8 64 Limited by Shared Memory per Multiprocessor 16 8 0 Note: Occupancy limiter is shown in orange Physical Max Warps/SM = 64 Occupancy = 64 / 64 = 100%

CUDA GPU Occupancy Calculator 1.) Select Compute Capability (click): 3.5 1.b) Select Shared Memory Size Config (bytes) 49152 2.) Enter your resource usage: Threads Per Block 256 Registers Per Thread 26 Shared Memory Per Block (bytes) 0

(Don't edit anything below this line)

3.) GPU Occupancy Data is displayed here and in the graphs: Active Threads per Multiprocessor 2048 Active Warps per Multiprocessor 64 Active Thread Blocks per Multiprocessor 8 Occupancy of each Multiprocessor 100%

Scott B. Baden / CSE 260, UCSD / Fall '15 25

Full occupancy

Scott B. Baden / CSE 260, UCSD / Fall '15 26

My Block Size 256

0

8

16

24

32

40

48

56

64

0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 Mul

tipro

cess

or W

arp

Occ

upan

cy

(# w

arps

)

Threads Per Block

Impact of Varying Block Size

My Register Count 26

0

8

16

24

32

40

48

56

64

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256 M

ultip

roce

ssor

War

p O

ccup

ancy

(#

war

ps)

Registers Per Thread

Impact of Varying Register Count Per Thread

Today’s lecture •  Occupancy •  Using shared memory to increase

memory locality in matrix multiply

Scott B. Baden / CSE 260, UCSD / Fall '15 27

Kepler’s Memory Hierarchy •  DRAM takes hundreds

of cycles to access •  Each SMX has on-chip memory,

can configure part as shared memory, part as L1 cache {¾ + ¼} , {¼ + ¾ }, {½ + ½} Set the mode using cudaFuncSetCacheConfig( )

•  All threads in a block share this memory, hence a set of warps making up the block

•  L2 Cache shared by all SMXs (1.5 MB) •  Cache inclusion (L1⊂ L2) partially configurable on per-access

basis with mem. ref. instruction modifiers •  128 byte cache line size

B. Wilkinson

Recall Blocked Matrix Multiplication N blocks, n×n global matrix, b=n/N for i = 0 to N-1 for j = 0 to N-1 // load each block C[i,j] into cache, once : n2

// b = n/N = block size for k = 0 to N-1 // load each block A[i,k] and B[k,j] N3 times

// = 2N3 × (n/N)2 = 2Nn2 C[i,j] += A[i,k] * B[k,j] // do the matrix multiply // write each block C[i,j] once : n2

= + * C[i,j] C[i,j] A[i,k]

B[k,j]

Total: (2*N+2)*n2

Scott B. Baden / CSE 260, UCSD / Fall '15 29

Improving locality in matrix multiply •  Naïve algorithm

u  Each thread loads all the data it needs, independently loads a row and column of input

u  Each input element loaded multiple times

u  Each thread computes 1 MAD + 2 loads + 1 store

•  Blocked algorithm with shared memory

u  Threads cooperate to load a block of A&B into on-chip shared memory

u  Each thread in the block performs the ijk loop within shared memory

u  Each thread: b mpy-adds + 1 load + 1 store

Thread Block

3 2 5 4

2

4

2

6

48

Thread (2, 2)

BLOCK_SIZE

Scott B. Baden / CSE 260, UCSD / Fall '15 30