analyzing cuda workloads using a detailed gpu simulator ali bakhoda, george l. yuan, wilson w. l....

31
Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British Columbia

Post on 23-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt

University of British Columbia

Page 2: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

• GPUs and CPUs on a collision course– 1st GPUs with programmable shaders in 2001– Today: TeraFlop on a single card. Turing complete. Highly

accessible: senior undergrad students can learn to program CUDA in a few weeks (not good perf. code)

– Rapidly growing set of CUDA applications (209 listed on NVIDIA’s CUDA website in February).

– With OpenCL safely expect number of non-graphics applications written for GPUs to explode.

• GPUs are massively parallel systems:– Multicore + SIMT + fine grain multithreaded

2

Page 3: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

No academic detailed simulator for studying this?!?

3

Page 4: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

GPGPU-Sim• An academic detailed (“cycle-level”) timing simulator

developed from the ground up at the University of British Columbia (UBC) for modeling a modern GPU running non-graphics workloads.

• Relatively accurate

(no effort expended trying to make it more accurate relative to real hardware)

4

Page 5: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

GPGPU-Sim

• Currently supports CUDA version 1.1 applications “out of the box”.

• Microarchitecture model – Based on notion of “shader cores” which approximate

NVIDIA GeForce 8 series and above notion of “Streaming Multiprocessor”.

– Connect to memory controllers using a detailed network-on-chip simulator (Dally & Towles’ booksim)

– Detailed DRAM timing model (everything except refresh)

• GPGPU-Sim v2.0b available: www.gpgpu-sim.org

5

Page 6: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Rest of this talk

• Obligatory brief introduction to CUDA• GPGPU-Sim internals (100,000’ view)

– Simulator software overview– Modeled Microarchitecture

• Some results from the paper

6

Page 7: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

CUDA Examplemain()

{

cudaMalloc((void**) &d_idata, bytes);

cudaMalloc((void**) &d_odata, maxNumBlocks*sizeof(int));

cudaMemcpy(d_idata, h_idata, bytesin, cudaMemcpyHostToDevice);

reduce<<< nthreads, nblocks, smemSize >>>(d_idata, d_odata);

cudaThreadSynchronize();

cudaMemcpy(d_odata, h_odata, bytesout, cudaMemcpyDeviceToHost);

}

__global__ void reduce(int *g_idata, int *g_odata)

{

extern __shared__ int sdata[];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = g_idata[i];

__syncthreads();

for(unsigned int s=1; s < blockDim.x; s *= 2) {

if ((tid % (2*s)) == 0)

sdata[tid] += sdata[tid + s];

__syncthreads();

}

if (tid == 0) g_odata[blockIdx.x] = sdata[0];

} 7

Runs on CPU

nthreads x nblocks copies run in Parallel on GPU

Page 8: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Normal CUDA Flow

8

• Applications written in a mixture of C/C++ and CUDA.

• “nvcc” takes CUDA (.cu) files and generates host C code and “Parallel Thread eXecution” assembly language (PTX).

• PTX is passed to assembler / optimizer “ptxas” to generate machine code that is packed into a C array (not human readable).

• Combine whole thing and link to CUDA runtime API using regular C/C++ compiler linker.

• Run your app on the GPU.

Page 9: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

GPGPU-Sim Flow

9

• Uses CUDA nvcc to generate CPU C code and PTX.

• flex/bison parser reads in PTX.

• Link together host (CPU) code and simulator into one binary.

• Intercept CUDA API calls using custom libcuda that implements functions declared in header files that come with CUDA.

Page 10: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

GPGPU-Sim Microarchitecture

10

• Set of “shader cores” connected to set of memory controllers via a detailed interconnection network model (booksim).

• Memory controllers reorder requests to reduce activate /precharge overheads.

• Vary topology / bandwidth of interconnect

• Cache for global memory operations.

Page 11: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Shader Core Details

11

• Shader core roughly like a “Streaming Multiprocessor” in NVIDIA terminology.

• Set of scalar threads grouped together into an SIMD unit called a “warp” (NVIDIA uses 32 on current hardware). Warps grouped into CTAs. CTAs grouped into “grids”.

• Set of warps on a core are fine grain interleaved on pipeline to hide off-chip memory access latency.

• Threads in one CTA can communicate via an on chip 16KB “shared memory”.

Page 12: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Interconnection Network

12

Baseline: MeshVariations: Crossbar, Ring, Torus

Baseline mesh memory controller placement:

Page 13: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Are more threads better?

• More CTAs on a core – Helps hide the latency when some wait for

barriers– Can increase memory latency tolerance– Needs more resources

• Less CTAs on a core– Less contention in interconnection and memory

system

13

Page 14: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Memory Access Coalescing

• Grouping accesses from multiple, concurrently issued, scalar threads into a single access to a contiguous memory region

• Is always done for a single warp• Coalescing among multiple warps

– We explore its performance benefits – Is more expensive to implement

14

Page 15: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Simulation setupNumber of shader cores 28

Warp Size 32

SIMD pipeline width 8

# of Threads/CTAs/Registers per Core 1024 / 8 /16384

Shared Memory / Core 16KB (16 banks)

Constant Cache / Core 8KB (2-way set assoc. 64B lines LRU)

Texture Cache / Core 64KB (2-way set assoc. 64B lines LRU)

Memory Channels 8

BW / Memory Module 8 Byte/Cycle

DRAM request queue size 32

Memory Controller Out of order (FR-FCFS)

Branch Divergence handling method Immediate Post Dominator

Warp Scheduling Policy Round Robin among ready Warps

15

Page 16: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Benchmark Selection

• Applications developed by 3rd party researchers – Less than 50x reported speedups

• + some applications from CUDA SDK

16

Page 17: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Benchmarks (more info in paper)Benchmark Abbr. Claimed Speedup

AES Cryptography AES 12x

Breadth First Search BFS 2x-3x

Coulombic Potential CP 647x

gpuDG DG 50x

3D Laplace Solver LPS 50x

LIBOR Monte Carlo LIB 50x

MUMmerGPU MUM 3.5x-10x

Neural Network NN 10x

N-Queens Solver NQU 2.5x

Ray Tracing RAY 16x

StoreGPU STO 9x

Weather Prediction WP 20x

17

Page 18: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Interconnection Network Latency Sensitivity

• Slight increase in interconnection latency has no severe effect of overall performance– No need to overdesign interconnection to decrease

latency

18

Page 19: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Interconnection Network Bandwidth Sensitivity

• Low Bandwidth decreases performance a lot (8B)• Very high bandwidth moves the bottleneck

19

Page 20: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Effects of varying number of CTAs

• Most benchmarks do not benefit substantially • Some benchmarks even perform better with fewer

concurrent threads (e.g. AES)– Less contention in DRAM

20

Page 21: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

More insights and data in the paper…

21

Page 22: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Summary

• GPGPU-Sim: a novel GPU simulator– Capable of simulating CUDA applications– www.gpgpu-sim.org

• Performance of simulated applications– More sensitive to bisection BW– Less sensitive to (zero load) Latency

• Sometimes running fewer CTAs can improve performance (less DRAM contention)

22

Page 23: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

23

Page 24: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

24

Page 25: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Interconnect Topology (Fig 9)

25

Page 26: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

ICNT Latency and BW sensitivity (Fig 10-11)

26

Page 27: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Mem Controller Optimization Effects (Fig 12)

27

Page 28: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

DRAM Utilization and Efficiency (Fig 13 -14)

28

Page 29: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

L1 / L2 Cache (Fig 15)

29

Page 30: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Varying CTAs (Fig 16)

30

Page 31: Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British

Inter-Warp Coalescing (Fig 17)

31