ece 8823a gpu architectures...1 1 ece 8823a gpu architectures module 5: execution and resources -i...
TRANSCRIPT
1
1
ECE 8823A
GPU Architectures
Module 5: Execution and Resources - I
Reading Assignment
• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6
• CUDA Programming Guide– http://docs.nvidia.com/cuda/cuda-c-programming-
guide/#abstract
2
2
Objective
• To understand the implications of programming model constructs on demand for execution resources
• To be able to reason about performance consequences of programming model parameters– Thread blocks, warps, memory behaviors, etc.– Need deeper understanding of architecture to be really valuable
(later)• To understand DRAM bandwidth
– Cause of the DRAM bandwidth problem– Programming techniques that address the problem: memory
coalescing, corner turning,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
3
Closer Look: Formation of Warps
4
• How do you form warps out of multidimensional arrays of threads?– Linearize thread IDs
Grid 1
Block (0, 0)
Block (1, 1)
Block (1, 0)
Block (0, 1)
Block (1,1)
Thread(0,0,0)Thread
(0,1,3)Thread(0,1,0)
Thread(0,1,1)
Thread(0,1,2)
Thread(0,0,0)
Thread(0,0,1)
Thread(0,0,2)
Thread(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
warp1D Thread Block
3D Thread Block
3
Formation of Warps
5
Grid 1
Block (0, 0)
Block (1, 1)
Block (1, 0)
Block (0, 1)
Block (1,1)
Thread(0,0,0)Thread
(0,1,3)Thread(0,1,0)
Thread(0,1,1)
Thread(0,1,2)
Thread(0,0,0)
Thread(0,0,1)
Thread(0,0,2)
Thread(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
T0,0,0 T0,0,1 T0,0,2 T0,0,3 T0,1,0 T0,1,1 T0,1,2 T0,1,3 T1,0,0 T1,0,1 T1,0,2 T1,0,3 T1,1,0 T1,1,1 T1,1,2 T1,1,3
linear order
2D Thread Block3D Thread Block
Mapping Thread Blocks to Warps
6
T0,0
T7,3
Warp 0
Warp 1
Thread Bock
T7,0
T3,3T3,0
T0,3
An Example with a warp size of 16 threads
• Follow row major order through the Z-dimension• Linearize and then split into warps• Understanding becomes important when optimizing
global memory accesses
4
Execution of Warps
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
7
• Each warp executed as SIMD bundle• How do we handle divergent control flow among threads
in a warp? – Execution semantics– How is it implemented? (later)– How can we optimize against it?
warp
Impact of Control Divergence
8
• Occurs within a warp• Branches lead serialization branch dependent code
v Performance issue: low warp utilization
if(…)
{… }
else {…}
Idle threads
Reconvergence!
Serialization
5
Causes
• Traditional nested branches• Loops
– Variable number of iterations/thread– Loop condition based on thread ID?
• Switching on thread IDif(threadIDx.x > 5) {}
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
9
Control Divergence Mitigation: Algorithmic Approach
10
Benefits of SIMD execution Flexibility of MIMD control flow +
Can algorithmic techniques maximize utilizations achieved by a warp?
6
Reduction• A commonly used strategy for processing large
input data sets– There is no required order of processing elements in a
data set (associative and commutative)– Partition the data set into smaller chunks– Have each thread to process a chunk– Use a reduction tree to summarize the results from
each chunk into the final answer• We will focus on the reduction tree step for now.• Google and Hadoop MapReduce frameworks are
examples of this pattern
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012
11
A parallel reduction tree algorithm performs N-1 Operations in log(N) steps
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012
12
3 1 7 0 4 1 6 3
3 7 4 6
max maxmaxmax
maxmax
7 6
max
7
7
Reduction: Approach 1
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
13
1. __shared__ float partialsum[];..
2. unsigned int t = threadIDx.x;3. For (unsigned int stride =1; stride <blockDim.x; stride *=2)4. {5. __syncthread();6. If(t%(2*stride) == 0)7. partialsum[t] +=partialsum[t+stride];8. }
0 1 2 43 5 66
0+1 2+3 4+5 6+7
0..3 4..7
0..7
threadID.x
Thread Block
Data in shared memory
• O(N) additions and therefore work efficient?
• Hardware efficiency?
thread thread thread thread
A Better Strategy
• Principle: Shift the index usage to ensure high thread utilization in warp– Remap thread indices
• Keep the active threads consecutive
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012
14
8
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012
15
Thread 0
An Example of 16 threads
0 1 2 3 … 13 1514 181716 19
0+16 15+31
Thread 1 Thread 2 Thread 14Thread 15
No Divergence
Reduction: Approach 2
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
16
1. __shared__ float partialsum[];..
2. unsigned int t = threadIDx.x;3. For (unsigned int stride = blockDim.x; stride>1; stride /=2)4. {5. __syncthread();6. If(t < stride)7. partialsum[t] +=partialsum[t+stride];8. }
• Difference is in which threads diverge!• For a thread block of 512 threads
– Threads 0-255 take the branch, 256-511 do not• For a warp size of 32, all threads in a warp have identical
branch conditions à no divergence!• When #active threads <warp-size, à old problem
9
Global Memory Bandwidth
• How can we map thread access patterns to global memory addresses to maximize bandwidth utilization?
• Need to understand the organization of DRAMs!– Hierarchy of latencies
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
17
Basic Organization
deco
de
0 1 1
Sense amps and buffer
Mux
18
Example: 32x32 = 1024 bit array
I/O pins©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
10
1Gb Micron DD2 SDRAM
19Row access time
Column access time
Technology Trends
20
Past two decades,• Data rate increase ~
1000x• RAS/CAS latency
decrease = 56%
Courtesy: Synopsis DesignWare Technical Bulletin
How? à increasing burst length
11
DRAM Bursting for a 8x2 Bank
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
time
Address bits to decoder
Core Array access delay2 bitsto pin
2 bitsto pin
Non-burst timing
Burst timing
Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations.
21
Multiple DRAM Banks
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
deco
de
Sense amps
Mux
deco
de
Sense amps
Mux
0 1 10
Bank 0 Bank 1
22
12
DRAM Bursting for the 8x2 Bank
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
time
Address bits to decoder
Core Array access delay2 bitsto pin
2 bitsto pin
Single-Bank burst timing, dead time on interface
Multi-Bank burst timing, reduced dead time
23
First-order Look at the GPU off-chip
memory subsystem
• nVidia V100 Volta GPU:
– Peak global memory bandwidth = 900 GB/s
• Global memory (HBM2) interface @ 4096 bits
• Prior generation GPUs (e.g., Keplar) 384 bit wide
GDDR5 @ 224GBytes/sec
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
24
13
Multiple Memory Channels
• Divide the memory address space into N parts– N is number of memory channels– Assign each portion to a channel
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Channel 0
Channel 1
Channel 2
Channel 3
Bank Bank Bank Bank
25
“You can buy bandwidth but you can’t bribe God”
-- Unknown
Lessons
• Organize data accesses to maximize burst mode bandwidth – Access consecutive locations– Algorithmic strategies + data layout
• Thread blocks issue warp-size load/store instructions– 32 addresses for a warp size of 32– Coalesce these accesses to create smaller number of
memory transactions à maximize memory bandwidth– More later as we discuss microarchitecture
26
14
Memory Coalescing
• Memory references are coalesced into sequence of memory transactions– Accesses to a segment are coalesced, e.g., 128 byte
segments)• Ability and extent of coalescing depends on
compute capability 27
LD LD LD LD
Warp
Implications of Memory Coalescing
• Reduce the request rate to L1 and DRAM
• Distinct from CPU optimizations – why?
• Need to be able to re-map entries from each access back to threads
28
Warp Schedulers
Register File
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
L1/Shared Memory
DRAMDRAM
DRAMDRAM
L1 access bandwidth
DRAM access bandwidth
15
M0,2
M1,1
M0,1M0,0
M1,0
M0,3
M1,2 M1,3
M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3
M2,1M2,0 M2,2 M2,3
M3,1M3,0 M3,2 M3,3
M3,1M3,0 M3,2 M3,3
M
linearized order in increasing address
Placing a 2D C array into linear memory space
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Base Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){// Calculate the row index of the Pd element and Mint Row = blockIdx.y*TILE_WIDTH + threadIdx.y;// Calculate the column index of Pd and Nint Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;// each thread computes one element of the block sub-
matrixfor (int k = 0; k < Width; ++k)Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue;} 30
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois, 2007-2012
16
Two Access Patterns
31
d_M d_N
WIDTH
WIDTH
Thread 1Thread 2
(a) (b)
d_M[Row*Width+k] d_N[k*Width+Col]
k is loop counter in the inner product loop of the kernel code
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Lets look at these access
patterns
32
N accesses are coalesced.
NT0 T1 T2 T3
Load iteration 0
T0 T1 T2 T3
Load iteration 1
Access direction in kernel code (one thread)
…
N0,2
N1,1
N0,1N0,0
N1,0
N0,3
N1,2 N1,3
N2,1N2,0 N2,2 N2,3
N3,1N3,0 N3,2 N3,3
N0,2N0,1N0,0 N0,3 N1,1N1,0 N1,2 N1,3 N2,1N2,0 N2,2 N2,3 N3,1N3,0 N3,2 N3,3
Across successive threads in a warp
d_N[k*Width+Col]
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
(each thread)
17
M accesses are not coalesced.
33
MT0 T1 T2 T3
Load iteration 0
T0 T1 T2 T3
Load iteration 1
Access direction in Kernel code (in a thread)
…
M0,2
M1,1
M0,1M0,0
M1,0
M0,3
M1,2 M1,3
M2,1M2,0 M2,2 M2,3
M3,1M3,0 M3,2 M3,3
M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3 M3,1M3,0 M3,2 M3,3
d_M[Row*Width+k]
Access across successive threads
in a warp
Using Shared Memory
34
d_M d_N
WIDTH
d_M d_N
Original AccessPattern
Tiled AccessPattern
Copy into scratchpad
memory
Perform multiplication
with scratchpad values
WIDTH
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
18
Shared Memory Accesses
35
• Shared memory is banked – No coalescing
• Data access patterns should be structured to avoid bank conflicts
• Low order interleaved mapping?
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;// Identify the row and column of the d_P element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;// Loop over the d_M and d_N tiles required to compute the d_P element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of d_M and d_N tiles into shared memory9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx];10. Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col];11. __syncthreads();12. for (int k = 0; k < TILE_WIDTH; ++k)13. Pvalue += Mds[tx][k] * Nds[k][ty];14. __synchthreads();
}15. d_P[Row*Width+Col] = Pvalue;
}
• Accesses from shared memory, hence coalescing is not necessary
• Consider bank conflicts
19
Coalescing Behavior
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
37
d_M
d_N
d_P
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
TILE_WIDTH
TILE_WIDTH
TILE_WIDTHE
WIDTH
WIDTH
m*TILE_W
IDTH
m*TILE_WIDTH
Col
Row…
…
Thread Granularity
38
Warp Schedulers
Register File
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
L1/Shared Memory
DRAMDRAM
DRAMDRAM
• Consider instruction bandwidth vs. memory bandwidth
• Control amount of work per thread
Fetch/Decode
20
Thread Granularity Tradeoffs
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 39
d_M
d_N
d_P
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
TILE_WIDTH
TILE_WIDTH
TILE_WIDTHE
WIDTH
WIDTH
m*TILE_W
IDTH
m*TILE_WIDTH
Col
Row
…
…
• Preserving instruction bandwidth (memory bandwidth)– Increase thread granularity– Merge adjacent tiles: sharing tile
data
Thread Granularity Tradeoffs (2)
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 40
d_M
d_N
d_P
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
TILE_WIDTH
TILE_WIDTH
TILE_WIDTHE
WIDTH
WIDTH
m*TILE_W
IDTH
m*TILE_WIDTH
Col
Row
…
…
• Impact on parallelism– #TBs, #registers/thread– Need to explore impact à
autotuning
21
ANY MORE QUESTIONS?READ CHAPTER 6!
41