operation of the basic sm...
Post on 24-Apr-2019
216 Views
Preview:
TRANSCRIPT
1
(1)
Operation of the Basic SM Pipeline
©Sudhakar Yalamanchili unless otherwise noted
(2)
Objectives
• Cycle-level examination of the operation of major pipeline stages in a stream multiprocessor
• A breadth first look at a basic pipeline
• Understand the type of information necessary for each stage of operation
• Identification of performance bottlenecks v Detailed implementations are addressed in subsequent
modules
2
(3)
Objectives
Host CPU
Inte
rcon
nect
ion
Bus
GPU
SMX SMX SMX SMX
Kernel Distributor
SMX Scheduler Core Core Core Core
Registers
L1 Cache / Shard Memory
Warp Schedulers
Warp Context
Kern
el M
anag
emen
t Uni
t
HW W
ork
Que
ues
Pend
ing
Kern
els
Memory Controller
PC Dim Param ExeBLKernel Distributor Entry
Control Registers
DRAML2 Cache
Step inside
(4)
Reading
• Documentation for the GPGPUSim simulatorv Good source of information about the general organization and
operation of a stream multiprocessorv http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual
• Operation of a Scoreboardv https://en.wikipedia.org/wiki/Scoreboarding
• General Purpose Graphics Architectures, T. Aamodt, W. Fung, and T. Rogers, Chapter 2.2
3
(5)
NVIDIA GK110 (Keplar)Thread Block Scheduler
Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/
Hierarchy of schedulers: kernel, TB, warp, memory transactions
(6)
SMX Organization : GK 110
Multiple Warp Schedulers
192 cores – 6 clusters of 32 cores each
64K 32-bit registers
Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/
What are the main stages of a generic
SMX pipeline?
4
(7)
A Generic SM Pipeline
Predicate & GP Register Files
Scalar Pipelines
Data Memory Access
Writeback/Commit
Scalar Fetch & Decode
Instruction Issue & Warp
Scheduler
Warp 6
Warp 1Warp 2
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
Pending Warps
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
I-Fetch
Miss?
Front-end
Scalar Cores
Back-end
(8)
Single Warp Execution
PC AM WIDState
warp stateThread Block
Grid
setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;
bra L2;
L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3; //r8 = &c[index]
L2:ret;
PTX (Assembly):
5
(9)
Instruction Fetch & Decode
PC AM WIDState Instr
Warp 0 PC
Warp 1 PC
Warp n-1 PC
To I-Cache
Next Warp
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
May realize multiple fetch
policies
I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Examples from Harmonica2 GPU
(10)
Instruction Buffer I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Scoreboard
• Buffer a fixed number of instructions per warp
• Coordinated with instruction fetchv Need an empty I-buffer for the warp
• V: valid instruction in the buffer• R: instruction ready to be issued
v Set using the scoreboard logicFrom GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
V Instr 1 W1RV Instr 2 W1R
V Instr 2 WnR
V Instr 1 W2R
Example: buffer 2 instructions/warpDecoded instruction
ECE 6100/CS 6290
6
(11)
Instruction Buffer (2)I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Scoreboard
• Scoreboard enforces WAW and RAW hazardsv Indexed by Warp IDv Each entry hosts required registers, v Destination registers are reserved at
issuev Reserved registers released at
writeback
• Enables multiple instructions to be in execution from a single warp
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
V Instr 1 W1RV Instr 2 W1R
V Instr 2 WnR
V Instr 1 W2R
(12)
Instruction Buffer (3)I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Scoreboard
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
V Instr 1 W1RV Instr 2 W1R
V Instr 2 WnR
V Instr 1 W2R
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Int Yes Load F2 R3 No
dest reg src1 src2
Source Registers
have value?Function unit
producing valueGeneric Scoreboard
• Next: Modified scoreboard design to address• Have multiple instructions in transit• Excessive demand for register file ports
7
(13)
Instruction IssueI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Warp 3Warp 8
Warp 7
Warp Scheduler
instruction
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
Manages implementation of barriers, register dependencies, and
control divergence
pool of ready warps
(14)
Instruction Issue (2)I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Warp 3Warp 8
Warp 7
Warp Scheduler
instruction
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
• Barriers – warps wait here for barrier synchronizationv All threads in the thread block
must reach the barrier
warp
barrier
8
(15)
Instruction Issue (3)I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Warp 3Warp 8
Warp 7
Warp Scheduler
instruction
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
• Register Dependencies - track through the scoreboard
ScoreboardV Instr 1 W1RV Instr 2 W1R
V Instr 2 WnR
V Instr 1 W2R
(16)
Instruction Issue (4)I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Warp 3Warp 8
Warp 7
Warp Scheduler
instruction
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
• Control Divergence - per warp stack
• Create execution mask that is read with operands
divergent warps
Keeps track of divergent threads at a branch
SIMT Stack (per warp)
9
(17)
Instruction Issue (5)
• Scheduler can issue multiple instructions from a warp• Issue conditions
v Has valid instructionsv Not waiting at a barrierv Scoreboard checkv Pipeline line is not stalled: operand access stage (will get to it
later)
• Reserve destination registers
• Instructions may issue to memory, SP or SFU pipelines
• Warp scheduling disciplines à more later in the course
(18)
Register File AccessI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
RF n-1RF n-2RF n-3RF n-4
RF1RF0
RF n-1RF n-2RF n-3RF n-4
RF1RF0
RF n-1RF n-2RF n-3RF n-4
RF1RF0
Xbar
1024 bit
Banks 0-15
OC OC OC OC
DU DU DU DU
Operand Collectors (OC)
Dispatch Units (DU)
ALUs L/S SFU
Arbiter
Single ported Register File
Banks
From SPs
10
(19)
Scalar PipelineI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Dispatch
ALU FPULD/SD
Result Queue
A Single Core
• Functional units are pipelined
• Designs with multiple issue
(20)
Shared Memory Access
2-way Conflict access
Conflict free access
• Multiple bank organization
• Data is interleaved across banks
• Bank conflicts extend access times
I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
11
(21)
Memory Request Coalescing
Memory Requests
Tid
RQ Size
Base Add
Offset
Tid
RQ Size
Base Add
Offset
Tid
RQ Size
Base Add
Offset
Tid
RQ Size
Base Add
Offset
Pending Request Table
Memory Address Coalescing
Pending RQ Count Addr Mask Addr Mask Addr Mask
Thread Masks
• Pending Request Table (PRT) is filled whenever a memory request is issued
• Generate a set of address masks à one for each memory transaction
• Issue transactions
From J. Leng et.al., “GPUWattch : Enabling Energy Optimizations in GPGPUs,’ ISCA 2013
(22)
Memory Hierarchy
• Configurable cache/shared memory configuration for L1
• Read-only cache for compiler or developer (intrinsics) use
• Shared L2 across all SMXs• ECC coverage across the
hierarchyv Performance impact
From GK110: NVIDIA white paper
L1 Cache Shared Memory
Read-Only
Cache
L2 Cache
DRAM
warp
top related