operation of the basic sm...

Operation of the Basic SM Pipeline

©Sudhakar Yalamanchili unless otherwise noted

Objectives

• Cycle-level examination of the operation of major pipeline stages in a stream multiprocessor

• A breadth first look at a basic pipeline

• Understand the type of information necessary for each stage of operation

• Identification of performance bottlenecks v Detailed implementations are addressed in subsequent

modules

Objectives

Host CPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

Step inside

Reading

• Documentation for the GPGPUSim simulatorv Good source of information about the general organization and

operation of a stream multiprocessorv http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual

• Operation of a Scoreboardv https://en.wikipedia.org/wiki/Scoreboarding

• General Purpose Graphics Architectures, T. Aamodt, W. Fung, and T. Rogers, Chapter 2.2

NVIDIA GK110 (Keplar)Thread Block Scheduler

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

Hierarchy of schedulers: kernel, TB, warp, memory transactions

SMX Organization : GK 110

Multiple Warp Schedulers

192 cores – 6 clusters of 32 cores each

64K 32-bit registers

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

What are the main stages of a generic

SMX pipeline?

A Generic SM Pipeline

Predicate & GP Register Files

Scalar Pipelines

Data Memory Access

Writeback/Commit

Scalar Fetch & Decode

Instruction Issue & Warp

Scheduler

Warp 6

Warp 1Warp 2

Decode

D-Cache

DataAll Hit?

Writeback

Pending Warps

scalarPipeline

scalarpipeline

I-Buffer

I-Fetch

Front-end

Scalar Cores

Back-end

Single Warp Execution

PC AM WIDState

warp stateThread Block

setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;

bra L2;

L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3; //r8 = &c[index]

L2:ret;

PTX (Assembly):

Instruction Fetch & Decode

PC AM WIDState Instr

Warp 0 PC

Warp 1 PC

Warp n-1 PC

To I-Cache

Next Warp

From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

May realize multiple fetch

policies

I-Fetch

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Examples from Harmonica2 GPU

Instruction Buffer I-Fetch

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Scoreboard

• Buffer a fixed number of instructions per warp

• Coordinated with instruction fetchv Need an empty I-buffer for the warp

• V: valid instruction in the buffer• R: instruction ready to be issued

v Set using the scoreboard logicFrom GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

V Instr 1 W1RV Instr 2 W1R

V Instr 2 WnR

V Instr 1 W2R

Example: buffer 2 instructions/warpDecoded instruction

ECE 6100/CS 6290

Instruction Buffer (2)I-Fetch

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Scoreboard

• Scoreboard enforces WAW and RAW hazardsv Indexed by Warp IDv Each entry hosts required registers, v Destination registers are reserved at

issuev Reserved registers released at

writeback

• Enables multiple instructions to be in execution from a single warp

V Instr 2 WnR

V Instr 1 W2R

Instruction Buffer (3)I-Fetch

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Scoreboard

V Instr 2 WnR

V Instr 1 W2R

Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Int Yes Load F2 R3 No

dest reg src1 src2

Source Registers

have value?Function unit

producing valueGeneric Scoreboard

• Next: Modified scoreboard design to address• Have multiple instructions in transit• Excessive demand for register file ports

Instruction IssueI-Fetch

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Warp 3Warp 8

Warp 7

Warp Scheduler

instruction

Manages implementation of barriers, register dependencies, and

control divergence

pool of ready warps

Instruction Issue (2)I-Fetch

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Warp 3Warp 8

Warp 7

Warp Scheduler

instruction

• Barriers – warps wait here for barrier synchronizationv All threads in the thread block

must reach the barrier

barrier

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Warp 3Warp 8

Warp 7

Warp Scheduler

instruction

• Register Dependencies - track through the scoreboard

ScoreboardV Instr 1 W1RV Instr 2 W1R

V Instr 2 WnR

V Instr 1 W2R

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Warp 3Warp 8

Warp 7

Warp Scheduler

instruction

• Control Divergence - per warp stack

• Create execution mask that is read with operands

divergent warps

Keeps track of divergent threads at a branch

SIMT Stack (per warp)

Instruction Issue (5)

• Scheduler can issue multiple instructions from a warp• Issue conditions

v Has valid instructionsv Not waiting at a barrierv Scoreboard checkv Pipeline line is not stalled: operand access stage (will get to it

later)

• Reserve destination registers

• Instructions may issue to memory, SP or SFU pipelines

• Warp scheduling disciplines à more later in the course

Register File AccessI-Fetch

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

RF n-1RF n-2RF n-3RF n-4

RF1RF0

1024 bit

Banks 0-15

OC OC OC OC

DU DU DU DU

Operand Collectors (OC)

Dispatch Units (DU)

ALUs L/S SFU

Arbiter

Single ported Register File

From SPs

Scalar PipelineI-Fetch

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Dispatch

ALU FPULD/SD

Result Queue

A Single Core

• Functional units are pipelined

• Designs with multiple issue

Shared Memory Access

2-way Conflict access

Conflict free access

• Multiple bank organization

• Data is interleaved across banks

• Bank conflicts extend access times

I-Fetch

Decode

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

I-Buffer

pending warps

Memory Request Coalescing

Memory Requests

RQ Size

Base Add

Offset

RQ Size

Base Add

Offset

RQ Size

Base Add

Offset

RQ Size

Base Add

Offset

Pending Request Table

Memory Address Coalescing

Pending RQ Count Addr Mask Addr Mask Addr Mask

Thread Masks

• Pending Request Table (PRT) is filled whenever a memory request is issued

• Generate a set of address masks à one for each memory transaction

• Issue transactions

From J. Leng et.al., “GPUWattch : Enabling Energy Optimizations in GPGPUs,’ ISCA 2013

Memory Hierarchy

• Configurable cache/shared memory configuration for L1

• Read-only cache for compiler or developer (intrinsics) use

• Shared L2 across all SMXs• ECC coverage across the

hierarchyv Performance impact

From GK110: NVIDIA white paper

L1 Cache Shared Memory

Read-Only

L2 Cache

Summary

• Synchronous progress of a warp through the SM pipelines

• Warp progress in a thread block can diverge for many reasonsv Barriersv Control divergencev Memory divergence

• How is the execution optimized? Next à

operation of the basic sm...

Documents

mips isa-i: the instruction set architecture lecture notes...

mips isa-i: the instruction set lecture notes from mkp, h....

ministers of the naccccavallaro, carmen sm. sm sm sm sm sm...

sm-r800 sm-r805f sm-r810 sm-r815f

module 3: cuda execution model...

pis: sudhakar yalamanchili hyesoon kim richard...

(1) basic language concepts © sudhakar yalamanchili,...

distributed file systems 11.2process sairaj bharath...

appdb.tisi.go.thappdb.tisi.go.th/tis_dev/p3_tis/fulltext/tis-1499-2541m.pdf5.1...

performance lecture notes from mkp, h. h. lee and s....

sm-g960f sm-g960f/ds sm-g965f sm-g965f/ds · hungarian....

cata bagulho - 2019 - soma · sm - setor 01 sm r isaias...

sm-r720, sm-r730, sm-r732, sm-r735s...

wi-fi and cellular handoff sowjanya talasila shilpa...

(1) basic input and output © sudhakar yalamanchili, georgia...

06-13517 - potluri v. yalamanchili et al [pdf 79 kb]

module: speculative execution © krishna v. palem, weng fai...

คู่มือการใช้งาน · thai....

pis: sudhakar yalamanchili hyesoon kim richard vuduc...pis:...

rockenberg kreis limburg - weilburg sm solms...