(1) kernel execution ©sudhakar yalamanchili and jin wang unless otherwise noted

28
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Upload: camron-floyd

Post on 20-Jan-2016

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(1)

Kernel Execution

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Page 2: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(2)

Objectives

• Understand the operation of major microarchitectural stages in launching and executing a kernel on a GPU device

• Enumerate generic information that must be maintained to track and sequence the execution of a large number of fine grained threads What types of information is maintained at each step Where can we augment functionality, e.g., scheduling

Page 3: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(3)

Reading

• CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract

• J. Wang, A. Sidelink, N Rubin, and S. Yalamanchili, “Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs”, IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015, (Section 2)

Page 4: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(4)

Execution Model Overview

• GPU: Many-Core Architecture

• CUDA: NVIDIA’s GPU Programming Model

……

……

……

……

……

……

Thread

Thread Block (TB)

CUDA Kernel (1D-3D grid of TBs)

Thread Block

Branch divergence

Warp (32 threads)

Reconverge

……

……

……

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs on GPU

Device Memory

Page 5: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(5)

Baseline Microarchitecture

• Modeled after the Kepler GK 110 (best guess)

• 14 stream multiprocessors (SMX)

• 32 threads/warp

• 2048 maximum number of threads/SMX

• 16 maximum number of TBs

• Remaining stats 4 Warp Scheduler 65536 Registers 64K L1/Share Memory

From Kepler GK110 Whitepaper

Page 6: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(6)

Kernel LaunchCUDA streams

Host Processor

HW Queues

Kernel Management Unit (Device)

• Commands by host issued through streams Kernels in the same stream executed sequentially Kernels in different streams may be executed concurrently

• Streams are mapped to hardware queues in the device in the kernel management unit (KMU) Multiple streams mapped to each queue serializes some

kernels

Kernel dispatch

CU

DA

Conte

xt

Page 7: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(7)

CUDA Context

• Analogous to a CPU process

• Encapsulate all CUDA resources and actions Streams Memory objects Kernels

• Distinct address space

• Created by runtime during initialization and shared by all host threads in the same CPU process

• Different CUDA context for different processes E.g. MPI rank

Page 8: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(8)

Host CPU

Inte

rcon

necti

on B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Kern

el M

anag

emen

t Uni

t

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

Baseline Microarchitecture: KMU

• Kernels are dispatched from the head of the HWQs• Next from the same HWQ cannot be dispatched before the prior

kernel from the same queue completes execution (today)

Page 9: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(9)

Host CPU

Inte

rcon

necti

on B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Kern

el M

anag

emen

t Uni

t

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

Kernel Distributor

• Same #entries as #HWQs maximum number of independent kernels that can be dispatched by the KMU

Page 10: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(10)

Kernel Distributor (2)

• Kernel distributor provides one entry on a FIFO basis to the SMX scheduler

Host CPU

Inte

rcon

necti

on B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Kern

el M

anag

emen

t Uni

t

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

PC Dim Param ExeBL

Entry PC

Grid and CTA Dimension Info

Parameter Addresses

#CTAs to complete

Page 11: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(11)

Kernel Parameters

• Managed by runtime and driver

• Dynamically allocated in device memory and mapped to constant parameter address space

• 4KB per kernel launching

.entry foo( .param .u32 len ) {

.reg .u32 %n; ld.param.u32 %n, [%len]; Load from parameter address space ...}

Page 12: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(12)

SMX Scheduler

• Initialize control registers• Distribute thread blocks (TB) to SMXs

Limited by #registers, shared memory, #TBs, #threads

• Update KDE and SMX scheduler entries as kernels complete execution

Host CPU

Inte

rcon

necti

on B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Kern

el M

anag

emen

t Uni

t

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

KDEI NextBL

KDE Index Next TB to be scheduled

SMX Scheduler

Control Registers (SSCR)

Page 13: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(13)

SMX Scheduler: TB Distribution

• Distribute TBs in every cycle Get the next TB from control registers Determine the destination SMX

o Round-robin (today) Distribute if enough available resource, update control

registers Otherwise wait till next cycle

KDEI NextBL

KDE Index Next TB to be scheduled

SMX Scheduler

Control Registers

Page 14: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(14)

SMX Scheduler: TB Control

• Each SMX has a set of TBCRs, each for one TB executed on the SMX (max. 16) Record KDEI and TB ID for this TB After TB execution finishes, notify SMX scheduler

• Support for multiple kernels

Host CPU

Inte

rcon

necti

on B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Kern

el M

anag

emen

t Uni

t

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

KDEI BLKID

Thread Block Control Registers (TBCR) per SMX

KDE Index Scheduled TB ID (in execution)

Page 15: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(15)

Streaming Multiprocessor (SMX)

• Thread blocks organized as warps (here 32 threads)• Select from a pool of ready warps and issue instruction to

the 32-lane SMX• Interleave warps for fine grain multithreading• Control flow divergence and memory divergence

Host CPU

Inte

rcon

necti

on B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Kern

el M

anag

emen

t Uni

t

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

Dispatch

ALU FPULD/SD

Result Queue

A core, lane, or stream processor (SP)

Page 16: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(16)

Warp Management

• Warp Context PC Active masks for each thread Control divergence stack Barrier information

• Warp context are stored in a set of hardware registers per SMX for fast warp context switch Max. 64 warp context registers (one for each warp)

Page 17: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(17)

Warp Scheduling

• Ready warps There is no unresolved dependency for the next instruction E.g. memory operand is ready

• Round-robin Switch to next ready warp after executing one instruction in

each warp

• Greedy-Then-Oldest (GTO) Execute current warp until there is and unresolved

dependency Then move to the oldest warp in the ready warp pool

Page 18: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(18)

Concurrent Kernel Execution

• If a kernel does not occupy all SMXs, distribute TBs from the next kernel

• When SMX scheduler issues next TB Can issue from the kernel in the KDE FIFO if current kernel

has all TBs issued

• On each SMX there can be TBs from different kernel Recall SMX TBCR

KDEI BLKID

Thread Block Control Registers (TBCR) per SMX

KDE Index Scheduled TB ID (in execution)

Page 19: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(19)

Executing Kernels

Host CPU

Inte

rcon

necti

on B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp ContextKe

rnel

Man

agem

ent U

nit

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 CacheKernelKernel

Kernel

Kernel

Kernel

K

K

K

K

K

TB TB TB TB

WarpsWarps WarpsWarps

TB

Warps

K

TB TB TBTB

WarpsWarps WarpsWarps

Warps

Concurrent Kernel Execution

Page 20: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(20)

Occupancy, TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs

Kernel 1 Kernel 2 Kernel 3 Kernel 100

……

……

SMX0 SMX1 SMX12

Page 21: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(21)

Occupancy, TBs, Warps, & Utilization

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

SMXs

Kernel 1

Kernel 2 Kernel 3 Kernel 100

……

……

SMX0 SMX1 SMX12

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Page 22: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(22)

Occupancy, TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

SMXs

Kernel 1

Kernel 3 Kernel 100

……

……

SMX0 SMX1 SMX12

Kernel 2

Page 23: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(23)

Occupancy, TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

SMXs

Kernel 1

Kernel 14 Kernel 100

……

……

SMX0 SMX1 SMX12

Kernel 2 Kernel 13

Page 24: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(24)

Occupancy, TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

SMXs

Kernel 1

Kernel 15 Kernel 100

……

……

SMX0 SMX1 SMX12

Kernel 2 Kernel 13Kernel 14

Page 25: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(25)

Occupancy, TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

SMXs

Kernel 1

Kernel 33 Kernel 100

……

……

SMX0 SMX1 SMX12

Kernel 2 Kernel 13Kernel 14 Kernel 15 Kernel 26

Kernel 27 Kernel 28 Kernel 32

Wait till previous kernel finishes

Page 26: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(26)

Occupancy, TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

Register File

Cores

L1 Cache/Shared Memory

SMXs

Kernel 1

Kernel 33 Kernel 100

……

……

SMX0 SMX1 SMX12

Kernel 2 Kernel 13Kernel 14 Kernel 15 Kernel 26

Kernel 27 Kernel 28 Kernel 32

Achieved SMX Occupancy: 32 kernels * 2 warps/kernel / (64 warps/SMX * 13 SMXs)

= 0.077Wait till previous kernel finishes

Page 27: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(27)

Host CPU

Inte

rcon

necti

on B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp ContextKe

rnel

Man

agem

ent U

nit

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

Balancing Resources

Limit: 32 for concurrent kernels

TB TBTBTB

SMX underutilized

Page 28: (1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(28)

Performance: Kernel Launch

• For synchronous host kernel launch Launching time is ~10us (measured on K20c)

• Launching time breakdown Software:

o Runtime: resource allocation, e.g. parameters, streams, kernel information

o Driver: SW-HW interactive data structure allocation Hardware:

o Kernel scheduling