cawa: coordinated warp scheduling and cache s. –y lee, a. a. … · 2020. 7. 31. · cawa:...

1

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)

CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU

WorkloadsS. –Y Lee, A. A. Kumar and C. J Wu

ISCA 2015

(2)

Goal

• Reduce warp divergence and hence increase throughput

• The key is the identification of critical (lagging) warps

• Manage resources and scheduling decisions to speed up the execution of critical warps thereby reducing divergence

2

(3)

Review: Resource Limits on Occupancy

SM Scheduler

Kernel Distributor

SM SM SM SM

DRAM

Limits the #threads

Limits the #thread blocks

Warp Schedulers

Warp ContextWarp ContextWarp ContextWarp Context

Register File

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

TB 0

Thread Block Control

L1/Shared MemoryLimits the #thread

blocks

Limits the #threads

SM – Stream MultiprocessorSP – Stream Processor

Locality effects

(4)

Evolution of Warps in TB

• Coupled lifetimes of warps in a TBv Start at the same timev Synchronization barriersv Kernel exit (implicit synchronization barrier)

Completed warps

Figure from P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

Region where latency hiding is less effective

3

(5)

Warp Execution Time Disparity

• Branch divergence, interference in the memory system, scheduling policy, and workload imbalance

Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

(6)

Workload Imbalance

10 instructions 10000 instructions

• Imbalance exists even without control divergence

4

(7)

Branch Divergence

Warp-to-warp variation in dynamic instruction count (w/o branch divergence)

Intra-warp branch

divergenceExample: traversal over constant node degree graphs

(8)

Branch Divergence

Extent of serialization over a CFG traversal varies across warps

5

(9)

Impact of the Memory System

Executing Warps

Intra-wavefront footprints in the cache

• Could have been re-used• Too slow!


Scheduling gaps are greater than the re-use distance

(10)

Impact of Warp Scheduler

• Amplifies the critical warp effect


6

(11)

Criticality Predictor Logic

m instructions

n instructions

• Non-divergent branches can generate large differences in dynamic instruction counts across warps (e.g., m>>n)

• Update CPL counter on branch: estimate dynamic instruction count

• Update CPL counter on instruction commit

(12)

Warp Criticality Problem

Host CPU

Inte

rcon

nect

ion

Bus

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Kern

el M

anag

emen

t Uni

t

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

TB

Warp

Warp

Available registers: spatial underutilization

Registers allocated to completed warps in the

TB: temporal underutilization

The last warp

Warp Context

Completed (idle) warp contexts

Manage resources and schedules around Critical Warps

Temporal & spatial underutilization

7

(13)

CPL Calculation

m instructions

n instructions

nCriticality = nInstr *w.CPIavg + nStall

Inter-instruction memory stall cycles

Average warp CPIAverage instruction Disparity between warps

(14)

Scheduling Policy

• Select warp based on criticality

• Execute until no more instructions are availablev A form of GTO

• Critical warps get higher priority and large time slice

8

(15)

Behavior of Critical Warp References


(16)

Criticality Aware Cache Prioritization

• Prediction: critical warp• Prediction: re-reference interval• Both used to manage the cache foot print


9

(17)

Integration into GPU Microarchitecture


Criticality Prediction Cache Management

(18)

Performance

Periodic computation of accuracy for critical warps

Due in part to low miss rates

10

(19)

Summary

• Warp divergence leads to some lagging warps àcritical warps

• Expose the performance impact of critical warps àthroughput reduction

• Coordinate scheduler and cache management to reduce warp divergence


Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. Aamodt

MICRO 2012

11

(21)

Goal

• Understand the relationship between schedulers (warp/wavefront) and locality behaviors v Distinguish between inter-wavefront and intra-wavefront

locality

• Design a scheduler to match #scheduled wavefronts with the L1 cache sizev Working set of the wavefronts fits in the cachev Emphasis on intra-wavefront locality

(22)

Reference Locality

Intra-threadlocality

Inter-threadlocality

Intra-wavefront locality Inter-wavefront

locality

• Scheduler decisions can affect locality

• Need non-oblivious (to locality) schedulers

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

Greatest gain!

12

(23)

Key Idea

Executing WavefrontsReady Wavefronts

Issue?

• Impact of issuing new wavefronts on the intra-warp locality of executing wavefrontsv Footprint in the cachev When will a new wavefront cause thrashing?

Intra-wavefront footprints in the cache

(24)

Concurrency vs. Cache Behavior

Round Robin Scheduler

Less than max concurrency

Adding more wavefronts to hide latency traded off

against creating more long latency memory references

due to thrashing


13

(25)

Additions to the GPU Microarchitecture

Control issue of wavefronts

I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps Keep track of locality behavior

on a per wavefront basis

Feedback

(26)

Intra-Wavefront Locality

• Scalar thread traverses edges

• Edges stored in successive memory locations

• One thread vs. many

• Intra-thread locality leads to intra-wavefront locality

• #wavefronts determined by its reference footprint and cache size

• Scheduler attempts to find the right#wavefronts

14

(27)

Static Wavefront LimitingI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

• Limit the #wavefronts based on working set and cache size

• Based on profiling?

• Current schemes allocate #wavefronts based on resource consumption not effectiveness of utilization

• Seek to shorten the re-reference interval

(28)

Cache Conscious Wavefront Scheduling

• Basic Stepsv Keep track of lost locality of a

wavefrontv Control number of wavefronts

that can be issued

• Approachv List of wavefronts sorted by lost

locality scorev Stop issuing wavefronts with

least lost locality

I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

15

(29)

Keeping Track of LocalityI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Keep track of associated Wavefronts

Indexed by WID Victims indicate “lost locality

Update lost locality score


(30)

Updating Estimates of LocalityI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Wavefronts sorted by

lost locality score

Update wavefront lost locality score on victim hit

Note which are allowed to issue

Changes based on

#wavefronts


16

(31)

Estimating the Feedback Gain

Drop the scores –no VTA hits

VYA hit –increase score

Prevent issue

LLDS = VTAHitsTotalInsIssuedTotal

.KThrottle.CumLLSCutoff Scale gain based on percentage of cutoff


(32)

Schedulers

• Loose Round Robin

• Greedy-Then-Oldest (GTO)v Execute one wavefront until stall and then oldest

• 2LVL-GTOv Two level scheduler with GTO instead of RR

• Best SWL

• CCWS

17

(33)

Performance


(34)

Some General Observations

• Performance largely determined byv Emphasis on oldest wavefrontsv Distribution of references across cache lines – extent of lack

of coalescing

• GTO works well for this reason à prioritizes older wavefronts

• LRR touches too much data (too little reuse) to fit in the cache

18

(35)

Locality Behaviors


(36)

Summary

• Dynamic tracking of relationship between wavefronts and working sets in the cache

• Modify scheduling decisions to minimize interference in the cache

• Tunables: need profile information to create stable operation of the feedback control

19


Divergence Aware Warp Scheduling T. Rogers, M O’Conner, and T. Aamodt

MICRO 2013

(38)

Goal

• Understand the relationship between schedulers (warp/wavefront) and control & memory locality behaviors v Distinguish between inter-wavefront and intra-wavefront

locality

• Design a scheduler to match #scheduled wavefronts with the L1 cache sizev Working set of the wavefronts fits in the cachev Emphasis on intra-wavefront localityv Couple effects of control flow divergence

• Differs from CCWS in being proactivev Deeper look at what happens inside loops

20

(39)

Key Idea

• Manage the relationship between control divergence, memory divergence and scheduling

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(40)

Key Idea (2)

• Coupling control divergence and memory divergence• The former indicates reduced cache capacity• “learning” eviction patterns

21

(41)

Key Idea (2)

while (i <C[tid+1])

warp 0warp 1

Fill the cache with 4 references –delay warp 1Divergent

branch

Intra-thread locality

Available room in the cache due to

divergence in warp 0 à schedule warp

1

Use warp 0 behavior to predict interference due to warp 1


(42)

GoalSimpler portable version


• Each thread computes a row

• Structured similar to a multithreaded CPU version • Desirable

22

(43)

GoalGPU-Optimized Version


• Each warp handles a row• Renumber threads modulo warp

size

• Row element index for each thread• Warps traverse the whole row

• Each thread handles a few row (column elements)

• Need to get sum the partial sums computed by each thread in a row

(44)

Optimizing the Vector Version

• What do need to know?

• TB & warp sizes, grid mappings

• Tuning TB sizes as a function of machine size

• Orchestrate the partial product summation

Simplicity (productivity) with no sacrifice in performance

23

(45)

GoalSimpler portable version GPU-Optimized Version

Make the performance equivalent


(46)

Additions to the GPU Microarchitecture

• Control issue of warps• Implements PDOM for control divergence

I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer/Sboard

pending warps

Memory Coalescer

24

(47)

Observation

• Bulk of the accesses in a loop come from a few static load instructions

• Bulk of the locality in (these) applications is intra-loop

Loops

(48)

Distribution of Locality

• Bulk of the locality comes form a few static loads in loops

• Find temporal reuse

Hint: Can we keep data from last iteration?


25

(49)

A Solution

• Prediction mechanisms for locality across iterations of a loop

• Schedule such that data fetched in one iteration is still present at next iteration

• Combine with control flow divergence (how much of the footprint needs to be in the cache?)


(50)

Classification of Dynamic Loads

• Group static loads into equivalence classes àreference the same cache line

• Identify these groups by repetition ID• Prediction for each load by compiler or hardware

converged

Diverged

Somewhat diverged

Control flow divergence


26

(51)

Coupling Divergence Effects

• Can we improve performance with branch prediction?v Diverged warps reduce cache footprint

converged

Control flow divergence


(52)

Predicting a Warp’s Cache Footprint• Entering loop body• Create footprint prediction

Warp

• Exit loop• Reinitialize prediction

• Some threads exit the loop• Predicted footprint drops

Warp

• Predict locality usage of static loadsv Not all loads increase the footprint

• Combine with control divergence to predict footprint • Use footprint to throttle/not-throttle warp issue

Taken

Not Taken

27

(53)

Predicting a Warp’s Active Thread Count

Warp• Predict the cache usage

of a thread• #active threads, i.e.,

control divergence

• Modulate footprint based on predicted cache usagev Predictions include control divergence effects

• Classify each loop static load based on divergence and repetition ID à profiled DAWs

Taken

Not Taken

(54)

Principles of Operation

• Prefix sum of each warp’s cache footprint used to select warps that can be issuedEffCacheSize = kAssocFactor.TotalNumLines

• Scaling back from a fully associative cache• Empirically determined


28

(55)

Principles of Operation (2)

• Profile static load instructionsv Are they divergent?v Loop repetition ID

o Assume all loads with same base address and offset within cache line access are repeated each iteration


(56)

Prediction Mechanisms

• Profiled Divergence Aware Scheduling (DAWS)v Used offline profile results to dynamically determine de-

scheduling decisions

• Detected Divergence Aware Scheduling (DAWS)v Behaviors derived at run-time to drive de-scheduling

decisionso Loops that exhibit intra-warp localityo Static loads are characterized as divergent or convergent

29

(57)

Extensions for DAWSI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

+

+

(58)

Operation: Tracking

Basis for throttling

Profile-based information

One entry per warp issue

slot

Created/removed at

loop begin/end


#Active Lanes

30

(59)

Operation: Prediction

Sum results from static loads in this loop

- Add #active-lanes of cache lines for divergent loads

- Add 2 for converged loads- Count loads in the same

equivalence class only once (unless divergent)

• Generally only considering de-scheduling warps in loopsv Since most of the activity is here

• Can be extended to non-loop regions by associating non-loop code with next loop


(60)

Operation: Nested Loops

for (i=0; i<limitx; i++)}{....

for (j=0: j<limity; j++){....}

..

..}

• On-entry update prediction to that of inner loop

• On re-entry predict based inner loop predictions

On-exit, do not clear prediction

De-scheduling of warps determined

by inner loop behaviors!

• Re-used predictions based on inner-most loops which is where most of the date re-use is found

• Note the attempt at tracking footprints to code segments

31

(61)

Detected DAWS: PredictionSampling warp for

the loop (>2 active threads)

• Detect both memory divergence and intra-loop repetition at run time

• Fill PCLoad entries based on run time information• Use profile information to start

if locality

enter load


(62)

Detected DAWS: Classification

Increment or decrement the counter depending on #memory accesses for a load

Create equivalence classes of loads (checking the PCs)


32

(63)

Performance

Little to no degradation


(64)

Performance

Significant intra-warp locality

SPMV-scalar normalized to best

SPMV-vector


33

(65)

Summary

• If we can characterize warp level memory reference locality, we can use this information to minimize interference in the cache through scheduling constraints

• Proactive scheme outperforms reactive management

• Understand interactions between memory divergence and control divergence


OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving

GPGPU PerformanceA. Jog et. al ASPLOS 2013

34

(67)

Goal

• Understand memory effects of scheduling from deeper within the memory hierarchy

• Minimize idle cycles induced by stalling warps waiting on memory references

Off-chip Bandwidth is Critical!

68

Percentage of total execution cycles wasted waiting for the data to come back from DRAM

0%20%

40%60%

80%100%

SAD

PVC

SSC

BFS

MUM

CFD

KMN

SCP

FWT IIX

SPMV

JPEG

BFSR SC FFT

SD2

WP

PVR BP

CON

AES SD1

BLK HS

SLADN

LPS

NN

PFN

LYTE

LUD

MM

STO CP

NQU

CUTP HW

TPAF

AVG

AVG-T1

Type-1Applications

55%AVG: 32%

Type-2Applications

GPGPU Applications

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

35

(69)

Source of Idle Cycles

• Warps stalled on waiting for memory referencev Cache missv Service at the memory controllerv Row buffer miss in DRAMv Latency in the network (not addressed in this paper)

• The last warp effect

• The last CTA effect

• Lack of multiprogrammed executionv One (small) kernel at a time

(70)

Impact of Idle Cycles

Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

36

High-Level View of a GPU

DRAM

SIMTCores

…

Scheduler

ALUsL1 Caches

Threads

… WW W W W W

Warps

L2 cache

Interconnect

CTA CTA CTA CTA

Cooperative Thread Arrays (CTAs)


Warp Scheduler

ALUsL1 Caches

CTA-Assignment Policy (Example)

72

Warp Scheduler

ALUsL1 Caches

Multi-threaded CUDA Kernel

SIMT Core-1 SIMT Core-2

CTA-1 CTA-2 CTA-3 CTA-4

CTA-2 CTA-4CTA-1 CTA-3


37

(73)

Organizing CTAs Into Groups

• Set minimum number of warps equal to #pipeline stagesv Same philosophy as the two-level warp scheduler

• Use same CTA grouping/numbering across SMs?

Warp Scheduler

ALUsL1 Caches

Warp Scheduler

ALUsL1 Caches



Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Warp Scheduling Policyn All launched warps on a SIMT core have equal priority

q Round-Robin execution

n Problem: Many warps stall at long latency operations roughly at the same time

74

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute

All warps have equal priority

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute


Send Memory Requests

SIMT Core Stalls

TimeCourtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

38

Solution

75

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA


Saved Cycles

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute


WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute



SIMT Core Stalls

Time

• Form Warp-Groups (Narasiman MICRO’11)

• CTA-Aware grouping• Group Switch is Round-Robin


(76)

Two Level Round Robin Scheduler

CTA0 CTA1

CTA3 CTA2

CTA4 CTA5

CTA7 CTA8

CTA12 CTA13

CTA15 CTA14

Group 0 Group 1

Group 3Group 2

RR

RR

Thread

Agnostic to whenpending misses

are satisfied

39

(77)

Key Idea

Executing Warps (EW)

• Relationship between EW, SW and CW?

• When EW ≠ CW, we get interference

• Principle – optimize reuse: Seek to make EW = CW

Stalled Warps (SW) Cache Resident Warps (CW)

Objective 1: Improve Cache Hit Rates

78

CTA 1 CTA 3 CTA 5 CTA 7CTA 1 CTA 3 CTA 5 CTA 7

Data for CTA1 arrives.No switching.

CTA 3 CTA 5 CTA 7CTA 1 CTA 3 C5 CTA 7CTA 1 C5

Data for CTA1 arrives.

TNo Switching: 4 CTAs in Time T

Switching: 3 CTAs in Time T

Fewer CTAs accessing the cache concurrently à Less cache contentionTime

Switch to CTA1.


40

Reduction in L1 Miss Rates

79

n Limited benefits for cache insensitive applications

n What is happening deeper in the memory system?

0.000.200.400.600.801.001.20

SAD SSC BFS KMN IIX SPMV BFSR AVG.Norm

alize

d L1

Miss

Rat

es

8%

18%

Round-Robin CTA-Grouping CTA-Grouping-Prioritization


(80)

The Off-Chip Memory Path

Off-chip GDDR5

Off-chip GDDR5

CU 0

To

CU 15

CU 16

To

CU 31

MC

MC

MC

MC

MC MC

Off-chip GDDR5

Off-chip GDDR5

Off-chip GDDR5

Off-chip GDDR5

Access patterns?

Ordering and buffering?

41

(81)

Inter-CTA Locality

Warp Scheduler

ALUsL1 Caches

Warp Scheduler

ALUsL1 Caches


DRAM DRAM DRAM DRAM

How do CTAs Interact at the MC and in DRAM?

(82)

Impact of the Memory Controller

• Memory scheduling policiesv Optimize BW vs. memory

latency

• Impact of row buffer access locality

• Cache lines?

42

(83)

Row Buffer Locality

CTA Data Layout (A Simple Example)

84

A(0,0) A(0,1) A(0,2) A(0,3)

:

:

DRAM Data Layout (Row Major)

Bank 1 Bank 2 Bank 3

Bank 4A(1,0) A(1,1) A(1,2) A(1,3)

:

:

A(2,0) A(2,1) A(2,2) A(2,3)

:

:

A(3,0) A(3,1) A(3,2) A(3,3)

:

:

Data Matrix

A(0,0) A(0,1) A(0,2) A(0,3)

A(1,0) A(1,1) A(1,2) A(1,3)

A(2,0) A(2,1) A(2,2) A(2,3)

A(3,0) A(3,1) A(3,2) A(3,3)

mapped to Bank 1CTA 1 CTA 2

CTA 3 CTA 4

mapped to Bank 2

mapped to Bank 3

mapped to Bank 4

Average percentage of consecutive CTAs (out of total CTAs) accessing the same row = 64%


43

L2 Cache

Implications of high CTA-row sharing

85

WCTA-1

WCTA-3

WCTA-2

WCTA-4


Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

Idle Banks

W W W W

CTA Prioritization Order CTA Prioritization Order


High Row LocalityLow Bank Level Parallelism

Bank-1

Row-1

Bank-2

Row-2

Bank-1

Row-1

Bank-2

Row-2

ReqReqReq

Req

Req

Req

Req

Req

Req

Req

Lower Row LocalityHigher Bank Level

Parallelism

Req


44

(87)

Some Additional Details

• Spread reference from multiple CTAs (on multiple SMs) across row buffers in the distinct banks

• Do not use same CTA group prioritization across SMsv Play the odds

• What happens with applications with unstructured, irregular memory access patterns?

L2 Cache

Objective 2: Improving Bank Level Parallelism

WCTA-1

WCTA-3

WCTA-2

WCTA-4




W W W W

11% increase in bank-level parallelism

14% decrease in row buffer locality

CTA Prioritization OrderCTA Prioritization Order


45

L2 Cache

Objective 3: Recovering Row Locality

WCTA-1

WCTA-3

WCTA-2

WCTA-4

SIMT Core-2



W W W W

Memory Side Prefetching

L2 Hits!

SIMT Core-1


Memory Side Prefetchingn Prefetch the so-far-unfetched cache lines in an already open row into

the L2 cache, just before it is closed

n What to prefetch?q Sequentially prefetches the cache lines that were not accessed by

demand requestsq Sophisticated schemes are left as future work

n When to prefetch?q Opportunistic in Natureq Option 1: Prefetching stops as soon as demand request comes for

another row. (Demands are always critical)q Option 2: Give more time for prefetching, make demands wait if there

are not many. (Demands are NOT always critical)


46

IPC results (Normalized to Round-Robin)

0.61.01.41.82.22.63.0

SA

D

PV

C

SS

C

BF

S

MU

M

CF

D

KM

N

SC

P

FW

T

IIX

SP

MV

JPE

G

BF

SR

SC

FF

T

SD

2

WP

PV

R

BP

AV

G -

T1

Nor

mal

ized

IPC

n 11% within Perfect L2

44%


(92)

Summary

• Coordinated scheduling across SMs, CTAs, and warps

• Consideration of effects deeper in the memory system

• Coordinating warp residence in the core with the presence of corresponding lines in the cache

cawa: coordinated warp scheduling and cache s. –y lee, a. a. … · 2020. 7. 31. · cawa:...

Documents