timothy g. rogers 1 , mike o’connor 2 , tor m. aamodt 1
DESCRIPTION
Divergence-Aware Warp Scheduling. Timothy G. Rogers 1 , Mike O’Connor 2 , Tor M. Aamodt 1. 1 The University of British Columbia 2 NVIDIA Research. MICRO 2013 Davis, CA. GPU. 10000’s concurrent threads Grouped into warps Scheduler picks warp to issue each cycle. Main Memory. L2 cache. - PowerPoint PPT PresentationTRANSCRIPT
Timothy G. Rogers1, Mike O’Connor2, Tor M. Aamodt1
1The University of British Columbia2NVIDIA Research
Divergence-Aware Warp Scheduling
MICRO 2013 Davis, CA
Tim Rogers Divergence-Aware Warp Scheduling 2
Streaming MultiprocessorStreaming Multiprocessor
Warp Scheduler
Memory Unit
L1D
GPU
…
W1 …W2
• 10000’s concurrent threads• Grouped into warps• Scheduler picks warp to issue each cycle
Main MemoryL2 cache
Threads
Warp …
Tim Rogers Divergence-Aware Warp Scheduling 3
Warp …
Main Memory
2 Types of Divergence
… Can waste memory bandwidth
Memory Divergence
Branch Divergence
Effects functional unit
utilizationAware of branch
divergence
if(…) {…
}
Threads
Warp …1 0 10
Aware of memory
divergence
ANDFocus on improving
performance
Tim Rogers Divergence-Aware Warp Scheduling 4
Motivation
• Transfer locality management from SW to HW
• Software solutions:• Complicate programming• Not always performance portable• Not guaranteed to improve performance• Sometimes impossible
• Improve performance of programs with memory divergence• Parallel irregular applications• Economically important (server computing, big data)
Tim Rogers Divergence-Aware Warp Scheduling 5
Programmability Case StudySparse Vector-Matrix Multiply
Divergence
Added Complication
Dependent on Warp Size
Parallel Reduction
Explicit Scratchpad Use
2 versions from SHOC
Divergent Version
GPU-Optimized Version
Each thread has locality
Tim Rogers Divergence-Aware Warp Scheduling 6
Previous Work• Scheduling used to capture intra-thread locality (MICRO 2012)
• Proactive
• Branch divergence aware
Reactive• Detects interference then throttles
Previous Work Divergence-AwareWarp Scheduling
Predict and be Proactive
Adapt to branch divergenceUnaware of branch divergence• All warps treated equally
Outperformed by profiled
static throttling
Outperform static
solution
Lost Locality Detected
W2
Warp Scheduler
Memory Unit
L1D
W1 …W3 WNGo Go StopStop
101100Active Mask
111111Active Mask
GoGo
Case Study:Divergent code 50% slowdown
Case Study:Divergent code <4% slowdown
Tim Rogers Divergence-Aware Warp Scheduling 7
Divergence-AwareWarp Scheduling
How to be proactive
Adapt to branch divergence
• Identify where locality exists• Limit the number of warps executing in high locality regions
• Create cache footprint prediction in high locality regions• Account for number of active lanes to create per-warp
footprint prediction.• Change the prediction as branch divergence occurs.
Tim Rogers Divergence-Aware Warp Scheduling 8
Where is the locality?• Examine every load instruction in program
0102030405060
Static Load Instructions in GC workload
Hits
/Mis
ses
PKI
Load
1
Load
2
Load
3
Load
4
Load
5
Load
6
Load
7
Load
8
Load
9
Load
10
Load
11
Load
12
Load
13
Locality Concentrated in Loops
Loop
Tim Rogers Divergence-Aware Warp Scheduling 9
Locality In Loops Limit Study
0
0.2
0.4
0.6
0.8
1
Average
Line accessed this iteration
Line accessed last iteration
Other
Hits on data accessed in immediately previous tripHow much data should we keep around?
Frac
tion
cach
e hi
ts in
loop
s
Tim Rogers Divergence-Aware Warp Scheduling 10
DAWS Objectives
1.Predict the amount of data accessed by each warp in a loop iteration.
2.Schedule warps in loops so that aggregate predicted footprint does not exceed L1D.
Tim Rogers Divergence-Aware Warp Scheduling 11
Observations that enable prediction• Memory divergence in static instructions is predictable
• Data touched by divergent loads dependent on active mask
Warp 0Warp 1…load…
Divergence
DivergenceWarp
Main Memory
Main MemoryMain Memory
Divergence
Both Used To Create Cache
Footprint Prediction
4 accesses2 accesses
1 0 10Warp
1 1 11
Tim Rogers Divergence-Aware Warp Scheduling 12
Online characterization to create cache footprint prediction1. Detect loops with locality
2. Classify loads in the loop
3. Compute footprint from active mask
Some loops have locality Some don’tLimit
multithreading here
while(…) {load 1…load 2
}
Diverged
Not Diverged
while(…) {load 1…load 2
}
Warp 0 111111
Loop with locality
Loop with locality
Diverged
Not Diverged
4 accesses
1 access+
Warp 0’s Footprint= 5 cache
lines
int C[]={0,64,96,128,160,160,192,224,256};void sum_row_csr(float* A, …) { float sum = 0; int i =C[tid];
while(i < C[tid+1]) {
sum += A[ i ];
++i;
}…
Example Compressed Sparse Row Kernel
Time1Time0 Time2
Cache A[0]
A[64]A[96]A[128]
Cache A[0]A[64]A[96]
A[128]
Cache A[32]
A[160]A[192]A[224]
Warp0 1 1 1 12nd Iter.
Warp0 1 0 0 033rd Iter.
Warp1 0 1 1 1
1st Iter.
Memory Divergence
Divergent Branch
Go Go
Warp1
Warp0Warp1
Warp0
No Footprint
DAWS Operation Example
Cache Footprint
4 4 4 Want to capture spatial locality
HitHitHit
Hit
Go
Hit x30Hit x30Hit x30Hit x30
LoopStopGo
No locality detected = no
footprint
Locality Detected1 Diverged Load
Detected
Footprint = 4X1
Footprint = 3X1Early warps
profile loop for later warps
Warp 0 has branch divergenceBoth warps capture
spatial locality together4 Active threads
Stop
Footprint decreased
Tim Rogers Divergence-Aware Warp Scheduling 14
MethodologyGPGPU-Sim (version 3.1.0)• 30 Streaming Multiprocessors
• 32 warp contexts (1024 threads total)• 32k L1D per streaming multiprocessor• 1M L2 unified cache
Compared Schedulers• Cache-Conscious Wavefront Scheduling (CCWS)• Profile based Best-SWL• Divergence-Aware Warp Scheduling (DAWS)
More schedulers in paper
Tim Rogers Divergence-Aware Warp Scheduling 15
Sparse MM Case Study Results
Within 4% of optimized with no programmer
input
0
0.5
1
1.5
2
Div
erge
nt C
ode
Exec
utio
n tim
e
• Performance (normalized to optimized version)
Tim Rogers Divergence-Aware Warp Scheduling 16
Sparse MM Case Study Results• Properties (normalized to optimized version)
0
0.5
1
1.5
2
2.5
3<20% increase in off-
chip accesses
Divergent version now has potential energy advantages
Divergent code issues 2.8x less instructions
Div
erge
nt c
ode
off-c
hip
acce
sses
Tim Rogers Divergence-Aware Warp Scheduling 17
Cache-Sensitive Applications• Breadth First Search (BFS)• Memcached-GPU (MEMC)• Sparse Matrix-Vector Multiply (SPMV-Scalar)• Garbage Collector (GC)• K-Means Clustering (KMN) Cache-Insensitive
Applications in paper
Tim Rogers Divergence-Aware Warp Scheduling 18
00.20.40.60.81
1.21.41.61.8
ResultsOutperform Best-SWL in highly
branch divergent
Overall 26% improvement over
CCWS
Nor
mal
ized
Spe
edup
CCWS
BFS MEMC SPMV-Scalar
GC KMN HMean
Best-SWL DAWS
Tim Rogers Divergence-Aware Warp Scheduling 19
Summary
Divergent loads in GPU programs.• Software solutions complicate programming
Problem
SolutionDAWS
• Captures opportunities by accounting for divergence
Result Overall 26% performance improvement over CCWSCase Study: Divergent code performs within 4% code optimized to minimize divergence
Questions?