pl-4049, cache coherence for gpu architectures, by arvindh shriraman and tor aamodt
DESCRIPTION
PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt at the AMD Developer Summit (APU13) November 11-13, 2013TRANSCRIPT
Cache coherence for GPU Architectures
���1
Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, Tor M. Aamodt, Cache Coherence for GPU Architectures, In proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA-19)
Agenda
���2
Agenda
Challenges with CPU coherence on GPUs.
���2
Agenda
Challenges with CPU coherence on GPUs.
Temporal Coherence: Rethinking coherence for GPUs
���2
Agenda
Challenges with CPU coherence on GPUs.
Temporal Coherence: Rethinking coherence for GPUs
What is the cost of providing coherence?
���2
Why provide coherence?1. Inter-workgroup communication
2. Atomic operations
3. Task queues
Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems, ISPASS 2012
���3
Cache Coherence
Shared Memory
P P P PProgrammer
Appearance: One global copy of every location
���4
Cache Coherence
MulticoresP P P PL1 L1 L1 L1
L2 L2
GPUs
Memory
L1 L1 L1 L1
... L2 Memory
���5
Cache Coherence
P P P PL1 L1 L1 L1
L2 L2
Memory
L1 L1 L1 L1
... L2 ...
Heterogeneous Systems
How to provide coherence?
���6
Challenges
���7
Challenges with coherence
L1
Shared L2
L1
���8
Challenges with coherence
L1
Shared L2
L1
���8
Challenges with coherence
L1
Shared L2
L11 2
���8
Challenges with coherence
L1
Shared L2
L11 23
���8
Challenge 1: Traffic
L1
Shared L2
L1
���9
Challenge 1: Traffic
L1
Shared L2
L1 L1
���9
Challenge 1: Traffic
L1
Shared L2
L1 L1
���9
Challenge 1: Traffic
L1
Shared L2
L1 L1
30% more traffic than current GPUs
���9
Challenge 2: Buffer Overhead
L1
Shared L2
L1 L1
���10
Challenge 2: Buffer Overhead
L1
Shared L2
L1 L1Protocol Buffer
���10
Challenge 2: Buffer Overhead
L1
Shared L2
L1 L1Protocol Buffer
���10
Challenge 2: Buffer Overhead
L1
Shared L2
L1 L1Protocol Buffer
���10
Challenge 2: Buffer Overhead
L1
Shared L2
L1 L1
Coherence protocol buffers require 28% of total L2
Protocol Buffer
���10
Challenge 3: Complexity
L1
Shared L2
L11
4 states
Incoherent protocol
���11
Challenge 3: Complexity
L1
Shared L2
L11
4 states 16 states
Incoherent protocol
Coherent protocol
23
���11
Coherence Overhead.
L1Coherence messages 1. Traffic transferring 2. Area overhead 3. Protocol complexity
How to achieve coherence without messages?
���12
TEMPORAL COHERENCE���13
L1
Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts
L1
���14
Shared L2
L1
Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts
L1
���14
Shared L2
L1
Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts
L1
���14
Shared L2
L1
Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts
L1
���14
Shared L2
Temporal Coherence
L1
Shared L2
L1
���15
Shared L2
Temporal Coherence
L1
Shared L2
L1
1 Clock
���15
Shared L2
Temporal Coherence
L1
Shared L2
L1
Load
1 Clock
���15
Shared L2
Temporal Coherence
L1
Shared L2
L1
Load
!
LT
Valid if TIME < LT
1 Clock
���15
Shared L2
Temporal Coherence
L1
Shared L2
L1
Load
!
LT
Valid if TIME < LT
1 Clock
���15
Shared L2
Temporal Coherence
L1
Shared L2
L1
Load
!
LT
Valid if TIME < LT
1 Clock
!
GT Shared if TIME < GT
���15
Shared L2
Temporal Coherence
L1L1
���16
Temporal Coherence
L1L1
0 TIME
���16
Temporal Coherence
L1L1
Load
0 TIME
���16
Temporal Coherence
L1L1
Load
!
20
0 TIME
���16
Temporal Coherence
L1L1
Load
!
20
0 TIME
!
20 Line shared till 20
���16
L1L1!
20 L1
5 TIME
���17
L1L1!
20 L1
Load
!
25
5 TIME
���17
L1L1!
20!
25Line shared
till 25
L1
Load
!
25
5 TIME
���17
L1L1!
20!
25
L1!
25
15 TIME
���18
L1L1!
20!
25
L1!
25
15 TIME
Write
���18
L1L1!
20!
25
L1!
25
20 TIME
Write
���19
L1L1!
20!
25
L1!
25
20 TIME
Write
���19
L1L1!
25
L1
Write
!
25
25 TIME
���20
L1L1!
25
L1
Write
!
25
25 TIME
���20
L1L1!
25
L1
Write
!
25
25 TIME
���20
L1L1!
25
L1
Write
!
25
25 TIME
���20
Temporal Coherence
No coherence messages
All transactions are 2-hop
Protocol complexity minimal
Supports strong and weak memory models
Enables optimized communication (ask me later...)
���21
How to set the block lifetime?
•Longer => writes may stall
•Shorter => may not exploit temporal locality
!
• Lifetime predictor at L2. -Load to expired block (for temporal locality) -Store to unexpired block (reduce write stalls) -Eviction of unexpired block (reduce L2 eviction stalls)
���22
Temporal Coherence (Weak)
L1
Shared L2
L1!
20!
25
L1!
25
Write
���23
Temporal Coherence (Weak)
L1
Shared L2
L1!
20!
25
L1!
25
Write
���23
Temporal Coherence (Weak)
L1
Shared L2
L1!
20!
25
L1!
25
Write
Sensitive to misprediction
���23
Temporal Coherence (Weak)
L1
Shared L2
L1!
20!
25
L1!
25
Write
Sensitive to misprediction
Resource stalls
���23
Temporal Coherence (Weak)
L1
Shared L2
L1!
20!
25
L1!
25
Write
Sensitive to misprediction
Resource stalls
Hurts GPU applications
���23
Temporal Coherence (Weak)
L1
Shared L2
L1!
20!
25
L1!
25
Write
Goal : eliminate Write Stalls!
Sensitive to misprediction
Resource stalls
Hurts GPU applications
���23
Temporal Coherence (Weak)
L1L1!
20!
25
L1!
25
15 TIME
OLD OLD
���24
Temporal Coherence (Weak)
L1L1!
20!
25
L1!
25
15 TIME
Write
OLD OLD
!
25
���24
Temporal Coherence (Weak)
L1L1!
20!
25
L1!
25
15 TIME
Write
OLD OLD
!
25 Fence
���24
Temporal Coherence (Weak)
L1L1!
20!
25
L1!
25
15 TIME
Write
OLD OLD
!
25 Fence......
���24
Temporal Coherence (Weak)
L1L1!
20!
25
L1!
25
20 TIME
OLD
!
25 Fence!
20
���25
Temporal Coherence (Weak)
L1L1!
20!
25
L1!
25
20 TIME
OLD
!
25 Fence!
20
���25
Temporal Coherence (Weak)
L1L1!
20!
25
L1!
25
20 TIME
OLD
!
25 Fence!
20
......
���25
Temporal Coherence (Weak)
L1L1!
25
L1
25 TIME
Fence!
25!
25
���26
Temporal Coherence (Weak)
L1L1!
25
L1
25 TIME
Fence!
25!
25
���26
Temporal Coherence (Weak)
L1L1!
25
L1
25 TIME
Fence!
25!
25
���26
Temporal Coherence (Weak)
L1L1!
25
L1
25 TIME
!
25!
25
���26
Temporal Coherence (Weak)
No Access Stalls
Efficient GPU applications
Aggressive lifetime predictors
Supports weak memory models
���27
���28
Coherence Applications•Lock-based programs -Barnes Hut -Cloth Physics -Place-and-Route
•Stencil -Max-Flow Min-cut -3D equation solver
•Load balancing -Octree Partitioning
���29
Interconnect TrafficGPU Applications (do not need coherence)
���30
Interconnect Traffic
0
0.5
1
1.5
2
GPU Applications (do not need coherence)
���30
Interconnect Traffic
0
0.5
1
1.5
2N
O.C
CGPU Applications (do not need coherence)
���30
Interconnect Traffic
0
0.5
1
1.5
2
MES
I
NO
.CC
2.3
GPU Applications (do not need coherence)
���30
Interconnect Traffic
0
0.5
1
1.5
2
MES
I
GPU
-VI
NO
.CC
2.3
GPU Applications (do not need coherence)
���30
Interconnect Traffic
0
0.5
1
1.5
2
MES
I
GPU
-VI
NO
.CC
2.3
GPU Applications (do not need coherence)
.8xWr-Through
���30
Interconnect Traffic
0
0.5
1
1.5
2
MES
I
GPU
-VI
TC
NO
.CC
2.3
GPU Applications (do not need coherence)
.8x
.3x
Wr-Through
No msgs
���30
Coherence Applications•Lock-based programs -Barnes Hut -Cloth Physics -Place-and-Route
•Stencil -Max-Flow Min-cut -3D equation solver
•Load balancing -Octree Partitioning
���31
SpeedupCoherence Applications
���32
SpeedupCoherence Applications
0
0.25
0.5
0.75
1
1.25
1.5
1.75
���32
SpeedupCoherence Applications
0
0.25
0.5
0.75
1
1.25
1.5
1.75N
O L
1
���32
SpeedupCoherence Applications
0
0.25
0.5
0.75
1
1.25
1.5
1.75
MES
I
NO
L1
���32
SpeedupCoherence Applications
0
0.25
0.5
0.75
1
1.25
1.5
1.75
MES
I
NO
L1
���32
SpeedupCoherence Applications
0
0.25
0.5
0.75
1
1.25
1.5
1.75
MES
I
GPU
-VI
TC
NO
L1
���32
SpeedupCoherence Applications
0
0.25
0.5
0.75
1
1.25
1.5
1.75
MES
I
GPU
-VI
TC
NO
L1
Need a 32KB directory
���32
Protocol Complexity
���33
Protocol Complexity
L1 Stable
L1 TransientL2 Stable
L2 Transient
���33
Protocol Complexity
L1 Stable
L1 TransientL2 Stable
L2 Transient
Non-Coherent
2222
���33
Protocol Complexity
L1 Stable
L1 TransientL2 Stable
L2 Transient
Non-Coherent
2222
GPU-VI
21510
���33
Protocol Complexity
L1 Stable
L1 TransientL2 Stable
L2 Transient
Non-Coherent
2222
GPU-VI
21510
Temporal Coherence
2153
���33
What did we learn!
•Throughput and heterogeneous architectures require a more streamlined caching framework. !
•Single-chip integration enables mechanisms that we can exploit to simplify communication protocols. !
•Efficient coherence protocols enable programmers to deploy accelerators for wider purposes..
•Obtain GPGPU-Sim with coherence support
���35
http://www.ece.ubc.ca/~isingh/gpgpusim-ruby.tar.gz
Contact: [email protected]
Interconnect Energy
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
NO
-L1
MES
I G
PU-V
I G
PU-V
ini
TCW
NO
-CO
H
MES
I G
PU-V
I G
PU-V
ini
TCW
Inter- workgroup
Intra- workgroup
Nor
mal
ized
Ene
rgy
Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
NO
-L1
MES
I G
PU-V
I G
PU-V
ini
TCW
NO
-CO
H
MES
I G
PU-V
I G
PU-V
ini
TCW
Inter- workgroup
Intra- workgroup
Nor
mal
ized
Pow
er
���36
0.0
0.5
1.0
1.5
2.0
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
HSP KMN LPS NDL RG SR AVG
Inte
rcon
nect
Tra
ffic
RCL INV REQ ATO ST LDRCL=0.25REQ=0.55
RCL=0.15REQ=0.63
RCL=0.09REQ=0.55
RCL=0.16REQ=0.63 2.27
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
BH CC CL DLB STN VPR AVG
0.0
0.5
1.0
1.5
2.0INV=0.03
RCL=0.03
REQ=0.68
(a) Inter-workgroup communication (b) Intra-workgroup communication
Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.
0.00.20.40.60.81.01.21.41.6
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
Inter-workgroup
Intra-workgroup
Nor
mal
ized
Ene
rgy
Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)
0.00.20.40.60.81.01.21.41.6
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
Inter-workgroup
Intra-workgroup
Nor
mal
ized
Pow
er
Figure 9. Breakdown of interconnect powerand energy.
into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.
Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.
MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.
8.2 Power
Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by
0.00.20.40.60.81.01.2
All applications
Inte
rcon
nect
Tr
affic
TCS TCW-FIXED TCW
0.6
0.8
1.0
1.2
1.4
All applications
Spee
dup
(a) (b)
Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.
21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.
8.3 TC-Weak vs. TC-Strong
Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.
TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.
0.0
0.5
1.0
1.5
2.0
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
HSP KMN LPS NDL RG SR AVG
Inte
rcon
nect
Tra
ffic
RCL INV REQ ATO ST LDRCL=0.25REQ=0.55
RCL=0.15REQ=0.63
RCL=0.09REQ=0.55
RCL=0.16REQ=0.63 2.27
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
BH CC CL DLB STN VPR AVG
0.0
0.5
1.0
1.5
2.0INV=0.03
RCL=0.03
REQ=0.68
(a) Inter-workgroup communication (b) Intra-workgroup communication
Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.
0.00.20.40.60.81.01.21.41.6
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
Inter-workgroup
Intra-workgroup
Nor
mal
ized
Ene
rgy
Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)
0.00.20.40.60.81.01.21.41.6
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
Inter-workgroup
Intra-workgroup
Nor
mal
ized
Pow
er
Figure 9. Breakdown of interconnect powerand energy.
into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.
Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.
MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.
8.2 Power
Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by
0.00.20.40.60.81.01.2
All applications
Inte
rcon
nect
Tr
affic
TCS TCW-FIXED TCW
0.6
0.8
1.0
1.2
1.4
All applications
Spee
dup
(a) (b)
Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.
21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.
8.3 TC-Weak vs. TC-Strong
Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.
TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.
���37
0.0
0.5
1.0
1.5
2.0
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
HSP KMN LPS NDL RG SR AVG
Inte
rcon
nect
Tra
ffic
RCL INV REQ ATO ST LDRCL=0.25REQ=0.55
RCL=0.15REQ=0.63
RCL=0.09REQ=0.55
RCL=0.16REQ=0.63 2.27
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
BH CC CL DLB STN VPR AVG
0.0
0.5
1.0
1.5
2.0INV=0.03
RCL=0.03
REQ=0.68
(a) Inter-workgroup communication (b) Intra-workgroup communication
Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.
0.00.20.40.60.81.01.21.41.6
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
Inter-workgroup
Intra-workgroup
Nor
mal
ized
Ene
rgy
Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)
0.00.20.40.60.81.01.21.41.6
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
Inter-workgroup
Intra-workgroup
Nor
mal
ized
Pow
er
Figure 9. Breakdown of interconnect powerand energy.
into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.
Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.
MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.
8.2 Power
Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by
0.00.20.40.60.81.01.2
All applications
Inte
rcon
nect
Tr
affic
TCS TCW-FIXED TCW
0.6
0.8
1.0
1.2
1.4
All applicationsSp
eedu
p
(a) (b)
Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.
21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.
8.3 TC-Weak vs. TC-Strong
Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.
TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.
0.0
0.5
1.0
1.5
2.0
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
HSP KMN LPS NDL RG SR AVG
Inte
rcon
nect
Tra
ffic
RCL INV REQ ATO ST LDRCL=0.25REQ=0.55
RCL=0.15REQ=0.63
RCL=0.09REQ=0.55
RCL=0.16REQ=0.63 2.27
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
BH CC CL DLB STN VPR AVG
0.0
0.5
1.0
1.5
2.0INV=0.03
RCL=0.03
REQ=0.68
(a) Inter-workgroup communication (b) Intra-workgroup communication
Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.
0.00.20.40.60.81.01.21.41.6
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
Inter-workgroup
Intra-workgroup
Nor
mal
ized
Ene
rgy
Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)
0.00.20.40.60.81.01.21.41.6
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
Inter-workgroup
Intra-workgroup
Nor
mal
ized
Pow
er
Figure 9. Breakdown of interconnect powerand energy.
into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.
Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.
MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.
8.2 Power
Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by
0.00.20.40.60.81.01.2
All applications
Inte
rcon
nect
Tr
affic
TCS TCW-FIXED TCW
0.6
0.8
1.0
1.2
1.4
All applications
Spee
dup
(a) (b)
Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.
21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.
8.3 TC-Weak vs. TC-Strong
Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.
TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.
���38
TC-Strong vs TC-Weak
Fixed lifetime for all applications
0.6 0.8 1.0
1.2
1.4
All applications
Spe
edup
0.6
0.8
1.0
1.2
All applications S
peed
up
Best lifetime for each application
TCSUO TCS TCSOO TCW TCW w/ predictor
���39