|introduction |background |tap (tlp-aware cache management policy) core sampling cache block...
TRANSCRIPT
TAPA TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture
Jaekyu Lee Hyesoon Kim
TLP-Aware Cache Management Policy (HPCA-18)
Outline
| Introduction| Background| TAP (TLP-Aware Cache Management Policy)
Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP
| Evaluation Methodology| Evaluation Results| Conclusion
2/25
TLP-Aware Cache Management Policy (HPCA-18)
CPU-GPU Heterogeneous Architecture
| Combining GPU cores with conventional CMPs is a trend.
| Various resources are shared between CPU and GPU cores. LLC, on-chip interconnection, memory controller, and DRAM
| Shared cache is one of most important resources.
Intel’s Sandy Bridge
AMD’s Fusion Denver Project
3/25
TLP-Aware Cache Management Policy (HPCA-18)
Cache Sharing Problem
| Many researchers have proposed various cache mechanisms. Dynamic cache partitioning
Suh+[HPCA’02], Kim+[PACT’04], Qureshi+[MICRO’06] Dynamic cache insertion policies
Qureshi+[ISCA’07], Jaleel+[PACT’08,ISCA’10], Wu+[MICRO’11,MICRO’11]
Many other mechanisms
| All mechanisms target CMPs.
| These may not be directly applicable to CPU-GPU heterogeneous architectures because CPU and GPU cores have different characteristics.
4/25
TLP-Aware Cache Management Policy (HPCA-18)
CPU vs. GPU cores (1)
| SIMD, massive threading, lack of speculative execution, …
| GPU cores have an order-of-magnitude more threads. CPU: 1-4 way SMT GPU: 10s of active threads in a core
| GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores.
| TLP has a significant impact on how caching affects performance of applications.
5/25
TLP-Aware Cache Management Policy (HPCA-18)
| With low TLP | With High TLP| This type is hardly found
in CPU applications
TLP Effect on Caching
Cache Size
Cache Size
MPKI
CPI
MPKI
CPI
MPKICPI
Compute intensive
or Thrashing
TLP Dominant
Cache friendly
6/25
TLP-Aware Cache Management Policy (HPCA-18)
TLP Effect on Caching
| Cache-oriented metrics cannot differentiate two types. Unable to recognize the effect of TLP
| We need to directly monitor performance effect by caching.
Cache Size
Cache Size
MPKI
CPI
MPKI
CPICPI CPI
TLP Dominant
Cache friendly
Identical
Different
7/25
TLP-Aware Cache Management Policy (HPCA-18)
Core Sampling
| Samples GPU cores with different cache policies
Core
L1 L1 L1L1L1 L1
DRAM
Core CoreCoreCore Core
Shared Last-Level Cache
CorePOL2
Bypassing LLC
(No L3)
MRU insertion
policy in LLC
CoreFollow
CoreFollow
CoreFollow
CoreFollow
Followers
CorePOL1
CPUs
GPUs
8/25
TLP-Aware Cache Management Policy (HPCA-18)
Core Sampling
| Measures performance differences
Core SamplingController
Calculate ∆ (IPC1, IPC2)
∆ > Threshold
Cache-friendly Caching improves perf.
Not Cache-friendly Caching does not affect perf.
Yes No
Collect Performance Samples
Calculate Performance Delta
Make a decision
IPC1 IPC2
CorePOL2
Bypassing LLC
(No L3)
MRU insertion
policy in LLC
CoreFollow
CoreFollow
CoreFollow
CoreFollow
Followers
CorePOL1
9/25
TLP-Aware Cache Management Policy (HPCA-18)
Core Sampling Example
Cache Size
Cache Size
MPKI
CPI
MPKI
CPI
TLP Dominant
Cache friendly
CorePOL1
CorePOL2
CorePOL1
CorePOL2
∆ > Threshold: Cache-friendly
∆ < Threshold: Not cache-friendly
10/25
Bypassing LLC
MRU insertion
Bypassing LLC
MRU insertion
TLP-Aware Cache Management Policy (HPCA-18)
Core Sampling
| Having different LLC policies for cores to identify the effect of last-level cache
| Main goal - finding cache-friendly GPGPU applications
| How core sampling is viable SPMD (Single Program, Multiple Data) model
Each GPU core is running same program. GPGPUs usually have symmetric behavior on their running GPU
cores. Performance variance between GPU cores is very small.
11/25
TLP-Aware Cache Management Policy (HPCA-18)
CPU vs. GPU cores (2)
| GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores.
| GPU cores have an order-of-magnitude more cache accesses
| GPUs have higher tolerance for cache misses due to TLP Generate cache accesses from different threads without stalls
| SIMD execution – one SIMD instruction can generate multiple memory requests
12/25
TLP-Aware Cache Management Policy (HPCA-18)
More Frequent Accesses by GPU Cores
GPU Threads
CPU Thread
Processor stalled
Cache miss
Cache miss
nam
dsje
ng
gam
ess
h264
ref
gobm
kwrf
sphinx
3
bwav
esm
ilc
lesli
e3d
AVG0
20
40
60
80
100
Requests
per
1000
cycle
s
0500
10001500200025003000
Requests
per
1000
cycle
s
< 100 > 500
Stalled, fewer cache accesses
No stalls, more^2 cache accesses
CPU, 1-core GPU, 6-core
13/25
TLP-Aware Cache Management Policy (HPCA-18)
More Frequent Accesses by GPU Cores
| Why are much more frequent accesses from GPGPU applications problematic? Severe interference by GPGPU applications
e.g.) base LRU replacement policy
Performance impact of cache hits is different in applications. Perf. PenaltyCPU(cache miss) Perf. PenaltyGPU(cache miss)
| We have to consider the different degree of cache accesses.
| We propose Cache Block Lifetime Normalization.
14/25
=?>
TLP-Aware Cache Management Policy (HPCA-18)
Cache Block Lifetime Normalization
| Simple monitoring mechanism Monitor cache access rate differences between CPU and
GPGPU applications and periodically calculate the ratio
| Hints for proposed TAP mechanisms regarding access rate differences
XSRATIO
GPU $ Access Counter
CPU $ Access Counter
Calculate Ratioif > threshold XSRATIO =
15/25
if < threshold XSRATIO =
TLP-Aware Cache Management Policy (HPCA-18)
TLP-Aware Cache Management Policy
TAP
Core Sampling
Lifetime Normalization
To find cache-friendly applications
To consider different degree of cache accesses
UCPUtility-based
Cache Partitioning
RRIPRe-Reference
Interval Prediction
TAP-UCP
TAP-RRIP
In this talk
In the paper
16/25
TLP-Aware Cache Management Policy (HPCA-18)
TAP-UCP
| UCP-Mask Register| Core Sampling| Cache block lifetime
normalization
Partitioning Algorithm
Core SamplingController
Cache block lifetime
normalization
XSRATIO
TAP (TLP-Aware)
UCP-Mask
/LLC
Per application, ATD and hit counters
ATD(LRU
Stack)
ATD(LRU
Stack)
ATD(LRU
Stack) Way Hit Counters
Way Hit Counters
Way Hit Counters
n1 n2 n3 n4 n5 n6 n7 n8Optimal Partition
Divide hit counter by XSRATIO register valueto balance cache space
UCP-Mask = 1if not cache friendly
n2 n3 n4 n5 n6 n7GPUCPU
UCP
Assign 1 way to GPGPUIf UCP-Mask == 1
UCP [Qureshi and Patt, MICRO-2006]
17/25
UCPPartitioning Algorithm
Way Hit Counters
TAP
TLP-Aware Cache Management Policy (HPCA-18)
3 5.510.3 9 8.8 7.8
TAP-UCP Case 1: Non Cache-Friendly
16 3 8 20 5 8 3 2
CPU Hit Counters 32 6 16 40 10 16 6 4
GPU Hit Counters
UCP TAP-UCP
C G
3 5.510.3 9 8.8 7.8 6 1120.71817.615.7
C G G G G
6 1120.71817.615.7
C G G G G G G
10 1310.13 5.510.3 10 1310.1
C G G G G G G G
16 3 8 20 5 8 3 2 32 6 16 40 10 16 6 4
C C C C C C C G
Final Partition Final Partition
Marginal UtilityHow many more
hits are expected if N ways are given to an application
31way3+82way3+8+203way
1 CPU 7 GPU
7 CPU 1 GPU
MRU LRU MRU LRU
Utility
……
18/25
Not Cache-friendly
Caching has little effect on Perf.
Assign only 1 way to GPGPU
Case 1: Non Cache-Friendly
∆ < Threshold
UCP
TAP-UCP
1 CPU: 7 GPU
7 CPU: 1 GPU
Perf
orm
an
ce
4 CPU: 4 GPU
More GPU ways
More CPU ways
TLP-Aware Cache Management Policy (HPCA-18)
TAP-UCP Case 2: Cache-Friendly
16 3 8 20 5 8 3 2
CPU Hit Counters 32 6 16 40 10 16 6 4
GPU Hit Counters
3 5.510.3 9 8.8 7.8
16 3 8 20 5 8 3 2
3 5.510.3 9 8.8 7.8
C G
3 5.510.3 9 8.8 7.8
C G C C C
5 6.5 5.3 3 5.510.33 5.510.3
Cache-friendly∆ >
Threshold
C G C C C G G G
C G C C C G G G
Divide hit counters
by XSRATIO
Final Partition4 CPU 4 GPU
Utility
UCP
C G G G G G G G
Final Partition1 CPU 7 GPU
TAP-UCP
XSRATIO = 2
MRU LRU MRU LRU
19/25
Case 2: Cache-Friendly
UCP
TAP-UCPPerf
orm
an
ce
1 CPU: 7 GPU
7 CPU: 1 GPU
4 CPU: 4 GPU
More GPU ways
More CPU ways
TLP-Aware Cache Management Policy (HPCA-18)
Outline
| Introduction| Background| TAP (TLP-Aware Cache Management Policy)
Core sampling Cache block lifetime normalization
| TAP-UCP| Evaluation Methodology| Evaluation Results| Conclusion
20/25
TLP-Aware Cache Management Policy (HPCA-18)
Evaluation Methodology
| MacSim simulator (http://code.google.com/p/macsim) [GT] Trace-driven, timing simulator, x86+ptx instructions
| Workload CPU: SPEC 2006 GPGPU: CUDA SDK, Parboil, Rodinia, ERCBench1-CPU (1 CPU + 1 GPU)
2-CPU(2 CPUs + 1 GPU)
4-CPU(4 CPUs + 1 GPU)
Stream-CPU(Stream CPU + 1 GPU)
152 150 75 25
OOO4-
wide
Private
L1/L2
16 SIMDwidth
PrivateL1
CPU (1-4 cores)
GPU (6 cores)
32-way8MB
Shared LLC
(Base: LRU)
LLCDDR3-1333,
41.6GB/s BW
FR-FCFS
DRAM
21/25
TLP-Aware Cache Management Policy (HPCA-18)
Compu
te In
tens
ive
Thra
shin
g
Cache
-frie
ndly
TLP
dom
inan
t
Thra
shin
g (T
LP)
AVG0.9
0.95
1
1.05
1.1
1.15
1.2
UCP TAP-UCP
Sp
ee
du
p o
ve
r LR
U
| UCP is effective with thrashing.
| Less effective with cache-sensitive GPGPU applications.
| RRIP is generally less effective on heterogeneous workloads.
Result
Compu
te In
tens
ive
Thra
shin
g
Cache
-frie
ndly
TLP
dom
inan
t
Thra
shin
g (T
LP)
AVG0.9
0.95
1
1.05
1.1
1.15
1.2
RRIP TAP-RRIP
Sp
ee
du
p o
ve
r LR
U
11% 12%
22/25
TLP-Aware Cache Management Policy (HPCA-18)
| Sphinx3 + Stencil
Case Study
| Stencil TLP dominant
| MPKI CPU: significant decrease GPGPU: considerable
increase Overall MPKI: increased
| Performance CPU: huge improvement GPU: no change Overall: huge
improvement
MPKI CPU MPKI GPU MPKI Overall
0
0.4
0.8
1.2Previous TAP
No
rma
lize
d M
PK
I
CPU Speedup
GPU Speedup
Overall Speedup
0
0.5
1
1.5
2
2.5 Previous TAP
Sp
ee
du
p o
ve
r LR
U
23/25
TLP-Aware Cache Management Policy (HPCA-18)
Result – Multiple CPU Applications
| TAP mechanisms show higher benefits with more CPU applications.
1 CPU App + 1 GPGPU App
2 CPU Apps + 1 GPGPU App
4 CPU Apps + 1 GPGPU App
1
1.05
1.1
1.15
1.2
1.25
1.3
UCP TAP-UCP RRIP TAP-RRIP
Sp
ee
du
p o
ve
r LR
U
11%12.5%
17.5%
12%14%
24%
24/25
TLP-Aware Cache Management Policy (HPCA-18)
Conclusion
| CPU-GPU Heterogeneous architecture is a popular trend. Resource sharing problem is more significant.
| We propose TAP for CPU-GPU heterogeneous architecture First proposal to consider the resource sharing problem
| We introduce a core sampling technique that samples GPU cores with different policies to identify cache-friendliness.
| Two TAP mechanisms improve the performance of the system significantly. TAP-UCP: 11% over LRU and 5% over UCP TAP-RRIP: 12% over LRU and 9% over RRIP
25/25