jayesh gaur*, alaa alameldeen**, sreenivas...
TRANSCRIPT
Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*
* Microarchitecture Research Lab (MRL), Intel India
**Memory Architecture Lab (MAL), Intel OregonISCA 2016
Seoul, Korea
Motivation
• Memory continues to be the bottleneck for modern CPU’s
• Larger Last Level Cache (LLC) capacity improves performance and power
• Higher hit rates Better Performance
• Fewer off-chip DRAM accesses Lower power, energy
• However this comes at a cost : Area and Leakage
• Cache compression is an attractive option
• Increased Capacity at lower area
2
But Compression can interfere with Replacement PoliciesWe present a new compression architecture to address this
3
How Cache Compression Works ?
Data ArrayTag Array
CompressionLogic
De-CompressionLogic
~8% increase in Area
Prior works have tried to change SRAM layout to allow compressionChanging a dense, timing sensitive SRAM layout is difficult
Way
Set
Data is fragmented across the setHow to associate tags to data ?
Tag Array(2X Tags)
Agenda
• Practical Architecture for Compressed Cache
• Interaction between Compression and Replacement Policies
• Base-Victim Proposal
• Results
• Performance
• Power
• Conclusions
4
Creating a Compressed LLC
5
• Tags per Set are doubled
• Exactly two tags are associated with each way
• Tag hit to data fetch is optimized
• Only 64B of data for every two tags
• Data corresponding to the tags is compressed
Tag0
Tag1
Way 0
Tag0
Tag1
Way 1
Tag0
Tag1
Way 2
Tag0
Tag1
Way 3
Data 0 Data 1
Data-0-size + Data-1-size <= 64B
6
0
0.5
1
1.5
2
2.5
3
IPC
an
d D
RA
M R
ea
d R
ati
o
ov
er
Ba
seli
ne
Traces --->
IPC Ratio DRAM Read Ratio
Performance gain from compression
Average 12% Loss ! Hit Rates lower in generalWhy did larger LLC capacity lower performance ?
Compression FriendlyCompression Unfriendly
Issues with Compressed Cache
Partner line victimization
• Replacement policy is broken because of size limitations
• Performs poorer than baseline with many negative outliers
7
48 16 24 40 64 0 32 24
Way 0 Way 1 Way 2 Way 3
1 0 1 1 1 X 1 1NRU Age
Data Size
Incoming request Size
LRU candidate way does not have space
Allocating into LRU way will victimize the partner MRU line !We need to increase capacity but also preserve the replacement policy !
24
• Extra capacity (Tag-1 in each way) logically belongs to a Victim Cache
• Tag 0 victims are cached in Tag 1 space
• Base replacement policy strictly maintained in Tag 0
• Guarantees baseline cache hit behavior, performance
• Victim cache is always clean
• Partner line victimization is easy
01
Way 0 Way 1 Way 2 Way 3
01
01
01
B
V
8
Opportunistic Victim Cache
0 1
Way 0 Way 1 Way 2 Way 3
0 10 10 1
Read miss allocates in Tag0 LRU
If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache
In baseline it was never there
Cannot be poorer than baseline
If victim created insert into Tag1 victim cache
D,24 C,8 A,48 B,24
F,32 E,8 X,16 Y,32
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
Z,48
Incoming Request
9
Compressed LLC : Miss
Read miss allocates in Tag0 LRU
If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache
In baseline it was never there
Cannot be poorer than baseline
If victim created insert into Tag1 victim cache
D,24 C,8 A,48
B,24
F,32 E,8 X,16 Y,32
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
Z,48
Incoming Request
10
Compressed LLC : Miss
Read miss allocates in Tag0 LRU
If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache
In baseline it was never there
Cannot be poorer than baseline
If victim created insert into Tag1 victim cache
D,24 C,8 A,48
B,24
F,32 E,8 X,16 Y,32
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
Z,48
Incoming Request
11
Compressed LLC : Miss
Read miss allocates in Tag0 LRU
If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache
In baseline it was never there
Cannot be poorer than baseline
If victim created insert into Tag1 victim cache
D,24 C,8 A,48
B,24
F,32 E,8 X,16
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
Z,48
Incoming Request
12
Compressed LLC : Miss
Read miss allocates in Tag0 LRU
If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache
In baseline it was never there
Cannot be poorer than baseline
If victim created insert into Tag1 victim cache
D,24 C,8 A,48
B,24
F,32 X,16
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
Z,48
Incoming Request
13
Compressed LLC : Miss
Read miss allocates in Tag0 LRU
If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache
In baseline it was never there
Cannot be poorer than baseline
If victim created insert into Tag1 victim cache
D,24 C,8 A,48
F,32 X,16
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
Z,48
Incoming Request
B,24
14
Compressed LLC : Miss
Read miss allocates in Tag0 LRU
If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache
In baseline it was never there
Cannot be poorer than baseline
If victim created insert into Tag1 victim cache
D,24 C,8 A,48
F,32 X,16
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
Z,48
Incoming Request
B,24
15
Compressed LLC : Miss
Victim cache is always clean Dirty lines written to memoryReturn data to core after decompression
If hit from Tag1 victim cache then move data to Tag0
Behave as if read miss served from memory
But latency is just lookup of LLC
Gives performance
Management of Victim Cache is critical
Needs more analysis
D,24 C,56 A,48 B,24
F,32 E,8 X,16 Y,32
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
E,8
Incoming Request
16
Compressed LLC : Hit in Victim $
If hit from Tag1 victim cache then move data to Tag0
Behave as if read miss served from memory
But latency is just lookup of LLC
Gives performance
Management of Victim Cache is critical
Needs more analysis
D,24 C,56 A,48
B,24
F,32 E,8 X,16 Y,32
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
E,8
Incoming Request
17
Compressed LLC : Hit in Victim $
If hit from Tag1 victim cache then move data to Tag0
Behave as if read miss served from memory
But latency is just lookup of LLC
Gives performance
Management of Victim Cache is critical
Needs more analysis
D,24 C,56 A,48
B,24
F,32 X,16 Y,32
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
E,8
Incoming Request
18
Compressed LLC : Hit in Victim $
If hit from Tag1 victim cache then move data to Tag0
Behave as if read miss served from memory
But latency is just lookup of LLC
Gives performance
Management of Victim Cache is critical
Needs more analysis
D,24 C,56 A,48
B,24
F,32 X,16 Y,32
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
E,8
Incoming Request
19
Compressed LLC : Hit in Victim $
If hit from Tag1 victim cache then move data to Tag0
Behave as if read miss served from memory
But latency is just lookup of LLC
Gives performance
Management of Victim Cache is critical
Needs more analysis
D,24 C,56 A,48
B,24
F,32 X,16
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
E,8
Incoming Request
20
Compressed LLC : Hit in Victim $
If hit from Tag1 victim cache then move data to Tag0
Behave as if read miss served from memory
But latency is just lookup of LLC
Gives performance
Management of Victim Cache is critical
Needs more analysis
D,24 C,56 A,48
B,24F,32 X,16
Baseline Cache
Ways 10 2 3
Victim Cache
<Tag, Size>
+
LRU
E,8
Incoming Request
21
Compressed LLC : Hit in Victim $
22
Configuration
• x86 Core running at 4GHz
• 2MB, 16 way Inclusive LLC per core
• Not Recently Used (NRU) replacement
• DDR3-1600 15-15-15-34
• Base Delta Immediate (BDI) Compression
• Decompression Latency of 2 cycles
Category Traces
SPECFP 06 30
ISPEC 06 29
Productivity 14
Client 27
Overall 100
On an average each cache-line gets compressed to 55% of its sizeDoubling Tags should get most of the gains
23
Results : IPC Gain
1.03
1.13
1.07
1.10
1.09
1.04
1.12
1.07
1.101.08
1.06
1.11
1.05
1.111.09
1.05
1.08
1.05
1.09
1.07
0.90
0.95
1.00
1.05
1.10
1.15
SPECFP SPECINT Productivity Client Average SPECFP SPECINT Productivity Client Average
Compression Friendly Overall
IPC
Ra
tio
ov
er
ba
seli
ne
3M Uncompressed LLC Opportunistic Compression
8% area addition gives performance equal to 50% increased areaGood gains across various categories of workloads
24
Results : Correlation with hit rate improvement
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
IPC
an
d D
RA
M R
ea
d R
ati
o
ov
er
Ba
seli
ne
Traces ----------->
IPC Ratio DRAM Read Ratio
Hit Rate >= Baseline hit rate. No negative outliersMemory traffic reduces by average 16%
Compression Friendly Compression Unfriendly
25
Effect of Baseline Replacement Policy
0.90
0.95
1.00
1.05
1.10
1.15
1.20
SPECFP SPECINT Productivity Client Average SPECFP SPECINT Productivity Client Average
Compression Friendly Overall
IPC
Ra
tio
ov
er
NR
U B
ase
lin
e
SRRIP SRRIP + Compression CHAR CHAR + Compression
Good gains with various state of the art replacement policiesIncreases capacity while retaining benefits of good replacement !
26
Energy Savings
Power saved in DRAM compensates for increased power in LLCOverall 6.5% energy savings
0.2
0.4
0.6
0.8
1
1.2
DR
AM
Re
ad
Ra
tio
an
d E
ne
rgy
Ra
tio
ov
er
Ba
seli
ne
Traces ------------>
DRAM Read Ratio
Energy Ratio with Word Enables
27
Conclusions
• Cache Compression increases capacity with low area impact
• But compression interferes with replacement policies
• We propose Base-Victim compression
• Opportunistic Victim cache created by compression
• Preserves gains from replacement policies
• No costly SRAM layout changes
• All changes in the cache controller
• ~50% increase in capacity with 8% area addition