jayesh gaur*, alaa alameldeen**, sreenivas...

Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

* Microarchitecture Research Lab (MRL), Intel India

**Memory Architecture Lab (MAL), Intel OregonISCA 2016

Seoul, Korea

Motivation

• Memory continues to be the bottleneck for modern CPU’s

• Larger Last Level Cache (LLC) capacity improves performance and power

• Higher hit rates Better Performance

• Fewer off-chip DRAM accesses Lower power, energy

• However this comes at a cost : Area and Leakage

• Cache compression is an attractive option

• Increased Capacity at lower area

2

But Compression can interfere with Replacement PoliciesWe present a new compression architecture to address this

3

How Cache Compression Works ?

Data ArrayTag Array

CompressionLogic

De-CompressionLogic

~8% increase in Area

Prior works have tried to change SRAM layout to allow compressionChanging a dense, timing sensitive SRAM layout is difficult

Way

Set

Data is fragmented across the setHow to associate tags to data ?

Tag Array(2X Tags)

Agenda

• Practical Architecture for Compressed Cache

• Interaction between Compression and Replacement Policies

• Base-Victim Proposal

• Results

• Performance

• Power

• Conclusions

4

Creating a Compressed LLC

5

• Tags per Set are doubled

• Exactly two tags are associated with each way

• Tag hit to data fetch is optimized

• Only 64B of data for every two tags

• Data corresponding to the tags is compressed

Tag0

Tag1

Way 0

Tag0

Tag1

Way 1

Tag0

Tag1

Way 2

Tag0

Tag1

Way 3

Data 0 Data 1

Data-0-size + Data-1-size <= 64B

6

0

0.5

1

1.5

2

2.5

3

IPC

an

d D

RA

M R

ea

d R

ati

o

ov

er

Ba

seli

ne

Traces --->

IPC Ratio DRAM Read Ratio

Performance gain from compression

Average 12% Loss ! Hit Rates lower in generalWhy did larger LLC capacity lower performance ?

Compression FriendlyCompression Unfriendly

Issues with Compressed Cache

Partner line victimization

• Replacement policy is broken because of size limitations

• Performs poorer than baseline with many negative outliers

7

48 16 24 40 64 0 32 24

Way 0 Way 1 Way 2 Way 3

1 0 1 1 1 X 1 1NRU Age

Data Size

Incoming request Size

LRU candidate way does not have space

Allocating into LRU way will victimize the partner MRU line !We need to increase capacity but also preserve the replacement policy !

24

• Extra capacity (Tag-1 in each way) logically belongs to a Victim Cache

• Tag 0 victims are cached in Tag 1 space

• Base replacement policy strictly maintained in Tag 0

• Guarantees baseline cache hit behavior, performance

• Victim cache is always clean

• Partner line victimization is easy

01


01

01

01

B

V

8

Opportunistic Victim Cache

0 1


0 10 10 1

Read miss allocates in Tag0 LRU

If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache

In baseline it was never there

Cannot be poorer than baseline

If victim created insert into Tag1 victim cache

D,24 C,8 A,48 B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

9

Compressed LLC : Miss






D,24 C,8 A,48

B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

10







D,24 C,8 A,48

B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

11







D,24 C,8 A,48

B,24

F,32 E,8 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

12







D,24 C,8 A,48

B,24

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

13







D,24 C,8 A,48

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

B,24

14







D,24 C,8 A,48

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

B,24

15


Victim cache is always clean Dirty lines written to memoryReturn data to core after decompression

If hit from Tag1 victim cache then move data to Tag0

Behave as if read miss served from memory

But latency is just lookup of LLC

Gives performance

Management of Victim Cache is critical

Needs more analysis

D,24 C,56 A,48 B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

16

Compressed LLC : Hit in Victim $




Gives performance


Needs more analysis

D,24 C,56 A,48

B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

17





Gives performance


Needs more analysis

D,24 C,56 A,48

B,24

F,32 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

18





Gives performance


Needs more analysis

D,24 C,56 A,48

B,24

F,32 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

19





Gives performance


Needs more analysis

D,24 C,56 A,48

B,24

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

20





Gives performance


Needs more analysis

D,24 C,56 A,48

B,24F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

21


22

Configuration

• x86 Core running at 4GHz

• 2MB, 16 way Inclusive LLC per core

• Not Recently Used (NRU) replacement

• DDR3-1600 15-15-15-34

• Base Delta Immediate (BDI) Compression

• Decompression Latency of 2 cycles

Category Traces

SPECFP 06 30

ISPEC 06 29

Productivity 14

Client 27

Overall 100

On an average each cache-line gets compressed to 55% of its sizeDoubling Tags should get most of the gains

23

Results : IPC Gain

1.03

1.13

1.07

1.10

1.09

1.04

1.12

1.07

1.101.08

1.06

1.11

1.05

1.111.09

1.05

1.08

1.05

1.09

1.07

0.90

0.95

1.00

1.05

1.10

1.15

SPECFP SPECINT Productivity Client Average SPECFP SPECINT Productivity Client Average

Compression Friendly Overall

IPC

Ra

tio

ov

er

ba

seli

ne

3M Uncompressed LLC Opportunistic Compression

8% area addition gives performance equal to 50% increased areaGood gains across various categories of workloads

24

Results : Correlation with hit rate improvement

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

IPC

an

d D

RA

M R

ea

d R

ati

o

ov

er

Ba

seli

ne

Traces ----------->

IPC Ratio DRAM Read Ratio

Hit Rate >= Baseline hit rate. No negative outliersMemory traffic reduces by average 16%

Compression Friendly Compression Unfriendly

25

Effect of Baseline Replacement Policy

0.90

0.95

1.00

1.05

1.10

1.15

1.20

SPECFP SPECINT Productivity Client Average SPECFP SPECINT Productivity Client Average

Compression Friendly Overall

IPC

Ra

tio

ov

er

NR

U B

ase

lin

e

SRRIP SRRIP + Compression CHAR CHAR + Compression

Good gains with various state of the art replacement policiesIncreases capacity while retaining benefits of good replacement !

26

Energy Savings

Power saved in DRAM compensates for increased power in LLCOverall 6.5% energy savings

0.2

0.4

0.6

0.8

1

1.2

DR

AM

Re

ad

Ra

tio

an

d E

ne

rgy

Ra

tio

ov

er

Ba

seli

ne

Traces ------------>

DRAM Read Ratio

Energy Ratio with Word Enables

27

Conclusions

• Cache Compression increases capacity with low area impact

• But compression interferes with replacement policies

• We propose Base-Victim compression

• Opportunistic Victim cache created by compression

• Preserves gains from replacement policies

• No costly SRAM layout changes

• All changes in the cache controller

• ~50% increase in capacity with 8% area addition

jayesh gaur*, alaa alameldeen**, sreenivas...

Documents