bypass and insertion algorithms for exclusive last-level caches jayesh gaur 1, mainak chaudhuri 2,...

18
Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1 , Mainak Chaudhuri 2 , Sreenivas Subramoney 1 1 Intel Architecture Group, Intel Corporation, Bangalore, India 2 Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, India Presented by Samira Khan Intel Labs, Intel Corporation and University of Texas at San Antonio International Symposium on Computer Architecture (ISCA), June 6 th , 2011

Upload: lindsay-heritage

Post on 31-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Bypass and Insertion Algorithms for Exclusive Last-level Caches

Jayesh Gaur1, Mainak Chaudhuri2, Sreenivas Subramoney1

1Intel Architecture Group,Intel Corporation, Bangalore, India

2Department of Computer Science and Engineering,Indian Institute of Technology Kanpur, India

Presented by Samira KhanIntel Labs, Intel Corporation andUniversity of Texas at San Antonio

International Symposium on Computer Architecture (ISCA), June 6th, 2011

Page 2: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Inclusive Vs Exclusive• Inclusive Cache Hierarchy

– Last level cache (LLC) is the super set of all caches– A block in L1 is also present in L2 and LLC

• Exclusive Cache Hierarchy– A Cache block is present only in one level– A block in L1 is never present in L2 and LLC

L1

L2

LLCL1

L2

L1L1

LLC

L2

Inclusive Hierarchy Exclusive Hierarchy

Page 3: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Inclusive Vs Exclusive

• Inclusive Last-level Caches (LLC) are popular choice– Inclusion wastes Cache capacity

3

Exclusive caches have higher capacity and better performance

Some of the materials are taken from the original presentation

Page 4: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

This talk is about replacement and bypass policies for exclusive caches

Exclusive Last Level Cache• Exclusive LLC (L3) serves as a victim cache for the L2 cache

– Data is filled into the L2– On L2 eviction, data is filled into LLC– On LLC hit, Cache line is invalidated from LLC and moved to L2

LLCL2 DRAM

Core+

L1

LoadLoadL2 Miss

LoadLLC Miss

FillEvict512 KB

2 MB 32 KB

LLC HitInvalidate from LLC

4

Page 5: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Replacement Policy in Exclusive LLC

• Popular replacement policy LRU• Replaces Least Recently Used block• Needs recency information to

choose the victim

fill hit hit hit last hit eviction

Cache set

MRU

LRUVictim

Exclusive caches have no recency information

Page 6: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Replacement Policy in Exclusive LLC

• How to choose victim in exclusive LLC?

• Can we bypass lines in LLC?

• Choose replacement victim with the help of some information from higher level caches

Do not place lines in the exclusive LLC that are never re-referenced before eviction

Page 7: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Outline

• Motivation• Problem Description• Characterizing Dead and Live lines• Basic Algorithm• Results• Conclusion

7

Page 8: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

TC captures the reuse distance between two clustered uses of a cache line

Characterizing Dead and Live Lines• Dead allocation to LLC

• Cache line filled into LLC, but evicted before being recalled by L2

• Live allocation to LLC• Cache line filled into LLC and sees a hit in LLC

• Trip Count (TC) :• # times cache line makes trips between LLC and L2 cache, before eviction

TC= 1

LLC

DRAM

TC = 0 L2

EvictionFrom LLC

L2

LLC

8

Page 9: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Only 1 bit TC is required for most applications: either TC = 0 or TC >= 1Can we use the liveness information from TC to design insertion/bypass policies ?

Oracle Analysis : Trip Count

9

Page 10: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Refer to paper that shows <TC,UC> pair can best approximate Belady victim selection

Use Count in L2• Use count (UC) is the number of times a cache line is hit in L2

Cache due to demand requests– For cache lines brought by demand requests, UC >=1

• We need only 2 bits for learning UC

TC= 1, UC = Y

LLC

DRAM

TC = 0 UC = X L2

EvictionFrom LLC

Y hits

L2

X hits

LLC

10

Page 11: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

More details in paper

TCxUC-based Algorithms• Send <TC,UC> information for every L2 eviction• Bin all L2 evictions into 8 <TC,UC> bins • Learn the dead and live distributions in these bins• Identify bins that have more dead blocks than live• Bypass blocks that belong to a bin that has more dead blocks

11

Page 12: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Experimental Methodology– SPEC 2006 and SERVER categories• 97 single-threaded (ST) traces • 35 4-way multi-programmed (MP) workloads • Cycle-accurate execution-driven simulation based on x86 ISA

and core i7 model– Three level cache hierarchy– 32KB L1 Caches– 2 MB LLC for ST and 8 MB LLC for MP(16-way)– 512 KB 8-way L2 cache per core

12

Page 13: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Overall, Bypass + TC_UC_AGE is the best policy

Policy Evaluation for ST Workloads

13

Page 14: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Throughput = ∑ IPCi Policy /∑ IPCi base Fairness = min (IPCi Policy/ IPCi base)Geomean throughput gain for our best proposal is 2.5%

Multi-programmed (MP) Workloads

14

Page 15: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Conclusion • For capacity and performance, exclusive LLC is more meaningful • LRU and related inclusive cache replacement schemes do not

work for exclusive LLC• We presented several insertion/bypass schemes for

exclusive caches– Based on trip count and use count– For ST workloads, we gain 4.3% higher average IPC– For MP workloads, we gain 2.5% average throughput

15

Why this paper is important?

Page 16: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

Thank you

Questions ?

16

Page 17: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

BACKUP

17

Page 18: Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel

TC enables us to mimic the inclusive replacement policies on exclusive cachesHowever, TC is insufficient to enable bypass. All cache lines start at TC = 0

• TC -AGE policy (Analogous to SRRIP, ISCA 2010)TC-based Insertion Age

L2 $ Fill1 bit per $ line

LLC Fill2 bits per $ line

LLC Eviction

TC = 0 TC = 1

LLC Hit ?

N Y

Age1

Age3

TC = 1 ?

N Y

Maintain relative age order

Choose least age as victim

18

DIP + TC-AGE policy (Analogous to DRRIP, ISCA 2010)• If TC = 1, fill LLC with age = 3• If TC = 0, duel between age = 0 and age = 1

This slide is kindly provided by the authors