highly-associative caches for low-power processors michael zhang krste asanovic...

18
Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic {rzhang|krste}@lcs.mit.edu

Post on 21-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Highly-Associative Caches for Low-Power Processors

Michael Zhang

Krste Asanovic

{rzhang|krste}@lcs.mit.edu

Motivation

Cache uses 30-60% processor energy in embedded systems. Example: 43% for StrongArm-1

Many academic studies on cache [Albera, Bahar, ’98] – Power and performance trade-offs [Amrutur, Horowitz, ‘98,’00] – Speed and power scaling [Bellas, Hajj, Polychronopoulos, ’99] – Dynamic cache management [Ghose, Kamble,’99] – Power reduction through sub-banking, etc. [Inoue, Ishihara, Murakami,’99] – Way predicting set-associative cache [Kin,Gupta, Mangione-Smith, ’97] – Filter cache [Ko, Balsara, Nanda, ’98] – Multilevel caches for RISC and CISC [Wilton, Jouppi, ’94] – CACTI cache model

Many Industrial Low-Power Processors use CAM (content-addressable-memory) ARM3 – 64-way set-associative – [Furber et. al. ’89] StrongArm – 32-way set-associative – [Santhanam et. al. ’98] Intel XScale – 32-way set-associative – ’01

CAM: Fast and Energy-Efficient

Talk Outline

Structural Comparison

Area and Delay Comparison

Energy Comparison

Related work

Conclusion

Set-Associative RAM-tag Cache

Not energy-efficient All ways are read out

Two-phase approach More energy-efficient 2X latency

=? =?

Tag Status Data Tag Status Data

Tag Index Offset

Set-Associative RAM-tag Sub-bank

Not energy-efficient All ways are read out

Two-phase approach More energy-efficient 2X latency

Sub-banking

1 sub-bank = 1 way

Low-swing Bitlines Only for reads, writes

performed full-swing

Wordline Gating

I/O

BUS

addr

Ad

dre

ss D

ecod

er

gwl

lwl

Offset Dec.

offset

DataSRAM Cells

SenseAmps

lwl

Offset Dec.

offset

DataSRAM Cells

SenseAmps

32128

Tag SRAMCells

Tag Comp

BUS Cache

CAM-tag Cache

Only one sub-bank activated

Associativity within sub-bank

Easy to implement high associativity

Tag Status Data

HIT?

Word

Tag OffsetBank

Tag Status Data

HIT?HIT?

CAM-tag Cache Sub-bank

Only one sub-bank activated

Associativity within sub-bank

Easy to implement high associativity

I/O

BUS

tag

CA

M-t

ag

Arr

ay

gwl

lwl

Offset Dec.

offset

SRAM Cells

SenseAmps

lwl

Offset Dec.

offset

SRAM Cells

SenseAmps

32128

CAM Functionality and Energy Usage

WLBit Bit_b SBitSBit_b

match

10-T CAM CellWith Separate

Write/Search LinesAnd Low-Swing

Match Line

WL

Bit Bit_b SBitSBit_b

match

Match

1

01

1

10

WL

Bit Bit_b SBitSBit_b

match

Mismatch

1 0

011 0

CAM Energy Dissipation Search Lines Match Lines Drivers

SR

AM

XOR

CAM-tag Cache Sub-bank Layout

10% area overhead over RAM-tag cache

2x12x32 CAM Array

1-KB Cache Sub-bank implemented in 0.25 m CMOS technology

32x64 RAM Array

Delay Comparison

gwl

Global Wordline Decoding

lwl

Decoded offset

Local Wordline Decoding

Tag readout

Data readout

Index Bits

Tag bits

Data out

Tag bitsTag Comp.

Data out

RAM tag Cache Critical Path:

CAM tag Cache Critical Path:

Within 3% of each other

Tag bits broadcasting Tag bits

Tag Comp.

gwl

Data readout

Local Wordline Decoding

lwl

Decoded offset

Hit Energy ComparisonH

it E

nerg

y p

er

Access f

or

8K

B C

ach

e in

pJ

0

50

100

150

200

250

300

350

400

450

1-wayRAM

2-wayRAM

4-wayRAM

8-wayRAM

8-wayCAM

16-wayCAM

32-wayCAM

LZWijpegpegwitperlm88ksimgccAvg

Associativity and Implementation

Miss Rate Results

0

5

10

15

20

25

1-way 2-way 4-way 8-way 16-way 32-way 64-way

LZW

0

0.5

1

1.5

2

2.5

3

3.5

1-way 2-way 4-way 8-way 16-way 32-way 64-way

ijpeg

0

0.5

1

1.5

2

2.5

3

3.5

1-way 2-way 4-way 8-way 16-way 32-way 64-way

m88ksim

0

2

4

6

8

10

12

14

16

1-way 2-way 4-way 8-way 16-way 32-way 64-way

8KB

16KB

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1-way 2-way 4-way 8-way 16-way 32-way 64-way

0

1

2

3

4

5

6

1-way 2-way 4-way 8-way 16-way 32-way 64-way

perl

pegwit

gcc

Total Access Energy (pegwit)

Pegwit – High miss rate for high associativity

Tota

l En

erg

y p

er

Access f

or

8K

B C

ach

e in

pJ

Miss Energy Expressed in Multiples of 32-bit Read Access Energy

0

500

1000

1500

2000

2500

32X 64X 128X 256X 512X 1024X

1-RAM2-RAM4-RAM8-RAM8-CAM16-CAM32-CAM

Total Access Energy (perl)

Perl – Very low miss rate for high associativity

Tota

l En

erg

y p

er

Access f

or

8K

B C

ach

e in

pJ

Miss Energy Expressed in Multiples of 32-bit Read Access Energy

0

50

100

150

200

250

300

350

400

450

500

32X 64X 128X 256X 512X 1024X

1-RAM2-RAM4-RAM8-RAM8-CAM16-CAM32-CAM

Other Advantages of CAM-tag

Hit signal generated earlier Simplifies pipelines

Simplified store operation Wordline only enabled during a hit Stores can happen in a single cycle No write buffer necessary

Related Work

CACTI and CACTI2 [Wilton and Jouppi ’94],[Reinman and Jouppi, ’99]Accurate delay and energy estimate

Results within 10%Energy estimate not suited for low-power designsTypical Low-power features not included in CACTI

Sub-banking Low-swing bitlines Wordline gating Separate CAM search line Low-swing match lines

Energy Estimation 10X greater than our model for one CAM-tag cache sub-bank Our results closely agree with [Amruthur and Horowitz, 98]

Conclusion

CAM tags – high performance and low-power Energy consumption of 32-way CAM < 2-way RAM Easy to implement highly-associative tags Low area overhead (10%) Comparable access delay Better CPI by reducing miss rate

Thank You!

http://www.cag.lcs.mit.edu/scale/