hardware caches with low access times and high...

45
Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang Ohio State University Acknowledgement of Contributions: Chenxi Zhang, Tongji University Zhichun Zhu, University of Illinois, Chicago

Upload: vuongtuong

Post on 25-Aug-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Hardware Caches with Low Access Times and High Hit Ratios

Xiaodong ZhangOhio State University

Acknowledgement of Contributions: Chenxi Zhang, Tongji University

Zhichun Zhu, University of Illinois, Chicago

Page 2: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Basics of Hardware Caches

A data item is referenced by its memory address.

It is first searched in the cache.

Three questions cover all the cache operations:How do we know it is in the cache? If it is (a hit), how do we find it? If it is not (a miss), how to replace the data if the location is already occupied?

Page 3: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Direct-Mapped CacheThe simplest, but efficient and commonly used.

Each access is mapped to exactly one location.

The mapping follows:(memory address) mod (number of cache blocks)

A standard block has 4 bytes (a word), but it is increasingly longer for spatial locality.

Page 4: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

An Example of an Direct Mapped-cache

000

001

010

011

100

101

110

111

0000

000

001

0001

000

011

0010

000

101

0011

000

111

0100

001

001

0101

001

011

0110

001

101

0111

001

111

1000

010

001

1001

010

011

1010

010

101

1011

010

111

1100

011

001

1101

011

011

1110

011

101

1111

011

111

Cache

Memory

710 (00111)2 Mod 810 = 11121510 (01111)2 Mod 810 = 11122310 (10111)2 Mod 810 = 11123110 (11111)2 Mod 810 = 1112

Page 5: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Nature of Direct-mapped Caches

If the number of cache blocks is a power of 2, the mapping is exactly the low-order log2 (cache size in blocks) bits of the address. If cache = 23 = 8 blocks, the 3 low-order bits are directly mapped addresses. The lower-order bits are also called cache index.

Page 6: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Tags in Direct-Mapped Caches

A cache index for a cache block is not enough because multiple addresses will be mapped to the same block. A ``tag” is used to make this distinction. The upper portion of the address forms the tag:

2 bits 3 bits

Tag Cache Index

If both the tag and index are matched between memory and cache, a ``hit” happens.A valid bit is also attached for each cache block.

Page 7: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Allocation of Tag, Index, and Offset Bits

Cache size: 64 Kbytes.Block size: 4 Bytes (2 bits for offset) 64 Kbytes = 16 K blocks = 2 14 (14 bits for index)For a 32-bit memory address: 16 bits left for tag.

Each cache line contains a total of 49 bits: 32 bits (data of 4 Bytes)16 bits (tag) 1 bit for valid bit.

16 bits 14 bits 2 bits

Tag Index for 16K words Byte offset

Address

Page 8: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Set-associative Caches

Direct-mapping one location causes high miss rate. How about increasing the number of possible locations for each mapping?The number of locations is called ``associativity”. A set contains more than one location.associativity = 1 for direct-mapped cacheSet-associative cache mapping follows:

(memory address) mod (number of sets)Associativity = number of blocks for fully associative cache.

Page 9: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Direct-mapped vs. Set-associativeDirect-mapped

2-way set-associative

Set 0

Set 3

Address

Mod #set Mod #setSet 0

Set 7

Way 0 Way 0 Way 1

Page 10: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Cache AccessesDirect-mapped

2-way set-associative

Set 0

Set 7 Set 3

Way 0 Way 0 Way 1

Set 0 2 7 A 4 2 7 A 2

2

2

7

7

2

A

7

A

4

A

4

2

2

4

7

7

2

A

7

A2

2

2

7 misses

2 7 A 4 2 7 A 22

272

7 A

A7

4

4

A

2

2

4

7

72

A

A7

2

2 A2

4 misses

References

Page 11: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Direct-mapped Cache Operations: Minimum Hit Time

CPU addresstag set offset Cache

tag data

Yes!

Data Ready

=?

tag

CPU

Page 12: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Set-associative Cache Operations:Delayed Hit Time

CPU addresstag set offset

=?To CPU

tagway0

dataway1 way2 way3

Mux 4:1

Page 13: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Set-associative Caches Reduce Miss Ratios

172.mgrid Data Cache Miss Rate

02468

1012

1-way 2-way 4-way 8-way 16-way

Cache Associativity (16KB, blocks size 32B)

Mis

s R

ate

(%)

172.mgrid: SPEC 2000, multi-grid solver

Page 14: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Trade-offs between High Hit Ratios (SA) and Low Access Times (DM)

Set- associative cache achieves high hit-ratios: 30% higher than direct-mapped cache.But it suffers high access times due to

Multiplexing logic delay during the selection.Tag checking, selection, and data dispatching are sequential.

Direct-mapped cache loads data and checks tag in parallel: minimizing the access time. Can we get both high hit ratios and low access times?The Key is the Way Prediction: speculatively determine which way is the hit so that only that way is accessed.

Page 15: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Best Case of Way Prediction: First HitCost: way prediction only

tag set offset

tagway0

dataway1 way2 way3

Way-prediction

=?Mux 4:1 To CPU

Page 16: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Way Prediction: Non-first Hit (in Set)Cost: way prediction + selection in set

tag set offset

tagway0

dataway1 way2 way3

Way-prediction

=?Mux 4:1≠ To CPU

Page 17: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Worst Case of Way-prediction: MissCost: way prediction + selection in set + miss

tag set offset

tagway0

dataway1 way2 way3

Way-prediction

=?Mux 4:1≠≠

Lower level Memory

To CPU

Page 18: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

MRU Way Predictions• Chang, et. al., ISCA’87, (for IBM 370 by IBM)• Kessler, et. al., ISCA’89. (Wisconsin)

• Mark the Most Recent Use (MRU) block in each set.

• Access this block first. If hits, low access time.

• If the prediction is wrong, search other blocks.

• If the search fails in the set, it is a miss.

Page 19: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

MRU Way Prediction

11010000 0111 101101

MRU Table Way 0 Way 1 Way 2 Way 3

100001

Reference 1

01 11010000 0111 1011000110

Miss

101011

Reference 2

000110 000110 0000 1101 1011101111Non-first Hit

Page 20: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Limits of MRU Set-Associative Caches

First hit is not equivalent to a direct-mapped hit.MRU index is fetched before accessing the block (either cache access cycle is lengthened or additional cycle is needed).

The MRU location is the only search entry in the set.The first hit ratio can be low in cases without many repeated accesses to single data (long reuse distance), such as loops.

MRU cache can reduce access times of set-associative caches by a certain degree but

It still has a big gap with that of direct-mapped cache.It can be worse than set-associative cache when first-hits to MRU locations are low.

Page 21: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Multi-column Caches: Fast Accesses and High Hit Ratio

Zhang, et. al., IEEE Micro, 1997. (Tongji, and Ohio State)

Objectives:Each first hit is equivalent to a direct-mapped access.Maximize the number of first hits.Minimize the latency of non-first-hits. Additional hardware should be simple and low cost.

Page 22: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Multi-Column Caches: Major Location

SetTag Offset

Unused bits directing to the block (way) in the set

Address

The unused bits and ``set bits” generate a direct-mapped location: Major Location.A major location mapping = direct-mapping. Major location only contains a MRU block either loaded from memory or just accessed.

Page 23: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Multi-column Caches: Selected Locations

Multiple blocks can be direct-mapped to the same major location, but only MRU is the major. The non-MRU blocks are stored in other empty locations in the set: Selected Locations.If ``other locations” are used for their own major locations, there will be no space for selected ones. Swap:

A block in selected location is swapped to major locationas it becomes MRU.A block in major location is swapped to a selected location after a new block is loaded in it from memory.

Page 24: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Multi-Column: Indexing Selected Locations

The selected locations associated with its major location are indexed for a fast search.

Location 0

Location 1

Location 2

Location 3

0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0

3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0

Major Location 1 has two selected locations at 0 and 2

Bit Vector 3 Bit Vector 2 Bit Vector 1 Bit Vector 0

Page 25: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Summary of Multi-column Caches

A major location in a set is the direct-mapped location of MRU. A selected location is the direct-mapped location but non-MRU.An selected location index is maintained for each major location. A ``swap” is used to ensure the block in the major location is always MRU.

Page 26: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Multi-column Cache Operations

100001

Reference 1

Way 0 Way 1 Way 2 Way 3

0000 1101 0111 1011

Not at major location !

1101

101011

Reference 2

0001

Place 0001 at the major location !

0000 0001 0111 10111011

First Hit!

0000 0000 00000000

No selected location!

0010Selected location

Page 27: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Performance Summary of Multi-column Caches

Hit ratio to the major locations is about 90%.

The hit ratio is higher than that of direct-mapped cache due to high associativity while keeps low access time of direct-mapped cache in average.

First-hit is equivalent to direct-mapped.Non-first hits are faster than set-associative caches.

Not only outperform set associative caches but alsoColumn-associative caches (two way only, ISCA’93, MIT).

Page 28: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Comparing First-hit Ratios between Multicolumn and MRU Caches

64KB Data Cache Hit Rate (Program mgrid)

0%

20%

40%

60%

80%

100%

120%

4-way 8-way 16-way

Hit

Rat

e

overall

MRU first-hit

Multicolumn first-hit

Page 29: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Some Complexity of Multi-column Caches

Search of the selected locations can besequential based on the vector bits of each set, or parallel with a multiplexor for a selection.

If a mapping finds its major location is occupied by a selected location in another major location group,

either search the bit vectors to set the bit to 0, or simplyReplace it by the major location data. When the selected location is searched, it will be a miss.

The index may be omitted by only relying on the swapping. Partial indexing by only tracing one selected location.

Page 30: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Multi-column Technique is Critical for Low Power Caches

0.1

1

10

100

1000

10000

85 87 89 91 93 95 97 99 01 03

Freq

uenc

y (M

Hz)

Tra

nsis

tor

(M)

Pow

er (W

)

FrequencyTransistorPower

Source: Intel.com

80386

80486 Pentium

Pentium ProPentium II

Pentium IIIPentium 4

Page 31: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Importance of Low-power Designs

Portable systems:Limited battery lifetime

High-end systemsCooling and package cost

• > 40 W: 1 W $1• Air-cooled techniques: reaching limits

Electricity billReliability

Page 32: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Low-power Techniques

Physical (CMOS) levelCircuit levelLogic levelArchitectural levelOS levelCompiler levelAlgorithm/application level

Page 33: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Tradeoff between Performance and Power

Objects for general-purpose systemReduce power consumption without degrading performance

Common solutionAccess/activate resources only when necessary

QuestionWhen is necessary?

Page 34: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

On-chip Caches: Area & Power(Alpha 21264)

Power Consumption

Clock Issue CachesFP Int MemI/O Others

Source: CoolChip Tutorial

Page 35: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Standard Set-associative Caches: Energy Perspectivetag set offset

=?To CPU

tagway0

dataway1 way2 way3

Mux 4:1

datatag PNPN ×+×

power data : power, tag: ity,associativ cache : datatag PPN

Page 36: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Phased Cachetag set offset

=?To CPU

tagway0

dataway1 way2 way3

Mux 4:1

datatag PPN +×

power data : power, tag: ity,associativ cache : datatag PPN

Page 37: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Way-prediction Cachetag set offset

tagway0

dataway1 way2 way3

Way-prediction

=?Mux 4:1 To CPU

datatag PP +

datatag PNPN ×+×To CPU

power data : power, tag: ity,associativ cache : datatag PPN

Page 38: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Limits of Existing Techniques

Way-prediction cachesBenefits from cache hits (Ptag + Pdata)Effective for programs with strong locality

Phased cachesBenefits from cache misses (N×Ptag)Effective for programs with weak locality

Page 39: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Cache Hit Ratios are Very DifferentHit Rate (64KB D-Cache, 4MB L2 Cache)

40%50%60%70%80%90%

100%110%

gzip

vpr

gcc

mcf

craf

typa

rser

eon

perl

bmk

gap

vort

exbz

ip2

twol

fw

upw

ise

swim

mgr

idap

plu

mes

aga

lgel art

equa

kefa

cere

cam

mp

luca

sfm

a3d

sixt

rack

apsi

Hit

Rat

e

dL1 hit rateuL2 hit rate

Page 40: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Cache Optimization Subject to Both Power and Access Time

ObjectivesPursue lowest access latency and power consumption for both cache hit and miss.Achieve consistent power saving across a wide range of applications.

SolutionApply way-prediction to cache hits and phase cache to misses.Access mode prediction (AMP) cache.Zhu and Zhang, IEEE Micro, 2002 (W&M, now at Ohio State)

Page 41: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

AMP Cache

Access predictedway (1 tag + 1 data)

Way prediction

Predictioncorrect?

Yes

Access modeprediction

Way prediction

Access modeprediction

Access all N tags

Access 1 data

Access all otherways ((N-1) tag

+ (N-1) data)

No

Page 42: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Prediction and Way Prediction

PredictionAccess predictor is designed to predict next access be a hit or a miss.The prediction result is used to switch between phase cache and way prediction technique. Cache misses are clustered and program behavior is repetitive. Branch prediction techniques are adopted.

Way PredictionMulti-column is found the most effective.

Page 43: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Energy Consumption:Multi-column over MRU Caches

0%10%20%30%40%50%60%70%80%

gzip vpr

gcc

mcf

craf

typa

rser

eon

perlb

mk

gap

vort

exbz

ip2

twol

fw

upw

ise

swim

mgr

idap

plu

mes

aga

lgel art

equa

kefa

cere

cam

mp

luca

sfm

a3d

sixt

rack

apsi

Ener

gy R

educ

tion

4-way8-way16-way

Page 44: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

Energy Consumption

0.40.60.81.01.21.41.6

gzip

vpr

gcc

mcf

craf

typa

rser

eon

perl

bmk

gap

vort

exbz

ip2

twol

fw

upw

ise

swim

mgr

idap

plu

mes

aga

lgel art

equa

kefa

cere

cam

mp

luca

sfm

a3d

sixt

rack

apsi

Ave

rageN

orm

aliz

ed E

nerg

y C

onsu

mpt

ion Multicolumn Phased Access Mode Prediction

Page 45: Hardware Caches with Low Access Times and High …dragonstar.ict.ac.cn/course_09/XD_Zhang/(2)-hardware-caches.pdf · Hardware Caches with Low Access Times and High Hit Ratios Xiaodong

ConclusionMulti-column cache fundamentally addresses the performance issue for both high hit ratio and low access time.

major location mapping is dominant and has the minimum access time (=direct-mapped access)swap can increase the first hit ratios in major locations.Indexing selected locations make non-first-hits fast.

Multicolumn cache is also an effective way prediction mechanism for low powers.