asplos’02 presented by kim, sun-hee. technology trends ◦ the rate of frequency scaling is...

An Adaptive, NonUniform Cache Structure

for WireDelay Dominated OnChip Caches

ASPLOS’02

Presented by Kim, Sun-Hee

Technology trends◦ The rate of frequency scaling is slowing down

Performance must come from exploiting concurrency◦ Increasing global on-chip wire delay problem

Architectures must be partitioned

NUCA (Non-Uniform access Cache Architecture)◦ Composable on-chip memories◦ Address the increasing wire delay problem in future large

caches◦ Array of fine-grained memory banks connected by a

switched network

Introduction

UCA (Uniform Cache Access)◦ Traditional cache◦ Poor performance

Internal wire delays Restricted numbers of ports

Level-2 Cache Architectures(1/5)

ML-UCA (Multi-level Cache)◦ L2 and L3◦ Aggressively baked

Multiple parallel access Inclusion, replicating


S-NUCA-1 (Static Non-Uniform Cache)◦ Non-uniform access without inclusion◦ Mapping is predetermined

Based on the block index Only one bank of the cache

◦ Private, two-way, pipelined transmission channel


S-NUCA-2◦ 2D switched network

Permitting a larger number of smaller, faster banks Circumvent wire & decoder area overhead


D-NUCA (Dynamic NUCA)◦ Migrating cache lines

By data to be mapped to many banks Most requests are serviced by the fastest banks

◦ Fewer misses By adopting to the working set


Experimental Methodology◦ Cacti to derive the access times for cache◦ sim-alpha to simulate cache performance

UCA Evaluation

UCA

Mappings of data to banks are static◦ Low-order bits index determine bank◦ Four-way set associative

Advantages◦ Different access time proportional to the distance

of the bank◦ Access to different banks may in parallel

Reducing contention

S-NUCA

2 private, per-bank 128-bit channels◦ Each bank access independently at max speed◦ Small bank advantages Vs. area overheads

Bank conflict contention model◦ Conservative policy : b+2d+3 cycles◦ Aggressive pipelining policy : b+3 cycles

S-NUCA-1 (Private Chan-nel)

Lightweight, wormhole-routed 2-D mesh Centralized tag store or

broadcasting the tags to all of the banks

S-NUCA-2 (Switched Chan-nel)

Spread sets◦ The multibanked cache as a set-associative◦ Bank set

D-NUCA : Mapping

Bank set, 4-wayRows# may not waysDifferent latencies

Equal latenciesComplex path in a setPotential longer latenciesMore contention

Fastest bank access

Incremental search◦ From the closest bank◦ Minimize messages, low energy and performance

Multicast search◦ Multicast address to banks in a set◦ Higher performance at more energy and con-

tention Limited multicast

◦ Search first M banks in parallel then incremental Partitioned multicast

◦ Subset in bank set is searched iteratively

D-NUCA : Locating

Challenges in distributed cache array◦ Many banks may need to be searched◦ Miss resolution time grows as way increase

Partial tag comparison◦ Reduce bank lookups and miss resolution time

Smart search◦ Stores the partial tag bits in the cache controller◦ ss-performance : enough tag bits reducing false

hit◦ ss-energy : serialized search from the closest

bank

D-NUCA : Searching

Maximize the hit ratio in the closest bank◦ MRU line is in the closest bank◦ Generational promotion

Approximating an LRU mapping Reduce the copying # by pure LRU On hit, swapped with the line in the next closest

bank Zero-copy policy, one-copy policy

D-NUCA : Movement

Mapping◦ Simple or shared

Search◦ Multicast, incremental, or combination

Promotion◦ Promotion distance(1bank), promotion

trigger(1hit) Insertion

◦ Location (slowest bank) and replacement (zero copy)

Compare to pure LRU

D-NUCA : Policies

Evaluations (1/2)

UCA : 67.7ML-UCA : 22.3S-NUCA : 30.4

UCA : 0.41S-NUCA : 0.65

Comparison to ML-UCA◦ Same with D-NUCA in frequently used data is

closer

Evaluations (2/2)

Working set > 2MB

Low latency access Technology scalability Performance stability Flattening the memory hierarchy

Summary and Conclusions

Evaluations (2/)

Cache Design Comparison

Evaluations (3/3)

asplos’02 presented by kim, sun-hee. technology trends ◦ the rate of frequency scaling is...

Documents

cache architectures35

cache architectures15

cache architectures25

banks snuca

cache controllerssperformance

nonuniform cache structure

channelseach bank access

mapping bank set