asplos’02 presented by kim, sun-hee. technology trends ◦ the rate of frequency scaling is...

21
An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches ASPLOS’02 Presented by Kim, Sun-Hee

Upload: robert-chester-poole

Post on 21-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

An Adaptive, NonUniform Cache Structure

for WireDelay Dominated OnChip Caches

ASPLOS’02

Presented by Kim, Sun-Hee

Page 2: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Technology trends◦ The rate of frequency scaling is slowing down

Performance must come from exploiting concurrency◦ Increasing global on-chip wire delay problem

Architectures must be partitioned

NUCA (Non-Uniform access Cache Architecture)◦ Composable on-chip memories◦ Address the increasing wire delay problem in future large

caches◦ Array of fine-grained memory banks connected by a

switched network

Introduction

Page 3: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

UCA (Uniform Cache Access)◦ Traditional cache◦ Poor performance

Internal wire delays Restricted numbers of ports

Level-2 Cache Architectures(1/5)

Page 4: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

ML-UCA (Multi-level Cache)◦ L2 and L3◦ Aggressively baked

Multiple parallel access Inclusion, replicating

Level-2 Cache Architectures(2/5)

Page 5: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

S-NUCA-1 (Static Non-Uniform Cache)◦ Non-uniform access without inclusion◦ Mapping is predetermined

Based on the block index Only one bank of the cache

◦ Private, two-way, pipelined transmission channel

Level-2 Cache Architectures(3/5)

Page 6: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

S-NUCA-2◦ 2D switched network

Permitting a larger number of smaller, faster banks Circumvent wire & decoder area overhead

Level-2 Cache Architectures(4/5)

Page 7: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

D-NUCA (Dynamic NUCA)◦ Migrating cache lines

By data to be mapped to many banks Most requests are serviced by the fastest banks

◦ Fewer misses By adopting to the working set

Level-2 Cache Architectures(5/5)

Page 8: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Experimental Methodology◦ Cacti to derive the access times for cache◦ sim-alpha to simulate cache performance

UCA Evaluation

UCA

Page 9: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Mappings of data to banks are static◦ Low-order bits index determine bank◦ Four-way set associative

Advantages◦ Different access time proportional to the distance

of the bank◦ Access to different banks may in parallel

Reducing contention

S-NUCA

Page 10: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

2 private, per-bank 128-bit channels◦ Each bank access independently at max speed◦ Small bank advantages Vs. area overheads

Bank conflict contention model◦ Conservative policy : b+2d+3 cycles◦ Aggressive pipelining policy : b+3 cycles

S-NUCA-1 (Private Chan-nel)

Page 11: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Lightweight, wormhole-routed 2-D mesh Centralized tag store or

broadcasting the tags to all of the banks

S-NUCA-2 (Switched Chan-nel)

Page 12: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Spread sets◦ The multibanked cache as a set-associative◦ Bank set

D-NUCA : Mapping

Bank set, 4-wayRows# may not waysDifferent latencies

Equal latenciesComplex path in a setPotential longer latenciesMore contention

Fastest bank access

Page 13: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Incremental search◦ From the closest bank◦ Minimize messages, low energy and performance

Multicast search◦ Multicast address to banks in a set◦ Higher performance at more energy and con-

tention Limited multicast

◦ Search first M banks in parallel then incremental Partitioned multicast

◦ Subset in bank set is searched iteratively

D-NUCA : Locating

Page 14: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Challenges in distributed cache array◦ Many banks may need to be searched◦ Miss resolution time grows as way increase

Partial tag comparison◦ Reduce bank lookups and miss resolution time

Smart search◦ Stores the partial tag bits in the cache controller◦ ss-performance : enough tag bits reducing false

hit◦ ss-energy : serialized search from the closest

bank

D-NUCA : Searching

Page 15: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Maximize the hit ratio in the closest bank◦ MRU line is in the closest bank◦ Generational promotion

Approximating an LRU mapping Reduce the copying # by pure LRU On hit, swapped with the line in the next closest

bank Zero-copy policy, one-copy policy

D-NUCA : Movement

Page 16: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Mapping◦ Simple or shared

Search◦ Multicast, incremental, or combination

Promotion◦ Promotion distance(1bank), promotion

trigger(1hit) Insertion

◦ Location (slowest bank) and replacement (zero copy)

Compare to pure LRU

D-NUCA : Policies

Page 17: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Evaluations (1/2)

UCA : 67.7ML-UCA : 22.3S-NUCA : 30.4

UCA : 0.41S-NUCA : 0.65

Page 18: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Comparison to ML-UCA◦ Same with D-NUCA in frequently used data is

closer

Evaluations (2/2)

Working set > 2MB

Page 19: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Low latency access Technology scalability Performance stability Flattening the memory hierarchy

Summary and Conclusions

Page 20: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Evaluations (2/)

Page 21: ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency

Cache Design Comparison

Evaluations (3/3)