asplos’02 presented by kim, sun-hee. technology trends ◦ the rate of frequency scaling is...
TRANSCRIPT
An Adaptive, NonUniform Cache Structure
for WireDelay Dominated OnChip Caches
ASPLOS’02
Presented by Kim, Sun-Hee
Technology trends◦ The rate of frequency scaling is slowing down
Performance must come from exploiting concurrency◦ Increasing global on-chip wire delay problem
Architectures must be partitioned
NUCA (Non-Uniform access Cache Architecture)◦ Composable on-chip memories◦ Address the increasing wire delay problem in future large
caches◦ Array of fine-grained memory banks connected by a
switched network
Introduction
UCA (Uniform Cache Access)◦ Traditional cache◦ Poor performance
Internal wire delays Restricted numbers of ports
Level-2 Cache Architectures(1/5)
ML-UCA (Multi-level Cache)◦ L2 and L3◦ Aggressively baked
Multiple parallel access Inclusion, replicating
Level-2 Cache Architectures(2/5)
S-NUCA-1 (Static Non-Uniform Cache)◦ Non-uniform access without inclusion◦ Mapping is predetermined
Based on the block index Only one bank of the cache
◦ Private, two-way, pipelined transmission channel
Level-2 Cache Architectures(3/5)
S-NUCA-2◦ 2D switched network
Permitting a larger number of smaller, faster banks Circumvent wire & decoder area overhead
Level-2 Cache Architectures(4/5)
D-NUCA (Dynamic NUCA)◦ Migrating cache lines
By data to be mapped to many banks Most requests are serviced by the fastest banks
◦ Fewer misses By adopting to the working set
Level-2 Cache Architectures(5/5)
Experimental Methodology◦ Cacti to derive the access times for cache◦ sim-alpha to simulate cache performance
UCA Evaluation
UCA
Mappings of data to banks are static◦ Low-order bits index determine bank◦ Four-way set associative
Advantages◦ Different access time proportional to the distance
of the bank◦ Access to different banks may in parallel
Reducing contention
S-NUCA
2 private, per-bank 128-bit channels◦ Each bank access independently at max speed◦ Small bank advantages Vs. area overheads
Bank conflict contention model◦ Conservative policy : b+2d+3 cycles◦ Aggressive pipelining policy : b+3 cycles
S-NUCA-1 (Private Chan-nel)
Lightweight, wormhole-routed 2-D mesh Centralized tag store or
broadcasting the tags to all of the banks
S-NUCA-2 (Switched Chan-nel)
Spread sets◦ The multibanked cache as a set-associative◦ Bank set
D-NUCA : Mapping
Bank set, 4-wayRows# may not waysDifferent latencies
Equal latenciesComplex path in a setPotential longer latenciesMore contention
Fastest bank access
Incremental search◦ From the closest bank◦ Minimize messages, low energy and performance
Multicast search◦ Multicast address to banks in a set◦ Higher performance at more energy and con-
tention Limited multicast
◦ Search first M banks in parallel then incremental Partitioned multicast
◦ Subset in bank set is searched iteratively
D-NUCA : Locating
Challenges in distributed cache array◦ Many banks may need to be searched◦ Miss resolution time grows as way increase
Partial tag comparison◦ Reduce bank lookups and miss resolution time
Smart search◦ Stores the partial tag bits in the cache controller◦ ss-performance : enough tag bits reducing false
hit◦ ss-energy : serialized search from the closest
bank
D-NUCA : Searching
Maximize the hit ratio in the closest bank◦ MRU line is in the closest bank◦ Generational promotion
Approximating an LRU mapping Reduce the copying # by pure LRU On hit, swapped with the line in the next closest
bank Zero-copy policy, one-copy policy
D-NUCA : Movement
Mapping◦ Simple or shared
Search◦ Multicast, incremental, or combination
Promotion◦ Promotion distance(1bank), promotion
trigger(1hit) Insertion
◦ Location (slowest bank) and replacement (zero copy)
Compare to pure LRU
D-NUCA : Policies
Evaluations (1/2)
UCA : 67.7ML-UCA : 22.3S-NUCA : 30.4
UCA : 0.41S-NUCA : 0.65
Comparison to ML-UCA◦ Same with D-NUCA in frequently used data is
closer
Evaluations (2/2)
Working set > 2MB
Low latency access Technology scalability Performance stability Flattening the memory hierarchy
Summary and Conclusions
Evaluations (2/)
Cache Design Comparison
Evaluations (3/3)