non-uniform cache architectures for wire delay dominated caches abhishek desai bhavesh mehta devang...
TRANSCRIPT
Non-Uniform Cache ArchitecturesNon-Uniform Cache Architecturesfor Wire Delay Dominated Cachesfor Wire Delay Dominated Caches
Abhishek Desai
Bhavesh Mehta
Devang Sachdev
Gilles Muller
MotivationMotivation
Bigger L2 and L3 Caches are needed– Programs are larger– SMT requires large cache for spatial locality– BW demands have increased on the package– Smaller technologies permit more bits per mm2
Wire delays dominate in large caches– Bulk of the access time will involve routing to
and from the banks, not the bank accesses themselves
What is NUCA?What is NUCA?
Data residing closer to the processor is accessed much faster than data that reside physically farther from the processor
Example:
The closest bank in a 16MB on-chip L2 cache built in 50nm process technology could be accessed in 4 cycles, while an access to the farthest bank might take 47 cycles.
UCA and ML-UCA UCA and ML-UCA
UCA
Avg. access time: 255 cyclesBanks: 1Size: 16MBTechnology: 50nm
L2
41
L3
41
L2
10
ML-UCA
Avg. access time: 11/41 cyclesBanks: 8/32Size: 16MBTechnology: 50nm
Static-NUCA-1Static-NUCA-1
…
…
17 41
S-NUCA-1
Avg. access time: 34 cyclesBanks: 32Size: 16MBTechnology: 50nmArea: Wire overhead 20.9%
S-NUCA-1 cache designS-NUCA-1 cache design
Tag Array
Data Bus
Address Bus
Bank
Sub-bank
Predecoder
Senseamplifier
Wordline driverand decoder
Static-NUCA-2Static-NUCA-2
9 32……
……
S-NUCA-2
Avg. access time: 24 cyclesBanks: 32Size: 16MBTechnology: 50nmArea: Channel overhead 5.9%
S-NUCA-2 cache designS-NUCA-2 cache design
Addressbus
Senseamplifier
Bank
Data bus
Switch
Tag Array
Wordline driverand decoder
Predecoder
Dynamic-NUCADynamic-NUCA
D-NUCA
Avg. access time: 18 cyclesBanks: 256Size: 16MBTechnology: 50nm
4 47……
……
Data migration
Management of Data in DNUCAManagement of Data in DNUCA
Mapping:– How the data are mapped to the banks and in
which banks a datum can reside?
Search:– How the set of possible locations are searched
to find a line?
Movement:– Under what conditions the data should be
migrated from one bank to another?
Simple Mapping (implemented)Simple Mapping (implemented)
8 bank sets
way 1
way 2
way 3
way 4
memory controller
one set
bank
Fair and Shared MappingFair and Shared Mapping
Fair Mapping Shared Mapping
memory controller memory controller
Searching Cached LinesSearching Cached Lines
Incremental search Multicast search (Implemented) Limited multicast Partitioned multicast
Smart Search: ss-performance ss-energy
Dynamic Movement of LinesDynamic Movement of Lines
LRU line furthest and MRU line closest One-bank promotion on a hit (implemented)
Policy on miss: Which line is evicted?
– Line in the furthest (slowest) bank -- (implemented) Where is the new line placed?
– Closest (fastest) bank– Furthest (slowest) bank -- (implemented)
What happens to the victim line? – Zero copy policy (implemented)– One copy policy
Advantages of DNUCA over ML-UCAAdvantages of DNUCA over ML-UCA
DNUCA does not enforce inclusion thus preventing redundant copies of the same line
In ML-UCA the faster level may not match the working set size of an application, either being too large and thus slow, or being too small and thus incurring misses
Configuration for simulationConfiguration for simulation
Used Sim-Alpha and CactiSimple mappingMulticast searchOne-bank promotion on each hitReplacement policy that chooses the
block in the slowest bank as the victim of a miss
Hit Rate Distribution for D-NUCAHit Rate Distribution for D-NUCA
Hit Rate Distribution
0
0.1
0.2
0.3
0.4
0.5
0.6
Cache row
Hit
ra
te
Simulation results – integer benchmarksSimulation results – integer benchmarks
UCA vs D-NUCA
00.20.40.60.81
1.21.4
SPEC INT 2000
IPC uca
d-nuca
Simulation results – FP benchmarksSimulation results – FP benchmarks
UCA vs D-NUCA
00.20.40.60.81
1.21.41.61.8
SPEC FP 2000
IPC uca
d-nuca
SummarySummary
D-NUCA has the following plus points:Low Access LatencyTechnology scalabilityPerformance stabilityFlattens the memory hierarchy