non-uniform cache architectures for wire delay dominated caches
DESCRIPTION
Non-Uniform Cache Architectures for Wire Delay Dominated Caches. Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller. Plan. Motivation What is NUCA UCA and ML-UCA Static NUCA Dynamic NUCA Simulation Results. Motivation. Bigger L2 and L3 Caches are needed Programs are larger - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/1.jpg)
Non-Uniform Cache ArchitecturesNon-Uniform Cache Architecturesfor Wire Delay Dominated Cachesfor Wire Delay Dominated Caches
Abhishek Desai
Bhavesh Mehta
Devang Sachdev
Gilles Muller
![Page 2: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/2.jpg)
PlanPlan
MotivationWhat is NUCAUCA and ML-UCAStatic NUCADynamic NUCASimulation Results
![Page 3: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/3.jpg)
MotivationMotivation
Bigger L2 and L3 Caches are needed– Programs are larger– SMT requires large cache for spatial locality– BW demands have increased on the package– Smaller technologies permit more bits per mm2
Wire delays dominate in large caches– Bulk of the access time will involve routing to
and from the banks, not the bank accesses themselves
![Page 4: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/4.jpg)
What is NUCA?What is NUCA?
Data residing closer to the processor is accessed much faster than data that reside physically farther from the processor
Example:
The closest bank in a 16MB on-chip L2 cache built in 50nm process technology could be accessed in 4 cycles, while an access to the farthest bank might take 47 cycles.
![Page 5: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/5.jpg)
UCA and ML-UCA UCA and ML-UCA
UCA
Avg. access time: 255 cyclesBanks: 1Size: 16MBTechnology: 50nm
L2
41
L3
41
L2
10
ML-UCA
Avg. access time: 11/41 cyclesBanks: 8/32Size: 16MBTechnology: 50nm
![Page 6: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/6.jpg)
Static-NUCA-1Static-NUCA-1
…
…
17 41
S-NUCA-1
Avg. access time: 34 cyclesBanks: 32Size: 16MBTechnology: 50nmArea: Wire overhead 20.9%
![Page 7: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/7.jpg)
S-NUCA-1 cache designS-NUCA-1 cache design
Tag Array
Data Bus
Address Bus
Bank
Sub-bank
Predecoder
Senseamplifier
Wordline driverand decoder
![Page 8: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/8.jpg)
Static-NUCA-2Static-NUCA-2
9 32……
……
S-NUCA-2
Avg. access time: 24 cyclesBanks: 32Size: 16MBTechnology: 50nmArea: Channel overhead 5.9%
![Page 9: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/9.jpg)
S-NUCA-2 cache designS-NUCA-2 cache design
Addressbus
Senseamplifier
Bank
Data bus
Switch
Tag Array
Wordline driverand decoder
Predecoder
![Page 10: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/10.jpg)
Dynamic-NUCADynamic-NUCA
D-NUCA
Avg. access time: 18 cyclesBanks: 256Size: 16MBTechnology: 50nm
4 47……
……
Data migration
![Page 11: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/11.jpg)
Management of Data in DNUCAManagement of Data in DNUCA
Mapping:– How the data are mapped to the banks and in
which banks a datum can reside?
Search:– How the set of possible locations are searched
to find a line?
Movement:– Under what conditions the data should be
migrated from one bank to another?
![Page 12: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/12.jpg)
Simple Mapping (implemented)Simple Mapping (implemented)
8 bank sets
way 1
way 2
way 3
way 4
memory controller
one set
bank
![Page 13: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/13.jpg)
Fair and Shared MappingFair and Shared Mapping
Fair Mapping Shared Mapping
memory controller memory controller
![Page 14: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/14.jpg)
Searching Cached LinesSearching Cached Lines
Incremental search Multicast search (Implemented) Limited multicast Partitioned multicast
Smart Search: ss-performance ss-energy
![Page 15: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/15.jpg)
Dynamic Movement of LinesDynamic Movement of Lines
LRU line furthest and MRU line closest One-bank promotion on a hit (implemented)
Policy on miss: Which line is evicted?
– Line in the furthest (slowest) bank -- (implemented) Where is the new line placed?
– Closest (fastest) bank– Furthest (slowest) bank -- (implemented)
What happens to the victim line? – Zero copy policy (implemented)– One copy policy
![Page 16: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/16.jpg)
Advantages of DNUCA over ML-UCAAdvantages of DNUCA over ML-UCA
DNUCA does not enforce inclusion thus preventing redundant copies of the same line
In ML-UCA the faster level may not match the working set size of an application, either being too large and thus slow, or being too small and thus incurring misses
![Page 17: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/17.jpg)
Configuration for simulationConfiguration for simulation
Used Sim-Alpha and CactiSimple mappingMulticast searchOne-bank promotion on each hitReplacement policy that chooses the
block in the slowest bank as the victim of a miss
![Page 18: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/18.jpg)
Hit Rate Distribution for D-NUCAHit Rate Distribution for D-NUCA
Hit Rate Distribution
0
0.1
0.2
0.3
0.4
0.5
0.6
Cache row
Hit
ra
te
![Page 19: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/19.jpg)
Simulation results – integer benchmarksSimulation results – integer benchmarks
UCA vs D-NUCA
00.20.40.60.81
1.21.4
SPEC INT 2000
IPC uca
d-nuca
![Page 20: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/20.jpg)
Simulation results – FP benchmarksSimulation results – FP benchmarks
UCA vs D-NUCA
00.20.40.60.81
1.21.41.61.8
SPEC FP 2000
IPC uca
d-nuca
![Page 21: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/21.jpg)
SummarySummary
D-NUCA has the following plus points:Low Access LatencyTechnology scalabilityPerformance stabilityFlattens the memory hierarchy
![Page 22: Non-Uniform Cache Architectures for Wire Delay Dominated Caches](https://reader033.vdocuments.mx/reader033/viewer/2022051621/568147b0550346895db4f202/html5/thumbnails/22.jpg)