non-uniform cache architectures for wire delay dominated caches abhishek desai bhavesh mehta devang...

22
Non-Uniform Cache Non-Uniform Cache Architectures Architectures for Wire Delay Dominated for Wire Delay Dominated Caches Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller

Upload: tyler-bennett

Post on 26-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Non-Uniform Cache ArchitecturesNon-Uniform Cache Architecturesfor Wire Delay Dominated Cachesfor Wire Delay Dominated Caches

Abhishek Desai

Bhavesh Mehta

Devang Sachdev

Gilles Muller

PlanPlan

MotivationWhat is NUCAUCA and ML-UCAStatic NUCADynamic NUCASimulation Results

MotivationMotivation

Bigger L2 and L3 Caches are needed– Programs are larger– SMT requires large cache for spatial locality– BW demands have increased on the package– Smaller technologies permit more bits per mm2

Wire delays dominate in large caches– Bulk of the access time will involve routing to

and from the banks, not the bank accesses themselves

What is NUCA?What is NUCA?

Data residing closer to the processor is accessed much faster than data that reside physically farther from the processor

Example:

The closest bank in a 16MB on-chip L2 cache built in 50nm process technology could be accessed in 4 cycles, while an access to the farthest bank might take 47 cycles.

UCA and ML-UCA UCA and ML-UCA

UCA

Avg. access time: 255 cyclesBanks: 1Size: 16MBTechnology: 50nm

L2

41

L3

41

L2

10

ML-UCA

Avg. access time: 11/41 cyclesBanks: 8/32Size: 16MBTechnology: 50nm

Static-NUCA-1Static-NUCA-1

17 41

S-NUCA-1

Avg. access time: 34 cyclesBanks: 32Size: 16MBTechnology: 50nmArea: Wire overhead 20.9%

S-NUCA-1 cache designS-NUCA-1 cache design

Tag Array

Data Bus

Address Bus

Bank

Sub-bank

Predecoder

Senseamplifier

Wordline driverand decoder

Static-NUCA-2Static-NUCA-2

9 32……

……

S-NUCA-2

Avg. access time: 24 cyclesBanks: 32Size: 16MBTechnology: 50nmArea: Channel overhead 5.9%

S-NUCA-2 cache designS-NUCA-2 cache design

Addressbus

Senseamplifier

Bank

Data bus

Switch

Tag Array

Wordline driverand decoder

Predecoder

Dynamic-NUCADynamic-NUCA

D-NUCA

Avg. access time: 18 cyclesBanks: 256Size: 16MBTechnology: 50nm

4 47……

……

Data migration

Management of Data in DNUCAManagement of Data in DNUCA

Mapping:– How the data are mapped to the banks and in

which banks a datum can reside?

Search:– How the set of possible locations are searched

to find a line?

Movement:– Under what conditions the data should be

migrated from one bank to another?

Simple Mapping (implemented)Simple Mapping (implemented)

8 bank sets

way 1

way 2

way 3

way 4

memory controller

one set

bank

Fair and Shared MappingFair and Shared Mapping

Fair Mapping Shared Mapping

memory controller memory controller

Searching Cached LinesSearching Cached Lines

Incremental search Multicast search (Implemented) Limited multicast Partitioned multicast

Smart Search: ss-performance ss-energy

Dynamic Movement of LinesDynamic Movement of Lines

LRU line furthest and MRU line closest One-bank promotion on a hit (implemented)

Policy on miss: Which line is evicted?

– Line in the furthest (slowest) bank -- (implemented) Where is the new line placed?

– Closest (fastest) bank– Furthest (slowest) bank -- (implemented)

What happens to the victim line? – Zero copy policy (implemented)– One copy policy

Advantages of DNUCA over ML-UCAAdvantages of DNUCA over ML-UCA

DNUCA does not enforce inclusion thus preventing redundant copies of the same line

In ML-UCA the faster level may not match the working set size of an application, either being too large and thus slow, or being too small and thus incurring misses

Configuration for simulationConfiguration for simulation

Used Sim-Alpha and CactiSimple mappingMulticast searchOne-bank promotion on each hitReplacement policy that chooses the

block in the slowest bank as the victim of a miss

Hit Rate Distribution for D-NUCAHit Rate Distribution for D-NUCA

Hit Rate Distribution

0

0.1

0.2

0.3

0.4

0.5

0.6

Cache row

Hit

ra

te

Simulation results – integer benchmarksSimulation results – integer benchmarks

UCA vs D-NUCA

00.20.40.60.81

1.21.4

SPEC INT 2000

IPC uca

d-nuca

Simulation results – FP benchmarksSimulation results – FP benchmarks

UCA vs D-NUCA

00.20.40.60.81

1.21.41.61.8

SPEC FP 2000

IPC uca

d-nuca

SummarySummary

D-NUCA has the following plus points:Low Access LatencyTechnology scalabilityPerformance stabilityFlattens the memory hierarchy