1 university of utah & hp labs 1 optimizing nuca organizations and wiring alternatives for large...

30
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Upload: bryan-hutchinson

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

1

University of Utah & HP Labs 1

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Naveen Muralimanohar

Rajeev Balasubramonian

Norman P Jouppi

Page 2: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

2University of Utah 2

Large Caches

Cache hierarchies will dominate chip

area

3D stacked processors with an entire

die for on-chip cache could be

common

Montecito has two private 12 MB L3

caches (27MB including L2)

Long global wires are required to

transmit data/address

Intel Montecito

Cache Cache

Page 3: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

3University of Utah 3

Wire Delay/Power

Wire delays are costly for performance and power

Latencies of 60 cycles to reach ends of a

chip at 32nm (@ 5 GHz)

50% of dynamic power is in interconnect

switching (Magen et al. SLIP 04)

CACTI* access time for 24 MB cache is 90 cycles

@ 5GHz, 65nm Tech

*version 4

Page 4: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

Contribution

Support for various interconnect models Improved design space exploration

Support for modeling Non-Uniform Cache Access (NUCA)

University of Utah 4

Page 5: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

5University of Utah 5

Cache Design Basics

Input address

Dec

oderWordline

Bitlines

Tag

arr

ay

Dat

a ar

ray

Column muxesSense Amps

Comparators

Output driver

Valid output?

Mux drivers

Data output

Output driver

Page 6: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

6University of Utah 6

Existing Model - CACTI

Decoder delay Decoder delay

Wordline & bitline delay Wordline & bitline delay

Cache model with 4 sub-arrays Cache model with 16 sub-arrays

Decoder delay = H-tree delay + logic delay

Page 7: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

7

Power/Delay Overhead of Wires

0%

10%

20%

30%

40%

50%

60%

70%

2 4 8 16 32Cache Size (MB)

H-tree delay percentage

H-tree power percentage H-tree delay increases

with cache size

H-tree power continues

to dominate

Bitlines are other major

contributors to total

power

Page 8: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

8

Motivation

The dominant role of interconnect is clear

Lack of tool to model interconnect in detail

can impede progress

Current solutions have limited wire options

Orion, CACTI

- Weak wire model

- No support for modeling Multi-megabyte caches

University of Utah 8

Page 9: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

9

CACTI 6.0 Enhancements

Incorporation of Different wire models

Different router models

Grid topology for NUCA

Shared bus for UCA

Contention values for various cache configurations

Methodology to compute optimal NUCA organization

Improved interface that enables trade-off analysis

Validation analysis

University of Utah 9

Page 10: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

10

Full-swing Wires

University of Utah 10

X Y

Z

Page 11: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

11

Full-swing Wires II

University of Utah 11

10% Delay

penalty 20% Delay

penalty30% Delay

penaltyRepeater size

Caveat: Repeater sizing and spacing cannot

be controlled precisely all the time

Three different design points

Page 12: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

12

Full-Swing Wires

Fast and simple Delay proportional to sqrt(RC) as against RC

High bandwidth Can be pipelined

- Requires silicon area

- High energy- Quadratic dependence on voltage

Page 13: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

13

Low-swing wires

University of Utah 13

400mV

50mV

raise

Differential wires50mV

drop

400mV

400mV

Page 14: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

14

Differential Low-swing

+ Very low-power, can be routed over other modules

- Relatively slow, low-bandwidth, high area requirement, requires special transmitter and receiver

Bitlines are a form of low-swing wireOptimized for speed and area as against powerDriver and pre-charger employ full Vdd voltage

University of Utah 14

Page 15: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

15

Delay Characteristics

University of Utah 15

Quadratic increase in delay

Page 16: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

16

Energy Characteristics

University of Utah 16

Page 17: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

17

Search Space of CACTI-5

University of Utah 17

Design space with global wires optimized for delay

Page 18: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

18

Search Space of CACTI-6

University of Utah 18

Design space with global and low-swing wires

Least Delay

30% Delay

Penalty

Low-swing

Page 19: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

19University of Utah 19

CACTI – Another Limitation

Access delay is equal to the delay of slowest sub-array Very high hit time for large caches

Employs a separate bus for each cache bank for multi-banked caches Not scalable

Exploit different wire types and network

design choices to improve the search space

Potential solution – NUCA

Extend CACTI to model NUCA

Page 20: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

20University of Utah 20

Non-Uniform Cache Access (NUCA)*

Large cache is broken into

a number of small banks

Employs on-chip network

for communication

Access delay (distance

between bank and cache

controller)

CPU & L1

Cache banks*(Kim et al. ASPLOS 02)

Page 21: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

21University of Utah 21

Extension to CACTI

On-chip network Wire model based on ITRS 2005 parameters

Grid network

3-stage speculative router pipeline

Network latency vs Bank access latency tradeoff Iterate over different bank sizes

Calculate the average network delay based on the number of banks and bank sizes

Consider contention values for different cache configurations

Similarly we also consider power consumed for each organization

Page 22: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

22

Trade-off Analysis (32 MB Cache)

0

50

100

150

200

250

300

350

400

2 4 8 16 32 64No. of Banks

La

ten

cy

(c

yc

les

)

Total No. of Cycles

Network Latency

Bank access latency

Network contention Cycles

16 Core CMP

Page 23: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

23

Effect of Core Count

0

50

100

150

200

250

300

2 4 8 16 32 64

Bank Count

Co

nte

nti

on

Cyc

les

16-core

8-core

4-core

Page 24: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

24University of Utah 24

Power Centric Design (32MB Cache)

0.E+00

1.E-09

2.E-09

3.E-09

4.E-09

5.E-09

6.E-09

7.E-09

8.E-09

9.E-09

1.E-08

2 4 8 16

32

64

En

erg

y J

Bank Count

Total EnergyBank EnergyNetwork Energy

Power Optimal Point

Page 25: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

Validation

HSPICE tool

Predictive Technology Model (65nm tech.)

Analytical model that employs PTM

parameters compared against HSPICE

Distributed wordlines, bitlines, low-swing

transmitters, wires, receivers

Verified to be within 12%University of Utah 25

Page 26: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

26

Case Study: Heterogeneous D-NUCA

Dynamic-NUCA Reduces access time by dynamic data movement

Near-by banks are accessed more frequently

Heterogeneous Banks Near-by banks are made smaller and hence

faster

Access to nearby banks consume less power

Other banks can be made larger and more power efficient

Page 27: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

27

Access Frequency

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

32

,76

8

3,3

09

,56

8

6,5

86

,36

8

9,8

63

,16

8

13

,13

9,9

68

16

,41

6,7

68

19

,69

3,5

68

22

,97

0,3

68

26

,24

7,1

68

29

,52

3,9

68

32

,80

0,7

68

% request satisfied by x KB of cache

Page 28: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

Few Heterogeneous Organizations Considered by CACTI

University of Utah 28

Model 1

Model 2

Page 29: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

29

Other Applications

Exposing wire properties Novel cache pipelining

Early lookup, Aggressive lookup (ISCA 07)

Flit-reservation flow control (Peh et al., HPCA 00)

Novel topologies Hybrid network (ISCA 07)

Page 30: 1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian

30

Conclusion

Network parameters and contention play a critical role in deciding NUCA organization

Wire choices have significant impact on cache properties

CACTI 6.0 can identify models that reduce power by a factor of three for a delay penalty of 25%

http://www.hpl.hp.com/personal/Norman_Jouppi/cacti6.html

http://www.cs.utah.edu/~rajeev/cacti6/