university of utah 1 the effect of interconnect design on the performance of large l2 caches naveen...

28
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

Post on 22-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 1

The Effect of Interconnect Design on the Performance of Large L2

Caches

Naveen Muralimanohar Rajeev Balasubramonian

Page 2: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 2

Motivation: Large Caches

Future processors will have large on-chip caches Intel Montecito has 24MB on-chip cache

Wire delay dominates in large caches Conventional design can lead to very high hit time

(CACTI access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech)

Careful network choices Improve access time

Open room for several other optimizations

Reduces power significantly

Page 3: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 3

Effect of L2 Hit Time

0%

10%

20%

30%

40%

50%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

fma3

d

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

swim

two

lf

vort

ex vpr

wu

pw

ise

IPC

imp

rove

men

t

Increase in IPC due to reduction in L2 access time

8-issue, out-of-order processor (L2-hit time 30-15 cycles)

Avg = 17%

Page 4: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 4

Cache DesignInput address

Dec

oderWordline

Bitlines

Tag

arr

ay

Dat

a ar

ray

Column muxesSense Amps

Comparators

Output driver

Valid output?

Mux drivers

Data output

Output driver

Page 5: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 5

Existing Model - CACTI

Decoder delay Decoder delay

Wordline & bitline delay Wordline & bitline delay

Cache model with 4 sub-arrays Cache model with 16 sub-arrays

Page 6: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 6

Shortcomings

CACTI Suboptimal for large cache size Access delay is equal to the delay of slowest

sub-array Very high hit time for large caches

Employs a separate bus for each cache bank for multi-banked caches

Page 7: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 7

Non-Uniform Cache Access (NUCA)

Large cache is broken into

a number of small banks

Employs on-chip network

for communication

Access delay (distance

between bank and cache

controller)

CPU & L1

Cache banks

Page 8: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 8

Shortcomings

NUCA Banks are sized such that the link latency is

one cycle (Kim et al. ASPLOS 02)

Increased routing complexity

Dissipates more power

Page 9: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 9

Extension to CACTI

On-chip network

Wire model is done using ITRS 2005 parameters

Grid network

No. of rows = No. of columns (or ½ the no. of columns)

Network latency vs Bank access latency tradeoff

Modified the exhaustive search to include the network

overhead

Page 10: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 10

Effect of Network Delay (32MB cache)

0

20

40

60

80

100

120

140

2 4 8 16 32 64 128 256 512 1024 2048 4096

Bank Count

Cy

cle

s (

Fre

q 5

GH

z)

Bank Access Time

Average Cache Access Latency (Global wires)

Average Network Delay

Delay optimal point

Page 11: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 11

Outline

Overview

Cache Design

Effect of Network Delay Wire Design Space Exploiting Heterogeneous Wires Results

Page 12: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 12

Wire Characteristics Wire Resistance and capacitance per unit length

),()22(0 verthorizverthorizwire fringenglayerspaci

width

spacing

thicknessKC

)2()( BarrierwidthBarrierthicknessRwire

Resistance Capacitance Bandwidth

Width

Spacing

Page 13: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 13

Design Space Exploration Tuning wire width and spacing

Base caseB wires

Fast butLow bandwidth

L wires

(Width & Spacing)

Delay Bandwidth

Page 14: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 14

Design Space Exploration Tuning Repeater size and spacing

Traditional WiresLarge repeatersOptimum spacing

Power Optimal WiresSmaller repeatersIncreased spacing

Dela

y Po

wer

Page 15: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 15

Design Space Exploration

Base caseB wires8x plane

Base caseW wires4x plane

PoweroptimizedPW wires4x plane

Fast, low bandwidth

L wires8x plane

Latency 1x

Power 1x

Area 1x

Latency 1.6x

Power 0.9x

Area 0.5x

Latency 3.2x

Power 0.3x

Area 0.5x

Latency 0.5x

Power 0.5x

Area 5x

Page 16: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 16

Access time for different link types

Bank Count

Bank Access Time

Avg Access time

8x-wires 4x-wires L-wires

16 17 46 75 21

32 9 40 71 15

64 6 38 63 14

128 5 44 68 17

256 4 51 83 20

512 3 82 113 27

1024 3 100 133 35

2048 3 99 162 51

4096 3 131 196 67

Page 17: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 17

Outline

Overview

Cache Design

Effect of Network Delay

Wire Design Space Exploiting Heterogeneous Wires Results

Page 18: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 18

Cache Look-UpTotal cache access time

Network delay

(req 6-8 bits to

identify the cache

Bank)

Decoder,

Wordline,

Bitline delay

(req 10-15 bits

of address)

Comparator,

output driver delay

(req remaining address

for tag match)

The entire access happens in a sequential

manner

Bank access

Page 19: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 19

Early Look-Up Send partial

address in L-wires Initiate the bank

lookup Wait for the

complete address Complete the

access

L

Early lookup

(req 10-15

bits

of address)

Tag match

We can hide 60-70%

of the bank access

delay

Page 20: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 20

Aggressive Look-Up Send partial address bits on L-wires

Do early look-up and do partial tag match

Send all the matched blocks aggressively

L

Agg. lookup

(req additional

8-bits of

address fpr

partial tag

match)

Tag match

at cache

controller

Network

delay reduced

Page 21: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 21

Aggressive Look-Up Significant reduction in network delay (for address

transfer) Increase in traffic due to false match < 1% Marginal increase in link overhead

Additional 8-bits of L-wires compared to early lookup

- Adds complexity to cache controller- Needs logic to do tag match

Page 22: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 22

Outline

Overview

Cache Design

Effect of Network Delay

Wire Design Space

Exploiting Heterogeneous Wires Results

Page 23: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 23

Experimental Setup

Simplescalar with contention modeled in detail

Single core, 8-issue out-of-order processor

32 MB, 8-way set-associative, on-chip L2 cache

(SNUCA organization)

32KB I-cache and 32KB D-cache with hit latency

of 3 cycles

Main memory latency 300 cycles

Page 24: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 24

Cache Models

Model Bank Access

(cycles)

Bank Count Network Link Description

1 3 512 B-wires Based on prior work

2 6 64 B-wires CACTI-L2

3 6 64 B & L–wires Early Lookup

4 6 64 B & L–wires Agg. Lookup

5 6 64 B & L–wires Upper bound

Page 25: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 25

Performance Results (Global Wires)

Model 2 (CACTI-L2) : Average performance improvement – 11%

Performance improvement for L2 latency sensitive benchmarks – 16.3%

Model 3 (Early Lookup): Average performance improvement – 14.4%

Performance improvement for L2 latency sensitive benchmarks – 21.6%

Model 4 (Aggressive Lookup): Average performance improvement – 17.6%

Performance improvement for L2 latency sensitive benchmarks – 26.6%

Model 6 (L-Network): Average performance improvement – 11.4%

Performance improvement for L2 latency sensitive benchmarks – 16.2%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

All Benchmarks Latency Sensitive Benchmarks

Page 26: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 26

Performance Results (4X – Wires)

Wire delay constrained

model Performance

improvements are better

Early lookup performs 5% better

Aggressive model performs 28% better

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Model 1 Model 2 Model 3 Model 4 Model 5

Different Cache Configurations

IPC

(N

orm

ali

zed

to

Mo

de

l 1

)

All Benchmarks Latency Sensitive Benchmarks

Page 27: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 27

Future Work Heterogeneous network in a CMP environment Hybrid-network

Employs a combination of point-to-point and bus for L-messages Effective use of L-wires Latency/bandwidth trade-off

Use of heterogeneous wires in DNUCA environment Cache design focusing on power

Pre-fetching (Power optimized wires) Writeback (Power optimized wires)

Page 28: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian

University of Utah 28

Conclusion

Traditional design approaches for large caches is sub-optimal

Network parameters play a significant role in the performance

of large caches

Modified CACTI model, that includes network overhead

performs 16.3% better compared to previous models

Heterogeneous network has potential to further improve the

performance

Early lookup – 21.6%

Aggressive lookup – 26.6%