1 wire aware architecture naveen muralimanohar advisor – rajeev balasubramonian university of utah

1

Wire Aware Architecture

Naveen Muralimanohar

Advisor – Rajeev Balasubramonian

University of Utah

Naveen Muralimanohar University of Utah

2

Effect of Technology Scaling

Power wall Temperature wall Reliability issues

Process variation Soft errors

Wire scaling Communication is expensive but computations

are cheap


3

Wire Delay – Compelling Opportunity

Existing proposals are indirect Hide wire delay

Pre-fetching, Speculative coherence, Run-ahead execution

Reduce communication to save power

Wire level optimizations are still limited to circuit designers

Thesis Statement

“The growing cost of on-chip wire delay requires a thorough understanding of wires.

The dissertation advocates exposing wire properties to architects and proposes microarchitectural wire management”


5

Wire Delay/Power

Pentium 4 (@ 90nm) spent two cycles to send a

signal across the chip

Wire delays are costly for performance and power

Latencies of 60 cycles to reach ends of a

chip at 32nm (@ 5 GHz)

50% of dynamic power is in interconnect

switching (Magen et al. SLIP 04)


6

Large Caches

Cache hierarchies will

dominate chip area

Montecito has two

private 12 MB L3 caches

(27MB including L2)

Long global wires are

required to transmit

data/address

Intel Montecito

Cache Cache


7

On-Chip Cache Challenges

4 MB 16 MB 64 MB

~1.5X65nm process

~1X130nm process

~2X32nm process

Cache access time calculated using CACTI


8

Effect of L2 Hit Time

0%

10%

20%

30%

40%

50%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

fma3

d

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

swim

two

lf

vort

ex vpr

wu

pw

ise

IPC

imp

rove

men

t

Increase in IPC due to reduction in L2 access time

An aggressive out-of-order processor (L2-hit time 30 ->15 cycles)

Avg = 17%


9

Coherence Traffic

CMP has already become ubiquitous Requires Coherence

among multiple cores

Coherence operations entail frequent communications

+ Different coherence messages have different latency and bandwidth needs

L2$

Core 1 Core2 Core 3

L1$ L1$ L1$Read Req

Fwd Read Req to owner

Latest copy

Ex Req

Inval Req

Inv Ack

Messages related to read missMessages related to write miss

L1 Accesses

Highly latency critical in aggressive out-of-order processors (such as a clustered processor)

The choice of inter-cluster communication fabric has a high impact on performance


11

On-chip Traffic

P0

I D

P1

I D

P2

I D

P3

I D

P4

I D

P5

I D

P6

I D

P7

I D

P8

I D

P9

I D

P10

I D

P11

I D

P12

I D

P13

I D

P14

I D

P15

I D

Controller

Controller

Cache Reads

and WritesCoherence

TransactionsL1-accesses

L2 bank

Cluster


12

Outline

Overview Wire Design Space Methodology to Design Scalable Caches Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence

Traffic Conclusions


13

Wire Characteristics Wire Resistance and capacitance per unit length

),()22(0 verthorizverthorizwire fringenglayerspaci

width

spacing

thicknessKC

)2()( BarrierwidthBarrierthicknessRwire

Resistance Capacitance Bandwidth

Width

Spacing


14

Design Space Exploration

Tuning wire width and spacing Base caseB wires

Fast butLow bandwidthL wires

(Width & Spacing)

Delay Bandwidth


15

Design Space Exploration Tuning Repeater size and spacing

Traditional WiresLarge repeatersOptimum spacing

Power Optimal WiresSmaller repeatersIncreased spacing

Dela

y Po

wer


16

ED Trade-off in a Repeated Wire


17

Design Space Exploration

Base caseB wires

BandwidthoptimizedW wires

Power and B/WoptimizedPW wires

Fast, low bandwidth L wires

Latency 1x

Power 1x

Area 1x

Latency 1.6x

Power 0.9x

Area 0.5x

Latency 3.2x

Power 0.3x

Area 0.5x

Latency 0.5x

Power 0.5x

Area 4x


18

Wire Model

MM M

Wire RCV

ores

ocap Icap

Cside-wall

Cadj

Wire Type Relative

Latency

Relative Area Dynamic Power Static Power

B-Wire 8x 1x 1x 2.65 1x

B-Wire 4x 1.6x 0.5x 2.9 1.13x

L-Wire 8x 0.5x 4x 1.46 0.55X

PW-Wire 4x 3.2x 0.5x 0.87 0.3x

Ref: Banerjee et al.

65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane


19

Outline

Overview

Wire Design Space Methodology to Design Scalable Caches Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence

Traffic Conclusions


20

Cache Design Basics

Input address

Dec

oderWordline

Bitlines

Tag

arr

ay

Dat

a ar

ray

Column muxesSense Amps

Comparators

Output driver

Valid output?

Mux drivers

Data output

Output driver


21

Existing Model - CACTI

Decoder delay Decoder delay

Wordline & bitline delay Wordline & bitline delay

Cache model with 4 sub-arrays Cache model with 16 sub-arrays

Decoder delay = H-tree delay + logic delay


22

CACTI Shortcomings

Access delay is equal to the delay of slowest sub-array Very high hit time for large caches

Employs a separate bus for each cache bank for multi-banked caches Not scalable

Exploit different wire types and network

design choices to reduce access latency

Potential solution – NUCA

Extend CACTI to model NUCA


23

Non-Uniform Cache Access (NUCA)*

Large cache is broken into

a number of small banks

Employs on-chip network

for communication

Access delay (distance

between bank and cache

controller)

CPU & L1

Cache banks*(Kim et al. ASPLOS 02)


24

Extension to CACTI

On-chip network Wire model based on ITRS 2005 parameters

Grid network

3-stage speculative router pipeline

Network latency vs Bank access latency tradeoff Iterate over different bank sizes

Calculate the average network delay based on the number of banks and bank sizes

Consider contention values for different cache configurations

Similarly we also consider power consumed for each organization


25

Trade-off Analysis (32 MB Cache)

0

50

100

150

200

250

300

350

400

2 4 8 16 32 64No. of Banks

La

ten

cy

(c

yc

les

)

Total No. of Cycles

Network Latency

Bank access latency

Network contention CyclesDelay Optimal Point


26

Effect of Core Count

0

50

100

150

200

250

300

2 4 8 16 32 64

Bank Count

Co

nte

nti

on

Cyc

les

16-core

8-core

4-core


27

Power Centric Design (32MB Cache)

0.E+00

1.E-09

2.E-09

3.E-09

4.E-09

5.E-09

6.E-09

7.E-09

8.E-09

9.E-09

1.E-08

2 4 8 16

32

64

En

erg

y J

Bank Count

Total EnergyBank EnergyNetwork Energy

Power Optimal Point

28

Search Space of Old CACTI

University of Utah 28

Design space with global wires optimized for delay

29

Search Space of CACTI-6

University of Utah 29

Design space with various wire types

Least Delay

30% Delay

Penalty

Low-swing


30

Earlier NUCA Models Made simplified assumptions for network

parameters Minimum bank access time Minimum network hop latency Single cycle router pipeline

Employed 512 banks for a 32 MB cache+ More bandwidth

- 2.5X less efficient in terms of delay


31

Outline

Overview

Wire Design Space

Methodology to Design Scalable Caches Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence

Traffic Conclusions


32

Cache Look-Up

The entire access happens in a sequential

manner

Core/L1Core/L1

L2 Bank

Tag DataNetwork Routing Logic 4-6 bits

Decoder 10-15 bits

Comparator


33

Early Look-Up

Break the sequential access Hides 70% of the bank access time

Core/L1Core/L1

L2 Bank

Tag DataCritical lower order bits

Comparator


34

Aggressive Look-Up

Core/L1Core/L1

L2 Bank

Tag DataCritical lower order bits + 8 bits

Comparator

1101…1101111100010

11100010


35

Aggressive Look-Up

Reduction in link delay (for address transfer)

Increase in traffic due to false match < 1%

Marginal increase in link overhead

Additional 8-bits

- More logic at the cache controller for tag match

- Address transfer for writes happens on L-wires


36

Heterogeneous Network

Routers introduce significant overhead (especially in L-network) L-wires can transfer signal across four banks in four

cycles Router adds three cycles for each hop

Modify network topology to take advantage of wire property Different topology for address and data transfers


37

Hybrid Network

Combination of point-to-point and bus Reduction in

latency Reduction in power Efficient use of L-

wires

- Low bandwidth

Core

L2 Controller

Shared bus

Shared bus

Shared bus

Router


38

Experimental Setup

Simplescalar with contention modeled in detail

Single core, 8-issue out-of-order processor

32 MB, 8-way set-associative, on-chip L2 cache

(SNUCA organization)

32KB L1 I-cache and 32KB L1 D-cache with a hit

latency of 3 cycles

Main memory latency 300 cycles


39

CMP Setup

Eight Core CMP(Simplescalar tool)

32 MB, 8-way set-associative

(SNUCA organization)

Two cache controllers

Main memory latency 300 cycles

L2 Bank

C1

C2

C3

C4

C5

C6

C7

C8


40

Network Model

Virtual channel flow control

Four virtual channels/physical channel

Credit based flow control (for backpressure)

Adaptive routing

Each hop should reduce Manhattan distance

between the source and the destination


41

Cache ModelsModel Bank Access

(cycles)

Bank Count Network Link Description

1 3 512 B-wires Based on prior work

2 17 16 B-wires CACTI-6

3 17 16 B & L–wires Early Lookup

4 17 16 B & L–wires Agg. Lookup

5 17 16 B & L–wires Hybrid network

6 17 16 B-wires Upper bound


42

Performance Results (Uniprocessor)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

No

rmal

ized

IPC

All Benchmarks Latency Sensitive Benchmarks

Model derived from CACTI, improvement over model assumed in the prior work – 73%

L2 Sensitive – 114%

Model derived from CACTI, improvement over model assumed in the prior work – 73%


Prior work CACTI-L2 Early Aggr. Hybrid. Ideal


43


0.0

0.5

1.0

1.5

2.0

2.5

3.0


No

rmal

ized

IPC


Early lookup technique, average improvement over Model 2 – 6%

L2 Sensitive – 8%



44


0.0

0.5

1.0

1.5

2.0

2.5

3.0


No

rmal

ized

IPC


Aggressive lookup technique, average improvement over Model 2 – 8%

L2 Sensitive – 9%



45


0.0

0.5

1.0

1.5

2.0

2.5

3.0


No

rmal

ized

IPC


Hybrid model, average improvement over Model 2 – 15%




46

Performance Results (CMP)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

Mix All L2Sensitive

Half L2 andHalf Non L2

Sensitive

Memoryintensive

Average

Benchmark Set

No

rma

lize

d I

PC

Base Early LookupAggressive Lookup HybridIdeal


47

Performance Results (4X – Wires)

Wire delay constrained

model Performance

improvements are better

Early lookup - 7% Aggressive model -

20% Hybrid model - 29%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50


No

rmal

ized

IP

C

All benchmarksLatency sensitive benchmarks


48

NUCA Design

Network parameters play a significant role in the performance of large caches

Modified CACTI model, that includes network overhead performs 51% better compared to previous models

Methodology to compute an optimal baseline NUCA


49

NUCA Design II

Wires can be tuned for different metrics

Routers impose non-trivial overhead

Address and data have different bandwidth needs

We introduce heterogeneity at three levels

Different types of wires for address and data transfers

Different topologies for address and data networks

Different architectures within address network (point-to-point and bus)

(Yields an additional performance improvement of 15% over the

optimal, baseline NUCA)


50

Outline

Overview

Methodology to Design Scalable Caches

Wire Design Space

Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence

Traffic Conclusions


51

Directory Based Protocol (Write-Invalidate)

Map critical/small messages on L wires and non-

critical messages on PW wires

Read exclusive request for block in shared state

Read request for block in exclusive state

Negative Ack (NACK) messages

Hop

Imbalance in

messages


52

1 Rd-Ex request from processor 1

2 Directory sends clean copy to processor 1

3 Directory sends invalidate message to processor 2

4 Cache 2 sends acknowledgement back to processor 1

Cache 1

L2 & Directory

Cache 2

Processor 1 Processor 2

12 3

4

Critical

Non-CriticalExclusive request for a shared copy


53

Read to an Exclusive Block

Proc 2L1

Proc 1L1

L2 & Directory

Read Req

Spec Reply

Req

ACK

Fwd Dirty Copy

WB Data

(critical)

(non-critical)

(non-critical)


54

Evaluation Platform & Simulation Methodology

Virtutech Simics Functional Simulator

Ruby Timing Model (GEMS)

SPLASH Suite

L2$

Processor


55

Heterogeneous Model

L2$

Processor

L-wireB-wirePW-wire

11% Performance improvement

22.5% Power savings in wire


56

Summary

Coherence messages have diverse needs

Intelligent mapping of these messages to wires in

heterogeneous network can improve both performance and

power

Low bandwidth, high speed links improve performance by

11% for SPLASH benchmark suite

Non-critical traffic on power optimized network decreases

wire power by 22.5%Ref: Interconnect Aware Coherence Protocol (ISCA 06) collaborated with Liqun Cheng

On-Core Communications

L-wires Narrow bit width operands Branch mis-predict signal

PW – wires Non-critical register values

Ready registers

Store data

11% improvement in ED^2


58

Results Summary

P0

I D

P1

I D

P2

I D

P3

I D

P4

I D

P5

I D

P6

I D

P7

I D

P8

I D

P9

I D

P10

I D

P11

I D

P12

I D

P13

I D

P14

I D

P15

I D

Controller

Controller

Cache Reads and

Writes114% Processor

performance improvement

50% Power Savings

Coherence Transactions

11% Performance

Improvement

22.5% power savings in wiresL1-accesses7% performance improvement

11% ED^2 improvement

L2 bank

Cluster


59

Conclusion

Impact of interconnect choices in modern processors is significant

Architectural level wire management can improve both power and performance of future communication bound processors

Architects have a lot to offer in the area of wire aware design


60

Future Research

Exploit upcoming technologies Low-swing wires, optical interconnect, RF,

transmission lines etc.

Transactional Memory Network to support register-register

communication Dynamic adaptation

Acknowledgements Committee members

Rajeev, Al, John, Erik, and Shubu (Intel)

ExternalDr. Norm Jouppi (HP Labs), Dr. Ravi Iyer (Intel)

CS front office staff

Lab-matesKarthik, Niti, Liqun, and other fellow grads


62

Avenues Explored Inter-core communication (ISCA 2006) Memory hierarchy (ISCA 2007) CACTI 6.0 – publicly released (MICRO 2007), (IEEE Micro Top

Picks 2008) Out-of-order core (HPCA 2005, IEEE Micro 06)

Power and Temperature Aware Architectures(ISPASS 2006)

Current Project or under submission: Scalable and Reliable Transactional Memory (PACT 08) Rethinking Fundamentals: Route Wires or Packets? 3D Reconfigurable Caches

1 wire aware architecture naveen muralimanohar advisor – rajeev balasubramonian university of utah

Documents

chip wire delay

indirecthide wire

wire properties

wire delaypowerpentium

chip areamontecito

effect of l2

mb l3 caches

l2long global wires