1 wire aware architecture naveen muralimanohar advisor – rajeev balasubramonian university of utah

62
1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Upload: lee-gibbs

Post on 02-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

1

Wire Aware Architecture

Naveen Muralimanohar

Advisor – Rajeev Balasubramonian

University of Utah

Page 2: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

2

Effect of Technology Scaling

Power wall Temperature wall Reliability issues

Process variation Soft errors

Wire scaling Communication is expensive but computations

are cheap

Page 3: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

3

Wire Delay – Compelling Opportunity

Existing proposals are indirect Hide wire delay

Pre-fetching, Speculative coherence, Run-ahead execution

Reduce communication to save power

Wire level optimizations are still limited to circuit designers

Page 4: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Thesis Statement

“The growing cost of on-chip wire delay requires a thorough understanding of wires.

The dissertation advocates exposing wire properties to architects and proposes microarchitectural wire management”

Page 5: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

5

Wire Delay/Power

Pentium 4 (@ 90nm) spent two cycles to send a

signal across the chip

Wire delays are costly for performance and power

Latencies of 60 cycles to reach ends of a

chip at 32nm (@ 5 GHz)

50% of dynamic power is in interconnect

switching (Magen et al. SLIP 04)

Page 6: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

6

Large Caches

Cache hierarchies will

dominate chip area

Montecito has two

private 12 MB L3 caches

(27MB including L2)

Long global wires are

required to transmit

data/address

Intel Montecito

Cache Cache

Page 7: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

7

On-Chip Cache Challenges

4 MB 16 MB 64 MB

~1.5X65nm process

~1X130nm process

~2X32nm process

Cache access time calculated using CACTI

Page 8: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

8

Effect of L2 Hit Time

0%

10%

20%

30%

40%

50%

amm

p

app

lu

apsi art

bzi

p2

craf

ty

eon

equ

ake

fma3

d

gal

gel

gap gcc

gzi

p

luca

s

mcf

mes

a

mg

rid

par

ser

swim

two

lf

vort

ex vpr

wu

pw

ise

IPC

imp

rove

men

t

Increase in IPC due to reduction in L2 access time

An aggressive out-of-order processor (L2-hit time 30 ->15 cycles)

Avg = 17%

Page 9: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

9

Coherence Traffic

CMP has already become ubiquitous Requires Coherence

among multiple cores

Coherence operations entail frequent communications

+ Different coherence messages have different latency and bandwidth needs

L2$

Core 1 Core2 Core 3

L1$ L1$ L1$Read Req

Fwd Read Req to owner

Latest copy

Ex Req

Inval Req

Inv Ack

Messages related to read missMessages related to write miss

Page 10: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

L1 Accesses

Highly latency critical in aggressive out-of-order processors (such as a clustered processor)

The choice of inter-cluster communication fabric has a high impact on performance

Page 11: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

11

On-chip Traffic

P0

I D

P1

I D

P2

I D

P3

I D

P4

I D

P5

I D

P6

I D

P7

I D

P8

I D

P9

I D

P10

I D

P11

I D

P12

I D

P13

I D

P14

I D

P15

I D

Controller

Controller

Cache Reads

and WritesCoherence

TransactionsL1-accesses

L2 bank

Cluster

Page 12: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

12

Outline

Overview Wire Design Space Methodology to Design Scalable Caches Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence

Traffic Conclusions

Page 13: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

13

Wire Characteristics Wire Resistance and capacitance per unit length

),()22(0 verthorizverthorizwire fringenglayerspaci

width

spacing

thicknessKC

)2()( BarrierwidthBarrierthicknessRwire

Resistance Capacitance Bandwidth

Width

Spacing

Page 14: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

14

Design Space Exploration

Tuning wire width and spacing Base caseB wires

Fast butLow bandwidthL wires

(Width & Spacing)

Delay Bandwidth

Page 15: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

15

Design Space Exploration Tuning Repeater size and spacing

Traditional WiresLarge repeatersOptimum spacing

Power Optimal WiresSmaller repeatersIncreased spacing

Dela

y Po

wer

Page 16: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

16

ED Trade-off in a Repeated Wire

Page 17: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

17

Design Space Exploration

Base caseB wires

BandwidthoptimizedW wires

Power and B/WoptimizedPW wires

Fast, low bandwidth L wires

Latency 1x

Power 1x

Area 1x

Latency 1.6x

Power 0.9x

Area 0.5x

Latency 3.2x

Power 0.3x

Area 0.5x

Latency 0.5x

Power 0.5x

Area 4x

Page 18: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

18

Wire Model

MM M

Wire RCV

ores

ocap Icap

Cside-wall

Cadj

Wire Type Relative

Latency

Relative Area Dynamic Power Static Power

B-Wire 8x 1x 1x 2.65 1x

B-Wire 4x 1.6x 0.5x 2.9 1.13x

L-Wire 8x 0.5x 4x 1.46 0.55X

PW-Wire 4x 3.2x 0.5x 0.87 0.3x

Ref: Banerjee et al.

65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane

Page 19: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

19

Outline

Overview

Wire Design Space Methodology to Design Scalable Caches Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence

Traffic Conclusions

Page 20: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

20

Cache Design Basics

Input address

Dec

oderWordline

Bitlines

Tag

arr

ay

Dat

a ar

ray

Column muxesSense Amps

Comparators

Output driver

Valid output?

Mux drivers

Data output

Output driver

Page 21: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

21

Existing Model - CACTI

Decoder delay Decoder delay

Wordline & bitline delay Wordline & bitline delay

Cache model with 4 sub-arrays Cache model with 16 sub-arrays

Decoder delay = H-tree delay + logic delay

Page 22: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

22

CACTI Shortcomings

Access delay is equal to the delay of slowest sub-array Very high hit time for large caches

Employs a separate bus for each cache bank for multi-banked caches Not scalable

Exploit different wire types and network

design choices to reduce access latency

Potential solution – NUCA

Extend CACTI to model NUCA

Page 23: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

23

Non-Uniform Cache Access (NUCA)*

Large cache is broken into

a number of small banks

Employs on-chip network

for communication

Access delay (distance

between bank and cache

controller)

CPU & L1

Cache banks*(Kim et al. ASPLOS 02)

Page 24: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

24

Extension to CACTI

On-chip network Wire model based on ITRS 2005 parameters

Grid network

3-stage speculative router pipeline

Network latency vs Bank access latency tradeoff Iterate over different bank sizes

Calculate the average network delay based on the number of banks and bank sizes

Consider contention values for different cache configurations

Similarly we also consider power consumed for each organization

Page 25: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

25

Trade-off Analysis (32 MB Cache)

0

50

100

150

200

250

300

350

400

2 4 8 16 32 64No. of Banks

La

ten

cy

(c

yc

les

)

Total No. of Cycles

Network Latency

Bank access latency

Network contention CyclesDelay Optimal Point

Page 26: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

26

Effect of Core Count

0

50

100

150

200

250

300

2 4 8 16 32 64

Bank Count

Co

nte

nti

on

Cyc

les

16-core

8-core

4-core

Page 27: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

27

Power Centric Design (32MB Cache)

0.E+00

1.E-09

2.E-09

3.E-09

4.E-09

5.E-09

6.E-09

7.E-09

8.E-09

9.E-09

1.E-08

2 4 8 16

32

64

En

erg

y J

Bank Count

Total EnergyBank EnergyNetwork Energy

Power Optimal Point

Page 28: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

28

Search Space of Old CACTI

University of Utah 28

Design space with global wires optimized for delay

Page 29: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

29

Search Space of CACTI-6

University of Utah 29

Design space with various wire types

Least Delay

30% Delay

Penalty

Low-swing

Page 30: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

30

Earlier NUCA Models Made simplified assumptions for network

parameters Minimum bank access time Minimum network hop latency Single cycle router pipeline

Employed 512 banks for a 32 MB cache+ More bandwidth

- 2.5X less efficient in terms of delay

Page 31: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

31

Outline

Overview

Wire Design Space

Methodology to Design Scalable Caches Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence

Traffic Conclusions

Page 32: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

32

Cache Look-Up

The entire access happens in a sequential

manner

Core/L1Core/L1

L2 Bank

Tag DataNetwork Routing Logic 4-6 bits

Decoder 10-15 bits

Comparator

Page 33: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

33

Early Look-Up

Break the sequential access Hides 70% of the bank access time

Core/L1Core/L1

L2 Bank

Tag DataCritical lower order bits

Comparator

Page 34: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

34

Aggressive Look-Up

Core/L1Core/L1

L2 Bank

Tag DataCritical lower order bits + 8 bits

Comparator

1101…1101111100010

11100010

Page 35: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

35

Aggressive Look-Up

Reduction in link delay (for address transfer)

Increase in traffic due to false match < 1%

Marginal increase in link overhead

Additional 8-bits

- More logic at the cache controller for tag match

- Address transfer for writes happens on L-wires

Page 36: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

36

Heterogeneous Network

Routers introduce significant overhead (especially in L-network) L-wires can transfer signal across four banks in four

cycles Router adds three cycles for each hop

Modify network topology to take advantage of wire property Different topology for address and data transfers

Page 37: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

37

Hybrid Network

Combination of point-to-point and bus Reduction in

latency Reduction in power Efficient use of L-

wires

- Low bandwidth

Core

L2 Controller

Shared bus

Shared bus

Shared bus

Router

Page 38: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

38

Experimental Setup

Simplescalar with contention modeled in detail

Single core, 8-issue out-of-order processor

32 MB, 8-way set-associative, on-chip L2 cache

(SNUCA organization)

32KB L1 I-cache and 32KB L1 D-cache with a hit

latency of 3 cycles

Main memory latency 300 cycles

Page 39: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

39

CMP Setup

Eight Core CMP(Simplescalar tool)

32 MB, 8-way set-associative

(SNUCA organization)

Two cache controllers

Main memory latency 300 cycles

L2 Bank

C1

C2

C3

C4

C5

C6

C7

C8

Page 40: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

40

Network Model

Virtual channel flow control

Four virtual channels/physical channel

Credit based flow control (for backpressure)

Adaptive routing

Each hop should reduce Manhattan distance

between the source and the destination

Page 41: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

41

Cache ModelsModel Bank Access

(cycles)

Bank Count Network Link Description

1 3 512 B-wires Based on prior work

2 17 16 B-wires CACTI-6

3 17 16 B & L–wires Early Lookup

4 17 16 B & L–wires Agg. Lookup

5 17 16 B & L–wires Hybrid network

6 17 16 B-wires Upper bound

Page 42: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

42

Performance Results (Uniprocessor)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

No

rmal

ized

IPC

All Benchmarks Latency Sensitive Benchmarks

Model derived from CACTI, improvement over model assumed in the prior work – 73%

L2 Sensitive – 114%

Model derived from CACTI, improvement over model assumed in the prior work – 73%

L2 Sensitive – 114%

Prior work CACTI-L2 Early Aggr. Hybrid. Ideal

Page 43: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

43

Performance Results (Uniprocessor)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

No

rmal

ized

IPC

All Benchmarks Latency Sensitive Benchmarks

Early lookup technique, average improvement over Model 2 – 6%

L2 Sensitive – 8%

Prior work CACTI-L2 Early Aggr. Hybrid. Ideal

Page 44: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

44

Performance Results (Uniprocessor)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

No

rmal

ized

IPC

All Benchmarks Latency Sensitive Benchmarks

Aggressive lookup technique, average improvement over Model 2 – 8%

L2 Sensitive – 9%

Prior work CACTI-L2 Early Aggr. Hybrid. Ideal

Page 45: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

45

Performance Results (Uniprocessor)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

No

rmal

ized

IPC

All Benchmarks Latency Sensitive Benchmarks

Hybrid model, average improvement over Model 2 – 15%

L2 Sensitive – 20%

Prior work CACTI-L2 Early Aggr. Hybrid. Ideal

Page 46: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

46

Performance Results (CMP)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

Mix All L2Sensitive

Half L2 andHalf Non L2

Sensitive

Memoryintensive

Average

Benchmark Set

No

rma

lize

d I

PC

Base Early LookupAggressive Lookup HybridIdeal

Page 47: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

47

Performance Results (4X – Wires)

Wire delay constrained

model Performance

improvements are better

Early lookup - 7% Aggressive model -

20% Hybrid model - 29%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

No

rmal

ized

IP

C

All benchmarksLatency sensitive benchmarks

Page 48: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

48

NUCA Design

Network parameters play a significant role in the performance of large caches

Modified CACTI model, that includes network overhead performs 51% better compared to previous models

Methodology to compute an optimal baseline NUCA

Page 49: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

49

NUCA Design II

Wires can be tuned for different metrics

Routers impose non-trivial overhead

Address and data have different bandwidth needs

We introduce heterogeneity at three levels

Different types of wires for address and data transfers

Different topologies for address and data networks

Different architectures within address network (point-to-point and bus)

(Yields an additional performance improvement of 15% over the

optimal, baseline NUCA)

Page 50: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

50

Outline

Overview

Methodology to Design Scalable Caches

Wire Design Space

Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence

Traffic Conclusions

Page 51: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

51

Directory Based Protocol (Write-Invalidate)

Map critical/small messages on L wires and non-

critical messages on PW wires

Read exclusive request for block in shared state

Read request for block in exclusive state

Negative Ack (NACK) messages

Hop

Imbalance in

messages

Page 52: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

52

1 Rd-Ex request from processor 1

2 Directory sends clean copy to processor 1

3 Directory sends invalidate message to processor 2

4 Cache 2 sends acknowledgement back to processor 1

Cache 1

L2 & Directory

Cache 2

Processor 1 Processor 2

12 3

4

Critical

Non-CriticalExclusive request for a shared copy

Page 53: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

53

Read to an Exclusive Block

Proc 2L1

Proc 1L1

L2 & Directory

Read Req

Spec Reply

Req

ACK

Fwd Dirty Copy

WB Data

(critical)

(non-critical)

(non-critical)

Page 54: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

54

Evaluation Platform & Simulation Methodology

Virtutech Simics Functional Simulator

Ruby Timing Model (GEMS)

SPLASH Suite

L2$

Processor

Page 55: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

55

Heterogeneous Model

L2$

Processor

L-wireB-wirePW-wire

11% Performance improvement

22.5% Power savings in wire

Page 56: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

56

Summary

Coherence messages have diverse needs

Intelligent mapping of these messages to wires in

heterogeneous network can improve both performance and

power

Low bandwidth, high speed links improve performance by

11% for SPLASH benchmark suite

Non-critical traffic on power optimized network decreases

wire power by 22.5%Ref: Interconnect Aware Coherence Protocol (ISCA 06) collaborated with Liqun Cheng

Page 57: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

On-Core Communications

L-wires Narrow bit width operands Branch mis-predict signal

PW – wires Non-critical register values

Ready registers

Store data

11% improvement in ED^2

Page 58: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

58

Results Summary

P0

I D

P1

I D

P2

I D

P3

I D

P4

I D

P5

I D

P6

I D

P7

I D

P8

I D

P9

I D

P10

I D

P11

I D

P12

I D

P13

I D

P14

I D

P15

I D

Controller

Controller

Cache Reads and

Writes114% Processor

performance improvement

50% Power Savings

Coherence Transactions

11% Performance

Improvement

22.5% power savings in wiresL1-accesses7% performance improvement

11% ED^2 improvement

L2 bank

Cluster

Page 59: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

59

Conclusion

Impact of interconnect choices in modern processors is significant

Architectural level wire management can improve both power and performance of future communication bound processors

Architects have a lot to offer in the area of wire aware design

Page 60: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

60

Future Research

Exploit upcoming technologies Low-swing wires, optical interconnect, RF,

transmission lines etc.

Transactional Memory Network to support register-register

communication Dynamic adaptation

Page 61: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Acknowledgements Committee members

Rajeev, Al, John, Erik, and Shubu (Intel)

ExternalDr. Norm Jouppi (HP Labs), Dr. Ravi Iyer (Intel)

CS front office staff

Lab-matesKarthik, Niti, Liqun, and other fellow grads

Page 62: 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

Naveen Muralimanohar University of Utah

62

Avenues Explored Inter-core communication (ISCA 2006) Memory hierarchy (ISCA 2007) CACTI 6.0 – publicly released (MICRO 2007), (IEEE Micro Top

Picks 2008) Out-of-order core (HPCA 2005, IEEE Micro 06)

Power and Temperature Aware Architectures(ISPASS 2006)

Current Project or under submission: Scalable and Reliable Transactional Memory (PACT 08) Rethinking Fundamentals: Route Wires or Packets? 3D Reconfigurable Caches