microarchitectural wire management for performance and power in partitioned architectures

Feb 14th 2005 University of Utah 1

Microarchitectural Wire Management for Performance and Power in

partitioned architectures

Rajeev BalasubramonianNaveen Muralimanohar

Karthik RamaniVenkatand Venkatachalapathy

Processor Architecture

February 14th 2005

2

Overview/Motivation

Wire delays hamper performance.

Power incurred in movement of data 50% of dynamic power is in interconnect

switching (Magen et al. SLIP 04)

MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003)

Abundant number of metal layers

February 14th 2005

3

Wire characteristics

Wire Resistance and capacitance per unit length

),()22(0 verthorizverthorizwire fringenglayerspaci

width

spacing

thicknessKC

)2()( BarrierwidthbarrierthicknessRwire

± Width R , C

Spacing C

Delay (as delay RC), Bandwidth

February 14th 2005

4

Design space exploration

Tuning wire width and spacing

d

2d

B WiresResistance

Capacitance

Resistance

Capacitance

Bandwidth

February 14th 2005

5

Transmission Lines

Similar to L wires - extremely low delay

Constraining implementation requirements!

Large width

Large spacing between wires

Design of sensing circuits

Implemented in test CMOS chips

February 14th 2005

6


Tuning Repeater size and spacing

Traditional WiresLarge repeatersOptimum spacing

Power Optimal WiresSmaller repeatersIncreased spacing

Dela

y Po

wer

February 14th 2005

7


DelayOptimized

B wires

BandwidthOptimizedW wires

PowerOptimized

P wires

Power and B/WOptimizedPW wires

Fast, low bandwidth

L wires

February 14th 2005

8

Heterogeneous Interconnects Intercluster global Interconnect

72 B wires Repeaters sized and spaced for optimum delay

18 L wires Wide wires and large spacing

Occupies more area

Low latencies 144 PW wires

Poor delay

High bandwidth

Low power

February 14th 2005

9

Outline

Overview

Design Space Exploration

Heterogeneous Interconnects Employing L wires for performance PW wires: The power optimizers Evaluation Results Conclusion

February 14th 2005

10

L1 DCache

LSQ

Eff. Address Transfer 10c

Mem. DepResolution

5c

CacheAccess

5c

Data return at 20c

L1 Cache pipeline

February 14th 2005

11

Exploiting L-Wires

L1 DCache

LSQ

Eff. Address Transfer 10c

PartialMem. DepResolution

3c

CacheAccess

5c

8-bit Transfer 5c

Data return at 14c

February 14th 2005

12

L wires: Accelerating cache access

Transmit LSB bits of effective address through L

wires

Partial comparison of loads and stores in LSQ

Faster memory disambiguation

Introduces false dependences ( < 9%)

Indexing data and tag RAM arrays

LSB bits can prefetch data out of L1$

Reduce access latency of loads

February 14th 2005

13

L wires: Narrow bit width operands

Transfer of 10 bit integers on L wires

Schedule wake up operations

Reduction in branch mispredict penalty

A predictor table of 8K two bit counters

Identifies 95% of all narrow bit-width results

Accuracy of 98%

Implemented in the PowerPC!

February 14th 2005

14

PW wires: Power/Bandwidth efficient

Idea: steer non-critical data through energy

efficient PW interconnect

Transfer of data at instruction dispatch Transfer of input operands to remote register

file

Covered by long dispatch to issue latency

Store data

February 14th 2005

15

Evaluation methodology

L1 DCache

B wires (2 cycles)

L wires (1 cycle)

PW wires (3 cycles)

Cluster

A dynamically scheduled clustered modeled with 4 clusters in simplescalar-3.0

Crossbar interconnects Centralized front-end

I-Cache & D-Cache LSQ Branch Predictor

February 14th 2005

16

Evaluation methodology

I-Cache

D-cache

LSQ Cluster

Cross bar

Ring interconnect

A dynamically

scheduled 16 cluster

modeled in

Simplescalar-3.0

Ring latencies

B wires ( 4 cycles)

PW wires ( 6

cycles)

L wires (2 cycles)

February 14th 2005

17

IPC improvements: L wires

L wires improves performance by 4% on four cluster

system and 7.1% on a sixteen cluster system

February 14th 2005

18

Four cluster system: ED2 gains

Link Relativ

e metal

area

IPC Relative

processor

energy

(10%)

Relative

ED2

(10%)

Relative

ED2

(20%)

144 B 1.0 0.95 100 100 100

288 PW 1.0 0.92 97 103.4 100.2

144 PW 36 L 1.5 0.96 97 95.0 92.1

288 B 2.0 0.98 103 96.6 99.2

288 PW,36 L 2.0 0.97 99 94.4 93.2

144 B, 36 L 2.0 0.99 101 93.3 94.5

February 14th 2005

19

Sixteen Cluster system: ED2 gains

Link IPC Relative

Processor Energy

(20%)

Relative ED2

(20%)

144 B 1.11 100 100

144 PW, 36 L 1.05 94 105.3

288 B 1.18 105 93.1

144 B, 36 L 1.19 102 88.7

288 B, 36 L 1.22 107 88.7

February 14th 2005

20

Conclusions

Exposing the wire design space to the architecture

A case for micro-architectural wire management!

A low latency low bandwidth network alone helps improve performance by upto 7%

ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect

Entails hardware complexity

February 14th 2005

21

Future work

A preliminary evaluation looks promising

Heterogeneous interconnect entails

complexity

Design of heterogeneous clusters

Energy efficient interconnect

February 14th 2005

22

Questions and Comments?

Thank you!

February 14th 2005

23

Backup

February 14th 2005

24

L wires: Accelerating cache access

TLB access for page look up Transmit a few bits of

Virtual page number on L wires

Prefetch data our of L1$ and TLB

18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)

Wire

Type

Crossb

ar

delay

Ring

hop

delay

PW

wires

3 6

B wires 2 4

L wires 1 2

February 14th 2005

25

Model parameters

Simplescalar-3.0 with separate integer and

floating point queues

32 KB 2 way Instruction cache

32 KB 4 way Data cache

128 entry 8 way I and D TLB

February 14th 2005

26

Overview/Motivation:

± Three wire implementations employed in this study

± B wires: traditional Optimal delay

Huge power consumption

± L wires: Faster than B wires

Lesser bandwidth

± PW wires: Reduced power consumption

Higher bandwidth compared to B wires

Increased delay through the wires

microarchitectural wire management for performance and power in partitioned architectures

Documents

cyclesl wires

cyclepw wires

performancepw wires

wireswide wires

cycles pw wires

ring latencies b wires

access latency of loadsl

indexing data