microarchitectural wire management for performance and power in partitioned architectures
DESCRIPTION
Processor Architecture. Microarchitectural Wire Management for Performance and Power in partitioned architectures. Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy. Overview/Motivation. Wire delays hamper performance. - PowerPoint PPT PresentationTRANSCRIPT
Feb 14th 2005 University of Utah 1
Microarchitectural Wire Management for Performance and Power in
partitioned architectures
Rajeev BalasubramonianNaveen Muralimanohar
Karthik RamaniVenkatand Venkatachalapathy
Processor Architecture
February 14th 2005
2
Overview/Motivation
Wire delays hamper performance.
Power incurred in movement of data 50% of dynamic power is in interconnect
switching (Magen et al. SLIP 04)
MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003)
Abundant number of metal layers
February 14th 2005
3
Wire characteristics
Wire Resistance and capacitance per unit length
),()22(0 verthorizverthorizwire fringenglayerspaci
width
spacing
thicknessKC
)2()( BarrierwidthbarrierthicknessRwire
± Width R , C
Spacing C
Delay (as delay RC), Bandwidth
February 14th 2005
4
Design space exploration
Tuning wire width and spacing
d
2d
B WiresResistance
Capacitance
Resistance
Capacitance
Bandwidth
February 14th 2005
5
Transmission Lines
Similar to L wires - extremely low delay
Constraining implementation requirements!
Large width
Large spacing between wires
Design of sensing circuits
Implemented in test CMOS chips
February 14th 2005
6
Design space exploration
Tuning Repeater size and spacing
Traditional WiresLarge repeatersOptimum spacing
Power Optimal WiresSmaller repeatersIncreased spacing
Dela
y Po
wer
February 14th 2005
7
Design space exploration
DelayOptimized
B wires
BandwidthOptimizedW wires
PowerOptimized
P wires
Power and B/WOptimizedPW wires
Fast, low bandwidth
L wires
February 14th 2005
8
Heterogeneous Interconnects Intercluster global Interconnect
72 B wires Repeaters sized and spaced for optimum delay
18 L wires Wide wires and large spacing
Occupies more area
Low latencies 144 PW wires
Poor delay
High bandwidth
Low power
February 14th 2005
9
Outline
Overview
Design Space Exploration
Heterogeneous Interconnects Employing L wires for performance PW wires: The power optimizers Evaluation Results Conclusion
February 14th 2005
10
L1 DCache
LSQ
Eff. Address Transfer 10c
Mem. DepResolution
5c
CacheAccess
5c
Data return at 20c
L1 Cache pipeline
February 14th 2005
11
Exploiting L-Wires
L1 DCache
LSQ
Eff. Address Transfer 10c
PartialMem. DepResolution
3c
CacheAccess
5c
8-bit Transfer 5c
Data return at 14c
February 14th 2005
12
L wires: Accelerating cache access
Transmit LSB bits of effective address through L
wires
Partial comparison of loads and stores in LSQ
Faster memory disambiguation
Introduces false dependences ( < 9%)
Indexing data and tag RAM arrays
LSB bits can prefetch data out of L1$
Reduce access latency of loads
February 14th 2005
13
L wires: Narrow bit width operands
Transfer of 10 bit integers on L wires
Schedule wake up operations
Reduction in branch mispredict penalty
A predictor table of 8K two bit counters
Identifies 95% of all narrow bit-width results
Accuracy of 98%
Implemented in the PowerPC!
February 14th 2005
14
PW wires: Power/Bandwidth efficient
Idea: steer non-critical data through energy
efficient PW interconnect
Transfer of data at instruction dispatch Transfer of input operands to remote register
file
Covered by long dispatch to issue latency
Store data
February 14th 2005
15
Evaluation methodology
L1 DCache
B wires (2 cycles)
L wires (1 cycle)
PW wires (3 cycles)
Cluster
A dynamically scheduled clustered modeled with 4 clusters in simplescalar-3.0
Crossbar interconnects Centralized front-end
I-Cache & D-Cache LSQ Branch Predictor
February 14th 2005
16
Evaluation methodology
I-Cache
D-cache
LSQ Cluster
Cross bar
Ring interconnect
A dynamically
scheduled 16 cluster
modeled in
Simplescalar-3.0
Ring latencies
B wires ( 4 cycles)
PW wires ( 6
cycles)
L wires (2 cycles)
February 14th 2005
17
IPC improvements: L wires
L wires improves performance by 4% on four cluster
system and 7.1% on a sixteen cluster system
February 14th 2005
18
Four cluster system: ED2 gains
Link Relativ
e metal
area
IPC Relative
processor
energy
(10%)
Relative
ED2
(10%)
Relative
ED2
(20%)
144 B 1.0 0.95 100 100 100
288 PW 1.0 0.92 97 103.4 100.2
144 PW 36 L 1.5 0.96 97 95.0 92.1
288 B 2.0 0.98 103 96.6 99.2
288 PW,36 L 2.0 0.97 99 94.4 93.2
144 B, 36 L 2.0 0.99 101 93.3 94.5
February 14th 2005
19
Sixteen Cluster system: ED2 gains
Link IPC Relative
Processor Energy
(20%)
Relative ED2
(20%)
144 B 1.11 100 100
144 PW, 36 L 1.05 94 105.3
288 B 1.18 105 93.1
144 B, 36 L 1.19 102 88.7
288 B, 36 L 1.22 107 88.7
February 14th 2005
20
Conclusions
Exposing the wire design space to the architecture
A case for micro-architectural wire management!
A low latency low bandwidth network alone helps improve performance by upto 7%
ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect
Entails hardware complexity
February 14th 2005
21
Future work
A preliminary evaluation looks promising
Heterogeneous interconnect entails
complexity
Design of heterogeneous clusters
Energy efficient interconnect
February 14th 2005
22
Questions and Comments?
Thank you!
February 14th 2005
23
Backup
February 14th 2005
24
L wires: Accelerating cache access
TLB access for page look up Transmit a few bits of
Virtual page number on L wires
Prefetch data our of L1$ and TLB
18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)
Wire
Type
Crossb
ar
delay
Ring
hop
delay
PW
wires
3 6
B wires 2 4
L wires 1 2
February 14th 2005
25
Model parameters
Simplescalar-3.0 with separate integer and
floating point queues
32 KB 2 way Instruction cache
32 KB 4 way Data cache
128 entry 8 way I and D TLB
February 14th 2005
26
Overview/Motivation:
± Three wire implementations employed in this study
± B wires: traditional Optimal delay
Huge power consumption
± L wires: Faster than B wires
Lesser bandwidth
± PW wires: Reduced power consumption
Higher bandwidth compared to B wires
Increased delay through the wires