1 wire aware architecture naveen muralimanohar advisor – rajeev balasubramonian university of utah
TRANSCRIPT
1
Wire Aware Architecture
Naveen Muralimanohar
Advisor – Rajeev Balasubramonian
University of Utah
Naveen Muralimanohar University of Utah
2
Effect of Technology Scaling
Power wall Temperature wall Reliability issues
Process variation Soft errors
Wire scaling Communication is expensive but computations
are cheap
Naveen Muralimanohar University of Utah
3
Wire Delay – Compelling Opportunity
Existing proposals are indirect Hide wire delay
Pre-fetching, Speculative coherence, Run-ahead execution
Reduce communication to save power
Wire level optimizations are still limited to circuit designers
Thesis Statement
“The growing cost of on-chip wire delay requires a thorough understanding of wires.
The dissertation advocates exposing wire properties to architects and proposes microarchitectural wire management”
Naveen Muralimanohar University of Utah
5
Wire Delay/Power
Pentium 4 (@ 90nm) spent two cycles to send a
signal across the chip
Wire delays are costly for performance and power
Latencies of 60 cycles to reach ends of a
chip at 32nm (@ 5 GHz)
50% of dynamic power is in interconnect
switching (Magen et al. SLIP 04)
Naveen Muralimanohar University of Utah
6
Large Caches
Cache hierarchies will
dominate chip area
Montecito has two
private 12 MB L3 caches
(27MB including L2)
Long global wires are
required to transmit
data/address
Intel Montecito
Cache Cache
Naveen Muralimanohar University of Utah
7
On-Chip Cache Challenges
4 MB 16 MB 64 MB
~1.5X65nm process
~1X130nm process
~2X32nm process
Cache access time calculated using CACTI
Naveen Muralimanohar University of Utah
8
Effect of L2 Hit Time
0%
10%
20%
30%
40%
50%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
fma3
d
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
swim
two
lf
vort
ex vpr
wu
pw
ise
IPC
imp
rove
men
t
Increase in IPC due to reduction in L2 access time
An aggressive out-of-order processor (L2-hit time 30 ->15 cycles)
Avg = 17%
Naveen Muralimanohar University of Utah
9
Coherence Traffic
CMP has already become ubiquitous Requires Coherence
among multiple cores
Coherence operations entail frequent communications
+ Different coherence messages have different latency and bandwidth needs
L2$
Core 1 Core2 Core 3
L1$ L1$ L1$Read Req
Fwd Read Req to owner
Latest copy
Ex Req
Inval Req
Inv Ack
Messages related to read missMessages related to write miss
L1 Accesses
Highly latency critical in aggressive out-of-order processors (such as a clustered processor)
The choice of inter-cluster communication fabric has a high impact on performance
Naveen Muralimanohar University of Utah
11
On-chip Traffic
P0
I D
P1
I D
P2
I D
P3
I D
P4
I D
P5
I D
P6
I D
P7
I D
P8
I D
P9
I D
P10
I D
P11
I D
P12
I D
P13
I D
P14
I D
P15
I D
Controller
Controller
Cache Reads
and WritesCoherence
TransactionsL1-accesses
L2 bank
Cluster
Naveen Muralimanohar University of Utah
12
Outline
Overview Wire Design Space Methodology to Design Scalable Caches Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence
Traffic Conclusions
Naveen Muralimanohar University of Utah
13
Wire Characteristics Wire Resistance and capacitance per unit length
),()22(0 verthorizverthorizwire fringenglayerspaci
width
spacing
thicknessKC
)2()( BarrierwidthBarrierthicknessRwire
Resistance Capacitance Bandwidth
Width
Spacing
Naveen Muralimanohar University of Utah
14
Design Space Exploration
Tuning wire width and spacing Base caseB wires
Fast butLow bandwidthL wires
(Width & Spacing)
Delay Bandwidth
Naveen Muralimanohar University of Utah
15
Design Space Exploration Tuning Repeater size and spacing
Traditional WiresLarge repeatersOptimum spacing
Power Optimal WiresSmaller repeatersIncreased spacing
Dela
y Po
wer
Naveen Muralimanohar University of Utah
16
ED Trade-off in a Repeated Wire
Naveen Muralimanohar University of Utah
17
Design Space Exploration
Base caseB wires
BandwidthoptimizedW wires
Power and B/WoptimizedPW wires
Fast, low bandwidth L wires
Latency 1x
Power 1x
Area 1x
Latency 1.6x
Power 0.9x
Area 0.5x
Latency 3.2x
Power 0.3x
Area 0.5x
Latency 0.5x
Power 0.5x
Area 4x
Naveen Muralimanohar University of Utah
18
Wire Model
MM M
Wire RCV
ores
ocap Icap
Cside-wall
Cadj
Wire Type Relative
Latency
Relative Area Dynamic Power Static Power
B-Wire 8x 1x 1x 2.65 1x
B-Wire 4x 1.6x 0.5x 2.9 1.13x
L-Wire 8x 0.5x 4x 1.46 0.55X
PW-Wire 4x 3.2x 0.5x 0.87 0.3x
Ref: Banerjee et al.
65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane
Naveen Muralimanohar University of Utah
19
Outline
Overview
Wire Design Space Methodology to Design Scalable Caches Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence
Traffic Conclusions
Naveen Muralimanohar University of Utah
20
Cache Design Basics
Input address
Dec
oderWordline
Bitlines
Tag
arr
ay
Dat
a ar
ray
Column muxesSense Amps
Comparators
Output driver
Valid output?
Mux drivers
Data output
Output driver
Naveen Muralimanohar University of Utah
21
Existing Model - CACTI
Decoder delay Decoder delay
Wordline & bitline delay Wordline & bitline delay
Cache model with 4 sub-arrays Cache model with 16 sub-arrays
Decoder delay = H-tree delay + logic delay
Naveen Muralimanohar University of Utah
22
CACTI Shortcomings
Access delay is equal to the delay of slowest sub-array Very high hit time for large caches
Employs a separate bus for each cache bank for multi-banked caches Not scalable
Exploit different wire types and network
design choices to reduce access latency
Potential solution – NUCA
Extend CACTI to model NUCA
Naveen Muralimanohar University of Utah
23
Non-Uniform Cache Access (NUCA)*
Large cache is broken into
a number of small banks
Employs on-chip network
for communication
Access delay (distance
between bank and cache
controller)
CPU & L1
Cache banks*(Kim et al. ASPLOS 02)
Naveen Muralimanohar University of Utah
24
Extension to CACTI
On-chip network Wire model based on ITRS 2005 parameters
Grid network
3-stage speculative router pipeline
Network latency vs Bank access latency tradeoff Iterate over different bank sizes
Calculate the average network delay based on the number of banks and bank sizes
Consider contention values for different cache configurations
Similarly we also consider power consumed for each organization
Naveen Muralimanohar University of Utah
25
Trade-off Analysis (32 MB Cache)
0
50
100
150
200
250
300
350
400
2 4 8 16 32 64No. of Banks
La
ten
cy
(c
yc
les
)
Total No. of Cycles
Network Latency
Bank access latency
Network contention CyclesDelay Optimal Point
Naveen Muralimanohar University of Utah
26
Effect of Core Count
0
50
100
150
200
250
300
2 4 8 16 32 64
Bank Count
Co
nte
nti
on
Cyc
les
16-core
8-core
4-core
Naveen Muralimanohar University of Utah
27
Power Centric Design (32MB Cache)
0.E+00
1.E-09
2.E-09
3.E-09
4.E-09
5.E-09
6.E-09
7.E-09
8.E-09
9.E-09
1.E-08
2 4 8 16
32
64
En
erg
y J
Bank Count
Total EnergyBank EnergyNetwork Energy
Power Optimal Point
28
Search Space of Old CACTI
University of Utah 28
Design space with global wires optimized for delay
29
Search Space of CACTI-6
University of Utah 29
Design space with various wire types
Least Delay
30% Delay
Penalty
Low-swing
Naveen Muralimanohar University of Utah
30
Earlier NUCA Models Made simplified assumptions for network
parameters Minimum bank access time Minimum network hop latency Single cycle router pipeline
Employed 512 banks for a 32 MB cache+ More bandwidth
- 2.5X less efficient in terms of delay
Naveen Muralimanohar University of Utah
31
Outline
Overview
Wire Design Space
Methodology to Design Scalable Caches Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence
Traffic Conclusions
Naveen Muralimanohar University of Utah
32
Cache Look-Up
The entire access happens in a sequential
manner
Core/L1Core/L1
L2 Bank
Tag DataNetwork Routing Logic 4-6 bits
Decoder 10-15 bits
Comparator
Naveen Muralimanohar University of Utah
33
Early Look-Up
Break the sequential access Hides 70% of the bank access time
Core/L1Core/L1
L2 Bank
Tag DataCritical lower order bits
Comparator
Naveen Muralimanohar University of Utah
34
Aggressive Look-Up
Core/L1Core/L1
L2 Bank
Tag DataCritical lower order bits + 8 bits
Comparator
1101…1101111100010
11100010
Naveen Muralimanohar University of Utah
35
Aggressive Look-Up
Reduction in link delay (for address transfer)
Increase in traffic due to false match < 1%
Marginal increase in link overhead
Additional 8-bits
- More logic at the cache controller for tag match
- Address transfer for writes happens on L-wires
Naveen Muralimanohar University of Utah
36
Heterogeneous Network
Routers introduce significant overhead (especially in L-network) L-wires can transfer signal across four banks in four
cycles Router adds three cycles for each hop
Modify network topology to take advantage of wire property Different topology for address and data transfers
Naveen Muralimanohar University of Utah
37
Hybrid Network
Combination of point-to-point and bus Reduction in
latency Reduction in power Efficient use of L-
wires
- Low bandwidth
Core
L2 Controller
Shared bus
Shared bus
Shared bus
Router
Naveen Muralimanohar University of Utah
38
Experimental Setup
Simplescalar with contention modeled in detail
Single core, 8-issue out-of-order processor
32 MB, 8-way set-associative, on-chip L2 cache
(SNUCA organization)
32KB L1 I-cache and 32KB L1 D-cache with a hit
latency of 3 cycles
Main memory latency 300 cycles
Naveen Muralimanohar University of Utah
39
CMP Setup
Eight Core CMP(Simplescalar tool)
32 MB, 8-way set-associative
(SNUCA organization)
Two cache controllers
Main memory latency 300 cycles
L2 Bank
C1
C2
C3
C4
C5
C6
C7
C8
Naveen Muralimanohar University of Utah
40
Network Model
Virtual channel flow control
Four virtual channels/physical channel
Credit based flow control (for backpressure)
Adaptive routing
Each hop should reduce Manhattan distance
between the source and the destination
Naveen Muralimanohar University of Utah
41
Cache ModelsModel Bank Access
(cycles)
Bank Count Network Link Description
1 3 512 B-wires Based on prior work
2 17 16 B-wires CACTI-6
3 17 16 B & L–wires Early Lookup
4 17 16 B & L–wires Agg. Lookup
5 17 16 B & L–wires Hybrid network
6 17 16 B-wires Upper bound
Naveen Muralimanohar University of Utah
42
Performance Results (Uniprocessor)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
No
rmal
ized
IPC
All Benchmarks Latency Sensitive Benchmarks
Model derived from CACTI, improvement over model assumed in the prior work – 73%
L2 Sensitive – 114%
Model derived from CACTI, improvement over model assumed in the prior work – 73%
L2 Sensitive – 114%
Prior work CACTI-L2 Early Aggr. Hybrid. Ideal
Naveen Muralimanohar University of Utah
43
Performance Results (Uniprocessor)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
No
rmal
ized
IPC
All Benchmarks Latency Sensitive Benchmarks
Early lookup technique, average improvement over Model 2 – 6%
L2 Sensitive – 8%
Prior work CACTI-L2 Early Aggr. Hybrid. Ideal
Naveen Muralimanohar University of Utah
44
Performance Results (Uniprocessor)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
No
rmal
ized
IPC
All Benchmarks Latency Sensitive Benchmarks
Aggressive lookup technique, average improvement over Model 2 – 8%
L2 Sensitive – 9%
Prior work CACTI-L2 Early Aggr. Hybrid. Ideal
Naveen Muralimanohar University of Utah
45
Performance Results (Uniprocessor)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
No
rmal
ized
IPC
All Benchmarks Latency Sensitive Benchmarks
Hybrid model, average improvement over Model 2 – 15%
L2 Sensitive – 20%
Prior work CACTI-L2 Early Aggr. Hybrid. Ideal
Naveen Muralimanohar University of Utah
46
Performance Results (CMP)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
Mix All L2Sensitive
Half L2 andHalf Non L2
Sensitive
Memoryintensive
Average
Benchmark Set
No
rma
lize
d I
PC
Base Early LookupAggressive Lookup HybridIdeal
Naveen Muralimanohar University of Utah
47
Performance Results (4X – Wires)
Wire delay constrained
model Performance
improvements are better
Early lookup - 7% Aggressive model -
20% Hybrid model - 29%
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
No
rmal
ized
IP
C
All benchmarksLatency sensitive benchmarks
Naveen Muralimanohar University of Utah
48
NUCA Design
Network parameters play a significant role in the performance of large caches
Modified CACTI model, that includes network overhead performs 51% better compared to previous models
Methodology to compute an optimal baseline NUCA
Naveen Muralimanohar University of Utah
49
NUCA Design II
Wires can be tuned for different metrics
Routers impose non-trivial overhead
Address and data have different bandwidth needs
We introduce heterogeneity at three levels
Different types of wires for address and data transfers
Different topologies for address and data networks
Different architectures within address network (point-to-point and bus)
(Yields an additional performance improvement of 15% over the
optimal, baseline NUCA)
Naveen Muralimanohar University of Utah
50
Outline
Overview
Methodology to Design Scalable Caches
Wire Design Space
Heterogeneous Wires for Large Caches Heterogeneous Wires for Coherence
Traffic Conclusions
Naveen Muralimanohar University of Utah
51
Directory Based Protocol (Write-Invalidate)
Map critical/small messages on L wires and non-
critical messages on PW wires
Read exclusive request for block in shared state
Read request for block in exclusive state
Negative Ack (NACK) messages
Hop
Imbalance in
messages
Naveen Muralimanohar University of Utah
52
1 Rd-Ex request from processor 1
2 Directory sends clean copy to processor 1
3 Directory sends invalidate message to processor 2
4 Cache 2 sends acknowledgement back to processor 1
Cache 1
L2 & Directory
Cache 2
Processor 1 Processor 2
12 3
4
Critical
Non-CriticalExclusive request for a shared copy
Naveen Muralimanohar University of Utah
53
Read to an Exclusive Block
Proc 2L1
Proc 1L1
L2 & Directory
Read Req
Spec Reply
Req
ACK
Fwd Dirty Copy
WB Data
(critical)
(non-critical)
(non-critical)
Naveen Muralimanohar University of Utah
54
Evaluation Platform & Simulation Methodology
Virtutech Simics Functional Simulator
Ruby Timing Model (GEMS)
SPLASH Suite
L2$
Processor
Naveen Muralimanohar University of Utah
55
Heterogeneous Model
L2$
Processor
L-wireB-wirePW-wire
11% Performance improvement
22.5% Power savings in wire
Naveen Muralimanohar University of Utah
56
Summary
Coherence messages have diverse needs
Intelligent mapping of these messages to wires in
heterogeneous network can improve both performance and
power
Low bandwidth, high speed links improve performance by
11% for SPLASH benchmark suite
Non-critical traffic on power optimized network decreases
wire power by 22.5%Ref: Interconnect Aware Coherence Protocol (ISCA 06) collaborated with Liqun Cheng
On-Core Communications
L-wires Narrow bit width operands Branch mis-predict signal
PW – wires Non-critical register values
Ready registers
Store data
11% improvement in ED^2
Naveen Muralimanohar University of Utah
58
Results Summary
P0
I D
P1
I D
P2
I D
P3
I D
P4
I D
P5
I D
P6
I D
P7
I D
P8
I D
P9
I D
P10
I D
P11
I D
P12
I D
P13
I D
P14
I D
P15
I D
Controller
Controller
Cache Reads and
Writes114% Processor
performance improvement
50% Power Savings
Coherence Transactions
11% Performance
Improvement
22.5% power savings in wiresL1-accesses7% performance improvement
11% ED^2 improvement
L2 bank
Cluster
Naveen Muralimanohar University of Utah
59
Conclusion
Impact of interconnect choices in modern processors is significant
Architectural level wire management can improve both power and performance of future communication bound processors
Architects have a lot to offer in the area of wire aware design
Naveen Muralimanohar University of Utah
60
Future Research
Exploit upcoming technologies Low-swing wires, optical interconnect, RF,
transmission lines etc.
Transactional Memory Network to support register-register
communication Dynamic adaptation
Acknowledgements Committee members
Rajeev, Al, John, Erik, and Shubu (Intel)
ExternalDr. Norm Jouppi (HP Labs), Dr. Ravi Iyer (Intel)
CS front office staff
Lab-matesKarthik, Niti, Liqun, and other fellow grads
Naveen Muralimanohar University of Utah
62
Avenues Explored Inter-core communication (ISCA 2006) Memory hierarchy (ISCA 2007) CACTI 6.0 – publicly released (MICRO 2007), (IEEE Micro Top
Picks 2008) Out-of-order core (HPCA 2005, IEEE Micro 06)
Power and Temperature Aware Architectures(ISPASS 2006)
Current Project or under submission: Scalable and Reliable Transactional Memory (PACT 08) Rethinking Fundamentals: Route Wires or Packets? 3D Reconfigurable Caches