Download - Lazy Logic
Lazy Logic
Mikko H. LipastiAssociate Professor
Department of Electrical and Computer Engineering
University of Wisconsin—Madison
http://www.ece.wisc.edu/~pharm
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
CMOS History CMOS has been a faithful servant
40+ years since invention Tremendous advances
Device size, integration level Voltage scaling Yield, manufacturability, reliability
Nearly 20 years now as high-performance workhorse
Result: life has been easy for architects Ease leads to complacency & laziness
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
CMOS Futures“The reports of my demise are greatly
exaggerated.” – Mark Twain CMOS has some life left in it
Device scaling will continue What comes after CMOS…
Many new challenges Process variability Device reliability Leakage power Dynamic power Focus of this talk
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dynamic Power
Static CMOS: current flows when transistors switch
Combinational logic evaluates new inputs Flip-flop, latch captures new value (clock edge)
Terms C: capacitance of circuit
wire length, number and size of transistors V: supply voltage A: activity factor f: frequency
Architects can/should focus on Ci x Ai Reduce capacitance of each unit Reduce activity of each unit
unitsi
iiidyn fAVCkP 2
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Design Objective Inversion Historically, hardware was expensive
Every gate, wire, cable, unit mattered Squeeze maximum utilization from each
Now, power is expensive On-chip devices & wires, not so much Should minimize Ci x Ai
Logic should be simple, infrequently used Both sequential and combinational
Lazy Logic
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic
Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling
Conclusions Research Group Overview
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
What is Lazy Logic? Design philosophy Some overall principles
Minimize unit utilization Minimize unit complexity OK to increase number of
units/wires/devices As long as reduced Ai (activity) compensates Don’t forget leakage
Result Reject conventional “good ideas” Reduce power without loss of performance Sometimes improve performance
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Lazy Logic Applications CMP interconnection networks
Old: Packet-switched, store-and-forward New: Circuit-switched, reconfigurable
Stall cycle redistribution Transparent pipelines want fine-grained
stalls Redistribute coarse stalls into fine stalls
High-performance dynamic scheduling Cycle time goal achieved by replicating
ALUs
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
CMP Interconnection Networks Options
Buses don’t scale Crossbars are too
expensive Rings are too slow Packet-switched
mesh Attractive for all the
DSM reasons Scalable Low latency High link utilization
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
CMP Interconnection Networks
But… Cables/traces are now
on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop
Router latency adds up 3-4 cpu cycles per hop
Store-and-forward Lots of activity/power
Is this the right answer?
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Circuit-switched Interconnects Communication
patterns Spatial locality to
memory Pairwise
communication Circuit-switched links
Avoid switching/routing
Reduce latency Save power?
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Router Design
Switches can be logically configured to appear as wires (no routing overhead)
Can also act as packet-switched network Can switch back and forth very easily Detailed router design not presented here
NSE W
P
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dirty Miss coverage
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Circuit-Switched Connections/Processor
% o
f D
irty
Mis
se
s
SPECjbbSPECwebTPC-HTPC-W
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Directory Protocol Initial 3-hop miss establishes CS path Subsequent miss requests
Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list
Benefits Reduced 3-hop latency Less activity, less power
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Circuit-switched Performance
0
0.2
0.4
0.6
0.8
1
1.2
TP
C-H
SP
EC
jbb
20
00
SP
EC
we
b9
9
TP
C-W
Ba
rne
s-H
ut
Oce
an
Ra
dio
sity
No
rma
lize
d C
yc
le C
ou
nt
Base Fully connected, Oracle Limit 1, Oracle Limit 1, Region Prediction
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Link Activity
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%T
PC
-H
SP
EC
jbb
20
00
SP
EC
we
b9
9
TP
C-W
Ba
rne
s-H
ut
Oce
an
Ra
dio
sity
No
rma
lize
d L
ink
Ac
tiv
ity
Limit 1, Oracle Limit 1, Region Prediction
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Buffer Activity
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%T
PC
-H
SP
EC
jbb
20
00
SP
EC
we
b9
9
TP
C-W
Ba
rne
s-H
ut
Oce
an
Ra
dio
sity
No
rma
lize
d I
np
ut
bu
ffe
r A
cti
vit
y
Limit 1, Oracle Limit 1, Region Prediction
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Circuit-switched Coherence Summary
Reconfigurable interconnect Circuit-switched links
Some performance benefit Substantial reduction in activity Current status (slides are out of date)
Router design and physical/area models Protocol tuning and tweaks, etc. Initial results in CA Letters paper
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic
Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling
Conclusions Research Group Overview
April 21, 2023 Eric L. Hill – Preliminary Exam 20
Pipeline Clocking Revisited
AB
Two units of work, 10 clock pulses
Latches clocked to propagate data
Conventional pipeline clock gating Each valid work unit gets clocked into each latch This is needlessly conservative
April 21, 2023 Eric L. Hill – Preliminary Exam 21
Transparent Pipeline Gating
AB
Two units of work, 5 clock pulses
return
Transparent pipelining: novel approach to clocking [Jacobsen 2004, 2005]
Both master and slave latch can remain transparent Gating logic ensures no races Pipeline registers are clocked lazily only when race occurs
Quite effective for low utilization pipelines Gaps between valid work units enable transparent mode
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Applications Best suited for low utilization pipelines
E.g. FP, Media processing functional units High utilization pipelines see least benefit
E.g. Instruction fetch pipelines To benefit from transparent approach:
Valid data items need fine-grained gaps (stalls)
1-cycle gap provides lion’s share (50%)
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Application: Front-end Pipelines Provide back-end with sufficient
supply of instructions to find ILP High branch prediction accuracy Low instruction cache miss rates Little opportunity for clock gating
Designed to feed peak demand Poor match for transparent
pipeline gating
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
In-Order Execution Model In-order Cores
Power efficient Low design complexity Throughput oriented
CMP systems trending towards simple cores (e.g. Sun Niagara)
Data dependences cause fine-grained stalls at dispatch
Can we project these back to fetch?
Exploit fetch slack
time
April 21, 2023 Eric L. Hill – Preliminary Exam 25
Pipeline Diagram
BpredPC
bpred update
0x0
RPInstruction
Fetch
Execution Core
clock vector
Issue Buffer
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Available Fetch Slack
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
fracti
on
of
instr
ucti
on
gro
up
s o
bserv
ed
7+
6
5
4
3
2
1
0
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Implementation Stall cycle bits embedded in BTB
EPIC ISAs (IA64) could use stop bits Verify prediction by observing
unperturbed groups Let high confidence groups
periodically execute unperturbed Observe overall increase in execution
time Modeled Cell PPU-like PowerPC
core with aggressive clock gating
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Latch Activity Reduction
0
0.2
0.4
0.6
0.8
1
1.2
no
rmali
zed
latc
h a
cti
vit
y f
acto
r
scr
scr+tcg
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
FE Energy Delay Product
0
0.2
0.4
0.6
0.8
1
1.2
no
rma
lize
d f
ron
t e
nd
en
erg
y-d
ela
y p
roje
ct
(j*s
)
fe_latch
bpred
icache
base
scr
scr+
tpg
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Stall Cycle Redistribution Summary [ISLPED 2006]
Transparent pipelines reduce latch activity Not effective in pipelines with coarse-
grained stalls (e.g. fetch) Coarse-grained stalls can be redistributed
without affecting performance (fetch slack)
Benefits Equivalent performance, lower power Transparent fetch pipeline now attractive
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic
Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling
Conclusions Research Group Overview
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
A Brief Scheduler Overview
Fetch DecodeSched/Exe
WritebackCommit
Atomic Sched/Exe
Fetch Decode ScheduleDispatch RF Exe WritebackCommit
wakeup/select
Fetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommit
Wakeup/Select
Fetch Decode ScheduleDispatch RF Exe WritebackCommit
Wakeup/Select
Spec wakeup/select
Fetch Decode ScheduleDispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Spec wakeup/select
Fetch Decode ScheduleDispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Latency Changed!!
Fetch Decode ScheduleDispatch RF ExeWriteback/Recover
Commit
Re-schedulewhen latency mispredicted
Invalid input value
Speculatively issued instructions
Fetch Decode ScheduleDispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Data capture/ non-data capture scheduler
Speculative scheduling
Data capture scheduler desirable for many reasonsCycle time is not competitive because of data path
delay Current machines use speculative scheduling
Misscheduled/replayed instructions burn power Depending on recovery policy, up to 17% issued insts need to
replay
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Slicing the Core
Bitslice the core: narrow (16b) and wide (64b) Narrow core can be full data capture
Still makes aggressive cycle time (with lazy logic) Completely nonspeculative, virtually no replays Further power benefits (not in this talk)
Front-End Back-End
OoO Core
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dynamic Scheduling with Partial Operand Values
Narrow core Computes partial operand Determines load latency Avoids misscheduling
Wide core Computes the rest of the operand (if needed)
wakeup/select
Fetch DecodeSched &Nrw Exe
Dispatch RF ExeWriteback/Recover
CommitFetch DecodeSched &Nrw Exe
Dispatch RF ExeWriteback/Recover
Commit
wakeup/select
Fetch DecodeSched &Nrw Exe
Dispatch RF ExeWriteback/Recover
CommitFetch DecodeSched &Nrw Exe
Dispatch RF ExeWriteback/Recover
CommitFetch DecodeSched &Nrw Exe
Dispatch RF ExeWriteback/Recover
CommitFetch DecodeSched &Nrw Exe
Dispatch RF ExeWriteback/Recover
CommitFetch DecodeSched &Nrw Exe
Dispatch RF ExeWriteback/Recover
Commit
the rest of the data
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Scheduler w/ Narrow Data-Path
Non-data capture schedulerSelect – mux – tag bcast
& compare – ready wrR O B ID D ata1T ag1 D ata2T ag2
= =
... ......
...
... sele
ct lo
gic
...
Dest
(1)
(2)
To W ide Data Path
In t ALULSQ C ache
Adder
...
(a)
Naïve narrow data capture schedulerSelect – mux – tag bcast
& compare – ready wr
Select – mux – narrow ALU – data bcast – data wr
Increased cycle time
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
R O B ID D ata1T ag1R D ata2T ag2R
= =
......
...
... ......
Dest
(1)
(2)
To W ide D ata P ath
In t ALU
Int ALUse
lect
logi
c
(b)
M M
LS Q C ache
latc
h
Scheduler w/ Embedded ALUs
With embedded ALUsSelect – mux – tag bcast
& compare – ready wrMax(select, data bcast – mux – narrow ALU) – mux – latch setup
Lazy LogicReplicated ALUsLow utilizationOff critical delay
path
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Cycle Time, Area, Energy 32 entries, implemented using verilog Synthesized using Synopsis Design
Compiler and LSI Logic’s gflxp 0.11um
1.43
1.53
1.49
1.98
Area (mm2)
1.54
1.48
1.46
1.40
Energy(nJ)
2.04Full-Data Capture
1.28Non-Data Capture
1.28Narrow-Data Capture w/ ALUs
1.71Narrow-Data Capture
Cycle Time (ns)
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dynamic Scheduling Summary
Benefits: [JILP 2007] Save 25-30% of total OoO window energy
=> 12-18% total dynamic chip power Reduce misspeculated loads by 75%-80% Slightly improved IPC Comparable cycle time
Enabled by: Lazy narrow ALUs ALUs are cheap, so compute in parallel
with scheduling select logic
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic
Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling
Conclusions Research Group Overview
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Conclusions Lazy Logic
Promising new design philosophy Some overall principles
Minimize unit utilization Minimize unit complexity OK to increase number of
units/wires/devices Initial Results
Circuit-switched CMP interconnects Stall cycle redistribution Dynamic Scheduling
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Who Are We? Faculty: Mikko Lipasti Current Ph.D. students:
Profligate execution: Gordie Bell (joining IBM in 2006) Coarse-grained coherence: Jason Cantin (joining IBM in 2006) Lazy Logic
Circuit-switched coherence: Natalie Enright Stall cycle redistribution: Eric Hill Dynamic scheduling: Erika Gunadi
Dynamic code optimization: Lixin Su SMT/CMP scheduling/resource allocation: Dana Vantrease
Pharmed out: IBM: Trey Cain, Brian Mestan AMD: Kevin Lepak Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu
Seshadri Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay
Koka
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students
Gordie Bell, Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease
Graduates, current employment: AMD: Kevin Lepak IBM: Trey Cain, Jason Cantin, Brian Mestan Intel: Ilhyun Kim, Morris Marden, Craig
Saldanha, Madhu Seshadri Sun Microsystems: Matt Ramsay, Razvan
Cheveresan, Pranay Koka
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Current Focus Areas Multiprocessors
Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems
Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions
Software Java Virtual Machine run-time optimization Workload development and characterization
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Funding IBM
Faculty Partnership Awards Shared University Research equipment
Intel Research council support Equipment donations
National Science Foundation CSA, ITR, NGS, CPA Career Award
Schneider ECE Faculty Fellowship UW Graduate School
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Questions?http://www.ece.wisc.edu/
~pharm
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Questions?
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Backup slides
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Technology Parameters 65 nm technology generation 16 tiled processors
Approximately 4 mm x 4mm Signal can travel approximately 4
mm/cycle Circuit switched interconnect
consists of 5 mm unidirectional links
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Broadcast Protocol Broadcast to all nodes Establish Circuit-Switched path with
owner of data Future broadcasts will use Circuit-
Switched path to reduce power Predict when CS path will suffice
Use LRU information for paths to tear down old paths when resources need to be claimed by new path
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Switch Design from paper
E
ProcessorCM
CM
CM
CM
CM
CM = Configuration Memory
N
S
W
Buffer
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Race example from paper (1 of 2)
P0 P1 P2
Dir3
1a. CS Req
4. CS Resp (S)
2.
Upgrad
e
5.
Invalidate
6. Inval Resp
1b.
CS Notify
3.
7.
Downgrade
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Race example (2 of 2)
P0 P1 P2
Dir3
1a. CS Req
4a. CS Resp (S)5.
Invalidate
6. Inval Resp
1b.
CS Notify
3.
4b. Nack2. Upgrade
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
LRU pairs for Dirty Misses
23 or fewer pairs capture >80% of dirty misses for 3 out of 4 benchmarks (16p)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
1 10
19
28
37
46
55
64
73
82
91
10
0
10
9
11
8
12
7
13
6
14
5
15
4
16
3
17
2
18
1
19
0
19
9
20
8
21
7
22
6
23
5
Specjbb
specweb
tpch
tpcw
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Local LRU pairs
2 Circuit-Switched Paths per processor covers between 55% and 85% of dirty misses
Miss Rate (Local LRU)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Specjbb
specweb
tpch
tpcw
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Concurrent Links
5 concurrent links cover 90% necessary pairs Captures 50%-77% of overall opportunity
2 Circuit-Switched Paths per Processor
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
110.00%
1 2 3 4 5 6 7 8 9
SpecJBB
Specweb
TPC-H
TPC-W
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Experimental Setup PHARMsim
Activity-based power model based on Wattch added
InOrder issue 4/2/2 fetch/issue/commit (based on Cell PPU) 10 stage transparent front-end pipeline
(conventional latches at endpoints) Gshare (8k entry) branch predictor, 1024 set,
4-way BTB 32KB I/D cache (1/4), 512KB L2 cache (12) 4 confidence bits / >4 high conf threshold /
predictions checked randomly 10% of the time Benchmarks simulated for 250M instructions
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Branch Predictor Activity
0
0.2
0.4
0.6
0.8
1
1.2
no
rma
lize
d b
pre
d a
cti
vit
y
scr_extra
normal
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Related Work Removing Wrong Path Instructions
[Manne 1998] Flow Based Throttling Techniques
[Baniasadi 2001, Karkhanis 2002]
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Future Work Explore performance of other fetch
gating schemes with transparent pipelining
Explore dependence driven gating on Itanium machine model
Explore latch soft error vulnerability (TVF) when lazy clocking is used
Explore change in AVF when fetch gating is used Less ACE state in-flight
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
LD
ADD
OR
Cachemiss
AND
BR
Scheduling Replay Example
Squashing/non-selective replay – alpha 21264 Replays all dependent and independent instructions
issued under load shadow Analogous to squashing recovery in branch
misprediction Simple but high performance penalty
Independent instructions are unnecessarily replayedSched Disp RF Exe Retire
Invalidate & replay ALL instructions in the load
shadow
LD
ADD
OR
AND
BR
LD
ADD
OR
AND
BR
LD
ADD
OR
AND
BR
missresolvedLD
ADD
OR
AND
BR
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Narrow Core Narrow Scheduler
Captures partial operands Determines load latency (hit/miss)
Narrow Data-Path Narrow ALU – provides partial data to consumers Nar row LSQ and partial tag cache
Finds only possible load data source Uses least significant 16 bits
Large enough to help predict load latency Small enough to achieve fast cycle time
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
L/S Disambiguation &Partial Tag Matching
Exploits operand significance[Brooks et.al. 1999, Canal et al. 2000]
Load/store disambiguation 10 bits finds 99% of matching stores
Partial tag match 16 bits for 97%(mcf) - 99%(bzip2)
accuracy
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Outline Motivation Dynamic Scheduling with Narrow
Values Scheduler with Narrow Data-Path Pipelined Data Cache Pipeline Integration
Implementation and Experiments Conclusions and Future Work
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dynamic Scheduling withPartial Operands
Stores a subset of operands in scheduler Exploits partial operand knowledge
Load-store disambiguation Partial tag match
Front-End Back-End
OoO Core
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Pipelined Cache w/ Early Bits
TagA rray
D ataA rray
Com parator Muxes
TagS ubarray
D ataS ub-array
Com parator Muxes
C om para tor
N arrow B ank W ide B ank
Row
Decoder
Row
Decoder
Subarr
ay D
ecoder
Subarr
ay D
ecoder
To N arrow D ata P ath To W ide D ata P ath
P artia l B its
Fu
ll B
its
La
tch
La
tch
La
tch
La
tch
La
tch
D isp1 D isp2
D isp1 D isp2 A gen
Narrow bank for partial access, wide bank for the rest
Uses partial tag match in narrow bank Saves power in wide bank Hide wide cache bank latency by starting early
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Narrow LSQ Stores partial addresses of stores Used for partial load-store
disambiguation Accessed in parallel with narrow
bank Saves power in the wide LSQ
Cheaper direct mapped access rather than full associative search
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Pipeline Integration
Simple ALU insts link dependences in back-to-back cycle
Fetch D ecode R enam e Q ueue S ched D isp D isp
P artia lLoad
In tA LU
M ult/D iv M ult/D iv M ult/D iv
A genC ache
W B C om m itD ecodeD ecodeFetch
C ache
Complex ALU insts link dependences non-speculatively
Load insts need another cycle to schedule dependences
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Pipelined Data Cache & LSQ Modeled using modified CACTI 3.0 Configuration: 16KB, 4-way, 64B blocks
(1.21 + 0.40) mm2
(1.50 + 0.40) mm2
Total Area
(0.62 + 0.11) nJ(0.37 + 0.08) nJ Total Energy Consumption (Cache + LSQ)
1.24ns0.60nsAccess Latency – Wide Bank
N/A0.80nsAccess Latency – Narrow Bank
Conventional Data Cache
PipelinedData Cache
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Experiments Simplescalar / Alpha 3.0 tool set Machine Model
64-entry ROB 4-wide fetch/issue/commit 16-entry SQ, 16-entry LQ 32-entry scheduler 13-stage pipeline 64KB I-Cache (2-cyc), 16KB D-Cache (2-
cyc) 2-cycle store to load forwarding
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Energy Dissipation
On average narrow captured scheduling consume 25% less energy than non-data captured scheduling
0
0.2
0.4
0.6
0.8
1
bzip2 mcf parser vpr avg
Benchmarks
To
tal E
ne
rgy
narrow_refetch
narrow_squash
squash
parallel_selective
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Mispredicted Load Instructions
Reduce misspeculated loads by 75%-80%
0
2
4
6
8
10
12
14
bzip2 mcf parser vpr
Benchmarks
Nu
mb
er
of
Mis
sc
he
du
led
Lo
ad
Ins
tru
cti
on
s
(mill
ion
s)
miss-forward
store no-data
misalign store
cache alias
cache miss
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Optimized model Using refetch replay scheme to
reduce replay complexity Clear the scheduler entries once
instructions are issued Decreases scheduler occupancy Instructions enters OoO window
sooner Reduce L1 cache latency from 2-
cycle to 1-cycle
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Optimized Model Performance
Small variations Always perform as good or better
0.5
1
1.5
2
bzip2 mcf parser vpr avg
Benchmarks
Sp
eed
Up
improved narrow_refetch
narrow_refetch
narrow_squash
squash
selective
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Future Work Implement a more accurate
dynamic power model Study custom design vs.
synthesized model Study opportunities for leakage
power reduction
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Delay Model
Processor 0 can reach Processor 15 in 9 fewer cycles
Circuit Switched Interconnect
4
3
2
-- 432
976
764
643
Baseline Store and Forward Mesh
9
6
3
-- 963
181512
15129
1296
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Pipeline Unrolling