[ppt]lazy logic - pharm--computer architecture...
TRANSCRIPT
Lazy LogicMikko H. Lipasti
Associate ProfessorDepartment of Electrical and
Computer EngineeringUniversity of Wisconsin—
Madisonhttp://www.ece.wisc.edu/~pharm
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
CMOS History CMOS has been a faithful servant
40+ years since invention Tremendous advances
Device size, integration level Voltage scaling Yield, manufacturability, reliability
Nearly 20 years now as high-performance workhorse
Result: life has been easy for architects Ease leads to complacency & laziness
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
CMOS Futures“The reports of my demise are greatly
exaggerated.” – Mark Twain CMOS has some life left in it
Device scaling will continue What comes after CMOS…
Many new challenges Process variability Device reliability Leakage power Dynamic power Focus of this talk
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dynamic Power
Static CMOS: current flows when transistors switch Combinational logic evaluates new inputs Flip-flop, latch captures new value (clock edge)
Terms C: capacitance of circuit
wire length, number and size of transistors V: supply voltage A: activity factor f: frequency
Architects can/should focus on Ci x Ai Reduce capacitance of each unit Reduce activity of each unit
unitsi
iiidyn fAVCkP 2
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Design Objective Inversion Historically, hardware was expensive
Every gate, wire, cable, unit mattered Squeeze maximum utilization from each
Now, power is expensive On-chip devices & wires, not so much Should minimize Ci x Ai
Logic should be simple, infrequently used Both sequential and combinational
Lazy Logic
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic
Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling
Conclusions Research Group Overview
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
What is Lazy Logic? Design philosophy Some overall principles
Minimize unit utilization Minimize unit complexity OK to increase number of
units/wires/devices As long as reduced Ai (activity) compensates Don’t forget leakage
Result Reject conventional “good ideas” Reduce power without loss of performance Sometimes improve performance
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Lazy Logic Applications CMP interconnection networks
Old: Packet-switched, store-and-forward New: Circuit-switched, reconfigurable
Stall cycle redistribution Transparent pipelines want fine-grained
stalls Redistribute coarse stalls into fine stalls
High-performance dynamic scheduling Cycle time goal achieved by replicating
ALUs
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
CMP Interconnection Networks Options
Buses don’t scale Crossbars are too
expensive Rings are too slow Packet-switched
mesh Attractive for all the
DSM reasons Scalable Low latency High link utilization
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
CMP Interconnection Networks
But… Cables/traces are now
on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop
Router latency adds up 3-4 cpu cycles per hop
Store-and-forward Lots of activity/power
Is this the right answer?
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Circuit-switched Interconnects Communication
patterns Spatial locality to
memory Pairwise
communication Circuit-switched links
Avoid switching/routing
Reduce latency Save power?
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Router Design
Switches can be logically configured to appear as wires (no routing overhead)
Can also act as packet-switched network Can switch back and forth very easily Detailed router design not presented here
NSE W
P
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dirty Miss coverage
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Circuit-Switched Connections/Processor
% o
f Dirt
y M
isse
s
SPECjbbSPECwebTPC-HTPC-W
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Directory Protocol Initial 3-hop miss establishes CS path Subsequent miss requests
Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list
Benefits Reduced 3-hop latency Less activity, less power
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Circuit-switched Performance
0
0.2
0.4
0.6
0.8
1
1.2
TPC
-H
SP
EC
jbb2
000
SP
EC
web
99
TPC
-W
Bar
nes-
Hut
Oce
an
Rad
iosi
ty
Nor
mal
ized
Cyc
le C
ount
Base Fully connected, Oracle Limit 1, Oracle Limit 1, Region Prediction
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Link Activity
0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%
100.00%TP
C-H
SP
EC
jbb2
000
SP
EC
web
99
TPC
-W
Bar
nes-
Hut
Oce
an
Rad
iosi
ty
Nor
mal
ized
Lin
k A
ctiv
ity
Limit 1, Oracle Limit 1, Region Prediction
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Buffer Activity
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%TP
C-H
SP
EC
jbb2
000
SP
EC
web
99
TPC
-W
Bar
nes-
Hut
Oce
an
Rad
iosi
ty
Nor
mal
ized
Inpu
t buf
fer A
ctiv
ity
Limit 1, Oracle Limit 1, Region Prediction
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Circuit-switched Coherence Summary
Reconfigurable interconnect Circuit-switched links
Some performance benefit Substantial reduction in activity Current status (slides are out of date)
Router design and physical/area models Protocol tuning and tweaks, etc. Initial results in CA Letters paper
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic
Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling
Conclusions Research Group Overview
May 9, 2023 Eric L. Hill – Preliminary Exam 20
Pipeline Clocking Revisited
AB
Two units of work, 10 clock pulses
Latches clocked to propagate data
Conventional pipeline clock gating Each valid work unit gets clocked into each latch This is needlessly conservative
May 9, 2023 Eric L. Hill – Preliminary Exam 21
Transparent Pipeline Gating
AB
Two units of work, 5 clock pulses
return
Transparent pipelining: novel approach to clocking [Jacobsen 2004, 2005] Both master and slave latch can remain transparent Gating logic ensures no races Pipeline registers are clocked lazily only when race occurs
Quite effective for low utilization pipelines Gaps between valid work units enable transparent mode
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Applications Best suited for low utilization pipelines
E.g. FP, Media processing functional units High utilization pipelines see least
benefit E.g. Instruction fetch pipelines
To benefit from transparent approach: Valid data items need fine-grained gaps
(stalls) 1-cycle gap provides lion’s share (50%)
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Application: Front-end Pipelines Provide back-end with sufficient
supply of instructions to find ILP High branch prediction accuracy Low instruction cache miss rates Little opportunity for clock gating
Designed to feed peak demand Poor match for transparent
pipeline gating
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
In-Order Execution Model In-order Cores
Power efficient Low design complexity Throughput oriented
CMP systems trending towards simple cores (e.g. Sun Niagara)
Data dependences cause fine-grained stalls at dispatch
Can we project these back to fetch?
Exploit fetch slack
time
May 9, 2023 Eric L. Hill – Preliminary Exam 25
Pipeline Diagram
BpredPC
bpred update
0x0
RPInstruction
FetchExecution
Core
clock vectorIssue Buffer
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Available Fetch Slack
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
frac
tion
of in
stru
ctio
n gr
oups
obs
erve
d
7+6543210
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Implementation Stall cycle bits embedded in BTB
EPIC ISAs (IA64) could use stop bits Verify prediction by observing
unperturbed groups Let high confidence groups
periodically execute unperturbed Observe overall increase in execution
time Modeled Cell PPU-like PowerPC
core with aggressive clock gating
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Latch Activity Reduction
0
0.2
0.4
0.6
0.8
1
1.2
norm
aliz
ed la
tch
activ
ity fa
ctor
scrscr+tcg
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
FE Energy Delay Product
0
0.2
0.4
0.6
0.8
1
1.2
norm
aliz
ed fr
ont e
nd e
nerg
y-de
lay
proj
ect (
j*s)
fe_latchbpredicache
base
scr
scr+
tpg
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Stall Cycle Redistribution Summary [ISLPED 2006]
Transparent pipelines reduce latch activity Not effective in pipelines with coarse-
grained stalls (e.g. fetch) Coarse-grained stalls can be redistributed
without affecting performance (fetch slack)
Benefits Equivalent performance, lower power Transparent fetch pipeline now attractive
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic
Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling
Conclusions Research Group Overview
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
A Brief Scheduler Overview
Fetch Decode Sched/Exe WritebackCommit
Atomic Sched/Exe
Fetch Decode ScheduleDispatch RF Exe WritebackCommit
wakeup/select
Fetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommit
Wakeup/Select
Fetch Decode ScheduleDispatch RF Exe WritebackCommit
Wakeup/Select
Spec wakeup/select
Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Spec wakeup/select
Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Latency Changed!!
Fetch Decode ScheduleDispatch RF Exe Writeback/Recover Commit
Re-schedulewhen latency mispredicted
Invalid input value
Speculatively issued instructionsFetch Decode ScheduleDispatch RF Exe Writeback
/Recover CommitSpeculatively issued instructions
Data capture/ non-data capture scheduler
Speculative scheduling
Data capture scheduler desirable for many reasonsCycle time is not competitive because of data path
delay Current machines use speculative scheduling
Misscheduled/replayed instructions burn power Depending on recovery policy, up to 17% issued insts need to
replay
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Slicing the Core
Bitslice the core: narrow (16b) and wide (64b) Narrow core can be full data capture
Still makes aggressive cycle time (with lazy logic) Completely nonspeculative, virtually no replays Further power benefits (not in this talk)
Front-End Back-End
OoO Core
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dynamic Scheduling with Partial Operand Values
Narrow core Computes partial operand Determines load latency Avoids misscheduling
Wide core Computes the rest of the operand (if needed)
wakeup/select
Fetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback
/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback
/Recover Commit
wakeup/select
Fetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback
/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback
/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback
/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback
/Recover CommitFetch Decode Sched &Nrw Exe Dispatch RF Exe Writeback
/Recover Commit
the rest of the data
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Scheduler w/ Narrow Data-Path
Non-data capture schedulerSelect – mux – tag bcast
& compare – ready wrR O B ID Data1Tag1 Data2Tag2
= =
... ......
...
... sele
ct lo
gic
...
Dest
(1 )
(2)
To W ide D ata P ath
In t ALULS Q Cache
Adde r
...
(a)
Naïve narrow data capture schedulerSelect – mux – tag bcast
& compare – ready wrSelect – mux – narrow
ALU – data bcast – data wrIncreased cycle time
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
RO B ID Data1T ag1R Data2T ag2R
= =
......
...
... ......
Dest
(1)
(2)
To W ide Da ta P ath
In t A LU
Int ALUse
lect
logi
c
(b)
M M
LS Q C ache
latc
h
Scheduler w/ Embedded ALUs
With embedded ALUsSelect – mux – tag bcast &
compare – ready wrMax(select, data bcast –
mux – narrow ALU) – mux – latch setup
Lazy LogicReplicated ALUsLow utilizationOff critical delay
path
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Cycle Time, Area, Energy 32 entries, implemented using verilog Synthesized using Synopsis Design
Compiler and LSI Logic’s gflxp 0.11um
1.431.531.491.98
Area (mm2)
1.541.481.461.40
Energy(nJ)
2.04Full-Data Capture
1.28Non-Data Capture1.28Narrow-Data Capture w/
ALUs
1.71Narrow-Data Capture
Cycle Time (ns)
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dynamic Scheduling Summary
Benefits: [JILP 2007] Save 25-30% of total OoO window energy
=> 12-18% total dynamic chip power Reduce misspeculated loads by 75%-80% Slightly improved IPC Comparable cycle time
Enabled by: Lazy narrow ALUs ALUs are cheap, so compute in parallel
with scheduling select logic
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic
Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling
Conclusions Research Group Overview
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Conclusions Lazy Logic
Promising new design philosophy Some overall principles
Minimize unit utilization Minimize unit complexity OK to increase number of
units/wires/devices Initial Results
Circuit-switched CMP interconnects Stall cycle redistribution Dynamic Scheduling
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Who Are We? Faculty: Mikko Lipasti Current Ph.D. students:
Profligate execution: Gordie Bell (joining IBM in 2006) Coarse-grained coherence: Jason Cantin (joining IBM in 2006) Lazy Logic
Circuit-switched coherence: Natalie Enright Stall cycle redistribution: Eric Hill Dynamic scheduling: Erika Gunadi
Dynamic code optimization: Lixin Su SMT/CMP scheduling/resource allocation: Dana Vantrease
Pharmed out: IBM: Trey Cain, Brian Mestan AMD: Kevin Lepak Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu
Seshadri Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay
Koka
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students
Gordie Bell, Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease
Graduates, current employment: AMD: Kevin Lepak IBM: Trey Cain, Jason Cantin, Brian Mestan Intel: Ilhyun Kim, Morris Marden, Craig
Saldanha, Madhu Seshadri Sun Microsystems: Matt Ramsay, Razvan
Cheveresan, Pranay Koka
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Current Focus Areas Multiprocessors
Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems
Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions
Software Java Virtual Machine run-time optimization Workload development and characterization
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Funding IBM
Faculty Partnership Awards Shared University Research equipment
Intel Research council support Equipment donations
National Science Foundation CSA, ITR, NGS, CPA Career Award
Schneider ECE Faculty Fellowship UW Graduate School
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Questions?http://www.ece.wisc.edu/
~pharm
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Questions?
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Backup slides
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Technology Parameters 65 nm technology generation 16 tiled processors
Approximately 4 mm x 4mm Signal can travel approximately 4
mm/cycle Circuit switched interconnect
consists of 5 mm unidirectional links
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Broadcast Protocol Broadcast to all nodes Establish Circuit-Switched path with
owner of data Future broadcasts will use Circuit-
Switched path to reduce power Predict when CS path will suffice
Use LRU information for paths to tear down old paths when resources need to be claimed by new path
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Switch Design from paper
E
ProcessorCM
CM
CM
CM
CM
CM = Configuration Memory
N
S
WBuffer
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Race example from paper (1 of 2)
P0 P1 P2
Dir3
1a. CS Req
4. CS Resp (S)
2.
Upgrade
5. Invalidate
6. Inval Resp
1b. CS Notify
3.
7. Downgrad
e
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Race example (2 of 2)
P0 P1 P2
Dir3
1a. CS Req
4a. CS Resp (S)5. Invalidate
6. Inval Resp
1b. CS Notify
3.
4b. Nack 2. Upgrade
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
LRU pairs for Dirty Misses
23 or fewer pairs capture >80% of dirty misses for 3 out of 4 benchmarks (16p)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
1 10 19 28 37 46 55 64 73 82 91 100
109
118
127
136
145
154
163
172
181
190
199
208
217
226
235
Specjbbspecwebtpchtpcw
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Local LRU pairs
2 Circuit-Switched Paths per processor covers between 55% and 85% of dirty misses
Miss Rate (Local LRU)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Specjbbspecwebtpchtpcw
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Concurrent Links
5 concurrent links cover 90% necessary pairs Captures 50%-77% of overall opportunity
2 Circuit-Switched Paths per Processor
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
110.00%
1 2 3 4 5 6 7 8 9
SpecJBBSpecwebTPC-HTPC-W
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Experimental Setup PHARMsim
Activity-based power model based on Wattch added InOrder issue 4/2/2 fetch/issue/commit (based on Cell PPU) 10 stage transparent front-end pipeline
(conventional latches at endpoints) Gshare (8k entry) branch predictor, 1024 set,
4-way BTB 32KB I/D cache (1/4), 512KB L2 cache (12) 4 confidence bits / >4 high conf threshold /
predictions checked randomly 10% of the time Benchmarks simulated for 250M instructions
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Branch Predictor Activity
0
0.2
0.4
0.6
0.8
1
1.2
norm
aliz
ed b
pred
act
ivity
scr_extranormal
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Related Work Removing Wrong Path Instructions
[Manne 1998] Flow Based Throttling Techniques
[Baniasadi 2001, Karkhanis 2002]
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Future Work Explore performance of other fetch
gating schemes with transparent pipelining
Explore dependence driven gating on Itanium machine model
Explore latch soft error vulnerability (TVF) when lazy clocking is used
Explore change in AVF when fetch gating is used Less ACE state in-flight
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
LDADD
OR
Cachemiss
ANDBR
Scheduling Replay Example
Squashing/non-selective replay – alpha 21264 Replays all dependent and independent instructions
issued under load shadow Analogous to squashing recovery in branch
misprediction Simple but high performance penalty
Independent instructions are unnecessarily replayedSched Disp RF Exe Retire
Invalidate & replay ALL instructions in the load
shadow
LDADDORANDBR
LDADDOR
ANDBR
LDADDOR
ANDBR
missresolvedLD
ADDOR
ANDBR
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Narrow Core Narrow Scheduler
Captures partial operands Determines load latency (hit/miss)
Narrow Data-Path Narrow ALU – provides partial data to consumers Nar row LSQ and partial tag cache
Finds only possible load data source Uses least significant 16 bits
Large enough to help predict load latency Small enough to achieve fast cycle time
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
L/S Disambiguation &Partial Tag Matching
Exploits operand significance[Brooks et.al. 1999, Canal et al. 2000]
Load/store disambiguation 10 bits finds 99% of matching stores
Partial tag match 16 bits for 97%(mcf) - 99%(bzip2)
accuracy
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Outline Motivation Dynamic Scheduling with Narrow
Values Scheduler with Narrow Data-Path Pipelined Data Cache Pipeline Integration
Implementation and Experiments Conclusions and Future Work
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Dynamic Scheduling withPartial Operands
Stores a subset of operands in scheduler Exploits partial operand knowledge
Load-store disambiguation Partial tag match
Front-End Back-End
OoO Core
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Pipelined Cache w/ Early Bits
TagA rray
DataA rray
Com parator Muxes
TagS ubarray
D ataS ub-array
Com parator Muxes
Com para tor
Narrow B ank W ide B ank
Row
Dec
oder
Row
Dec
oder
Sub
arra
y D
ecod
er
Sub
arra
y D
ecod
er
T o N arrow D ata Pa th To W ide D ata P ath
P artia l B its
Full
Bits
Latc
h
Latc
h
Latc
h
Latc
h
Latc
h
Disp1 D isp2
D isp1 D isp2 A gen
Narrow bank for partial access, wide bank for the rest
Uses partial tag match in narrow bank Saves power in wide bank Hide wide cache bank latency by starting early
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Narrow LSQ Stores partial addresses of stores Used for partial load-store
disambiguation Accessed in parallel with narrow
bank Saves power in the wide LSQ
Cheaper direct mapped access rather than full associative search
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Pipeline Integration
Simple ALU insts link dependences in back-to-back cycle
Fetch D ecode R enam e Q ueue Sched D isp D isp
P artia lLoad
In tALU
M ult/D iv M ult/D iv M ult/D iv
AgenC ache
W B C om m itD ecodeD ecodeFetch
C ache
Complex ALU insts link dependences non-speculatively
Load insts need another cycle to schedule dependences
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Pipelined Data Cache & LSQ Modeled using modified CACTI 3.0 Configuration: 16KB, 4-way, 64B blocks
(1.21 + 0.40) mm2
(1.50 + 0.40) mm2
Total Area
(0.62 + 0.11) nJ(0.37 + 0.08) nJ Total Energy Consumption (Cache + LSQ)
1.24ns0.60nsAccess Latency – Wide Bank
N/A0.80nsAccess Latency – Narrow Bank
Conventional Data Cache
PipelinedData Cache
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Experiments Simplescalar / Alpha 3.0 tool set Machine Model
64-entry ROB 4-wide fetch/issue/commit 16-entry SQ, 16-entry LQ 32-entry scheduler 13-stage pipeline 64KB I-Cache (2-cyc), 16KB D-Cache (2-cyc) 2-cycle store to load forwarding
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Energy Dissipation
On average narrow captured scheduling consume 25% less energy than non-data captured scheduling
0
0.2
0.4
0.6
0.8
1
bzip2 mcf parser vpr avg
Benchmarks
Tota
l Ene
rgy
narrow_refetchnarrow_squashsquashparallel_selective
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Mispredicted Load Instructions
Reduce misspeculated loads by 75%-80%
0
2
4
6
8
10
12
14
bzip2 mcf parser vpr
Benchmarks
Num
ber o
f M
issc
hedu
led
Load
Inst
ruct
ions
(m
illio
ns)
miss-forwardstore no-datamisalign storecache aliascache miss
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Optimized model Using refetch replay scheme to
reduce replay complexity Clear the scheduler entries once
instructions are issued Decreases scheduler occupancy Instructions enters OoO window
sooner Reduce L1 cache latency from 2-
cycle to 1-cycle
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Optimized Model Performance
Small variations Always perform as good or better
0.5
1
1.5
2
bzip2 mcf parser vpr avg
Benchmarks
Spee
d U
p
improved narrow_refetch
narrow_refetch
narrow_squash
squash
selective
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Future Work Implement a more accurate
dynamic power model Study custom design vs.
synthesized model Study opportunities for leakage
power reduction
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Delay Model
Processor 0 can reach Processor 15 in 9 fewer cycles
Circuit Switched Interconnect
432-- 432
976764643
Baseline Store and Forward Mesh
963-- 963
181512151291296
July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of
Toronto
Pipeline Unrolling