Download - Advanced Microarchitecture
Advanced MicroarchitectureLecture 11: Memory Scheduling
2
Executing Memory Instructions
Lecture 13: Memory Scheduling
• If R1 != R7, then Load R8 gets correct value from cache
• If R1 == R7, then Load R8 should have gotten value from the Store, but it didn’t!
Load R3 = 0[R6]Add R7 = R3 + R9Store R4 0[R7]Sub R1 = R1 – R2Load R8 = 0[R1]
Issue
Issue
Cache Miss!
Issue Cache Hit!
Miss serviced…IssueIssue
But there was a later load…
3
Memory Disambiguation Problem• Ordering problem is a data-dependence
violation
• Why can’t this happen with non-memory insts?– Operand specifiers in non-memory insts are
absolute• “R1” refers to one specific location
– Operand specifiers in memory insts are ambiguous• “R1” refers to a memory location specified by the
value of R1. As pointers change, so does this location.
• Determining whether it is safe to issue a load OOO requires disambiguating the operand specifiers
Lecture 13: Memory Scheduling
4
Two Problems• Memory disambiguation
– Are there any earlier unexecuted stores to the same address as myself? (I’m a load)
– Binary question: answer is yes or no
• Store-to-load forwarding problem– Which earlier store do I get my value from? (I’m
a load)– Which later load(s) do I forward my value to?
(I’m a store)– Non-binary question: answer is one or more
instruction identifiersLecture 13: Memory Scheduling
5
Load Store Queue (LSQ)
Lecture 13: Memory Scheduling
L 0xF048 41773 0x3290 42L/S PC Seq Addr Value
S 0xF04C 41774 0x3410 25S 0xF054 41775 0x3290 -17L 0xF060 41776 0x3418 1234L 0xF840 41777 0x3290 -17L 0xF858 41778 0x3300 1S 0xF85C 41779 0x3290 0L 0xF870 41780 0x3410 25L 0xF628 41781 0x3290 0L 0xF63C 41782 0x3300 1
Oldest
Youngest
0x3290 42
0x3410 380x3418 1234
0x3300 1
Data Cache
25
-17
6
Most Conservative Policy• No Memory Reordering• LSQ still needed for forwarded data (last
slide)• Easy to schedule
Lecture 13: Memory Scheduling
Ready!
bidgrant
bidgrant
Ready! 1
… …
Least IPC, all memory executed sequentially
7
Loads OOO Between Stores• Let loads exec OOO
w.r.t. each other, but no ordering past earlier unexecuted stores
Lecture 13: Memory Scheduling
Srd ex all earlier stores executed
L
L
S
L
S=0L=1
8
Loads Wait for Only STA’s• Stores normally don’t “Execute” until both
inputs are ready: address and data
• Only address is needed to disambiguate
Lecture 13: Memory Scheduling
S
L
Address readyData ready
9
Loads Execute When Ready• Most aggressive approach• Relies on fact that storeload forwarding is
not the common case
• Greatest potential IPC – loads never stall
• Potential for incorrect execution
Lecture 13: Memory Scheduling
10
Detecting Ordering Violations• Case 1: Older store execs before younger
load– No problem; if same address stld forwarding
happens• Case 2: Older store execs after younger
load– Store scans all younger loads– Address match ordering violation
Lecture 13: Memory Scheduling
11
Detecting Ordering Violations (2)
Lecture 13: Memory Scheduling
L 0xF048 41773 0x3290 42S 0xF04C 41774 0x3410 25S 0xF054 41775 0x3290 -17L 0xF060 41776 0x3418 1234L 0xF840 41777 0x3290 -17L 0xF858 41778 0x3300 1S 0xF85C 41779 0x3290 0L 0xF870 41780 0x3410 25L 0xF628 41781 0x3290 42L 0xF63C 41782 0x3300 1
Store broadcasts value,address and sequence #(-17,0x3290,41775)
Loads CAM-match onaddress, only care ifstore seq-# is lower thanown seq
(Load 41773 ignores because it has a lower seq #)
IF younger load hadn’t executed, andaddress matches, grab b’casted value
IF younger load has executed, andaddress matches, then ordering violation!
-17
Grab value, flush pipeline after load
(0,0x3290,41779)
An instruction may be involved inmore than one ordering violation
12
Dealing with Misspeculations• Instructions using the load’s stale/wrong
value will propagate more wrong values• These must somehow be re-executed
Lecture 13: Memory Scheduling
• Easiest: flush all instructions after (and including?) the misspeculated load, and just refetch
• Load uses forwarded value• Correct value propagated
when instructions re-execute
13
Recovery Complications• When flushing only part of the pipeline
(everything after the load), RAT must be repaired to the state just after the load was renamed
• Solutions?– Checkpoint at every load
• Not so good, between loads and branches, very large number of checkpoints needed
– Rollback to previous branch (which has its own checkpoint)• Make sure load doesn’t misspeculate on 2nd time
around• Have to redo the work between the branch and the
load which were all correct the first time around– Works with undo-list style of recovery
Lecture 13: Memory Scheduling
14
Flushing is Expensive• Not all later instructions are dependent on
the bogus load value• Pipeline latency due to refetch is exposed• Hunting down RS entries to squash is tricky
Lecture 13: Memory Scheduling
15
Selective Re-Execution• Ideal case w.r.t. maintaining high IPC• Very complicated
– need to hunt down only data-dependent insts– messier because some instructions may have
already executed (now in ROB) while others may not have executed yet (still in RS)• iteratively walk dependence graph?• use some sort of load/store coloring scheme?
• P4 uses replay for load-latency misspeculation– But replay wouldn’t work in this case (why?)
Lecture 13: Memory Scheduling
16
Load/Store Execution• “SimpleScalar” style
Lecture 13: Memory Scheduling
Store
alloc
ea-comp
Addea-com
p
st-datald-data
LSQ
RS
scheduleea
D
IndependentlyExecute
StoreStore “complete”
Forward valueto later Loads
S ea
D
IndependentlySchedule
Crack atDispatch
timeLoad is similar, but LD-data portion isdata-dependent on the LD ea-comp
AddLoad
17
Complications• LSQ needs data-capture support
– Store Data needs to capture value– EA-comps can write to LSQ entries directly
using LSQ index (no associative search)
Lecture 13: Memory Scheduling
Ld-dSt-d
add
L-eaxorS-ea
LSQ
RS
ADD T17 T12 T43
op dest srcL srcR
St-ea Lsq-5 T18 #0
Store normally doesn’t have adest; overload field for LSQ index
Load ea-comp done the same;Load’s LSQ entry handles “real”
destination tag broadcast
18
Select
Select
Complications (2)• Load must bid/select twice
– once for ea-comp portion– once for cache access (includes LSQ check)
Lecture 13: Memory Scheduling
Ld-ea
Ld-data
Ea-compExec
DataCache
Data cache and LSQsearch in parallel
RS LSQ
19
Load/Store Execution• “Pentium” Style
Lecture 13: Memory Scheduling
Store
dispatch/alloc
STASTD
LD
“store”“load”
LSQ
RS
schedule
AddLoad
• STA and STD still execute independently
• LSQ does not need data-capture– uses RS’s data-
capture (for data-capture scheduler)
– or RSPRFLSQ• Potentially adds a
little delay from STD-ready to STLD forwarding
20
Select
Load Execution• Only one select/bid
Lecture 13: Memory Scheduling
Load
Load
Ea-compExec
DataCache
LSQ search in parallel
RS LSQ
Load queue part doesn’t“execute”, but justholds address for
detecting orderingviolations
21
Store Execution• STA and STD independently issue from RS
– STA does ea comp– STD just reads operand and moves it to the LSQ
• When both have executed and reached the LSQ, then perform LSQ search for younger loads that have already executed (i.e., ordering violations)
Lecture 13: Memory Scheduling
22
LSQ Hardware in More Detail• CAM logic – harder than regular scheduler
because we need address + age information
• Age information not needed for physical registers since register renaming guarantees one writer per address– No easy way to prevent more than one store to
the same address
Lecture 13: Memory Scheduling
23
Loads checking for earlier matching stores
Lecture 13: Memory Scheduling
ST 0x4000
ST 0x4000
ST 0x4120
LD 0x4000
=
Address Bank Data Bank
=
=====
0
No earliermatches
Addr matchValid store
Use thisstore
Need to adjust this so thatload need not be at bottom,
and that LSQ can wrap-around
If |LSQ| is large, logic can beadapted to have log delay
24
Similar Logic to Previous Slide
Data Forwarding
Lecture 13: Memory Scheduling
ST 0x4000
ST 0x4120
ST 0x4000
LD 0x4000
Addr Match
Is Load CaptureValue
Overwritten
Overwritten
Data Bank
This logic is ugly, complicated, slow and
power hungry!
25
Alternative: Store Colors• Each store is assigned a unique, increasing
number (its color)• Loads inherit the color of the most recently
alloc’d st
Lecture 13: Memory Scheduling
St
St
St
St
Ld
LdLd
Ld
Color=1
Color=2
Color=3
Color=4
Ld
All three loads have same color:only care about ordering w.r.t.
stores, not other loads
St
LdLdLd
Ignore store broadcastsIf store’s color > your own
Special care is needed to deal with the eventual
overflow/wrap-around of the color/age counter
26
Don’t Make Stores Forward• When load receives data, it still needs to
wakeup its dependents… value not needed until dependents make it to execute stage
• Alternative timing/implementation:– Broadcast address only– When load wakes up, search LSQ again (should
hit now)
Lecture 13: Memory Scheduling
27
StoreLoadOp Timing
Lecture 13: Memory Scheduling
Ideal Case: std staLD
Load predicted dependenton store: waits for STA
LD add
Cycle i
Cycle i+1
Cycle i+2
i+3
i+4
add addS X E
i i+1
std staLD LDWith decoupledScheduling:
LDLDRe-search: std staLD LD: search LSQi i+1
i+2
i+3
addadd addS X E
i+4
i+2add
Even if load value is ready,
dependent op hasn’t been scheduled
No performance benefit for direct
STLD forwarding at time of address
broadcast
28
LSQ is Full Of Associative Searches• We should all know by now that associative
searches do not scale well
• So how do we manage this?
Lecture 13: Memory Scheduling
29
Split Load Queue/Store Queue• Stores don’t need to b’cast address to
stores• Loads don’t need to check for collisions
against earlier loads
Lecture 13: Memory Scheduling
Store Queue (STQ) Load Queue (LDQ)
Associative search for
earlier stores only needs
to check entries that actually
contain stores
Associative searchfor later loads forSTLD forwarding
only needs to checkentries that actually
contain loads
30
Load Execution• Load issue EA computation DL1 access
and LSQ search in parallel• Typical Latencies
– DL1: 3 cycles– LSQ search: 1 cycle (more?)
• Remember: instructions are speculatively scheduled!
Lecture 13: Memory Scheduling
31
Load Execution (2)
Lecture 13: Memory Scheduling
S X X X E
S X X X E
Pipeline timing assuming LSQ hit
LOAD
ADD
Pipeline timing assuming DL1 hit
S X X X E
S X X X E
LOAD
ADD
E E
But at time of scheduling, how dowe know LSQ hit vs. DL1 hit?
32
Load Execution (3)• Can predict latency
– similar to predicting L1 hit vs. L2 hit vs. going to DRAM
– If predict LSQ hit but wrong scheduling replay– If predict L1 hit but wrong waste a few cycles
• Normalize latencies– Make LSQ hit and L1 hit have same latency– Greatly simplifies scheduler– Loses some performance since in theory you
could do STLD forwarding in less time than the L1 latency• Loss is not too great since most loads do not hit in LSQ
Lecture 13: Memory Scheduling
33
Reducing Ordering Violations• Dependence violations can be predicted
Lecture 13: Memory Scheduling
A
B
OrderingViolationDetected
00000000
Make Noteof it!
1
B
A
Next time arounddon’t let B issuebefore previousSTA’s known
X Z
All previous STA’sknown; now it’ssafe to issue
B
AX Z
1
1
11
1
1
1
Table has finite number of entries;eventually all will be set to “do notspeculate” equivalent to machine
with no ordering speculation
34
Dealing with “Full” Table• Do similar to branch predictors: use
counters– asymmetric costs
• mispredicting T-branch as NT, or NT-branch as T makes no difference; need to flush and re-fetch either way
• predicting a no-conflict load as conflict causes load to stall unnecessarily, but other insts may still execute
• predicting a conflict as no-conflict causes pipeline flush
– asymmetric frequencies• no conflict loads much more common than conflicting
loads
Lecture 13: Memory Scheduling
35
Dealing with “Full” Tables (2)• Asymmetric updates
– when no ordering violation, decrement counter by 1
– on ordering violation, increment by X > 1• choose X based on frequency of misspeculations and
penalty/performance cost of misspeculation
• Periodic reset– Every K cycles, reset the entire table– Works reasonably well, lower hardware cost
than using saturating counters
Lecture 13: Memory Scheduling
36
Store-Load Pair Prediction• Explicitly remember which load conflicted
with which store
Lecture 13: Memory Scheduling
A
B
OrderingViolationDetected
A B
Make Noteof It!
B
A
Next time arounddon’t let B issue
before A’sSTA is known (don’t have to
wait for X and Z)
X Z
A’s STA isknown, but X
and Z stillunknown; it’shopefully safe
to issue
B
AX Z
37
Store Sets Prediction• A load may have conflicts with more than
one previous store
Lecture 13: Memory Scheduling
basicblock #2
basicblock #1A
B
C
Store R1 0x4000
Store R4 0x4000
Load R2 0x4000
A B
basicblock #3
C
38
Store Sets Prediction (2)
Lecture 13: Memory Scheduling
A
C
OrderingViolationDetected
A
Make Noteof It!
B
Another
BC
A
Next time arounddon’t let C issue
before A&B’sSTA’s are known
(don’t have towait for Z)
B Z
A&B’s STA’s areknown, but Zstill unknown;it’s hopefullysafe to issue
C
AB Z
39
Store Sets Implementation
Lecture 13: Memory Scheduling
144
4
3
Store Sets IdentificationTable (SSIT)
A
B
C
D
E
PC hash into SSIT;entry indicates store set
A, B, C belongto same store set
Last Fetched StoreTable (LFST)012345
A A FetchedSSIT lookup SSID = 4
A:L12 Update LFSTw/ LSQ index
C C FetchedSSIT lookup SSID = 4LFST says load shouldwait on LSQ entry 12
before issuing
If B fetched before C, thenB waits on A, updates LFST,
then C will wait on B
40
Note on Dependence Prediction• Few processors actually support this
– 21264 did; used the “load wait table”– Core 2 supports this now… so this is becoming
much more important
• Many machines only use wait-for-earlier-STAs approach– becomes bottleneck as instruction window size
increases
Lecture 13: Memory Scheduling