back-end: instruction scheduling, memory access instructions, and clusters
DESCRIPTION
Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters. J. Nelson Amaral. Tomasulo Algorithm. IBM 360/91 Floating Point Arithmetic Unit. Tomasulo Algorithm: A reservation station for each functional unit. Free/Occupied bit. Flag = on → Data = value - PowerPoint PPT PresentationTRANSCRIPT
Back-End: Instruction Scheduling, Memory Access Instructions, and
ClustersJ. Nelson Amaral
Tomasulo Algorithm
IBM 360/91 Floating Point Arithmetic UnitTomasulo Algorithm:
A reservation stationfor each functional unit.
Baer, p. 97
Free/Occupied bit
Flag = on → Data = valueFlag = off → Data = tag
A tag (pointer) to the ROB entry that will store result.
Decode-rename Stage
Reservation Station
Available?
Structural HazardStall incoming instructions
No
Free ROB Entry?
Structural HazardStall incoming instructions
No
Assign reservation station and tail of ROB
to instruction
Yes
Yes
Baer p. 97
Dispatch Stage
Map for each source
operand?
ROBEntry
Forward ROB tag to RS.
ReadyBit(RS) ← 0
LogicalRegister
ROB Entry Flag?
Forward value to Reservation Station (RS)
ReadyBit(RS) ← 1
Tag
Value
Map result register to tag
Enter tag into RS
Enter instruction at tail of ROBResultFlag(tail of ROB) ←0 Baer p. 98
Issue Stage
Both Flags in RS are
on?
Issue instruction to functional unit to start
execution
Yes
No
No
Function unit stalled? (waiting for
CDB)
Yes
If multiple functional units of the same typeare available, use a scheduling algorithm
CDB = Common Data Bus
Baer p. 98
ExecuteLast cycle
of execution?
Broadcast result and associated tag
Yes
No
Got ownership
of CDB
No
If multiple functional units request ownershipof the Common Data Bus (CDB) on the samecycle a hardwired priority protocol picks thewinner.
Baer p. 98
ROB stores result in entry identified by tag.
Set correspondingReadyBit.
RSs with same tag store result and set corresponding flag.
Yes
Commit Stage
Is there a result at the
head of ROB?
No
Store result in logical register
Delete ROB entry
Yes
Baer p. 97
Operation Timings
Assuming no dependencies
Baer, p. 98
Addition:0 1 2 3 4 5 6 7Time:
Multiplication:0 1 2 3 4 5 6 7Time:
Decoded Dispatched Issuedfinish execution
broadcast commit(if head of ROB)
Decoded Dispatched Issuedfinish execution
broadcast
commit(if head of ROB)
Example
i1: R4 ← R0 * R2 # use reservation station 1 of multiplieri2: R6 ← R4 * R8 # use reservation station 2 of multiplieri3: R8 ← R2 + R12 # use reservation station 1 of adderi4: R4 ← R14 + R16 # use reservation station 2 of adder
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E1 E2
Flag Data Log. Reg0 E1 R4 head0 E2 R6
tail
8
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 0 E1 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
ExecutingDispatched
i2 is inthis res.station.
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2
Flag Data Log. Reg
0 E4 R4
0 E1 R4 head0 E2 R60 E3 R8
tail
E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 0 E1 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag01 1 (R14) 1 (R16) E40
Adder Reservation Stations
ExecutingDispatched
Ready to Broadc.Dispatched
“register R4, which was renamed as ROB entry E1 and tagged as such in the reservation station Mult2, is now mapped to ROB entry E4.” (Baer, p. 102)
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2
Flag Data Log. Reg
0 E4 R4
0 E1 R4 head0 E2 R61 (i3) R8
tail
E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 0 E1 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
DispatchedBroadcast
Ready to Broadc.
Ready to Broadc.
Assume Adder has priority to broadcast.
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2
Flag Data Log. Reg
1 (i4) R4
0 E1 R4 head0 E2 R61 (i3) R8
tail
E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 0 E1 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
Dispatched
Broadcast
Ready to Broadc.
Assume Adder has priority to broadcast.
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2
Flag Data Log. Reg
1 (i4) R4
1 (i1) R4 head0 E2 R61 (i3) R8
tail
E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 1 (i1) 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
DispatchedBroadcast
Index ⋅⋅⋅ 4 5 6 7Register Map
i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2 E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
00
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
CommitExecuting
ROB Flag Data Log. Reg
0 (i4) R4
0 (i1) R40 E2 R6 head0 (i3) R8
tail
IBM 360/91 – unveiled in 1966
Some variant of the Tomasulo algorithm is the basis for the
design of all out-of-order processors.
Baer p. 97
Data dependency between instruction
Where should these instructions wait?
How do they become ready for issue?
Several instructions get to the end of thefront end and have to wait for operands.
Baer p. 177
Wakeup Stage
Detects instruction readiness.
We hope for m instructions to be woken up on each cycle.
Baer p. 177
Select Step
• or Scheduling step: Arbitrates between multiple instructions vieing for the same instruction unit.– Variations of fist-come-first-serve (of FIFO)
• Bypassing (or forwarding) of operands to units allows earlier selection.
• Critical instructions may have preference for selection.
Baer p. 177
Out-of-Order Architectures
Key idea: allow instructions following a stalled oneto start execution out of order.
A FIFO schedule is not a good idea!
Where to store stalled instructions?
Baer p. 178
Two Extreme Solutions
Tomasulo: a separatereservation station foreach functional unit.(distributed window)
Instruction Window: a centralizedreservation stationfor all functional units(centralized window)
IBM PowerPC series
Intel P6architecture
Baer p. 178
A Hybrid Solution
Reservation stations are shared among groups of functional units(hybrid window).
MIPS R10000: 3 sets of reservationstations:• address calculations• floating-point units•load-store units
Baer p. 178
How a design team selects between a centralized, distributed
or hybrid window?
What are the compromises?
Baer p. 179
Window design
• Resource allocation: centralized is better– static partitioning of resources is worse than
dynamic allocation
• Large windows: speed and power come into play
Baer p. 179
Two-Step Instruction Issue
Wakeup: instruction is ready for execution
Select: instruction is assigned to an execution unit.
Wakeup Step
Baer p. 180
f
Functional units
Window entries
w
Window entry with buses from 8 exec units
Wakeup Step
Baer p. 180
f
Functional units
Window entries
w
We need onebus from eachfunctional unit to each window entry
We also need twocomparators foreach functionalunit in eachwindow entry
Thus we need2fw comparators
If we separate thefunctional units andwindow slots into twoequal-size groups,we only need.fw/2 comparators
We will also need fewer (shorter)buses fromunits to slots.
Select Step
• Priority encoder: a circuit that receives several requests and issues one grant
• woken up instructions vying for the same unit send requests.• priority related to position in window
• Smaller window → smaller priority encoder
Baer p. 181
When should a centralized window be replaced by a distributed or
hybrid one?
When the wakeup-select step are on the critical path.
Threshold appears to be windows with around 64 entries on a 4-wide
superscalar processor Baer p. 182
Intel Pentium 4:2 large windows 2 schedulers per window
Intel Pentium III and Intel core:Smaller centralized window
AMD Opteron:4 sets of reservation stations
Baer p. 182
Relation between Select and Wake Up
i: R51 ← R22 + R33
i+1: R43 ← R27 – R51
Example:
The name given to the result of instruction i (R51)must be broadcast as soon as instruction i is selected.
Broadcasting the tag of R51 wakes up instruction i+1.
For single-cycle latency instructions, thestart of the execution is too late to broadcast the tag.
Baer p. 183
Speculative Wake Up and Select
i: R51 ← load(R22)i+1: R43 ← R27 – R51
i+2: R35 ← R51 + R28
Example:
In this case the tag of the destination of instruction iis broadcast.
Instructions i+1 and i+2 are speculatively woken upand selected based on a cache-hit latency.
In the case of a cache miss all dependent instructionsthat have been woken up and selected must be aborted.
Baer p. 183
Speculative Selection and the Reservation Stations
• An instruction must remain in a reservation station after it is scheduled– A bit indicates that the instruction has been
selected– Station is free once it is sure that the instruction
selection is not speculative anymore• Windows are large in comparison with the
number of functional units– accommodate many instructions in flight, some
speculatively.
Baer p. 183
Integrated Register File
Tomasulo Reservation Stations
What happens upon selection of an instruction?
FunctionalUnit
Reservation Station
Opcode
Operands
Opcode
Operands
FunctionalUnit
Instruction Window
PhysicalRegister File
Baer p. 183
The complexity of Bypassing
i: R51 ← R22 + R33
i+1: R43 ← R27 – R51
Example:
FunctionalUnit A
Compute i
FunctionalUnit B
Compute i+1
Output of A must be forwardedto B bypassing storage.
Baer p. 183
The complexity of Bypassing
i: R51 ← R22 + R33
i+1: R43 ← R27 – R51
Example:
FunctionalUnit A
Compute i
FunctionalUnit B
Now the bypass must forwardthe output to the input of A.
Compute i+1 But the hardware has to implement both buses.
Baer p. 183
The complexity of Bypassing
i: R51 ← R22 + R33
i+1: R43 ← R27 – R51
Example:
FunctionalUnit A
Compute i
FunctionalUnit B
Compute i+1
Also, we need buses to forwardthe output of B.
In general, given k functionalunits we may need k2 buses.
Buses become long to avoidcrossing each other.
Forwarding may limit the numberof functional units in a processor.
Forwarding may need more thanone cycle to complete.
Baer p. 184
Load Speculation
• Load Address Speculation– Used for data prefetching
• Memory dependence prediction– Used to speculate data flow from a store to a
subsequent load.
Baer p. 185
Store Buffer
• Store Buffer: A circular queue – Entry allocated when store instruction is decoded– Entry removed when store is committed• Keep data for stores that have not yet committed
Baer p. 185
States of a Store Buffer Entry
AV:Available
AD:Address is
known
CO:Committed
RE:Result and
Address known
AddressComputation
Data to be stored is still to be computed by another instruction
Store instructionreaches top of ROB
Datawritten
to cache
What happens with store buffer on a branch misprediction?
Baer p. 185
Handling Store Buffer on Branch Misprediction and Exceptions.
• Entries preceeding the mispredicted branch: – are in COMMIT state– must be written to cache
• Entries following misprediction– become AVAILABLE
• Exceptions: similar– Must write the COMMIT entries to cache before
handling exception
Baer p. 186
Load Instructions and Load Speculation
Baer p. 187
Load /Store Window Implementation –
Most Restricted
Load/StoreWindow
(FIFO)
Loads/Stores inserted inprogram order.
Loads/Stores removed insame order – at mot oneper cycle.
Single windowfor loads andstores.
Baer p. 187
Load Bypassing• Compare address of load with all addresses in
store buffer– Load bypassing: If there is no match → load can
proceed– What happens if the operand address of any entry
in store buffer is not yet computed?• load cannot proceed
– What happens if there is a match to an entry that is not committed? • load cannot access cache• “match” is the last match in program order.
• Need associative search of operand addresses in store buffer
Baer p. 187
Load Forwarding
• If these conditions are true:– A load match a store buffer entry AND– The result is available for the entry ( entry is in RE
or CO state)
• Then the result can be sent to the register specified by the load
• If the match is with an entry in AD state then:– Load waits for entry to reach RE state
Load Speculation in Out-of-Order Architectures
Dynamic Memory Disambiguation Problem:
Loads are issued speculatively ahead of precedingstores in program order. How to ensure that datadependences are not violated?
Three approachesPessimistic: Wait until certain that load can proceed.(like load forwarding and bypassing)
Optimistic: Load always proceeds speculatively.Need a recovery mechanism.
Dependence prediction: use a predictorto decide to speculate or not.Try to have fewer recoveries.
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 188
true dependency
Pessimistic: i3 and i4 cannot issue until i2 has computed its result:• i2 must be at least in RE (Result)• i4 proceeds once i1 and i2 are in AD (Address)
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 189
true dependency
Optimistic: i3 and i4 issue as soon as possible (load-buffer entries are created)
A store reaches COaddress comparedassociativelywith load-buffer entries
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 189
true dependencyAD i1 memadd1AD i2 memadd2
Store Buffer:
1 i3 memadd31 i4 memadd4
Load Buffer:
Indicates thatthe load is speculative
CO
Nothing happens becausethere is no match in load buffer.
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 189
true dependencyCO i1 memadd1CO i2 memadd2
Store Buffer:
1 i3 memadd31 i4 memadd4
Load Buffer:match
i3 has to be reissued
i4 has to be reissued because itis after i3 in program order
some implementations only reissue instructions that depend on i3
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 189
true dependency
Dependence Prediction: with correct predictions, i4 can proceed and we avoid reissueing i3.
Motivation: Optimistic
Memory dependencies are rare:Less than 10% of loads depend on an earlier store
Baer p. 190
Motivation:Dependence Prediction
Load misspeculations are expensive and predictors can reduce them.
What strategy should we use forpredicting profitable speculations?
Baer p. 190
Simple StrategyMemory dependencies are infrequent
Predict that all loads can be speculated
If a load L is misspeculated
All subsequent instances of L must wait
We need a bit to remember. Where should this bit be stored?
Baer p. 190
Simple strategy (cont.)Single prediction bit P associated with instruction in cache.
When load instr. brought into cache → P = 1
Load is misspeculated → P = 0
Line evicted from cache and reloaded → P = 1
Strategy used in theDEC Alpha 21264
Baer p. 190
Principle Behind Load Prediction
“static store-load instruction pairs thatcause most of the dynamic data misspredictionare relatively few and exhibit temporal locality.”
Moshovos A. , Breach S. E., Vijaykumar T. N., Sohi G. S.,“Dynamic Speculation and Synchronization of DataDependences,” International Symposium on ComputerArchitecture, (ISCA) 1997, Denver, CO, USA
Ideal load speculationAvoids mis-speculation.
Allows loads to execute as early as possible.
Loads with no true dependences→ Execute without delay.
A load with a true dependence→ Execute as soon as the store that produces the data commits.
MoshovosISCA97.
A Real Predictor
MoshovosISCA97.
Dynamically identify store-load pairs that are likelyto be data dependent.
i
Provide a synchronization mechanism to instancesof these dependences.
ii
Uses this mechanism to synchronize the storeand the load.
iii
Load Predictor Table
Baer p. 190
Hash basedon PC
Saturatingcounters
Predictor States:• 00: strong nospeculate• 01: weak nospeculate• 10: weak speculate• 11: strong speculate
tagload buffer entry:
op.address: memory address of operand
spec.bit: speculative load?
update.bit: should update predictor at commit/abort?
Each load instruction has a loadspec bit.
Incrementing asaturating countermoves it towardstrong speculate.
Load/Decode Stage
• Set loadspec bit according to value of counter associated with the load PC
Baer p. 190
After Operand Address is Computed
UncommittedYoungerStores?
Enter in the load buffer:
op.ad 0tag 0
spec.bit update.bit
loadspec
Issue Cache Access
Enter in the load buffer:
op.ad 1tag 0
Enter in the load buffer:
op.ad 0tag 1
Wait (like in pessimistic solution)
No
On
Off
Yes
Baer p. 190
Store Commit Stage
For all matchesin load buffer
spec.bit update.bit ← 1
Load Abort:Predictor ← Strong NoSpeculateRecover from misspeculated load
Baer p. 191
Off
On
It was correct to not speculateand should keep not speculatingin the future
Store Commit Stage
spec.bit increment saturating counter
speculating was correct
On
update.bitincrement
saturating counter
would like to speculatein the future
Off
Off
predictor ← strong nospeculate
On
Baer p. 191
Store Sets
Baer p. 191
Motivation for Store Sets• The past is a good predictor for future
memory-order violations.• Must also predict:
When one load is dependent on multiple stores
store A
store B
store C
load Dload E
load F
When multiple loads depend on one store.
ChrysosISCA98
Chrysos, G. Z. and Emer, J. S., “Memory Dependence Prediction using Store Sets,” International Symposium on Computer Architecture, 1998 pp. 142-153.
Store Set Definition
Given a load L, the store set of L is the set of all stores that L has ever depended upon.
Ideally, any time a store-load dependence isdetected, the store is added to the load’s store set table.
To make a prediction, the store set table ofthe load is searched for all uncommitted younger stores.
ChrysosISCA98
Too expensive! We need an approximation.
Implementation of Store Sets Memory Dependence Prediction
Both loads and stores have entries in Store Set ID Table.
ChrysosISCA98
Store Set Examples:Multiple loads depend on one store
j: loadadd1 k: loadadd2
⋅⋅⋅
i: storeadd3
i→
j→
SSITLFST
k→
Baer p. 192
Store Set Examples:Multiple loads depend on one store
i: storedd2 j: storeadd3
k: loadadd1
⋅⋅⋅
i→
j→
SSITLFST
k→
Baer p. 192
Store Set Examples:Multiple loads depend on multiple stores
i: storedd2 j: storeadd3
k: loadadd1
i→
j→
SSITLFST
l→
l: loadadd4
⋅⋅⋅
⋅⋅⋅
⋅⋅⋅
k→
We have a conflict betweenthe LFST entry associated with i and l.
Winner is the entry with smaller index in SSITMake loser point to the winner’s entry.
Baer p. 192
Evaluating Load Speculation
• Performance benefits from load speculation depends on:– speculation miss rate– cost of misspeculation recovery
Baer p. 194
Evaluating Load Speculation - Terminology
Conflicting load: at the time the load is ready to issue there is a previous store in the instruction window whose operand address is unknown.
Colliding load: the load is dependent on one of the stores with which it conflicts.
Baer p. 194
Evaluating Load Speculation – Typical measurements
• In a 32-entry load-store window, there are– 25% of loads are non-conflicting– of the 75% conflicting loads:• only 10% actually collide
• In larger windows the percentage of:– non-conflicting loads increase– colliding loads decrease
Baer p. 194
Back-End Optimizations
• Branch prediction– “a must”
• Load speculation (load-bypassing stores)– “important” because other instructions depend on
the load
• Prediction of load latency– “common” to hide load latency in the cache
hierarchy
Baer p. 195
Other Back-End Optimizations
• Value Prediction– Predict the value that an instruction will compute• May restrict to the value loaded by loads
• Critical Instructions– Predict which instructions are in the critical path.
Baer p. 196-201
Clustered Microarchitectures
Baer p. 201
Back-end Limitations to m
Large windows: large m requires largewindows.Expensive in hardware and power dissipation
Many functional units: many (long) buses;affect forwarding.
Centralized Resources (p. e. Register File): large resources, many ports.
Baer p. 201
Definition of a Cluster
• A cluster is formed by:– A set of functional units– A register file– An instruction window (or reservation stations)
Baer p. 201
Clustered Microarchitecture
Baer p. 202
Register File Replication
• A copy of the register file in each cluster– Small number of clusters– Can use crossbar switch for interconnection– Example (Alpha 21264):• integer unit is two clusters;• each cluster has a full copy of the 80 registers
Baer p. 202
Changes because of Clustering
• Front end– steer instruction to window of a cluster• static: compile time decision• dynamic: by hardware at runtime
• Back end– Copy results into registers of other clusters– Intercluster latency affects wake up and select
Baer p. 202
Effect of Clustering in Performance
• Latency to forward results between clusters• Sensitive to load balancing between clusters• Conflicting goals:– keep producers and consumers of data into same
cluster– balance the workload
Baer p. 202
Distributed Register Files• Steering affects Renaming– Assume that an instruction a is assigned to cluster
ci • A free register form ci will be used for the result of a
– If an operand of a is produced by an instruction b in a cluster cj, what needs to be done?
1. Another free register of ci is assigned tothis operand.
2. A copy instruction is inserted in cj immediately b.
3. The copy is kept in ci for use by other instructions.Baer p. 203
Clustered microarchitectures can be seen as astep in the evolution from monolithic processorsto multiprocessors.
Chapter Summary: Back end is important for performance
– Tomasulo Algorithm– Centralized/Distributed/Hybrid Windows– Wakeup/Select steps– Scheduling: Critical instructions first– Loads:• Bypassing stores• Forwarding values• Speculating on the absence of dependences with stores
– Clustering to reduce wiring complexity