back-end: instruction scheduling, memory access instructions, and clusters

Back-End: Instruction Scheduling, Memory Access Instructions, and

ClustersJ. Nelson Amaral

Tomasulo Algorithm

IBM 360/91 Floating Point Arithmetic UnitTomasulo Algorithm:

A reservation stationfor each functional unit.

Baer, p. 97

Free/Occupied bit

Flag = on → Data = valueFlag = off → Data = tag

A tag (pointer) to the ROB entry that will store result.

Decode-rename Stage

Reservation Station

Available?

Structural HazardStall incoming instructions

No

Free ROB Entry?

Structural HazardStall incoming instructions

No

Assign reservation station and tail of ROB

to instruction

Yes

Yes

Baer p. 97

Dispatch Stage

Map for each source

operand?

ROBEntry

Forward ROB tag to RS.

ReadyBit(RS) ← 0

LogicalRegister

ROB Entry Flag?

Forward value to Reservation Station (RS)

ReadyBit(RS) ← 1

Tag

Value

Map result register to tag

Enter tag into RS

Enter instruction at tail of ROBResultFlag(tail of ROB) ←0 Baer p. 98

Issue Stage

Both Flags in RS are

on?

Issue instruction to functional unit to start

execution

Yes

No

No

Function unit stalled? (waiting for

CDB)

Yes

If multiple functional units of the same typeare available, use a scheduling algorithm

CDB = Common Data Bus

Baer p. 98

ExecuteLast cycle

of execution?

Broadcast result and associated tag

Yes

No

Got ownership

of CDB

No

If multiple functional units request ownershipof the Common Data Bus (CDB) on the samecycle a hardwired priority protocol picks thewinner.

Baer p. 98

ROB stores result in entry identified by tag.

Set correspondingReadyBit.

RSs with same tag store result and set corresponding flag.

Yes

Commit Stage

Is there a result at the

head of ROB?

No

Store result in logical register

Delete ROB entry

Yes

Baer p. 97

Operation Timings

Assuming no dependencies

Baer, p. 98

Addition:0 1 2 3 4 5 6 7Time:

Multiplication:0 1 2 3 4 5 6 7Time:

Decoded Dispatched Issuedfinish execution

broadcast commit(if head of ROB)

Decoded Dispatched Issuedfinish execution

broadcast

commit(if head of ROB)

Example

i1: R4 ← R0 * R2 # use reservation station 1 of multiplieri2: R6 ← R4 * R8 # use reservation station 2 of multiplieri3: R8 ← R2 + R12 # use reservation station 1 of adderi4: R4 ← R14 + R16 # use reservation station 2 of adder

Index ⋅⋅⋅ 4 5 6 7Register Map

ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16

0 1 2 3 4 5 6 7Time:

8

E1 E2

Flag Data Log. Reg0 E1 R4 head0 E2 R6

tail

8

Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations

01 0 E1 1 (R8) E2

Free Flag1 Oper1 Flag2 Oper2 Tag000

Adder Reservation Stations

ExecutingDispatched

i2 is inthis res.station.



0 1 2 3 4 5 6 7Time:

8

E4 E2

Flag Data Log. Reg

0 E4 R4

0 E1 R4 head0 E2 R60 E3 R8

tail

E38


01 0 E1 1 (R8) E2

Free Flag1 Oper1 Flag2 Oper2 Tag01 1 (R14) 1 (R16) E40


ExecutingDispatched

Ready to Broadc.Dispatched

“register R4, which was renamed as ROB entry E1 and tagged as such in the reservation station Mult2, is now mapped to ROB entry E4.” (Baer, p. 102)



0 1 2 3 4 5 6 7Time:

8

E4 E2

Flag Data Log. Reg

0 E4 R4

0 E1 R4 head0 E2 R61 (i3) R8

tail

E38


01 0 E1 1 (R8) E2



DispatchedBroadcast

Ready to Broadc.

Ready to Broadc.

Assume Adder has priority to broadcast.



0 1 2 3 4 5 6 7Time:

8

E4 E2

Flag Data Log. Reg

1 (i4) R4

0 E1 R4 head0 E2 R61 (i3) R8

tail

E38


01 0 E1 1 (R8) E2



Dispatched

Broadcast

Ready to Broadc.

Assume Adder has priority to broadcast.



0 1 2 3 4 5 6 7Time:

8

E4 E2

Flag Data Log. Reg

1 (i4) R4

1 (i1) R4 head0 E2 R61 (i3) R8

tail

E38


01 1 (i1) 1 (R8) E2



DispatchedBroadcast


i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16

0 1 2 3 4 5 6 7Time:

8

E4 E2 E38


00



CommitExecuting

ROB Flag Data Log. Reg

0 (i4) R4

0 (i1) R40 E2 R6 head0 (i3) R8

tail

IBM 360/91 – unveiled in 1966

Some variant of the Tomasulo algorithm is the basis for the

design of all out-of-order processors.

Baer p. 97

Data dependency between instruction

Where should these instructions wait?

How do they become ready for issue?

Several instructions get to the end of thefront end and have to wait for operands.

Baer p. 177

Wakeup Stage

Detects instruction readiness.

We hope for m instructions to be woken up on each cycle.

Baer p. 177

Select Step

• or Scheduling step: Arbitrates between multiple instructions vieing for the same instruction unit.– Variations of fist-come-first-serve (of FIFO)

• Bypassing (or forwarding) of operands to units allows earlier selection.

• Critical instructions may have preference for selection.

Baer p. 177

Out-of-Order Architectures

Key idea: allow instructions following a stalled oneto start execution out of order.

A FIFO schedule is not a good idea!

Where to store stalled instructions?

Baer p. 178

Two Extreme Solutions

Tomasulo: a separatereservation station foreach functional unit.(distributed window)

Instruction Window: a centralizedreservation stationfor all functional units(centralized window)

IBM PowerPC series

Intel P6architecture

Baer p. 178

A Hybrid Solution

Reservation stations are shared among groups of functional units(hybrid window).

MIPS R10000: 3 sets of reservationstations:• address calculations• floating-point units•load-store units

Baer p. 178

How a design team selects between a centralized, distributed

or hybrid window?

What are the compromises?

Baer p. 179

Window design

• Resource allocation: centralized is better– static partitioning of resources is worse than

dynamic allocation

• Large windows: speed and power come into play

Baer p. 179

Two-Step Instruction Issue

Wakeup: instruction is ready for execution

Select: instruction is assigned to an execution unit.

Wakeup Step

Baer p. 180

f

Functional units

Window entries

w

Window entry with buses from 8 exec units

Wakeup Step

Baer p. 180

f

Functional units

Window entries

w

We need onebus from eachfunctional unit to each window entry

We also need twocomparators foreach functionalunit in eachwindow entry

Thus we need2fw comparators

If we separate thefunctional units andwindow slots into twoequal-size groups,we only need.fw/2 comparators

We will also need fewer (shorter)buses fromunits to slots.

Select Step

• Priority encoder: a circuit that receives several requests and issues one grant

• woken up instructions vying for the same unit send requests.• priority related to position in window

• Smaller window → smaller priority encoder

Baer p. 181

When should a centralized window be replaced by a distributed or

hybrid one?

When the wakeup-select step are on the critical path.

Threshold appears to be windows with around 64 entries on a 4-wide

superscalar processor Baer p. 182

Intel Pentium 4:2 large windows 2 schedulers per window

Intel Pentium III and Intel core:Smaller centralized window

AMD Opteron:4 sets of reservation stations

Baer p. 182

Relation between Select and Wake Up

i: R51 ← R22 + R33

i+1: R43 ← R27 – R51

Example:

The name given to the result of instruction i (R51)must be broadcast as soon as instruction i is selected.

Broadcasting the tag of R51 wakes up instruction i+1.

For single-cycle latency instructions, thestart of the execution is too late to broadcast the tag.

Baer p. 183

Speculative Wake Up and Select

i: R51 ← load(R22)i+1: R43 ← R27 – R51

i+2: R35 ← R51 + R28

Example:

In this case the tag of the destination of instruction iis broadcast.

Instructions i+1 and i+2 are speculatively woken upand selected based on a cache-hit latency.

In the case of a cache miss all dependent instructionsthat have been woken up and selected must be aborted.

Baer p. 183

Speculative Selection and the Reservation Stations

• An instruction must remain in a reservation station after it is scheduled– A bit indicates that the instruction has been

selected– Station is free once it is sure that the instruction

selection is not speculative anymore• Windows are large in comparison with the

number of functional units– accommodate many instructions in flight, some

speculatively.

Baer p. 183

Integrated Register File

Tomasulo Reservation Stations

What happens upon selection of an instruction?

FunctionalUnit

Reservation Station

Opcode

Operands

Opcode

Operands

FunctionalUnit

Instruction Window

PhysicalRegister File

Baer p. 183

The complexity of Bypassing

i: R51 ← R22 + R33

i+1: R43 ← R27 – R51

Example:

FunctionalUnit A

Compute i

FunctionalUnit B

Compute i+1

Output of A must be forwardedto B bypassing storage.

Baer p. 183


i: R51 ← R22 + R33

i+1: R43 ← R27 – R51

Example:

FunctionalUnit A

Compute i

FunctionalUnit B

Now the bypass must forwardthe output to the input of A.

Compute i+1 But the hardware has to implement both buses.

Baer p. 183


i: R51 ← R22 + R33

i+1: R43 ← R27 – R51

Example:

FunctionalUnit A

Compute i

FunctionalUnit B

Compute i+1

Also, we need buses to forwardthe output of B.

In general, given k functionalunits we may need k2 buses.

Buses become long to avoidcrossing each other.

Forwarding may limit the numberof functional units in a processor.

Forwarding may need more thanone cycle to complete.

Baer p. 184

Load Speculation

• Load Address Speculation– Used for data prefetching

• Memory dependence prediction– Used to speculate data flow from a store to a

subsequent load.

Baer p. 185

Store Buffer

• Store Buffer: A circular queue – Entry allocated when store instruction is decoded– Entry removed when store is committed• Keep data for stores that have not yet committed

Baer p. 185

States of a Store Buffer Entry

AV:Available

AD:Address is

known

CO:Committed

RE:Result and

Address known

AddressComputation

Data to be stored is still to be computed by another instruction

Store instructionreaches top of ROB

Datawritten

to cache

What happens with store buffer on a branch misprediction?

Baer p. 185

Handling Store Buffer on Branch Misprediction and Exceptions.

• Entries preceeding the mispredicted branch: – are in COMMIT state– must be written to cache

• Entries following misprediction– become AVAILABLE

• Exceptions: similar– Must write the COMMIT entries to cache before

handling exception

Baer p. 186

Load Instructions and Load Speculation

Baer p. 187

Load /Store Window Implementation –

Most Restricted

Load/StoreWindow

(FIFO)

Loads/Stores inserted inprogram order.

Loads/Stores removed insame order – at mot oneper cycle.

Single windowfor loads andstores.

Baer p. 187

Load Bypassing• Compare address of load with all addresses in

store buffer– Load bypassing: If there is no match → load can

proceed– What happens if the operand address of any entry

in store buffer is not yet computed?• load cannot proceed

– What happens if there is a match to an entry that is not committed? • load cannot access cache• “match” is the last match in program order.

• Need associative search of operand addresses in store buffer

Baer p. 187

Load Forwarding

• If these conditions are true:– A load match a store buffer entry AND– The result is available for the entry ( entry is in RE

or CO state)

• Then the result can be sent to the register specified by the load

• If the match is with an entry in AD state then:– Load waits for entry to reach RE state

Load Speculation in Out-of-Order Architectures

Dynamic Memory Disambiguation Problem:

Loads are issued speculatively ahead of precedingstores in program order. How to ensure that datadependences are not violated?

Three approachesPessimistic: Wait until certain that load can proceed.(like load forwarding and bypassing)

Optimistic: Load always proceeds speculatively.Need a recovery mechanism.

Dependence prediction: use a predictorto decide to speculate or not.Try to have fewer recoveries.

Example

i1: st R1, memadd1⋅⋅⋅

i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4

Baer p. 188

true dependency

Pessimistic: i3 and i4 cannot issue until i2 has computed its result:• i2 must be at least in RE (Result)• i4 proceeds once i1 and i2 are in AD (Address)

Example



Baer p. 189

true dependency

Optimistic: i3 and i4 issue as soon as possible (load-buffer entries are created)

A store reaches COaddress comparedassociativelywith load-buffer entries

Example



Baer p. 189

true dependencyAD i1 memadd1AD i2 memadd2

Store Buffer:

1 i3 memadd31 i4 memadd4

Load Buffer:

Indicates thatthe load is speculative

CO

Nothing happens becausethere is no match in load buffer.

Example



Baer p. 189

true dependencyCO i1 memadd1CO i2 memadd2

Store Buffer:

1 i3 memadd31 i4 memadd4

Load Buffer:match

i3 has to be reissued

i4 has to be reissued because itis after i3 in program order

some implementations only reissue instructions that depend on i3

Example



Baer p. 189

true dependency

Dependence Prediction: with correct predictions, i4 can proceed and we avoid reissueing i3.

Motivation: Optimistic

Memory dependencies are rare:Less than 10% of loads depend on an earlier store

Baer p. 190

Motivation:Dependence Prediction

Load misspeculations are expensive and predictors can reduce them.

What strategy should we use forpredicting profitable speculations?

Baer p. 190

Simple StrategyMemory dependencies are infrequent

Predict that all loads can be speculated

If a load L is misspeculated

All subsequent instances of L must wait

We need a bit to remember. Where should this bit be stored?

Baer p. 190

Simple strategy (cont.)Single prediction bit P associated with instruction in cache.

When load instr. brought into cache → P = 1

Load is misspeculated → P = 0

Line evicted from cache and reloaded → P = 1

Strategy used in theDEC Alpha 21264

Baer p. 190

Principle Behind Load Prediction

“static store-load instruction pairs thatcause most of the dynamic data misspredictionare relatively few and exhibit temporal locality.”

Moshovos A. , Breach S. E., Vijaykumar T. N., Sohi G. S.,“Dynamic Speculation and Synchronization of DataDependences,” International Symposium on ComputerArchitecture, (ISCA) 1997, Denver, CO, USA

Ideal load speculationAvoids mis-speculation.

Allows loads to execute as early as possible.

Loads with no true dependences→ Execute without delay.

A load with a true dependence→ Execute as soon as the store that produces the data commits.

MoshovosISCA97.

A Real Predictor

MoshovosISCA97.

Dynamically identify store-load pairs that are likelyto be data dependent.

i

Provide a synchronization mechanism to instancesof these dependences.

ii

Uses this mechanism to synchronize the storeand the load.

iii

Load Predictor Table

Baer p. 190

Hash basedon PC

Saturatingcounters

Predictor States:• 00: strong nospeculate• 01: weak nospeculate• 10: weak speculate• 11: strong speculate

tagload buffer entry:

op.address: memory address of operand

spec.bit: speculative load?

update.bit: should update predictor at commit/abort?

Each load instruction has a loadspec bit.

Incrementing asaturating countermoves it towardstrong speculate.

Load/Decode Stage

• Set loadspec bit according to value of counter associated with the load PC

Baer p. 190

After Operand Address is Computed

UncommittedYoungerStores?

Enter in the load buffer:

op.ad 0tag 0

spec.bit update.bit

loadspec

Issue Cache Access


op.ad 1tag 0


op.ad 0tag 1

Wait (like in pessimistic solution)

No

On

Off

Yes

Baer p. 190

Store Commit Stage

For all matchesin load buffer

spec.bit update.bit ← 1

Load Abort:Predictor ← Strong NoSpeculateRecover from misspeculated load

Baer p. 191

Off

On

It was correct to not speculateand should keep not speculatingin the future

Store Commit Stage

spec.bit increment saturating counter

speculating was correct

On

update.bitincrement

saturating counter

would like to speculatein the future

Off

Off

predictor ← strong nospeculate

On

Baer p. 191

Store Sets

Baer p. 191

Motivation for Store Sets• The past is a good predictor for future

memory-order violations.• Must also predict:

When one load is dependent on multiple stores

store A

store B

store C

load Dload E

load F

When multiple loads depend on one store.

ChrysosISCA98

Chrysos, G. Z. and Emer, J. S., “Memory Dependence Prediction using Store Sets,” International Symposium on Computer Architecture, 1998 pp. 142-153.

Store Set Definition

Given a load L, the store set of L is the set of all stores that L has ever depended upon.

Ideally, any time a store-load dependence isdetected, the store is added to the load’s store set table.

To make a prediction, the store set table ofthe load is searched for all uncommitted younger stores.

ChrysosISCA98

Too expensive! We need an approximation.

Implementation of Store Sets Memory Dependence Prediction

Both loads and stores have entries in Store Set ID Table.

ChrysosISCA98

Store Set Examples:Multiple loads depend on one store

j: loadadd1 k: loadadd2

⋅⋅⋅

i: storeadd3

i→

j→

SSITLFST

k→

Baer p. 192

Store Set Examples:Multiple loads depend on one store

i: storedd2 j: storeadd3

k: loadadd1

⋅⋅⋅

i→

j→

SSITLFST

k→

Baer p. 192

Store Set Examples:Multiple loads depend on multiple stores

i: storedd2 j: storeadd3

k: loadadd1

i→

j→

SSITLFST

l→

l: loadadd4

⋅⋅⋅

⋅⋅⋅

⋅⋅⋅

k→

We have a conflict betweenthe LFST entry associated with i and l.

Winner is the entry with smaller index in SSITMake loser point to the winner’s entry.

Baer p. 192

Evaluating Load Speculation

• Performance benefits from load speculation depends on:– speculation miss rate– cost of misspeculation recovery

Baer p. 194

Evaluating Load Speculation - Terminology

Conflicting load: at the time the load is ready to issue there is a previous store in the instruction window whose operand address is unknown.

Colliding load: the load is dependent on one of the stores with which it conflicts.

Baer p. 194

Evaluating Load Speculation – Typical measurements

• In a 32-entry load-store window, there are– 25% of loads are non-conflicting– of the 75% conflicting loads:• only 10% actually collide

• In larger windows the percentage of:– non-conflicting loads increase– colliding loads decrease

Baer p. 194

Back-End Optimizations

• Branch prediction– “a must”

• Load speculation (load-bypassing stores)– “important” because other instructions depend on

the load

• Prediction of load latency– “common” to hide load latency in the cache

hierarchy

Baer p. 195

Other Back-End Optimizations

• Value Prediction– Predict the value that an instruction will compute• May restrict to the value loaded by loads

• Critical Instructions– Predict which instructions are in the critical path.

Baer p. 196-201

Clustered Microarchitectures

Baer p. 201

Back-end Limitations to m

Large windows: large m requires largewindows.Expensive in hardware and power dissipation

Many functional units: many (long) buses;affect forwarding.

Centralized Resources (p. e. Register File): large resources, many ports.

Baer p. 201

Definition of a Cluster

• A cluster is formed by:– A set of functional units– A register file– An instruction window (or reservation stations)

Baer p. 201

Clustered Microarchitecture

Baer p. 202

Register File Replication

• A copy of the register file in each cluster– Small number of clusters– Can use crossbar switch for interconnection– Example (Alpha 21264):• integer unit is two clusters;• each cluster has a full copy of the 80 registers

Baer p. 202

Changes because of Clustering

• Front end– steer instruction to window of a cluster• static: compile time decision• dynamic: by hardware at runtime

• Back end– Copy results into registers of other clusters– Intercluster latency affects wake up and select

Baer p. 202

Effect of Clustering in Performance

• Latency to forward results between clusters• Sensitive to load balancing between clusters• Conflicting goals:– keep producers and consumers of data into same

cluster– balance the workload

Baer p. 202

Distributed Register Files• Steering affects Renaming– Assume that an instruction a is assigned to cluster

ci • A free register form ci will be used for the result of a

– If an operand of a is produced by an instruction b in a cluster cj, what needs to be done?

1. Another free register of ci is assigned tothis operand.

2. A copy instruction is inserted in cj immediately b.

3. The copy is kept in ci for use by other instructions.Baer p. 203

Clustered microarchitectures can be seen as astep in the evolution from monolithic processorsto multiprocessors.

Chapter Summary: Back end is important for performance

– Tomasulo Algorithm– Centralized/Distributed/Hybrid Windows– Wakeup/Select steps– Scheduling: Critical instructions first– Loads:• Bypassing stores• Forwarding values• Speculating on the absence of dependences with stores

– Clustering to reduce wiring complexity

back-end: instruction scheduling, memory access instructions, and clusters

Documents

r4 r14 r1601234567time

dispatchedregister r4

r8 r2 r12i4

use reservation station

reservation station

rob stores result

rob entry flag

head of rob