dynamic scheduling ii admin - cs.duke.edu filecompsci 220 / ece 252 lecture notes dynamic scheduling...

18
COMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II 1 © 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti Dynamic Scheduling II • so far: static scheduling • dynamic scheduling (out-of-order execution) • Scoreboard • Tomasulo’s algorithm • register renaming: removing artificial dependences (WAR/WAW) • now: out-of-order execution + precise state • advanced topic: dynamic load scheduling • PentiumII vs. Pentium4 • limits of ILP COMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II 2 © 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti Admin HW #2 Due HW #3 Assigned Projects: Decide by Monday, talk with me if needed. Reading • H+P chapter 3 • Sohi and Vajapeyam: “Instruction Issue Logic” • Yeager: “The MIPS R10000 Superscalar Microprocessor” • Complexity-Effective Superscalar • Continual Flow Pipelines • DIVA • Power COMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II 3 © 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti Superscalar + Out-of-Order + Speculation superscalar + out-of-order + speculation • three concepts that work well (best?) when used together • CPI >= 1? • overcome with superscalar • superscalar increases hazards? • overcome with dynamic scheduling • RAW dependences still a problem? • overcome with a large instruction window • branches a problem for filling large window? • overcome with speculation COMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II 4 © 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti Speculation and Precise Interrupts Q: why are we discussing these together? • sequential (von Neumann) semantics for interrupts • all instructions before interrupt should be complete • all instructions after interrupt should look as if never started (abort) • basically, we also want the same thing for a mis-predicted branch • what makes precise interrupts hard? • out-of-order completion must undo post-interrupt writebacks • in-order no post-branch writebacks before branch completes • out-of-order can happen A: with out-of-order, precise interrupts and mis-speculation recovery are same problem same solution

Upload: truongque

Post on 01-Aug-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

1© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Dynamic Scheduling II• so far: static scheduling• dynamic scheduling (out-of-order execution)

• Scoreboard• Tomasulo’s algorithm• register renaming: removing artificial dependences (WAR/WAW)

• now: out-of-order execution + precise state• advanced topic: dynamic load scheduling • PentiumII vs. Pentium4• limits of ILP

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

2© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

AdminHW #2 Due

HW #3 Assigned

Projects: Decide by Monday, talk with me if needed.

Reading• H+P chapter 3• Sohi and Vajapeyam: “Instruction Issue Logic”• Yeager: “The MIPS R10000 Superscalar Microprocessor”• Complexity-Effective Superscalar• Continual Flow Pipelines• DIVA• Power

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

3© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Superscalar + Out-of-Order + Speculationsuperscalar + out-of-order + speculation

• three concepts that work well (best?) when used together• CPI >= 1?

• overcome with superscalar

• superscalar increases hazards? • overcome with dynamic scheduling

• RAW dependences still a problem?• overcome with a large instruction window

• branches a problem for filling large window?• overcome with speculation

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

4© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Speculation and Precise InterruptsQ: why are we discussing these together?

• sequential (von Neumann) semantics for interrupts• all instructions before interrupt should be complete• all instructions after interrupt should look as if never started (abort)• basically, we also want the same thing for a mis-predicted branch

• what makes precise interrupts hard? • out-of-order completion ⇒ must undo post-interrupt writebacks• in-order ⇒ no post-branch writebacks before branch completes• out-of-order ⇒ can happen

A: with out-of-order, precise interrupts and mis-speculation recovery are same problem ⇒ same solution

Page 2: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

5© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Solution: Precise State• speculative execution requirements

• ability to abort & restart at every branch

• precise synchronous interrupt requirements• ability to abort & restart at every load, store, FP divide, ??

• precise asynchronous interrupt requirements• ability to abort & restart at every ??

• just bite the bullet • implement ability to abort & restart at every instruction• called “precise state”

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

6© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Ways to Implement Precise State• imprecise state: ignore the problem!

– makes page faults (any restartable exceptions) hard– makes speculative execution almost impossible• IEEE standard strongly suggests precise state

• force in-order completion (WB): stall pipe if necessary– slow

• precise state in software– even slower - would require a trap for every misprediction

• precise state in hardware: save recovery info internally+ everything is better in hardware

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

7© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

The Problem with Precise Stateproblem is in the writeback stage (WB)

• mixes two things together that should be separate• (1) broadcasts values to RS, forwards to other instructions

• OK for this to be out-of-order

• (2) writes values to registers• would like this to be in-order

solution to every functionality problem? add a level of indirection

• have already seen this for out-of-order execution• split ID into in-order DS and out-of-order IS• separate using instruction buffer (scoreboard, reservation stations)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

8© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Re-Order Buffer (ROB)

instruction buffer ⇒ re-order buffer (ROB)• buffers completed results en route to register file and D$!!!

• may be combined with RS or separate (combined in the picture)

• split writeback (WB) into two stages: Complete and Retire

I$

regfile

DSIF CMEX

F/D D/XPC X/C

IS

ROB

RT

Page 3: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

9© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Complete and Retire

• CM (complete)• completed values write results to ROB out-of-order• out-of-order stage ⇒ hazards result in waits

• RT (retire, but sometimes called “commit” or “graduate”)• ROB writes results to register file in-order• in-order stage ⇒ hazards result in stalls

I$

regfile

DSIF CMEX

F/D D/XPC X/C

IS

ROB

RT

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

10© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Memory Ordering Buffer (MOB)ROB makes register writes in-order, but what about stores?

• same as before (i.e., to D$ in MEM stage)?• bad idea: imprecise memory worse than imprecise registers• must do same trick for stores

Memory Ordering Buffer (MOB)• a.k.a. store buffer, store queue, load/store queue (LSQ)• completed (but not retired) stores write to MOB• on RT, head of MOB written to D$• loads look at MOB and D$ in parallel

• forward from MOB if matching store (i.e. to same address)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

11© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

ROB+MOB

modulo some gross simplifications, this picture is almost realistic!!

I$

regfile

DSIF CMEX

F/D D/XPC X/C

IS

ROB

RT

D$MOBstores

loads/storesstores loads

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

12© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Tomasulo+ROB• add ROB to Tomasulo’s algorithm

• combined ROB and RS are called RUU (or Sohi’s method)• RUU = register update unit

• separate ROB and RS are called P6-style (Intel P6 = Pentium Pro)

• our example: Simple-P6• separate ROB and RS• same RS organization as before: 1 ALU, 1 load, 1 store, 2 3-cycle FP

Page 4: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

13© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6-style Organization

• instruction fields and ready bits• tags• values

value

V1 V2

T+

T1 T2T

R

==

FUT

taildispatch

retirehead

value

RSROB

reg status

dispatch

RF

CD

B.V

CD

B.T

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

14© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Data Structures• RS are the same as before• ROB

• head, tail: to keep sequential order• R: output register of instruction, V: output value of instruction

• tags are different• was: RS# → now: ROB#

• register status table is different• T+: tag + “ready-in-ROB” bit• tag == 0 ⇒ result ready in register file• tag != 0 ⇒ result not ready• tag != 0 + ⇒ result ready in ROB

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

15© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Data Structures

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store No4 FP1 No5 FP2 No

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1)2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

Reg. Statusreg T+f0f2f4r1

CDBV T

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

16© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Pipelinenew pipeline structure: IF, DS, IS, EX, CM, RT

• DS (dispatch)• (RS/ROB/MOB full) ? (stall) : • {allocate RS/ROB/MOB entries, set RS tag to ROB#, set register status entry to ROB# with “ready-in-ROB” bit off, read ready registers into RS}

• EX (execute)• free RS entry• used to be done at WB• can be earlier now because RS# are not tags

Page 5: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

17© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Pipeline• CM (complete)

• (CDB not available) ? (wait) :• {write value into ROB entry indicated by RS tag, mark ROB entry complete, mark register status entry “ready-in-ROB” bit (+)}

• RT (retire, commit, graduate)• (ROB head not complete) ? (stall) :• {write ROB head result to register file, if store, then write MOB head to D$, handle any exceptions, free ROB/MOB entries}

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

18© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6: Dispatch (DS) part I

• stall if RS or ROB or MOB is full• allocate RS+ROB entries (assign ROB# to RS output tag)• set register status entry to ROB# and “ready-in-ROB” bit to 0

value

V1 V2

T+

T1 T2T

valueR

==

FUT

taildispatch

retirehead

RSROB

reg status

dispatch

RF

CD

B.V

CD

B.T

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

19© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6: Dispatch (DS) part II

• read tags for register inputs from register status table• if tag==0: copy value from RF (not shown)• if tag!=0: copy tag to RS • if tag!=0 +: copy value from ROB

value

V1 V2

T+

T1 T2T

valueR

==

FUT

taildispatch

retirehead

RSROB

reg status

dispatch

RF

CD

B.V

CD

B.T

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

20© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6: Complete (CM)

• wait for CDB• broadcast <result,tag> on CDB• write result into ROB, set reg. status “ready-in-ROB” bit (+)• match tags, write CDB.V into RS of dependent instructions

value

V1 V2

T+

T1 T2T

valueR

==

FUT

tailfetch

retirehead

RSROB

reg status

fetch

RF

CD

B.V

CD

B.T

Page 6: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

21© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6: Retire (RT)

• stall until instruction at ROB head has completed• write ROB head result to reg-file (D$ if store), clear reg. status entry• free ROB entry

value

V1 V2

T+

T1 T2T

R

==

FUT

tailfetch

retirehead

value

RSROB

reg status

fetch

RF

CD

B.V

CD

B.T

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

22© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#1 REG[r1]3 store No4 FP1 No5 FP2 No

P6 Example: Cycle 1

hdtl

ROB + MOB# instruction R V addr IS EX CM

ht 1 ldf f0,X(r1) f0 &X[0]2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

Reg. Statusreg T+f0 ROB#1f2f4r1

allocate RS

CDBV T

set reg. status

before ROB,this was RS#2

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

23© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 2

RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#1 REG[r1]3 store No4 FP1 Yes mulf ROB#2 REG[f2] ROB#15 FP2 No

allocate RS,

Reg. Statusreg T+f0 ROB#1f2f4 ROB#2r1

CDBV T

set reg. status

allocate ROB,

hdtl

ROB + MOB# instruction R V addr IS EX CM

h 1 ldf f0,X(r1) f0 &X[0] c2t 2 mulf f4,f0,f2 f4

3 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

24© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 3

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#3 REG[r1] ROB#24 FP1 Yes mulf ROB#2 REG[f2] ROB#15 FP2 No

allocatefree

hdtl

ROB + MOB# instruction R V addr IS EX CM

h 1 ldf f0,X(r1) f0 &X[0] c2 c32 mulf f4,f0,f2 f4

t 3 stf f4,Z(r1) &Z[0]4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

Reg. Statusreg T+f0 ROB#1f2f4 ROB#2r1

CDBV T

Page 7: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

25© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 4

RS# FU busy op T V1 V2 T1 T21 ALU Yes add ROB#4 REG[r1]2 load No3 store Yes stf ROB#3 REG[r1] ROB#24 FP1 Yes mulf ROB#2 CDB.V REG[f2] ROB#15 FP2 No

allocate

ldf finished1. write result to ROB

f0 ready

hdtl

ROB + MOB# instruction R V addr IS EX CM

h 1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 c43 stf f4,Z(r1) &Z[0]

t 4 add r1,r1,#8 r15 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

Reg. Statusreg T+f0 ROB#1+f2f4 ROB#2r1 ROB#4

CDBV T

[f0] ROB#1

2. CDB broadcast

grab from CDB

3. set ready-in-ROB bit

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

26© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 5

RS# FU busy op T V1 V2 T1 T21 ALU Yes add ROB#4 REG[r1]2 load Yes ldf ROB#5 ROB#43 store Yes stf ROB#3 REG[r1] ROB#24 FP1 No5 FP2 No

retire, write ROBresult into regfile

Reg. Statusreg T+f0 ROB#5f2f4 ROB#2r1 ROB#4

CDBV T

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 f4 c4 c53 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 r1 c5

t 5 ldf f0,X(r1) f06 mulf f4,f0,f27 stf f4,Z(r1)

allocate

free

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

27© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 6

RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#5 ROB#43 store Yes stf ROB#3 REG[r1] ROB#24 FP1 No5 FP2 Yes mulf ROB#6 REG[f2] ROB#5 allocate

free

Reg. Statusreg T+f0 ROB#5f2f4 ROB#6r1 ROB#4

CDBV T

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 f4 c4 c5+3 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 r1 c5 c65 ldf f0,X(r1) f0

t 6 mulf f4,f0,f2 f47 stf f4,Z(r1)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

28© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 7

RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#5 CDB.V ROB#43 store Yes stf ROB#3 REG[r1] ROB#24 FP1 No5 FP2 Yes mulf ROB#6 REG[f2] ROB#5

stall DS, no free store RS

r1 ready

Reg. Statusreg T+f0 ROB#5f2f4 ROB#6r1 ROB#4+

CDBV T

[r1] ROB#4

write result into ROB, CDB

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 f4 c4 c5+3 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7

t 6 mulf f4,f0,f2 f47 stf f4,Z(r1)

grab from CDB

add finished

Page 8: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

29© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 8

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#3 CDB.V REG[r1] ROB#24 FP1 No5 FP2 Yes mulf ROB#6 REG[f2] ROB#5

stall DS, no free store RS

stall RT

free

Reg. Statusreg T+f0 ROB#5f2f4 ROB#6r1 ROB#4+

CDBV T

[f4] ROB#2

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 f4 [f4] c4 c5+ c83 stf f4,Z(r1) &Z[0] c84 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8

t 6 mulf f4,f0,f2 f47 stf f4,Z(r1)

f4 readygrab from CDB

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

30© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 9

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 Yes mulf ROB#6 CDB.V REG[f2] ROB#5

stall RT

Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+

CDBV T

[f0] ROB#5

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

h 3 stf f4,Z(r1) &Z[0] c8 c94 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9

t 7 stf f4,Z(r1) &Z[1]

f0 readygrab from CDB

free (ROB#3)allocate (ROB#7)

read from ROBnot reg. file (+)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

31© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 10

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 No

stall RT

Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+

CDBV T

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

h 3 stf f4,Z(r1) &Z[0] c8 c9 c104 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9 c10

t 7 stf f4,Z(r1)

free

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

32© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 11

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 No

Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+

CDBV T

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c83 stf f4,Z(r1) &Z[0] c8 c9 c10

h 4 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9 c10

t 7 stf f4,Z(r1)retire stf

Page 9: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

33© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Precise State with ROBpoint of ROB is maintaining precise state

• how does that work?• easy as 1,2,3

• 1: wait until precise condition reaches retire (RT) stage• 2: clear contents of ROB, RS & register status table• 3: start over

• works because zeros signify the right state for start-over• 0 in ROB/RS ⇒ entry is empty• tag==0 in register status table ⇒ register is written

• and because register file and D$ writes take place at RT• let’s look at example page fault at stf ...

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

34© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Precise State Example: Cycle 9

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 Yes mulf ROB#6 CDB.V REG[f2] ROB#5

Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+

CDBV T

[f0] ROB#5

hd tl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

h 3 stf f4,Z(r1) &Z[0] c8 c94 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9

t 7 stf f4,Z(r1)PAGE FAULT

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

35© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Precise State Example:Cycle 10

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store No4 FP1 No5 FP2 No

faulting instruction at

Reg. Statusreg Tf0f2f4r1

CDBV T

hd tl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c83 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1) ROB head?

CLEAR EVERYTHING

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

36© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Precise State Example: Cycle 11

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#3 REG[f4] REG[r1]4 FP1 No5 FP2 No

Reg. Statusreg Tf0f2f4r1

CDBV T

hd tl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

ht 3 stf f4,Z(r1) &Z[0]4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1) START OVER

Page 10: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

37© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Precise State Example: Cycle 12

RS# FU busy op T V1 V2 T1 T21 ALU Yes add ROB#4 REG[r1]2 load No3 store Yes stf ROB#3 REG[f4] REG[r1]4 FP1 No5 FP2 No

Reg. Statusreg Tf0f2f4r1 ROB#4

CDBV T

hd tl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

h 3 stf f4,Z(r1) &Z[0] c12t 4 add r1,r1,#8 r1

5 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

38© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Performancein other words: what is the cost of precise interrupts?

+ in general: same performance as plain Tomasulo+ maybe a little better (RS freed earlier -> fewer RS structural hazards)

– unless ROB is very small– in which case ROB structural hazards become a problem

• rules of thumb for ROB size• >= superscalar width (N) * number of stages between DS and RT• at least N * thit-L2• what’s the rationale behind these rules?

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

39© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Organization Summary• popular design for a while

+ easy (relatively) to implement correctly• to recover, just zero out everything• examples: PentiumPro, PowerPC, AMD K6

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

40© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Problems with P6

– problems for high performance implementations• too much value movement (regfile/ROB→RS→ROB→regfile) • too many muxes/buses (values from too many places)• RS mixes values with control tags (long data paths make clock slow)

value

V1 V2

T+

T1 T2T

valueR

==

FUT

Page 11: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

41© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Alternative Implementation: MIPS R10K

• separate control (ROB/RS) from data (registers/FU)• one big physical register file (PRF) holds all data → no copies• ROB and RS used only for control and tags → small• register file close to FUs, everything else is on the side

valueT+

T1+ T2+T

R

==

FUT

T ToldT

RS

dispatch

map table

ROB

PRF

head

tail

CDB.T

freelist

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

42© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Register Renaming• no architectural register file!• physical register file holds all values

• #physical registers = #architectural registers + #ROB• map architectural registers to physical registers• removes WAW&WAR hazards (physical registers replace RS copies)

• register status table replaced by register map table• mappings cannot be 0 (there is no architectural register file)

• free list keeps track of unallocated physical registers• ROB responsible for returning physical registers to free list

• conceptually: true register renaming• have seen an example a few lectures ago ... here it is again ...

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

43© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Register Renaming Example• names: r1,r2,r3, locations: l1,l2,l3,l4,l5,l6,l7• original mapping: r1→l1, r2→l2, r3→l3 (l4-l7 “free”)

map tableraw instruction r1 r2 r3 free locations renamed instruction

l1 l2 l3 l4,l5,l6,l7add r1,r2,r3 l4 l2 l3 l5,l6,l7 add l4,l2,l3sub r3,r2,r1 l4 l2 l5 l6,l7 sub l5,l2,l4mul r1,r2,r3 l6 l2 l5 l7 mul l6,l2,l5div r2,r1,r3 l6 l7 l5 div l7,l6,l5

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

44© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Freeing Registers in R10Kissue: freeing physical registers

• P6• no need to free speculative storage explicitly• temporary storage comes with ROB entry• RT: copy value from ROB to register file, free ROB entry

• R10K• can’t free physical register when writing instruction retires• no architectural register to copy value to• but...• we can free phys register previously mapped to same logical register• why? all instructions that will ever read that value have retired

Page 12: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

45© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Freeing Registers in R10K

• when add commits: free l1• when sub commits: free l3• when mul commits: free ?• when div commits: free ?• see the pattern?

map tableraw instruction r1 r2 r3 free locations renamed instruction

l1 l2 l3 l4,l5,l6,l7add r1,r2,r3 l4 l2 l3 l5,l6,l7 add l4,l2,l3sub r3,r2,r1 l4 l2 l5 l6,l7 sub l5,l2,l4mul r1,r2,r3 l6 l2 l5 l7 mul l6,l2,l5div r2,r1,r3 l6 l7 l5 div l7,l6,l5

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

46© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Pipelinesame pipeline structure: IF,DS,IS,EX,CM,RT

• DS (dispatch)• (RS or ROB or MOB full or no physical registers) ? (stall) : • (allocate RS and ROB entries AND physical register)

• IS (issue)• (read physical registers)

• CM (completion)• (writeback destination register, mark ROB entry complete)

• RT (retire, commit, graduate)• (ROB head not complete) ? (stall) :• {if store then write MOB head to D$, handle any exceptions, free ROB/MOB entries, free previous physical register}

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

47© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K: Dispatch (DS)

• stall if no free RS, ROB, MOB or physical register (preg)• allocate RS and ROB entry• read physical register tags for input registers, store in RS• allocate new preg for output, set in RS, ROB and map table

valueT+

T1+ T2+T

R

==

FUT

T ToldT

RS

dispatch

map table

ROB

PRF

head

tail

CDB.T

freelist

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

48© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K: Complete (CM)

• wait for free CDB• broadcast tag on CDB.T• set instruction’s output register ready bit in map table• set ready bits for matching input tags in RS

valueT+

T1+ T2+T

R

==

FUT

T ToldT

RS

dispatch

map table

ROB

PRF

head

tail

CDB.T

freelist

Page 13: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

49© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K: Retire (RT)

• stall until instruction at ROB head is complete• return Told of ROB head to free list• free ROB head entry

valueT+

T1+ T2+T

R

==

FUT

T ToldT

RS

dispatch

map table

ROB

PRF

head

tail

CDB.T

freelist

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

50© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Data Structures

RS# FU busy op T T1+ T2+1 ALU No2 load No3 store No4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#1+f2 PR#2+f4 PR#3+r1 PR#4+

Free ListPR#5,PR#6, PR#7, PR#8

Told for freeing oldphysical register

always T mappings, evenwitn no in-flight instructions

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1)2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

only tags in RS and CDBno values, so need ready bits

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

51© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 1

RS# FU busy op T T1+ T2+1 ALU No2 load Yes ldf PR#5 PR#4+3 store No4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#3+r1 PR#4+

Free ListPR#5, PR#6, PR#7, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM

ht 1 ldf f0,X(r1) PR#5 PR#1 &X[0]2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

allocate new physical register to f0remember old physical registermapped to f0

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

52© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 2

RS# FU busy op T T1+ T2+1 ALU No2 load Yes ldf PR#5 PR#4+3 store No4 FP1 Yes mulf PR#6 PR#5 PR#2+5 FP2 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4+

Free ListPR#6,

PR#7, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM

h 1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2t 2 mulf f4,f0,f2 PR#6 PR#3

3 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

allocate new physical register to f0remember old physical registermapped to f4

Page 14: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

53© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 3

RS# FU busy op T T1+ T2+1 ALU No2 load No3 store Yes stf PR#6 PR#4+4 FP1 Yes mulf PR#6 PR#5 PR#2+5 FP2 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4+

Free List

PR#7, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM

h 1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c32 mulf f4,f0,f2 PR#6 PR#3

t 3 stf f4,Z(r1) &Z[0]4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

allocate RSfree RS

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

54© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 4

RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load No3 store Yes stf PR#6 PR#4+4 FP1 Yes mulf PR#6 PR#5+ PR#2+5 FP2 No

Map Tablereg T+f0 PR#5+f2 PR#2+f4 PR#6r1 PR#7

Free List

PR#7, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM

h 1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c42 mulf f4,f0,f2 PR#6 PR#3 c43 stf f4,Z(r1) &Z[0]

t 4 add r1,r1,#8 PR#7 PR#45 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

PR#5

match PR#5 tag from CDB & issue

allocate RS

ldf finishedset ready bit, CDB broadcast

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

55© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 5

RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load Yes ldf PR#8 PR#73 store Yes stf PR#6 PR#4+4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#8f2 PR#2+f4 PR#6r1 PR#7

Free ListPR#1 PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c53 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 PR#7 PR#4 c5

t 5 ldf f0,X(r1) PR#8 PR#56 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

free RS

allocate RS

ldf retiresreturn PR#1 to free list,no one will ever read it again

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

56© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Precise Stateproblem with R10K way? precise state is more difficult

– registers already written• but that’s OK• why? because there is no architectural register file• we can “free” written registers and restore old ones

• we need to restore register map table to the way it was• option 1: roll back ROB serially (slow)• option 2: restore from a checkpoint (fast)

Page 15: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

57© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Rollback Example: Cycle 5

RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load Yes ldf PR#8 PR#73 store Yes stf PR#6 PR#4+4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#8f2 PR#2+f4 PR#6r1 PR#7

Free ListPR#1

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c53 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 PR#7 PR#4 c5

t 5 ldf f0,X(r1) PR#8 PR#56 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

undo instructions 3–5

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

58© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Rollback Example: Cycle 6

RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load No3 store Yes stf PR#6 PR#4+4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#7

Free ListPR#1, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c5+3 stf f4,Z(r1) &Z[0]

t 4 add r1,r1,#8 PR#7 PR#4 c55 ldf f0,X(r1) PR#8 PR#56 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

undo ldf1. free load RS2. free T (PR#8)3. restore f0 mapping to Told (PR#5)4. free ROB entry #5

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

59© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Rollback Example: Cycle 7

RS# FU busy op T T1+ T2+1 ALU No2 load No3 store Yes stf PR#6 PR#4+4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4

Free ListPR#1,PR#8

PR#7

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c5+t 3 stf f4,Z(r1) &Z[0]

4 add r1,r1,#8 PR#7 PR#4 c55 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

undo add1. free ALU RS2. free T (PR#7)3. restore r1 mapping to Told (PR#3)4. free ROB entry #4

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

60© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Rollback Example: Cycle 8

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4

Free ListPR#1,PR#8

PR#7

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

ht 2 mulf f4,f0,f2 PR#6 PR#3 c4 c5+ c83 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

undo stf1. free store RS2. free ROB entry #33. no registers to restore/free4. how is store undone?

RS# FU busy op T T1+ T2+1 ALU No2 load No3 store No4 FP1 No5 FP2 No

Page 16: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

61© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 vs. R10K

• R10K style is more popular today• e.g., Alpha, Pentium4

feature P6 R10Kvalue storage RF, ROB, RS PRF

register read DS: RF/ROB → RS IS: PRF → FU

register write RT: ROB → RF CM: FU → PRF

spec value free RT: automatic with ROB when overwriting instructionretires

datapaths RF/ROB→RS, RS→FU, FU→ROB, ROB→RF

PRF→FU, FU→PRF

precise state simple, zero all structures complex: serial/checkpoint

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

62© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Advanced Topic: Load Scheduling• all instructions except for loads are easy in Tomasulo

• register inputs only• register renaming captures all dependences• tags tell you exactly when you can execute

• loads not so easy• must check for older active stores with same address• register renaming doesn’t tell you that

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

63© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

The Data Memory FU• MOB+D$ = memory FU

• just like any other FU• 2 reg inputs: addr, data_in• 1 reg output: data_out

• what actually happens?address valueL/S

head

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

64© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Store/Load Dispatch• allocate MOB entry (tail)

• indicate store/load• remember MOB# in RS

address valueL/S

head

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

Page 17: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

65© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Store/Load Retire• free MOB entry (head)• load?

• done

• store?• address+value to D$/TLBaddress valueL/S

head

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

66© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Store Execute• address+value to MOB

• can be done separately• e.g., PentiumII: store→2µops

address valueL/S

head

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

67© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Load Executein parallel

• address to D$• read value

• address to MOB• compare to older store addrs

• no match ⇒ D$ value• match +value ⇒ MOB value

• multiple matches? youngest• forwarding or bypassing• same latency as D$ hit (why?)

• match –value ⇒ stall• try executing load again later

address valueL/S

head

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

68© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Memory Disambiguation Problemat load execution

• what if older store address unknown?• how to determine match?• can’t determine, have to guess• called “memory disambiguation”

• what if older load address unknown?• who cares

address L/S

==ad

dres

s

load

head

Page 18: Dynamic Scheduling II Admin - cs.duke.edu fileCOMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II © 2004 by Lebeck, Sorin, Roth, 5 Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

69© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Memory Disambiguation Alternatives• conservative: loads in-order with respect to stores

• don’t know address? assume match, wait+ pretty simple– many unnecessary waits on non-matching stores

• opportunistic: out-of-order loads• don’t know address? assume no match, go+ higher performance (most cases are not matching)– mis-speculations: went too soon? recover (complex+expensive!)

• selective: combination of conservative and opportunistic• start out opportunistic• load mis-speculation? remember PC in table, next time conservative+ pretty accurate prediction ⇒ pretty good performance

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

70© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Pentium II vs. Pentium4Pentium II Pentium4

Peak Clock 450MHz (Xeon) 2.4 (4.8 internal) GHzPipeline stages 14 22

Branch Prediction 512 entry local 2K entry hybridBTB 512 entries 4K entries

I$ 16KB 64KB T$ + 8KB I$D$ 16KB 8KBL2 512KB–2MB 256KB–2MB

Fetch Width 16 bytes 3 µops (16 bytes on miss)Rename/Retire Width 3 µops 3 µops

Execute Width 5 µops 7 µops (X2)Reservation Stations 20 60

ROB Size 40 128Register Renaming P6-style MIPS R10K-style

Memory Disambiguation Conservative Predictor-BasedAnything else? No Multithreading

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

71© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Limits of Instruction Level Parallelism (ILP)• no need to build bigger superscalar if there isn’t ILP

• ultimately, how much ILP is there?

• ILP study• assume perfect/infinite hardware• successively refine to more realistic hardware• e.g.: [Wall’88], [Wilson+Lam’92]

• some (surprising) results• perfect/infinite/single-cycle everything: int ILP: >50!, FP ILP: >150!!• on actual machines: int ILP: ~2!!, FP ILP: ~3!• culprits? branch prediction, memory latency, finite ROB, issue width• read on your own (H+P, 3.8, 3.9)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

72© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Summary• dynamic scheduling + precise state + speculation

• re-order buffer (ROB)• WB ⇒ CM (out-of-order) + RT (in-order)• P6 vs. R10K• how is recovery done?

• load scheduling using • MOB+D$, bypassing values from MOB• memory disambiguation

• ILP limits• read on your own

next up: static (compiler) exploitation of ILP