sections 3.2 and 3.3 dynamic scheduling – tomasulo’s algorithm

53
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳吳吳 吳吳吳吳吳吳吳吳吳吳 October 2004 EEF011 Computer Architecture 計計計計計

Upload: dante

Post on 24-Jan-2016

73 views

Category:

Documents


3 download

DESCRIPTION

EEF011 Computer Architecture 計算機結構. Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm. 吳俊興 高雄大學資訊工程學系 October 2004. A Dynamic Algorithm: Tomasulo’s Algorithm. For IBM 360/91 (before caches!) – 3 years after CDC Goal: High Performance without special compilers - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Sections 3.2 and 3.3Dynamic Scheduling –Tomasulo’s Algorithm

吳俊興高雄大學資訊工程學系

October 2004

EEF011 Computer Architecture計算機結構

Page 2: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

A Dynamic Algorithm: Tomasulo’s Algorithm

• For IBM 360/91 (before caches!) – 3 years after CDC

• Goal: High Performance without special compilers• Small number of floating point registers (4 in 360)

prevented interesting compiler scheduling of operations

– This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware!

• Why Study 1966 Computer? • The descendants of this have flourished!

– Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …

Page 3: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Example to eleminate WAR and WAW by register renaming

• OriginalDIV.D F0, F2, F4

ADD.D F6, F0, F8

S.D F6, 0(R1)

SUB.D F8, F10, F14

MUL.D F6, F10, F8

WAR between ADD.D and SUB.D, WAW between ADD.D and MUL.D

(Due to that DIV.D needs to take much longer cycles to get F0)

• Register renamingDIV.D F0, F2, F4

ADD.D S, F0, F8

S.D S, 0(R1)

SUB.D T, F10, F14

MUL.D F6, F10, T

Page 4: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Algorithm

• Register renaming provided– by reservation stations, which buffer the operands of

instructions waiting to issue– by the issue logic

• Basic idea:– a reservation station fetches and buffers an operand

as soon as it is available, eliminating the need to get the operand from a register (WAR)

– pending instructions designate the reservation station that will provide their input (RAW)

– when successive writes to a register overlap in execution, only the last one is actually used to update the register (WAW)

As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming

• more reservation stations than real registers

Page 5: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Properties of Tomasulo Algorithm

1. Control & buffers distributed with Function Units (FU)– Hazard detection and execution control are distributed– FU buffers called “reservation stations”; have pending

operands– Registers in instructions replaced by values or pointers to

reservation stations(RS)» form of register renaming to avoids WAR, WAW hazards

2. Bypassing: Results passed directly to FU from RS, not through registers, over Common Data Bus– that broadcasts results to all FUs, so allows all units

waiting for an operand to be loaded simultaneously

• Load and Stores treated as FUs with RSs as well• Integer instructions can go past branches,

allowing FP ops beyond basic block in FP queue

Page 6: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Figure 3.2 Basic structure of a MIPS floating-point unit using Tomasulo’s

algorithmLoad buffers:1. hold components of the effected addr2. track outstanding loads that are waiting

on the memory3. hold the results of completed loads that

are waiting for the CDB

Store buffers:1. hold components of the

effected addr2. hold the destination

memory addresses of outstanding stores that are waiting for the data value to store

3. hold the addr and value to store until the memory unit is available

Page 7: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Three Stages of Tomasulo Algorithm1. Issue—get instruction from the head of the instruction queue

If reservation station free (no structural hazard), control issues instr with the operand values (renames registers).

– No free RS => there is a structural hazard– If the operands are not in the registers, keep track of FU

» This step renames registers, eliminating WAR and WAW hazards

2. Execute—operate on operands (EX) When both operands ready (placed into RS), then execute;

if not ready, monitor Common Data Bus for result– By delaying EX until the operands are available, RAW hazards are

avoided

3. Write result—finish execution (WB) Write on Common Data Bus to the registers and the RS of all awaiting

units; mark reservation station available

• Normal data bus: data + destination (“go to” bus)• Common data bus: data + source (“come from” bus)

– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast

Page 8: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

7 Components of Reservation Station Op: Operation to perform in the unit (e.g., + or –)Qj, Qk: Reservation stations producing the corresponding source operand

– Note: Qj,Qk=0 => ready or unnessary– Store buffers only have Qi for RS producing result

Vj, Vk: Value of Source operands– Only one of V field or the Q field is valid– Store buffers has V field, result to be stored

A: used to hold information for the memory address calculation for a load or a storeBusy: Indicates reservation station or FU is busy

Register result status Qi—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Page 9: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo ExampleInstruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

Clock cycle counter

FU countdown

Instruction stream

3 Load/Buffers

3 FP Adder R.S.2 FP Mult R.S.

Example speed: 3 clocks for FP +,-; 10 for * ; 40 clks for /

Page 10: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 1Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1

Page 11: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 2Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

Note: Can have multiple loads outstanding

Page 12: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 3Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued

• Load1 completing; what is waiting for Load1?

Page 13: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 4Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1

• Load2 completing; what is waiting for Load2?

Page 14: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 5Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2

• Timer starts down for Add1, Mult1

Page 15: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 6Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2

• Issue ADDD here despite name dependency on F6?

Page 16: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 7Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2

• Add1 (SUBD) completing; what is waiting for it?

Page 17: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 8Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2

Page 18: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 9Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No1 Add2 Yes ADDD (M-M) M(A2)

Add3 No6 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(A2) Add2 (M-M) Mult2

Page 19: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 10Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No0 Add2 Yes ADDD (M-M) M(A2)

Add3 No5 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2

• Add2 (ADDD) completing; what is waiting for it?

Page 20: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 11Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

• Write result of ADDD here?• All quick instructions complete in this cycle!

Page 21: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 12Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

3 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 22: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 13Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

2 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 23: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 14Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

1 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 24: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 15Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

0 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

• Mult1 (MULTD) completing; what is waiting for it?

Page 25: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 16Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

• Just waiting for Mult2 (DIVD) to complete

Page 26: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 55Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

Page 27: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 56Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

• Mult2 (DIVD) is completing; what is waiting for it?

Page 28: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Example Cycle 57Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Result

• Once again: In-order issue, out-of-order execution and out-of-order completion.

Page 29: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Drawbacks

• Complexity– delays of 360/91, MIPS 10000, Alpha 21264,

IBM PPC 620 in CA:AQA 2/e, but not in silicon!

• Many associative stores (CDB) at high speed

• Performance limited by Common Data Bus– Each CDB must go to multiple functional units

high capacitance, high wiring density– Number of functional units that can complete per

cycle limited to one!» Multiple CDBs more FU logic for parallel assoc

stores

• Non-precise interrupts!– We will address this later

Page 30: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo Loop Example

Loop: LD F0 0 R1

MULTD F4 F0 F2

SD F4 0 R1

SUBI R1 R1 #8

BNEZ R1 Loop

• This time assume Multiply takes 4 clocks• Assume 1st load takes 8 clocks

(L1 cache miss), 2nd load takes 1 clock (hit)• To be clear, will show clocks for SUBI, BNEZ

– Reality: integer instructions ahead of Fl. Pt. Instructions

• Show 2 iterations

Page 31: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop ExampleInstruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F300 80 Fu

Added Store Buffers

Value of Register used for address, iteration control

Instruction Loop

Iter-ationCount

Page 32: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 1Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 80

Load2 NoLoad3 NoStore1 NoStore2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F301 80 Fu Load1

Page 33: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 2Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No

Load3 NoStore1 NoStore2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F302 80 Fu Load1 Mult1

Page 34: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 3Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F303 80 Fu Load1 Mult1

• Implicit renaming sets up data flow graph

Page 35: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 4Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F304 80 Fu Load1 Mult1

• Dispatching SUBI Instruction (not in FP queue)

Page 36: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 5Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F305 72 Fu Load1 Mult1

• And, BNEZ instruction (not in FP queue)

Page 37: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 6Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult1

Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F306 72 Fu Load2 Mult1

• Notice that F0 never sees Load from location 80

Page 38: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 7Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 No

Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F307 72 Fu Load2 Mult2

• Register file completely detached from computation• First and Second iteration completely overlapped

Page 39: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 8Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F308 72 Fu Load2 Mult2

Page 40: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 9Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F309 72 Fu Load2 Mult2

• Load1 completing: who is waiting?• Note: Dispatching SUBI

Page 41: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 10Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 10 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3010 64 Fu Load2 Mult2

• Load2 completing: who is waiting?• Note: Dispatching BNEZ

Page 42: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 11Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #84 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3011 64 Fu Load3 Mult2

• Next load in sequence

Page 43: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 12Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #83 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3012 64 Fu Load3 Mult2

• Why not issue third multiply?

Page 44: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 13Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #82 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3013 64 Fu Load3 Mult2

• Why not issue third store?

Page 45: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 14Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #81 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3014 64 Fu Load3 Mult2

• Mult1 completing. Who is waiting?

Page 46: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 15Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8

0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3015 64 Fu Load3 Mult2

• Mult2 completing. Who is waiting?

Page 47: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 16Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3016 64 Fu Load3 Mult1

Page 48: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 17Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3017 64 Fu Load3 Mult1

Page 49: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 18Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3018 64 Fu Load3 Mult1

Page 50: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 19Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 19 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3019 56 Fu Load3 Mult1

Page 51: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Loop Example Cycle 20Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 Yes 561 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 No2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3020 56 Fu Load1 Mult1

• Once again: In-order issue, out-of-order execution and out-of-order completion.

Page 52: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Why can Tomasulo overlap iterations of loops?

• Register renaming– Multiple iterations use different physical destinations

for registers (dynamic loop unrolling).

• Reservation stations – Permit instruction issue to advance past integer control

flow operations– Also buffer old values of registers - totally avoiding the

WAR stall that we saw in the scoreboard.

• Other perspective: Tomasulo building data flow dependency graph on the fly.

Page 53: Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Tomasulo’s scheme offers 2 major advantages

(1)the distribution of the hazard detection logic– distributed reservation stations and the CDB– If multiple instructions waiting on single result,

& each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB

– If a centralized register file were used, the units would have to read their results from the registers when register buses are available.

(2) the elimination of stalls for WAW and WAR hazards of scoreboard