dynamic instruction scheduling

94
Dynamic Instruction Scheduling PD-MIRI Ramon Canal

Upload: others

Post on 08-Feb-2022

24 views

Category:

Documents


0 download

TRANSCRIPT

Dynamic Instruction Scheduling

PD-MIRIRamon Canal

Dynamic vs. Static Scheduling

• Data hazards in a program cause a processor to stall.• With static scheduling the compiler tries to reorder

these instructions during compile time to reduce pipeline stalls.

– Uses less hardware – Can use more powerful algorithms

• With dynamic scheduling the hardware tries to rearrange the instructions during run-time to reduce pipeline stalls.

– Simpler compiler– Handles dependencies not known at compile time– Allows code compiled for a different machine to run efficiently.

Out-Of-Order Execution• In our previous model, all instructions executed in the

order that they appear• This can lead to unnecessary stalls

DIVD FO, F2, F4ADDD F10, F0, F8SUBD F12, F8, F14

• SUBD stalls waiting for the ADDD to go first, even though SUBD does not have a data dependency.

• With out-of-order execution, the SUBD is allowed to executed before the add

– This can lead to out-of order completion, which can cause WAW and WAR hazards

Scoreboarding• The scoreboard implements a centralized

control scheme that– Detects all resource and data hazards– Allows instructions to execute out-of-order when no

resource hazards or data dependencies

• First implemented in 1964 by the CDC 6600, which had 18 separate functional units

– 4 FP units (2 multiply, 1 add, 1 divide)– 7 memory units (5 loads, 2 stores)– 7 integer units (add, shift, logical, compare, etc.)

• Our dynamic pipeline (much simpler)– 2 FP multiply (10 EX cycles)– 1 FP add (2 EX cycles)– 1 FP divide (40 EX cycles)– 1 integer unit (1 EX cycle)

Out-of-Order Execution• Out-of-order execution divides DR stage into:

1. Issue—decode instructions, check for structural hazards2. Read operands—wait until no data hazards, then read operands

• Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions

• CDC 6600: In order issue, out of order execution, out of order commit (also called completion)

Scoreboard Implications

• Out-of-order completion can lead to WAR and WAW hazards?

• Solution for WAW– Detect WAW hazard before reading operands– Stall write until other instruction completes

• Solutions for WAR– Detect WAR hazards before writing back to the register files and

stall the write back

• This scoreboard does not take advantage of forwarding (i.e. bypasses), since it waits until both results are written back to the register file

• Scoreboard replaces DR, EX, WB with 4 stages

Four Stages of Scoreboard Control

• Decode+Issue (Issue)– decode instructions– check for structural and WAW hazards– stall until structural and WAW hazards are resolved

• Read operands (Read)– wait until no RAW hazards– then read operands

• Execution (EX)– operate on operands– may be multiple cycles - notify scoreboard when done

• Write result (WB)– finish execution– stall if WAR hazard

Three Parts of the Scoreboard1.Instruction status—which of 4 steps the instruction is in:

Issue, Read, EX, or WB.

2.Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit

Busy—Indicates whether the unit is busy or notOp—Operation to perform in the unit (e.g., + or –)Fi—Destination registerFj, Fk—Source-register numbersQj, Qk—Functional units producing source registers Fj, FkRj, Rk—Flags indicating when Fj, Fk are ready

3.Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

Scoreboarding Example Cycle 0Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No

Mult2 No0 Add No0 Divide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

0 FU

Scoreboarding Example Cycle 1Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F6 R2 Yes0 Mult1 No

Mult2 No0 Add No0 Divide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

1 FU Integer

Scoreboarding Example Cycle 2Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F6 R2 Yes0 Mult1 No

Mult2 No0 Add No0 Divide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

2 FU IntegerIssue 2nd Load?

Scoreboarding Example Cycle 3Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F6 R2 Yes0 Mult1 No

Mult2 No0 Add No0 Divide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

3 FU IntegerIssue 2nd Load?

Scoreboarding Example Cycle 4Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F6 R2 Yes0 Mult1 No

Mult2 No0 Add No0 Divide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

4 FU Integer

Scoreboarding Example Cycle 5Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F2 R3 Yes0 Mult1 No

Mult2 No0 Add No0 Divide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

5 FU Integer

Scoreboarding Example Cycle 6Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6MULTD F0 F2 F4 6SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F2 R3 Yes0 Mult1 Yes Mult F0 F2 F4 integer No Yes

Mult2 No0 Add No0 Divide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

6 FU Mult1 Integer

Scoreboarding Example Cycle 7Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F2 R3 Yes0 Mult1 Yes Mult F0 F2 F4 integer No Yes

Mult2 No0 Add Yes Sub F8 F6 F2 integer Yes No0 Divide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

7 FU Mult1 Integer Add

Scoreboarding Example Cycle 8Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Ld F2 R3 Yes0 Mult1 Yes Mult F0 F2 F4 integer No Yes

Mult2 No0 Add Yes Sub F8 F6 F2 integer Yes No0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

8 FU Mult1 Integer Add Divide

Scoreboarding Example Cycle 9Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No

10 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 No

2 Add Yes Sub F8 F6 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

9 FU Mult1 Add Divide

Scoreboarding Example Cycle 10Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 --DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No9 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No1 Add Yes Sub F8 F6 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

10 FU Mult1 Add Divide

Scoreboarding Example Cycle 11Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No8 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No0 Add Yes Sub F8 F6 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

11 FU Mult1 Add Divide

Scoreboarding Example Cycle 12Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No7 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No0 Add Yes Sub F8 F6 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

12 FU Mult1 Add Divide

Scoreboarding Example Cycle 13Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No6 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

13 FU Mult1 Add Divide

Scoreboarding Example Cycle 14Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No5 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No2 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

14 FU Mult1 Add Divide

Scoreboarding Example Cycle 15Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 --Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No4 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No1 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

15 FU Mult1 Add Divide

Scoreboarding Example Cycle 16Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No3 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

16 FU Mult1 Add Divide

Scoreboarding Example Cycle 17Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No2 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

17 FU Mult1 Add DivideWrite result of ADDD?

Scoreboarding Example Cycle 18Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No1 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

18 FU Mult1 Add Divide

Scoreboarding Example Cycle 19Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

19 FU Mult1 Add Divide

Scoreboarding Example Cycle 20Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

20 FU Mult1 Add Divide

Scoreboarding Example Cycle 21Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No

Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes

40 Divide Yes Div F10 F0 F6 Yes YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

21 FU Add Divide

Scoreboarding Example Cycle 22Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 --ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No

Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes

39 Divide Yes Div F10 F0 F6 Yes YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

22 FU Add Divide

Scoreboarding Example Cycle 23Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 -- ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No

Mult2 No0 Add No

38 Divide Yes Div F10 F0 F6 Yes YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

23 FU Divide

Scoreboarding Example Cycle 61Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 -- 61 ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No

Mult2 No0 Add No0 Divide Yes Div F10 F0 F6 Yes Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

61 FU Divide

Scoreboarding Example Cycle 62Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 -- 61 62ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No

Mult2 No0 Add No0 Divide Yes Div F10 F0 F6 Yes Yes

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

62 FU Divide

Scoreboarding Example Cycle 63Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 -- 61 62ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No

Mult2 No0 Add No0 Divide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

63 FU

CDC 6000 Scoreboard Summary

• Speedup from scoreboard– 1.7 for FORTRAN programs– 2.5 for hand-coded assembly language programs– Effects of modern compilers?

• Hardware– Scoreboard hardware approximately same as one FPU– Main cost was buses (4x’s normal amount)– Could be more severe for modern processors

• Limitations– No forwarding logic– Limited to instructions instruction window– Stalls for WAW hazards– Wait for WAR hazards before WB

Scoreboarding

Scoreboard

Tomasulo Algorithmfor Dynamic Scheduling

• For IBM 360/91 in 1967 - about 3 years after CDC 6600• Goal: High performance without special compilers• Differences between IBM 360 & CDC 6600

– IBM has only 2 register specifiers/instr vs. 3 in CDC 6600– IBM has register-memory instructions– IBM has 4 FP registers vs. 8 in CDC 6600– IBM has pipelined functional units (3 adds, 2 multiplies)

• Tomasulo algorithm is designed to handle name dependencies (WAW and WAR hazards) efficiently

SUB F1, F2, F0DIVF F2, F3 , F2ADDF F3, F0, F0MULF F3, F1, F1

Tomasulo Algorithm

• Differences from Scoreboarding– Distributed hazard detection and control (through

reservation stations)– Results are bypassed to function units– Common data bus (CDB) broadcasts results to all FUs. – HW renaming of registers to avoid WAR, WAW hazards– Load and Stores treated as FUs as well– Registers in instructions replaced by pointers to

reservation station buffers

• Lead to concepts used in Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

Tomasulo Organization(FP + Load)

Reservation Station ComponentsOp—Operation to perform in the unit (e.g., + or –)Qj, Qk—Reservation stations producing source registers Vj, Vk—Value of Source operandsRj, Rk—Flags indicating when Vj, Vk are readyBusy—Indicates reservation station and FU is busy

Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Three Stages of Tomasulo Algorithm

1.Issue—get instruction from FP Op QueueIf reservation station free, issue instruction & send operands (renames registers).

2.Execution—operate on operands (EX)When both operands ready then execute;if not ready, watch CDB for result

3.Write result—finish execution (WB)Write on Common Data Bus to all awaiting units; mark reservation station available.

Tomasulo Example Cycle 0Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

0 FU

Address

Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 Load1 YesLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

1 FU Load1

Address34+R2

Tomasulo Example Cycle 1

Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2- Load1 YesLD F2 45+ R3 2 Load2 YesMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

2 FU Load2 Load1

Address34+R245+R3

Tomasulo Example Cycle 2

Tomasulo Example Cycle 3Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 Load1 YesLD F2 45+ R3 2 3- Load2 YesMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 Yes Mult F4 Load20 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

3 FU Mult1 Load2 Load1

Address34+R245+R3

Tomasulo Example Cycle 4Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 Load2 YesMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes Sub F6 Load20 Add2 No

Add3 No0 Mult1 Yes Mult F4 Load20 Mult2 No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

4 FU Mult1 Load2 Add1

Address

45+R3

Tomasulo Example Cycle 5Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes Sub F6 F20 Add2 No

Add3 No10 Mult1 Yes Mult F2 F4

0 Mult2 Yes Div F6 Mult1Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

5 FU Mult1 Add1 Mult2

Address

Tomasulo Example Cycle 6Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 --DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk1 Add1 Yes Sub F6 F20 Add2 Yes Add F2 Add1

Add3 No9 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

6 FU Mult1 Add2 Add1 Mult2

Address

Tomasulo Example Cycle 7Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes Sub F6 F20 Add2 Yes Add F2 Add1

Add3 No8 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

7 FU Mult1 Add2 Add1 Mult2

Address

Tomasulo Example Cycle 8Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No2 Add2 Yes Add F8 F2

Add3 No7 Mult1 Yes Mult F2 F40 Mult2 Yes Div F2 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

8 FU Mult1 Add2 Mult2

Address

Tomasulo Example Cycle 9Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 --Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No1 Add2 Yes Add F8 F2

Add3 No6 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

9 FU Mult1 Add2 Mult2

Address

Tomasulo Example Cycle 10Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 Yes Add F8 F2

Add3 No5 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

10 FU Mult1 Add2 Mult2

Address

Tomasulo Example Cycle 11Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No

Add2 NoAdd3 No

4 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

11 FU Mult1 Mult2

Address

Tomasulo Example Cycle 12Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No

Add2 NoAdd3 No

4 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

12 FU Mult1 Mult2

Address

Tomasulo Example Cycle 15Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- 15 Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No

Add2 NoAdd3 No

0 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

15 FU Mult1 Mult2

Address

Tomasulo Example Cycle 16Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- 15 16 Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No

Add2 NoAdd3 NoMult1 No

40 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

16 FU Mult2

Address

Tomasulo Example Cycle 56Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- 15 16 Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No

Add2 NoAdd3 NoMult1 No

0 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

56 FU Mult2

Address

Tomasulo Example Cycle 57Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- 15 16 Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56 57ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No

Add2 NoAdd3 NoMult1 No

0 Mult2 NoRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

57 FU

Address

Tomasulo Summary

• Advantages– Prevents register from being the bottleneck– Eliminates WAR, WAW hazards– Allows loop unrolling in HW

• Common Data Bus– Broadcasts results to multiple instructions– Central bottleneck

• Lasting Contributions– Dynamic scheduling– Register renaming– Load/store disambiguation

Tomasulo Implementation

RS layout (1)

• How many RS exist for each FU type?– One single RS (centralized RS)

» Intel Pentium Pro (Pentium II, III)

UFUFUF UFUF

RS layout (2)

– One RS per FU» PowerPC 6xx

UFUFUF UFUF

RS layout (3)

– One RS per n FU» MIPS R10000, HP PA-8500, AMD Opteron, DEC/Compaq

Alpha 21264, Intel Pentium IV, IBM Power 4

UFUFUF UFUF

Register Renaming (1)

• What do we use to rename the registers (tags)?– Reservation Station id

» IBM 360– ROB entry id

» Intel Pentium Pro, AMD K5, HP PA-8000– Future/architectural register file

» IBM PowerPC 6xx– Merged future/architectural register file

» MIPS R10000, DEC/Compaq Alpha, Intel Pentium IV (netburst), AMD Opteron

Register Renaming (2)

• Merged future/architectural register file– Two kinds of registers

» Logical Registers (the ones the compiler thinks)• Typically R0, R1, ... , R31

» Physical Registers (the ones the processor has)• Typically R0, R1, ... , Rn (n>31, and is usually #Rlogic+#ROB entries)

– Need a structure to know what physical register hold a logical value

» Rename table– Need to know the free physical registers

» Free list

Free-list Rename Table

Give me a free physical register free physical reg Logical reg Physical reg

Register Renaming (3)

• Merged future/architectural register file - STEPS– Decode

» Keep a copy of the old mapping• Old_Dest_Physical_Reg = Rename Table [Dest_Logical_Reg]

» Get new physical reg for destination register• Dest_Physical_Reg = Free-List()• Rename Table [Dest_Logical_Reg] = Dest_Physical_Reg

– Writeback» In case of good completition free old mapping

• Free-table + = Old_Dest_Physical_Reg

» In case of interruption/exception• Restore old mapping

Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2

Cycle 0

LD F6, 34(F2)

Free listRF32

RF33

RF34

RF35

RF36

RF37

...

Reg. Map

RF31F31

......

RF10F10

RF9F9

RF8F8

RF7F7

RF6F6

RF5F5

RF4F4

RF3F3

RF2F2

RF1F1

RF0F0

FETCHIn-order!

DECODERename

In-order!

ISSUE

Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2

Cycle 1

LD F2, 45(F3)

Free listRF32

RF33

RF34

RF35

RF36

RF37

...

Reg. Map

RF31F31

......

RF10F10

RF9F9

RF8F8

RF7F7

RF32F6

RF5F5

RF4F4

RF3F3

RF2F2

RF1F1

RF0F0

LD F6, 34(F2)

LD F6, 34(RF2)

LD RF32, 34(RF2)

(Old:RF6)

FETCHIn-order!

DECODERename

In-order!

ISSUE

Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2

Cycle 2

MULTD F0,F2,F4

Free listRF32

RF33

RF34

RF35

RF36

RF37

...

Reg. Map

RF31F31

......

RF10F10

RF9F9

RF8F8

RF7F7

RF32F6

RF5F5

RF4F4

RF3F3

RF33F2

RF1F1

RF0F0

FETCHIn-order!

DECODERename

In-order!

ISSUE

LD F2, 45(F3)

LD F2, 45(RF3)

LD RF33, 45(RF3)

(Old: RF2)

LD RF33, 45(RF3) (Old: RF6)

Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2

Cycle 3

MULTD F0,F2,F4

Free listRF32

RF33

RF34

RF35

RF36

RF37

...

Reg. Map

RF31F31

......

RF10F10

RF9F9

RF8F8

RF7F7

RF32F6

RF5F5

RF4F4

RF3F3

RF33F2

RF1F1

RF34F0

MULTD F0,F2,F4

MULTD F0,RF33,RF4

MULTD RF34,RF33,RF4

(Old: RF0)

LD RF33, 45(RF3) (Old: RF2)

LD RF33, 45(RF3) (Old: RF6)

FETCHIn-order!

DECODERename

In-order!

ISSUE

Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2

WB

Cycle N

LD RF33, 45(RF3)

(Old: RF2)

LD RF33, 45(RF3)(Old: RF6)

COMMIT

in-order!(Oldest to youngest)

Free listRF32

RF33

RF34

RF35

RF36

RF37

...

Reg. Map

RF31F31

......

RF36F10

RF9F9

RF35F8

RF7F7

RF37F6

RF5F5

RF4F4

RF3F3

RF33F2

RF1F1

RF34F0

Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2

WB

COMMIT

in-order!(Oldest to youngest)

Cycle N

LD RF33, 45(RF3)

(Old: RF2)

LD RF33, 45(RF3)(Old: RF6)

Free listRF32

RF33

RF34

RF35

RF36

RF37

...

RF6

RF2

Reg. Map

RF31F31

......

RF36F10

RF9F9

RF35F8

RF7F7

RF37F6

RF5F5

RF4F4

RF3F3

RF33F2

RF1F1

RF34F0

Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2

WB

COMMIT

in-order!(Oldest to youngest)

Cycle N

Free listRF32

RF33

RF34

RF35

RF36

RF37

...

RF6

RF2

Reg. Map

RF31F31

......

RF36F10

RF9F9

RF35F8

RF7F7

RF37F6

RF5F5

RF4F4

RF3F3

RF33F2

RF1F1

RF34F0

Issue Logic• MIPS R10000

Issue Logic• MIPS R10000

Kenneth C. Yeager, “The MIPS R10000 Superscalar Processor”, IEEE Micro ,Volume: 16 , Issue: 2 , April 1996 Pages: 28 - 41

Issue Logic• MIPS R10000

Kenneth C. Yeager, “The MIPS R10000 Superscalar Processor”, IEEE Micro ,Volume: 16 , Issue: 2 , April 1996 Pages: 28 - 41

Issue Logic• DEC/Compaq Alpha 21264

R.E. Kessler, E.J. McLellan, and D.A. Webb, “The Alpha 21264 MicroprocessorArchitecture,” Proc. 1998 IEEE Int’l Conf. Computer Design: VLSI in Computers and Processors, Oct. 1998, pp. 90–95.

Issue Logic• DEC/Compaq Alpha 21264

R.E. Kessler, E.J. McLellan, and D.A. Webb, “The Alpha 21264 MicroprocessorArchitecture,” Proc. 1998 IEEE Int’l Conf. Computer Design: VLSI in Computers and Processors, Oct. 1998, pp. 90–95.

Issue Logic• Intel Pentium III

Issue Logic• Intel Pentium III

Issue Logic• Intel Pentium IV

Glem Hilton et al., “The Microarchitecture of the Pentium 4 processor”, Intel Technology Journal Q1, 2001

Issue Logic• Intel Pentium IV

Glem Hilton et al., “The Microarchitecture of the Pentium 4 processor”, Intel Technology Journal Q1, 2001

Issue Logic• Intel Pentium IV

Glem Hilton et al., “The Microarchitecture of the Pentium 4 processor”, Intel Technology Journal Q1, 2001

Issue Logic• Intel Core 2 Lynnfield (2009)

Sandy Bridge (2011)

Issue Logic1

1 Cache size andassociativity, ROB sizeand RS number varyacross generations

Issue Logic• AMD Athlon

Issue Logic• AMD Athlon

Issue Logic• AMD Opteron

Issue Logic• AMD Opteron

Level 2Cache

L2 ECCL2 Tags

L2 Tag ECC

System RequestQueue (SRQ)

Cross Bar(XBAR)

Memory Controller&

HyperTransport™

2kBranchTargets

16kHistoryCounter

RAS&

Target Address

DataTLB Level 1 Data Cache ECC

Instr’nTLB Level 1 Instr’n Cache

AGU ALU AGU ALU AGU ALU FADD FMUL FMISC

8-entryScheduler

8-entryScheduler

8-entryScheduler

36-entryScheduler

Fetch 2 - transitPick

DecodeDecodeDecode

Pack Pack Pack

Decode 1Decode 2

Decode 1Decode 2

Decode 1Decode 2

“Northbridge”

Issue Logic• AMD Phenom II (bulldozer core)

AMD Bulldozer

Issue Logic

ARM Cortex-A15 MPCore• Samsung Galaxy SIII, tablets?, servers?