dynamic instruction scheduling
TRANSCRIPT
Dynamic vs. Static Scheduling
• Data hazards in a program cause a processor to stall.• With static scheduling the compiler tries to reorder
these instructions during compile time to reduce pipeline stalls.
– Uses less hardware – Can use more powerful algorithms
• With dynamic scheduling the hardware tries to rearrange the instructions during run-time to reduce pipeline stalls.
– Simpler compiler– Handles dependencies not known at compile time– Allows code compiled for a different machine to run efficiently.
Out-Of-Order Execution• In our previous model, all instructions executed in the
order that they appear• This can lead to unnecessary stalls
DIVD FO, F2, F4ADDD F10, F0, F8SUBD F12, F8, F14
• SUBD stalls waiting for the ADDD to go first, even though SUBD does not have a data dependency.
• With out-of-order execution, the SUBD is allowed to executed before the add
– This can lead to out-of order completion, which can cause WAW and WAR hazards
Scoreboarding• The scoreboard implements a centralized
control scheme that– Detects all resource and data hazards– Allows instructions to execute out-of-order when no
resource hazards or data dependencies
• First implemented in 1964 by the CDC 6600, which had 18 separate functional units
– 4 FP units (2 multiply, 1 add, 1 divide)– 7 memory units (5 loads, 2 stores)– 7 integer units (add, shift, logical, compare, etc.)
• Our dynamic pipeline (much simpler)– 2 FP multiply (10 EX cycles)– 1 FP add (2 EX cycles)– 1 FP divide (40 EX cycles)– 1 integer unit (1 EX cycle)
Out-of-Order Execution• Out-of-order execution divides DR stage into:
1. Issue—decode instructions, check for structural hazards2. Read operands—wait until no data hazards, then read operands
• Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions
• CDC 6600: In order issue, out of order execution, out of order commit (also called completion)
Scoreboard Implications
• Out-of-order completion can lead to WAR and WAW hazards?
• Solution for WAW– Detect WAW hazard before reading operands– Stall write until other instruction completes
• Solutions for WAR– Detect WAR hazards before writing back to the register files and
stall the write back
• This scoreboard does not take advantage of forwarding (i.e. bypasses), since it waits until both results are written back to the register file
• Scoreboard replaces DR, EX, WB with 4 stages
Four Stages of Scoreboard Control
• Decode+Issue (Issue)– decode instructions– check for structural and WAW hazards– stall until structural and WAW hazards are resolved
• Read operands (Read)– wait until no RAW hazards– then read operands
• Execution (EX)– operate on operands– may be multiple cycles - notify scoreboard when done
• Write result (WB)– finish execution– stall if WAR hazard
Three Parts of the Scoreboard1.Instruction status—which of 4 steps the instruction is in:
Issue, Read, EX, or WB.
2.Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit
Busy—Indicates whether the unit is busy or notOp—Operation to perform in the unit (e.g., + or –)Fi—Destination registerFj, Fk—Source-register numbersQj, Qk—Functional units producing source registers Fj, FkRj, Rk—Flags indicating when Fj, Fk are ready
3.Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Scoreboarding Example Cycle 0Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No
Mult2 No0 Add No0 Divide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Scoreboarding Example Cycle 1Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F6 R2 Yes0 Mult1 No
Mult2 No0 Add No0 Divide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
Scoreboarding Example Cycle 2Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F6 R2 Yes0 Mult1 No
Mult2 No0 Add No0 Divide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU IntegerIssue 2nd Load?
Scoreboarding Example Cycle 3Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F6 R2 Yes0 Mult1 No
Mult2 No0 Add No0 Divide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU IntegerIssue 2nd Load?
Scoreboarding Example Cycle 4Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F6 R2 Yes0 Mult1 No
Mult2 No0 Add No0 Divide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
Scoreboarding Example Cycle 5Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F2 R3 Yes0 Mult1 No
Mult2 No0 Add No0 Divide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
Scoreboarding Example Cycle 6Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6MULTD F0 F2 F4 6SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F2 R3 Yes0 Mult1 Yes Mult F0 F2 F4 integer No Yes
Mult2 No0 Add No0 Divide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
Scoreboarding Example Cycle 7Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Load F2 R3 Yes0 Mult1 Yes Mult F0 F2 F4 integer No Yes
Mult2 No0 Add Yes Sub F8 F6 F2 integer Yes No0 Divide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
Scoreboarding Example Cycle 8Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer Yes Ld F2 R3 Yes0 Mult1 Yes Mult F0 F2 F4 integer No Yes
Mult2 No0 Add Yes Sub F8 F6 F2 integer Yes No0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide
Scoreboarding Example Cycle 9Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 Add Divide
Scoreboarding Example Cycle 10Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 --DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No9 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No1 Add Yes Sub F8 F6 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
10 FU Mult1 Add Divide
Scoreboarding Example Cycle 11Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No8 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No0 Add Yes Sub F8 F6 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide
Scoreboarding Example Cycle 12Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No7 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No0 Add Yes Sub F8 F6 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Add Divide
Scoreboarding Example Cycle 13Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No6 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 Add Divide
Scoreboarding Example Cycle 14Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No5 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No2 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide
Scoreboarding Example Cycle 15Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 --Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No4 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No1 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Add Divide
Scoreboarding Example Cycle 16Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No3 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide
Scoreboarding Example Cycle 17Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No2 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
17 FU Mult1 Add DivideWrite result of ADDD?
Scoreboarding Example Cycle 18Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 --SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No1 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
18 FU Mult1 Add Divide
Scoreboarding Example Cycle 19Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
19 FU Mult1 Add Divide
Scoreboarding Example Cycle 20Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes0 Divide Yes Div F10 F0 F6 mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Mult1 Add Divide
Scoreboarding Example Cycle 21Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21ADDD F6 F8 F2 13 14 15 -- 16Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No
Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes
40 Divide Yes Div F10 F0 F6 Yes YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
Scoreboarding Example Cycle 22Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 --ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No
Mult2 No0 Add Yes Add F6 F8 F2 Yes Yes
39 Divide Yes Div F10 F0 F6 Yes YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Add Divide
Scoreboarding Example Cycle 23Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 -- ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No
Mult2 No0 Add No
38 Divide Yes Div F10 F0 F6 Yes YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
23 FU Divide
Scoreboarding Example Cycle 61Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 -- 61 ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No
Mult2 No0 Add No0 Divide Yes Div F10 F0 F6 Yes Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide
Scoreboarding Example Cycle 62Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 -- 61 62ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No
Mult2 No0 Add No0 Divide Yes Div F10 F0 F6 Yes Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU Divide
Scoreboarding Example Cycle 63Instruction status Operand Execution WriteInstruction j k Issue Read start -complete ResultLD F6 34+ R2 1 2 3 -- 3 4LD F2 45+ R3 5 6 7 -- 7 8MULTD F0 F2 F4 6 9 10 -- 19 20SUBD F8 F6 F2 7 9 10 -- 11 12DIVD F10 F0 F6 8 21 22 -- 61 62ADDD F6 F8 F2 13 14 15 -- 16 22Functional Unit Status
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk0 Integer No0 Mult1 No
Mult2 No0 Add No0 Divide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
63 FU
CDC 6000 Scoreboard Summary
• Speedup from scoreboard– 1.7 for FORTRAN programs– 2.5 for hand-coded assembly language programs– Effects of modern compilers?
• Hardware– Scoreboard hardware approximately same as one FPU– Main cost was buses (4x’s normal amount)– Could be more severe for modern processors
• Limitations– No forwarding logic– Limited to instructions instruction window– Stalls for WAW hazards– Wait for WAR hazards before WB
Tomasulo Algorithmfor Dynamic Scheduling
• For IBM 360/91 in 1967 - about 3 years after CDC 6600• Goal: High performance without special compilers• Differences between IBM 360 & CDC 6600
– IBM has only 2 register specifiers/instr vs. 3 in CDC 6600– IBM has register-memory instructions– IBM has 4 FP registers vs. 8 in CDC 6600– IBM has pipelined functional units (3 adds, 2 multiplies)
• Tomasulo algorithm is designed to handle name dependencies (WAW and WAR hazards) efficiently
SUB F1, F2, F0DIVF F2, F3 , F2ADDF F3, F0, F0MULF F3, F1, F1
Tomasulo Algorithm
• Differences from Scoreboarding– Distributed hazard detection and control (through
reservation stations)– Results are bypassed to function units– Common data bus (CDB) broadcasts results to all FUs. – HW renaming of registers to avoid WAR, WAW hazards– Load and Stores treated as FUs as well– Registers in instructions replaced by pointers to
reservation station buffers
• Lead to concepts used in Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …
Reservation Station ComponentsOp—Operation to perform in the unit (e.g., + or –)Qj, Qk—Reservation stations producing source registers Vj, Vk—Value of Source operandsRj, Rk—Flags indicating when Vj, Vk are readyBusy—Indicates reservation station and FU is busy
Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
Three Stages of Tomasulo Algorithm
1.Issue—get instruction from FP Op QueueIf reservation station free, issue instruction & send operands (renames registers).
2.Execution—operate on operands (EX)When both operands ready then execute;if not ready, watch CDB for result
3.Write result—finish execution (WB)Write on Common Data Bus to all awaiting units; mark reservation station available.
Tomasulo Example Cycle 0Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No
Add3 No0 Mult1 No0 Mult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Address
Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 Load1 YesLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No
Add3 No0 Mult1 No0 Mult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1
Address34+R2
Tomasulo Example Cycle 1
Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2- Load1 YesLD F2 45+ R3 2 Load2 YesMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No
Add3 No0 Mult1 No0 Mult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1
Address34+R245+R3
Tomasulo Example Cycle 2
Tomasulo Example Cycle 3Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 Load1 YesLD F2 45+ R3 2 3- Load2 YesMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No
Add3 No0 Mult1 Yes Mult F4 Load20 Mult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
Address34+R245+R3
Tomasulo Example Cycle 4Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 Load2 YesMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes Sub F6 Load20 Add2 No
Add3 No0 Mult1 Yes Mult F4 Load20 Mult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 Add1
Address
45+R3
Tomasulo Example Cycle 5Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes Sub F6 F20 Add2 No
Add3 No10 Mult1 Yes Mult F2 F4
0 Mult2 Yes Div F6 Mult1Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 Add1 Mult2
Address
Tomasulo Example Cycle 6Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 --DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk1 Add1 Yes Sub F6 F20 Add2 Yes Add F2 Add1
Add3 No9 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Add2 Add1 Mult2
Address
Tomasulo Example Cycle 7Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes Sub F6 F20 Add2 Yes Add F2 Add1
Add3 No8 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Add2 Add1 Mult2
Address
Tomasulo Example Cycle 8Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No2 Add2 Yes Add F8 F2
Add3 No7 Mult1 Yes Mult F2 F40 Mult2 Yes Div F2 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Add2 Mult2
Address
Tomasulo Example Cycle 9Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 --Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No1 Add2 Yes Add F8 F2
Add3 No6 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 Add2 Mult2
Address
Tomasulo Example Cycle 10Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 Yes Add F8 F2
Add3 No5 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
10 FU Mult1 Add2 Mult2
Address
Tomasulo Example Cycle 11Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No
Add2 NoAdd3 No
4 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Mult2
Address
Tomasulo Example Cycle 12Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No
Add2 NoAdd3 No
4 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Mult2
Address
Tomasulo Example Cycle 15Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- 15 Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No
Add2 NoAdd3 No
0 Mult1 Yes Mult F2 F40 Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Mult2
Address
Tomasulo Example Cycle 16Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- 15 16 Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No
Add2 NoAdd3 NoMult1 No
40 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult2
Address
Tomasulo Example Cycle 56Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- 15 16 Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No
Add2 NoAdd3 NoMult1 No
0 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU Mult2
Address
Tomasulo Example Cycle 57Instruction status Execution WriteInstruction j k Issue complete Result BusyLD F6 34+ R2 1 2--3 4 Load1 NoLD F2 45+ R3 2 3--4 5 Load2 NoMULTD F0 F2 F4 3 6 -- 15 16 Load3 NoSUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56 57ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No
Add2 NoAdd3 NoMult1 No
0 Mult2 NoRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
57 FU
Address
Tomasulo Summary
• Advantages– Prevents register from being the bottleneck– Eliminates WAR, WAW hazards– Allows loop unrolling in HW
• Common Data Bus– Broadcasts results to multiple instructions– Central bottleneck
• Lasting Contributions– Dynamic scheduling– Register renaming– Load/store disambiguation
RS layout (1)
• How many RS exist for each FU type?– One single RS (centralized RS)
» Intel Pentium Pro (Pentium II, III)
UFUFUF UFUF
RS layout (3)
– One RS per n FU» MIPS R10000, HP PA-8500, AMD Opteron, DEC/Compaq
Alpha 21264, Intel Pentium IV, IBM Power 4
UFUFUF UFUF
Register Renaming (1)
• What do we use to rename the registers (tags)?– Reservation Station id
» IBM 360– ROB entry id
» Intel Pentium Pro, AMD K5, HP PA-8000– Future/architectural register file
» IBM PowerPC 6xx– Merged future/architectural register file
» MIPS R10000, DEC/Compaq Alpha, Intel Pentium IV (netburst), AMD Opteron
Register Renaming (2)
• Merged future/architectural register file– Two kinds of registers
» Logical Registers (the ones the compiler thinks)• Typically R0, R1, ... , R31
» Physical Registers (the ones the processor has)• Typically R0, R1, ... , Rn (n>31, and is usually #Rlogic+#ROB entries)
– Need a structure to know what physical register hold a logical value
» Rename table– Need to know the free physical registers
» Free list
Free-list Rename Table
Give me a free physical register free physical reg Logical reg Physical reg
Register Renaming (3)
• Merged future/architectural register file - STEPS– Decode
» Keep a copy of the old mapping• Old_Dest_Physical_Reg = Rename Table [Dest_Logical_Reg]
» Get new physical reg for destination register• Dest_Physical_Reg = Free-List()• Rename Table [Dest_Logical_Reg] = Dest_Physical_Reg
– Writeback» In case of good completition free old mapping
• Free-table + = Old_Dest_Physical_Reg
» In case of interruption/exception• Restore old mapping
Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2
Cycle 0
LD F6, 34(F2)
Free listRF32
RF33
RF34
RF35
RF36
RF37
...
Reg. Map
RF31F31
......
RF10F10
RF9F9
RF8F8
RF7F7
RF6F6
RF5F5
RF4F4
RF3F3
RF2F2
RF1F1
RF0F0
FETCHIn-order!
DECODERename
In-order!
ISSUE
Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2
Cycle 1
LD F2, 45(F3)
Free listRF32
RF33
RF34
RF35
RF36
RF37
...
Reg. Map
RF31F31
......
RF10F10
RF9F9
RF8F8
RF7F7
RF32F6
RF5F5
RF4F4
RF3F3
RF2F2
RF1F1
RF0F0
LD F6, 34(F2)
LD F6, 34(RF2)
LD RF32, 34(RF2)
(Old:RF6)
FETCHIn-order!
DECODERename
In-order!
ISSUE
Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2
Cycle 2
MULTD F0,F2,F4
Free listRF32
RF33
RF34
RF35
RF36
RF37
...
Reg. Map
RF31F31
......
RF10F10
RF9F9
RF8F8
RF7F7
RF32F6
RF5F5
RF4F4
RF3F3
RF33F2
RF1F1
RF0F0
FETCHIn-order!
DECODERename
In-order!
ISSUE
LD F2, 45(F3)
LD F2, 45(RF3)
LD RF33, 45(RF3)
(Old: RF2)
LD RF33, 45(RF3) (Old: RF6)
Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2
Cycle 3
MULTD F0,F2,F4
Free listRF32
RF33
RF34
RF35
RF36
RF37
...
Reg. Map
RF31F31
......
RF10F10
RF9F9
RF8F8
RF7F7
RF32F6
RF5F5
RF4F4
RF3F3
RF33F2
RF1F1
RF34F0
MULTD F0,F2,F4
MULTD F0,RF33,RF4
MULTD RF34,RF33,RF4
(Old: RF0)
LD RF33, 45(RF3) (Old: RF2)
LD RF33, 45(RF3) (Old: RF6)
FETCHIn-order!
DECODERename
In-order!
ISSUE
Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2
WB
Cycle N
LD RF33, 45(RF3)
(Old: RF2)
LD RF33, 45(RF3)(Old: RF6)
COMMIT
in-order!(Oldest to youngest)
Free listRF32
RF33
RF34
RF35
RF36
RF37
...
Reg. Map
RF31F31
......
RF36F10
RF9F9
RF35F8
RF7F7
RF37F6
RF5F5
RF4F4
RF3F3
RF33F2
RF1F1
RF34F0
Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2
WB
COMMIT
in-order!(Oldest to youngest)
Cycle N
LD RF33, 45(RF3)
(Old: RF2)
LD RF33, 45(RF3)(Old: RF6)
Free listRF32
RF33
RF34
RF35
RF36
RF37
...
RF6
RF2
Reg. Map
RF31F31
......
RF36F10
RF9F9
RF35F8
RF7F7
RF37F6
RF5F5
RF4F4
RF3F3
RF33F2
RF1F1
RF34F0
Register Renaming (Example)Sample CodeLD F6, 34(F2)LD F2, 45(F3)MULTD F0, F2, F4SUBD F8,F6,F2DIVD F10, F0, F6ADDD F6, F8, F2
WB
COMMIT
in-order!(Oldest to youngest)
Cycle N
Free listRF32
RF33
RF34
RF35
RF36
RF37
...
RF6
RF2
Reg. Map
RF31F31
......
RF36F10
RF9F9
RF35F8
RF7F7
RF37F6
RF5F5
RF4F4
RF3F3
RF33F2
RF1F1
RF34F0
Issue Logic• MIPS R10000
Kenneth C. Yeager, “The MIPS R10000 Superscalar Processor”, IEEE Micro ,Volume: 16 , Issue: 2 , April 1996 Pages: 28 - 41
Issue Logic• MIPS R10000
Kenneth C. Yeager, “The MIPS R10000 Superscalar Processor”, IEEE Micro ,Volume: 16 , Issue: 2 , April 1996 Pages: 28 - 41
Issue Logic• DEC/Compaq Alpha 21264
R.E. Kessler, E.J. McLellan, and D.A. Webb, “The Alpha 21264 MicroprocessorArchitecture,” Proc. 1998 IEEE Int’l Conf. Computer Design: VLSI in Computers and Processors, Oct. 1998, pp. 90–95.
Issue Logic• DEC/Compaq Alpha 21264
R.E. Kessler, E.J. McLellan, and D.A. Webb, “The Alpha 21264 MicroprocessorArchitecture,” Proc. 1998 IEEE Int’l Conf. Computer Design: VLSI in Computers and Processors, Oct. 1998, pp. 90–95.
Issue Logic• Intel Pentium IV
Glem Hilton et al., “The Microarchitecture of the Pentium 4 processor”, Intel Technology Journal Q1, 2001
Issue Logic• Intel Pentium IV
Glem Hilton et al., “The Microarchitecture of the Pentium 4 processor”, Intel Technology Journal Q1, 2001
Issue Logic• Intel Pentium IV
Glem Hilton et al., “The Microarchitecture of the Pentium 4 processor”, Intel Technology Journal Q1, 2001
Issue Logic• AMD Opteron
Level 2Cache
L2 ECCL2 Tags
L2 Tag ECC
System RequestQueue (SRQ)
Cross Bar(XBAR)
Memory Controller&
HyperTransport™
2kBranchTargets
16kHistoryCounter
RAS&
Target Address
DataTLB Level 1 Data Cache ECC
Instr’nTLB Level 1 Instr’n Cache
AGU ALU AGU ALU AGU ALU FADD FMUL FMISC
8-entryScheduler
8-entryScheduler
8-entryScheduler
36-entryScheduler
Fetch 2 - transitPick
DecodeDecodeDecode
Pack Pack Pack
Decode 1Decode 2
Decode 1Decode 2
Decode 1Decode 2
“Northbridge”