1 comp 206: computer architecture and implementation montek singh mon., sep 30, 2002 topic:...
TRANSCRIPT
1
COMP 206:COMP 206:Computer Architecture and Computer Architecture and
ImplementationImplementation
Montek SinghMontek Singh
Mon., Sep 30, 2002Mon., Sep 30, 2002
Topic: Topic: Instruction-Level ParallelismInstruction-Level Parallelism
(Dynamic Scheduling: Tomasulo’s (Dynamic Scheduling: Tomasulo’s
Algorithm)Algorithm)
2
Dynamic Scheduling: Tomasulo’s Dynamic Scheduling: Tomasulo’s AlgorithmAlgorithm For IBM 360/91 (about three years after CDC 6600)For IBM 360/91 (about three years after CDC 6600) Goal: High performance without special compilersGoal: High performance without special compilers Differences between IBM 360 and CDC 6600 ISADifferences between IBM 360 and CDC 6600 ISA
IBM has only 2 register specifiers/instruction versus 3 in CDC IBM has only 2 register specifiers/instruction versus 3 in CDC 66006600
IBM has 4 FP registers versus 8 in CDC 6600IBM has 4 FP registers versus 8 in CDC 6600 Differences between Tomasulo Algorithm and Differences between Tomasulo Algorithm and
ScoreboardScoreboard Control and buffers distributed with Function Units versus Control and buffers distributed with Function Units versus
centralized in scoreboard; called “reservation stations”centralized in scoreboard; called “reservation stations” Registers in instructions replaced by pointers to reservation Registers in instructions replaced by pointers to reservation
station bufferstation buffer Hardware renaming of registers to avoid WAR and WAW Hardware renaming of registers to avoid WAR and WAW
hazardshazards Common Data Bus broadcasts results to all FUs (forwarding)Common Data Bus broadcasts results to all FUs (forwarding) Load and Stores treated as FUs as wellLoad and Stores treated as FUs as well
3
Tomasulo: OrganizationTomasulo: Organization
123456
Floatingpointoperandstack
DecoderFP
registers(4)
TagsBusy
bit
Tag Data Tag DataTag Data Tag DataTag Data Tag Data
Adder (2 cycle, pipelined)
Tag Data Tag DataTag Data Tag Data
Multiply (3 cycle, pipelined)Divide (12 cycle, not pipelined)
101112
89
Frommemory
TomemoryFrom
I-Fetch
Status
To all tags
Common data bus (CDB)
Tag AddrTag AddrTag Addr
DataDataData
4
More Details of Tomasulo More Details of Tomasulo OrganizationOrganization Entities that produce values are assigned 4-bit Entities that produce values are assigned 4-bit
tagstags 1, 2, 3, 4, 5, 6 for load buffers1, 2, 3, 4, 5, 6 for load buffers 8, 9 for multiplier reservation stations8, 9 for multiplier reservation stations 10, 11, 12 for adder reservation stations10, 11, 12 for adder reservation stations Tag 0 indicates presence of valid dataTag 0 indicates presence of valid data
FP registers have “busy bits”FP registers have “busy bits” 0 means that register holds valid data0 means that register holds valid data 1 means that it is waiting to receive value from source 1 means that it is waiting to receive value from source
identified by its tag fieldidentified by its tag field
5
Tomasulo: Representing Data Tomasulo: Representing Data DependencesDependences InputsInputs
Operand is a Operand is a register with busy bit = 0register with busy bit = 0Data copied immediately (through register bus) into Data copied immediately (through register bus) into
reservation stationreservation stationTag field of RS set to 0Tag field of RS set to 0
Operand is a Operand is a register with busy bit = 1register with busy bit = 1Tag field of RS receives a copy of the register tag fieldTag field of RS receives a copy of the register tag field
Operand is a Operand is a load buffer that contains valid dataload buffer that contains valid dataData copied into RSData copied into RS
Operand is a Operand is a load buffer that is awaiting dataload buffer that is awaiting dataTag field of RS receives tag of load bufferTag field of RS receives tag of load buffer
OutputsOutputs Output is a Output is a registerregister
Busy bit set to 1, tag set to RS tagBusy bit set to 1, tag set to RS tag Output is a Output is a store bufferstore buffer
Tag set to RS tag, destination address setTag set to RS tag, destination address set
6
Three Stages of Tomasulo Three Stages of Tomasulo AlgorithmAlgorithm1.1. Issue: Issue: get instruction from FP operation get instruction from FP operation
queuequeue If reservation station free, the scoreboard issues If reservation station free, the scoreboard issues
instruction and sends operands (renames registers)instruction and sends operands (renames registers)
2.2. Execution: Execution: operate on operands (EX)operate on operands (EX) When both operands ready then execute;When both operands ready then execute;
if not ready, watch CDB for resultif not ready, watch CDB for result
3.3. Write Result: Write Result: finish execution (WB)finish execution (WB) Write on Common Data Bus to all awaiting units; Write on Common Data Bus to all awaiting units;
mark reservation station availablemark reservation station available
7
Tomasulo: State TransitionsTomasulo: State TransitionsTag MM addr ---- Data has not arrived
Tag MM addr XXXX Data has arrived
Tag MM addr XXXX CDB transfer
0 ---- ---- After CDB transfer
Tag --- Before CDB transfer
Tag XXXX During CDB transfer
0 XXXX After CDB transfer
•In case of a CDB conflict, earlier instruction has priority•If more than one instruction is enabled in the reservation stations of adder or multiplier in same cycle, top entry has priority•If CDB transfer and issue occur in same cycle, CDB transfer is assumed to occur first•Every instruction should spend at least one cycle in R stage•If an instruction being issued both reads and writes the same register, and the source operand is actually in the register (busy bit = 0), then first the register is read, and then its busy bit is turned to 1, making it unreadable
Lo
ad B
uff
er
Register
8
Tomasulo: ExampleTomasulo: Example100: F0 A101: F0 F0 + F1102: F0 F0 + B103: F2 F2 + F3104: F1 F1 * F2105: C F1106: F0 F1 / F0
100: F0 A101: F0 F0 + F1102: F0 F0 + B103: F2 F2 + F3104: F1 F1 * F2105: C F1106: F0 F1 / F0
CDB tags 4 3 10 12 11 8 9Clocks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
100 I R X W101 I R R X X W102 I R R R R X X W103 I R X X W104 I R R R X X X W105 I R R R R R R X106 I R R R R R X X X X X X W
100
101
102
106
103
104
105
I IssuedR In reservation stationX In executionW Writing result through CDB
9
Tomasulo Example Cycle 0Tomasulo Example Cycle 0Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status
ID MM D6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 0 F2 (F2) 10 ID T D T D 1 100 2 0 F1 (F1) 11 8 2 11
1 0 F0 (F0) 12 9 3 12CDB =
System is quiescentSystem is quiescent
10
Tomasulo Example Cycle 1Tomasulo Example Cycle 1Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF0 = A ID MM D
6 ID C5 B T ID D 84 A 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 0 F2 (F2) 10 ID T D T D 1 101 2 0 F1 (F1) 11 8 2 11
1 1 4 F0 12 9 3 12CDB =
(A) will arrive at tag 4(A) will arrive at tag 4 (F0) will come from tag 4(F0) will come from tag 4 F0 is set to “busy”F0 is set to “busy”
11
Tomasulo Example Cycle 2Tomasulo Example Cycle 2Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF0 = F0+F1 ID MM D
6 ID C5 B T ID D 84 A 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 0 F2 (F2) 10 4 0 (F1) ID T D T D 1 102 2 0 F1 (F1) 11 8 2 11
1 1 10 F0 12 9 3 12CDB =
(F0) will be produced at tag 10(F0) will be produced at tag 10 Right input of adder came from register (tag bit = 0)Right input of adder came from register (tag bit = 0) Left input of adder will come from tag 4Left input of adder will come from tag 4 Forwarding tag of F0 has been changed from 4 to 10Forwarding tag of F0 has been changed from 4 to 10
12
Tomasulo Example Cycle 3Tomasulo Example Cycle 3Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF0 = F0+B ID MM D
6 ID C5 B T ID D 84 A X 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 B 0 F2 (F2) 10 4 0 (F1) ID T D T D 1 103 2 0 F1 (F1) 11 10 3 8 2 11
1 1 11 F0 12 9 3 12CDB =
(F0) will be produced at tag 11(F0) will be produced at tag 11 (B) will arrive at tag 3(B) will arrive at tag 3 Right input of adder will come from tag 3Right input of adder will come from tag 3 Left input of adder will come from tag 10Left input of adder will come from tag 10 (A) arrives from memory(A) arrives from memory Forwarding tag of F0 has been changed from 10 to 11Forwarding tag of F0 has been changed from 10 to 11
13
Tomasulo Example Cycle 4Tomasulo Example Cycle 4Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF2 = F2+F3 ID MM D
6 ID C5 B T ID D 84 A X 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 B 1 12 F2 10 4 X 0 (F1) ID T D T D 1 104 2 0 F1 (F1) 11 10 3 8 2 11
1 1 11 F0 12 0 (F2) 0 (F3) 9 3 12CDB = 4
(F2) will be produced at tag 12(F2) will be produced at tag 12 Right input of adder came from register (tag bit = 0)Right input of adder came from register (tag bit = 0) Left input of adder came from register (tag bit = 0)Left input of adder came from register (tag bit = 0) (A) with tag 4 is broadcast on CDB(A) with tag 4 is broadcast on CDB Adder (at tag 10) picks it up, and is thereby enabledAdder (at tag 10) picks it up, and is thereby enabled The instruction that will write F2 has already read the The instruction that will write F2 has already read the
old contents of F2old contents of F2
14
Tomasulo Example Cycle 5Tomasulo Example Cycle 5Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF1 = F1*F2 ID MM D
6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 B Y 1 12 F2 10 0 X 0 (F1) ID T D T D 1 10 X5 2 1 8 F1 11 10 3 8 0 (F1) 12 2 11
1 1 11 F0 12 0 (F2) 0 (F3) 9 3 12CDB =
(F1) will be produced at tag 8(F1) will be produced at tag 8 Right input of multiplier will come from tag 12Right input of multiplier will come from tag 12 Left input of multiplier came from register (tag bit = 0)Left input of multiplier came from register (tag bit = 0) Adder (at tag 10) starts computingAdder (at tag 10) starts computing (B) arrives from memory(B) arrives from memory
15
Tomasulo Example Cycle 6Tomasulo Example Cycle 6Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusC = F1 ID MM D
6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 B Y 1 12 F2 10 0 X 0 (F1) ID T D T D 1 8 C 10 X6 2 1 8 F1 11 10 3 Y 8 0 (F1) 12 2 11
1 1 11 F0 12 0 (F2) 0 (F3) 9 3 12 XCDB = 3
Memory address of destination is CMemory address of destination is C Data will come from tag 8Data will come from tag 8 Adder (at tag 10) finishes computingAdder (at tag 10) finishes computing (B) with tag 3 is broadcast on CDB(B) with tag 3 is broadcast on CDB Adder (at tag 11) picks it upAdder (at tag 11) picks it up Adder (at tag 12) starts computingAdder (at tag 12) starts computing
16
Tomasulo Example Cycle 7Tomasulo Example Cycle 7Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF1 = F1/F0 ID MM D
6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 1 12 F2 10 0 X 0 (F1) ID T D T D 1 8 C 107 2 1 9 F1 11 10 Z 0 Y 8 0 (F1) 12 2 11
1 1 11 F0 12 0 (F2) 0 (F3) 9 8 11 3 12 XCDB = 10
(F1) will be produced at tag 9(F1) will be produced at tag 9 Right input of divider will come from tag 11Right input of divider will come from tag 11 Left input of divider will come from tag 8Left input of divider will come from tag 8 Result of adder (with tag 10) is broadcast on CDBResult of adder (with tag 10) is broadcast on CDB Adder (at tag 11) picks it up and is thereby enabledAdder (at tag 11) picks it up and is thereby enabled Adder (at tag 12) finishes computingAdder (at tag 12) finishes computing
17
Tomasulo Example Cycle 8Tomasulo Example Cycle 8Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status
ID MM D6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 1 12 F2 W 10 ID T D T D 1 8 C 108 2 1 9 F1 11 0 Z 0 Y 8 0 (F1) 12 W 2 11 X
1 1 11 F0 12 0 (F2) 0 (F3) 9 8 11 3 12CDB = 12
Result of adder (at tag 12) is broadcast on CDBResult of adder (at tag 12) is broadcast on CDB Multiplier (at tag 12) picks it up and is thereby Multiplier (at tag 12) picks it up and is thereby
enabledenabled
18
Tomasulo Example Cycle 9Tomasulo Example Cycle 9Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status
ID MM D6 ID C5 B T ID D 8 X4 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 0 F2 W 10 ID T D T D 1 8 C 109 2 1 9 F1 11 0 Z 0 Y 8 0 (F1) 0 W 2 11 X
1 1 11 F0 12 9 8 11 3 12CDB =
Multiplier (at tag 8) starts computingMultiplier (at tag 8) starts computing Adder (at tag 11) finishes computingAdder (at tag 11) finishes computing
19
Tomasulo Example Cycle 10Tomasulo Example Cycle 10Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status
ID MM D6 ID C5 B T ID D 8 X4 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 0 F2 W 10 ID T D T D 1 8 C 1010 2 1 9 F1 11 0 Z 0 Y 8 0 (F1) 0 X 2 11
1 1 11 F0 V 12 9 8 11 V 3 12CDB = 11
Result of adder (with tag 11) is broadcast on CDBResult of adder (with tag 11) is broadcast on CDB Divider (at tag 9) picks it upDivider (at tag 9) picks it up Register F0 picks it upRegister F0 picks it up
20
Tomasulo Example Cycle 11Tomasulo Example Cycle 11Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status
ID MM D6 ID C5 B T ID D 8 X4 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 0 F2 W 10 ID T D T D 1 8 C 1011 2 1 9 F1 11 8 0 (F1) 0 X 2 11
1 0 F0 V 12 9 8 0 V 3 12CDB =
Multiplier (at tag 8) finishes computingMultiplier (at tag 8) finishes computing
21
Tomasulo Example Cycle 12Tomasulo Example Cycle 12Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status
ID MM D6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9
Clock = 3 0 F2 W 10 ID T D T D 1 8 U C 1012 2 1 9 F1 11 8 0 (F1) 0 X 2 11
1 0 F0 V 12 9 8 U 0 V 3 12CDB = 8
Result of multiplier (at tag 8) is broadcast on CDBResult of multiplier (at tag 8) is broadcast on CDB Divider (at tag 9) picks it up, and is thereby enabledDivider (at tag 9) picks it up, and is thereby enabled Store buffer (at tag 1) picks it up, and is thereby Store buffer (at tag 1) picks it up, and is thereby
enabledenabled
22
Observations on Tomasulo’s Observations on Tomasulo’s AlgorithmAlgorithm Instructions: move from decoder to reservation Instructions: move from decoder to reservation
stationsstations in program order in program order dependences can be correctly recordeddependences can be correctly recorded
Data Flow Graph: Data Flow Graph: The graph of pointers connecting The graph of pointers connecting the RS, registers, and memory buffersthe RS, registers, and memory buffers helps accomplish out-of-order sequencing of instructionshelps accomplish out-of-order sequencing of instructions
Chief cost of this scheme: high-speed associative Chief cost of this scheme: high-speed associative hardwarehardware RS hardware has to search for tags when CDB broadcasts RS hardware has to search for tags when CDB broadcasts
some value with its tagsome value with its tag Full load bypassing is supportedFull load bypassing is supported
load and store buffers are treated just like functional unitsload and store buffers are treated just like functional units additional hardware on 360/91 also supported load additional hardware on 360/91 also supported load
forwardingforwarding
23
Tomasulo: Example of Load Tomasulo: Example of Load BypassingBypassing
200: F0 A201: F0 F0 / F1202: C F0203: F0 D204: F0 F0 * F2
200: F0 A201: F0 F0 / F1202: C F0203: F0 D204: F0 F0 * F2
Instruction 202 depends Instruction 202 depends on instructions 200 and on instructions 200 and 201, so instruction 203 201, so instruction 203 will start executing will start executing much before 202 much before 202 (assuming C and D are (assuming C and D are found to be different found to be different memory addresses)memory addresses)
Work out details off-lineWork out details off-line
24
Tomasulo: “Loop Unrolling in Tomasulo: “Loop Unrolling in Hardware”Hardware” 360/91 supported limited kind of speculation360/91 supported limited kind of speculation
Small loops could be held in a loop bufferSmall loops could be held in a loop buffer Loop closing branches were predicted as takenLoop closing branches were predicted as taken
This has the effect of loop unrolling at run-timeThis has the effect of loop unrolling at run-time Given the small number of FP registers in machine, Given the small number of FP registers in machine,
software loop unrolling was not a viable optionsoftware loop unrolling was not a viable option
25
Tomasulo Loop ExampleTomasulo Loop ExampleLoop: Loop: L.DL.D F0F0 00 R1R1
MULT.DMULT.D F4F4 F0F0 F2F2
S.DS.D F4F4 00 R1R1
SUBISUBI R1R1 R1R1 #8#8
BNEZBNEZ R1R1 LoopLoop
Multiply takes 4 clocksMultiply takes 4 clocks Loads have cache missesLoads have cache misses
26
Loop Example Cycle 0Loop Example Cycle 0
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 Load1 NoMULTDF4 F0 F2 1 Load2 NoSD F4 0 R1 1 Load3 No QiLD F0 0 R1 2 Store1 NoMULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 No SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F300 80 Qi
27
Loop Example Cycle 1Loop Example Cycle 1
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 Load2 NoSD F4 0 R1 1 Load3 No QiLD F0 0 R1 2 Store1 NoMULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 No SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F301 80 Qi Load1
28
Loop Example Cycle 2Loop Example Cycle 2
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 Load3 No QiLD F0 0 R1 2 Store1 NoMULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F302 80 Qi Load1 Mult1
29
Loop Example Cycle 3Loop Example Cycle 3
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F303 80 Qi Load1 Mult1
30
Loop Example Cycle 4Loop Example Cycle 4
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F304 72 Qi Load1 Mult1
31
Loop Example Cycle 5Loop Example Cycle 5
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F305 72 Qi Load1 Mult1
32
Loop Example Cycle 6Loop Example Cycle 6
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F306 72 Qi Load1 Mult1Load2
33
Loop Example Cycle 7Loop Example Cycle 7
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F307 72 Qi Load2 Mult2
34
Loop Example Cycle 8Loop Example Cycle 8
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F308 72 Qi Load2 Mult2
35
Loop Example Cycle 9Loop Example Cycle 9
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F309 64 Qi Load2 Mult2
36
Loop Example Cycle 10Loop Example Cycle 10
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 10 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R14 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3010 64 Qi Load2 Mult2
37
Loop Example Cycle 11Loop Example Cycle 11
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R13 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #84 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3011 64 Qi Mult2
38
Loop Example Cycle 12Loop Example Cycle 12
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R12 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #83 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3012 64 Qi Mult2
39
Loop Example Cycle 13Loop Example Cycle 13
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R11 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #82 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3013 64 Qi Mult2
40
Loop Example Cycle 14Loop Example Cycle 14
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #81 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3014 64 Qi Mult2
41
Loop Example Cycle 15Loop Example Cycle 15
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 No SUBI R1 R1 #80 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3015 64 Qi Mult2
42
Loop Example Cycle 16Loop Example Cycle 16
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3016 64 Qi Mult1
43
Loop Example Cycle 17Loop Example Cycle 17
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3017 64 Qi Mult1
44
Loop Example Cycle 18Loop Example Cycle 18
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3018 56 Qi Mult1
45
Loop Example Cycle 19Loop Example Cycle 19
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 19 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 NoMULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3019 56 Qi Mult1
46
Loop Example Cycle 20Loop Example Cycle 20
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 19 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 NoMULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 20 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3020 56 Qi Mult1
47
Loop Example Cycle 21Loop Example Cycle 21
Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 19 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 NoMULTDF4 F0 F2 2 7 15 16 Store2 NoSD F4 0 R1 2 8 20 21 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k
Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop
Register result status
Clock R1 F0 F2 F4 F6 F8 F10 F12... F3021 56 Qi Mult1
48
Summary of Tomasulo’s Summary of Tomasulo’s AlgorithmAlgorithm Prevents registers as bottleneckPrevents registers as bottleneck Avoids WAR and WAW hazards of scoreboardAvoids WAR and WAW hazards of scoreboard Allows loop unrolling in hardwareAllows loop unrolling in hardware Not limited to basic blocks (provided we have Not limited to basic blocks (provided we have
branch prediction)branch prediction) Lasting contributionsLasting contributions
Dynamic schedulingDynamic scheduling Register renamingRegister renaming Load/store disambiguationLoad/store disambiguation
Next: Dynamic branch predictionNext: Dynamic branch prediction