1 comp 206: computer architecture and implementation montek singh mon., sep 30, 2002 topic:...

48
1 COMP 206: COMP 206: Computer Architecture Computer Architecture and Implementation and Implementation Montek Singh Montek Singh Mon., Sep 30, 2002 Mon., Sep 30, 2002 Topic: Topic: Instruction-Level Parallelism Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s Algorithm) (Dynamic Scheduling: Tomasulo’s Algorithm)

Upload: gianna-layer

Post on 01-Apr-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

1

COMP 206:COMP 206:Computer Architecture and Computer Architecture and

ImplementationImplementation

Montek SinghMontek Singh

Mon., Sep 30, 2002Mon., Sep 30, 2002

Topic: Topic: Instruction-Level ParallelismInstruction-Level Parallelism

(Dynamic Scheduling: Tomasulo’s (Dynamic Scheduling: Tomasulo’s

Algorithm)Algorithm)

Page 2: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

2

Dynamic Scheduling: Tomasulo’s Dynamic Scheduling: Tomasulo’s AlgorithmAlgorithm For IBM 360/91 (about three years after CDC 6600)For IBM 360/91 (about three years after CDC 6600) Goal: High performance without special compilersGoal: High performance without special compilers Differences between IBM 360 and CDC 6600 ISADifferences between IBM 360 and CDC 6600 ISA

IBM has only 2 register specifiers/instruction versus 3 in CDC IBM has only 2 register specifiers/instruction versus 3 in CDC 66006600

IBM has 4 FP registers versus 8 in CDC 6600IBM has 4 FP registers versus 8 in CDC 6600 Differences between Tomasulo Algorithm and Differences between Tomasulo Algorithm and

ScoreboardScoreboard Control and buffers distributed with Function Units versus Control and buffers distributed with Function Units versus

centralized in scoreboard; called “reservation stations”centralized in scoreboard; called “reservation stations” Registers in instructions replaced by pointers to reservation Registers in instructions replaced by pointers to reservation

station bufferstation buffer Hardware renaming of registers to avoid WAR and WAW Hardware renaming of registers to avoid WAR and WAW

hazardshazards Common Data Bus broadcasts results to all FUs (forwarding)Common Data Bus broadcasts results to all FUs (forwarding) Load and Stores treated as FUs as wellLoad and Stores treated as FUs as well

Page 3: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

3

Tomasulo: OrganizationTomasulo: Organization

123456

Floatingpointoperandstack

DecoderFP

registers(4)

TagsBusy

bit

Tag Data Tag DataTag Data Tag DataTag Data Tag Data

Adder (2 cycle, pipelined)

Tag Data Tag DataTag Data Tag Data

Multiply (3 cycle, pipelined)Divide (12 cycle, not pipelined)

101112

89

Frommemory

TomemoryFrom

I-Fetch

Status

To all tags

Common data bus (CDB)

Tag AddrTag AddrTag Addr

DataDataData

Page 4: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

4

More Details of Tomasulo More Details of Tomasulo OrganizationOrganization Entities that produce values are assigned 4-bit Entities that produce values are assigned 4-bit

tagstags 1, 2, 3, 4, 5, 6 for load buffers1, 2, 3, 4, 5, 6 for load buffers 8, 9 for multiplier reservation stations8, 9 for multiplier reservation stations 10, 11, 12 for adder reservation stations10, 11, 12 for adder reservation stations Tag 0 indicates presence of valid dataTag 0 indicates presence of valid data

FP registers have “busy bits”FP registers have “busy bits” 0 means that register holds valid data0 means that register holds valid data 1 means that it is waiting to receive value from source 1 means that it is waiting to receive value from source

identified by its tag fieldidentified by its tag field

Page 5: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

5

Tomasulo: Representing Data Tomasulo: Representing Data DependencesDependences InputsInputs

Operand is a Operand is a register with busy bit = 0register with busy bit = 0Data copied immediately (through register bus) into Data copied immediately (through register bus) into

reservation stationreservation stationTag field of RS set to 0Tag field of RS set to 0

Operand is a Operand is a register with busy bit = 1register with busy bit = 1Tag field of RS receives a copy of the register tag fieldTag field of RS receives a copy of the register tag field

Operand is a Operand is a load buffer that contains valid dataload buffer that contains valid dataData copied into RSData copied into RS

Operand is a Operand is a load buffer that is awaiting dataload buffer that is awaiting dataTag field of RS receives tag of load bufferTag field of RS receives tag of load buffer

OutputsOutputs Output is a Output is a registerregister

Busy bit set to 1, tag set to RS tagBusy bit set to 1, tag set to RS tag Output is a Output is a store bufferstore buffer

Tag set to RS tag, destination address setTag set to RS tag, destination address set

Page 6: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

6

Three Stages of Tomasulo Three Stages of Tomasulo AlgorithmAlgorithm1.1. Issue: Issue: get instruction from FP operation get instruction from FP operation

queuequeue If reservation station free, the scoreboard issues If reservation station free, the scoreboard issues

instruction and sends operands (renames registers)instruction and sends operands (renames registers)

2.2. Execution: Execution: operate on operands (EX)operate on operands (EX) When both operands ready then execute;When both operands ready then execute;

if not ready, watch CDB for resultif not ready, watch CDB for result

3.3. Write Result: Write Result: finish execution (WB)finish execution (WB) Write on Common Data Bus to all awaiting units; Write on Common Data Bus to all awaiting units;

mark reservation station availablemark reservation station available

Page 7: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

7

Tomasulo: State TransitionsTomasulo: State TransitionsTag MM addr ---- Data has not arrived

Tag MM addr XXXX Data has arrived

Tag MM addr XXXX CDB transfer

0 ---- ---- After CDB transfer

Tag --- Before CDB transfer

Tag XXXX During CDB transfer

0 XXXX After CDB transfer

•In case of a CDB conflict, earlier instruction has priority•If more than one instruction is enabled in the reservation stations of adder or multiplier in same cycle, top entry has priority•If CDB transfer and issue occur in same cycle, CDB transfer is assumed to occur first•Every instruction should spend at least one cycle in R stage•If an instruction being issued both reads and writes the same register, and the source operand is actually in the register (busy bit = 0), then first the register is read, and then its busy bit is turned to 1, making it unreadable

Lo

ad B

uff

er

Register

Page 8: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

8

Tomasulo: ExampleTomasulo: Example100: F0 A101: F0 F0 + F1102: F0 F0 + B103: F2 F2 + F3104: F1 F1 * F2105: C F1106: F0 F1 / F0

100: F0 A101: F0 F0 + F1102: F0 F0 + B103: F2 F2 + F3104: F1 F1 * F2105: C F1106: F0 F1 / F0

CDB tags 4 3 10 12 11 8 9Clocks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

100 I R X W101 I R R X X W102 I R R R R X X W103 I R X X W104 I R R R X X X W105 I R R R R R R X106 I R R R R R X X X X X X W

100

101

102

106

103

104

105

I IssuedR In reservation stationX In executionW Writing result through CDB

Page 9: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

9

Tomasulo Example Cycle 0Tomasulo Example Cycle 0Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status

ID MM D6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 0 F2 (F2) 10 ID T D T D 1 100 2 0 F1 (F1) 11 8 2 11

1 0 F0 (F0) 12 9 3 12CDB =

System is quiescentSystem is quiescent

Page 10: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

10

Tomasulo Example Cycle 1Tomasulo Example Cycle 1Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF0 = A ID MM D

6 ID C5 B T ID D 84 A 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 0 F2 (F2) 10 ID T D T D 1 101 2 0 F1 (F1) 11 8 2 11

1 1 4 F0 12 9 3 12CDB =

(A) will arrive at tag 4(A) will arrive at tag 4 (F0) will come from tag 4(F0) will come from tag 4 F0 is set to “busy”F0 is set to “busy”

Page 11: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

11

Tomasulo Example Cycle 2Tomasulo Example Cycle 2Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF0 = F0+F1 ID MM D

6 ID C5 B T ID D 84 A 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 0 F2 (F2) 10 4 0 (F1) ID T D T D 1 102 2 0 F1 (F1) 11 8 2 11

1 1 10 F0 12 9 3 12CDB =

(F0) will be produced at tag 10(F0) will be produced at tag 10 Right input of adder came from register (tag bit = 0)Right input of adder came from register (tag bit = 0) Left input of adder will come from tag 4Left input of adder will come from tag 4 Forwarding tag of F0 has been changed from 4 to 10Forwarding tag of F0 has been changed from 4 to 10

Page 12: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

12

Tomasulo Example Cycle 3Tomasulo Example Cycle 3Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF0 = F0+B ID MM D

6 ID C5 B T ID D 84 A X 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 B 0 F2 (F2) 10 4 0 (F1) ID T D T D 1 103 2 0 F1 (F1) 11 10 3 8 2 11

1 1 11 F0 12 9 3 12CDB =

(F0) will be produced at tag 11(F0) will be produced at tag 11 (B) will arrive at tag 3(B) will arrive at tag 3 Right input of adder will come from tag 3Right input of adder will come from tag 3 Left input of adder will come from tag 10Left input of adder will come from tag 10 (A) arrives from memory(A) arrives from memory Forwarding tag of F0 has been changed from 10 to 11Forwarding tag of F0 has been changed from 10 to 11

Page 13: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

13

Tomasulo Example Cycle 4Tomasulo Example Cycle 4Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF2 = F2+F3 ID MM D

6 ID C5 B T ID D 84 A X 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 B 1 12 F2 10 4 X 0 (F1) ID T D T D 1 104 2 0 F1 (F1) 11 10 3 8 2 11

1 1 11 F0 12 0 (F2) 0 (F3) 9 3 12CDB = 4

(F2) will be produced at tag 12(F2) will be produced at tag 12 Right input of adder came from register (tag bit = 0)Right input of adder came from register (tag bit = 0) Left input of adder came from register (tag bit = 0)Left input of adder came from register (tag bit = 0) (A) with tag 4 is broadcast on CDB(A) with tag 4 is broadcast on CDB Adder (at tag 10) picks it up, and is thereby enabledAdder (at tag 10) picks it up, and is thereby enabled The instruction that will write F2 has already read the The instruction that will write F2 has already read the

old contents of F2old contents of F2

Page 14: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

14

Tomasulo Example Cycle 5Tomasulo Example Cycle 5Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF1 = F1*F2 ID MM D

6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 B Y 1 12 F2 10 0 X 0 (F1) ID T D T D 1 10 X5 2 1 8 F1 11 10 3 8 0 (F1) 12 2 11

1 1 11 F0 12 0 (F2) 0 (F3) 9 3 12CDB =

(F1) will be produced at tag 8(F1) will be produced at tag 8 Right input of multiplier will come from tag 12Right input of multiplier will come from tag 12 Left input of multiplier came from register (tag bit = 0)Left input of multiplier came from register (tag bit = 0) Adder (at tag 10) starts computingAdder (at tag 10) starts computing (B) arrives from memory(B) arrives from memory

Page 15: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

15

Tomasulo Example Cycle 6Tomasulo Example Cycle 6Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusC = F1 ID MM D

6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 B Y 1 12 F2 10 0 X 0 (F1) ID T D T D 1 8 C 10 X6 2 1 8 F1 11 10 3 Y 8 0 (F1) 12 2 11

1 1 11 F0 12 0 (F2) 0 (F3) 9 3 12 XCDB = 3

Memory address of destination is CMemory address of destination is C Data will come from tag 8Data will come from tag 8 Adder (at tag 10) finishes computingAdder (at tag 10) finishes computing (B) with tag 3 is broadcast on CDB(B) with tag 3 is broadcast on CDB Adder (at tag 11) picks it upAdder (at tag 11) picks it up Adder (at tag 12) starts computingAdder (at tag 12) starts computing

Page 16: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

16

Tomasulo Example Cycle 7Tomasulo Example Cycle 7Issued From MM Registers ADD Unit MPY/DIV unit To MM FU statusF1 = F1/F0 ID MM D

6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 1 12 F2 10 0 X 0 (F1) ID T D T D 1 8 C 107 2 1 9 F1 11 10 Z 0 Y 8 0 (F1) 12 2 11

1 1 11 F0 12 0 (F2) 0 (F3) 9 8 11 3 12 XCDB = 10

(F1) will be produced at tag 9(F1) will be produced at tag 9 Right input of divider will come from tag 11Right input of divider will come from tag 11 Left input of divider will come from tag 8Left input of divider will come from tag 8 Result of adder (with tag 10) is broadcast on CDBResult of adder (with tag 10) is broadcast on CDB Adder (at tag 11) picks it up and is thereby enabledAdder (at tag 11) picks it up and is thereby enabled Adder (at tag 12) finishes computingAdder (at tag 12) finishes computing

Page 17: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

17

Tomasulo Example Cycle 8Tomasulo Example Cycle 8Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status

ID MM D6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 1 12 F2 W 10 ID T D T D 1 8 C 108 2 1 9 F1 11 0 Z 0 Y 8 0 (F1) 12 W 2 11 X

1 1 11 F0 12 0 (F2) 0 (F3) 9 8 11 3 12CDB = 12

Result of adder (at tag 12) is broadcast on CDBResult of adder (at tag 12) is broadcast on CDB Multiplier (at tag 12) picks it up and is thereby Multiplier (at tag 12) picks it up and is thereby

enabledenabled

Page 18: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

18

Tomasulo Example Cycle 9Tomasulo Example Cycle 9Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status

ID MM D6 ID C5 B T ID D 8 X4 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 0 F2 W 10 ID T D T D 1 8 C 109 2 1 9 F1 11 0 Z 0 Y 8 0 (F1) 0 W 2 11 X

1 1 11 F0 12 9 8 11 3 12CDB =

Multiplier (at tag 8) starts computingMultiplier (at tag 8) starts computing Adder (at tag 11) finishes computingAdder (at tag 11) finishes computing

Page 19: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

19

Tomasulo Example Cycle 10Tomasulo Example Cycle 10Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status

ID MM D6 ID C5 B T ID D 8 X4 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 0 F2 W 10 ID T D T D 1 8 C 1010 2 1 9 F1 11 0 Z 0 Y 8 0 (F1) 0 X 2 11

1 1 11 F0 V 12 9 8 11 V 3 12CDB = 11

Result of adder (with tag 11) is broadcast on CDBResult of adder (with tag 11) is broadcast on CDB Divider (at tag 9) picks it upDivider (at tag 9) picks it up Register F0 picks it upRegister F0 picks it up

Page 20: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

20

Tomasulo Example Cycle 11Tomasulo Example Cycle 11Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status

ID MM D6 ID C5 B T ID D 8 X4 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 0 F2 W 10 ID T D T D 1 8 C 1011 2 1 9 F1 11 8 0 (F1) 0 X 2 11

1 0 F0 V 12 9 8 0 V 3 12CDB =

Multiplier (at tag 8) finishes computingMultiplier (at tag 8) finishes computing

Page 21: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

21

Tomasulo Example Cycle 12Tomasulo Example Cycle 12Issued From MM Registers ADD Unit MPY/DIV unit To MM FU status

ID MM D6 ID C5 B T ID D 84 0 F3 (F3) ID T D T D ID T D MM 9

Clock = 3 0 F2 W 10 ID T D T D 1 8 U C 1012 2 1 9 F1 11 8 0 (F1) 0 X 2 11

1 0 F0 V 12 9 8 U 0 V 3 12CDB = 8

Result of multiplier (at tag 8) is broadcast on CDBResult of multiplier (at tag 8) is broadcast on CDB Divider (at tag 9) picks it up, and is thereby enabledDivider (at tag 9) picks it up, and is thereby enabled Store buffer (at tag 1) picks it up, and is thereby Store buffer (at tag 1) picks it up, and is thereby

enabledenabled

Page 22: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

22

Observations on Tomasulo’s Observations on Tomasulo’s AlgorithmAlgorithm Instructions: move from decoder to reservation Instructions: move from decoder to reservation

stationsstations in program order in program order dependences can be correctly recordeddependences can be correctly recorded

Data Flow Graph: Data Flow Graph: The graph of pointers connecting The graph of pointers connecting the RS, registers, and memory buffersthe RS, registers, and memory buffers helps accomplish out-of-order sequencing of instructionshelps accomplish out-of-order sequencing of instructions

Chief cost of this scheme: high-speed associative Chief cost of this scheme: high-speed associative hardwarehardware RS hardware has to search for tags when CDB broadcasts RS hardware has to search for tags when CDB broadcasts

some value with its tagsome value with its tag Full load bypassing is supportedFull load bypassing is supported

load and store buffers are treated just like functional unitsload and store buffers are treated just like functional units additional hardware on 360/91 also supported load additional hardware on 360/91 also supported load

forwardingforwarding

Page 23: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

23

Tomasulo: Example of Load Tomasulo: Example of Load BypassingBypassing

200: F0 A201: F0 F0 / F1202: C F0203: F0 D204: F0 F0 * F2

200: F0 A201: F0 F0 / F1202: C F0203: F0 D204: F0 F0 * F2

Instruction 202 depends Instruction 202 depends on instructions 200 and on instructions 200 and 201, so instruction 203 201, so instruction 203 will start executing will start executing much before 202 much before 202 (assuming C and D are (assuming C and D are found to be different found to be different memory addresses)memory addresses)

Work out details off-lineWork out details off-line

Page 24: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

24

Tomasulo: “Loop Unrolling in Tomasulo: “Loop Unrolling in Hardware”Hardware” 360/91 supported limited kind of speculation360/91 supported limited kind of speculation

Small loops could be held in a loop bufferSmall loops could be held in a loop buffer Loop closing branches were predicted as takenLoop closing branches were predicted as taken

This has the effect of loop unrolling at run-timeThis has the effect of loop unrolling at run-time Given the small number of FP registers in machine, Given the small number of FP registers in machine,

software loop unrolling was not a viable optionsoftware loop unrolling was not a viable option

Page 25: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

25

Tomasulo Loop ExampleTomasulo Loop ExampleLoop: Loop: L.DL.D F0F0 00 R1R1

MULT.DMULT.D F4F4 F0F0 F2F2

S.DS.D F4F4 00 R1R1

SUBISUBI R1R1 R1R1 #8#8

BNEZBNEZ R1R1 LoopLoop

Multiply takes 4 clocksMultiply takes 4 clocks Loads have cache missesLoads have cache misses

Page 26: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

26

Loop Example Cycle 0Loop Example Cycle 0

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 Load1 NoMULTDF4 F0 F2 1 Load2 NoSD F4 0 R1 1 Load3 No QiLD F0 0 R1 2 Store1 NoMULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 No SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F300 80 Qi

Page 27: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

27

Loop Example Cycle 1Loop Example Cycle 1

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 Load2 NoSD F4 0 R1 1 Load3 No QiLD F0 0 R1 2 Store1 NoMULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 No SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F301 80 Qi Load1

Page 28: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

28

Loop Example Cycle 2Loop Example Cycle 2

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 Load3 No QiLD F0 0 R1 2 Store1 NoMULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F302 80 Qi Load1 Mult1

Page 29: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

29

Loop Example Cycle 3Loop Example Cycle 3

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F303 80 Qi Load1 Mult1

Page 30: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

30

Loop Example Cycle 4Loop Example Cycle 4

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F304 72 Qi Load1 Mult1

Page 31: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

31

Loop Example Cycle 5Loop Example Cycle 5

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F305 72 Qi Load1 Mult1

Page 32: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

32

Loop Example Cycle 6Loop Example Cycle 6

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F306 72 Qi Load1 Mult1Load2

Page 33: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

33

Loop Example Cycle 7Loop Example Cycle 7

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F307 72 Qi Load2 Mult2

Page 34: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

34

Loop Example Cycle 8Loop Example Cycle 8

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F308 72 Qi Load2 Mult2

Page 35: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

35

Loop Example Cycle 9Loop Example Cycle 9

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F309 64 Qi Load2 Mult2

Page 36: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

36

Loop Example Cycle 10Loop Example Cycle 10

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 10 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R14 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3010 64 Qi Load2 Mult2

Page 37: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

37

Loop Example Cycle 11Loop Example Cycle 11

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R13 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #84 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3011 64 Qi Mult2

Page 38: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

38

Loop Example Cycle 12Loop Example Cycle 12

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R12 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #83 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3012 64 Qi Mult2

Page 39: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

39

Loop Example Cycle 13Loop Example Cycle 13

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R11 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #82 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3013 64 Qi Mult2

Page 40: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

40

Loop Example Cycle 14Loop Example Cycle 14

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD M(80) R(F2) SUBI R1 R1 #81 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3014 64 Qi Mult2

Page 41: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

41

Loop Example Cycle 15Loop Example Cycle 15

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 Store2 Yes 72 Mult2SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 No SUBI R1 R1 #80 Mult2 Yes MULTD M(72) R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3015 64 Qi Mult2

Page 42: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

42

Loop Example Cycle 16Loop Example Cycle 16

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 NoReservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3016 64 Qi Mult1

Page 43: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

43

Loop Example Cycle 17Loop Example Cycle 17

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3017 64 Qi Mult1

Page 44: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

44

Loop Example Cycle 18Loop Example Cycle 18

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 Yes 80 M(80)*R(F2)MULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3018 56 Qi Mult1

Page 45: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

45

Loop Example Cycle 19Loop Example Cycle 19

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 19 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 NoMULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3019 56 Qi Mult1

Page 46: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

46

Loop Example Cycle 20Loop Example Cycle 20

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 19 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 NoMULTDF4 F0 F2 2 7 15 16 Store2 Yes 72 M(72)*R(72)SD F4 0 R1 2 8 20 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3020 56 Qi Mult1

Page 47: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

47

Loop Example Cycle 21Loop Example Cycle 21

Instruction status ExecutionWriteInstruction j k iteration Issue completeResult Busy AddressLD F0 0 R1 1 1 9 10 Load1 NoMULTDF4 F0 F2 1 2 14 15 Load2 NoSD F4 0 R1 1 3 18 19 Load3 Yes 64 QiLD F0 0 R1 2 6 10 11 Store1 NoMULTDF4 F0 F2 2 7 15 16 Store2 NoSD F4 0 R1 2 8 20 21 Store3 Yes 64 Mult1Reservation Stations S1 S2 RS for jRS for k

Time Name Busy Op Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 No MULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load3 SUBI R1 R1 #80 Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12... F3021 56 Qi Mult1

Page 48: 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s

48

Summary of Tomasulo’s Summary of Tomasulo’s AlgorithmAlgorithm Prevents registers as bottleneckPrevents registers as bottleneck Avoids WAR and WAW hazards of scoreboardAvoids WAR and WAW hazards of scoreboard Allows loop unrolling in hardwareAllows loop unrolling in hardware Not limited to basic blocks (provided we have Not limited to basic blocks (provided we have

branch prediction)branch prediction) Lasting contributionsLasting contributions

Dynamic schedulingDynamic scheduling Register renamingRegister renaming Load/store disambiguationLoad/store disambiguation

Next: Dynamic branch predictionNext: Dynamic branch prediction