tomasulo’s algorithm

31
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04 1 Tomasulo’s Algorithm There are only three stages that an instruction goes through Issue – get next instruction from FIFO instruction queue. If there is empty reservation station transfer instruction there along with operand values or names of reservation stations (tags) that will produce operand values. If there are no reservation stations stall on structural hazard. Execute – when all operands are available start execution. Loads need only effective address. Stores also need data to be stored. No instruction can start executing before all prior branches have been evaluated. Write result – write on CDB and from there into registers and pending reservation stations or memory

Upload: dawn-price

Post on 31-Dec-2015

30 views

Category:

Documents


0 download

DESCRIPTION

Tomasulo’s Algorithm. There are only three stages that an instruction goes through - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

1

Tomasulo’s Algorithm There are only three stages that an instruction goes

through Issue – get next instruction from FIFO instruction queue.

If there is empty reservation station transfer instruction there along with operand values or names of reservation stations (tags) that will produce operand values. If there are no reservation stations stall on structural hazard.

Execute – when all operands are available start execution. Loads need only effective address. Stores also need data to be stored. No instruction can start executing before all prior branches have been evaluated.

Write result – write on CDB and from there into registers and pending reservation stations or memory

Page 2: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

2

Tomasulo’s Algorithm Each reservation station has seven fields

Op – operation to perform Qj, Qk – reservation station tags that will produce

operands (0 indicates the operand is ready) Vj, Vk – operand values A – immediate field and later effective address of

load/store instruction Busy – this reservation station and its functional unit are

occupied Register file has a field

Qi – tag of reservation station computing the result

Page 3: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

3

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Qi

Register result status

First load is issued

Mult1Mult2

yes Load

Load1

Regs[R2] 34

Time =1

Page 4: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

4

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

First load calculates address

Mult1Mult2

yesyes

Load

Load1

34

Load

Load2

Regs[R3] 45

Qi

Time =2

Second load is issued

Regs[R2] +

Page 5: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

5

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Mult is issued

Mult1Mult2

yesyes

yes

Load

Load1

Regs[R2]+34

Load

Load2

Mult Regs[F4] Load2

Mult1Qi

Time =3

First load reads from memorySecond load calculates address

Regs[R3] 45+

Page 6: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

6

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Sub is issued

Mult1Mult2

yesyes

yes

Mem[34+Regs[R2]]

Load

Load2

Regs[R3]+45

Mult Regs[F4] Load2

Mult1

Sub Load2

Add1Qi

Time =4

First load writes result

yes Load Regs[R2]+34

Load1

Second load reads from memoryMul is stalled

Page 7: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

7

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Second load writes result

Mult1Mult2

yes

yesyes

Mult

Mult1

Sub

Add1

Div

Mult2

Regs[F4]Mem[34+Regs[R2]] Mult1

Qi

Mem[34+Regs[R2]]Mem[45+Regs[R3]]

Time =5

Div is issued

Mult is stalledSub is stalled

yes Load Regs[R3]+45

Load2

Load2

Load2

Mem[45+Regs[R3]]

Page 8: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

8

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Sub is executed (1 out of 2)

Mult1

Add1

Mult2

Add2Qi

Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3

Register result status

Mult1Mult2

yesyes

yesyes

Mult Regs[F4]

Sub

Mem[45+Regs[R3]]

Div Mem[34+Regs[R2]] Mult1

Add Add1Mem[45+Regs[R3]]

Mem[34+Regs[R2]]Mem[45+Regs[R3]]

Time = 6

Add is issued

Mult is executed (1 out of 10)

6

6

Div is stalled

Page 9: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

9

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Load1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Mult1Mult2

yesyes

yesyes

Mult

Mult1

Sub

Add1

Div

Mult2

Add

Add2

Mult1

Add1

Regs[F4]Mem[45+Regs[R3]]

Mem[34+Regs[R2]]

Mem[45+Regs[R3]]

Mem[34+Regs[R2]]Mem[45+Regs[R3]]

Busy Op Vj Vk Qj Qk A

Qi

Time = 7

Sub is executed (2 out of 2)Mult is executed (2 out of 10)

Add is stalledDiv is stalled

6

6

Page 10: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

10

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Load1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Mult1Mult2

yes

yesyes

Mult

Mult1

Div

Mult2

Add

Add2

X=Mem[34+Regs[R2]]-Mem[45+Regs[R3]]

Mult1

X

Regs[F4]Mem[45+Regs[R3]]

Mem[34+Regs[R2]]

Mem[45+Regs[R3]]

Busy Op Vj Vk Qj Qk A

Qi

Time = 8

Sub writes result

Add is stalledDiv is stalled

Mult is executed (3 out of 10)

yes Sub Mem[34+Regs[R2]]Mem[45+Regs[R3]]

Add1

Add1

6

6

Page 11: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

11

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Load1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Mult1Mult2

yes

yesyes

Mult

Mult1

Div

Mult2

Add

Add2

Mult1

X

Regs[F4]Mem[45+Regs[R3]]

Mem[34+Regs[R2]]

Mem[45+Regs[R3]]

Busy Op Vj Vk Qj Qk A

Qi

Time = 9

Add is executed (1 out of 2)Div is stalledMult is executed (4 out of 10)

6

Page 12: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

12

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Load1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Mult1Mult2

yes

yesyes

Mult

Mult1

Div

Mult2

Add

Add2

Mult1

X

Regs[F4]Mem[45+Regs[R3]]

Mem[34+Regs[R2]]

Mem[45+Regs[R3]]

Busy Op Vj Vk Qj Qk A

Qi

Time = 10

Add is executed (2 out of 2)Div is stalledMult is executed (5 out of 10)

6

Page 13: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

13

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Load1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Mult1Mult2

yesyes

Mult

Mult1

Div

Mult2

Mult1Regs[F4]Mem[45+Regs[R3]]

Mem[34+Regs[R2]]

Busy Op Vj Vk Qj Qk A

Qi

Time = 11

Add writes resultDiv is stalledMult is executed (6 out of 10)

yes Add X Mem[45+Regs[R3]]

Add2

6

Page 14: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

14

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Load1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Mult1Mult2 yes

Div

Mult2

YMem[34+Regs[R2]]

Busy Op Vj Vk Qj Qk A

Qi

Y=Mem[45+Regs[R3]]*Regs[F4]

Time = 16

Div is stalledMult writes result

Mult1yes Mult Regs[F4]Mem[45+Regs[R3]]

Mult1

Page 15: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

15

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Load1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Mult1Mult2 yes

Div

Mult2

YMem[34+Regs[R2]]

Busy Op Vj Vk Qj Qk A

Qi

Time = 17

Div is executed (1 out of 40)

17

Page 16: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

16

Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction status

Load1Load2Add1Add2Add3

Reservation stations

F0 … F2 … F4 … F6 … F8 … F10 … F12

Register result status

Mult1Mult2

Busy Op Vj Vk Qj Qk A

Qi

Time = 57

Div writes result

yes Div

Mult2

YMem[34+Regs[R2]]

Page 17: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

17

Tomasulo’s Alg. and Loop Unrolling Consider a loop

LOOP: L.D F0, 0(R1)MUL.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1,#-8BNE R1, R2, LOOP

We will assume that branch is always predicted as taken and issue instructions from two loop iterations Assume none of the load/store or FP operations have

completed

Page 18: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

18

Issue Execute Write result L.D F0, 0(R1)MUL.D F4, F0, F2S.D F4, 0(R1)

Load1Load2Add1Add2Add3

F0 … F2 … F4 … F6 … F8 … F10 … F12

Mult1Mult2

yesyes

yesyes

Busy Op Vj Vk Qj Qk A

Qi

L.D F0, -8(R1)MUL.D F4, F0, F2S.D F4, -8(R1)

Store1Store2

yesyes

LoadLoad

MultMultStoreStore

Regs[R1]+0Regs[R1]-8

F2F2

Load1Load2

Regs[R1]+0Regs[R1]-8

Mult1Mult2

Load2 Mult2

Page 19: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

19

Dynamic Memory Disambiguation Order of loads and stores must be preserved

Since they access memory locations we can examine order only after we calculate effective address

Effective address calculation is performed in order Address of a load is examined against A fields of all

store buffers Address of a store is examined against A fields of all

load and store buffers

Page 20: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

20

Dynamic Hardware Branch Prediction Predict the outcome of a branch

Change the prediction after observing a few iterations

To achieve good effectiveness we must Have accurate prediction technique Have a low cost for misprediction

Page 21: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

21

Local Prediction: Branch Prediction Buffer

A table indexed by low bits of branch instruction address It contains a bit indicating whether the branch was

recently taken or not If it turns out we have been wrong the bit is inverted

Branch address

4

1 bit

Page 22: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

22

1-bit Branch Prediction Buffer

Problem – even simplest branches are mispredicted twice

LD R1, #5Loop: LD R2, 0(R5)

ADD R2, R2, R4STORE R2, 0(R5)ADD R5, R5, #4SUB R1, R1, #1BNEZ R1, Loop

First time: prediction = 0 but the branch is taken change prediction to 1 miss

Time 2, 3, 4: prediction = 1 and the branch is taken

Time 5: prediction = 1 but the branch is not taken change prediction to 0 miss

Page 23: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

23

2-bit Branch Prediction Buffer

To amend this we will use 2 bits, we must miss twice before we change our prediction

Predict taken11

Predict taken10

Predict not taken01

Predict not taken00

TakenTaken

Not taken

Not taken

Not taken

Taken

Not taken

Taken

Page 24: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

24

2-bit Branch Prediction Buffer

First time we encounter this loop

LD R1, #5Loop: LD R2, 0(R5)

ADD R2, R2, R4STORE R2, 0(R5)ADD R5, R5, #4SUB R1, R1, #1BNEZ R1, Loop

First time: prediction = 00, not taken the branch is taken change prediction to 01 miss

Time 2: prediction = 01, not taken the branch is taken change prediction to 11 miss

Time 3,4: prediction = 11, taken the branch is taken

Time 5: prediction = 11, taken the branch is not taken change prediction to 10 miss

Page 25: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

25

n-bit Branch Prediction Buffer We can generalize this technique to n-bit prediction

buffers When the counter is ≥ 2n-1, branch is predicted as taken Those predictors are not much more accurate than 2-bit

Predict taken111

Predict taken110

Predict taken100

Predict not taken011

Predict not taken001

Predict not taken000

Taken

Taken Taken

Not taken Not taken

Not taken

Taken

Not taken

Taken

Not taken

Taken

Page 26: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

26

Correlating (Global) Branch Predictors Assign two prediction bits, one if the previous

branch was not taken, the other if it was taken

b1: if (d==0) d=1;

b2: if (d==1)

If b1 is taken, b2 is taken

b1: BNEZ R1, L1DADDUI R1, R0, #1

L1: DSUBUI R3, R1, #1b2: BNEZ R3, L2

…….L2:

If b1 is not taken, b2 is not taken0/0

One bit indicating what to do if one previous branch was not taken

One bit indicating what to do if one previous branch was taken

Page 27: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

27

Correlating Branch Predictors Assign two prediction bits, one if the previous

branch was not taken, the other if it was taken

R1=? b1 prediction b1 action New b1 prediction b2 prediction b2 action New b2 prediction

b1: BNEZ R1, L1DADDUI R1, R0, #1

L1: DSUBUI R3, R1, #1b2: BNEZ R3, L2

…….L2:

2020

NT/NT NT/NTT T/NT T NT/TT/NT NT T/NT

m mNT/T NT NT/T

T/NT T T/NT NT/T T NT/T

T/NT NT T/NT NT/T NT NT/T

This is (1,1) predictor it usesoutcome of 1 previous branch to do prediction with 1-bit predictor

Page 28: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

28

Correlating Branch Predictors (m,n) Observe behavior of m previous branches, use n-bit

predictor0/0

One bit indicating what to do if one previous branch was not taken

One bit indicating what to do if one previous branch was taken

0/0/0/…/0One bit indicating what to do if m previous branches were not taken

One bit indicating what to do if m previous branches were taken

(1,1)

(m,1)

0111/0011/0001/…/1110

n bits indicating what to do if m previous branches were not taken

n bits indicating what to do if m previous branches were taken (m,n)

Page 29: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

29

Correlating Branch Predictors (m,n) 2m combinations, n-bits each

Branch address

4

m bits indicatingoutcome of m

previous branches

n bits n bits n bits

Page 30: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

30

Correlating Branch Predictors (m,n) How many bits do we need for (m,n) predictor?

2m combinations, n-bits each, suppose we use last t bits of branch target to select prediction

2m * n * 2t

Page 31: Tomasulo’s Algorithm

CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04

31

Tournament Predictors Combine one global and one local predictor with a

selector

Use predictor 1 Use predictor 2

Use predictor 1 Use predictor 2

1/1, 0/0, 1/0

0/0, 1/1

1/0

0/1

1/0 0/1 0/1 1/0

0/0, 1/1

1/1, 0/0, 0/1

First selector was right Second selector

was wrong