ece4750/cs4420 computer architecture l9: tomasulo’s algorithm · 2013. 7. 3. · 1 ece4750/cs4420...

13
1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm Edward Suh Computer Systems Laboratory [email protected] 2 Announcements Lab2 grade Will be out on Friday Re-grade request within a week HW2 due Sunday Class schedule next week Tuesday: no class (fall break) Thursday: no class, evening prelim 7:30-9:30 ECE4750/CS4420 Computer Architecture, Fall 2008

Upload: others

Post on 18-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

1

ECE4750/CS4420 Computer Architecture

L9: Tomasulo’s Algorithm

Edward Suh

Computer Systems Laboratory

[email protected]

2

Announcements

Lab2 grade

• Will be out on Friday

• Re-grade request within a week

HW2 due Sunday

Class schedule next week

• Tuesday: no class (fall break)

• Thursday: no class, evening prelim 7:30-9:30

ECE4750/CS4420 — Computer Architecture, Fall 2008

Page 2: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

2

3

Overview

Scoreboard review

Limitations of Scoreboard

• WAR & WAW hazards

Tomasulo’s algorithm

Example

Reading: Chapter 2.4 & 2.5

ECE4750/CS4420 — Computer Architecture, Fall 2008

4

Scoreboard Review

Step 1: issue

• if FU available (structural), and

• if no earlier instruction writes to same destination (WAW), then

• send instruction to FU

Step 2: read operands (a.k.a. dispatch)

• if no operand pending update (RAW), then

• instruct FU to read operands and start execution

Step 3: execution

• inform scoreboard at completion

Step 4: write result (a.k.a. retire)

• if WAR hazard possible, stall at WB stage

ECE4750/CS4420 — Computer Architecture, Fall 2008

Page 3: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

3

5

Parts of Scoreboard

Instruction Status – in which of the four steps each instruction is

FU Status – is FU available?

• Busy – FU is busy

• Op – operation to be performed

• Fi, Fj, Fk – Destination and source registers

• Qj, Qk – FUs producing Fj, Fk

• Rj, Rk – Operand-ready flags; reset after operands are read

Register Status – is a register (Reg) up-to-date?

• Result[Reg] – which FU will write Reg

ECE4750/CS4420 — Computer Architecture, Fall 2008

6

Scoreboard Details

Issue

• Wait till no structural (not Busy[FU]) and WAW (not Result[D]) hazard

• Busy[FU] = yes; Op[FU] = op; Fi[FU] = D; Fj[FU] = S1; Fk[FU] = S2; Qj = Result[S1]; Qk = Result[S2]; Rj = not Qj; Rk = not Qk; Result[D] = FU

Read operands

• Wait till no RAW hazard: Rj and Rk

• Rj = No; Rk = No; Qj = 0; Qk = 0

Execution

Write result

• Wait till no WAR hazard: for all other FUs, sources (Fj, Fk) that are in the register file (Rj, Rk == yes) do not match the register (Fi[FU]) to overwite

• Rj[f], Rk[f] = yes if Qj[f], Qk[f] == FU; Result[Fi[FU]] = 0; Busy[FU] = No;

ECE4750/CS4420 — Computer Architecture, Fall 2008

Page 4: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

4

7

Scoreboard: Example

Instruction

Status

Instruction I R E W

fld1 f6,34($2)

fld2 f2,45($3)

fmul f0,f2,f4

fsub f8,f6,f2

fdiv f10,f0,f6

fadd f6,f8,f2

ECE4750/CS4420 — Computer Architecture, Fall 2008, Suh

FU Status

Busy Op Fi Fj Fk Qj Qk Rj Rk

Int

Mul1

Mul2

Add

Div

Register Result Status

F0 F2 F4 F6 F8 F10 F12 … F30

FU

Clock

0

Latencies: fadd – 2 cycles, fmul – 10 cycles, fdiv – 40 cycles, fld – 1 cycle (cache hit)

8

Limitations: an Example

ECE4750/CS4420 — Computer Architecture, Fall 2008

latency1 LD F2, 34(R2) 1

2 LD F4, 45(R3) long

3 MULTD F6, F4, F2 3

4 SUBD F8, F2, F2 1

5 DIVD F4, F2, F8 4

6 ADDD F10, F6, F4 1

In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6

1 2

34

5

6

Out-of-order: 1 (2,1)

Page 5: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

5

9

Limitations in ISA

Which features of an ISA limit the number of instructions in the pipeline?

ECE4750/CS4420 — Computer Architecture, Fall 2008

10

Instruction-level Parallelism via Renaming

ECE4750/CS4420 — Computer Architecture, Fall 2008

latency1 LD F2, 34(R2) 1

2 LD F4, 45(R3) long

3 MULTD F6, F4, F2 3

4 SUBD F8, F2, F2 1

5 DIVD F4’, F2, F8 4

6 ADDD F10, F6, F4’ 1

In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6Out-of-order: 1 (2,1)

1 2

34

5

6

X

Any name dependence can be eliminated by renaming.(renaming additional storage)

Page 6: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

6

11

Dynamic Scheduling by Tomasulo’s

Developed for IBM 360/91 three years after CDC 6600

Goal was high performance without compiler help

• only four floating-point registers

• wanted portability of code

Innovations over scoreboard

• control and buffers distributed: “reservation stations”

• source operands point to reservation stations– renaming, eliminates WAR, WAW hazards

– Common Data Bus (CDB) broadcasts results

Original IBM 360/91 used reg-mem ISA, but we’ll use MIPS ISA instead

ECE4750/CS4420 — Computer Architecture, Fall 2008

12

Tomasulo-based FPU

ECE4750/CS4420 — Computer Architecture, Fall 2008

Page 7: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

7

13

Three Steps to Instruction Execution

Step 1: issue

• if reservation station available (structural), then

• rename operands, send instruction to reservation station

• read operands from a register file

Step 2: execution

• if operand(s) not available, monitor CDB (snoop)

• inform control logic at completion

Step 3: write result

• broadcast result via CDB

• if no WAW hazard, update register

ECE4750/CS4420 — Computer Architecture, Fall 2008

14

Parts of Tomasulo

Instruction Status – in which of the three steps each instruction is

Reservation Station Status – is the reservation station available?

• Busy – reservation station is busy

• Op – operation to be performed

• Address – effective address (if load/store)

• Vj, Vk – Source values (not registers!)

• Qj, Qk – reservation stations producing Vj, Vk for this instruction

Register Status – which reservation station will update each register

• Qi – reservation station producing the most updated value for the register

ECE4750/CS4420 — Computer Architecture, Fall 2008

Page 8: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

8

15

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2)

fld2 f2,45($3)

fmul f0,f2,f4

fsub f8,f6,f2

fdiv f10,f0,f6

fadd f6,f8,f2

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

Load1

Load2

Add1

Add2

Add3

Mul1

Mul2

Register Result Status

F0 F2 F4 F6 F8 F10 F12 … F30

Qi

Clock

16

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2) 1

fld2 f2,45($3)

fmul f0,f2,f4

fsub f8,f6,f2

fdiv f10,f0,f6

fadd f6,f8,f2

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

Load1 Y ld $2 34

Load2

Add1

Add2

Add3

Mul1

Mul2

Register Result Status

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Load1

Clock

1

Page 9: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

9

17

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2) 1

fld2 f2,45($3) 2

fmul f0,f2,f4

fsub f8,f6,f2

fdiv f10,f0,f6

fadd f6,f8,f2

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

1 Load1 Y ld $2 $2+34

Load2 Y ld $3 45

Add1

Add2

Add3

Mul1

Mul2

Register Result Status

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Load2 Load1

Clock

2

18

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2) 1 3

fld2 f2,45($3) 2

fmul f0,f2,f4 3

fsub f8,f6,f2

fdiv f10,f0,f6

fadd f6,f8,f2

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

0 Load1 Y ld $2 $2+34

1 Load2 Y ld $3 $3+45

Add1

Add2

Add3

Mul1 Y fmul f4 Load2

Mul2

Register Result Status (also values)

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mul1 Load2 Load1

Clock

3

Page 10: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

10

19

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2) 1 3 4

fld2 f2,45($3) 2 4

fmul f0,f2,f4 3

fsub f8,f6,f2 4

fdiv f10,f0,f6

fadd f6,f8,f2

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

Load1 - - - -

0 Load2 Y ld $3 $3+45

Add1 Y fsub M1 Load2

Add2

Add3

Mul1 Y fmul f4 Load2

Mul2

Register Result Status (also values)

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mul1 Load2 M1 Add1

Clock

4

20

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2) 1 3 4

fld2 f2,45($3) 2 4 5

fmul f0,f2,f4 3

fsub f8,f6,f2 4

fdiv f10,f0,f6 5

fadd f6,f8,f2

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

Load1

Load2 - - - -

2 Add1 Y fsub M1 M2 -

Add2

Add3

10 Mul1 Y fmul M2 f4 -

Mul2 Y fdiv M1 Mul1

Register Result Status (also values)

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mul1 M2 M1 Add1 Mul2

Clock

5

Page 11: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

11

21

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2) 1 3 4

fld2 f2,45($3) 2 4 5

fmul f0,f2,f4 3

fsub f8,f6,f2 4 7 8

fdiv f10,f0,f6 5

fadd f6,f8,f2 6

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

Load1

Load2

Add1 - - - -

2 Add2 Y fadd M1- M2 M2 -

Add3

7 Mul1 Y fmul M2 f4

Mul2 Y fdiv M1 Mul1

Register Result Status (also values)

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mul1 M2 Add2 M1- M2 Mul2

Clock

8

22

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2) 1 3 4

fld2 f2,45($3) 2 4 5

fmul f0,f2,f4 3

fsub f8,f6,f2 4 7 8

fdiv f10,f0,f6 5

fadd f6,f8,f2 6 10 11

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

Load1

Load2

Add1

Add2 - - - -

Add3

4 Mul1 Y fmul M2 f4

Mul2 Y fdiv M1 Mul1

Register Result Status (also values)

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mul1 M2 M1 M1- M2 Mul2

Clock

11

Page 12: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

12

23

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2) 1 3 4

fld2 f2,45($3) 2 4 5

fmul f0,f2,f4 3 15 16

fsub f8,f6,f2 4 7 8

fdiv f10,f0,f6 5

fadd f6,f8,f2 6 10 11

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

Load1

Load2

Add1

Add2

Add3

Mul1 - - - -

40 Mul2 Y fdiv M2xf4 M1

Register Result Status (also values)

F0 F2 F4 F6 F8 F10 F12 … F30

Qi M2xf4 M2 M1 M1- M2 Mul2

Clock

16

24

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f6,34($2) 1 3 4

fld2 f2,45($3) 2 4 5

fmul f0,f2,f4 3 15 16

fsub f8,f6,f2 4 7 8

fdiv f10,f0,f6 5 56 57

fadd f6,f8,f2 6 10 11

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

Load1

Load2

Add1

Add2

Add3

Mul1

Mul2 - - - -

Register Result Status (also values)

F0 F2 F4 F6 F8 F10 F12 … F30

Qi M2xf4 M2 M1 M1- M2 M2xf4/M1

Clock

57

Page 13: ECE4750/CS4420 Computer Architecture L9: Tomasulo’s Algorithm · 2013. 7. 3. · 1 ECE4750/CS4420 Computer Architecture L9: Tomasulo’sAlgorithm Edward Suh Computer Systems Laboratory

13

25

Tomasulo: Loop Example

Renaming powerful tool across loops

• Scoreboard unable to process multiple iterations simultaneously

Hardware loop unrolling

• process several loop iterations simultaneously

• make it transparent to the compiler

L: fld f0,0($1)

fmul f4,f0,f2

fsd f4,0(r1)

subi $1,$1,8

bne $1,$0,L

ECE4750/CS4420 — Computer Architecture, Fall 2008

26

Tomasulo: Example

Instr. Status

Instruction I E W

fld1 f0,0($1)

fmul1 f4,f0,f2

fsd1 f4,0(r1)

fld2 f0,0($1)

fmul2 f4,f0,f2

fsd2 f4,0(r1)

ECE4750/CS4420 — Computer Architecture, Fall 2008

Reservation Stations

Busy Op Vj Vk Qj Qk A

Load1

Load2

Add1

Add2

Add3

Mul1

Mul2

Store1

Store2

Register Result Status

F0 F2 F4

Qi

Clock$1