1 zvika guz slides modified from prof. dave patterson, prof. john kubiatowicz, and prof. nancy...

24
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution

Post on 20-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

1

Zvika Guz

Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez

Out Of Order Execution

2

Out Of Order Execution

• Goal: – Performance (IPC>1)

• How?– Wide Machine

– Speculations (Branch prediction)

– Out Of Order execution

» Essentially a data flow execution model: Operations execute as soon as their operands are available

– Eliminate name dependencies (aka false/anti dependencies)• WAW, WAR

» Via register renaming

• But, we still want a precise interrupt model– In-order commit

» Via Reorder Buffer (ROB)

3

MIPS FPU with Tomasulo and ROB

4

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

RegistersF0F0F2F2F4F4F10F10

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

RAT

5

4 Steps of Speculative Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr &

send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)

2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch

CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs

& reorder buffer; mark reservation station available.

4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, update

register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)

6

Code Example

1. LD F0, 10(R2)2. ADDD F10, F4, F03. DIVD F2, F10, F64. BNE F2, <…>5. LD F4, 0(R3)6. ADDD F0, F4, F67. ADDD F0, F4, F6

7

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

RegistersF0F0F2F2F4F4F10F10

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

RAT

8

F0F0

F2F2

F4F4

F10F10

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0F0 LD F0,10(R2)LD F0,10(R2) NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

F0F0F2F2F4F4F10F10

DestValue Instruction

Tomasulo With Reorder buffer:

ROB1ROB1

RAT

Reorder Buffer

Registers

9

F0F0

F2F2

F4F4

F10F10

2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F10F10

F0F0ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

RegistersF0F0F2F2F4F4F10F10

DestValue Instruction

Tomasulo With Reorder buffer:

ROB1ROB1

ROB2ROB2

RAT

10

F0F0

F2F2

F4F4

F10F10

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

RegistersF0F0F2F2F4F4F10F10

DestValue Instruction

ROB1ROB1

ROB3ROB3

ROB2ROB2

RAT

11

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0 ADDD F0,F4,F6ADDD F0,F4,F6 NN

F4F4 LD F4,0(R3)LD F4,0(R3) NN

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2

5 0+R35 0+R3

Dest

Reorder Buffer

RegistersF0F0F2F2F4F4F10F10

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB6ROB6

ROB3ROB3

ROB5ROB5

ROB2ROB2

RAT

12

F0F0 ROB7ROB7

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0 ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6NN

NN

F4F4 LD F4,0(R3)LD F4,0(R3) NN

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

7 ADDD ROB5, R(F6)7 ADDD ROB5, R(F6)

F0F0F2F2F4F4F10F10

1 10+R21 10+R2

5 0+R35 0+R3

DestValue Instruction

F2F2

F4F4

F10F10

ROB3ROB3

ROB5ROB5

ROB2ROB2

RAT

13

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0 ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6NN

NN

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD M[10],R(F6)6 ADDD M[10],R(F6)7 ADDD M[10],R(F6)7 ADDD M[10],R(F6)

F0F0F2F2F4F4F10F10

1 10+R21 10+R2

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB7ROB7

ROB3ROB3

ROB5ROB5

ROB2ROB2

RAT

14

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0<val3><val3>

<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6YY

YY

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

RegistersF0F0F2F2F4F4F10F10

1 10+R21 10+R2

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB7ROB7

ROB3ROB3

ROB5ROB5

ROB2ROB2

RAT

15

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),M[2]2 ADDD R(F4),M[2]

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0<val3><val3>

<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6YY

YY

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0 M[2]M[2]

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

YY

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

RegistersF0F0F2F2F4F4F10F10

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB7ROB7

ROB3ROB3

ROB5ROB5

ROB2ROB2

RAT

16

3 DIVD <val4>,R(F6)3 DIVD <val4>,R(F6)

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0<val3><val3>

<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6YY

YY

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0<val4><val4>

M[2]M[2]

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

YY

CC

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

RegistersF0 = M[2]F0 = M[2]F2F2F4F4F10F10

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB7ROB7

ROB3ROB3

ROB5ROB5

ROB2ROB2

RAT

17

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0<val3><val3>

<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6YY

YY

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0

<val5><val5>

<val4><val4>

M[2]M[2]

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

YY

CC

CC

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

RegistersF0 = M[2]F0 = M[2]F2F2F4F4F10= <val4>F10= <val4>

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB7ROB7

ROB3ROB3

ROB5ROB5RAT

18

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

F0 = M[2]F0 = M[2]F2 = <val5>F2 = <val5>F4F4F10= <val4>F10= <val4>

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0<val3><val3>

<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6YY

YY

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0

<val5><val5>

<val4><val4>

M[2]M[2]

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

CC

CC

CC

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB7ROB7

ROB5ROB5RAT

19

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

F0 = M[2]F0 = M[2]F2 = <val5>F2 = <val5>F4F4F10= <val4>F10= <val4>

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0<val3><val3>

<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6YY

YY

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> CC

F2F2

F10F10

F0F0

<val5><val5>

<val4><val4>

M[2]M[2]

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

CC

CC

CC

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB7ROB7

ROB5ROB5RAT

20

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

F0 = M[2]F0 = M[2]F2 = <val5>F2 = <val5>F4 = M[10]F4 = M[10]F10= <val4>F10= <val4>

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0<val3><val3>

<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6YY

YY

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) CC

---- BNE F2,<…>BNE F2,<…> CC

F2F2

F10F10

F0F0

<val5><val5>

<val4><val4>

M[2]M[2]

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

CC

CC

CC

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB7ROB7

RAT

21

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

F0 = <val2>F0 = <val2>F2 = <val5>F2 = <val5>F4 = M[10]F4 = M[10]F10= <val4>F10= <val4>

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0<val3><val3>

<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6YY

CC

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) CC

---- BNE F2,<…>BNE F2,<…> CC

F2F2

F10F10

F0F0

<val5><val5>

<val4><val4>

M[2]M[2]

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

CC

CC

CC

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

DestValue Instruction

F0F0

F2F2

F4F4

F10F10

ROB7ROB7

RAT

22

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

F0 = <val3>F0 = <val3>F2 = <val5>F2 = <val5>F4 = M[10]F4 = M[10]F10= <val4>F10= <val4>

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0

F0F0<val3><val3>

<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6

ADDD F0,F4,F6ADDD F0,F4,F6CC

CC

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) CC

---- BNE F2,<…>BNE F2,<…> CC

F2F2

F10F10

F0F0

<val5><val5>

<val4><val4>

M[2]M[2]

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

CC

CC

CC

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

Dest Value Instruction

F0F0

F2F2

F4F4

F10F10

RAT

23

Remarks

• What about timing?– What happens on what cycle? No #cycles in the figure

» How many fetches/commits in a cycle?

» How many execution units?• Homework assignment

• Preserving precise interrupt model– When an interrupt occurs, we can flush everything

» Instructions that were not committed have no effect• Commit happens in-order

– Exceptions are taken on commit

• What happen if ROB is full?– Fetch is stopped until some instruction commits

» Committed instruction frees its ROB entry

24

Memory Hazards

• When is memory updated? – On commit (or later)

» Relevant for store instructions only

• WAR/WAW Hazards?– Handled by ROB

• RAW Hazards?

– Must ensure that no in-flight store is targeting the same address

– What about memory disambiguation?

» Simple answer: before starting the load we must know all the addresses of all other in-flight stores

» In real life we speculate on this

ST 0(R2),F1LD F2, 0(R2)

ST 0(R2), F1LD F2, 0(R4) //What if R4=R2?