computer structure 2014 – out-of-order execution 1 computer structure out-of-order execution lihu...

22
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz

Upload: jakob-slee

Post on 14-Dec-2015

237 views

Category:

Documents


0 download

TRANSCRIPT

Computer Structure 2014 – Out-Of-Order Execution1

Computer Structure

Out-Of-Order Execution

Lihu Rappoport and Adi Yoaz

Computer Structure 2014 – Out-Of-Order Execution2

What’s Next Goal: minimize CPU Time

CPU Time = clock cycle CPI IC

So far we have learned Minimize clock cycle add more pipe stages Minimize CPI use pipeline Minimize IC architecture

In a pipelined CPU CPI w/o hazards is 1 CPI with hazards is > 1

Adding more pipe stages reduces clock cycle but increases CPI Higher penalty due to control hazards More data hazards

What can we do ? Further reduce the CPI !

Computer Structure 2014 – Out-Of-Order Execution3

A Superscalar CPU Duplicating HW in one pipe stage won’t help

e.g., have 2 ALUs the bottleneck moves to other stages

Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock:

IF ID EXE MEM WB

IF ID EXE MEM WB

Computer Structure 2014 – Out-Of-Order Execution4

The Pentium Processor Fetches and decodes 2 instructions per cycle

Before register file read, decide on pairing: can the two instructions be executed in parallel

Pairing decision is based on Data dependencies: 2nd instruction must be independent

of 1st

Resources: U-pipe and V-pipe are not symmetric (save HW)• Common instructions can execute on either pipe• Some instructions can execute only on the U-pipe

• If the 2nd instruction requires the U-pipe, it cannot pair• Some instructions use resources of both pipes

IF IDU-pipe

V-pipe

pairing

Computer Structure 2014 – Out-Of-Order Execution5

MPI : miss-per-instruction:

#incorrectly predicted branches #predicted branches

MPI = = MPR× total # of instructions total # of

instructions

MPI correlates well with performance, e.g., assume MPR = 5%, %branches = 20% MPI = 1% Without hazards IPC=2 (2 instructions per cycles) Flush penalty of 5 cycles

We get MPI = 1% flush in every 100 instructions IPC=2 flush every 100/2 = 50 cycles 5 cycles flush penalty every 50 cycles 10% performance

hit

For IPC=1 we would get 5 cycles flush penalty per 100 cycles 5% performance

hit

Flush penalty increases as the machine is deeper and wider

Misprediction Penalty in a Superscalar CPU

Computer Structure 2014 – Out-Of-Order Execution6

Extract More ILP ILP – Instruction Level Parallelism

A given program, executed on a given input data has a given parallelism

Can execute only independent instructions in parallel If for example each instruction is dependent on the

previous instruction, the ILP of the program is 1• Adding more HW will not change that

Adjacent instructions are usually dependent The utilization of the 2nd pipe is usually low There are algorithms in which both pipes are highly

utilized

Solution: Out-Of-Order Execution Look for independent instructions further ahead in the

program Execute instructions based on data readiness Still need to keep the semantics of the original program

Computer Structure 2014 – Out-Of-Order Execution7

Data Flow Analysis Example:

(1) r1 r4 / r7 ; assume divide takes 20 cycles(2) r8 r1 + r2(3) r5 r5 + 1(4) r6 r6 - r3 (5) r4 r5 + r6(6) r7 r8 * r4

134

52

6

In-order execution

134

5 2 6

Out-of-order execution

1 3 4

2 5

6

Data Flow Graph

r1 r5 r6

r4r8

Computer Structure 2014 – Out-Of-Order Execution8

OOOE – General Scheme

Fetch & decode instructions in parallel but in order Fill the Instruction Pool

Execute ready instructions from the instructions pool All source data ready + needed execution resources available

Once an instruction is executed signal all dependent instructions that data is ready

Commit instructions in parallel but in-order State change (memory, register) and fault/exception handling

Retire(commit)

In-order

Fetch &Decode

Instruction pool

In-order

Execute

Out-of-order

Computer Structure 2014 – Out-Of-Order Execution9

(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Write-After-Write Dependency

(8) r32(7) r4r3+r1

(3) r123

Computer Structure 2014 – Out-Of-Order Execution10

(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Write-After-Write Dependency

(8) r32(7) r4r3+r1

(3) r123

If inst (3) is executed before inst (1), r1 ends up having a wrong value.

Called write-after-write false dependency.

Computer Structure 2014 – Out-Of-Order Execution11

(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Write-After-Write Dependency

(8) r32(7) r4r3+r1

(3) r123

Inst (4) should use the value of r1 produced by inst (3), even if inst (1) is executed after inst (3).

Write-After-Write (WAW) is a false dependencyNot a real data dependency, but an artifact of OOO execution

Computer Structure 2014 – Out-Of-Order Execution12

(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Speculative Execution

(8) r32(7) r4r3+r1

(3) r123

1/5 instruction is a branch continue fetching, decoding, and allocating instructions into the instruction pool according to the predicted path.

Called “speculative execution”

Computer Structure 2014 – Out-Of-Order Execution13

(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Write-After-Read Dependency

(3) r123

(8) r32(7) r4r3+r1

Computer Structure 2014 – Out-Of-Order Execution14

(7) r4r3+r1

(1) r1R9/17(2) r2r2+r1

(4) r3r3+r1(5) jcc L2

(6) L2 r135

Write-After-Read Dependency

(3) r123

(8) r32If inst (8) is executed before inst (7), inst (7) gets a wrong value of r3.

Called write-after-read false dependency.

Write-After-Read (WAR) is a false dependencyNot a real data dependency, but an artifact of OOO execution

Computer Structure 2014 – Out-Of-Order Execution15

Register Renaming Hold a pool of physical registers

Map architectural registers into physical registers

When an instruction is allocated into the instruction pool (still in-order) Allocate a free physical register from a pool The physical register points to the architectural register

When an instruction executes and writes a result Write the result value to the physical register

When an instruction needs data from a register Read data from the physical register allocated to the latest

inst which writes to the same arch register, and precedes the current inst• If no such instruction exists, read from the reset arch.

value

When an instruction commits Copy the value from its physical register to the architectural

register

Computer Structure 2014 – Out-Of-Order Execution16

Renaming

r1:pr1 pr117r2:pr2 pr2r2+pr1r1:pr3 pr323r3:pr4 pr4r3+pr3

r1:pr5 pr535r4:pr6 pr6pr4+pr5r3:pr7 pr72

(1) r117(2) r2r2+r1(3) r123(4) r3r3+r1(5) jcc L2

(6) L2 r135(7) r4r3+r1(8) r32

Register Renaming

r1 r2 r3 r4

Register mapping r1 r2 r3 r4pr1 pr2pr3 pr4pr5 pr6pr7

When an instruction commits: Copy its physical register into the architectural register

Computer Structure 2014 – Out-Of-Order Execution17

Renaming

r1:pr1 pr117r2:pr2 pr2r2+pr1r1:pr3 pr323r3:pr4 pr4r3+pr3

r1:pr5 pr535r4:pr6 pr6pr4+pr5r3:pr7 pr72

(1) r117(2) r2r2+r1(3) r123(4) r3r3+r1(5) jcc L2

(6) L2 r135(7) r4r3+r1(8) r32

Speculative Execution – Misprediction

r1 r2 r3 r4

Register mapping r1 r2 r3 r4pr1 pr2pr3 pr4pr5 pr6pr7

If the predicted branch path turns out to be wrong (when the branch is executed):

The instructions following the branch are flushed before they are committed the architectural state is not changed

Computer Structure 2014 – Out-Of-Order Execution18

Renaming

r1:pr1 pr117r2:pr2 pr2r2+pr1r1:pr3 pr323r3:pr4 pr4r3+pr3

r1:pr5 pr535r4:pr6 pr6pr4+pr5r3:pr7 pr72

(1) r117(2) r2r2+r1(3) r123(4) r3r3+r1(5) jcc L2

(6) L2 r135(7) r4r3+r1(8) r32

Speculative Execution – Misprediction

r1 r2 r3 r4

Register mapping r1 r2 r3 r4pr1 pr2pr3 pr4pr5 pr6pr7

But the register mapping was already wrongly updated by the wrong path instructions

Computer Structure 2014 – Out-Of-Order Execution19

Jump Misprediction – Flush at Retire

When the mispredicted jump retires Flush the pipeline

• When the branch commits, all the instructions remaining in the pipe are younger than the branch from the wrong path

Reset the renaming map• So all register are mapped to architectural registers• This is ok since there are no consumers of physical

registers (pipe is flushed)

Start fetching instructions from the correct path

Disadvantage Very high misprediction penalty Misprediction is already known after the jump was

executed We will see ways to recover a misprediction at execution

Computer Structure 2014 – Out-Of-Order Execution20

OOO Requires Accurate Branch Predictor

Accurate branch predictor increases the effective scheduling window size

Speculate across multiple branches (a branch every 5 – 10 instructions)

70% 75% 80% 85% 90% 95%100%0%

10%

20%

30%

40%

50%

60%%wrong instructions

Prediction Rate

Instruction pool

branches

High chances

to commit

Low chances

to commit

Computer Structure 2014 – Out-Of-Order Execution21

Interrupts and Faults Handling

Complications for pipelined and OOO execution Interrupts occur in the middle of an instruction A speculative instruction can get a fault (divide by 0, page

fault)

Faults are served in program order, at retirement only Mark an instruction that takes a fault at execution Instructions older than the faulting instruction are retired Only when the faulting instruction retires – handle the fault

• Flush subsequent instructions• Initiate the fault handling code according to the fault type• Restart faulting and/or subsequent instructions

Interrupts are served when the next instruction retires Let the instruction in the current cycle retire Flush subsequent instructions and initiate the interrupt

service code Fetch the subsequent instructions

Computer Structure 2014 – Out-Of-Order Execution22

Out Of Order Execution Summary

Look ahead in a window of instructions Dispatch ready instructions to execution

• Do not depend on data from previous instructions still not executed

• Have the required execution resources available

Advantages Exploit Instruction Level Parallelism beyond adjacent

instructions Help cover latencies (e.g., L1 data cache miss, divide) Superior/complementary to compiler scheduler

• Can look for ILP beyond conditional branches• In a given control path instructions may be independent• Register Renaming: use more than the number architectural

registers

Complex micro-architecture Register renaming, complex scheduler, misprediction recovery Memory ordering – so far we did not talk about that