pipeline2

8/7/2019 pipeline2

1/5

Datorarkitektur F 4 - 1

Petru Eles, IDA, LiTH

INSTRUCTION PIPELINING (II)

1. Reducing Pipeline Branch Penalties

2. Instruction Fetch Units and Instruction Queues

3. Delayed Branching

4. Branch Prediction Strategies

5. Static Branch Prediction

6. Dynamic Branch Prediction

7. Branch History Table



Reducing Pipeline Branch Penalties

Branch instructions can dramatically affect pipelineperformance. Control operations (conditional andunconditional branch) are very frequent in currentprograms.

Some statistics:

- 20% - 35% of the instructions executed arebranches (conditional and unconditional).

- Conditional branches are much more frequentthan unconditional ones (more than two times).More than 50% of conditional branches aretaken.

It is very important to reduce the penaltiesproduced by branches.



Instruction Fetch Units and Instruction Queues

Most processors employ sophisticated fetch unitsthat fetch instructions before they are needed andstore them in a queue.

The fetch unit also has the ability to recognize

branch instructions and to generate the targetaddress. Thus, penalty produced by unconditionalbranchescan be drastically reduced: the fetch unitcomputes the target address and continues to fetchinstructions from that address, which are sent to thequeue. Thus, the rest of the pipeline gets acontinuous stream of instructions, without stalling.

The rate at which instructionscan be read (from theinstruction cache) must be sufficiently high to avoidan empty queue.

With conditional branchespenalties can not beavoided. The branch condition, which usuallydepends on the result of the preceding instruction,has to be known in order to determine the followinginstruction.

Observation In the Pentium 4, the instruction cache (trace

cache) is located between the fetch unit and theinstruction queue (See F 2, slide 31).

InstructionFetch Unit

Instruction QueueRest of thepipeline

Instructioncache



Delayed Branching

These are the pipeline sequences for a conditionalbranch instruction (see slide 12, lecture 3)

Branch is taken

Penalty: 3 cycles

Branch is not taken

Penalty: 2 cycles

The idea with delayed branching is to let the CPUdo some useful work during some of the cycleswhich are shown above to be stalled.

With delayed branching the CPU always executesthe instruction that immediately follows after thebranch and only then alters (if necessary) thesequence of execution. The instruction after thebranch is said to be in the branch delay slot.

FI DI

1 2 83 4 5 6 7Clock cycle

ADD R1,R2

BEZ TARGET

the target

COFO EI WO

FI DI COFO EI WO

9 10 11 1 2

FI stallstall FI DI COFO EI WO

FI DI


ADD R1,R2

BEZ TARGET

instr i+1

COFO EI WO

FI DI COFO EI WO

9 10 1 1 12

FI stall stall DI COFO EI WO

8/7/2019 pipeline2

2/5



Delayed Branching (contd)

This is what the programmer has written

MUL R3,R4 R3 R3*R4

SUB #1,R2 R2 R2-1

ADD R1,R2 R1R1+R2

BEZ TAR branch if zero

MOVE #10,R1 R1 10

- - - - - - - - - - - - -

TAR - - - - - - - - - - - - -

The compiler (assembler) has to find an instructionwhich can be moved from its original place into thebranch delay slot after the branch and which will beexecuted regardless of the outcome of the branch.

This is what the compiler (assembler) has produced andwhat actually will be executed:

SUB #1,R2

ADD R1,R2

BEZ TAR

MUL R3,R4

MOVE #10,R1

- - - - - - - - - - - - -

TAR - - - - - - - - - - - - -

This instruction does notinfluenceany of the instructionswhich follow until the branch; italso doesnt influence theoutcome of the branch.

This instruction shouldbe executed only if thebranch is not taken.

This instruction will be execut-ed regardless of the condition.

This will be executedonly if thebranch has not been taken




This happens in the pipeline:

Branch is taken

Penalty: 2 cycles

Branch is not take

Penalty: 1 cycle

FI DI


ADD R1,R2

BEZ TAR

MUL R3,R4

COFO EI WO

FI DI COFO EI WO

9 10 1 1 12

FI DI COFO EI WO

FI DI


ADD R1,R2BEZ TAR

MOVE

COFO EI WO

FI DI COFO EI WO

9 10 1 1 12

FI stall DI COFO EI WO

the target FI stall FI DI COFO EI WO

At this moment, both thecondition (set by ADD) andthe target address are known.

MUL R3,R4 FI DI COFO EI WO

At this moment the condition isknown and the MOVE can go on.




What happens if the compiler is not able to find aninstruction to be moved after the branch, into thebranch delay slot?

In this case a NOP instruction (an instruction thatdoes nothing) has to be placed after the branch. Inthis case the penalty will be the same as withoutdelayed branching.

MUL R2,R4

SUB #1,R2

ADD R1,R2

BEZ TAR

NOP

MOVE #10,R1

- - - - - - - - - - - - -

TAR - - - - - - - - - - - - -

Some statistics show that for between 60% and85% of branches, sophisticated compilers are ableto find an instruction to be moved into the branchdelay slot.

Now, with R2, this instruction in-fluences the following ones andcannot be moved from its place.



Branch Prediction

In the last example we have considered that thebranch will not be takenand we fetched theinstruction following the branch; in the case thebranch was taken the fetched instruction wasdiscarded. As result, we had

branch penalty of

Let usconsider the oppositeprediction:branch taken.For this solution it is needed that the target addressis computed in advance by an instruction fetch unit.

Branch is taken

Penalty: 1 cycle (prediction fulfilled)

Branch is not taken

Penalty: 2 cycles (prediction not fulfilled)

1 if the branch is not taken(prediction fulfilled)

2 if the branch is taken(prediction not fulfilled)

FI DIADD R1,R2

BEZ TAR

MUL R3,R4

COFO EI WO

FI DI COFO EI WO

FI DI COFO EI WO

MOVE FI stall FI DI COFO EI WO

FI DIADD R1,R2

BEZ TAR

the target

COFO EI WO

FI DI COFO EI WO

stall DI COFO EI WO

MUL R3,R4 FI DI COFO EI WO

FI

1 2 83 4 5 6 7Clock cycle 9 10 1 1 12

1 2 83 4 5 6 7Clock cycle 9 10 1 1 12

8/7/2019 pipeline2

3/5



Branch Prediction (contd)

Correct branch prediction is very important and canproduce substantial performance improvements.

Based on the predicted outcome, the respective in-struction can be fetched, as well as the instructionsfollowing it, and they can be placed into the instruc-tion queue (see slide 3). If, after the branch condi-tion is computed, it turns out that the prediction wascorrect, execution continues. On the other hand, ifthe prediction is not fulfilled, the fetched instruc-tion(s) must be discarded and the correct instruc-tion must be fetched.

To take full advantage of branch prediction, we canhave the instructions not only fetched but alsobegin execution. This is known as speculativeexecution.

Speculative executionmeans that instructions areexecuted before the processor is certain that they arein the correct execution path. If it turns out that theprediction was correct, execution goes on withoutintroducing any branch penalty. If, however, the pre-diction is not fulfilled, the instruction(s) started in ad-vance and all their associated data must be purgedand the state previous to their execution restored.

Branch prediction strategies:

1. Static prediction

2. Dynamic prediction



Static Branch Prediction

Static prediction techniques do not take intoconsideration execution history.

Static approaches:

Predict never taken (Motorola 68020): assumesthat the branch is not taken.

Predict always taken: assumes that the branch istaken.

Predict depending on the branch direction(PowerPC 601):

- predict branch taken for backward branches;

- predict branch not taken for forward branches.



Dynamic Branch Prediction

Dynamic prediction techniques improve theaccuracy of the prediction by recording the history ofconditional branches.

One-Bit Prediction Scheme

One-bit is used in order to record if the last execu-tion resulted in a branch taken or not. The systempredicts the same behavior as for the last time.

Shortcoming

When a branch is almost always taken, thenwhen it is not taken, we will predict incorrectlytwice, rather than once:

- - - - - - - - - - -

LOOP - - - - - - - - - - -

- - - - - - - - - - -

BNZ LOOP

- - - - - - - - - - -

- After the loop has been executed for the first timeand left, it will be remembered that BNZ has notbeen taken. Now, when the loop is executedagain, after the first iteration there will be a falseprediction; following predictions are OK until thelast iteration, when there will be a secondfalseprediction.

- In this case the result is even worse than withstatic prediction considering that backward loopsare always taken (PowerPC 601 approach).



Two-Bit Prediction Scheme

With a two-bit scheme predictions can be madedepending on the last two instances of execution.

A typical scheme is to change the prediction only ifthere have been two incorrect predictions in a row.

- - - - - - - - - - -

LOOP - - - - - - - - - - -

- - - - - - - - - - -

BNZ LOOP

- - - - - - - - - - -

not taken

not taken

not taken

not taken

taken

taken

taken

taken

prd.: not taken

00prd.: not taken

prd.: taken prd.: taken

10

11 01

After the first execution of theloop the bits attached to BNZwill be 01; now, there will bealways one falseprediction forthe loop, at its exit.

8/7/2019 pipeline2

4/5



Branch History Table

History information can be used not only to predictthe outcome of a conditional branch but also toavoid recalculation of the target address. Togetherwith the bits used for prediction, the target addresscan be stored for later use in a branch history table.

InstructionFetch Unit

Instructioncache

EI

Instr.addr.

Targetaddr.

Pred.bits

Addr.ofbranchinstr.

Addre

sswhe

re

tofetch

from

Branchwas inTAB?

Pred.wasOK?

Addnew

entry

N

Y

Updateentry

discard fetchesand restart fromcorrect address

N Y



Branch History Table (contd)

Some explanations to previous figure:

- Address where to fetch from: If the branchinstruction is not in the table the next instruction(address PC+1) is to be fetched. If the branch

instruction is in the table first of all a predictionbased on the prediction bitsis made. Dependingon the prediction outcome the next instruction(address PC+1) or the instruction at the targetaddressis to be fetched.

- Update entry: If the branch instruction has beenin the table, the respective entry has to beupdated to reflect the correct or incorrectprediction.

- Add new entry: If the branch instruction has notbeen in the table, it is added to the table with thecorresponding information concerning branchoutcome and target address. If needed one ofthe existing table entries is discarded.Replacement algorithms similar to those forcache memories are used.

Using dynamic branch prediction with history tablesup to 90% of predictions can be correct.

Both Pentium and PowerPC 620 use speculativeexecution with dynamic branch prediction based ona branch history table.



The Intel 80486 Pipeline

The 80486 is the last x86 processor that is notsuperscalar. It is a typical example of an advancednon-superscalar pipeline.

The 80486 has a five stage pipeline.

No branch prediction or, in fact, always not taken.

Fetchinstructions

Decode_1

Execute

Write back

Decode_2

Fetch: instructions fetchedfrom cache and placed intoinstruction queue (organisedas two prefetch buffers).Operates independently of theother stages and tries to keepthe prefetch buffers full.

Decode_1: Takes the first 3bytes of the instruction anddecodes opcode, addressing-mode, instruction length; restof the instruction is decodedby Decode_2.

Decode_2: decodes the restof the instruction and producescontrol signals; preformsaddress computation.

Execute: ALU operations;cache access for operands.

Write back: updates registers,status flags; for memoryupdate sends values to cacheand to write buffers.



The ARM pipeline

Fetch

Execute

Decode

Fetch: instructions fetchedfrom cache.

Decode: instructions andoperand registers decoded.

Execute: registers read;shift and ALU operations;results or loaded datawritten back to register.

ARM7 pipeline

8/7/2019 pipeline2

5/5



The ARM pipeline (contd)

Fetch

Execute

Decode

Fetch: instructions fetchedfrom I-cache.

Decode: instructions andoperand registers decoded;registers read.

Execute: shift and ALUoperations.

Data memory access: fetch/store data from/to D-cache.

Register write: results orloaded data written back toregister.

Registerwrite

Datamemoryaccess

ARM9 pipeline

The performance of the ARM9 is significantly superiorto the ARM7:

Higher clock speed due to larger number ofpipeline stages.

More even distribution of tasks among pipelinestages; tasks have been moved away from theexecute stage




Fetch 1

ALU/Memory 1

Decode

Writeback

Fetch 2

Issue

Shift/Address

ALU/Memory 2

The performance of ARM11 isfurther enhanced by:

Higher clock speed due tolarger number of pipelinestages; more even distributionof tasks among pipeline stages.

Branch prediction:

- Dynamic two bits predictionbased on a 64 entry branchhistory table (branch targetaddress cache - BTAC).

- If the instruction is not inthe BTAC, static predictionis done: takenif backward,not takenif forward.

Decoupling of the load/storepipeline from the ALU&MAC(multiply-accumulate) pipeline.ALU operations can continuewhile load/store operationscomplete (see next slide).

ARM11 pipeline




Fetch 1

Decode

Fetch 2

Issue

ALU 1

Writeback

Shift

ALU 2

Memory 1

Writeback

Address

Memory 2

Fetch 1, 2:instructions fetchedfromI-cache; dynamicbranch prediction.

Decode: instructionsdecoded; staticbranch prediction (if

needed).

Issue: instructionissued; registers read.

Address:addresscalculation.

Memory1,2:datamemoryaccess.

Writeback:write loadeddata to reg.;commit store

Shift:register shift/rotate

ALU 1,2:ALU/MACoperations.

Writeback:resultswrittento register

ARM11 pipeline



Summary

Branch instructions can dramatically affect pipelineperformance. It is very important to reducepenalties produced by branches.

Instruction fetch units are able to recognize branchinstructions and generate the target address.Fetching at a high rate from the instruction cacheand keeping the instruction queue loaded, it ispossible to reduce the penalty for unconditionalbranches to zero. For conditional branches this isnot possible because we have to wait for the

outcome of the decision. Delayed branching is a compiler based technique

aimed to reduce the branch penalty by movinginstructions into the branch delay slot.

Efficient reduction of the branch penalty forconditional branches needs a clever branchprediction strategy. Static branch prediction doesnot take into consideration execution history.Dynamic branch prediction is based on a record ofthe history of a conditional branch.

Branch history tables are used to store bothinformation on the outcome of branches and thetarget address of the respective branch.

pipeline2

Documents