pipeline2

Upload: vinnisharma

Post on 09-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 pipeline2

    1/5

    Datorarkitektur F 4 - 1

    Petru Eles, IDA, LiTH

    INSTRUCTION PIPELINING (II)

    1. Reducing Pipeline Branch Penalties

    2. Instruction Fetch Units and Instruction Queues

    3. Delayed Branching

    4. Branch Prediction Strategies

    5. Static Branch Prediction

    6. Dynamic Branch Prediction

    7. Branch History Table

    Datorarkitektur F 4 - 2

    Petru Eles, IDA, LiTH

    Reducing Pipeline Branch Penalties

    Branch instructions can dramatically affect pipelineperformance. Control operations (conditional andunconditional branch) are very frequent in currentprograms.

    Some statistics:

    - 20% - 35% of the instructions executed arebranches (conditional and unconditional).

    - Conditional branches are much more frequentthan unconditional ones (more than two times).More than 50% of conditional branches aretaken.

    It is very important to reduce the penaltiesproduced by branches.

    Datorarkitektur F 4 - 3

    Petru Eles, IDA, LiTH

    Instruction Fetch Units and Instruction Queues

    Most processors employ sophisticated fetch unitsthat fetch instructions before they are needed andstore them in a queue.

    The fetch unit also has the ability to recognize

    branch instructions and to generate the targetaddress. Thus, penalty produced by unconditionalbranchescan be drastically reduced: the fetch unitcomputes the target address and continues to fetchinstructions from that address, which are sent to thequeue. Thus, the rest of the pipeline gets acontinuous stream of instructions, without stalling.

    The rate at which instructionscan be read (from theinstruction cache) must be sufficiently high to avoidan empty queue.

    With conditional branchespenalties can not beavoided. The branch condition, which usuallydepends on the result of the preceding instruction,has to be known in order to determine the followinginstruction.

    Observation In the Pentium 4, the instruction cache (trace

    cache) is located between the fetch unit and theinstruction queue (See F 2, slide 31).

    InstructionFetch Unit

    Instruction QueueRest of thepipeline

    Instructioncache

    Datorarkitektur F 4 - 4

    Petru Eles, IDA, LiTH

    Delayed Branching

    These are the pipeline sequences for a conditionalbranch instruction (see slide 12, lecture 3)

    Branch is taken

    Penalty: 3 cycles

    Branch is not taken

    Penalty: 2 cycles

    The idea with delayed branching is to let the CPUdo some useful work during some of the cycleswhich are shown above to be stalled.

    With delayed branching the CPU always executesthe instruction that immediately follows after thebranch and only then alters (if necessary) thesequence of execution. The instruction after thebranch is said to be in the branch delay slot.

    FI DI

    1 2 83 4 5 6 7Clock cycle

    ADD R1,R2

    BEZ TARGET

    the target

    COFO EI WO

    FI DI COFO EI WO

    9 10 11 1 2

    FI stallstall FI DI COFO EI WO

    FI DI

    1 2 83 4 5 6 7Clock cycle

    ADD R1,R2

    BEZ TARGET

    instr i+1

    COFO EI WO

    FI DI COFO EI WO

    9 10 1 1 12

    FI stall stall DI COFO EI WO

  • 8/7/2019 pipeline2

    2/5

    Datorarkitektur F 4 - 5

    Petru Eles, IDA, LiTH

    Delayed Branching (contd)

    This is what the programmer has written

    MUL R3,R4 R3 R3*R4

    SUB #1,R2 R2 R2-1

    ADD R1,R2 R1R1+R2

    BEZ TAR branch if zero

    MOVE #10,R1 R1 10

    - - - - - - - - - - - - -

    TAR - - - - - - - - - - - - -

    The compiler (assembler) has to find an instructionwhich can be moved from its original place into thebranch delay slot after the branch and which will beexecuted regardless of the outcome of the branch.

    This is what the compiler (assembler) has produced andwhat actually will be executed:

    SUB #1,R2

    ADD R1,R2

    BEZ TAR

    MUL R3,R4

    MOVE #10,R1

    - - - - - - - - - - - - -

    TAR - - - - - - - - - - - - -

    This instruction does notinfluenceany of the instructionswhich follow until the branch; italso doesnt influence theoutcome of the branch.

    This instruction shouldbe executed only if thebranch is not taken.

    This instruction will be execut-ed regardless of the condition.

    This will be executedonly if thebranch has not been taken

    Datorarkitektur F 4 - 6

    Petru Eles, IDA, LiTH

    Delayed Branching (contd)

    This happens in the pipeline:

    Branch is taken

    Penalty: 2 cycles

    Branch is not take

    Penalty: 1 cycle

    FI DI

    1 2 83 4 5 6 7Clock cycle

    ADD R1,R2

    BEZ TAR

    MUL R3,R4

    COFO EI WO

    FI DI COFO EI WO

    9 10 1 1 12

    FI DI COFO EI WO

    FI DI

    1 2 83 4 5 6 7Clock cycle

    ADD R1,R2BEZ TAR

    MOVE

    COFO EI WO

    FI DI COFO EI WO

    9 10 1 1 12

    FI stall DI COFO EI WO

    the target FI stall FI DI COFO EI WO

    At this moment, both thecondition (set by ADD) andthe target address are known.

    MUL R3,R4 FI DI COFO EI WO

    At this moment the condition isknown and the MOVE can go on.

    Datorarkitektur F 4 - 7

    Petru Eles, IDA, LiTH

    Delayed Branching (contd)

    What happens if the compiler is not able to find aninstruction to be moved after the branch, into thebranch delay slot?

    In this case a NOP instruction (an instruction thatdoes nothing) has to be placed after the branch. Inthis case the penalty will be the same as withoutdelayed branching.

    MUL R2,R4

    SUB #1,R2

    ADD R1,R2

    BEZ TAR

    NOP

    MOVE #10,R1

    - - - - - - - - - - - - -

    TAR - - - - - - - - - - - - -

    Some statistics show that for between 60% and85% of branches, sophisticated compilers are ableto find an instruction to be moved into the branchdelay slot.

    Now, with R2, this instruction in-fluences the following ones andcannot be moved from its place.

    Datorarkitektur F 4 - 8

    Petru Eles, IDA, LiTH

    Branch Prediction

    In the last example we have considered that thebranch will not be takenand we fetched theinstruction following the branch; in the case thebranch was taken the fetched instruction wasdiscarded. As result, we had

    branch penalty of

    Let usconsider the oppositeprediction:branch taken.For this solution it is needed that the target addressis computed in advance by an instruction fetch unit.

    Branch is taken

    Penalty: 1 cycle (prediction fulfilled)

    Branch is not taken

    Penalty: 2 cycles (prediction not fulfilled)

    1 if the branch is not taken(prediction fulfilled)

    2 if the branch is taken(prediction not fulfilled)

    FI DIADD R1,R2

    BEZ TAR

    MUL R3,R4

    COFO EI WO

    FI DI COFO EI WO

    FI DI COFO EI WO

    MOVE FI stall FI DI COFO EI WO

    FI DIADD R1,R2

    BEZ TAR

    the target

    COFO EI WO

    FI DI COFO EI WO

    stall DI COFO EI WO

    MUL R3,R4 FI DI COFO EI WO

    FI

    1 2 83 4 5 6 7Clock cycle 9 10 1 1 12

    1 2 83 4 5 6 7Clock cycle 9 10 1 1 12

  • 8/7/2019 pipeline2

    3/5

    Datorarkitektur F 4 - 9

    Petru Eles, IDA, LiTH

    Branch Prediction (contd)

    Correct branch prediction is very important and canproduce substantial performance improvements.

    Based on the predicted outcome, the respective in-struction can be fetched, as well as the instructionsfollowing it, and they can be placed into the instruc-tion queue (see slide 3). If, after the branch condi-tion is computed, it turns out that the prediction wascorrect, execution continues. On the other hand, ifthe prediction is not fulfilled, the fetched instruc-tion(s) must be discarded and the correct instruc-tion must be fetched.

    To take full advantage of branch prediction, we canhave the instructions not only fetched but alsobegin execution. This is known as speculativeexecution.

    Speculative executionmeans that instructions areexecuted before the processor is certain that they arein the correct execution path. If it turns out that theprediction was correct, execution goes on withoutintroducing any branch penalty. If, however, the pre-diction is not fulfilled, the instruction(s) started in ad-vance and all their associated data must be purgedand the state previous to their execution restored.

    Branch prediction strategies:

    1. Static prediction

    2. Dynamic prediction

    Datorarkitektur F 4 - 10

    Petru Eles, IDA, LiTH

    Static Branch Prediction

    Static prediction techniques do not take intoconsideration execution history.

    Static approaches:

    Predict never taken (Motorola 68020): assumesthat the branch is not taken.

    Predict always taken: assumes that the branch istaken.

    Predict depending on the branch direction(PowerPC 601):

    - predict branch taken for backward branches;

    - predict branch not taken for forward branches.

    Datorarkitektur F 4 - 11

    Petru Eles, IDA, LiTH

    Dynamic Branch Prediction

    Dynamic prediction techniques improve theaccuracy of the prediction by recording the history ofconditional branches.

    One-Bit Prediction Scheme

    One-bit is used in order to record if the last execu-tion resulted in a branch taken or not. The systempredicts the same behavior as for the last time.

    Shortcoming

    When a branch is almost always taken, thenwhen it is not taken, we will predict incorrectlytwice, rather than once:

    - - - - - - - - - - -

    LOOP - - - - - - - - - - -

    - - - - - - - - - - -

    BNZ LOOP

    - - - - - - - - - - -

    - After the loop has been executed for the first timeand left, it will be remembered that BNZ has notbeen taken. Now, when the loop is executedagain, after the first iteration there will be a falseprediction; following predictions are OK until thelast iteration, when there will be a secondfalseprediction.

    - In this case the result is even worse than withstatic prediction considering that backward loopsare always taken (PowerPC 601 approach).

    Datorarkitektur F 4 - 12

    Petru Eles, IDA, LiTH

    Two-Bit Prediction Scheme

    With a two-bit scheme predictions can be madedepending on the last two instances of execution.

    A typical scheme is to change the prediction only ifthere have been two incorrect predictions in a row.

    - - - - - - - - - - -

    LOOP - - - - - - - - - - -

    - - - - - - - - - - -

    BNZ LOOP

    - - - - - - - - - - -

    not taken

    not taken

    not taken

    not taken

    taken

    taken

    taken

    taken

    prd.: not taken

    00prd.: not taken

    prd.: taken prd.: taken

    10

    11 01

    After the first execution of theloop the bits attached to BNZwill be 01; now, there will bealways one falseprediction forthe loop, at its exit.

  • 8/7/2019 pipeline2

    4/5

    Datorarkitektur F 4 - 13

    Petru Eles, IDA, LiTH

    Branch History Table

    History information can be used not only to predictthe outcome of a conditional branch but also toavoid recalculation of the target address. Togetherwith the bits used for prediction, the target addresscan be stored for later use in a branch history table.

    InstructionFetch Unit

    Instructioncache

    EI

    Instr.addr.

    Targetaddr.

    Pred.bits

    Addr.ofbranchinstr.

    Addre

    sswhe

    re

    tofetch

    from

    Branchwas inTAB?

    Pred.wasOK?

    Addnew

    entry

    N

    Y

    Updateentry

    discard fetchesand restart fromcorrect address

    N Y

    Datorarkitektur F 4 - 14

    Petru Eles, IDA, LiTH

    Branch History Table (contd)

    Some explanations to previous figure:

    - Address where to fetch from: If the branchinstruction is not in the table the next instruction(address PC+1) is to be fetched. If the branch

    instruction is in the table first of all a predictionbased on the prediction bitsis made. Dependingon the prediction outcome the next instruction(address PC+1) or the instruction at the targetaddressis to be fetched.

    - Update entry: If the branch instruction has beenin the table, the respective entry has to beupdated to reflect the correct or incorrectprediction.

    - Add new entry: If the branch instruction has notbeen in the table, it is added to the table with thecorresponding information concerning branchoutcome and target address. If needed one ofthe existing table entries is discarded.Replacement algorithms similar to those forcache memories are used.

    Using dynamic branch prediction with history tablesup to 90% of predictions can be correct.

    Both Pentium and PowerPC 620 use speculativeexecution with dynamic branch prediction based ona branch history table.

    Datorarkitektur F 4 - 15

    Petru Eles, IDA, LiTH

    The Intel 80486 Pipeline

    The 80486 is the last x86 processor that is notsuperscalar. It is a typical example of an advancednon-superscalar pipeline.

    The 80486 has a five stage pipeline.

    No branch prediction or, in fact, always not taken.

    Fetchinstructions

    Decode_1

    Execute

    Write back

    Decode_2

    Fetch: instructions fetchedfrom cache and placed intoinstruction queue (organisedas two prefetch buffers).Operates independently of theother stages and tries to keepthe prefetch buffers full.

    Decode_1: Takes the first 3bytes of the instruction anddecodes opcode, addressing-mode, instruction length; restof the instruction is decodedby Decode_2.

    Decode_2: decodes the restof the instruction and producescontrol signals; preformsaddress computation.

    Execute: ALU operations;cache access for operands.

    Write back: updates registers,status flags; for memoryupdate sends values to cacheand to write buffers.

    Datorarkitektur F 4 - 16

    Petru Eles, IDA, LiTH

    The ARM pipeline

    Fetch

    Execute

    Decode

    Fetch: instructions fetchedfrom cache.

    Decode: instructions andoperand registers decoded.

    Execute: registers read;shift and ALU operations;results or loaded datawritten back to register.

    ARM7 pipeline

  • 8/7/2019 pipeline2

    5/5

    Datorarkitektur F 4 - 17

    Petru Eles, IDA, LiTH

    The ARM pipeline (contd)

    Fetch

    Execute

    Decode

    Fetch: instructions fetchedfrom I-cache.

    Decode: instructions andoperand registers decoded;registers read.

    Execute: shift and ALUoperations.

    Data memory access: fetch/store data from/to D-cache.

    Register write: results orloaded data written back toregister.

    Registerwrite

    Datamemoryaccess

    ARM9 pipeline

    The performance of the ARM9 is significantly superiorto the ARM7:

    Higher clock speed due to larger number ofpipeline stages.

    More even distribution of tasks among pipelinestages; tasks have been moved away from theexecute stage

    Datorarkitektur F 4 - 18

    Petru Eles, IDA, LiTH

    The ARM pipeline (contd)

    Fetch 1

    ALU/Memory 1

    Decode

    Writeback

    Fetch 2

    Issue

    Shift/Address

    ALU/Memory 2

    The performance of ARM11 isfurther enhanced by:

    Higher clock speed due tolarger number of pipelinestages; more even distributionof tasks among pipeline stages.

    Branch prediction:

    - Dynamic two bits predictionbased on a 64 entry branchhistory table (branch targetaddress cache - BTAC).

    - If the instruction is not inthe BTAC, static predictionis done: takenif backward,not takenif forward.

    Decoupling of the load/storepipeline from the ALU&MAC(multiply-accumulate) pipeline.ALU operations can continuewhile load/store operationscomplete (see next slide).

    ARM11 pipeline

    Datorarkitektur F 4 - 19

    Petru Eles, IDA, LiTH

    The ARM pipeline (contd)

    Fetch 1

    Decode

    Fetch 2

    Issue

    ALU 1

    Writeback

    Shift

    ALU 2

    Memory 1

    Writeback

    Address

    Memory 2

    Fetch 1, 2:instructions fetchedfromI-cache; dynamicbranch prediction.

    Decode: instructionsdecoded; staticbranch prediction (if

    needed).

    Issue: instructionissued; registers read.

    Address:addresscalculation.

    Memory1,2:datamemoryaccess.

    Writeback:write loadeddata to reg.;commit store

    Shift:register shift/rotate

    ALU 1,2:ALU/MACoperations.

    Writeback:resultswrittento register

    ARM11 pipeline

    Datorarkitektur F 4 - 20

    Petru Eles, IDA, LiTH

    Summary

    Branch instructions can dramatically affect pipelineperformance. It is very important to reducepenalties produced by branches.

    Instruction fetch units are able to recognize branchinstructions and generate the target address.Fetching at a high rate from the instruction cacheand keeping the instruction queue loaded, it ispossible to reduce the penalty for unconditionalbranches to zero. For conditional branches this isnot possible because we have to wait for the

    outcome of the decision. Delayed branching is a compiler based technique

    aimed to reduce the branch penalty by movinginstructions into the branch delay slot.

    Efficient reduction of the branch penalty forconditional branches needs a clever branchprediction strategy. Static branch prediction doesnot take into consideration execution history.Dynamic branch prediction is based on a record ofthe history of a conditional branch.

    Branch history tables are used to store bothinformation on the outcome of branches and thetarget address of the respective branch.