pipeline2
TRANSCRIPT
-
8/7/2019 pipeline2
1/5
Datorarkitektur F 4 - 1
Petru Eles, IDA, LiTH
INSTRUCTION PIPELINING (II)
1. Reducing Pipeline Branch Penalties
2. Instruction Fetch Units and Instruction Queues
3. Delayed Branching
4. Branch Prediction Strategies
5. Static Branch Prediction
6. Dynamic Branch Prediction
7. Branch History Table
Datorarkitektur F 4 - 2
Petru Eles, IDA, LiTH
Reducing Pipeline Branch Penalties
Branch instructions can dramatically affect pipelineperformance. Control operations (conditional andunconditional branch) are very frequent in currentprograms.
Some statistics:
- 20% - 35% of the instructions executed arebranches (conditional and unconditional).
- Conditional branches are much more frequentthan unconditional ones (more than two times).More than 50% of conditional branches aretaken.
It is very important to reduce the penaltiesproduced by branches.
Datorarkitektur F 4 - 3
Petru Eles, IDA, LiTH
Instruction Fetch Units and Instruction Queues
Most processors employ sophisticated fetch unitsthat fetch instructions before they are needed andstore them in a queue.
The fetch unit also has the ability to recognize
branch instructions and to generate the targetaddress. Thus, penalty produced by unconditionalbranchescan be drastically reduced: the fetch unitcomputes the target address and continues to fetchinstructions from that address, which are sent to thequeue. Thus, the rest of the pipeline gets acontinuous stream of instructions, without stalling.
The rate at which instructionscan be read (from theinstruction cache) must be sufficiently high to avoidan empty queue.
With conditional branchespenalties can not beavoided. The branch condition, which usuallydepends on the result of the preceding instruction,has to be known in order to determine the followinginstruction.
Observation In the Pentium 4, the instruction cache (trace
cache) is located between the fetch unit and theinstruction queue (See F 2, slide 31).
InstructionFetch Unit
Instruction QueueRest of thepipeline
Instructioncache
Datorarkitektur F 4 - 4
Petru Eles, IDA, LiTH
Delayed Branching
These are the pipeline sequences for a conditionalbranch instruction (see slide 12, lecture 3)
Branch is taken
Penalty: 3 cycles
Branch is not taken
Penalty: 2 cycles
The idea with delayed branching is to let the CPUdo some useful work during some of the cycleswhich are shown above to be stalled.
With delayed branching the CPU always executesthe instruction that immediately follows after thebranch and only then alters (if necessary) thesequence of execution. The instruction after thebranch is said to be in the branch delay slot.
FI DI
1 2 83 4 5 6 7Clock cycle
ADD R1,R2
BEZ TARGET
the target
COFO EI WO
FI DI COFO EI WO
9 10 11 1 2
FI stallstall FI DI COFO EI WO
FI DI
1 2 83 4 5 6 7Clock cycle
ADD R1,R2
BEZ TARGET
instr i+1
COFO EI WO
FI DI COFO EI WO
9 10 1 1 12
FI stall stall DI COFO EI WO
-
8/7/2019 pipeline2
2/5
Datorarkitektur F 4 - 5
Petru Eles, IDA, LiTH
Delayed Branching (contd)
This is what the programmer has written
MUL R3,R4 R3 R3*R4
SUB #1,R2 R2 R2-1
ADD R1,R2 R1R1+R2
BEZ TAR branch if zero
MOVE #10,R1 R1 10
- - - - - - - - - - - - -
TAR - - - - - - - - - - - - -
The compiler (assembler) has to find an instructionwhich can be moved from its original place into thebranch delay slot after the branch and which will beexecuted regardless of the outcome of the branch.
This is what the compiler (assembler) has produced andwhat actually will be executed:
SUB #1,R2
ADD R1,R2
BEZ TAR
MUL R3,R4
MOVE #10,R1
- - - - - - - - - - - - -
TAR - - - - - - - - - - - - -
This instruction does notinfluenceany of the instructionswhich follow until the branch; italso doesnt influence theoutcome of the branch.
This instruction shouldbe executed only if thebranch is not taken.
This instruction will be execut-ed regardless of the condition.
This will be executedonly if thebranch has not been taken
Datorarkitektur F 4 - 6
Petru Eles, IDA, LiTH
Delayed Branching (contd)
This happens in the pipeline:
Branch is taken
Penalty: 2 cycles
Branch is not take
Penalty: 1 cycle
FI DI
1 2 83 4 5 6 7Clock cycle
ADD R1,R2
BEZ TAR
MUL R3,R4
COFO EI WO
FI DI COFO EI WO
9 10 1 1 12
FI DI COFO EI WO
FI DI
1 2 83 4 5 6 7Clock cycle
ADD R1,R2BEZ TAR
MOVE
COFO EI WO
FI DI COFO EI WO
9 10 1 1 12
FI stall DI COFO EI WO
the target FI stall FI DI COFO EI WO
At this moment, both thecondition (set by ADD) andthe target address are known.
MUL R3,R4 FI DI COFO EI WO
At this moment the condition isknown and the MOVE can go on.
Datorarkitektur F 4 - 7
Petru Eles, IDA, LiTH
Delayed Branching (contd)
What happens if the compiler is not able to find aninstruction to be moved after the branch, into thebranch delay slot?
In this case a NOP instruction (an instruction thatdoes nothing) has to be placed after the branch. Inthis case the penalty will be the same as withoutdelayed branching.
MUL R2,R4
SUB #1,R2
ADD R1,R2
BEZ TAR
NOP
MOVE #10,R1
- - - - - - - - - - - - -
TAR - - - - - - - - - - - - -
Some statistics show that for between 60% and85% of branches, sophisticated compilers are ableto find an instruction to be moved into the branchdelay slot.
Now, with R2, this instruction in-fluences the following ones andcannot be moved from its place.
Datorarkitektur F 4 - 8
Petru Eles, IDA, LiTH
Branch Prediction
In the last example we have considered that thebranch will not be takenand we fetched theinstruction following the branch; in the case thebranch was taken the fetched instruction wasdiscarded. As result, we had
branch penalty of
Let usconsider the oppositeprediction:branch taken.For this solution it is needed that the target addressis computed in advance by an instruction fetch unit.
Branch is taken
Penalty: 1 cycle (prediction fulfilled)
Branch is not taken
Penalty: 2 cycles (prediction not fulfilled)
1 if the branch is not taken(prediction fulfilled)
2 if the branch is taken(prediction not fulfilled)
FI DIADD R1,R2
BEZ TAR
MUL R3,R4
COFO EI WO
FI DI COFO EI WO
FI DI COFO EI WO
MOVE FI stall FI DI COFO EI WO
FI DIADD R1,R2
BEZ TAR
the target
COFO EI WO
FI DI COFO EI WO
stall DI COFO EI WO
MUL R3,R4 FI DI COFO EI WO
FI
1 2 83 4 5 6 7Clock cycle 9 10 1 1 12
1 2 83 4 5 6 7Clock cycle 9 10 1 1 12
-
8/7/2019 pipeline2
3/5
Datorarkitektur F 4 - 9
Petru Eles, IDA, LiTH
Branch Prediction (contd)
Correct branch prediction is very important and canproduce substantial performance improvements.
Based on the predicted outcome, the respective in-struction can be fetched, as well as the instructionsfollowing it, and they can be placed into the instruc-tion queue (see slide 3). If, after the branch condi-tion is computed, it turns out that the prediction wascorrect, execution continues. On the other hand, ifthe prediction is not fulfilled, the fetched instruc-tion(s) must be discarded and the correct instruc-tion must be fetched.
To take full advantage of branch prediction, we canhave the instructions not only fetched but alsobegin execution. This is known as speculativeexecution.
Speculative executionmeans that instructions areexecuted before the processor is certain that they arein the correct execution path. If it turns out that theprediction was correct, execution goes on withoutintroducing any branch penalty. If, however, the pre-diction is not fulfilled, the instruction(s) started in ad-vance and all their associated data must be purgedand the state previous to their execution restored.
Branch prediction strategies:
1. Static prediction
2. Dynamic prediction
Datorarkitektur F 4 - 10
Petru Eles, IDA, LiTH
Static Branch Prediction
Static prediction techniques do not take intoconsideration execution history.
Static approaches:
Predict never taken (Motorola 68020): assumesthat the branch is not taken.
Predict always taken: assumes that the branch istaken.
Predict depending on the branch direction(PowerPC 601):
- predict branch taken for backward branches;
- predict branch not taken for forward branches.
Datorarkitektur F 4 - 11
Petru Eles, IDA, LiTH
Dynamic Branch Prediction
Dynamic prediction techniques improve theaccuracy of the prediction by recording the history ofconditional branches.
One-Bit Prediction Scheme
One-bit is used in order to record if the last execu-tion resulted in a branch taken or not. The systempredicts the same behavior as for the last time.
Shortcoming
When a branch is almost always taken, thenwhen it is not taken, we will predict incorrectlytwice, rather than once:
- - - - - - - - - - -
LOOP - - - - - - - - - - -
- - - - - - - - - - -
BNZ LOOP
- - - - - - - - - - -
- After the loop has been executed for the first timeand left, it will be remembered that BNZ has notbeen taken. Now, when the loop is executedagain, after the first iteration there will be a falseprediction; following predictions are OK until thelast iteration, when there will be a secondfalseprediction.
- In this case the result is even worse than withstatic prediction considering that backward loopsare always taken (PowerPC 601 approach).
Datorarkitektur F 4 - 12
Petru Eles, IDA, LiTH
Two-Bit Prediction Scheme
With a two-bit scheme predictions can be madedepending on the last two instances of execution.
A typical scheme is to change the prediction only ifthere have been two incorrect predictions in a row.
- - - - - - - - - - -
LOOP - - - - - - - - - - -
- - - - - - - - - - -
BNZ LOOP
- - - - - - - - - - -
not taken
not taken
not taken
not taken
taken
taken
taken
taken
prd.: not taken
00prd.: not taken
prd.: taken prd.: taken
10
11 01
After the first execution of theloop the bits attached to BNZwill be 01; now, there will bealways one falseprediction forthe loop, at its exit.
-
8/7/2019 pipeline2
4/5
Datorarkitektur F 4 - 13
Petru Eles, IDA, LiTH
Branch History Table
History information can be used not only to predictthe outcome of a conditional branch but also toavoid recalculation of the target address. Togetherwith the bits used for prediction, the target addresscan be stored for later use in a branch history table.
InstructionFetch Unit
Instructioncache
EI
Instr.addr.
Targetaddr.
Pred.bits
Addr.ofbranchinstr.
Addre
sswhe
re
tofetch
from
Branchwas inTAB?
Pred.wasOK?
Addnew
entry
N
Y
Updateentry
discard fetchesand restart fromcorrect address
N Y
Datorarkitektur F 4 - 14
Petru Eles, IDA, LiTH
Branch History Table (contd)
Some explanations to previous figure:
- Address where to fetch from: If the branchinstruction is not in the table the next instruction(address PC+1) is to be fetched. If the branch
instruction is in the table first of all a predictionbased on the prediction bitsis made. Dependingon the prediction outcome the next instruction(address PC+1) or the instruction at the targetaddressis to be fetched.
- Update entry: If the branch instruction has beenin the table, the respective entry has to beupdated to reflect the correct or incorrectprediction.
- Add new entry: If the branch instruction has notbeen in the table, it is added to the table with thecorresponding information concerning branchoutcome and target address. If needed one ofthe existing table entries is discarded.Replacement algorithms similar to those forcache memories are used.
Using dynamic branch prediction with history tablesup to 90% of predictions can be correct.
Both Pentium and PowerPC 620 use speculativeexecution with dynamic branch prediction based ona branch history table.
Datorarkitektur F 4 - 15
Petru Eles, IDA, LiTH
The Intel 80486 Pipeline
The 80486 is the last x86 processor that is notsuperscalar. It is a typical example of an advancednon-superscalar pipeline.
The 80486 has a five stage pipeline.
No branch prediction or, in fact, always not taken.
Fetchinstructions
Decode_1
Execute
Write back
Decode_2
Fetch: instructions fetchedfrom cache and placed intoinstruction queue (organisedas two prefetch buffers).Operates independently of theother stages and tries to keepthe prefetch buffers full.
Decode_1: Takes the first 3bytes of the instruction anddecodes opcode, addressing-mode, instruction length; restof the instruction is decodedby Decode_2.
Decode_2: decodes the restof the instruction and producescontrol signals; preformsaddress computation.
Execute: ALU operations;cache access for operands.
Write back: updates registers,status flags; for memoryupdate sends values to cacheand to write buffers.
Datorarkitektur F 4 - 16
Petru Eles, IDA, LiTH
The ARM pipeline
Fetch
Execute
Decode
Fetch: instructions fetchedfrom cache.
Decode: instructions andoperand registers decoded.
Execute: registers read;shift and ALU operations;results or loaded datawritten back to register.
ARM7 pipeline
-
8/7/2019 pipeline2
5/5
Datorarkitektur F 4 - 17
Petru Eles, IDA, LiTH
The ARM pipeline (contd)
Fetch
Execute
Decode
Fetch: instructions fetchedfrom I-cache.
Decode: instructions andoperand registers decoded;registers read.
Execute: shift and ALUoperations.
Data memory access: fetch/store data from/to D-cache.
Register write: results orloaded data written back toregister.
Registerwrite
Datamemoryaccess
ARM9 pipeline
The performance of the ARM9 is significantly superiorto the ARM7:
Higher clock speed due to larger number ofpipeline stages.
More even distribution of tasks among pipelinestages; tasks have been moved away from theexecute stage
Datorarkitektur F 4 - 18
Petru Eles, IDA, LiTH
The ARM pipeline (contd)
Fetch 1
ALU/Memory 1
Decode
Writeback
Fetch 2
Issue
Shift/Address
ALU/Memory 2
The performance of ARM11 isfurther enhanced by:
Higher clock speed due tolarger number of pipelinestages; more even distributionof tasks among pipeline stages.
Branch prediction:
- Dynamic two bits predictionbased on a 64 entry branchhistory table (branch targetaddress cache - BTAC).
- If the instruction is not inthe BTAC, static predictionis done: takenif backward,not takenif forward.
Decoupling of the load/storepipeline from the ALU&MAC(multiply-accumulate) pipeline.ALU operations can continuewhile load/store operationscomplete (see next slide).
ARM11 pipeline
Datorarkitektur F 4 - 19
Petru Eles, IDA, LiTH
The ARM pipeline (contd)
Fetch 1
Decode
Fetch 2
Issue
ALU 1
Writeback
Shift
ALU 2
Memory 1
Writeback
Address
Memory 2
Fetch 1, 2:instructions fetchedfromI-cache; dynamicbranch prediction.
Decode: instructionsdecoded; staticbranch prediction (if
needed).
Issue: instructionissued; registers read.
Address:addresscalculation.
Memory1,2:datamemoryaccess.
Writeback:write loadeddata to reg.;commit store
Shift:register shift/rotate
ALU 1,2:ALU/MACoperations.
Writeback:resultswrittento register
ARM11 pipeline
Datorarkitektur F 4 - 20
Petru Eles, IDA, LiTH
Summary
Branch instructions can dramatically affect pipelineperformance. It is very important to reducepenalties produced by branches.
Instruction fetch units are able to recognize branchinstructions and generate the target address.Fetching at a high rate from the instruction cacheand keeping the instruction queue loaded, it ispossible to reduce the penalty for unconditionalbranches to zero. For conditional branches this isnot possible because we have to wait for the
outcome of the decision. Delayed branching is a compiler based technique
aimed to reduce the branch penalty by movinginstructions into the branch delay slot.
Efficient reduction of the branch penalty forconditional branches needs a clever branchprediction strategy. Static branch prediction doesnot take into consideration execution history.Dynamic branch prediction is based on a record ofthe history of a conditional branch.
Branch history tables are used to store bothinformation on the outcome of branches and thetarget address of the respective branch.