Branch Prediction
J. Nelson Amaral
Why Branch Prediction?
• Every 5-7 instruction of a program is a branch• Not predicting, or miss-predicting, is very
costly in architectures with deep pipelines or with many functional units.
Baer p. 129
Anatomy of a Predictor
Baer p. 130
Anatomy of a Branch Predictor
• Event Source: the execution of the program– Predictive information:
• Can be encoded in the instruction code – a bit indicates most likely outcome– forward/backward branch
• Obtained from some profiling informationBaer p. 130
Prog. Exec.
Anatomy of a Branch Predictor (cont.)
• Event Selection: when to predict?– Simple solution: compute the prediction for every
instruction (even non-branches)• Only use the result of the prediction for branches
Baer p. 130
Event Selec.
Anatomy of a Branch Predictor (cont.)
• Prediction Indexing:– Use part of the PC to index prediction tables:
• history of outcome of previous branches at this PC• history of execution path leading to this PC
Baer p. 130
Pred. Index.
Anatomy of a Branch Predictor (cont.)
• Predictor Mechanism:– Static (example):
• forward: always not taken• backward: always taken
– Dynamic:• Finite State Machine predictor: saturating counters• Markov predictor: correlation Baer p. 131
Pred. Mechan.
Anatomy of a Branch Predictor (cont.)
• Feedback and Recovery:– Use real outcome to reinforce prediction– Must recover from miss-predictions
Baer p. 131
Feedback
Control Flow StatisticsApplication % control
flow% cond. branches
(% taken)% Uncond.(% direct)
% calls % returns
SPEC95int 20.4 14.9 (46) 1.1 (77) 2.2 2.1
Desktop 18.7 13 (39) 1.1 (92) 2.4 2.1
A 4-way superscalar has to predict a branch, on average,every other cycle.
Baer p. 131
Interbranch Distances40% of the time there is 1 or 0 cycles betweenpredictions
Branch resolution takes +/- 10 cycles
If the prediction is wrong, up to 40 wronginstructions are in flight by the time theresolution occurs.
Simulation for a 4-way out-of-order architecture Baer p. 131
Static Predictions
Always Taken Always Not Taken
OR
Baer p. 132
Static Predictions
• Early studies indicated that 2/3 of branches are taken– but 30% of those branches were
unconditional!
• For conditional branches there appears to be no preferred direction.
Always TakenBaer p. 132
Alternative Static Predictions
Forward Always Not Taken Backward Always Taken
Accuracy improvementsare barely noticeable.
Static prediction based onprofiling is slightly better.
Static branch-not-takenhas no implementationcost on pipeline.
Baer p. 132
Dynamic Predictors
• Prediction of a given branch changes with the execution of the program.– Simple: a finite-state machine encodes the
outcome of a few recent executions of the branch.– Elaborate: Not only early branch outcomes, but
other correlated parts of the programs are considered.
Baer p. 132
When to predict?• Static prediction: at the
Instruction Decode stage– Know that the instruction
is a branch
• Dynamic prediction: at the Instruction Fetch stage– Calculate prediction for
every instruction, even non-branch ones.
Baer p. 133
What to Predict?
• Branch Direction: Is branch taken on not?
• Branch Target: Address of next instruction for a taken branch
Baer p. 133
Predicting Direction
• Where we find the prediction?
• How to encode the prediction?
Look at the recent past:
What was the direction the last time this samebranch was executed?
A single bit encodes the prediction:
Prediction bit is set at prediction time.
Baer p. 133
Prediction Hysteresis
• Look at the last two resolutions– Two wrong predictions
are necessary to change the prediction
– Motivated by wrong predictions at the end of inner loops.
Baer p. 133
2-Bit Saturating CounterLast two instanceswere taken
Last instancewas taken but theprevious was not
Last two instanceswere not taken
Last instancewas not taken but theprevious was taken
Baer p. 134
2-Bit Saturating Counter (Example)for(i=0 ; i < m ; i++) for(j=0; j<n ; j++) begin S1; S2; …; Sk end;
i ← 0
m ≤ 0
n ≥ 0
j ← 0
S1; S2; …; Sk
j < n
j←j+1
i←i+1
i < m
i←i+1
i j Pred Outc
1-bit
0 0 NT T
0 1 T T
0 n T NT
1 0 NT T
1 1 T TT
NT
2 × m misspredictions
Baer p. 134
2-Bit Saturating Counter (Example)for(i=0 ; i < m ; i++) for(j=0; j<n ; j++) begin S1; S2; …; Sk end;
i ← 0
m ≤ 0
n ≥ 0
j ← 0
S1; S2; …; Sk
j < n
j←j+1
i←i+1
i < m
i←i+1
i j Pred Outc
1-bit
State Pred
2-bit
0 0 NT T wNT NT
0 1 T T sT T
Outc
T
T
0 n T NT sT T
1 0 NT T wT T
NT
T
1 1 T T sT T T
m + 1 misspredictions
T
NT
Baer p. 134
Accuracy of Branch Prediction• Includes unconditional branches• Predictions are associated with branches after each branch’s
first execution
3-bit counters yield onlyminor improvements
Baer p. 135
Average of 26 traces (IBM 379, DEC PDP-11, CDC 6400)
Average of 32 traces (MIPS R2000, Sun SPARC, DEC VAX, Motorola 68000)
Fix prediction. Determined by the first execution of the branch.
Where to store the Prediction
Need one (or two) bit for each possible branch address.
Storing prediction bits with instructions.
Use a cache (Branch Prediction Buffer – BPB).
Solution: ditch the tags.
32-bit address → 230 entries
Need to modifycode every 5 instructions.
Many more bits fortags than for predictions.
Baer p. 136
Pattern History Table (PHT)
Use selected bits from PCto index (or hash) the PHT.
Aliasing: multiple branchesmay index the same PHT entry.
Performance degrades slightly.
Baer p. 136
Each entry of the PHPstores the state of afinite state machineassociated with a branch.
Accuracy of Bimodal Predictor(based on PHT)
Based on 10 SPEC89 traces.
Baer p. 137
Separate PHTSeparate PHTEmbedded in Instruction cacheEmbedded in Instruction cache
Where the Predictor is Stored?
Alpha 21264: 1 counter per instruction? (2K counters)
Sun UltraSPARC:2 counters/cache line(2K counters)
AMD K5:1 counter/cache line(1K counters)
MIPS R10000: (512 counters)
IBM PowerPC 620: (512 counters)
Intel Pentium: Combines PHP with Branch Target Buffer(512 entries)Baer p. 137
Feedback and Recovery
Baer p. 137
Feedback
Feedback: Bimodal Predictor• Feedback: update 2-bit counter for executing
branch• When the updating is done?
– When the actual direction is found (EX stage)Other predictions of the same branch are done.
– When the branch commitsEven more predictions are done.
– Speculatively when the prediction is doneOnly reinforces prediction in bimodal predictor.
Textbook typo (p. 137): choice for the timing of the “update”. Baer p. 137
EX/commit updating makes little difference in performance.
Local × Global Predictor
• Local: – Only use history of the branch to be predicted
• Global:– Use history of other branches that precede the
branch to be predicted.
Baer p. 138
Motivation for Global Prediction
• Example from SPEC program eqntott:
if (aa == 2) /* b1 */ aa = 0;if (bb == 2) /* b2 */ bb = 0;if(aa != bb){ /* b3 */ ….}
if (aa == 2) /* b1 */ aa = 0;if (bb == 2) /* b2 */ bb = 0;if(aa != bb){ /* b3 */ ….}
If b1 and b2 are taken,then b3 is not taken.
Baer p. 138
Correlator Predictor
History Register
1 inserted to the right when a branchis taken (0 otherwise)
Shifted-out bits are lost
Two-level predictor.
Baer p. 139
Update Problem in theCorrelator Predictor
• PHT is updated non-speculatively at commit stage.
• What is the problem with non-speculative updates of the global register?
Baer p. 139
Updating the Global Register in theCorrelator Predictor
if (aa == 2) /* b1 */ aa = 0;if (bb == 2) /* b2 */ bb = 0;if(aa != bb){ /* b3 */ ….}
if (aa == 2) /* b1 */ aa = 0;if (bb == 2) /* b2 */ bb = 0;if(aa != bb){ /* b3 */ ….}
Event TimePrediction of b1 tPrediction of b2 t+1Prediction of b3 t+2
Commit of b1 t+5
Branches b1 and b2 are notinclude in the prediction ofbranch b3!
Baer p. 139
Updating the Global Register in theCorrelator Predictor
if (aa == 2) /* b1 */ aa = 0;if (bb == 2) /* b2 */ bb = 0;if(aa != bb){ /* b3 */ ….}
if (aa == 2) /* b1 */ aa = 0;if (bb == 2) /* b2 */ bb = 0;if(aa != bb){ /* b3 */ ….}
Mispredictions and cache missesaffect the commit time of earlierbranches.
•Two consecutive predictions of a branch b may use different ancestors of b.
• Even if the path leading to b is the same
Baer p. 139
Solution to the Update Problem in theCorrelator Predictor
• Update Global Register speculatively when prediction is made.
• New problem: – Need a repair mechanism– All bits after a misprediction
are from branches in the wrong path.
Baer p. 139
Repair Mechanism for Global Register in the Correlator Predictor
• Decode Stage:– Checkpoint current GR into
a FIFO queue• Commit Stage:
– H: head of the queue– The corresponding check-
pointed GR is H.– Correct prediction: discard H– Incorrect prediction: shift
branch outcome into H and make it the new GR.
Baer p. 144
Optimization to GR Checkpointing
Put into the queue a GRthat has the correctedbit shifted into it.
Baer p. 144
Issues with Correlator Predictor
• For small PHTs– Performance is worse than local predictors
• It does not use the location of the branch in the program for the prediction– May introduce excessive aliasing
• Solution to the aliasing problem:– Reintroduce the PC in the indexing of PHT
Baer p. 140
gshare Predictor
A common hash is an XOR function.Baer p. 141
Accuracy and Use of gshare• Almost perfect for SPEC
FP95.• 0.83 accuracy for SPEC
INT95– 0.65 for program go
AMD K5
Sun UltraSPARC
IBM Power4
Baer p. 141
Example• Assume n=4:
– bimodal mispredicts 1/5 times– global mispredicts from 0 to 5
times depending on other branches in the loop
• This branch has a fix pattern:– “4 taken, 1 not taken”
• How can this pattern be learned?– Remember the history of
individual branches• We need predictors more
attuned to locality of individual branches
i ← 0
m ≤ 0
n ≥ 0
j ← 0
S1; S2; …; Sk
j < n
j←j+1
i←i+1
i < m
i←i+1
T
NT
Baer p. 142
global-set predictor
• First Level: A global shift register for correlations• Second Level: A set of multiple PHTs to prevent
aliasing– expensive in terms of storage
• must use few PHTs to be viableBaer p. 142/143
set-global predictor
• Set of Branch History registers (BHT)• A single global PHT
Baer p. 143
set-set predictor
• A set of branch history registers (BHT)• A set of PHTs
Baer p. 143
Predicting the Branch Target
• When is the target of a branch computed?– In a superscalar architecture (p.e., the IA-32 of the
Intel P6) after several pipeline stages.
• What is the point of predicting direction early if we don’t know where the branch goes?– Need to also predict the branch target address.
Baer p. 145
Branch Target Buffer (BTB)
• A cachelike storage that records branch addresses and associated targets
• If there is a hit in BTB for branch predicted taken:– PC ← Target in BTB for branch
Baer p. 146
Integrated BTB-PHT
• BTB needs much more space than the PHT– # of entries is limited by BTB.
• BTB must be accessed on a single cycle
Baer p. 146
Decoupled BTB-PHT
• Parallel BTB and PHT access• if PHT say ‘taken’ and hit in BTB
then PC ← Address in BTB Baer p. 146
Decoupled BTB-PHT
• For space efficiency:– Only taken branches are added
to BTB• They are added at the backend
when the outcome is known.
IBM PowerPC 620: 256-entry, 2-way set-associative BTB2K counter PHT
Baer p. 146
Integrating the BTB with the Branch History Table (BHT)
• The history of all branches needs to be recorded in BTB+BHT• Taken and not taken branches need to be included
Most likely, it is not thesame bit field from the PCthat is used to index the BTB+BHTand to select the PHT
Intel P64-bit local history512 BTB entries# of PHTs not published
What happens on a BTB miss?
“Backward taken, forward not taken” prediction.
Baer p. 147
Two Instances of Mispredictions
• Direction of branch b is mispredicted– Recovery only when b is at the head of the
reorder buffer• lots of instructions to be nullified
• BTB miss for branch b (direction is correctly predicted taken)– Cannot fetch instructions until target is computed
• only affect the filling of the front end
Baer p. 147
misfetch• Branch is correctly predicted taken and• There is a hit in the BTB• but target address is wrong
– caused by indirect jumps• more common in object-oriented languages
– can modify a BTB entry after two misfetches• need a counter with each BTB entry
Intel Pentium MHas an indirect branch predictor associates global history registerswith target address
Baer p. 148
Chapter 2 — Instructions: Language of the Computer — 53
CMPUT 229 Flashback:Procedure Call Instructions
• Procedure call: jump and link
– Address of following instruction put in $ra– Jumps to target address
• Procedure return: jump register
– Copies $ra to program counter– Can also be used for computed jumps
• e.g., for case/switch statements
jal ProcedureLabel
jr $ra
P-H p. 113
Chapter 2 — Instructions: Language of the Computer — 54
Example fact(3)
MIPS assembly:fact:
sub $sp, $sp, 8 # Make room in stack for 2 more itemssw $ra, 4($sp) # save the return addresssw $a0, 0($sp) # save the argument nslt $t0, $a0, 1 # if ($a0<1) then $t01 else $t0 0beq $t0, $zero, L1 # if n 1, go to L1add $v0, $zero, 1 # return 1add $sp, $sp, 8 # pop two items from the stackjr $ra # return to the instruction after jal
L1: sub $a0, $a0, 1 # subtract 1 from argumentjal fact: # call fact(n-1)lw $a0, 0($sp) # just returned from jal: restore nlw $ra, 4($sp) # restore the return addressadd $sp, $sp, 8 # pop two items from the stackmul $v0, $a0, $v0 # return n*fact(n-1)jr $ra # return to the caller
$t0
$v0
3$a0
Processor
0x1000 2000$sp
$ra
$spMemory High Address
0x1000 3FFB addi $a0,$zero,30x1000 4000 jal fact0x1000 4004 ….
Low Address
Pat.-Hen. pp. 136-138and A-26/A-29
int fact ( int n ) { if (n < 1) return(1); else return(n * fact(n-1)); }
Chapter 2 — Instructions: Language of the Computer — 55
Example fact(3)
MIPS assembly:fact:
sub $sp, $sp, 8 # Make room in stack for 2 more itemssw $ra, 4($sp) # save the return addresssw $a0, 0($sp) # save the argument nslt $t0, $a0, 1 # if ($a0<1) then $t01 else $t0 0beq $t0, $zero, L1 # if n 1, go to L1add $v0, $zero, 1 # return 1add $sp, $sp, 8 # pop two items from the stackjr $ra # return to the instruction after jal
L1: sub $a0, $a0, 1 # subtract 1 from argumentjal fact: # call fact(n-1)lw $a0, 0($sp) # just returned from jal: restore nlw $ra, 4($sp) # restore the return addressadd $sp, $sp, 8 # pop two items from the stackmul $v0, $a0, $v0 # return n*fact(n-1)jr $ra # return to the caller
$t0
$v0
3$a0
Processor
0x1000 2000$sp
0x1000 4004$ra
Memory High Address
0x1000 3FFB addi $a0,$zero,30x1000 4000 jal fact0x1000 4004 ….
Low Address
Pat.-Hen. pp. 136-138and A-26/A-29
$sp
int fact ( int n ) { if (n < 1) return(1); else return(n * fact(n-1)); }
Chapter 2 — Instructions: Language of the Computer — 56
Example fact(3)
MIPS assembly:fact:
sub $sp, $sp, 8 # Make room in stack for 2 more itemssw $ra, 4($sp) # save the return addresssw $a0, 0($sp) # save the argument nslt $t0, $a0, 1 # if ($a0<1) then $t01 else $t0 0beq $t0, $zero, L1 # if n 1, go to L1add $v0, $zero, 1 # return 1add $sp, $sp, 8 # pop two items from the stackjr $ra # return to the instruction after jal
L1: sub $a0, $a0, 1 # subtract 1 from argumentjal fact: # call fact(n-1)lw $a0, 0($sp) # just returned from jal: restore nlw $ra, 4($sp) # restore the return addressadd $sp, $sp, 8 # pop two items from the stackmul $v0, $a0, $v0 # return n*fact(n-1)jr $ra # return to the caller
1$t0
6$v0
3$a0
Processor
0x1000 2000$sp
0x1000 4004$ra
0x1000 4004
3
0x1000 6FEC
2
0x1000 6FEC
1
$spMemory High Address
0x1000 3FFB addi $a0,$zero,30x1000 4000 jal fact0x1000 4004 ….
Low Address
0x1000 6FEC
0
Pat.-Hen. pp. 136-138and A-26/A-29
int fact ( int n ) { if (n < 1) return(1); else return(n * fact(n-1)); }
Chapter 2 — Instructions: Language of the Computer — 57
Example fact(3)
MIPS assembly:fact:
sub $sp, $sp, 8 # Make room in stack for 2 more itemssw $ra, 4($sp) # save the return addresssw $a0, 0($sp) # save the argument nslt $t0, $a0, 1 # if ($a0<1) then $t01 else $t0 0beq $t0, $zero, L1 # if n 1, go to L1add $v0, $zero, 1 # return 1add $sp, $sp, 8 # pop two items from the stackjr $ra # return to the instruction after jal
L1: sub $a0, $a0, 1 # subtract 1 from argumentjal fact: # call fact(n-1)lw $a0, 0($sp) # just returned from jal: restore nlw $ra, 4($sp) # restore the return addressadd $sp, $sp, 8 # pop two items from the stackmul $v0, $a0, $v0 # return n*fact(n-1)jr $ra # return to the caller
1$t0
6$v0
3$a0
Processor
0x1000 2000$sp
0x1000 4004$ra
0x1000 4004
3
0x1000 6FEC
2
0x1000 6FEC
1
$spMemory High Address
0x1000 3FFB addi $a0,$zero,30x1000 4000 jal fact0x1000 4004 ….
Low Address
0x1000 6FEC
0
Pat.-Hen. pp. 136-138and A-26/A-29
int fact ( int n ) { if (n < 1) return(1); else return(n * fact(n-1)); }
Call/Return Mechanisms
foo(….){ …0x10001000 jal bar0x10001004 … …0x10001800 jal bar0x10001804 … …0x10001CE4 jal bar0x10001CE8 … ...}
bar(….){ …0x1000F0E0 jal baz0x1000F0E4 … ... jar $ra}
baz(….){ ... jar $ra}
How to predict the next instructionto be executed after the return?
We know that the branch is always taken.
The return address is known sincethe time of each call!
Baer p. 150
Return Address Stack
foo(….){ …0x10001000 jal bar0x10001004 … …0x10001800 jal bar0x10001804 … …0x10001CE4 jal bar0x10001CE8 … ...}
bar(….){ …0x1000F0E0 jal baz0x1000F0E4 … ... jar $ra}
baz(….){ ... jar $ra}
Pop address from stack at return.
Push return address into stackat the function call.
Stack is a circular FIFO. Wrong address on overflow. What is the best strategy to handle FIFO overflow? Baer p. 150
Speculative calls and returns
foo(….){ …0x10000FFC beq … target0x10001000 jal bar0x10001004 … …target:0x10001800 jal baz0x10001804 … …0x10001CE4 jal bar0x10001CE8 … ...}
bar(….){ …0x1000F0E0 bne … next0x1000F0E4 jr $ra ...next: ….}
Function calls and returns executedin the predicted path of a branchchange the return address stack.
Need a recovery mechanism for thereturn address stack.
If a single path is followed, save thepointer to the top of the stack on abranch prediction and restore it incase of misprediction. Baer p. 150
Return StacksMIPS R10000: 1-entry return stack
DEC Alpha 21164:12-entry return stack
Intel Pentium III: 16-entry return stackBaer p. 151
A different way of doing things…
Don’t know which way to go?
“Some people go both ways.”
(Scarecrow, The Wizard of Oz)
Baer p. 151
IBM System 360/91
• Upon decoding a branch:– fetch, decode, and enqueue both the taken and
the not taken paths into separate buffers
• Upon branch resolution:– one buffer becomes the execution path– the other is discarded
Baer p. 151
In a restricted version …Branch is predicted
taken
There is aBTB hit
Instruction Cache Line:
Branch Instruction Resume Buffer:@#$&%misprediction!
Fetch from Resume Buffer!
MIPS R10000Intel P6
Fall-through instructions in cache line
Baer p. 151
Loop Detector
• A separate loop predictor detects loop patterns:– TTTTTTTNTTTTTTTNTTTTTTTNTTTTTTTNTT….
• Uses a separate counter for each recognized loop
Intel Pentium M
Baer p. 151
Sophisticated Predictors• Tension:
– Branch Correlation (global information) × Individual Branch Patterns (local information)
• neutral aliasing– between branches biased the same way
• destructive aliasing– between branches with opposite bias
• bias bit– added to BTB– PHT predicts if direction agrees with the bias bit
• two branches with strong opposite bias that alias do not destroy each other prediction.
Baer p. 152
skewed predictor
• Goal: reduce aliasing• Use three PHTs
– different hashing function for each PHT– Take majority vote
Baer p. 153
hybrid (or combining) predictor
Two different prediction strategies
Tournament predictor:predicts which strategyshould be used
Baer p. 156
Tournament Predictor
Baer p. 155