lecture 3: branch prediction young cho graduate computer architecture i
TRANSCRIPT
Lecture 3: Branch Prediction
Young Cho
Graduate Computer Architecture I
2 - CSE/ESE 560M – Graduate Computer Architecture I
“Instruction Frequency”
CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count
“Average Cycles per Instruction”
j
n
jj I CPI Time Cycle timeCPU
1
Countn Instructio
I F whereF CPI CPI
1
jj
n
jjj
Cycles Per Instructions
3 - CSE/ESE 560M – Graduate Computer Architecture I
Instruction Memory
Register File ALU
Data Memory
PC Control
IF/ID ID/EX EX/MEM MEM/WB
Typical Load/Store Processor
4 - CSE/ESE 560M – Graduate Computer Architecture I
Pipelining Laundry
30 minutes 35 minutes 35 minutes
Three sets of Clean Clothes in 2 hours 40 minutes
35 minutes 25 minutes
With large number of sets, the each load takes average of ~35 min to wash
3X Increase in Productivity!!!
5 - CSE/ESE 560M – Graduate Computer Architecture I
Introducing Problems
• Hazards prevent next instruction from executing during its designated clock cycle– Structural hazards: HW cannot support this
combination of instructions (single person to dry and iron clothes simultaneously)
– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock – needs both before putting them away)
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (Er…branch & jump)
6 - CSE/ESE 560M – Graduate Computer Architecture I
• Read After Write (RAW) – Instr2 tries to read operand before Instr1 writes it
– Caused by a “Dependence” in compiler term• Write After Read (WAR)
– Instr2 writes operand before Instr1 reads it
– Called an “anti-dependence” in compiler term• Write After Write (WAW)
– Instr2 writes operand before Instr1 writes it
– “Output dependence” in compiler term• WAR and WAW in more complex systems
Data Hazards
7 - CSE/ESE 560M – Graduate Computer Architecture I
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch
3 instructions are in the pipeline before new instruction
can be fetched.
Branch Hazard (Control)
8 - CSE/ESE 560M – Graduate Computer Architecture I
Branch Hazard Alternatives
• Stall until branch direction is clear• Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% DLX branches not taken on average– PC+4 already calculated, so use it to get next instr
• Predict Branch Taken– 53% DLX branches taken on average– DLX still incurs 1 cycle branch penalty– Other machines: branch target known before outcome
9 - CSE/ESE 560M – Graduate Computer Architecture I
Branch delay of length n
• Delayed Branch– Define branch to take place AFTER a following
instruction (Fill in Branch Delay Slot)
branch instructionsequential successor1
sequential successor2
........sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
Branch Hazard Alternatives
10 - CSE/ESE 560M – Graduate Computer Architecture I
Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined
stall
Stall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC
Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty
11 - CSE/ESE 560M – Graduate Computer Architecture I
Solution to Hazards
• Structural Hazards– Delaying HW Dependent Instruction– Increase Resources (i.e. dual port memory)
• Data Hazards– Data Forwarding– Software Scheduling
• Control Hazards– Pipeline Stalling– Predict and Flush– Fill Delay Slots with Previous Instructions
12 - CSE/ESE 560M – Graduate Computer Architecture I
Administrative
• Literature Survey– One Q&A per Literature– Q&A should show that you read the paper
• Changes in Schedule– Need to be out of town on Oct 4th (Tuesday)– Quiz 2 moved up 1 lecture
• Tool and VHDL help
13 - CSE/ESE 560M – Graduate Computer Architecture I
Typical Pipeline
• Example: MIPS R4000
IF ID MEM WB
integer unit
FP/int Multiply
FP adder
FP/int divider
ex
m1 m2 m3 m4 m5 m6 m7
a1 a2 a3 a4
Div (lat = 25, Init inv=25)
14 - CSE/ESE 560M – Graduate Computer Architecture I
Prediction
• Easy to fetch multiple (consecutive) instructions per cycle– Essentially speculating on sequential flow
• Jump: unconditional change of control flow– Always taken
• Branch: conditional change of control flow– Taken typically ~50% of the time in applications
• Backward: 30% of the Branch 80% taken = ~24%• Forward: 70% of the Branch 40% taken = ~28%
15 - CSE/ESE 560M – Graduate Computer Architecture I
Current Ideas
• Reactive– Adapt Current Action based on the Past– TCP windows– URL completion, ...
• Proactive– Anticipate Future Action based on the Past– Branch prediction– Long Cache block– Tracing
16 - CSE/ESE 560M – Graduate Computer Architecture I
Branch Prediction Schemes
• Static Branch Prediction• Dynamic Branch Prediction
– 1-bit Branch-Prediction Buffer– 2-bit Branch-Prediction Buffer– Correlating Branch Prediction Buffer– Tournament Branch Predictor
• Branch Target Buffer• Integrated Instruction Fetch Units• Return Address Predictors
17 - CSE/ESE 560M – Graduate Computer Architecture I
Static Branch Prediction
• Execution profiling– Very accurate if Actually take time to Profile– Incovenient
• Heuristics based on nesting and coding– Simple heuristics are very inaccurate
• Programmer supplied hints...– Inconvenient and potentially inaccurate
18 - CSE/ESE 560M – Graduate Computer Architecture I
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of mis-prediction)• 1-bit Branch History Table
– Bitmap for Lower bits of PC address– Says whether or not branch taken last time– If Inst is Branch, predict and update the table
• Problem– 1-bit BHT will cause 2 mis-predictions for Loops
• First time through the loop, it predicts exit instead loop• End of loop case, it predicts loops instead of exit
– Avg is 9 iterations before exit• Only 80% accuracy even if loop 90% of the time
19 - CSE/ESE 560M – Graduate Computer Architecture I
N-bit Dynamic Branch Prediction
• N-bit scheme where change prediction only if get misprediction N-times:
T
T
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
NT
T
NT
NT
2-bit Scheme: Saturates the prediction up to 2 times
20 - CSE/ESE 560M – Graduate Computer Architecture I
Correlating Branches
• (2,2) predictor– 2-bit global: indicates the
behavior of the last two branches
– 2-bit local (2-bit Dynamic Branch Prediction)
• Branch History Table– Global branch history is
used to choose one of four history bitmap table
– Predicts the branch behavior then updates only the selected bitmap table
Branch address (4 bits)
PredictionPrediction
2-bit recent global branch history
(01 = not taken then taken)
21 - CSE/ESE 560M – Graduate Computer Architecture I
Accuracy of Different Schemes
4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT
18%
Fre
qu
ency
of
Mis
pre
dic
tio
ns
0%
1%
5%
6% 6%
11%
4%
6%
5%
1%
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li
Fre
qu
en
cy o
f M
isp
red
icti
on
s
22 - CSE/ESE 560M – Graduate Computer Architecture I
BHT Accuracy
• Mispredict because either:– Wrong guess for the branch– Wrong Index for the branch
• 4096 entry table – programs vary from 1% misprediction (nasa7,
tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
• For SPEC92– 4096 about as good as infinite table
23 - CSE/ESE 560M – Graduate Computer Architecture I
Tournament Branch Predictors
• Correlating Predictor– 2-bit predictor failed on important branches– Better results by also using global information
• Tournament Predictors– 1 Predictor based on global information– 1 Predictor based on local information– Use the predictor that guesses better
addr
Predictor BPredictor A
24 - CSE/ESE 560M – Graduate Computer Architecture I
Alpha 21264• 4K 2-bit counters to choose from among a global predictor and a
local predictor• Global predictor also has 4K entries and is indexed by the history of
the last 12 branches; each entry in the global predictor is a standard 2-bit predictor– 12-bit pattern: ith bit 0 => ith prior branch not taken;
ith bit 1 => ith prior branch taken; • Local predictor consists of a 2-level predictor:
– Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted.
– Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction
• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!(~180,000 transistors)
25 - CSE/ESE 560M – Graduate Computer Architecture I
Branch Prediction Accuracy
94%
96%
98%
98%
97%
100%
70%
82%
77%
82%
84%
99%
88%
86%
88%
86%
95%
99%
0% 20% 40% 60% 80% 100%
gcc
espresso
li
fpppp
doduc
tomcatv
Profile-based2-bit dynmicTournament
26 - CSE/ESE 560M – Graduate Computer Architecture I
Accuracy versus Size
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Total predictor size (Kbits)
Con
ditio
nal b
ranc
h m
ispr
edic
tion
rate
Local
Correlating
Tournament
27 - CSE/ESE 560M – Graduate Computer Architecture I
Branch Target Buffer
• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong
branch address
Branch PC Predicted PC
=?
PC
of in
structio
nF
ET
CH
Extra prediction state
bits
Yes: instruction is branch and use predicted PC as next PCNo: branch not
predicted, proceed normally (Next PC = PC+4)
28 - CSE/ESE 560M – Graduate Computer Architecture I
Predicated Execution
• Built in Hardware Support– Bit for predicated instruction execution– Both paths are in the code– Execution based on the result of the condition
• No Branch Prediction is Required– Instructions not selected are ignored– Sort of inserting Nop
29 - CSE/ESE 560M – Graduate Computer Architecture I
and r3,r1,r5addi r2,r3,#4sub r4,r2,r1jal doitsubi r1,r1,#1
A:
sub r4,r2,r1 doit
addi r2,r3,#4 A+8N
sub r4,r2,r1 L
--- -----
and r3,r1,r5 A+4N
subi r1,r1,#1 A+20N
Internal Cache state:
Zero Cycle Jump• What really has to be done at runtime?
– Once an instruction has been detected as a jump or JAL, we might recode it in the internal cache.
– Very limited form of dynamic compilation?• Use of “Pre-decoded” instruction cache
– Called “branch folding” in the Bell-Labs CRISP processor.– Original CRISP cache had two addresses and could thus fold a
complete branch into the previous instruction– Notice that JAL introduces a structural hazard on write
30 - CSE/ESE 560M – Graduate Computer Architecture I
Dynamic Branch Prediction Summary
• Prediction becoming important part of scalar execution• Branch History Table
– 2 bits for loop accuracy• Correlation
– Recently executed branches correlated with next branch.– Either different branches– Or different executions of same branches
• Tournament Predictor– More resources to competitive solutions and pick between them
• Branch Target Buffer– Branch address & prediction
• Predicated Execution– No need for Prediction– Hardware Support needed