lecture 4: branch predictors. direction: 0 or 1 target: 32- or 64-bit value turns out targets are...

Advanced MicroarchitectureLecture 4: Branch Predictors

2

Direction vs. Target• Direction: 0 or 1• Target: 32- or 64-bit value

• Turns out targets are generally easier to predict– Don’t need to predict NT target– T target doesn’t usually change

• or has “nice” pattern like subroutine returns

Lecture 4: Correlated Branch Predictors

3

Branches Have Locality• If a branch was previously taken, there’s a

good chance it’ll be taken again in the future

for(i=0; i < 100000; i++){

/* do stuff */}


This branch will be taken99,999 times in a row.

4

Simple Predictor• Always predict NT

– no fetch bubbles (always just fetch the next line)

– does horribly on previous for-loop example• Always predict T

– does pretty well on previous example– but what if you have other control besides

loops?

p = calloc(num,sizeof(*p));if(p == NULL)

error_handler( );Lecture 4: Correlated Branch Predictors

This branch is practicallynever taken

5

Last Outcome Predictor• Do what you did last time


0xDC08: for(i=0; i < 100000; i++){

0xDC44: if( ( i % 100) == 0 )

tick( );

0xDC50:if( (i & 1) == 1)

odd( );

}

T

N

6

Misprediction Rates?


DC08: TTTTTTTTTTT ... TTTTTTTTTTNTTTTTTTTT …

100,000 iterations

How often is branch outcome != previous outcome?2 / 100,000

TNNT

DC44: TTTTT ... TNTTTTT ... TNTTTTT ...

2 / 100DC50: TNTNTNTNTNTNTNTNTNTNTNTNTNTNT …

2 / 2

99.998%Prediction

Rate

98.0%

0.0%

7

Saturating Two-Bit Counter


0 1

FSM for Last-OutcomePrediction

0 1

2 3

FSM for 2bC(2-bit Counter)

Predict NT

Predict T

Transistion on T outcome

Transistion on NT outcome

8

Example


2

T

3

T

3

T

…3

N

N

1

T

0

0

T

1

T T T T…

T

1 1 1 1

T

1

T…1

0

T

1

T

2

T

3

T

3

T… 3

T

Initial Training/Warm-up1bC:

2bC:

Only 1 Mispredict per N branches now!DC08: 99.999% DC04: 99.0%

9

Importance of Branches• 98% 99%

– Whoop-Dee-Do!– Actually, it’s 2% misprediction rate 1%– That’s a halving of the number of mispredictions

• So what?– If misp rate equals 50%, and 1 in 5 insts is a branch, then

number of useful instructions that we can fetch is:5*(1 + ½ + (½)2 + (½)3 + … ) = 10

– If we halve the miss rate down to 25%:5*(1 + ¾ + (¾)2 + (¾)3 + … ) = 20

– Halving the miss rate doubles the number of useful instructions that we can try to extract ILP from


10

Typical Organization of 2bC Predictor


PC hash32 or 64 bits

log2 n bits

n entries/counters

Prediction

FSMUpdateLogic

table update

Actual outcome

… back to predictors

11

Typical Hash

• Just take the log2n least significant bits of the PC

• May need to ignore a few bits– In a 32-bit RISC ISA, all instructions are 4 bytes

wide, and all instruction addresses are 4-byte aligned least two significant bits of PC are always zeros and so they are not included• equivalent to right-shifting PC by two positions before

hashing– In a variable-length CISC ISA (ex. x86),

instructions may start on arbitrary byte boundaries• probably don’t want to shift


12

How about the Branch at 0xDC50?• 1bc and 2bc don’t do too well (50% at best)• But it’s still obviously predictable• Why?

– It has a repeating pattern: (NT)*– How about other patterns? (TTNTN)*

• Use branch correlation– The outcome of a branch is often related to

previous outcome(s)


13

Idea: Track the History of a Branch


PC Previous Outcome

1Counter if prev=0

3 0Counter if prev=1

1 3 3 prev = 1 3 0 prediction = N

prev = 0 3 0 prediction = T

prev = 1 3 0 prediction = N

prev = 0 3 0 prediction = T

prev = 1 3 prediction = T3




14

Deeper History Covers More Patterns

• What pattern has this branch predictor entry learned?


PC

0 310 1 3 1 0 02 2

Last 3 Outcomes Counter if prev=000

Counter if prev=001

Counter if prev=010

Counter if prev=111

001 1; 011 0; 110 0; 100 100110011001… (0011)*

15

Predictor Organizations


PC Hash

Different pattern foreach branch PC

PC Hash

Shared set ofpatterns

PC Hash

Mix of both

16

Example (1)• 1024 counters (210)

– 32 sets ( )• 5-bit PC hash chooses a set

– Each set has 32 counters• 32 x 32 = 1024• History length of 5 (log232 =

5)

• Branch collisions– 1000’s of branches

collapsed into only 32 sets


PC Hash

5

5

17

Example (2)• 1024 counters (210)

– 128 sets ( )• 7-bit PC hash chooses a set

– Each set has 8 counters• 128 x 8 = 1024• History length of 3 (log28 =

3)

• Limited Patterns/Correlation– Can now only handle

history length of three


PC Hash

7

3

18

Two-Level Predictor Organization• Branch History Table

(BHT)– 2a entries– h-bit history per entry

• Pattern History Table (PHT)– 2b sets– 2h counters per set

• Total Size in bits– h2a + 2(b+h)2


PC Hash a

b

h

Each entry is a 2-bit counter

19

Classes of Two-Level Predictors• h = 0 or a = 0 (Degenerate Case)

– Regular table of 2bC’s (b = log2counters)

• h > 0, a > 1– “Local History” 2-level predictor

• h > 0, a = 1– “Global History” 2-level predictor


20

Global vs. Local Branch History• Local Behavior

– What is the predicted direction of Branch A given the outcomes of previous instances of Branch A?

• Global Behavior– What is the predicted direction of Branch Z

given the outcomes of all* previous branches A, B, …, X and Y?

* number of previous branches tracked limited by the history length


21

Why Global Correlations Exist• Example: related branch

conditions

p = findNode(foo);if ( p is parent )

do something;

do other stuff; /* may contain more branches */

if ( p is a child )do something else;


Outcome of secondbranch is always

opposite of the firstbranch

A:

B:

22

Other Global Correlations• Testing same/similar conditions

– code might test for NULL before a function call, and the function might test for NULL again

– in some cases it may be faster to recompute a condition rather than save a previous computation in memory and re-load it

– partial correlations: one branch could test for cond1, and another branch could test for cond1 && cond2 (if cond1 is false, then the second branch can be predicted as false)

– multiple correlations: one branch tests cond1, a second tests cond2, and a third tests cond1 cond2 (which can always be predicted if the first two branches are known).


23

A Global-History Predictor


PC Hash

b

h

Single global branchhistory register (BHR)

PC Hash

bh

b+h

24

Similar Tradeoff Between B and H• For fixed number of counters

– Larger h Smaller b• Larger h longer history

– able to capture more patterns– longer warm-up/training time

• Smaller b more branches map to same set of counters

– more interference

– Larger b Smaller h• just the opposite…


25

Motivation for Combined Indexing• Not all 2h “states” are used

– (TTNN)* only uses half of the states for a history length of 3, and only ¼ of the states for a history length of 4

– (TN)* only uses two states no matter how long the history length is

• Not all bits of the PC are uniformly distributed

• Not all bits of the history are uniformly likely to be correlated– more recent history more likely to be strongly

correlated


26

Combined Index Example: gshare• S. McFarling (DEC-WRL TR, 1993)


PC Hash

kk

XOR

k = log2counters

27

Gshare exampleBranchAddress

GlobalHistory

Gselect4/4

Gshare8/8

00000000 00000001 00000001 00000001

00000000 00000000 00000000 00000000

11111111 00000000 11110000 11111111

11111111 10000000 11110000 01111111


Insufficient historyleads to a conflict

28

Some Interference May Be Tolerable• Branch A: always not-

taken• Branch B: always taken• Branch C: TNTNTN…• Branch D: TTNNTTNN…


3

0

3

0

3

0

0

3

000

111

010

101

001

011

100

110

29

And Then It Might Not• Branch X: TTTNTTTN…• Branch Y: TNTNTN…• Branch Z: TTTT…


000

111

010

101

001

011

100

110

0

3

3

3

3?

?

30

Interference Reducing Predictors• There are patterns and asymmetries in

branches• Not all patterns occur with same frequency• Branches have biases• This lecture:

– Bi-Mode (Lee et al., MICRO 97)– gskewed (Michaud et al., ISCA 97)

• These are global history predictors, but the ideas can be applied to other types of predictors


31

Gskewed idea• Interference occurs because two (or more)

branches hash to the same index• A different hash function can prevent this

collision– but may cause other collisions

• Use multiple hash functions such that a collision can only occur in a few cases– use a majority vote to make final decision


32

Gskewed organization


PC

Global Histhash1 hash2 hash3

maj

prediction

PH

T1

PH

T2

PH

T3

if hash1(x) = hash1(y)then:

hash2(x) hash2(y)hash3(x) hash3(y)

33

Gskewed example


A

B

maj

34

Combining Predictors• Some branches exhibit local history

correlations– ex. loop branches

• While others exhibit global history correlations– “spaghetti logic”, ex. if-elsif-elsif-elsif-else

branches

• Using a global history predictor prevents accurate prediction of branches exhibiting local history correlations

• And visa-versaLecture 4: Correlated Branch Predictors

35

Tournament Hybrid Predictors

Pred0 Pred1

MetaUpdat

e

---

Inc

Dec

---


Pred0 Pred1Meta-

Predictor

Final Prediction

table of 2-/3-bit counters

If meta-counter MSB = 0,use pred0 else use pred1

36

Common Combinations• Global history + Local history• “easy” branches + global history

– 2bC and gshare• short history + long history

• Many types of behavior, many combinations


37

Multi-Hybrids• Why only combine two predictors?


M23

MM

prediction

M

prediction

• Tradeoff between making good individual predictions (P’s) vs. making good meta-predictions (M’s)– for a fixed hardware budget, improving one may

hurt the other

P3P2M01P1P0 P3P2P1P0

38

Prediction Fusion

• Selection discards information from n-1 predictors

• Fusion attempts to synthesize all information– more info to work with– possibly more junk to sort through


M

prediction

P3

prediction

M

P2P1P0P3P2P1P0

39

Using Long Branch Histories• Long global history provides more context

for branch prediction/pattern matching– more potential sources of correlation

• Costs– For PHT-based approach, HW cost increases

exponentially: O(2h) counters– Training time increases, which may decrease

overall accuracy


40

Predictor Training Time• Ex: prediction equals opposite for 2nd most

recent


• Hist Len = 2• 4 states to train:

NN TNT TTN NTT N

• Hist Len = 3• 8 states to train:

NNN TNNT TNTN NNTT NTNN T…

41

Neural Branch Prediction• Uses “Perceptron” from classical machine

learning theory– simplest form of a neural-net (single-layer,

single-node)• Inputs are past branch outcomes• Compute weighted sum of inputs

– output is linear function of inputs– sign of output is used for the final prediction


42

Perceptron Predictor


1 0 1 0 0 1 0 1 1 1 0 0 1 1 0 0 0 0 1

1 -1 1 -1 -1 1 -1 1 1 1 -1 -1 1 1 -1 -1 -1 -1 1

xn x0x1x2xn-1

w0w1w2w3w4w5w6w7w8w9w10w11w12w13w14w15w16w17wn

xxx x x x x x x x x x x x x x x x x

Adder

0 prediction

“bias”

43

Perceptron Predictor (2)

• Magnitude of weight wi determines how correlated branch i is to the current branch

• Sign of weight determines postitive or negative correlation

• Ex. outcome is usually opposite as 5th oldest branch– w5 has large magnitude (L), but is negative

– if x5 is taken, then w5x5 = -L1 = -L• tends to make sum more negative (toward a NT

prediction)

– if x5 is not taken, then w5x5 = -L-1 = L


44

Perceptron Predictor (3)• When actual branch outcome is known:

– if xi = outcome, then increment wi (positive correlation)

– if xi outcome, then decrement wi (negative correlation)

– for x0, increment if branch taken, decrement if NT

• “Done with training”– if |S wi| > q, then don’t update weights unless

mispred


45

Perceptron Trains Quickly• If no correlation exists with branch i, then

wi will just get incremented and decremented back and forth, wi 0

• If correlation exists with branch j, then wj will be consistently incremented (or decremented) to have a large influence on the overall sum


46

Linearly Inseparable Functions• Perceptron computes linear combination of

inputs• Can only learn linearly separable functions


xi

xj

1-1-1

1N

N

N

Txi

xj

1-1-1

1N

T

T

N

f() = -3*xi -4*xj – 5

wi wj w0

• No values of wi, wj, w0 exist to satisfy these output

• No straight line exists that separates T’s from N’s

47

Overall Hardware Organization


PC Hash

one set of weights

BHR…

adder…

prediction = sign(sum)

Size = (h+1)*k*n + h + Area(mult) + Area(adder)

h = history length, k = counter width, n = number of perceptrons in table

Table of weights

Table BHR

Multipliers

48

GEHL• GEometric History Length predictor


very long branch history

h1 h2 h3 h4

PC

adder

prediction = sign(sum)

K-bit weights

L1L2 L3

L4

L(i) = ai-1 L(1)

History lengths form a geometric progression

49

PPM Predictors• PPM = Partial Pattern Matching

– Used in data compression– Idea: Use longest history necessary, but no longer


Most Recent Oldest

2bc

Part

ial ta

gs

Part

ial ta

gs

Part

ial ta

gs

Part

ial ta

gsh1 h2 h3 h4

PC

= = = =

0 1

0 1

0 1

0 1

Pred

2bc

2bc

2bc

2bc

50

TAGE Predictor• Similar to PPM, but uses geometric history

lengths– Currently the most accurate type of branch

prediction algorithm

• References (www.jilp.org):– PPM: Michaud (CBP-1)– O-GEHL: Seznec (CBP-1)– TAGE: Seznec & Michaud (JILP)– L-TAGE: Seznec (CBP-2)


lecture 4: branch predictors. direction: 0 or 1 target: 32- or 64-bit value turns out targets are...

Documents