compiler techniques for ilp & branch prediction

ComputerArchitectureLecture5:CompilerTechniquesforILP&BranchPrediction(Chapter3)

ChihWeiLiuNationalChiao [email protected]

DataDependenceandParallelism

If2instructionsareparallel theycanbeexecutedsimultaneouslyinapipelinewithoutcausinganystalls(exceptthestructuralhazards)

theirexecutionordercanbeswapped If2instructionsaredependent

theymustbeexecutedinorder orpartiallyoverlapped.

Toexploitparallelismsoverinstructionsisequivalenttodeterminedependencesoverinstructions

CA-Lec5 [email protected]

Introduction

2

ExploitInstructionLevelParallelism

Twomainapproaches: Hardwarebaseddynamicapproaches

Hardwarelocatestheparallelisminruntime Usedinserveranddesktopprocessors(NotusedasextensivelyinPMPprocessors)

Superscalarprocessors:Pentium4,IBMPower,AMDOpteron

Compilerbasedstaticapproaches Softwarefindsparallelismatcompiletime UsedinDSPprocessors(Notassuccessfuloutsideofscientificapplications)

VLIWprocessors:Itanium2,ITRIPAC


Introduction

3

WhyILP?

MultiissueProcessor:twoormoreinstructionscanbeissued(orexecuted)inparallel ThegoalistomaximizeIPC(instructionpercycle) PipelineCPI=IdealpipelineCPI+Structuralstalls+Datahazardstalls+Controlstalls

Toreducetheimpactofdataandcontrolhazards BasicblockILP

CanwemakeCPIcloserto1? Ifwehavencyclelatency,thenweneedn1instructionsbetweenaproducinginstructionanditsuse


Introduction

4

BasicBlockILP BasicBlock(BB)ILPisquitesmall :

BB:astraightlinecodesequencewithnobranchesin excepttotheentryandnobranchesout exceptattheexit

averagedynamicbranchfrequency15%to25%=>4to7instructionsexecutebetweenapairofbranches

PlusinstructionsinBBlikelytodependoneachother WemustexploitILPacrossmultiplebasicblocks

Loopunrolling toexploitlooplevelparallelism

CA-Lec5 [email protected] 5

for (i=1; i

OvercometheDataDependence

Maintainingthedependencebutavoidingahazard schedulingthecodeinHW/SWapproach

Eliminatingadependencebytransformingthecode primarybysoftware

Dependencedetection byregisternames:simpler bymemorylocations:morecomplicated

Twoaddressesmayrefertothesamelocationbutlookquitedifferent(e.g.100(R4), 20(R6) maybeidentical)

Theeffectiveaddressofaload/storemaychangedfrominstructiontoinstruction(20(R4), 20(R4) maybedifferent)


(True)DataDependence

DataandControldependenciesareapropertyoftheprogram(application)

Datadependenceconveys: Possibilityofapipelinehazard Orderinwhichresultsmustbecalculated(i.e.programbehavior)

UpperboundonILP Dependenciesthatflowthroughmemorylocationsaredifficulttodetect Hardwarebaseddynamicapproachismoreattractive


Introduction

7

NameDependence Twoinstructionsusethesameregisterormemorylocation(i.e.thesamename)butnoflowofinformation Notatruedatadependence,butisaproblemwhenreorderinginstructionsorairregularlypipelineddatapath.

Antidependence:instructionjwritesaregisterormemorylocationthatinstructioni reads

Initialordering(i beforej)mustbepreserved Outputdependence:instructioni andinstructionjwritethesameregisterormemorylocation

Orderingmustbepreserved

Toresolve,userenamingtechniques


Introduction

8

RegisterRenamingandWAW/WAR

DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0 (R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8

DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0 (R1) SUB.D T, F10, F14 MUL.D F6, F10, T


WAW: ADD.D/MUL.D

WAR: ADD.D/SUB.D, S.D/MUL.D

RAW: DIV.D/ADD.D, ADD.D/S.D SUB.D/MUL.D

Register renaming result

ControlDependence

ControlDependence Orderingofinstructioni withrespecttoabranchinstruction

Instructioncontroldependentonabranchcannotbemovedbeforethebranchsothatitsexecutionisnolongercontrollerbythebranch

Aninstructionnotcontroldependentonabranchcannotbemovedafterthebranchsothatitsexecutioniscontrolledbythebranch


Introduction

10

ControlDependenceIgnored

Themethodtopreservethecontroldependence BeusedinmostsimplepipelineCPUs Simplebutinefficient

Controldependenceisnotthecriticalpropertythatmustbepreserved Wemayexecuteinstructionthatshouldnothavebeenexecuted,

therebyviolatingthecontroldependence,ifwecandosowithoutaffectingthecorrectnessoftheprogram

Thetwopropertiescriticaltoprogramcorrectnessare Theexceptionbehavior Dataflow


ControlDependenceExamples

R1inORinstructiondependsonDADDUorDSUBU,reliedonthebranchistakenornot.

AssumeR4isntusedafterskip PossibletomoveDSUBUbeforethe

branch(thisviolatesthecontroldependencebutnotaffectsthedataflow)


Introduction

Example 1:DADDU R1,R2,R3BEQZ R4,LDSUBU R1,R1,R6

L: OR R7,R1,R8

Example 2:DADDU R1,R2,R3BEQZ R12,skipDSUBU R4,R5,R6DADDU R5,R4,R9

skip:OR R7,R8,R9

12

PreserveDataFlowBehavior

Thedataflowistheactualflowofdata amonginstructions Branchesmakedataflowdynamic Example

R1 valuedependsonthebranchistakenornot DSUBU cannotbemovedabovethebranch. Speculationshouldtakecarethisproblem

Programorder,thatdetermineswhichpredecessorwillactuallydeliveradatavaluetotheinstruction,shouldbeensuredbymaintainingthecontroldependences

CA-Lec5 [email protected] 04-13

CompilerTechniquesforExposingILP

Pipelinescheduling Separatedependentinstructionfromthesourceinstructionbythe

pipelinelatency(orinstructionlatency)ofthesourceinstruction Example:

for(i=999;i>=0;i=i1)x[i]=x[i]+s;


Com

piler Techniques

14

Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)DADDUI R1,R1,#-8BNE R1,R2,Loop

DataDependenceLoop: L.D F0, 0(R1) ;F0=array element

ADD.D F4, F0, F2 ;add scalar in F2

S.D F4, 0(R1) ;store result

DADDUI R1, R1, #-8 ;decrement pointer

BNE R1, R2, Loop ;branch R1!=R2


The arrows show the order that must be preserved for correct execution.

If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped.

Step1:InsertPipelineStallsLoop: L.D F0,0(R1)

stallADD.D F4,F0,F2stallstallS.D F4,0(R1)DADDUI R1,R1,#8stall (assumeintegerloadlatencyis1)BNE R1,R2,Loop


Com

piler Techniques

16

9 C.C/ iteration

Step2:ReSchedulingScheduledcode:Loop: L.D F0,0(R1)

DADDUI R1,R1,#8ADD.D F4,F0,F2stallstallS.D F4,8(R1)BNE R1,R2,Loop


Com

piler Techniques

17

7 C.C/ iteration

Step3:LoopUnrolling Loopunrolling

Unrollbyafactorof4(assume#elementsisdivisibleby4) Eliminateunnecessaryinstructions

Loop: L.DF0,0(R1)ADD.DF4,F0,F2S.DF4,0(R1);dropDADDUI&BNEL.DF6,8(R1)ADD.DF8,F6,F2S.DF8,8(R1);dropDADDUI&BNEL.DF10,16(R1)ADD.DF12,F10,F2S.DF12,16(R1);dropDADDUI&BNEL.DF14,24(R1)ADD.DF16,F14,F2S.DF16,24(R1)DADDUIR1,R1,#32BNER1,R2,Loop


Com

piler Techniques

note:numberofliveregistersvs.originalloop

18

Step4:ReScheduletheUnrolledloop

Pipelinescheduletheunrolledloop:

Loop: L.DF0,0(R1)L.DF6,8(R1)L.DF10,16(R1)L.DF14,24(R1)ADD.DF4,F0,F2ADD.DF8,F6,F2ADD.DF12,F10,F2ADD.DF16,F14,F2S.DF4,0(R1)S.DF8,8(R1)DADDUIR1,R1,#32S.DF12,16(R1)S.DF16,8(R1)BNER1,R2,Loop


Com

piler Techniques

19

14 C.C/ 4 iterationsor 3.5 C.C/ iteration

ILPandDataDependencies

HW/SWmustpreserveprogramorder:orderinstructionswouldexecuteinifexecutedsequentiallyasdeterminedbyoriginalsourceprogram Dependencesareapropertyofprograms

Presenceofdependenceindicatespotential forahazard,butactualhazardandlengthofanystallispropertyofthepipeline

Importanceofthedatadependencies1)indicatesthepossibilityofahazard2)determinesorderinwhichresultsmustbecalculated3)setsanupperboundonhowmuchparallelismcanpossiblybeexploited

HW/SWgoal:exploitparallelismbypreservingprogramorderonlywhereitaffectstheoutcomeoftheprogram


UnrolledLoopDetail

Donotusuallyknowupperboundofloop Supposeitisn,andwewouldliketounrollthelooptomakek

copiesofthebody Insteadofasingleunrolledloop,wegenerateapairof

consecutiveloops: 1stexecutes(n mod k)timesandhasabodythatistheoriginalloop 2ndistheunrolledbodysurroundedbyanouterloopthatiterates

(n/k)times

Forlargevaluesofn,mostoftheexecutiontimewillbespentintheunrolledloop


Com

piler Techniques

21

5LoopUnrollingDecisions Requiresunderstandinghowoneinstructiondependsonanotherandhowthe

instructionscanbechangedorreorderedgiventhedependences:1. Determineloopunrollingusefulbyfindingthatloopiterationswereindependent

(exceptformaintenancecode)2. Usedifferentregisterstoavoidunnecessaryconstraintsforcedbyusingsame

registersfordifferentcomputations3. Eliminatetheextratestandbranchinstructions andadjustthelooptermination

anditerationcode4. Determinethatloadsandstoresinunrolledloopcanbeinterchanged by

observingthatloadsandstoresfromdifferentiterationsareindependent Transformationrequiresanalyzingmemoryaddressesandfindingthattheydonotrefer

tothesameaddress

5. Schedulethecode,preservinganydependencesneededtoyieldthesameresultastheoriginalcode


3LimitstoLoopUnrolling1. Decreaseinamountofoverheadamortizedwitheachextra

unrolling AmdahlsLaw

2. Growthincodesize Forlargerloops,concernitincreasestheinstructioncache

missrate3. Registerpressure(compilerlimitation):potentialshortfallin

registerscreatedbyaggressiveunrollingandscheduling Ifnotbepossibletoallocatealllivevaluestoregisters,

maylosesomeorallofitsadvantage Loopunrollingreducesimpactofbranchesonpipeline;

anotherwayisbranchprediction


GettingCPIbelow1 CPI 1ifissueonly1instructioneveryclockcycle Multipleissueprocessors comein3flavors:

1. staticallyscheduledsuperscalar processors,2. dynamicallyscheduledsuperscalarprocessors,and3. VLIW (verylonginstructionword)processors

2typesofsuperscalarprocessorsissuevaryingnumbersofinstructionsperclock useinorderexecutioniftheyarestaticallyscheduled,or outoforderexecutioniftheyaredynamicallyscheduled

VLIWprocessors,incontrast,issueafixednumberofinstructionsformattedeitherasonelargeinstructionorasafixedinstructionpacketwiththeparallelismamonginstructionsexplicitlyindicatedbytheinstruction(Intel/HPItanium)


BasicVLIW(VeryLongInstructionWord)

AVLIWusesmultiple,independentfunctionalunits AVLIWpackagesmultipleindependentoperationsintooneverylong

instruction Theburdenforchoosingandpackagingindependentoperationsfalls

oncompiler HWinasuperscalarmakestheseissuedecisionsisunnecessary

VLIWdependsonenoughparallelismforkeepingFUsbusy Loopunrollingandthencodescheduling Compilermayneedtodolocalschedulingandglobalscheduling

HereweconsideraVLIWprocessormighthaveinstructionsthatcontain5operations,including1integer(orbranch),2FP,and2memoryreferences DependontheavailableFUsandfrequencyofoperation


VLIW:VeryLargeInstructionWord

Eachinstruction hasexplicitcodingformultipleoperations InIA64,groupingcalledapacket InTransmeta,groupingcalledamolecule (withatoms asops)

Tradeoffinstructionspaceforsimpledecoding Thelonginstructionwordhasroomformanyoperations Bydefinition,alltheoperationsthecompilerputsinthelong

instructionwordareindependent=>executeinparallel E.g.,2integeroperations,2FPops,2Memoryrefs,1branch

16to24bitsperfield=>7*16or112bitsto7*24or168bitswide Needcompilingtechniquethatschedulesacrossseveralbranches


Recall:UnrolledLoopthatMinimizesStallsforScalar


1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

L.D to ADD.D: 1 CycleADD.D to S.D: 2 Cycles

LoopUnrollinginVLIW


Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.29 clocks per iteration 23 ops in 9 clocks, average 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW

VLIWProblems

Increaseincodesize Ambitiousloopunrolling Wheneverinstructionsarenotfull,theunusedFUstranslatetowaste

bitsintheinstructionencoding Aninstructionmayneedtobeleftcompletelyemptyifnooperationcanbescheduled

Cleverencodingorcompress/decompress

Binarycodecompatibility differentnumbersoffunctionalunitsandunitlatenciesrequire

differentversionsofthecode Needrecompilation Solution:Objectcodetranslationoremulation


Intel/HPIA64ExplicitlyParallelInstructionComputer(EPIC)

IA64:instructionsetarchitecture 12864bitintegerregs+12882bitfloatingpointregs

NotseparateregisterfilesperfunctionalunitasinoldVLIW Hardwarechecksdependencies

(interlocks=>binarycompatibilityovertime) Predicatedexecution(select1outof641bitflags)

=>40%fewermispredictions? Itanium wasfirstimplementation(2001)

Highlyparallelanddeeplypipelinedhardwareat800Mhz 6wide,10stagepipelineat800Mhzon0.18 process

Itanium2 isnameof2ndimplementation(2005) 6wide,8stagepipelineat1666Mhzon0.13 process Caches:32KBI,32KBD,128KBL2I,128KBL2D,9216KBL3


AnotherPossibility:SoftwarePipelining

Observation:ifiterationsfromloopsareindependent,thencangetmoreILPbytakinginstructionsfromdifferent iterations

Softwarepipelining:reorganizesloopssothateachiterationismadefrominstructionschosenfromdifferentiterationsoftheoriginalloop

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

ControlHazardAvoidance

ConsiderEffectsofIncreasingtheILP Controldependenciesrapidlybecomethelimitingfactor Theytendtonotgetoptimizedbythecompiler

Higherbranchfrequenciesresult Plusmultipleissue(morethanoneinstructions/sec)morecontrolinstructionspersec.

Controlstallpenaltieswillgoupasmachinesgofaster AmdahlsLawinaction again

BranchPrediction:helpsifcanbedoneforreasonablecost Staticbycompiler:appendixC

e.g.predictnottaken,delaybranch DynamicbyHW:thissection


DynamicBranchPrediction

Whydoespredictionwork? Underlyingalgorithmhasregularities Datathatisbeingoperatedonhasregularities Instructionsequencehasredundancies thatareartifactsofwaythat

humans/compilersthinkaboutproblems

Isdynamicbranchpredictionbetterthanstaticbranchprediction? Seemstobe Thereareasmallnumberofimportantbranchesinprogramswhich

havedynamicbehavior


DynamicBranchPrediction Thepredictorwilldependonthebehaviorofthebranchat

runtime Goals:

allowtheprocessortoresolvetheoutcomeofabranchearly,preventcontroldependencesfromcausingstalls

Effectivenessofabranchpredictionschemedependsnotonlyontheaccuracybutalsoonthecostofabranch BP_Performance=f (accuracy,costofmisprediction)

BranchHistoryTable(BHT) LowerbitsofPCaddressindextableof1bitvalues

Nopreciseaddresscheck justmatchthelowerbits Sayswhetherornotbranchtakenlasttime


1bitDynamicHardwarePrediction


predict taken predict not-taken

taken

not-taken

taken not-taken

Problem: Loop case

LOOP: LOAD R1, 100(R2)MUL R6, R6, R1SUBI R2, R2, #4BNEZ R2, LOOP

The steady-state prediction behavior will mispredict on the first and last loop iterations

BHTPrediction


If two branch instructions withthe same lower bits

Useful only for the target address is known before CC is decided

ProblemwiththeSimpleBHT

Aliasing Allbrancheswiththesameindex(lower)bitsreferencesameBHT

entry Hencetheymutuallypredicteachother Noguaranteethatapredictionisright.Butitmaynotmatteranyway

Avoidance Makethetablebigger OKsinceitsonlyasinglebitvector Thisisacommoncacheimprovementstrategyaswell

Othercachestrategiesmayalsoapply

Considerhowthisworksforloops Alwaysmispredicttwiceforeveryloop

Oneisunavoidablesincetheexitisalwaysasurprise Howeverpreviousexitwillalwayscauseamispredictionthefirsttryofeverynewloopentry


clear benefit is that its cheap and understandable

NbitPredictors

2bitcounterimplies4states Statistically2bitsgetsmostoftheadvantage


idea: improve on the loop entry problem

predict taken

predict not-taken

taken

not-takentaken

predict not-taken

predict taken

not-taken

not-taken

not-taken

taken

taken

Compiler could hint 11 init. on loop branches or it will go to 11 anyway in the 4th iteration

Only the loop exit causes a mispredict

11 10

01 00

A prediction must miss twice before it is changed

BHTAccuracy Mispredictbecauseeither:

Wrongguessforthatbranch(accuracy) Gotbranchhistoryofwrongbranchwhenindexthetable(size)


4K of BPB with 2-bit entries misprediction rates on SPEC89@IBM Power

ToIncreasetheBHTSize 4096aboutasgoodasinfinitetable Thehitrateofthebufferisclearlynotthelimitingfactorforanenough

largeBHTsize


Theworstcaseforthe2bitpredictor

if (aa==2)aa=0;

if (bb==2)bb=0;

if (aa != bb) {


DSUBUI R3, R1, #2BNEZ R3, L1 ;branch b1(aa!=2)DADDD R1, R0, R0 ;aa=0

L1: DSUBUI R3, R2, #2BNEZ R3, L2 ;branch b2(bb!=2)DADDD R2, R0, R0 ;bb=0

L2: DSUBU R3, R1, R2BEQZ R3, L3 ;branch b3(aa==bb)

if the first 2 untaken then the 3rd will always be taken

aa and bb are assigned to R1 and R2

ImprovePredictionStrategyByCorrelatingBranches

Considertheworstcaseforthe2bitpredictorif (aa==2) then aa=0;if (bb==2) then bb=0;if (aa != bb) then whatever

singlelevelpredictorscannevergetthiscase

Correlatingor2levelpredictors Thepredictorusesthebehaviorofotherbranch(es)tomakea

prediction Correlation=whathappenedonthelastbranch Predictor=whichwaytogo


if the first 2 fail then the 3rdwill always be taken

CorrelatingBranches Hypothesis:recentlyexecutedbranchesarecorrelated

Idea:recordmmostrecently executedbranchesastakenornottaken,andusethatpatterntoselecttheproperbranchhistorytable

Ingeneral,(m,n)predictor meansrecordlastmbranchestoselectbetween2m historytableseachwithnbitcounters Old2bitBHTisthena(0,2)predictor

GlobalBranchHistory:mbitshiftregisterkeepingT/NTstatusoflastmbranches

Eachentryintablehasmnbitpredictors


Two-level predictors

2Level(m,n)BHT Usethebehaviorofthelastmbranches tochoosefrom2m branchpredictors,eachofwhichisannbitpredictor forasinglebranch

Totalbitsforthe(m,n)BHTpredictionbuffer:

pbitsofbufferindex =2P bitBHT 2m banksofmemory selectedbytheglobalbranchhistory(whichisjustashiftregister) e.g.acolumnaddress

Usepbitsofthebranchaddresstoselectrow Getthenpredictorbits intheentrytomakethedecision


pm nbitsmemoryTotal 22__

(2,2)PredictorImplementation


4 banks = each with 32 2-bit predictor entries

p=5m=2n=2

5:32

ExampleofCorrelatingBranchPredictors

if (d==0)d = 1;

if (d==1)


BNEZ R1, L1 ;branch b1 (d!=0)DAAIU R1, R0, #1 ;d==0, so d=1

L1: DAAIU R3, R1, #-1BNEZ R3, L2 ;branch b2 (d!=1)

L2:

d is assigned to R1

ExampleofCorrelatingBranchPredictors(Cont.)initial

value of dd==0? b1 value of d

before b2d==1? b2

0 YES not taken 1 YES not taken1 NO taken 1 YES not taken2 NO taken 2 NO taken


d=? b1 prediction

b1 action New b1 prediction

b2 prediction


2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT

1-bit predictor initialized to NT

All the branches are mispredicted !!!

ExampleofCorrelatingBranchPredictors(Cont.)Prediction bits Prediction if last branch

not takenPrediction if last branch

takenNT/NT NT NTNT/T NT TT/NT T NTT/T T T


d=? b1 prediction


b2 prediction


2 NT/NT T T/NT NT/NT T NT/T0 T/NT NT T/NT NT/T NT NT/T2 T/NT T T/NT NT/T T NT/T0 T/NT NT T/NT NT/T NT NT/T

Use 1-bit correlation + 1-bit prediction with initialized to NT/NT(1,1) predictor

Comparison:AccuracyofDifferent2bitPredictors


8K bits

8K bits

TournamentPredictors Recallthatthecorrelatorisjustalocalpredictor Adaptivelycombinelocalandglobal predictors

Multiplepredictors Onebasedonglobalinformation:Resultsofrecentlyexecutedmbranches

Onebasedonlocalinformation:Resultsofpastexecutionsofthecurrentbranchinstruction

Selector tochoosewhichpredictorstouse E.g.:2bitsaturatingcounter,incrementedwheneverthepredictedpredictoriscorrectandtheotherpredictorisincorrect,anditisdecrementedinthereversesituation

Advantage Abilitytoselecttherightpredictorfortherightbranch


The most popular one

StateTransitionDiagramforATournamentPredictor


BranchPredictionPerformance


HighPerformanceInstructionDelivery

Foramultipleissueprocessor,predictingbrancheswellisnotenough

Deliverahighbandwidthinstructionstream isnecessary(e.g.,4~8instructions/cycle) Branchtargetbuffer Integratedinstructionfetchunit Indirectbranchbypredictingreturnaddress


BranchTargetBuffer/Cache

Toreducethebranchpenaltyfrom1cycleto0 NeedtoknowwhattheaddressisattheendofIF Buttheinstructionisnotevendecodedyet Sousetheinstructionaddressratherthanwaitfordecode

Ifpredictionworksthenpenaltygoesto0!

BTBIdea Cachetostoretakenbranches(noneedtostoreuntaken) AccesstheBTBduringIFstage Matchtagisinstructionaddress comparewithcurrentPC DatafieldisthepredictedPC

Maywanttoaddpredictorfield Toavoidthemispredict twiceoneveryloopphenomenon Addscomplexitysincewenowhavetotrackuntakenbranchesaswell


BTB Illustration


FlowchartforBTB


Send out the predicted PC

A branch?

PenaltiesUsingthisApproachfor5StageMIPS

Instruction in buffer

Prediction Actual Branch Penalty Cycles

Yes Taken Taken 0Yes Taken Not Taken 2No Taken 2No Not Taken 0


Note: Predict_wrong = 1 CC to update BTB + 1 CC to restart fetching Not found and taken = 2CC to update BTBNote: For complex pipeline design, the penalties may be higher

Example Givenpredictionaccuracy(forinst.inbuffer):90% Givenhitrateinbuffer(forbranchespredictedtoken):90% Assume60%ofthebranchesaretaken Determinethetotalbranchpenalty=?

Solution Probability(branchinbuffer,butactuallynottaken)=percentbuffer

hitrate percentincorrectprediction=90% 10%=0.09 Probability(branchnotinbuffer,butactuallytaken)=10% Hence,wehave2cycles (0.09+0.1)=0.38cycles

Comparingthedelaybranchwiththepenalty=0.5cycles/branch


IntegratedInstructionFetchUnits

Considerthefetchunitasaseparateautonomousunit,notapipelinestage

Functionsfortheintegratedinstructionfetchunit Branchprediction Prefetch

Todelivermultipleinstructionspercycle Instructionmemoryaccessandbuffering

mayrequireaccessingmultiplecachelines prefetchmayhidethelatencyformemoryaccess bufferingmaybenecessary


ReturnAddressPredictor

Indirectjump jumpswhosedestinationaddressvariesatruntime indirectprocedurecall,selectorcase,procedurereturn SPEC89benchmarks:85%ofindirectjumpsareprocedure

returns AccuracyofBTBforprocedurereturnsarelow

ifprocedureiscalledfrommanyplaces,andthecallsfromoneplacearenotclusteredintime

Useasmallbufferofreturnaddressesoperatingasastack Cachethemostrecentreturnaddresses Pushareturnaddressatacall,andpoponeoffatareturn Ifthecacheissufficientlarge(maxcalldepth) prefect


DynamicBranchPredictionSummary

Branchpredictionschemearelimitedby Predictionaccuracy Mispredictionpenalty

BranchHistoryTable:2bitsforloopaccuracy Correlation:Recentlyexecutedbranchescorrelatedwithnextbranch Tournamentpredictorstakeinsighttonextlevel,byusingmultiple

predictors usuallyonebasedonglobalinformationandonebasedonlocalinformation,

andcombiningthemwithaselector In2006,tournamentpredictorsusing 30Kbitsareinprocessorslikethe

Power5andPentium4 BranchTargetBuffer:includebranchaddress&prediction Reducepenaltyfurtherbyfetchinginstructionsfromboththepredicted

andunpredicteddirection Requiredualportedmemory,interleavedcache HWcost CachingaddressesorinstructionsfrommultiplepathinBTB


compiler techniques for ilp & branch prediction

Documents