compiler techniques for ilp & branch prediction

61
Computer Architecture Lecture 5: Compiler Techniques for ILP & Branch Prediction (Chapter 3) ChihWei Liu 劉志尉 National Chiao Tung University [email protected]

Upload: doankhuong

Post on 04-Jan-2017

226 views

Category:

Documents


1 download

TRANSCRIPT

  • ComputerArchitectureLecture5:CompilerTechniquesforILP&BranchPrediction(Chapter3)

    ChihWeiLiuNationalChiao [email protected]

  • DataDependenceandParallelism

    If2instructionsareparallel theycanbeexecutedsimultaneouslyinapipelinewithoutcausinganystalls(exceptthestructuralhazards)

    theirexecutionordercanbeswapped If2instructionsaredependent

    theymustbeexecutedinorder orpartiallyoverlapped.

    Toexploitparallelismsoverinstructionsisequivalenttodeterminedependencesoverinstructions

    CA-Lec5 [email protected]

    Introduction

    2

  • ExploitInstructionLevelParallelism

    Twomainapproaches: Hardwarebaseddynamicapproaches

    Hardwarelocatestheparallelisminruntime Usedinserveranddesktopprocessors(NotusedasextensivelyinPMPprocessors)

    Superscalarprocessors:Pentium4,IBMPower,AMDOpteron

    Compilerbasedstaticapproaches Softwarefindsparallelismatcompiletime UsedinDSPprocessors(Notassuccessfuloutsideofscientificapplications)

    VLIWprocessors:Itanium2,ITRIPAC

    CA-Lec5 [email protected]

    Introduction

    3

  • WhyILP?

    MultiissueProcessor:twoormoreinstructionscanbeissued(orexecuted)inparallel ThegoalistomaximizeIPC(instructionpercycle) PipelineCPI=IdealpipelineCPI+Structuralstalls+Datahazardstalls+Controlstalls

    Toreducetheimpactofdataandcontrolhazards BasicblockILP

    CanwemakeCPIcloserto1? Ifwehavencyclelatency,thenweneedn1instructionsbetweenaproducinginstructionanditsuse

    CA-Lec5 [email protected]

    Introduction

    4

  • BasicBlockILP BasicBlock(BB)ILPisquitesmall :

    BB:astraightlinecodesequencewithnobranchesin excepttotheentryandnobranchesout exceptattheexit

    averagedynamicbranchfrequency15%to25%=>4to7instructionsexecutebetweenapairofbranches

    PlusinstructionsinBBlikelytodependoneachother WemustexploitILPacrossmultiplebasicblocks

    Loopunrolling toexploitlooplevelparallelism

    CA-Lec5 [email protected] 5

    for (i=1; i

  • OvercometheDataDependence

    Maintainingthedependencebutavoidingahazard schedulingthecodeinHW/SWapproach

    Eliminatingadependencebytransformingthecode primarybysoftware

    Dependencedetection byregisternames:simpler bymemorylocations:morecomplicated

    Twoaddressesmayrefertothesamelocationbutlookquitedifferent(e.g.100(R4), 20(R6) maybeidentical)

    Theeffectiveaddressofaload/storemaychangedfrominstructiontoinstruction(20(R4), 20(R4) maybedifferent)

    CA-Lec5 [email protected] 6

  • (True)DataDependence

    DataandControldependenciesareapropertyoftheprogram(application)

    Datadependenceconveys: Possibilityofapipelinehazard Orderinwhichresultsmustbecalculated(i.e.programbehavior)

    UpperboundonILP Dependenciesthatflowthroughmemorylocationsaredifficulttodetect Hardwarebaseddynamicapproachismoreattractive

    CA-Lec5 [email protected]

    Introduction

    7

  • NameDependence Twoinstructionsusethesameregisterormemorylocation(i.e.thesamename)butnoflowofinformation Notatruedatadependence,butisaproblemwhenreorderinginstructionsorairregularlypipelineddatapath.

    Antidependence:instructionjwritesaregisterormemorylocationthatinstructioni reads

    Initialordering(i beforej)mustbepreserved Outputdependence:instructioni andinstructionjwritethesameregisterormemorylocation

    Orderingmustbepreserved

    Toresolve,userenamingtechniques

    CA-Lec5 [email protected]

    Introduction

    8

  • RegisterRenamingandWAW/WAR

    DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0 (R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8

    DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0 (R1) SUB.D T, F10, F14 MUL.D F6, F10, T

    CA-Lec5 [email protected] 9

    WAW: ADD.D/MUL.D

    WAR: ADD.D/SUB.D, S.D/MUL.D

    RAW: DIV.D/ADD.D, ADD.D/S.D SUB.D/MUL.D

    Register renaming result

  • ControlDependence

    ControlDependence Orderingofinstructioni withrespecttoabranchinstruction

    Instructioncontroldependentonabranchcannotbemovedbeforethebranchsothatitsexecutionisnolongercontrollerbythebranch

    Aninstructionnotcontroldependentonabranchcannotbemovedafterthebranchsothatitsexecutioniscontrolledbythebranch

    CA-Lec5 [email protected]

    Introduction

    10

  • ControlDependenceIgnored

    Themethodtopreservethecontroldependence BeusedinmostsimplepipelineCPUs Simplebutinefficient

    Controldependenceisnotthecriticalpropertythatmustbepreserved Wemayexecuteinstructionthatshouldnothavebeenexecuted,

    therebyviolatingthecontroldependence,ifwecandosowithoutaffectingthecorrectnessoftheprogram

    Thetwopropertiescriticaltoprogramcorrectnessare Theexceptionbehavior Dataflow

    CA-Lec5 [email protected] 11

  • ControlDependenceExamples

    R1inORinstructiondependsonDADDUorDSUBU,reliedonthebranchistakenornot.

    AssumeR4isntusedafterskip PossibletomoveDSUBUbeforethe

    branch(thisviolatesthecontroldependencebutnotaffectsthedataflow)

    CA-Lec5 [email protected]

    Introduction

    Example 1:DADDU R1,R2,R3BEQZ R4,LDSUBU R1,R1,R6

    L: OR R7,R1,R8

    Example 2:DADDU R1,R2,R3BEQZ R12,skipDSUBU R4,R5,R6DADDU R5,R4,R9

    skip:OR R7,R8,R9

    12

  • PreserveDataFlowBehavior

    Thedataflowistheactualflowofdata amonginstructions Branchesmakedataflowdynamic Example

    R1 valuedependsonthebranchistakenornot DSUBU cannotbemovedabovethebranch. Speculationshouldtakecarethisproblem

    Programorder,thatdetermineswhichpredecessorwillactuallydeliveradatavaluetotheinstruction,shouldbeensuredbymaintainingthecontroldependences

    CA-Lec5 [email protected] 04-13

  • CompilerTechniquesforExposingILP

    Pipelinescheduling Separatedependentinstructionfromthesourceinstructionbythe

    pipelinelatency(orinstructionlatency)ofthesourceinstruction Example:

    for(i=999;i>=0;i=i1)x[i]=x[i]+s;

    CA-Lec5 [email protected]

    Com

    piler Techniques

    14

    Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)DADDUI R1,R1,#-8BNE R1,R2,Loop

  • DataDependenceLoop: L.D F0, 0(R1) ;F0=array element

    ADD.D F4, F0, F2 ;add scalar in F2

    S.D F4, 0(R1) ;store result

    DADDUI R1, R1, #-8 ;decrement pointer

    BNE R1, R2, Loop ;branch R1!=R2

    CA-Lec5 [email protected] 15

    The arrows show the order that must be preserved for correct execution.

    If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped.

  • Step1:InsertPipelineStallsLoop: L.D F0,0(R1)

    stallADD.D F4,F0,F2stallstallS.D F4,0(R1)DADDUI R1,R1,#8stall (assumeintegerloadlatencyis1)BNE R1,R2,Loop

    CA-Lec5 [email protected]

    Com

    piler Techniques

    16

    9 C.C/ iteration

  • Step2:ReSchedulingScheduledcode:Loop: L.D F0,0(R1)

    DADDUI R1,R1,#8ADD.D F4,F0,F2stallstallS.D F4,8(R1)BNE R1,R2,Loop

    CA-Lec5 [email protected]

    Com

    piler Techniques

    17

    7 C.C/ iteration

  • Step3:LoopUnrolling Loopunrolling

    Unrollbyafactorof4(assume#elementsisdivisibleby4) Eliminateunnecessaryinstructions

    Loop: L.DF0,0(R1)ADD.DF4,F0,F2S.DF4,0(R1);dropDADDUI&BNEL.DF6,8(R1)ADD.DF8,F6,F2S.DF8,8(R1);dropDADDUI&BNEL.DF10,16(R1)ADD.DF12,F10,F2S.DF12,16(R1);dropDADDUI&BNEL.DF14,24(R1)ADD.DF16,F14,F2S.DF16,24(R1)DADDUIR1,R1,#32BNER1,R2,Loop

    CA-Lec5 [email protected]

    Com

    piler Techniques

    note:numberofliveregistersvs.originalloop

    18

  • Step4:ReScheduletheUnrolledloop

    Pipelinescheduletheunrolledloop:

    Loop: L.DF0,0(R1)L.DF6,8(R1)L.DF10,16(R1)L.DF14,24(R1)ADD.DF4,F0,F2ADD.DF8,F6,F2ADD.DF12,F10,F2ADD.DF16,F14,F2S.DF4,0(R1)S.DF8,8(R1)DADDUIR1,R1,#32S.DF12,16(R1)S.DF16,8(R1)BNER1,R2,Loop

    CA-Lec5 [email protected]

    Com

    piler Techniques

    19

    14 C.C/ 4 iterationsor 3.5 C.C/ iteration

  • ILPandDataDependencies

    HW/SWmustpreserveprogramorder:orderinstructionswouldexecuteinifexecutedsequentiallyasdeterminedbyoriginalsourceprogram Dependencesareapropertyofprograms

    Presenceofdependenceindicatespotential forahazard,butactualhazardandlengthofanystallispropertyofthepipeline

    Importanceofthedatadependencies1)indicatesthepossibilityofahazard2)determinesorderinwhichresultsmustbecalculated3)setsanupperboundonhowmuchparallelismcanpossiblybeexploited

    HW/SWgoal:exploitparallelismbypreservingprogramorderonlywhereitaffectstheoutcomeoftheprogram

    CA-Lec5 [email protected] 20

  • UnrolledLoopDetail

    Donotusuallyknowupperboundofloop Supposeitisn,andwewouldliketounrollthelooptomakek

    copiesofthebody Insteadofasingleunrolledloop,wegenerateapairof

    consecutiveloops: 1stexecutes(n mod k)timesandhasabodythatistheoriginalloop 2ndistheunrolledbodysurroundedbyanouterloopthatiterates

    (n/k)times

    Forlargevaluesofn,mostoftheexecutiontimewillbespentintheunrolledloop

    CA-Lec5 [email protected]

    Com

    piler Techniques

    21

  • 5LoopUnrollingDecisions Requiresunderstandinghowoneinstructiondependsonanotherandhowthe

    instructionscanbechangedorreorderedgiventhedependences:1. Determineloopunrollingusefulbyfindingthatloopiterationswereindependent

    (exceptformaintenancecode)2. Usedifferentregisterstoavoidunnecessaryconstraintsforcedbyusingsame

    registersfordifferentcomputations3. Eliminatetheextratestandbranchinstructions andadjustthelooptermination

    anditerationcode4. Determinethatloadsandstoresinunrolledloopcanbeinterchanged by

    observingthatloadsandstoresfromdifferentiterationsareindependent Transformationrequiresanalyzingmemoryaddressesandfindingthattheydonotrefer

    tothesameaddress

    5. Schedulethecode,preservinganydependencesneededtoyieldthesameresultastheoriginalcode

    CA-Lec5 [email protected] 22

  • 3LimitstoLoopUnrolling1. Decreaseinamountofoverheadamortizedwitheachextra

    unrolling AmdahlsLaw

    2. Growthincodesize Forlargerloops,concernitincreasestheinstructioncache

    missrate3. Registerpressure(compilerlimitation):potentialshortfallin

    registerscreatedbyaggressiveunrollingandscheduling Ifnotbepossibletoallocatealllivevaluestoregisters,

    maylosesomeorallofitsadvantage Loopunrollingreducesimpactofbranchesonpipeline;

    anotherwayisbranchprediction

    CA-Lec5 [email protected] 23

  • GettingCPIbelow1 CPI 1ifissueonly1instructioneveryclockcycle Multipleissueprocessors comein3flavors:

    1. staticallyscheduledsuperscalar processors,2. dynamicallyscheduledsuperscalarprocessors,and3. VLIW (verylonginstructionword)processors

    2typesofsuperscalarprocessorsissuevaryingnumbersofinstructionsperclock useinorderexecutioniftheyarestaticallyscheduled,or outoforderexecutioniftheyaredynamicallyscheduled

    VLIWprocessors,incontrast,issueafixednumberofinstructionsformattedeitherasonelargeinstructionorasafixedinstructionpacketwiththeparallelismamonginstructionsexplicitlyindicatedbytheinstruction(Intel/HPItanium)

    CA-Lec5 [email protected] 24

  • BasicVLIW(VeryLongInstructionWord)

    AVLIWusesmultiple,independentfunctionalunits AVLIWpackagesmultipleindependentoperationsintooneverylong

    instruction Theburdenforchoosingandpackagingindependentoperationsfalls

    oncompiler HWinasuperscalarmakestheseissuedecisionsisunnecessary

    VLIWdependsonenoughparallelismforkeepingFUsbusy Loopunrollingandthencodescheduling Compilermayneedtodolocalschedulingandglobalscheduling

    HereweconsideraVLIWprocessormighthaveinstructionsthatcontain5operations,including1integer(orbranch),2FP,and2memoryreferences DependontheavailableFUsandfrequencyofoperation

    CA-Lec5 [email protected] 25

  • VLIW:VeryLargeInstructionWord

    Eachinstruction hasexplicitcodingformultipleoperations InIA64,groupingcalledapacket InTransmeta,groupingcalledamolecule (withatoms asops)

    Tradeoffinstructionspaceforsimpledecoding Thelonginstructionwordhasroomformanyoperations Bydefinition,alltheoperationsthecompilerputsinthelong

    instructionwordareindependent=>executeinparallel E.g.,2integeroperations,2FPops,2Memoryrefs,1branch

    16to24bitsperfield=>7*16or112bitsto7*24or168bitswide Needcompilingtechniquethatschedulesacrossseveralbranches

    CA-Lec5 [email protected] 26

  • Recall:UnrolledLoopthatMinimizesStallsforScalar

    CA-Lec5 [email protected] 27

    1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24

    14 clock cycles, or 3.5 per iteration

    L.D to ADD.D: 1 CycleADD.D to S.D: 2 Cycles

  • LoopUnrollinginVLIW

    CA-Lec5 [email protected] 28

    Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.29 clocks per iteration 23 ops in 9 clocks, average 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW

  • VLIWProblems

    Increaseincodesize Ambitiousloopunrolling Wheneverinstructionsarenotfull,theunusedFUstranslatetowaste

    bitsintheinstructionencoding Aninstructionmayneedtobeleftcompletelyemptyifnooperationcanbescheduled

    Cleverencodingorcompress/decompress

    Binarycodecompatibility differentnumbersoffunctionalunitsandunitlatenciesrequire

    differentversionsofthecode Needrecompilation Solution:Objectcodetranslationoremulation

    CA-Lec5 [email protected] 29

  • Intel/HPIA64ExplicitlyParallelInstructionComputer(EPIC)

    IA64:instructionsetarchitecture 12864bitintegerregs+12882bitfloatingpointregs

    NotseparateregisterfilesperfunctionalunitasinoldVLIW Hardwarechecksdependencies

    (interlocks=>binarycompatibilityovertime) Predicatedexecution(select1outof641bitflags)

    =>40%fewermispredictions? Itanium wasfirstimplementation(2001)

    Highlyparallelanddeeplypipelinedhardwareat800Mhz 6wide,10stagepipelineat800Mhzon0.18 process

    Itanium2 isnameof2ndimplementation(2005) 6wide,8stagepipelineat1666Mhzon0.13 process Caches:32KBI,32KBD,128KBL2I,128KBL2D,9216KBL3

    CA-Lec5 [email protected] 30

  • AnotherPossibility:SoftwarePipelining

    Observation:ifiterationsfromloopsareindependent,thencangetmoreILPbytakinginstructionsfromdifferent iterations

    Softwarepipelining:reorganizesloopssothateachiterationismadefrominstructionschosenfromdifferentiterationsoftheoriginalloop

    Iteration 0 Iteration

    1 Iteration 2 Iteration

    3 Iteration 4

    Software- pipelined iteration

  • ControlHazardAvoidance

    ConsiderEffectsofIncreasingtheILP Controldependenciesrapidlybecomethelimitingfactor Theytendtonotgetoptimizedbythecompiler

    Higherbranchfrequenciesresult Plusmultipleissue(morethanoneinstructions/sec)morecontrolinstructionspersec.

    Controlstallpenaltieswillgoupasmachinesgofaster AmdahlsLawinaction again

    BranchPrediction:helpsifcanbedoneforreasonablecost Staticbycompiler:appendixC

    e.g.predictnottaken,delaybranch DynamicbyHW:thissection

    CA-Lec5 [email protected] 32

  • DynamicBranchPrediction

    Whydoespredictionwork? Underlyingalgorithmhasregularities Datathatisbeingoperatedonhasregularities Instructionsequencehasredundancies thatareartifactsofwaythat

    humans/compilersthinkaboutproblems

    Isdynamicbranchpredictionbetterthanstaticbranchprediction? Seemstobe Thereareasmallnumberofimportantbranchesinprogramswhich

    havedynamicbehavior

    CA-Lec5 [email protected] 33

  • DynamicBranchPrediction Thepredictorwilldependonthebehaviorofthebranchat

    runtime Goals:

    allowtheprocessortoresolvetheoutcomeofabranchearly,preventcontroldependencesfromcausingstalls

    Effectivenessofabranchpredictionschemedependsnotonlyontheaccuracybutalsoonthecostofabranch BP_Performance=f (accuracy,costofmisprediction)

    BranchHistoryTable(BHT) LowerbitsofPCaddressindextableof1bitvalues

    Nopreciseaddresscheck justmatchthelowerbits Sayswhetherornotbranchtakenlasttime

    CA-Lec5 [email protected] 34

  • 1bitDynamicHardwarePrediction

    CA-Lec5 [email protected] 35

    predict taken predict not-taken

    taken

    not-taken

    taken not-taken

    Problem: Loop case

    LOOP: LOAD R1, 100(R2)MUL R6, R6, R1SUBI R2, R2, #4BNEZ R2, LOOP

    The steady-state prediction behavior will mispredict on the first and last loop iterations

  • BHTPrediction

    CA-Lec5 [email protected] 36

    If two branch instructions withthe same lower bits

    Useful only for the target address is known before CC is decided

  • ProblemwiththeSimpleBHT

    Aliasing Allbrancheswiththesameindex(lower)bitsreferencesameBHT

    entry Hencetheymutuallypredicteachother Noguaranteethatapredictionisright.Butitmaynotmatteranyway

    Avoidance Makethetablebigger OKsinceitsonlyasinglebitvector Thisisacommoncacheimprovementstrategyaswell

    Othercachestrategiesmayalsoapply

    Considerhowthisworksforloops Alwaysmispredicttwiceforeveryloop

    Oneisunavoidablesincetheexitisalwaysasurprise Howeverpreviousexitwillalwayscauseamispredictionthefirsttryofeverynewloopentry

    CA-Lec5 [email protected] 37

    clear benefit is that its cheap and understandable

  • NbitPredictors

    2bitcounterimplies4states Statistically2bitsgetsmostoftheadvantage

    CA-Lec5 [email protected] 38

    idea: improve on the loop entry problem

    predict taken

    predict not-taken

    taken

    not-takentaken

    predict not-taken

    predict taken

    not-taken

    not-taken

    not-taken

    taken

    taken

    Compiler could hint 11 init. on loop branches or it will go to 11 anyway in the 4th iteration

    Only the loop exit causes a mispredict

    11 10

    01 00

    A prediction must miss twice before it is changed

  • BHTAccuracy Mispredictbecauseeither:

    Wrongguessforthatbranch(accuracy) Gotbranchhistoryofwrongbranchwhenindexthetable(size)

    CA-Lec5 [email protected] 04-39

    4K of BPB with 2-bit entries misprediction rates on SPEC89@IBM Power

  • ToIncreasetheBHTSize 4096aboutasgoodasinfinitetable Thehitrateofthebufferisclearlynotthelimitingfactorforanenough

    largeBHTsize

    CA-Lec5 [email protected] 04-40

  • Theworstcaseforthe2bitpredictor

    if (aa==2)aa=0;

    if (bb==2)bb=0;

    if (aa != bb) {

    CA-Lec5 [email protected] 41

    DSUBUI R3, R1, #2BNEZ R3, L1 ;branch b1(aa!=2)DADDD R1, R0, R0 ;aa=0

    L1: DSUBUI R3, R2, #2BNEZ R3, L2 ;branch b2(bb!=2)DADDD R2, R0, R0 ;bb=0

    L2: DSUBU R3, R1, R2BEQZ R3, L3 ;branch b3(aa==bb)

    if the first 2 untaken then the 3rd will always be taken

    aa and bb are assigned to R1 and R2

  • ImprovePredictionStrategyByCorrelatingBranches

    Considertheworstcaseforthe2bitpredictorif (aa==2) then aa=0;if (bb==2) then bb=0;if (aa != bb) then whatever

    singlelevelpredictorscannevergetthiscase

    Correlatingor2levelpredictors Thepredictorusesthebehaviorofotherbranch(es)tomakea

    prediction Correlation=whathappenedonthelastbranch Predictor=whichwaytogo

    CA-Lec5 [email protected] 42

    if the first 2 fail then the 3rdwill always be taken

  • CorrelatingBranches Hypothesis:recentlyexecutedbranchesarecorrelated

    Idea:recordmmostrecently executedbranchesastakenornottaken,andusethatpatterntoselecttheproperbranchhistorytable

    Ingeneral,(m,n)predictor meansrecordlastmbranchestoselectbetween2m historytableseachwithnbitcounters Old2bitBHTisthena(0,2)predictor

    GlobalBranchHistory:mbitshiftregisterkeepingT/NTstatusoflastmbranches

    Eachentryintablehasmnbitpredictors

    CA-Lec5 [email protected] 43

    Two-level predictors

  • 2Level(m,n)BHT Usethebehaviorofthelastmbranches tochoosefrom2m branchpredictors,eachofwhichisannbitpredictor forasinglebranch

    Totalbitsforthe(m,n)BHTpredictionbuffer:

    pbitsofbufferindex =2P bitBHT 2m banksofmemory selectedbytheglobalbranchhistory(whichisjustashiftregister) e.g.acolumnaddress

    Usepbitsofthebranchaddresstoselectrow Getthenpredictorbits intheentrytomakethedecision

    CA-Lec5 [email protected] 44

    pm nbitsmemoryTotal 22__

  • (2,2)PredictorImplementation

    CA-Lec5 [email protected] 45

    4 banks = each with 32 2-bit predictor entries

    p=5m=2n=2

    5:32

  • ExampleofCorrelatingBranchPredictors

    if (d==0)d = 1;

    if (d==1)

    CA-Lec5 [email protected] 46

    BNEZ R1, L1 ;branch b1 (d!=0)DAAIU R1, R0, #1 ;d==0, so d=1

    L1: DAAIU R3, R1, #-1BNEZ R3, L2 ;branch b2 (d!=1)

    L2:

    d is assigned to R1

  • ExampleofCorrelatingBranchPredictors(Cont.)initial

    value of dd==0? b1 value of d

    before b2d==1? b2

    0 YES not taken 1 YES not taken1 NO taken 1 YES not taken2 NO taken 2 NO taken

    CA-Lec5 [email protected] 47

    d=? b1 prediction

    b1 action New b1 prediction

    b2 prediction

    b2 action New b2 prediction

    2 NT T T NT T T

    0 T NT NT T NT NT

    2 NT T T NT T T

    0 T NT NT T NT NT

    1-bit predictor initialized to NT

    All the branches are mispredicted !!!

  • ExampleofCorrelatingBranchPredictors(Cont.)Prediction bits Prediction if last branch

    not takenPrediction if last branch

    takenNT/NT NT NTNT/T NT TT/NT T NTT/T T T

    CA-Lec5 [email protected] 48

    d=? b1 prediction

    b1 action New b1 prediction

    b2 prediction

    b2 action New b2 prediction

    2 NT/NT T T/NT NT/NT T NT/T0 T/NT NT T/NT NT/T NT NT/T2 T/NT T T/NT NT/T T NT/T0 T/NT NT T/NT NT/T NT NT/T

    Use 1-bit correlation + 1-bit prediction with initialized to NT/NT(1,1) predictor

  • Comparison:AccuracyofDifferent2bitPredictors

    CA-Lec5 [email protected] 04-49

    8K bits

    8K bits

  • TournamentPredictors Recallthatthecorrelatorisjustalocalpredictor Adaptivelycombinelocalandglobal predictors

    Multiplepredictors Onebasedonglobalinformation:Resultsofrecentlyexecutedmbranches

    Onebasedonlocalinformation:Resultsofpastexecutionsofthecurrentbranchinstruction

    Selector tochoosewhichpredictorstouse E.g.:2bitsaturatingcounter,incrementedwheneverthepredictedpredictoriscorrectandtheotherpredictorisincorrect,anditisdecrementedinthereversesituation

    Advantage Abilitytoselecttherightpredictorfortherightbranch

    CA-Lec5 [email protected] 50

    The most popular one

  • StateTransitionDiagramforATournamentPredictor

    CA-Lec5 [email protected] 51

  • BranchPredictionPerformance

    CA-Lec5 [email protected] 52

  • HighPerformanceInstructionDelivery

    Foramultipleissueprocessor,predictingbrancheswellisnotenough

    Deliverahighbandwidthinstructionstream isnecessary(e.g.,4~8instructions/cycle) Branchtargetbuffer Integratedinstructionfetchunit Indirectbranchbypredictingreturnaddress

    CA-Lec5 [email protected] 53

  • BranchTargetBuffer/Cache

    Toreducethebranchpenaltyfrom1cycleto0 NeedtoknowwhattheaddressisattheendofIF Buttheinstructionisnotevendecodedyet Sousetheinstructionaddressratherthanwaitfordecode

    Ifpredictionworksthenpenaltygoesto0!

    BTBIdea Cachetostoretakenbranches(noneedtostoreuntaken) AccesstheBTBduringIFstage Matchtagisinstructionaddress comparewithcurrentPC DatafieldisthepredictedPC

    Maywanttoaddpredictorfield Toavoidthemispredict twiceoneveryloopphenomenon Addscomplexitysincewenowhavetotrackuntakenbranchesaswell

    CA-Lec5 [email protected] 54

  • BTB Illustration

    CA-Lec5 [email protected] 55

  • FlowchartforBTB

    CA-Lec5 [email protected] 56

    Send out the predicted PC

    A branch?

  • PenaltiesUsingthisApproachfor5StageMIPS

    Instruction in buffer

    Prediction Actual Branch Penalty Cycles

    Yes Taken Taken 0Yes Taken Not Taken 2No Taken 2No Not Taken 0

    CA-Lec5 [email protected] 57

    Note: Predict_wrong = 1 CC to update BTB + 1 CC to restart fetching Not found and taken = 2CC to update BTBNote: For complex pipeline design, the penalties may be higher

  • Example Givenpredictionaccuracy(forinst.inbuffer):90% Givenhitrateinbuffer(forbranchespredictedtoken):90% Assume60%ofthebranchesaretaken Determinethetotalbranchpenalty=?

    Solution Probability(branchinbuffer,butactuallynottaken)=percentbuffer

    hitrate percentincorrectprediction=90% 10%=0.09 Probability(branchnotinbuffer,butactuallytaken)=10% Hence,wehave2cycles (0.09+0.1)=0.38cycles

    Comparingthedelaybranchwiththepenalty=0.5cycles/branch

    CA-Lec5 [email protected] 58

  • IntegratedInstructionFetchUnits

    Considerthefetchunitasaseparateautonomousunit,notapipelinestage

    Functionsfortheintegratedinstructionfetchunit Branchprediction Prefetch

    Todelivermultipleinstructionspercycle Instructionmemoryaccessandbuffering

    mayrequireaccessingmultiplecachelines prefetchmayhidethelatencyformemoryaccess bufferingmaybenecessary

    CA-Lec5 [email protected] 59

  • ReturnAddressPredictor

    Indirectjump jumpswhosedestinationaddressvariesatruntime indirectprocedurecall,selectorcase,procedurereturn SPEC89benchmarks:85%ofindirectjumpsareprocedure

    returns AccuracyofBTBforprocedurereturnsarelow

    ifprocedureiscalledfrommanyplaces,andthecallsfromoneplacearenotclusteredintime

    Useasmallbufferofreturnaddressesoperatingasastack Cachethemostrecentreturnaddresses Pushareturnaddressatacall,andpoponeoffatareturn Ifthecacheissufficientlarge(maxcalldepth) prefect

    CA-Lec5 [email protected] 60

  • DynamicBranchPredictionSummary

    Branchpredictionschemearelimitedby Predictionaccuracy Mispredictionpenalty

    BranchHistoryTable:2bitsforloopaccuracy Correlation:Recentlyexecutedbranchescorrelatedwithnextbranch Tournamentpredictorstakeinsighttonextlevel,byusingmultiple

    predictors usuallyonebasedonglobalinformationandonebasedonlocalinformation,

    andcombiningthemwithaselector In2006,tournamentpredictorsusing 30Kbitsareinprocessorslikethe

    Power5andPentium4 BranchTargetBuffer:includebranchaddress&prediction Reducepenaltyfurtherbyfetchinginstructionsfromboththepredicted

    andunpredicteddirection Requiredualportedmemory,interleavedcache HWcost CachingaddressesorinstructionsfrommultiplepathinBTB

    CA-Lec5 [email protected] 61