![Page 1: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/1.jpg)
CSC631:High-PerformanceComputerArchitecture
Spring2017Lecture6:Out-of-OrderProcessors
Supercomputers
Definitionsofasupercomputer:§ Fastestmachineinworldatgiventask§ Adevicetoturnacompute-boundproblemintoanI/Oboundproblem
§ Anymachinecosting$30M+§ AnymachinedesignedbySeymourCray
§ CDC6600(Cray,1964)regardedasfirstsupercomputer
2
![Page 2: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/2.jpg)
CDC6600SeymourCray,1963
§ Afastpipelinedmachinewith60-bitwords- 128Kwordmainmemorycapacity,32banks
§ Tenfunctionalunits(parallel,unpipelined)- FloatingPoint:adder,2multipliers,divider- Integer:adder,2incrementers,...
§ Hardwiredcontrol(nomicrocoding)§ Scoreboard fordynamicschedulingofinstructions§ TenPeripheralProcessorsforInput/Output
- afastmulti-threaded12-bitintegerALU§ Veryfastclock,10MHz(FPaddin4clocks)§ >400,000transistors,750sq.ft.,5tons,150kW,novelfreon-basedtechnologyforcooling
§ Fastestmachineinworldfor5years(until7600)- over100sold($7-10Meach)
33/10/2009
CDC6600:ALoad/StoreArchitecture
4
• Separateinstructionstomanipulatethreetypesofreg.• 8x60-bitdataregisters(X)• 8x18-bitaddressregisters(A)• 8x18-bitindexregisters(B)
• Allarithmeticandlogicinstructionsareregister-to-register
•OnlyLoadandStoreinstructionsrefertomemory!
Touchingaddressregisters1to5initiatesaload6to7initiatesastore
- veryusefulforvectoroperations
opcode i j k Ri ¬ Rj op Rk
opcode i j disp Ri ¬ M[Rj + disp]
6 3 3 3
6 3 3 18
![Page 3: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/3.jpg)
CDC6600:Datapath
5
AddressRegsIndexRegs8x18-bit8x18-bit
OperandRegs8x60-bit
Inst.Stack8x60-bit
IR
10FunctionalUnits
CentralMemory128Kwords,32banks,1µscycle
resultaddr
result
operand
operandaddr
CDC6600ISAdesignedtosimplifyhigh-performanceimplementation
§ Useofthree-address,register-registerALUinstructionssimplifiespipelinedimplementation- Only3-bitregisterspecifier fieldscheckedfordependencies- Noimplicitdependenciesbetweeninputsandoutputs
§ Decouplingsettingofaddressregister(Ar)fromretrievingvaluefromdataregister(Xr)simplifiesprovidingmultipleoutstandingmemoryaccesses- Softwarecanscheduleloadofaddressregisterbeforeuseofvalue- Caninterleaveindependentinstructionsinbetween
§ CDC6600hasmultipleparallelbutunpipelined functionalunits- E.g.,2separatemultipliers
§ Follow-onmachineCDC7600usedpipelinedfunctionalunits- ForeshadowslaterRISCdesigns
6
![Page 4: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/4.jpg)
CDC6600:VectorAddition
7
B0←- nloop: JZEB0,exit
A0←B0+a0 loadX0A1←B0+b0 loadX1X6←X0+X1A6←B0+c0 storeX6B0←B0+1jumploop
Ai=addressregisterBi=indexregisterXi=dataregister
CDC6600Scoreboard
§ Instructionsdispatchedin-ordertofunctionalunitsprovidednostructuralhazardorWAW- Stallonstructuralhazard,nofunctionalunitsavailable-Onlyonependingwritetoanyregister
§ Instructionswaitforinputoperands(RAWhazards)beforeexecution- Canexecuteout-of-order
§ Instructionswaitforoutputregistertobereadbyprecedinginstructions(WAR)- Resultheldinfunctionalunituntilregisterfree
8
![Page 5: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/5.jpg)
9[©IBM]
IBM360/91Floating-PointUnitR.M.Tomasulo,1967
10
Mult
1
123456
loadbuffers(frommemory)
1234
Adder
123
Floating-PointRegfile
storebuffers(tomemory)
...
instructions
Commonbusensuresthatdataismadeavailableimmediatelytoalltheinstructionswaitingforit.Matchtag,ifequal,copyvalue&setpresence“p”.
Distributereservationstationstofunctionalunits
<tag,result>
p tag/datap tag/datap tag/data
p tag/datap tag/datap tag/data
p tag/datap tag/datap tag/data
p tag/datap tag/data
p tag/datap tag/data2
p tag/datap tag/datap tag/data
p tag/datap tag/datap tag/data
p tag/datap tag/datap tag/datap tag/data
![Page 6: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/6.jpg)
Out-of-OrderFadesintoBackgroundOut-of-orderprocessingimplementedcommerciallyin1960s,butdisappearedagainuntil1990sastwomajorproblemshadtobesolved:§ Precisetraps- ImprecisetrapscomplicatedebuggingandOScode-Note,preciseinterruptsarerelativelyeasytoprovide
§ Branchprediction- Amountofexploitableinstruction-levelparallelism(ILP)limitedbycontrolhazards
Also,simplermachinedesignsinnewtechnologybeatcomplicatedmachinesinoldtechnology- Bigadvantagetofitprocessor &cachesononechip-Microprocessorshaderaof1%/weekperformancescaling
11
SeparatingCompletionfromCommit
§ Re-orderbufferholdsregisterresultsfromcompletionuntilcommit- Entriesallocatedinprogramorderduringdecode- Bufferscompletedvaluesandexceptionstateuntilin-ordercommitpoint
- Completedvaluescanbeusedbydependentsbeforecommitted(bypassing)
- Eachentryholdsprogramcounter,instructiontype,destinationregisterspecifier andvalueifany,andexceptionstatus(infooftencompressedtosavehardware)
§ Memoryreorderingneedsspecialdatastructures- Speculativestoreaddressanddatabuffers- Speculativeloadaddressanddatabuffers
12
![Page 7: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/7.jpg)
In-OrderCommitforPreciseTraps
§ In-orderinstructionfetchanddecode,anddispatchtoreservationstationsinsidereorderbuffer
§ Instructionsissuefromreservationstationsout-of-order§ Out-of-ordercompletion,valuesstoredintemporarybuffers§ Commitisin-order,checksfortraps,andifnoneupdatesarchitecturalstate
13
Fetch Decode
Execute
CommitReorderBuffer
In-order In-orderOut-of-order
Trap?Kill
Kill Kill
InjecthandlerPC
PhasesofInstructionExecution
14
Fetch: Instructionbitsretrievedfrominstructioncache.I-cache
FetchBuffer
IssueBuffer
FunctionalUnits
ArchitecturalState
Execute:Instructionsandoperands issuedtofunctionalunits.Whenexecutioncompletes,allresultsandexceptionflagsareavailable.
Decode: Instructions dispatchedtoappropriateissue buffer
ResultBufferCommit:Instructionirrevocablyupdatesarchitecturalstate(aka“graduation”),ortakesprecisetrap/interrupt.
PC
Commit
Decode/Rename
![Page 8: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/8.jpg)
In-OrderversusOut-of-OrderPhases
§ Instructionfetch/decode/renamealwaysin-order-NeedtoparseISAsequentiallytogetcorrectsemantics- ProposalsforspeculativeOoO instructionfetch,e.g.,Multiscalar.Predictcontrolflowanddatadependenciesacrosssequential programsegmentsfetched/decoded/executedinparallel,fixup ifpredictionwrong
§ Dispatch(placeinstructionintomachinebufferstowaitforissue)alsoalwaysin-order- Someuse“Dispatch”tomeanissue,butnotintheselectures
15
In-OrderVersusOut-of-OrderIssue
§ In-orderissue:- IssuestallsonRAWdependenciesorstructuralhazards,orpossiblyWAR/WAWhazards
- Instructioncannotissuetoexecutionunitsunlessallprecedinginstructionshaveissuedtoexecutionunits
§ Out-of-orderissue:- Instructionsdispatchedinprogramordertoreservationstations(orotherformsofinstructionbuffer)towaitforoperandstoarrive,orotherhazardstoclear
-Whileearlierinstructionswaitinissuebuffers,followinginstructionscanbedispatchedandissuedout-of-order
16
![Page 9: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/9.jpg)
In-OrderversusOut-of-OrderCompletion
§ Allbutthesimplestmachineshaveout-of-ordercompletion,duetodifferentlatenciesoffunctionalunitsanddesiretobypassvaluesassoonasavailable
§ ClassicRISC5-stageintegerpipelinejustbarelyhasin-ordercompletion- Loadtakestwocycles,butfollowingone-cycleintegeropcompletesatsametime, notearlier
- AddingpipelinedFPUimmediatelybringsOoO completion
17
In-OrderversusOut-of-OrderCommit
§ In-ordercommitsupportsprecisetraps,standardtoday- Someproposalstoreducethecostofin-ordercommitbyretiringsomeinstructionsearlytocompactreorderbuffer,butthisisjustanoptimizedin-ordercommit
§ Out-of-ordercommitwaseffectivelywhatearlyOoOmachinesimplemented(imprecisetraps)ascompletionirrevocablychangedmachinestate- i.e.,complete==commitinthesemachines
18
![Page 10: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/10.jpg)
OoO DesignChoices
§ Wherearereservationstations?- Partofreorderbuffer,orinseparateissuewindow?- Distributedbyfunctionalunits,orcentralized?
§ Howisregisterrenamingperformed?- Tagsanddataheldinreservationstations,withseparatearchitecturalregisterfile
- Tagsonlyinreservationstations,dataheldinunifiedphysicalregisterfile
19
“Data-in-ROB”Design(HPPA8000,PentiumPro,Core2Duo,Nehalem)
§ Managedascircularbufferinprogramorder,newinstructionsdispatchedtofreeslots,oldestinstructioncommitted/reclaimedwhendone(“p”bitsetonresult)
§ TagisgivenbyindexinROB(Freepointervalue)§ Indispatch,non-busysourceoperandsreadfromarchitecturalregisterfileandcopiedtoSrc1andSrc2withpresencebit“p”set.Busyoperandscopytagofproducerandclear“p”bit.
§ Setvalidbit“v”ondispatch,setissuedbit“i”onissue§ Oncompletion,searchsourcetags,set“p”bitandcopydataintosrc ontagmatch.WriteresultandexceptionflagstoROB.
§ Oncommit,checkexceptionstatus,andcopyresultintoarchitecturalregisterfileifnotrap.
§ Ontrap,flushmachineandROB,setfree=oldest,jumptohandler
Tagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv Opcode
Oldest
Free
![Page 11: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/11.jpg)
ManagingRenameforData-in-ROB
§ If“p”bitset,thenusevalueinarchitecturalregisterfile§ Else,tagfieldindicatesinstructionthatwill/hasproducedvalue§ Fordispatch,readsourceoperands<p,tag,value>fromarch.regfile,andalsoread<p,result>fromproducinginstructioninROB,bypassingasneeded.CopytoROB
§ Writedestinationarch.registerentrywith<0,Free,_>,toassigntagtoROBindexofthisinstruction
§ Oncommit,updatearch.regfile with<1,_,Result>§ Ontrap,resettable(Allp=1)
21
Tagp ValueTagp ValueTagp Value
Tagp Value
Oneentryperarch.register
Renametableassociatedwitharchitecturalregisters,managedindecode/dispatch
ROB
DataMovementinData-in-ROBDesign
22
ArchitecturalRegisterFile
Readoperandsduringdecode
SourceOperands
Writesourcesindispatch
Readoperandsatissue
FunctionalUnits
Writeresultsatcompletion
Readresultsforcommit
Bypassnewervaluesatdispatch
ResultData
Writeresultsatcommit
![Page 12: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/12.jpg)
UnifiedPhysicalRegisterFile(MIPSR10K,Alpha21264,IntelPentium4&Sandy/IvyBridge)
§ Renameallarchitecturalregistersintoasinglephysicalregisterfileduringdecode,noregistervaluesread
§ Functionalunitsreadandwritefromsingleunifiedregisterfileholdingcommittedandtemporaryregistersinexecute
§ Commitonlyupdatesmappingofarchitecturalregistertophysicalregister,nodatamovement
23
UnifiedPhysicalRegisterFile
Readoperandsatissue
FunctionalUnits
Writeresultsatcompletion
CommittedRegisterMapping
DecodeStageRegisterMapping
LifetimeofPhysicalRegisters
24
ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)
ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)
Rename
Whencanwereuseaphysicalregister?Whennextwriterofsamearchitecturalregistercommits
• Physicalregfile holdscommittedandspeculativevalues• PhysicalregistersdecoupledfromROBentries(nodatainROB)
![Page 13: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/13.jpg)
PhysicalRegisterManagement
25
op p1 PR1 p2 PR2exuse Rd PRdLPRd
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
x5P5x6P6x7
x0P8x1
x2P7x3
x4
ROB
Rename Table
Physical Regs Free List
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
ppp
P0P1P3P2P4
(LPRd requires third read port on Rename Table for each instruction)
<x1>P8 p
PhysicalRegisterManagement
26
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8
![Page 14: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/14.jpg)
PhysicalRegisterManagement
27
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1
PhysicalRegisterManagement
28
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3
![Page 15: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/15.jpg)
PhysicalRegisterManagement
29
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2
PhysicalRegisterManagement
30
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
![Page 16: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/16.jpg)
PhysicalRegisterManagement
31
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3
x ld p P7 x1 P0
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
PhysicalRegisterManagement
32
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
P8
x x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
p
p<x3>
P7
![Page 17: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/17.jpg)
MIPSR10KTrapHandling
§ Renametableisrepairedbyunrenaming instructionsinreverseorderusingthePRd/LPRd fields
§ TheAlpha21264hadsimilarphysicalregisterfilescheme,butkeptcompleterenametablesnapshotsforeachinstructioninROB(80snapshotstotal)- Flashcopyallbitsfromsnapshottoactivetableinonecycle
33
PartII:AdvancedOut-of-OrderSuperscalarDesigns
34
![Page 18: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/18.jpg)
ROB
DataMovementinData-in-ROBDesign
35
ArchitecturalRegisterFile
Readoperandsduringdecode
SourceOperands
Writesourcesindispatch
Readoperandsatissue
FunctionalUnits
Writeresultsatcompletion
Readresultsforcommit
Bypassnewervaluesatdispatch
ResultData
Writeresultsatcommit
UnifiedPhysicalRegisterFile(MIPSR10K,Alpha21264,IntelPentium4&Sandy/IvyBridge)
§ Renameallarchitecturalregistersintoasinglephysicalregisterfileduringdecode,noregistervaluesread
§ Functionalunitsreadandwritefromsingleunifiedregisterfileholdingcommittedandtemporaryregistersinexecute
§ Commitonlyupdatesmappingofarchitecturalregistertophysicalregister,nodatamovement
36
UnifiedPhysicalRegisterFile
Readoperandsatissue
FunctionalUnits
Writeresultsatcompletion
CommittedRegisterMapping
DecodeStageRegisterMapping
![Page 19: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/19.jpg)
LifetimeofPhysicalRegisters
37
ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)
ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)
Rename
Whencanwereuseaphysicalregister?Whennextwriterofsamearchitecturalregistercommits
• Physicalregfile holdscommittedandspeculativevalues• PhysicalregistersdecoupledfromROBentries(nodatainROB)
PhysicalRegisterManagement
38
op p1 PR1 p2 PR2exuse Rd PRdLPRd
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
x5P5x6P6x7
x0P8x1
x2P7x3
x4
ROB
Rename Table
Physical Regs Free List
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
ppp
P0P1P3P2P4
(LPRd requires third read port on Rename Table for each instruction)
<x1>P8 p
![Page 20: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/20.jpg)
PhysicalRegisterManagement
39
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8
PhysicalRegisterManagement
40
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1
![Page 21: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/21.jpg)
PhysicalRegisterManagement
41
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3
PhysicalRegisterManagement
42
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2
![Page 22: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/22.jpg)
PhysicalRegisterManagement
43
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
PhysicalRegisterManagement
44
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3
x ld p P7 x1 P0
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
![Page 23: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/23.jpg)
PhysicalRegisterManagement
45
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
P8
x x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
p
p<x3>
P7
MIPSR10KTrapHandling
§ Renametableisrepairedbyunrenaming instructionsinreverseorderusingthePRd/LPRd fields
§ TheAlpha21264hadsimilarphysicalregisterfilescheme,butkeptcompleterenametablesnapshotsforeachinstructioninROB(80snapshotstotal)- Flashcopyallbitsfromsnapshottoactivetableinonecycle
46
![Page 24: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/24.jpg)
ReorderBufferHoldsActiveInstructions(DecodedbutnotCommitted)
47
(Olderinstructions)
(Newerinstructions)
Cyclet
…ld x1, (x3)add x3, x1, x2sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x1)…
Commit
Fetch
Cyclet+1
Execute
…ld x1, (x3)add x3, x1, x2sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x1)…
ROBcontents
SeparateIssueWindowfromROB
48
Reorderbufferusedtoholdexceptioninformationforcommit.
Theissuewindowholdsonlyinstructionsthathavebeendecodedandrenamedbutnotissuedintoexecution.Hasregistertagsandpresencebits,andpointertoROBentry.
op p1 PR1 p2 PR2 PRduse ex ROB#
ROBisusuallyseveraltimeslargerthanissuewindow– why?
Rd LPRd PC Except?Oldest
Free
Done?
![Page 25: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/25.jpg)
SuperscalarRegisterRenaming§ Duringdecode,instructionsallocatednewphysicaldestinationregister§ Sourceoperandsrenamedtophysicalregisterwithnewestvalue§ Executionunitonlyseesphysicalregisternumbers
49
RenameTable
Op Src1 Src2Dest Op Src1 Src2Dest
RegisterFreeList
Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest
UpdateMapping
Doesthiswork?
Inst1 Inst2
ReadAddresses
ReadDataWrite
Ports
SuperscalarRegisterRenaming
50
RenameTable
Op Src1 Src2Dest Op Src1 Src2Dest
RegisterFreeList
Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest
UpdateMapping
Inst1 Inst2
ReadAddresses
ReadDataWrite
Ports =?=?
MustcheckforRAWhazardsbetweeninstructionsissuinginsamecycle.Canbedoneinparallelwithrenamelookup.
MIPSR10Krenames4serially-RAW-dependentinsts/cycle
![Page 26: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/26.jpg)
ControlFlowPenalty
51
I-cache
Fetch Buffer
IssueBuffer
Func.Units
Arch.State
Execute
Decode
ResultBuffer Commit
PC
Fetch
Branchexecuted
Next fetch started
Modernprocessorsmayhave>10pipelinestagesbetweennextPCcalculationandbranchresolution!
Howmuchworkislostifpipelinedoesn’tfollowcorrectinstructionflow?
~Looplengthxpipelinewidth+buffers
ReducingControlFlowPenalty
§ Softwaresolutions- Eliminatebranches- loopunrolling- Increasestherunlength
- Reduceresolutiontime- instructionscheduling- Computethebranchconditionasearlyaspossible(oflimitedvaluebecausebranchesoftenincriticalpaththroughcode)
§ Hardwaresolutions- Findsomethingelsetodo- delayslots- Replacespipelinebubbleswithusefulwork(requiressoftwarecooperation)– quicklyseediminishingreturns
- Speculate- branchprediction- Speculativeexecutionofinstructionsbeyondthebranch- Manyadvancesinaccuracy
52
![Page 27: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/27.jpg)
BranchPrediction
53
Motivation:BranchpenaltieslimitperformanceofdeeplypipelinedprocessorsModernbranchpredictorshavehighaccuracy(>95%)andcanreducebranchpenaltiessignificantly
Requiredhardwaresupport:Predictionstructures:
• Branchhistorytables,branchtargetbuffers,etc.
Mispredict recoverymechanisms:• Keepresultcomputationseparatefromcommit• Killinstructionsfollowingbranchinpipeline• Restorestatetothatfollowingbranch
ImportanceofBranchPrediction
§ Consider4-waysuperscalarwith8pipelinestagesfromfetchtodispatch, and80-entryROB,and3cyclesfromissuetobranchresolution
§ Onamispredict,couldthrowaway8*4+(80-1)=111instructions
§ Improvingfrom90%to95%predictionaccuracy,removes50%ofbranchmispredicts- If1/6instructionsarebranches,thenmovefrom60instructionsbetweenmispredicts,to120instructionsbetweenmispredicts
54
![Page 28: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/28.jpg)
StaticBranchPrediction
55
Overallprobabilityabranchistakenis~60-70%but:
ISAcanattachpreferreddirectionsemanticstobranches,e.g.,MotorolaMC88110
bne0 (preferredtaken) beq0 (nottaken)
ISAcanallowarbitrarychoiceofstaticallypredicteddirection,e.g.,HPPA-RISC,IntelIA-64
typicallyreportedas~80%accurate
backward90%
forward50%
DynamicBranchPredictionlearningbasedonpastbehavior
§ Temporalcorrelation- Thewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecution
§ Spatialcorrelation- Severalbranchesmayresolveinahighlycorrelatedmanner(apreferredpathofexecution)
56
![Page 29: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/29.jpg)
One-BitBranchHistoryPredictor
§ Foreachbranch,rememberlastwaybranchwent§ Hasproblemwithloop-closingbackwardbranches,astwomispredicts occuroneveryloopexecution1. firstiterationpredictsloopbackwardsbranchnot-taken
(loopwasexitedlasttime)2. lastiterationpredictsloopbackwardsbranchtaken(loop
continuedlasttime)
57
BranchPredictionBits
58
• Assume2BPbitsperinstruction• Changethepredictionaftertwoconsecutivemistakes!
¬takewrongtaken
¬taken
taken
taken
taken¬takeright
takeright
takewrong
¬taken
¬taken¬taken
BPstate:(predict take/¬take)x(lastprediction right/wrong)
![Page 30: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/30.jpg)
BranchHistoryTable(BHT)
59
4K-entryBHT,2bits/entry,~80-90%correctpredictions
0 0FetchPC
Branch? TargetPC
+
I-Cache
Opcode offsetInstruction
kBHTIndex
2k-entryBHT,2bits/entry
Taken/¬Taken?
ExploitingSpatialCorrelationYehandPatt,1992
60
Historyregister,H,recordsthedirectionofthelastNbranchesexecutedbytheprocessor
if (x[i] < 7) theny += 1;
if (x[i] < 5) thenc -= 4;
Iffirstconditionfalse,secondconditionalsofalse
![Page 31: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/31.jpg)
Two-LevelBranchPredictor
61
PentiumProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)
0 0
kFetchPC
ShiftinTaken/¬Takenresultsofeachbranch
2-bitglobalbranchhistoryshiftregister
Taken/¬Taken?
SpeculatingBothDirections
§ Analternativetobranchpredictionistoexecutebothdirectionsofabranchspeculatively- resourcerequirementisproportionaltothenumberofconcurrentspeculativeexecutions
- onlyhalftheresourcesengageinusefulworkwhenbothdirectionsofabranchareexecutedspeculatively
- branchpredictiontakeslessresourcesthanspeculativeexecutionofbothpaths
§ Withaccuratebranchprediction,itismorecosteffectivetodedicateallresourcestothepredicteddirection!
62
![Page 32: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/32.jpg)
LimitationsofBHTs
63
Onlypredictsbranchdirection.Therefore,cannotredirectfetchstreamuntilafterbranchtargetisdetermined.
UltraSPARC-IIIfetchpipeline
Correctlypredictedtakenbranch
penalty
JumpRegisterpenalty
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
Remainderofexecutepipeline(+another6stages)
BranchTargetBuffer(BTB)
64
• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded
2k-entry direct-mapped BTB(can also be associative)
I-Cache PC
k
Valid
valid
EntryPC
=
match
predicted
target
targetPC
![Page 33: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/33.jpg)
CombiningBTBandBHT
§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)
§ BHTcanholdmanymoreentriesandismoreaccurate
65
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
BTB
BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch
BTB/BHTonlyupdatedafterbranchresolvesinEstage
UsesofJumpRegister(JR)
§ Switchstatements(jumptoaddressofmatchingcase)
§ Dynamicfunctioncall(jumptorun-timefunctionaddress)
§ Subroutinereturns(jumptoreturnaddress)
66
HowwelldoesBTBworkforeachofthesecases?
BTBworkswellifsamecaseusedrepeatedly
BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)
BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!
![Page 34: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/34.jpg)
SubroutineReturnStack
SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.
67
&fb()&fc()
Pushcalladdresswhenfunctioncallexecuted
Popreturnaddresswhensubroutinereturndecoded
fa() { fb(); }fb() { fc(); }fc() { fd(); }
&fd() kentries(typicallyk=8-16)
ReturnStackinPipeline
§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubroutinecall/returnatdecode
68
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
RSRSPush/Popafterdecodegiveslargebubbleinfetchstream.
ReturnStackpredictionchecked
![Page 35: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/35.jpg)
ReturnStackinPipeline
§ CanrememberwhetherPCissubroutinecall/returnusingBTB-likestructure
§ Insteadoftarget-PC,juststorepush/popbit
69
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
RS
Push/Popbeforeinstructionsdecoded!
ReturnStackpredictionchecked
In-Ordervs.Out-of-OrderBranchPrediction
Fetch
Decode
Execute
Commit
In-OrderIssue Out-of-OrderIssue
Fetch
Decode
Execute
Commit
ROB
Br.Pred.
Resolve
Br.Pred.
Resolve
§ Speculativefetchbutnotspeculativeexecution- branchresolvesbeforelaterinstructionscomplete
§ Completedvaluesheldinbypassnetworkuntilcommit
§ Speculativeexecution,withbranchesresolvedafterlaterinstructionscomplete
§ CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileuntilcommit
• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemultipleinstructionspercycle
• Commontohave10-30pipelinestagesineitherstyleofdesign
In-Order
In-Order
In-Order
Out-of-Order
70
![Page 36: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/36.jpg)
InO vs.OoO Mispredict Recovery
§ In-orderexecution?- Designsonoinstructionissuedafterbranchcanwrite-backbeforebranchresolves
- Killallinstructionsinpipelinebehindmispredicted branch§ Out-of-orderexecution?-Multipleinstructionsfollowingbranchinprogramordercancompletebeforebranchresolves
- Asimplesolutionwouldbetohandlelikeprecisetraps- Problem?
71
BranchMispredictioninPipeline
§ CanhavemultipleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstructionsinROBthatfollowamispredicted branch
§ MIPSR10Kusesfourmaskbitstotaginstructionsthataredependentonuptofourspeculativebranches
§ Maskbitsclearedasbranchresolves,andreusedfornextbranch72
Fetch Decode
Execute
CommitReorder Buffer
Kill
Kill Kill
BranchResolution
Inject correct PC
BranchPrediction
PC
Complete
![Page 37: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/37.jpg)
RenameTableRecovery
§ Havetoquicklyrecoverrenametableonbranchmispredicts
§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspeculativebranches
§ Alpha21264has80snapshots,oneperROBinstruction
73
ImprovingInstructionFetch
§ Performanceofspeculativeout-of-ordermachinesoftenlimitedbyinstructionfetchbandwidth- speculativeexecutioncanfetch2-3xmoreinstructionsthanarecommitted
-mispredictpenaltiesdominatedbytimetorefillinstructionwindow
- takenbranchesareparticularlytroublesome
![Page 38: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/38.jpg)
IncreasingTakenBranchBandwidth(Alpha21264I-Cache)
§ Fold2-waytagsandBTBintopredictednextblock§ Taketagchecks,inst.decode,branchpredictoutofloop§ RawRAMspeedoncriticalloop(1cycleat~1GHz)§ 2-bithysteresiscounterperblockpreventsovertraining
CachedInstructions
LinePredict
WayPredict
TagWay0
TagWay1
=? =?
fastfetchpath
PCGeneration
PC
BranchPredictionInstructionDecodeValidityChecks
4insts
Hit/Miss/Way
TournamentBranchPredictor(Alpha21264)
§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredictingnextbranch
§ Globalhistoryisspeculativelyupdatedbutrestoredonmispredict
§ Claim90-100%successonrangeofapplications
Local history table (1,024x10b)
PC
Local prediction (1,024x3b)
Global Prediction (4,096x2b)
Choice Prediction (4,096x2b)
Global History (12b)Prediction
![Page 39: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/39.jpg)
TakenBranchLimit
§ Integercodeshaveatakenbranchevery6-9instructions
§ Toavoidfetchbottleneck,mustexecutemultipletakenbranchespercyclewhenincreasingperformance
§ Thisimplies:- predictingmultiplebranchespercycle- fetchingmultiplenon-contiguousblockspercycle
BranchAddressCache(Yeh,Marr,Patt)
PCk
EntryPC
=
match
Valid
valid
predicted
target#1
target#1len
len#1
predicted
target#2
target#2
ExtendBTBtoreturnmultiplebranchpredictionspercycle
![Page 40: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/40.jpg)
FetchingMultipleBasicBlocks
§ Requireseither-multiportedcache:expensive- interleaving:bankconflictswilloccur
§ Mergingmultipleblockstofeedtodecodersaddslatencyincreasingmispredictpenaltyandreducingbranchthroughput
TraceCache
§ KeyIdea:Packmultiplenon-contiguousbasicblocksintoonecontiguoustracecacheline
BR BR BR
• Singlefetchbringsinmultiplebasicblocks
• Tracecacheindexedbystartaddress andnextnbranchpredictions
• UsedinIntelPentium-4processortoholddecodeduops
BRBRBR
![Page 41: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/41.jpg)
Load-StoreQueueDesign
§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance
§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued
81
SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.
§ Duringdecode,storebufferslotallocatedinprogramorder
§ Storessplitinto“storeaddress”and“storedata”micro-operations
§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache
§ Onstoreabort:- clearvalidbit
82
DataTags
StoreCommitPath
SpeculativeStoreBuffer
L1DataCache
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
StoreAddress
StoreData
![Page 42: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/42.jpg)
Loadbypassfromspeculativestorebuffer
83
§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer
§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload
Data
LoadAddress
Tags
SpeculativeStoreBuffer L1DataCache
LoadData
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
MemoryDependencies
sd x1, (x2)ld x3, (x4)
§ Whencanweexecutetheload?
84
![Page 43: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/43.jpg)
In-OrderMemoryQueue
§ Executeallloadsandstoresinprogramorder
§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution
§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions
§ Needastructuretohandlememoryordering…
85
ConservativeO-o-OLoadExecution
sd x1, (x2)ld x3, (x4)
§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2
§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware
§ Don’texecuteloadifanypreviousstoreaddressnotknown
§ (MIPSR10K,16-entryaddressqueue)
86
![Page 44: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/44.jpg)
AddressSpeculation
sd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder
§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions
§ =>Largepenaltyforinaccurateaddressspeculation
87
MemoryDependencePrediction(Alpha21264)
sd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2 andexecuteloadbeforestore
§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait
§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete
§ Periodicallyclearstore-waitbits88
![Page 45: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/45.jpg)
PartIII:MoreAdvancedOut-of-OrderSuperscalarDesigns
89
LimitationsofBHTs
90
Onlypredictsbranchdirection.Therefore,cannotredirectfetchstreamuntilafterbranchtargetisdetermined.
UltraSPARC-IIIfetchpipeline
Correctlypredictedtakenbranch
penalty
JumpRegisterpenalty
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
Remainderofexecutepipeline(+another6stages)
![Page 46: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/46.jpg)
BranchTargetBuffer(BTB)
91
• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded
2k-entry direct-mapped BTB(can also be associative)
I-Cache PC
k
Valid
valid
EntryPC
=
match
predicted
target
targetPC
CombiningBTBandBHT
§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)
§ BHTcanholdmanymoreentriesandismoreaccurate
92
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
BTB
BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch
BTB/BHTonlyupdatedafterbranchresolvesinEstage
![Page 47: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/47.jpg)
UsesofJumpRegister(JR)
§ Switchstatements(jumptoaddressofmatchingcase)
§ Dynamicfunctioncall(jumptorun-timefunctionaddress)
§ Subroutinereturns(jumptoreturnaddress)
93
HowwelldoesBTBworkforeachofthesecases?
BTBworkswellifsamecaseusedrepeatedly
BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)
BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!
SubroutineReturnStack
SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.
94
&fb()&fc()
Pushcalladdresswhenfunctioncallexecuted
Popreturnaddresswhensubroutinereturndecoded
fa() { fb(); }fb() { fc(); }fc() { fd(); }
&fd() kentries(typicallyk=8-16)
![Page 48: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/48.jpg)
ReturnStackinPipeline
§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubroutinecall/returnatdecode
95
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
RSRSPush/Popafterdecodegiveslargebubbleinfetchstream.
ReturnStackpredictionchecked
ReturnStackinPipeline
§ CanrememberwhetherPCissubroutinecall/returnusingBTB-likestructure
§ Insteadoftarget-PC,juststorepush/popbit
96
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
RS
Push/Popbeforeinstructionsdecoded!
ReturnStackpredictionchecked
![Page 49: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/49.jpg)
In-Ordervs.Out-of-OrderBranchPrediction
Fetch
Decode
Execute
Commit
In-OrderIssue Out-of-OrderIssue
Fetch
Decode
Execute
Commit
ROB
Br.Pred.
Resolve
Br.Pred.
Resolve
§ Speculativefetchbutnotspeculativeexecution- branchresolvesbeforelaterinstructionscomplete
§ Completedvaluesheldinbypassnetworkuntilcommit
§ Speculativeexecution,withbranchesresolvedafterlaterinstructionscomplete
§ CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileuntilcommit
• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemultipleinstructionspercycle
• Commontohave10-30pipelinestagesineitherstyleofdesign
In-Order
In-Order
In-Order
Out-of-Order
97
InO vs.OoO Mispredict Recovery
§ In-orderexecution?- Designsonoinstructionissuedafterbranchcanwrite-backbeforebranchresolves
- Killallinstructionsinpipelinebehindmispredicted branch§ Out-of-orderexecution?-Multipleinstructionsfollowingbranchinprogramordercancompletebeforebranchresolves
- Asimplesolutionwouldbetohandlelikeprecisetraps- Problem?
98
![Page 50: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/50.jpg)
BranchMispredictioninPipeline
§ CanhavemultipleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstructionsinROBthatfollowamispredicted branch
§ MIPSR10Kusesfourmaskbitstotaginstructionsthataredependentonuptofourspeculativebranches
§ Maskbitsclearedasbranchresolves,andreusedfornextbranch99
Fetch Decode
Execute
CommitReorder Buffer
Kill
Kill Kill
BranchResolution
Inject correct PC
BranchPrediction
PC
Complete
RenameTableRecovery
§ Havetoquicklyrecoverrenametableonbranchmispredicts
§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspeculativebranches
§ Alpha21264has80snapshots,oneperROBinstruction
100
![Page 51: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/51.jpg)
ImprovingInstructionFetch
§ Performanceofspeculativeout-of-ordermachinesoftenlimitedbyinstructionfetchbandwidth- speculativeexecutioncanfetch2-3xmoreinstructionsthanarecommitted
-mispredictpenaltiesdominatedbytimetorefillinstructionwindow
- takenbranchesareparticularlytroublesome
IncreasingTakenBranchBandwidth(Alpha21264I-Cache)
§ Fold2-waytagsandBTBintopredictednextblock§ Taketagchecks,inst.decode,branchpredictoutofloop§ RawRAMspeedoncriticalloop(1cycleat~1GHz)§ 2-bithysteresiscounterperblockpreventsovertraining
CachedInstructions
LinePredict
WayPredict
TagWay0
TagWay1
=? =?
fastfetchpath
PCGeneration
PC
BranchPredictionInstructionDecodeValidityChecks
4insts
Hit/Miss/Way
![Page 52: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/52.jpg)
TournamentBranchPredictor(Alpha21264)
§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredictingnextbranch
§ Globalhistoryisspeculativelyupdatedbutrestoredonmispredict
§ Claim90-100%successonrangeofapplications
Local history table (1,024x10b)
PC
Local prediction (1,024x3b)
Global Prediction (4,096x2b)
Choice Prediction (4,096x2b)
Global History (12b)Prediction
TakenBranchLimit
§ Integercodeshaveatakenbranchevery6-9instructions
§ Toavoidfetchbottleneck,mustexecutemultipletakenbranchespercyclewhenincreasingperformance
§ Thisimplies:- predictingmultiplebranchespercycle- fetchingmultiplenon-contiguousblockspercycle
![Page 53: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/53.jpg)
BranchAddressCache(Yeh,Marr,Patt)
PCk
EntryPC
=
match
Valid
valid
predicted
target#1
target#1len
len#1
predicted
target#2
target#2
ExtendBTBtoreturnmultiplebranchpredictionspercycle
FetchingMultipleBasicBlocks
§ Requireseither-multiportedcache:expensive- interleaving:bankconflictswilloccur
§ Mergingmultipleblockstofeedtodecodersaddslatencyincreasingmispredictpenaltyandreducingbranchthroughput
![Page 54: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/54.jpg)
TraceCache
§ KeyIdea:Packmultiplenon-contiguousbasicblocksintoonecontiguoustracecacheline
BR BR BR
• Singlefetchbringsinmultiplebasicblocks
• Tracecacheindexedbystartaddress andnextnbranchpredictions
• UsedinIntelPentium-4processortoholddecodeduops
BRBRBR
Load-StoreQueueDesign
§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance
§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued
108
![Page 55: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/55.jpg)
SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.
§ Duringdecode,storebufferslotallocatedinprogramorder
§ Storessplitinto“storeaddress”and“storedata”micro-operations
§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache
§ Onstoreabort:- clearvalidbit
109
DataTags
StoreCommitPath
SpeculativeStoreBuffer
L1DataCache
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
StoreAddress
StoreData
Loadbypassfromspeculativestorebuffer
110
§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer
§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload
Data
LoadAddress
Tags
SpeculativeStoreBuffer L1DataCache
LoadData
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
![Page 56: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/56.jpg)
MemoryDependencies
sd x1, (x2)ld x3, (x4)
§ Whencanweexecutetheload?
111
In-OrderMemoryQueue
§ Executeallloadsandstoresinprogramorder
§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution
§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions
§ Needastructuretohandlememoryordering…
112
![Page 57: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/57.jpg)
ConservativeO-o-OLoadExecution
sd x1, (x2)ld x3, (x4)
§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2
§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware
§ Don’texecuteloadifanypreviousstoreaddressnotknown
§ (MIPSR10K,16-entryaddressqueue)
113
AddressSpeculation
sd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder
§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions
§ =>Largepenaltyforinaccurateaddressspeculation
114
![Page 58: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring](https://reader035.vdocuments.mx/reader035/viewer/2022071513/6133d99bdfd10f4dd73b5adf/html5/thumbnails/58.jpg)
MemoryDependencePrediction(Alpha21264)
sd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2 andexecuteloadbeforestore
§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait
§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete
§ Periodicallyclearstore-waitbits115
Acknowledgements
§ Thesecoursenotesweredevelopedby:- Krste Asanovic (UCB)- Arvind(MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)
116