a memory model for risc-v...i6: r2 = ld a wmm allows: r1=0, r2=0 monolithic memory p1 reg state...
TRANSCRIPT
AMemoryModelforRISC-V
Sizhuo Zhang,Muralidaran Vijayaraghavan,Arvind
RISC-VWorkshop,November29,2016
WhynotSC/TSO?
Theybothhavesimplespecifications,bothaxiomaticallyandoperationallyButsimpleimplementationshavelowperformancen Strictorderingrequirementsformemory
instructionsn Toimproveperformance,onemustmonitor
coherenceinvalidationtraffictopotentiallysquashexecutedloads
WhynotPOWER/ARM?
Theiroperationalmodelsexposetoomuchmicroarchitectural detailsn Branchspeculation,OOOexecution,rollbacketc
areexposedinthememorymodelspecification!Theiraxiomaticmodelsaretoocomplexwithnowell-understoodrelationtomicroarchitecturen Onecannotsaywithconfidenceifaparticular
microarchitectural implementationobeysthemodel
WhynotRMO?
Thread1 Thread2Sta=1 Ld r1=bMEMBAR Branchr1!=1goto exitStb=1 Stc=1
Ld r2=cr3=a+r2- 1Ld r4=[r3]exit:
RMO’sdependencyrequirementsaretoostrict
Initiallyeverything’s0
WhynotRMO?
Thread1 Thread2Sta=1 Ld r1=bMEMBAR Branchr1!=1goto exitStb=1 Stc=1
Ld r2=cr3=a+r2- 1Ld r4=[r3]exit:
RMO’sdependencyrequirementsaretoostrict
Initiallyeverything’s0
(1)
WhynotRMO?
Thread1 Thread2Sta=1 Ld r1=bMEMBAR Branchr1!=1goto exitStb=1 Stc=1
Ld r2=cr3=a+r2- 1Ld r4=[r3]exit:
RMO’sdependencyrequirementsaretoostrict
Initiallyeverything’s0
(1)
(1)
WhynotRMO?
Thread1 Thread2Sta=1 Ld r1=bMEMBAR Branchr1!=1goto exitStb=1 Stc=1
Ld r2=cr3=a+r2- 1Ld r4=[r3]exit:
RMO’sdependencyrequirementsaretoostrict
Initiallyeverything’s0
(1)
(1)
WhynotRMO?
Thread1 Thread2Sta=1 Ld r1=bMEMBAR Branchr1!=1goto exitStb=1 Stc=1
Ld r2=cr3=a+r2- 1Ld r4=[r3]exit:
RMO’sdependencyrequirementsaretoostrict
Initiallyeverything’s0
(1)
(1)
(1)
WhynotRMO?
Thread1 Thread2Sta=1 Ld r1=bMEMBAR Branchr1!=1goto exitStb=1 Stc=1
Ld r2=cr3=a+r2- 1Ld r4=[r3]exit:
RMO’sdependencyrequirementsaretoostrict
Initiallyeverything’s0
(1)
(a)
(1)
(1)
WhynotRMO?
Thread1 Thread2Sta=1 Ld r1=bMEMBAR Branchr1!=1goto exitStb=1 Stc=1
Ld r2=cr3=a+r2- 1Ld r4=[r3]exit:
RMO’sdependencyrequirementsaretoostrict
Initiallyeverything’s0
(1)
(a)
(1)
(1)
(1)
Propertiesforanewmemorymodel
Simplespecificationwithoutmicroarchitectural detailslikeBranchspeculation,OOOexecution,rollback,etcButestablishcorrespondencetomicroarchitectureimplementationsWeakerthanSC/TSOforhighperformant,simpleimplementationsInclusionofsufficientfencestoforceSC-likebehaviorwhennecessary
OurproposalforRISC-Vmemorymodel:WMM
SimpleoperationalspecificationlikeSC,TSO,PSOProcessor …
InstantaneousMemory
SC:• Storesupdatememoryinstantly• Loadreadsmemoryinstantly
ProcessorInstantaneousInorderExecution
OurproposalforRISC-Vmemorymodel:WMM
SimpleoperationalspecificationlikeSC,TSO,PSOProcessor …
InstantaneousMemory
TSO:• Storesaredequeued inorder• Whenastoreisdequeued fromstorebuffer,itupdatesmemory
instantly
• Loadreadstheyoungeststorefromstorebuffer,or(ifnotpresent)memoryinstantly
StoreBuffer
Processor
StoreBuffer
InstantaneousInorderExecution
OurproposalforRISC-Vmemorymodel:WMM
SimpleoperationalspecificationlikeSC,TSO,PSOProcessor …
InstantaneousMemory
PSO:• Storesaredequeued inorderonlyforsameaddress• Whenastoreisdequeued fromstorebuffer,itupdatesmemory
instantly
• Loadreadstheyoungeststorefromstorebuffer,or(ifnotpresent)memoryinstantly
StoreBuffer
Processor
StoreBuffer
InstantaneousInorderExecution
OurproposalforRISC-Vmemorymodel:WMM
SimpleoperationalspecificationlikeSC,TSO,PSOProcessor …
InstantaneousMemory
WMM:• Storesaredequeued inorderonlyforsameaddress• Whenastoreisdequeued fromstorebuffer,itupdatesmemory
instantly,removesaddressfromowninvalidationbufferandenterseveryotherinvalidationbufferinstantly
• Loadreadstheyoungeststorefromstorebuffer,or(ifnotpresent)oldestentryininvalidationbuffer,or(ifnotpresent)memoryinstantly
• Oldestinvalidationbufferentrycanbethrownoutanytime
StoreBuffer
Processor
StoreBuffer
InvalidationBuffer
InvalidationBuffer
InstantaneousInorderExecution
FencesinWMM
Acquire/ReconcileFence:ClearsInvalidationbufferRelease/CommitFence:WaitsforStorebuffertobeflushed(non-atomically)
AxiomaticDefinitionofWMMMemoryorderreorderingaxiom:
Loadreadstheyounger(inmemoryorder)ofn LateststoreinmemoryorderforthataddressORn Lateststoreinprogramorder(inthatthread)forthataddress
Can Reorder?Second
Ld b Stbv’ Acq/Reconcile Rel/Commit
First
Ld a a!=b No No No
Stav Yes a!=b Yes No
Acq/Reconcile No No No No
Rel/Commit Yes No No No
St-StFence:CommitLd-Ld Fence:Reconcile
St-Ld Fence:Commit+ReconcileLd-StFence:Notneeded
ImplementingWMM
Anexecutedloadwontgetsquashedlateraslongasitdoesn’tovertakeareconcileormemoryinstructiontosameaddressn Nomonitoringofcoherenceinvalidationsn Loadaddressspeculationallowed– squashedonlyif
predictedaddressiswrongAllinstructionsarecommittedinordern Storescannotovertakeloadsn Prevents“out-of-thin-air”generationofvalues
FormallyProven:OOO+ Single-threaded-correctness+ In-order-commit
+ ValuePrediction+ GlobalStoreAtomicity=WMM
ImplementingWMM
Writeback coherentcachehierarchytypicallysatisfiesGlobalStoreAtomicityIfL1iswrite-through,easytoensureGlobalStoreAtomicityunlessthecoreisSMTn SMTcoreswithL1write-throughcachesimplementa“non-multicopy-atomic”
memory Don’tdoit
“Theoretically, the definition of the aq and rl bits allows for implementations without global store atomicity. When both aq and rl bits are set, however, we require full sequential consistency for the atomic operation which implies global store atomicity in addition to both acquire and release semantics. In practice, hardware systems are usually implemented with global store atomicity, embodied in local processor ordering rules together with single-writer cache coherence protocols.”
FormallyProven:OOO+ Single-threaded-correctness+ In-order-commit
+ ValuePrediction+ GlobalStoreAtomicity=WMM
MappingC++11toWMMC++11 WMM
Non-atomic Load Load
Load Relaxed Load
LoadConsume Load;Acquire/Reconcile
LoadAcquire Load;Acquire/Reconcile
LoadSC Rel/Commit;Acq/Reconcile;Load;Acq/Reconcile
Non-atomicStore Store
StoreRelaxed Store
StoreRelease Release/Commit;Store
StoreSC Release/Commit;Store
UsingoperationalspecificationofWMMmakesitstraightforwardtoderive/verifythismapping
Conclusion
WMMisamemorymodelwithsimplespecificationandpotentiallyhighperformantimplementationsn BlendswellwithRISC-Vphilosophyandshould
beusedasthememorymodelforRISC-V
Thankyou! [email protected]@[email protected]
Advertisement:FormallyverifiedRISCV(subsetofRV32I)multicoreimplementationinKami,ahardwareformalverificationplatform
Backup
WhynotRMO?
Thread1 Thread2Sta=1 Ld r1=bMEMBAR Branchr1!=1goto exitStb=1 Stc=1
Ld r2=cr3=a+r2- 1Ld r4=[r3]exit:
RMO’sdependencyrequirementsaretoostrict
Initiallyeverything’s0
(1)
(a)
(1)
(1)
(1)
WhynotReleaseConsistency?
FencesarenotstrongenoughtogiveSequentialConsistency
Thread1 Thread2 Thread3Stval =1 Ld r1=val Ld r2=flag
Stflag =r1 Ld r3=val
Non-cumulativeFences
Initially,everythingis0
(1)
(1)
(1)
(0)
AcquireRelease
Out-of-thin-airissue
Noprocessorcanproducevaluesoutofthinairn ButincompletesetofaxiomsseeminglyallowsthisInsistingonin-ordercommitsandadvertisingstoresonlyaftercommittootherthreads/processorstakescareofthisissue
Thread1 Thread2Ld r1=x Ld r2=ySty=R1 Stx=42
Initiallyeverythingis0Finallyx=y=r1=r2=42
“The AMOs were designed to implement the C11 and C++11 memory models efficiently. Although the FENCE R, RW instruction suffices to implement the acquire operation and FENCE RW, W suffices to implement release, both imply additional unnecessary ordering as compared to AMOs with the corresponding aq or rl bit set.”
LitmusTestsforWMM
27
TestSB
P1 P2
I1:Sta 1I2:Commit
I3:r1 =Ld b
I4:Stb1I5:Commit
I6:r2=Ld a
WMMallows:r1=0,r2=0
Monolithicmemory
P1Reg state
Storebuffer
Invbuffer
P2Reg state
Storebuffer
Invbuffer<a,1> <a,0><b,1><b,0>
Reconcile Reconcile
WMMallowsthebehavior- Ld overtakesStandCommit
AddReconciletoforbidthis
LitmusTestsforWMM
28
Monolithicmemory
P1Reg state
Storebuffer
Invbuffer
P2Reg state
Storebuffer
Invbuffer<a,1>
<a,0>
<b,0>
<b,a>
WMMallowsthebehavior- Ld overtakesLd- Nodependencyordering- Canbecausedbyvaluepredictioninhardware
AddReconciletoforbidthis
Out-of-thin-airisimpossiblebecauseofI2E
TestMP+data
P1 P2
I1:Sta1I2:CommitI3:St ba
I4:r1 =Ld b
I5: r2=Ld r1
WMMallows:r1=a,r2=0
Reconcile
WMM-S
ThesameabstractmachinestructureasWMMModelnon-multi-copy-atomicstoresn Makeastorefromprocessori visibletoprocessorj beforethestoreupdates
monolithicmemoryn Makeacopyofthestorefromthesb ofprocessori, andinsertthecopyintothe
sb ofprocessorjn Eachstorehasauniquetag,copieshavethesametagDequeue astorefromsb tomonolithicmemoryn Allcopiesaredequeued fromsbn AllcopieshavetobetheoldestoneforthestoreaddressintheirrespectivesbCopyingofamustbeconstrainedforper-locationSCn Eachsb ordersstoresforacertainaddressasalistn Combiningallsuchlistsfromallsb togetherformsapartialcoherenceorder
(<"#)ofthestoretagsforthataddressn Aftercopying,partialcoherenceordermustbestillacyclic
29
Stbuffer𝑠𝑏
Monolithicmemory𝑚
…
Processor𝑝𝑠[𝑖]Regstate𝑠
Inv buffer𝑖𝑏 …
Storecopyexample
Currentpartialcoherenceordern 𝑡, <"# 𝑡- <"# 𝑡. and𝑡/ <"# 𝑡-n 𝑡, and𝑡/ areunrelatedIfwecopyCintosb ofP1asC’n Createcycle:𝑡. <"# 𝑡/ <"# 𝑡- <"# 𝑡.n ShouldnotbeallowedIfwecopyAintosb ofP2n Createcycle:𝑡. <"# 𝑡.
30
A:𝑡.
P1sbInsertedlater(younger)
↕Insertedearlier(older)
A’:𝑡.B:𝑡-D:𝑡,
P2sbB’:𝑡-
C:𝑡/
P3sb
C’:𝑡/
(Primesarecopies)
LitmusTestsforWMM-S
31
TestWRC
P1 P2 P3
I1:Sta 1 I2:r1=Ld a
I3:Stb r1
I4:r2 =Ld bI5:ReconcileI6:r3=Ld a
WMM-Sallows:r1=1,r2=1,r3=0
<a,1>
P1sb
<a,1>
P2sb P3sb
m <b,1>
• AddCommitinP2toforbidthisbehavior
• Commitgloballyadvertisesobservedstores-- release
• Reconcilepreventsloadsfromreadingstalevalues--acquire
Commit
LitmusTestsforWMM-S
32
TestIRIW
P1 P2 P3 P4
I1:Sta 1 I2:r1=Ld a
I3:ReconcileI4:r2=Ld b
I5:Stb1 I6:r3=Ld b
I7:ReconcileI8:r4=Ld a
WMM-Sallows:r1=1,r2=0,r3=1,r4=1
<a,1>
P1sb
<a,1>
P2sb
<b,1>
P3sb
m
<b,1>
P4sb
Commit Commit
WMM-SImplementation
WMM-ScanbeimplementedusingOOO+non-atomicmemorysystemn e.g.memorysystemoftheARMFlowingModel(FM)[1]
n WedonotneedstorebufferinOOO,becauseFMhasbuffers
33[1]Flur etal.“ModellingtheARMv8architecture,operationally:concurrencyandISA”,POPL2016
OOOP1ROB
OOOP2ROB
OOOP3ROB
OOOP4ROB
Segment𝑠[1] Segment𝑠[2] Segment𝑠[3] Segment𝑠[4]
Segment𝑠[5] Segment𝑠[6]
MonolithicmemorymFM
FM+OOO
Eachsegmentisabufferofmemoryrequestsn KeepsFIFOorderingofrequeststothesameaddressn Flowrule:Theoldestrequestforsomeaddressinasegmentcanbemovedto
theparentsegmentormonolithicmemoryn Bypassrule:Astorecanforwarditsdatatoaload,aslongasthereisnoother
requesttothesameaddressinbetweenOOOcommitn store:directlyinsertintosegmentn Commitfence:ifanysegmentcontainsastoreobservedbythecommitsofthe
OOOprocessor,thenwecannotcommitthefence
34
Seg.𝑠[1]
Seg.𝑠[5] Seg.𝑠[6]
Monolithicmemorym
P1 ROB
Seg.𝑠[2]
P2 ROB
Seg.𝑠[3]
P3 ROB
Seg.𝑠[4]
P4 ROB
Astoreobservedbycommitsof𝑃𝑖:eithercommittedby𝑃𝑖 orreturnedbyaloadcommittedby𝑃𝑖
SimplifiedversionofFM(nofenceinFM)
CCM+OOO⊆WMMFM+OOO⊆WMM-S
HowWMM/WMM-SsimulatesCCM/FM+OOOn WhenthemonolithicmemoryinCCM/FMisupdatedbyastore
w WMM/WMM-Sdequeues thatstorefrom𝑠𝑏tomonolithicmemoryn WhenOOOcommitsaninstruction
w WMM/WMM-SexecutesthatinstructionWhenOOOPicommitsaloadLforaddressa withresultvn Considerwhereisv inCCM/FM+OOOwhenLcommits
n v isinmonolithicmemoryofCCMw WMMexecutesLbyreadingmonolithicmemory
n v hasbeenoverwrittenbyanotherstoreinmonolithicmemoryw WMMhaspreviouslyinserted<a,v>intoib ofps[i]w NowWMMcanexecuteLbyreadingib
n visinstorebufferofOOOPiw IfvhasbeenobservedbycommitsofPibeforeLiscommitted,thenWMM/WMM-Scan
executeLbyreadinglocalsbw Otherwise,WMM-Sfirescopy<a,v>intolocalsb andletLreadit
35
ImpactofDisallowingLd-StReordering
36
Qualitativeanalysisn Storebuffercanalreadyhidethestoremisslatencyn Storesarenotonthecriticalpathforsingle-threadperformancen Inextremecases,thespeculativestorequeuemaybefilledupwith
uncommittedstores
Quantitativeevaluationn Simulate8-coremultiprocessorusingESESCsimulatorn RunSPLASH2xbenchmarksn CompareWMM,Alpha,andaggressiveimplementationsofSCand
TSOn Alpha=WMM+Ld-Streordering
w TrytofindyoungerstorestocommitwhentheinstructionatthecommitslotoftheROBcannotcommit
SimulationConfiguration
37
Results
38
NormalizedexecutiontimeanditsbreakdownatthecommitslotofROB
NormalizedexecutiontimeanditsbreakdownattheissueporttoROB
AveragecyclestocommitstoresearlyinAlpha
Non-AtomicMemory
Modelsfornon-atomicmemoryismorecomplicatedWeareunclearabouttheperformanceadvantageofnon-atomicmemoryn Becauseourunderstandingofthemicroarchitectural sourcesfornon-atomic
memoryislimitedn POWER:sharedwrite-throughL1duetoSMT
w OthersourcesinthehierarchystartingfromL2?n ARM:noclue
w Manylitmustestsfornon-atomicstoresarenotobservableonhardwarew WRC+addrs,WWC+addrs,IRIW+addrs (http://diy.inria.fr/cats/model-
arm/all.html)w WRC+addrs (http://www.cl.cam.ac.uk/~sf502/popl16/observations.pdf)
Onlybyunderstandingthemicroarchitectural reasonsfornon-atomicmemory,areweabletoanalyzethebenefitofit
39