a memory model for risc-v...i6: r2 = ld a wmm allows: r1=0, r2=0 monolithic memory p1 reg state...

AMemoryModelforRISC-V

Sizhuo Zhang,Muralidaran Vijayaraghavan,Arvind

RISC-VWorkshop,November29,2016

WhynotSC/TSO?

Theybothhavesimplespecifications,bothaxiomaticallyandoperationallyButsimpleimplementationshavelowperformancen Strictorderingrequirementsformemory

instructionsn Toimproveperformance,onemustmonitor

coherenceinvalidationtraffictopotentiallysquashexecutedloads

WhynotPOWER/ARM?

Theiroperationalmodelsexposetoomuchmicroarchitectural detailsn Branchspeculation,OOOexecution,rollbacketc

areexposedinthememorymodelspecification!Theiraxiomaticmodelsaretoocomplexwithnowell-understoodrelationtomicroarchitecturen Onecannotsaywithconfidenceifaparticular

microarchitectural implementationobeysthemodel

WhynotRMO?

Thread1 Thread2Sta=1 Ld r1=bMEMBAR Branchr1!=1goto exitStb=1 Stc=1

Ld r2=cr3=a+r2- 1Ld r4=[r3]exit:

RMO’sdependencyrequirementsaretoostrict

Initiallyeverything’s0

WhynotRMO?





(1)

WhynotRMO?





(1)

(1)

WhynotRMO?





(1)

(1)

(1)

WhynotRMO?





(1)

(a)

(1)

(1)

WhynotRMO?





(1)

(a)

(1)

(1)

(1)

Propertiesforanewmemorymodel

Simplespecificationwithoutmicroarchitectural detailslikeBranchspeculation,OOOexecution,rollback,etcButestablishcorrespondencetomicroarchitectureimplementationsWeakerthanSC/TSOforhighperformant,simpleimplementationsInclusionofsufficientfencestoforceSC-likebehaviorwhennecessary

OurproposalforRISC-Vmemorymodel:WMM

SimpleoperationalspecificationlikeSC,TSO,PSOProcessor …

InstantaneousMemory

SC:• Storesupdatememoryinstantly• Loadreadsmemoryinstantly

ProcessorInstantaneousInorderExecution



InstantaneousMemory

TSO:• Storesaredequeued inorder• Whenastoreisdequeued fromstorebuffer,itupdatesmemory

instantly

• Loadreadstheyoungeststorefromstorebuffer,or(ifnotpresent)memoryinstantly

StoreBuffer

Processor

StoreBuffer

InstantaneousInorderExecution



InstantaneousMemory

PSO:• Storesaredequeued inorderonlyforsameaddress• Whenastoreisdequeued fromstorebuffer,itupdatesmemory

instantly

• Loadreadstheyoungeststorefromstorebuffer,or(ifnotpresent)memoryinstantly

StoreBuffer

Processor

StoreBuffer




InstantaneousMemory

WMM:• Storesaredequeued inorderonlyforsameaddress• Whenastoreisdequeued fromstorebuffer,itupdatesmemory

instantly,removesaddressfromowninvalidationbufferandenterseveryotherinvalidationbufferinstantly

• Loadreadstheyoungeststorefromstorebuffer,or(ifnotpresent)oldestentryininvalidationbuffer,or(ifnotpresent)memoryinstantly

• Oldestinvalidationbufferentrycanbethrownoutanytime

StoreBuffer

Processor

StoreBuffer

InvalidationBuffer

InvalidationBuffer


FencesinWMM

Acquire/ReconcileFence:ClearsInvalidationbufferRelease/CommitFence:WaitsforStorebuffertobeflushed(non-atomically)

AxiomaticDefinitionofWMMMemoryorderreorderingaxiom:

Loadreadstheyounger(inmemoryorder)ofn LateststoreinmemoryorderforthataddressORn Lateststoreinprogramorder(inthatthread)forthataddress

Can Reorder?Second

Ld b Stbv’ Acq/Reconcile Rel/Commit

First

Ld a a!=b No No No

Stav Yes a!=b Yes No

Acq/Reconcile No No No No

Rel/Commit Yes No No No

St-StFence:CommitLd-Ld Fence:Reconcile

St-Ld Fence:Commit+ReconcileLd-StFence:Notneeded

ImplementingWMM

Anexecutedloadwontgetsquashedlateraslongasitdoesn’tovertakeareconcileormemoryinstructiontosameaddressn Nomonitoringofcoherenceinvalidationsn Loadaddressspeculationallowed– squashedonlyif

predictedaddressiswrongAllinstructionsarecommittedinordern Storescannotovertakeloadsn Prevents“out-of-thin-air”generationofvalues

FormallyProven:OOO+ Single-threaded-correctness+ In-order-commit

+ ValuePrediction+ GlobalStoreAtomicity=WMM

ImplementingWMM

Writeback coherentcachehierarchytypicallysatisfiesGlobalStoreAtomicityIfL1iswrite-through,easytoensureGlobalStoreAtomicityunlessthecoreisSMTn SMTcoreswithL1write-throughcachesimplementa“non-multicopy-atomic”

memory Don’tdoit

“Theoretically, the definition of the aq and rl bits allows for implementations without global store atomicity. When both aq and rl bits are set, however, we require full sequential consistency for the atomic operation which implies global store atomicity in addition to both acquire and release semantics. In practice, hardware systems are usually implemented with global store atomicity, embodied in local processor ordering rules together with single-writer cache coherence protocols.”

FormallyProven:OOO+ Single-threaded-correctness+ In-order-commit

+ ValuePrediction+ GlobalStoreAtomicity=WMM

MappingC++11toWMMC++11 WMM

Non-atomic Load Load

Load Relaxed Load

LoadConsume Load;Acquire/Reconcile

LoadAcquire Load;Acquire/Reconcile

LoadSC Rel/Commit;Acq/Reconcile;Load;Acq/Reconcile

Non-atomicStore Store

StoreRelaxed Store

StoreRelease Release/Commit;Store

StoreSC Release/Commit;Store

UsingoperationalspecificationofWMMmakesitstraightforwardtoderive/verifythismapping

Conclusion

WMMisamemorymodelwithsimplespecificationandpotentiallyhighperformantimplementationsn BlendswellwithRISC-Vphilosophyandshould

beusedasthememorymodelforRISC-V

Thankyou! [email protected]@[email protected]

Advertisement:FormallyverifiedRISCV(subsetofRV32I)multicoreimplementationinKami,ahardwareformalverificationplatform

Backup

WhynotRMO?





(1)

(a)

(1)

(1)

(1)

WhynotReleaseConsistency?

FencesarenotstrongenoughtogiveSequentialConsistency

Thread1 Thread2 Thread3Stval =1 Ld r1=val Ld r2=flag

Stflag =r1 Ld r3=val

Non-cumulativeFences

Initially,everythingis0

(1)

(1)

(1)

(0)

AcquireRelease

Out-of-thin-airissue

Noprocessorcanproducevaluesoutofthinairn ButincompletesetofaxiomsseeminglyallowsthisInsistingonin-ordercommitsandadvertisingstoresonlyaftercommittootherthreads/processorstakescareofthisissue

Thread1 Thread2Ld r1=x Ld r2=ySty=R1 Stx=42

Initiallyeverythingis0Finallyx=y=r1=r2=42

“The AMOs were designed to implement the C11 and C++11 memory models efficiently. Although the FENCE R, RW instruction suffices to implement the acquire operation and FENCE RW, W suffices to implement release, both imply additional unnecessary ordering as compared to AMOs with the corresponding aq or rl bit set.”

LitmusTestsforWMM

27

TestSB

P1 P2

I1:Sta 1I2:Commit

I3:r1 =Ld b

I4:Stb1I5:Commit

I6:r2=Ld a

WMMallows:r1=0,r2=0

Monolithicmemory

P1Reg state

Storebuffer

Invbuffer

P2Reg state

Storebuffer

Invbuffer<a,1> <a,0><b,1><b,0>

Reconcile Reconcile

WMMallowsthebehavior- Ld overtakesStandCommit

AddReconciletoforbidthis

LitmusTestsforWMM

28

Monolithicmemory

P1Reg state

Storebuffer

Invbuffer

P2Reg state

Storebuffer

Invbuffer<a,1>

<a,0>

<b,0>

<b,a>

WMMallowsthebehavior- Ld overtakesLd- Nodependencyordering- Canbecausedbyvaluepredictioninhardware

AddReconciletoforbidthis

Out-of-thin-airisimpossiblebecauseofI2E

TestMP+data

P1 P2

I1:Sta1I2:CommitI3:St ba

I4:r1 =Ld b

I5: r2=Ld r1

WMMallows:r1=a,r2=0

Reconcile

WMM-S

ThesameabstractmachinestructureasWMMModelnon-multi-copy-atomicstoresn Makeastorefromprocessori visibletoprocessorj beforethestoreupdates

monolithicmemoryn Makeacopyofthestorefromthesb ofprocessori, andinsertthecopyintothe

sb ofprocessorjn Eachstorehasauniquetag,copieshavethesametagDequeue astorefromsb tomonolithicmemoryn Allcopiesaredequeued fromsbn AllcopieshavetobetheoldestoneforthestoreaddressintheirrespectivesbCopyingofamustbeconstrainedforper-locationSCn Eachsb ordersstoresforacertainaddressasalistn Combiningallsuchlistsfromallsb togetherformsapartialcoherenceorder

(<"#)ofthestoretagsforthataddressn Aftercopying,partialcoherenceordermustbestillacyclic

29

Stbuffer𝑠𝑏

Monolithicmemory𝑚

…

Processor𝑝𝑠[𝑖]Regstate𝑠

Inv buffer𝑖𝑏 …

Storecopyexample

Currentpartialcoherenceordern 𝑡, <"# 𝑡- <"# 𝑡. and𝑡/ <"# 𝑡-n 𝑡, and𝑡/ areunrelatedIfwecopyCintosb ofP1asC’n Createcycle:𝑡. <"# 𝑡/ <"# 𝑡- <"# 𝑡.n ShouldnotbeallowedIfwecopyAintosb ofP2n Createcycle:𝑡. <"# 𝑡.

30

A:𝑡.

P1sbInsertedlater(younger)

↕Insertedearlier(older)

A’:𝑡.B:𝑡-D:𝑡,

P2sbB’:𝑡-

C:𝑡/

P3sb

C’:𝑡/

(Primesarecopies)

LitmusTestsforWMM-S

31

TestWRC

P1 P2 P3

I1:Sta 1 I2:r1=Ld a

I3:Stb r1

I4:r2 =Ld bI5:ReconcileI6:r3=Ld a

WMM-Sallows:r1=1,r2=1,r3=0

<a,1>

P1sb

<a,1>

P2sb P3sb

m <b,1>

• AddCommitinP2toforbidthisbehavior

• Commitgloballyadvertisesobservedstores-- release

• Reconcilepreventsloadsfromreadingstalevalues--acquire

Commit

LitmusTestsforWMM-S

32

TestIRIW

P1 P2 P3 P4

I1:Sta 1 I2:r1=Ld a

I3:ReconcileI4:r2=Ld b

I5:Stb1 I6:r3=Ld b

I7:ReconcileI8:r4=Ld a

WMM-Sallows:r1=1,r2=0,r3=1,r4=1

<a,1>

P1sb

<a,1>

P2sb

<b,1>

P3sb

m

<b,1>

P4sb

Commit Commit

WMM-SImplementation

WMM-ScanbeimplementedusingOOO+non-atomicmemorysystemn e.g.memorysystemoftheARMFlowingModel(FM)[1]

n WedonotneedstorebufferinOOO,becauseFMhasbuffers

33[1]Flur etal.“ModellingtheARMv8architecture,operationally:concurrencyandISA”,POPL2016

OOOP1ROB

OOOP2ROB

OOOP3ROB

OOOP4ROB

Segment𝑠[1] Segment𝑠[2] Segment𝑠[3] Segment𝑠[4]

Segment𝑠[5] Segment𝑠[6]

MonolithicmemorymFM

FM+OOO

Eachsegmentisabufferofmemoryrequestsn KeepsFIFOorderingofrequeststothesameaddressn Flowrule:Theoldestrequestforsomeaddressinasegmentcanbemovedto

theparentsegmentormonolithicmemoryn Bypassrule:Astorecanforwarditsdatatoaload,aslongasthereisnoother

requesttothesameaddressinbetweenOOOcommitn store:directlyinsertintosegmentn Commitfence:ifanysegmentcontainsastoreobservedbythecommitsofthe

OOOprocessor,thenwecannotcommitthefence

34

Seg.𝑠[1]

Seg.𝑠[5] Seg.𝑠[6]

Monolithicmemorym

P1 ROB

Seg.𝑠[2]

P2 ROB

Seg.𝑠[3]

P3 ROB

Seg.𝑠[4]

P4 ROB

Astoreobservedbycommitsof𝑃𝑖:eithercommittedby𝑃𝑖 orreturnedbyaloadcommittedby𝑃𝑖

SimplifiedversionofFM(nofenceinFM)

CCM+OOO⊆WMMFM+OOO⊆WMM-S

HowWMM/WMM-SsimulatesCCM/FM+OOOn WhenthemonolithicmemoryinCCM/FMisupdatedbyastore

w WMM/WMM-Sdequeues thatstorefrom𝑠𝑏tomonolithicmemoryn WhenOOOcommitsaninstruction

w WMM/WMM-SexecutesthatinstructionWhenOOOPicommitsaloadLforaddressa withresultvn Considerwhereisv inCCM/FM+OOOwhenLcommits

n v isinmonolithicmemoryofCCMw WMMexecutesLbyreadingmonolithicmemory

n v hasbeenoverwrittenbyanotherstoreinmonolithicmemoryw WMMhaspreviouslyinserted<a,v>intoib ofps[i]w NowWMMcanexecuteLbyreadingib

n visinstorebufferofOOOPiw IfvhasbeenobservedbycommitsofPibeforeLiscommitted,thenWMM/WMM-Scan

executeLbyreadinglocalsbw Otherwise,WMM-Sfirescopy<a,v>intolocalsb andletLreadit

35

ImpactofDisallowingLd-StReordering

36

Qualitativeanalysisn Storebuffercanalreadyhidethestoremisslatencyn Storesarenotonthecriticalpathforsingle-threadperformancen Inextremecases,thespeculativestorequeuemaybefilledupwith

uncommittedstores

Quantitativeevaluationn Simulate8-coremultiprocessorusingESESCsimulatorn RunSPLASH2xbenchmarksn CompareWMM,Alpha,andaggressiveimplementationsofSCand

TSOn Alpha=WMM+Ld-Streordering

w TrytofindyoungerstorestocommitwhentheinstructionatthecommitslotoftheROBcannotcommit

SimulationConfiguration

37

Results

38

NormalizedexecutiontimeanditsbreakdownatthecommitslotofROB

NormalizedexecutiontimeanditsbreakdownattheissueporttoROB

AveragecyclestocommitstoresearlyinAlpha

Non-AtomicMemory

Modelsfornon-atomicmemoryismorecomplicatedWeareunclearabouttheperformanceadvantageofnon-atomicmemoryn Becauseourunderstandingofthemicroarchitectural sourcesfornon-atomic

memoryislimitedn POWER:sharedwrite-throughL1duetoSMT

w OthersourcesinthehierarchystartingfromL2?n ARM:noclue

w Manylitmustestsfornon-atomicstoresarenotobservableonhardwarew WRC+addrs,WWC+addrs,IRIW+addrs (http://diy.inria.fr/cats/model-

arm/all.html)w WRC+addrs (http://www.cl.cam.ac.uk/~sf502/popl16/observations.pdf)

Onlybyunderstandingthemicroarchitectural reasonsfornon-atomicmemory,areweabletoanalyzethebenefitofit

39

a memory model for risc-v...i6: r2 = ld a wmm allows: r1=0, r2=0 monolithic memory p1 reg state...

Documents