16 memory hierarchy cache - unipi.it · cache operation • processor issues read and write...
TRANSCRIPT
SISTEMIEMBEDDED
ComputerOrganizationMemoryHierarchy,CacheMemory
FedericoBaronti Lastversion:20160524
MemoryHierarchy
• Idealmemoryisfast,large,andinexpensive
• Notfeasiblewithcurrentmemorytechnology,sousememoryhierarchy
• Exploitsprogrambehavior(localityofreference)tomakeitappear asthoughmemoryisonaveragefastandlarge
CachesandLocalityofReference
• Thecacheisbetweenprocessorandmemory• Makeslarge,slowmainmemoryappearfast• Typicalprogrambehaviorinvolvesexecutinginstructionsinloopsandaccessingdataarray
• Effectivenessisbasedonlocalityofreference– Temporallocality:instructions/datathathavebeenrecentlyaccessedarelikelytobeagain
– Spatiallocality:nearby instructionsordataarelikelytobeaccessedaftercurrentaccess
MoreCacheConcepts• Toexploitspatiallocality,transfercacheblock(orline)withmultipleadjacentwordsfrommemory– Lateraccessestonearbywordsarefast,providedthatcachestillcontainstheblock
• Mapping function determineswhereablockfrommemoryistobelocatedinthecache– DirectorAssociativemapping
• Whencacheisfull,replacementalgorithmdetermineswhichblockhastoberemovedfromthecache
CacheOperation
• ProcessorissuesReadandWriterequestsasifitwereaccessingmainmemorydirectly
• Butcontrolcircuitryfirstchecksthecache– Ifdesiredinformationispresentinthecache,aread or write hit occurs
• Foraread hit,mainmemoryisnotinvolved;thecacheprovidesthedesiredinformation
• Forawrite hit,therearetwoapproaches:–Write-backorWrite-through
HandlingCacheWrites• Write-throughprotocol:updatecache&memory.Memoryisalwaysupdated.
• Write-backprotocol:onlyupdatethecache;memoryupdatedlaterwhenblockisreplaced– Write-back schemeneedsmodified ordirtybit tomarkblocksthatareupdatedinthecacheandneedtobewritteninthemainmemorywhentheyarereplaced
• Ifsamelocationiswrittenrepeatedly,thenwrite-back ismuchbetterthanwrite-through– Blockmemoryupdateisoftenmoreefficient,evenifwritingbackunchangedwords
HandlingCacheMisses
• Ifdesiredinformationisnotpresentincache,aread orwrite miss occurs
• Foraread miss,theblockwithdesiredwordistransferredfrommainmemorytothecache
• Forawrite missunderwrite-through protocol,informationiswrittentothemainmemory
• Underwrite-back protocol,firsttransferblockcontainingtheaddressedwordintothecache.Thenoverwritespecificwordincachedblock
MappingFunctions• Blockofconsecutivewordsinmainmemorymustbetransferredtothecacheafteramiss
• Themappingfunction determinesthelocationofablockinthecache
• Threemappingfunctions:– Direct,AssociativeandSetAssociativeMapping
• Let’sconsiderthefollowingscenario:– Cachewith128blocksof16words– Mainmemorywith64Kwords(4Kblocks),word-addressable,so16-bitaddress
DirectMapping
• Simplestapproachusesafixedmapping:mem.blockj→ cacheblock(jmod128)• Onlyoneuniquelocationforeachmem.block– Twoblocksmaycontendforsamelocationevenifthecacheisnotfullyutilized–Newblockalwaysoverwritespreviousblock
Addressisdividedinto3fieldstag,blockorlineindex,word(oroffset)
Cachewith128blocksof16wordsMainmemorywith64Kwords(4Kblocks)Word-addressablememory,so16-bitaddress
DirectMapping
AssociativeMapping
• Fullflexibility:locateblockanywhereincache• Blockfieldofaddressnolongerneedsanybits• Tagfieldisenlargedtoencompassthosebits• Largertagstoredincachewitheachblock• Forhit/miss,comparealltagssimultaneouslyinparallelagainsttagfieldofgivenaddress
• Thisassociative search increasescomplexity• Flexiblemappingalsorequiresappropriatereplacementalgorithmwhencacheisfull
Cachewith128blocksof16wordsMainmemorywith64Kwords(4Kblocks)Word-addressablememory,so16-bitaddress
AssociativeMapping
Set-AssociativeMapping• Combinationofdirect&associativemapping• Groupblocksofcacheintosets• Blockfieldbitsmapablocktoauniqueset• Butanyblockwithinasetmaybeused• Associativesearchinvolvesonlytagsinaset• Replacementalgorithmisonlyforblocksinset• Reducingflexibilityalsoreducescomplexity• k blocks/set→k-wayset-associativecache– DirectMappingcorrespondsto1-way– AssociativeMappingcorrespondstoall-way
Cachewith128blocksof16wordsMainmemorywith64Kwords(4Kblocks)Word-addressablememory,so16-bitaddress
2-wayAssociativeMapping
StaleData• Eachblockhasavalidbit,initializedto0• Nohitifvalidbitis0,eveniftagmatchoccurs• Validbitsetto1whenablockplacedincache• Whenpoweristurnedon,allvalidbitsaresetto0• BecauseofDMA,mainmemorycanchangew/oreadorwriteperformedbytheprocessor– InvalidacacheblockwhencorrespondingblockinmemoryismodifiedbyDMA
– Ifwrite-blacktransfersblockfromcachetomemorybeforestartingDMAthathassuchablockassource.Thisactioncanbeachievedbyflushingthecache.
LRUReplacementAlgorithm• Replacementistrivialfordirectmapping,butneedsamethodforassociativemapping
• Considertemporallocalityofreferenceandusealeast-recently-used (LRU) algorithm
• Fork-waysetassociativity,eachblockinasethasacounterrangingfromfrom0 tok-1,whichisupdatedw/thefollowing rules:– Hittingonablockclearsitscountervalueto0;othersoriginallylowerinsetareincremented,andalltheothersremainunchanged
– Whenamissoccursandthesetisnotfull,thecounterofthenewblockissetto0 andalltheothersareincreasedbyone
– Whenamissoccursandthesetisfull,replacetheblockw/counter=k-1,setitscounterto0 andincrementbyonealltheothercounters
HitRateandMissPenalty• Performanceofamemoryhierarchyaredeterminedbythehit
rate andthemisspenalty• Hitrate dependsonthecachesizeanditsorganization
(mappingfunction,blocksize)• Misspenalty includesthetimetodetectthemiss,transferone
blockfromthemainmem.tothecacheandeventuallytherequestedwordtotheproc.Itdependsonthemainmemoryaccesstime,whichisusuallymuchlargerforthefirstwordoftheblockthanfortheremainderones.– Let’sassumethatthecacheaccesstimeis1clockcycle,theaccess
forthefirstwordinmem.isNfirst =7cycle andforthefollowingwordsNmore =1cycle,andtheblocksizeB =8word.
– Then,themisspenaltyNmiss =(1+1x Nfirst +(B-1)x Nmore +1)– 1=15,where1onecycleisfordetectingthecachemissandanotherforprovidingtherequestedwordtotheproc.
EffectonPipeliningPerformance• Assumethat:freq.ofcachemissesduringfetchpmiss-fetch =5%,freq.ofcachemissesduringmemaccesspmiss-mem =10%,freq.ofLoadandStoreinstr.pLD-ST =30%.Then,:– δcache-miss =Nmiss (pmiss-fetch +pLD-ST x pmiss-mem)
=15x(0.05+0.03)=1.2– F =R/2.2=0.45R
• W/ocache,i.e.,pmiss-fetch =100%andpmiss-mem=100%,memaccesstimepenaltyisNfirst -1cycle– δmem =(Nfirst -1)x(1+pLD-ST)=7.8– F =R/8.8=0.11R
• Cacheimprovesperformancebyafactorof4.
References
• C.Hamacher,Z.Vranesic,S.Zaky,N.Manjikian"ComputerOrganizationandEmbeddedSystems,”McGraw-HillInternationalEdition– ChapterVIII:8.5– 8.7.1