cs 61c: great ideas in computer architecture (machine ... · pdf filecs 61c: great ideas in...
TRANSCRIPT
CS61C:GreatIdeasinComputerArchitecture(MachineStructures)
CachesPart2
Instructors:JohnWawrzynek &VladimirStojanovichttp://inst.eecs.berkeley.edu/~cs61c/
Second-LevelCache(SRAM)
TypicalMemoryHierarchy
Control
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s 1’s 10’s 100’s 1,000,000’s
Size(bytes): 100’s 10K’s M’sG’sT’s
2
• Principleoflocality+memoryhierarchypresentsprogrammerwith≈asmuchmemoryasisavailableinthecheapest technologyatthe≈speedofferedbythefastest technology
Cost/bit:highest lowest
Third-LevelCache(SRAM)
Processor
Control
Datapath
Review:AddingCachetoComputer
3
PC
Registers
Arithmetic&LogicUnit(ALU)
MemoryInput
Output
Bytes
Enable?Read/Write
Address
WriteData
ReadData
Processor-Memory Interface I/O-MemoryInterfaces
Program
Data
Cache
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
8 88Byte
Word8-Byte Block
address address address
2 LSBs are 0 3 LSBs are 0
0
1
2
3
01234567012345670123456701234567
Byte offset in blockBlock #10/20/15 4
MemoryBlock-addressingexample
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
82
83
84
85
86
87
88
89
90
91
2
3
4
5
6
7
0
1
2
3
0
1
0
1
0
1
0
1
0
1
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
Blocknumberaliasingexample
10/20/15 5
Block# Block#mod8 Block#mod2
12-bitmemoryaddresses,16Byteblocks
CachesReview
6
• PrincipleofLocality• TemporalLocalityandSpatialLocality
• HierarchyofMemories(speed/size/costperbit)toExploitLocality
• Cache– copyofdatainlowerlevelofmemoryhierarchy
• DirectMappedtofindblockincacheusingTagfieldandValidbitforHit
• Cachedesignorganizationchoices:• FullyAssociative,Set-Associative,Direct-
Mapped
CacheOrganizations• “FullyAssociative”:Blockcangoanywhere– Firstdesigninlecture– Note:NoIndexfield,but1comparator/block
• “DirectMapped”:Blockgoesoneplace– Note:Only1comparator– Numberofsets=numberblocks
• “N-waySetAssociative”:Nplacesforablock– Numberofsets=numberofblocks/N– Ncomparators– FullyAssociative:N=numberofblocks– DirectMapped:N=1
7
ProcessorAddressFieldsusedbyCacheController
• BlockOffset:Byteaddresswithinblock• SetIndex:Selectswhichset• Tag:Remainingportionofprocessoraddress
• SizeofIndex=log2(numberofsets)• SizeofTag=Addresssize– SizeofIndex– log2(numberofbytes/block)
Block offsetSetIndexTag
8
ProcessorAddress(32-bitstotal)
• Onewordblocks,cachesize=1Kwords(or4KB)
Direct-MappedCacheReview
20Tag 10Index
DataIndex TagValid012...
102110221023
3130 ... 131211 ... 210Byteoffset
20
Data
32
Hit
9
Validbitensures
somethingusefulincacheforthisindex
CompareTagwith
upperpartofAddress toseeifaHit
Readdatafromcacheinstead
ofmemoryifaHit
Comparator
Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)
3130 ... 131211... 210 Byteoffset
DataTagV012...
253254255
DataTagV012...
253254255
DataTagV012...
253254255
SetIndex
DataTagV012...
253254255
8Index
22Tag
Hit Data
32
4x1select
Way0 Way1 Way2 Way3
10
HandlingStoreswithWrite-Through
• Storeinstructionswritetomemory,changingvalues
• Needtomakesurecacheandmemoryhavesamevaluesonwrites:2policies
1)Write-ThroughPolicy:writecacheandwritethroughthecachetomemory– Everywriteeventuallygetstomemory– Tooslow,soincludeWriteBuffertoallowprocessortocontinueoncedatainBuffer
– Bufferupdatesmemoryinparalleltoprocessor
11
Write-ThroughCache
• Writebothvaluesincacheandinmemory
• WritebufferstopsCPUfromstallingifmemorycannotkeepup
• Writebuffermayhavemultipleentriestoabsorbburstsofwrites
• Whatifstoremissesincache?
12
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
720
12
1312041 Addr Data
WriteBuffer
HandlingStoreswithWrite-Back
2)Write-BackPolicy:writeonlytocacheandthenwritecacheblockbacktomemorywhenevictblockfromcache–Writescollectedincache,onlysinglewritetomemoryperblock
– Includebittoseeifwrotetoblockornot,andthenonlywritebackifbitisset• Called“Dirty”bit(writingmakesit“dirty”)
13
Write-BackCache
• Store/cachehit,writedataincacheonly&setdirtybit– Memoryhasstalevalue
• Store/cachemiss,readdatafrommemory,thenupdateandsetdirtybit– “Write-allocate”policy
• Load/cachehit,usevaluefromcache
• Onanymiss,writebackevictedblock,onlyifdirty.Updatecachewithnewblockandcleardirtybit.
14
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
720
12
1312041
DDDD
DirtyBits
Write-Throughvs.Write-Back
• Write-Through:– Simplercontrollogic– Morepredictabletimingsimplifiesprocessorcontrollogic
– Easiertomakereliable,sincememoryalwayshascopyofdata(bigidea:Redundancy!)
• Write-Back– Morecomplexcontrollogic– Morevariabletiming(0,1,2memoryaccessespercacheaccess)
– Usuallyreduceswritetraffic
– Hardertomakereliable,sometimescachehasonlycopyofdata
15
Administrivia• Project3-1duedateWednesday10/21.• Project3-2duedatenow10/28(release10/21)
• Midterm1:– gradesposted
16
Cache(Performance) Terms
• Hitrate:fractionofaccessesthathitinthecache• Missrate:1– Hitrate• Misspenalty:timetoreplaceablockfromlowerlevelinmemoryhierarchytocache
• Hittime:timetoaccesscachememory(includingtagcomparison)
• Abbreviation:“$”=cache(ABerkeleyinnovation!)
17
AverageMemoryAccessTime(AMAT)• AverageMemoryAccessTime(AMAT)istheaveragetimetoaccessmemoryconsideringbothhitsandmissesinthecache
AMAT= Timeforahit+Missrate× Misspenalty
18
B:400psec
C:600psec
A:≤200psec☐
☐
☐
☐
19
Clickers/PeerinstructionAMAT=Timeforahit+Missratex Misspenalty
Givena200psec clock,amisspenaltyof50clockcycles,amissrateof0.02missesperinstructionandacachehittimeof1clockcycle,whatisAMAT?
Example:Direct-MappedCachewith4Single-WordBlocks,Worst-CaseReferenceString
0 4 0 4
0 4 0 4
• Considerthemainmemoryaddressreferencestringofwordnumbers:04040404
Startwithanemptycache- allblocksinitiallymarkedasnotvalid
20
Example:Direct-MappedCachewith4Single-WordBlocks,Worst-CaseReferenceString
0 4 0 4
0 4 0 4
miss miss miss miss
miss miss miss miss
00Mem(0) 00Mem(0)01 4
01Mem(4)000
00Mem(0)01 4
00Mem(0)01 4
00Mem(0)01 4
01Mem(4)000
01Mem(4)000
Startwithanemptycache- allblocksinitiallymarkedasnotvalid
• Ping-pong effectduetoconflictmisses- twomemorylocationsthatmapintothesamecacheblock
• 8requests,8misses
21
• Considerthemainmemoryaddressreferencestringofwordnumbers:04040404
AlternativeBlockPlacementSchemes
• DMplacement:mem block12in8blockcache:onlyonecacheblockwheremem block12canbefound—(12modulo8)=4
• SAplacement:foursetsx 2-ways(8cacheblocks),memoryblock12inset(12mod4)=0;eitherelementoftheset
• FAplacement:mem block12canappearinanycacheblocks22
Example:2-WaySetAssociative$(4words=2setsx2waysperset)
0
Cache
MainMemory
Q:Howdowefindit?
Usenext1lowordermemoryaddressbittodeterminewhichcacheset(i.e.,modulothenumberofsetsinthecache)
Tag Data
Q:Isitthere?
Compareall thecachetagsinthesettothehighorder3memoryaddressbits totellifthememoryblockisinthecache
V
0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx
Set
1
01
Way
0
1
OnewordblocksTwoloworderbitsdefine thebyteintheword(32bwords)
23
Example:4Word2-WaySA$SameReferenceString
0 4 0 4
• Considerthemainmemorywordreferencestring04040404Startwithanemptycache- allblocks
initiallymarkedasnotvalid
24
Example:4-Word2-WaySA$SameReferenceString
0 4 0 4
• Considerthemainmemoryaddressreferencestring04040404
miss miss hit hit
000Mem(0) 000Mem(0)
Startwithanemptycache- allblocksinitiallymarkedasnotvalid
010Mem(4) 010Mem(4)
000Mem(0) 000Mem(0)
010Mem(4)
• Solvestheping-pongeffectinadirect-mappedcacheduetoconflictmissessincenowtwomemorylocationsthatmapintothesamecachesetcanco-exist!
• 8requests,2misses
25
Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)
3130 ... 131211... 210 Byteoffset
DataTagV012...
253254255
DataTagV012...
253254255
DataTagV012...
253254255
Index DataTagV012...
253254255
8Index
22Tag
Hit Data
32
4x1select
Way0 Way1 Way2 Way3
26
DifferentOrganizationsofanEight-BlockCache
Totalsizeof$inblocksisequaltonumberofsets× associativity.Forfixed$sizeandfixedblocksize,increasing associativitydecreasesnumberofsetswhileincreasingnumberofelementsperset.Witheightblocks,an8-wayset-associative$issameasafullyassociative$.
27
RangeofSet-AssociativeCaches• Forafixed-sizecacheandfixedblocksize,eachincreasebyafactoroftwoinassociativitydoublesthenumberofblocksperset(i.e.,thenumberorways)andhalvesthenumberofsets– decreasesthesizeoftheindexby1bitandincreasesthesizeofthetagby1bit
Wordoffset ByteoffsetIndexTag
28
RangeofSet-AssociativeCaches• Forafixed-sizecacheandfixedblocksize,eachincreasebyafactoroftwoinassociativitydoublesthenumberofblocksperset(i.e.,thenumberorways)andhalvesthenumberofsets– decreasesthesizeoftheindexby1bitandincreasesthesizeofthetagby1bit
Wordoffset ByteoffsetIndexTag
Decreasingassociativity
Fullyassociative(onlyoneset)Tagisallthebitsexceptblockandbyteoffset
Directmapped(onlyoneway)Smallertags,onlyasinglecomparator
Increasingassociativity
SelectsthesetUsedfortagcompare Selectsthewordintheblock
29
TotalCacheCapacity=
30
Associativity× #ofsets× block_sizeBytes=blocks/set× sets× Bytes/block
ByteOffsetTag Index
C=N× S× B
address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)
Clickers/PeerInstruction• Foracachewithconstanttotalcapacity, ifweincreasethenumberofwaysbyafactorof2,whichstatementisfalse:
• A:Thenumberofsetscouldbedoubled• B:Thetagwidthcoulddecrease• C:Theblocksizecouldstaythesame• D:Theblocksizecouldbehalved• E:Tagwidthmustincrease
31
TotalCacheCapacity=
32
Associativity× #ofsets× block_size
Bytes=blocks/set× sets× Bytes/block
ByteOffsetTag Index
C=N× S× B
ClickerQuestion:Cremainsconstant,Sand/orBcanchangesuchthatC=2N*(SB)’=>(SB)’=SB/2
Tag_size =address_size – (log2(S)+log2(B))=address_size – log2(SB)=address_size – log2(SB/2)=address_size – (log2(SB)– 1)
address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)
CostsofSet-AssociativeCaches• N-wayset-associativecachecosts– Ncomparators(delayandarea)– MUXdelay(setselection)beforedataisavailable– Dataavailableaftersetselection(andHit/Missdecision).DM$:blockisavailablebeforetheHit/Missdecision• InSet-Associative,notpossibletojustassumeahitandcontinueandrecoverlaterifitwasamiss
• Whenmissoccurs,whichway’sblockselectedforreplacement?– LeastRecentlyUsed(LRU):onethathasbeenunusedthelongest(principleoftemporallocality)• Musttrackwheneachway’sblockwasusedrelativetootherblocksintheset
• For2-waySA$,onebitperset→setto1whenablockisreferenced;resettheotherway’sbit(i.e.,“lastused”)
33
CacheReplacementPolicies• RandomReplacement
– Hardwarerandomlyselectsacacheevict• Least-RecentlyUsed
– Hardwarekeepstrackofaccesshistory– Replacetheentrythathasnotbeenusedforthelongesttime– For2-wayset-associativecache,needonebitforLRUreplacement
• ExampleofaSimple“Pseudo”LRUImplementation– Assume64FullyAssociativeentries– Hardwarereplacementpointerpointstoonecacheentry– Wheneveraccessismadetotheentrythepointerpointsto:
• Movethepointertothenextentry– Otherwise:donotmovethepointer– (exampleof“not-most-recentlyused”replacementpolicy)
:
Entry0Entry1
Entry63
ReplacementPointer
34
BenefitsofSet-AssociativeCaches• ChoiceofDM$versusSA$dependsonthecostofamiss
versusthecostofimplementation
• Largestgainsareingoingfromdirectmappedto2-way(20%+reductioninmissrate)
35
UnderstandingCacheMisses:The3Cs
• Compulsory(coldstartorprocessmigration,1st reference):– Firstaccesstoblockimpossibletoavoid;smalleffectforlong
runningprograms– Solution:increaseblocksize(increasesmisspenalty;verylarge
blockscouldincreasemissrate)• Capacity:
– Cachecannotcontainallblocksaccessedbytheprogram– Solution:increasecachesize(mayincreaseaccesstime)
• Conflict(collision):– Multiplememorylocationsmappedtothesamecachelocation– Solution1:increasecachesize– Solution2:increaseassociativity (mayincreaseaccesstime)
36
HowtoCalculate3C’susingCacheSimulator
1. Compulsory:setcachesizetoinfinityandfullyassociative,andcountnumberofmisses
2. Capacity:Changecachesizefrominfinity,usuallyinpowersof2,andcountmissesforeachreductioninsize– 16MB,8MB,4MB,…128KB,64KB,16KB
3. Conflict:Changefromfullyassociativeton-waysetassociativewhilecountingmisses– Fullyassociative,16-way,8-way,4-way,2-way,1-way
37
3CsAnalysis
• Threesourcesofmisses(SPEC2000integerandfloating-pointbenchmarks)– Compulsorymisses0.006%;notvisible– Capacitymisses,functionofcachesize– Conflictportiondependsonassociativity andcachesize 38
ImprovingCachePerformance
• Reducethetimetohitinthecache– E.g.,Smallercache
• Reducethemissrate– E.g.,Biggercache
• Reducethemisspenalty– E.g.,Usemultiplecachelevels
39
AMAT=Timeforahit+MissratexMisspenalty
ImpactofLargerCacheonAMAT?• 1)Reducesmisses(whatkind(s)?)• 2)LongerAccesstime(Hittime):smallerisfaster– Increaseinhittimewilllikelyaddanotherstagetothepipeline
• Atsomepoint,increaseinhittimeforalargercachemayovercometheimprovementinhitrate,yieldingadecreaseinperformance
• Computerarchitectsexpendconsiderableeffortoptimizingorganizationofcachehierarchy– bigimpactonperformanceandpower!
40
Clickers:Impactoflongercacheblocksonmisses?
• Forfixedtotalcachecapacityandassociativity,whatiseffectoflongerblocksoneachtypeofmiss:– A:Decrease,B:Unchanged,C:Increase
• Compulsory?• Capacity?• Conflict?
41
Clickers:ImpactoflongerblocksonAMAT
• Forfixedtotalcachecapacityandassociativity,whatiseffectoflongerblocksoneachcomponentofAMAT:– A:Decrease,B:Unchanged,C:Increase
• HitTime?• MissRate?• MissPenalty?
42
Clickers/PeerInstruction:Forfixedcapacityandfixedblocksize,howdoesincreasingassociativityeffectAMAT?
43
CacheDesignSpace• Severalinteractingdimensions
– Cachesize– Blocksize– Associativity– Replacementpolicy– Write-throughvs.write-back– Writeallocation
• Optimalchoiceisacompromise– Dependsonaccesscharacteristics
• Workload• Use(I-cache,D-cache)
– Dependsontechnology/cost• Simplicityoftenwins
Associativity
CacheSize
BlockSize
Bad
Good
Less More
FactorA FactorB
44
And,InConclusion…
• NameoftheGame:ReduceAMAT–ReduceHitTime–ReduceMissRate–ReduceMissPenalty
• Balancecacheparameters(Capacity,associativity,blocksize)
45