lecture 11: memory hierarchy - github pagesterminology for memory hierarchy performance •hit: data...

73
Lecture 11: Memory Hierarchy -- Cache Organization and Performance CSCE 513 Computer Architecture 1 Department of Computer Science and Engineering Yonghong Yan [email protected] https://passlab.github.io/CSCE513

Upload: others

Post on 24-May-2020

15 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Lecture11:MemoryHierarchy-- CacheOrganizationandPerformance

CSCE513ComputerArchitecture

1

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://passlab.github.io/CSCE513

Page 2: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

TopicsforMemoryHierarchy

• MemoryTechnologyandPrincipalofLocality– CAQA:2.1,2.2,B.1– COD:5.1,5.2

• CacheOrganizationandPerformance– CAQA:B.1,B.2– COD:5.2,5.3

• CacheOptimization– 6BasicCacheOptimizationTechniques

• CAQA:B.3– 10AdvancedOptimizationTechniques

• CAQA:2.3• VirtualMemoryandVirtualMachine

– CAQA:B.4,2.4;COD:5.6,5.7– Skipforthiscourse

2

Page 3: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

CachingExploitsBothTypesofLocalitybyPreloadingandKeepingDatainFasterMemory

• Exploittemporallocalitybyrememberingthecontentsofrecentlyaccessedlocations.

• Exploitspatiallocalitybyfetchingblocksofdataaroundrecentlyaccessedlocations.

3

double A[N];sum = 0;for(i=0; i<N; i++)sum += A[i];

return sum;

64-byteblocksize,A[i]isan8-bytedoubleèAblock(cacheline)canhold8 elementsofA.

ReferencingtoA[0]willcausethememorysystemtobringA[0:7]tothecacheèFuturereferencetoA[1:7]areallhitsincacheè fasteraccessthanreadingfrommemory

Page 4: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

TerminologyforMemoryHierarchyPerformance

• Hit:Dataappearsinsomeblockinupperlevel(example:BlockX)– HitRate:thefractionofmemoryaccessfoundintheupperlevel.– HitTime:Timetoaccesstheupperlevelwhichconsistsof

• Upper-levelaccesstime+Timetodeterminehit/miss.

• Miss:dataneedstoberetrievefromablockinlowerlevel(BlockY)– MissRate=1- (HitRate),i.e.#misses/#memoryaccess– MissPenalty:Timetoreplaceablockintheupperlevel

• Timetodelivertheblocktheprocessor.

• HitTime<<MissPenalty(500instructionson21264!)Lower Level

MemoryUpper LevelMemory

To Processor

From ProcessorBlock X Block Y

4

Page 5: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

4QuestionsforCacheOrganization

• Q1:Wherecanablockbeplacedintheupperlevel?– Blockplacement

• Q2:Howisablockfoundifitisintheupperlevel?– Blockidentification

• Q3:Whichblockshouldbereplacedonamiss?– Blockreplacement

• Q4:Whathappensonawrite?– Writestrategy

5

Page 6: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Q1:WhereCanaBlockbePlacedinTheUpperLevel?

• BlockPlacement– DirectMapped,FullyAssociative,SetAssociative

• Directmapped:(Blocknumber)mod(Numberofblocksincache)• Setassociative:(Blocknumber)mod(Numberofsetsincache)

– #ofset£ #ofblocks– n-way:n blocksinaset– 1-way=directmapped

• Fullyassociative: #ofset=1

Block-frame address

Block no. 0 1 2 3 54 76 8 129 31

Direct mapped: data block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7Block no.

Set associative: data block 12 can go anywhere in set 0 (12 mod 4)

0 1 2 3 4 5 6 7

Set0

Block no.

Set1 Set2 Set3

Fully associative: data block 12 can go anywhere

Block no. 0 1 2 3 4 5 6 7

6

Page 7: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

1KBDirectMappedCache,32Bblocks

• Fora2Nbytecache– Theuppermost(32- N)bitsarealwaystheCacheTag– ThelowestMbitsaretheByteSelect(BlockSize=2M)

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9 510

7

Page 8: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

SetAssociativeCache

• N-waysetassociative:N entriesforeachCacheIndex– N directmappedcachesoperatesinparallel

• Example:Two-waysetassociativecache– CacheIndexselectsa“set”fromthecache;– Thetwotagsinthesetarecomparedtotheinputinparallel;– Dataisselectedbasedonthetagresult.

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

ORHit

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

CompareCompare

ORHit 8

Page 9: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

DisadvantageofSetAssociativeCache

• N-waySetAssociativeCacheversusDirectMappedCache:– Ncomparatorsvs.1– ExtraMUXdelayforthedata– DatacomesAFTER Hit/Missdecisionandsetselection

• Inadirectmappedcache,CacheBlockisavailableBEFOREHit/Miss:– Possibletoassumeahitandcontinue.Recoverlaterifmiss.

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

ORHit

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

CompareCompare

ORHit 9

Page 10: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Q2:BlockIdentification

• Tagoneachblock– Noneedtocheckindexorblockoffset

• Increasingassociativityshrinksindex,expandstag

10

BlockOffset

Block Address

IndexTag

Cache size = Associativity × 2index_size × 2offest_size

Set Select Data Select

Page 11: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Q3:Whichblockshouldbereplacedonamiss?

• EasyforDirectMapped• SetAssociativeorFullyAssociative

– Random– LRU(LeastRecentlyUsed)– Firstin,firstout(FIFO)

Associativity

2-way 4-way 8-way

Size LRU Ran. FIFO LRU Ran. FIFO LRU Ran. FIFO

16KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4

64KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3

256KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

11

Page 12: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Q4:WhatHappensonaWrite?

Write-Through Write-Back

Policy

Data written to cache block, also written to lower-level memory

1. Write data only to the cache

2. Update lower level when a block falls out of the cache

Debug Easy HardDo read misses produce writes? No YesDo repeated writes make it to lower level? Yes No

Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”).

12

Page 13: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

WriteBuffersforWrite-ThroughCaches

• Q.Whyawritebuffer?– A.SoCPUdoesn’tstall

• Q.Whyabuffer,whynotjustoneregister?– A.Burstsofwritesarecommon.

• Q.AreReadAfterWrite(RAW)hazardsanissueforwritebuffer?– A.Yes!Drainbufferbeforenextread,orsendread1st aftercheck

writebuffers.

ProcessorCache

Write Buffer

DRAM

13

Page 14: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Write-MissPolicy

• Twooptionsonawritemiss– Writeallocate– theblockisallocatedonawritemiss,followed

bythewritehitactions.• Writemissesactlikereadmisses.

– No-writeallocate– writemissesdonotaffectthecache.Theblockismodifiedonlyinthelower-levelmemory.• Blockstayoutofthecacheinno-writeallocate untiltheprogramtriestoreadtheblocks,butwithwriteallocate evenblocksthatareonlywrittenwillstillbeinthecache.

14

Page 15: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Write-MissPolicyExample• Example:Assumeafullyassociativewrite-backcachewithmanycacheentriesthat

startsempty.Belowissequenceoffivememoryoperations(Theaddressisinsquarebrackets):

WriteMem[100];WriteMem[100];ReadMem[200];WriteMem[200];WriteMem[100].

Whatarethenumberofhitsandmisses(inclusivereadsandwrites)whenusingno-writeallocateversuswriteallocate?

• AnswerNo-writeAllocate: Writeallocate:WriteMem[100];1writemiss WriteMem[100];1writemissWriteMem[100];1writemiss WriteMem[100];1writehitReadMem[200];1readmiss ReadMem[200];1readmissWriteMem[200];1writehit WriteMem[200];1writehitWriteMem[100].1writemiss WriteMem[100];1writehit4misses;1hit 2misses;3hits

15

Page 16: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

WhatHappensonaCacheMissinPipeline?• Forin-orderpipeline,2options:

– FreezepipelineinMem stage(popularearlyon:Sparc,R4000)IFIDEXMemstallstallstall…stallMemWr

IFIDEX stallstallstall…stallstall ExWr

– Releaseloadfrompipeline• MSHR=“MissStatus/HandlerRegisters” (Kroft)Eachentryinthisqueuekeepstrackofstatusofoutstandingmemoryrequeststoonecompletememoryline.– Percache-line:keepinfoaboutmemoryaddress.– Foreachword:register(ifany)thatiswaitingforresult.– Usedto“merge”multiplerequeststoonememoryline

• NewloadcreatesMSHRentryandsetsdestinationregisterto“Empty”.Loadis“released” frompipeline.

• Attempttouseregisterbeforeresultreturnscausesinstructiontoblockindecodestage.

Page 17: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Cache Performance

17

Page 18: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

AverageMemoryAccessTime(AMAT)

AverageMemoryAccessTime(AMAT)=HitTime+MissRate*MissPenalt=Thit(L1)+Miss%(L1)*T(Mem)

• Misspenalty:Timetofetchablockfromlowermemorylevel– Accesstime:functionoflatency– Transfertime:functionofbandwith b/wlevels

• Transferone“cacheblock/line”atatime• Example:

– CacheHit=1cycle– Missrate=10%,Misspenalty=300cycles– AMAT=1+0.1*300=31cycles

18

Page 19: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MissRate:MissperMemoryReference

MissRate=#Misses/#Memoryreference• Memoryaccessincludebothinstructionaccessanddataaccess

– Eachinstructionneedstobereadfrommemory:oneI-memaccess– LW/SWarememoryaccessinstructions:oneD-memaccess(in

additiontoone1-memaccess)– Count#memoryaccesses:Eachiteration,14instructionsintotal,9

ld/lw/swè 14+9or9*2+5=23memoryaccesses• 9aredata-memaccesses

19

int sum(int N, int a, int *X) {int i;int result = 0;for (i = 0; i < N; ++i)

result += X[i];return result;

}

.L3: lw a5,-20(s0) /* a5 = i */ sll a5,a5,2 /* a5 = i<<2, which is i=i*4 */ ld a4,-48(s0) /* a4 = X */ add a5,a4,a5 /* the &X[i] */ lw a5,0(a5) /* the X[i] */ lw a4,-24(s0) /* load result */ addw a5,a4,a5 /* result += X[i] */ sw a5,-24(s0) /* store to result */ lw a5,-20(s0) /* i */ addw a5,a5,1 /* i++ */ sw a5,-20(s0) /* store i */ .L2: lw a4,-20(s0) /* i */ lw a5,-36(s0) /* N */ blt a4,a5,.L3 /* if (i < N) goto .L3 */

Page 20: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MissPerInstruction(TextbookB-15andB-16)

20

Misses for instruction cache are very low, why?Because instruction access has good locality!

Page 21: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MissRateßàMissPerInstruction (TextbookB-15andB-16)

• MemoryAccessPerInstruction:– Eachinstructionneedstobereadfrommemory:oneI-memaccess– LW/SWinstructions:oneD-memaccess(inadditiontoone1-mem

access)• E.g.36%areload/storeinstruction

– MemoryAccessPerInstruction=1+0.36=1.36

21

Page 22: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MissRateßàMissPerInstruction (TextbookB-15andB-16)

NOTE:Verylowinstructionmissrate

22

Page 23: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MissRateforProgrammers

• Becauseinstructionmissrateisverylowè Onlycountdata-memoryaccess

– 1datamemaccess• Sincescalarvariables(i,result)areallinregisters• OnlyX[i]isLW(rightvalue)

23

int sum(int N, int a, int *X) {int i;int result = 0;for (i = 0; i < N; ++i)

result += X[i];return result;

}

Counting data-mem access only needs to look at source code! and count array reference!

Page 24: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

AMATforMulti-LevelCache

• AverageMemoryAccessTime(AMAT)=HitTime+MissRate*MissPenalty

MissPenalty(L3)=T(Mem)MissPenalty(L2)=AMAT(L3)=Thit(L3)+Miss%(L3)*MissPenalty(L3)MissPenalty(L1)=AMAT(L2)=Thit(L2)+Miss%(L2)*MissPenalty(L2)AMAT=Thit(L1)+Miss%(L1)*MissPenalty(L1)

= Thit(L1)+Miss%(L1)*{Thit(L2)+Miss%(L2)*MissPenalty(L2)}=Thit(L1)+Miss%(L1)* {Thit(L2)+Miss%(L2)*

[Thit(L3)+Miss%(L3)*T(Mem)] }24

Page 25: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

AMATExampleforMulti-LevelCache

AMAT=Thit(L1)+Miss%(L1)*{Thit(L2)+Miss%(L2)*[Thit(L3)+Miss%(L3)*T(Mem)] }

– MissrateL1=10%,Thit(L1)=1cycle– MissrateL2=5%,Thit(L2)=10cycles– MissrateL3=1%,Thit(L3)=20cycles– T(Mem)=300cycles

• AMAT=– 2.115comparedto31withsingle-levelL1cache

25

Page 26: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

CPUPerformanceRevisit

• CPUperformancefactors– Instructioncount

• DeterminedbyISAandcompiler– CPIandCycletime

• DeterminedbyCPUhardware

CPU Time = InstructionsProgram

* CyclesInstruction

*TimeCycle

26

Page 27: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

CPUPerformancewithMemoryFactor• ComponentsofCPUtime

– Programexecutioncycles• Includescachehittime

– Memorystallcycles• Mainlyfromcachemisses

• MemoryStallCycles:thenumberofcyclesduringwhichtheprocessorisstalledwaitingforamemoryaccess.

27

timecycleClock cycles) stallMemory cyclesclock (CPUtimeexecution CPU ´+=

Page 28: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MemoryStallCyclesperInstruction

28

CPU Time = InstructionsProgram

* CyclesInstruction

*TimeCycle

timecycleClock cycles) stallMemory cyclesclock (CPUtimeexecution CPU ´+=

Page 29: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MemoryStallCycles

• MemoryStallCycles:thenumberofcyclesduringwhichtheprocessorisstalledwaitingforamemoryaccess.

• Dependsonboththenumberofmissesandthecostpermiss,i.e.themisspenalty:

Penalty Missrate MissInstrution

accessesMemory IC

Penalty MissInstrution

MissesIC

penalty Missmisses ofNumber cycles stallMemory

´´´=

´´=

´=

† The advantage of the last form is the component can be easily measured.

29

Page 30: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

CachePerformance

• Given– I-cachemissrate=2%– D-cachemissrate=4%– Misspenalty=100cycles– BaseCPI(idealcache)=2– Load&storesare36%ofinstructions

• Misscyclesperinstruction– I-cache:0.02× 100=2– D-cache:0.36× 0.04× 100=1.44

• ActualCPI=2+2+1.44=5.44– IdealCPUis5.44/2=2.72timesfaster

30

Page 31: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Example(PageB-18ofCAQABook)- 1/2

31

Page 32: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Example(PageB-18ofCAQABook)- 2/2

32

• CPI for perfect cache (miss rate 0): 1.00 • CPU for 2% miss rate cache: 7• CPI for No cache: = 1.0+200*1.5 = 301

• a factor of more than 40 times longer than a system with a cache!

Page 33: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

ThreeImportantEquationsforCachePerformance

AverageMemoryAccessTime(AMAT)=HitTime+MissRate*MissPenalty

33

Penalty Missrate MissInstrution

accessesMemory IC

Penalty MissInstrution

MissesIC

penalty Missmisses ofNumber cycles stallMemory

´´´=

´´=

´=

Page 34: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

SummaryofPerformanceEquations

34

Page 35: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

AnotherExample• Assumewehaveacomputerwheretheclocksperinstruction(CPI)is1.0whenallmemory

accesseshitinthecache.Theonlydataaccessesareloadsandstores,andthesetotal50%oftheinstructions.Ifthemisspenaltyis25clockcyclesandthemissrateis2%,howmuchfasterwouldthecomputerbeifallinstructionswerecachehits?

• Answer:

cycleClock 1.0ICcycleClock 0)CPI(IC timecycleClock cycles) stallMemory cyclesclock (CPUtimeexecution CPU

´´=´+´=´+=

1. Compute the performance for the computer that always hits:

Memory stall cycles = IC× Memory accessesInstruction

×Miss rate×Miss penalty

= IC× (1+ 0.5)×0.02×25= IC×0.75

2. For the computer with the real cache, we compute memory stall cycles:

CPU execution time = (CPU clock cycles + Memory stall cycles)×Clock cycle time =1.75× IC×Clock cycle

3. Compute the total performance

75.1cycleClock IC1.0cycleClock IC1.75

timeexecution CPUtimeexecution CPU cache =

´´´´

=

4. Compute the performance ratio which is the inverse of the execution times:

1 instruction memory access + 0.5 data memory access

35

Page 36: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

PerformanceSummary

• WhenCPUperformanceincreased– Misspenaltybecomesmoresignificant

• DecreasingbaseCPI– Greaterproportionoftimespentonmemorystalls

• Increasingclockrate– MemorystallsaccountformoreCPUcycles

• Can’tneglectcachebehaviorwhenevaluatingsystemperformance

36

Page 37: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Midterm

• Oct15Monday8:05- 9:20• Closebookandclosenote• 1-page(onesideofthepaper)A4/Letter-sizehand-writtennote

• Somesamplequestions:– https://passlab.github.io/CSCE513/Midterm_Fall2016_formatte

d_WithSolutions.pdf

37

Page 38: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Additional

38

Page 39: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MultilevelCaches

• PrimarycacheattachedtoCPU– Small,butfast

• Level-2cacheservicesmissesfromprimarycache– Larger,slower,butstillfasterthanmainmemory

• MainmemoryservicesL-2cachemisses• Somehigh-endsystemsincludeL-3cache

39

Page 40: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MultilevelCacheConsiderations

• Primary cache– Focus on minimal hit time

• L-2 cache– Focus on low miss rate to avoid main memory access– Hit time has less overall impact

• Results– L-1 cache usually smaller than a single cache– L-1 block size smaller than L-2 block size

40

Page 41: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MultilevelCacheExample

• Given– CPU base CPI = 1, clock rate = 4GHz– Miss rate/instruction = 2%– Main memory access time = 100ns

• With just primary cache– Miss penalty = 100ns/0.25ns = 400 cycles– Effective CPI = 1 + 0.02 × 400 = 9

41

Page 42: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Example(cont.)

• Now add L-2 cache– Access time = 5ns– Global miss rate to main memory = 0.5%

• Primary miss with L-2 hit– Penalty = 5ns/0.25ns = 20 cycles

• Primary miss with L-2 miss– Extra penalty = 500 cycles

• CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4• Performance ratio = 9/3.4 = 2.6

42

Page 43: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MemoryHierarchyPerformance

• Twoindirectperformancemeasureshavewaylaidmanyacomputerdesigner.– Instructioncountisindependentofthehardware;– Missrateisindependentofthehardware.

• AbettermeasureofmemoryhierarchyperformanceistheAverageMemoryAccessTime(AMAT)perinstructions

– CPUwith1nsclock,hittime=1cycle,misspenalty=20cycles,I-cachemissrate=5%

– AMAT=1+0.05× 20=2ns• 2cyclesperinstruction

penalty Misssrate MisHit timeAMAT ´+=

43

Page 44: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Example(B-16):Separatevs UnifiedCache• Which has the lower miss rate: a 16 KB instruction cache with a 16KB data or a 32 KB unified cache?

Use the miss rates in Figure B.6 to help calculate the correct answer, assuming 36% of the instructions are data transfer instructions. Assume a hit take 1 clock cycle and the miss penalty is 100 clock cycles. A load or store hit takes 1 extra clock cycle on a unified cache if there is only one cache port to satisfy two simultaneous requests. Using the pipelining terminology of Chapter 2, the unified cache leads to a structure hazard. What is the average memory access time in each case? Assume write-through caches with a write buffer and ignore stalls due to the write buffer.

• Answer:First let’s convert misses per 1000 instructions into miss rates. Solving the general formula from above, the miss rate is

Since every instruction access has exactly one memory access to fetch the instruction, the instruction miss rate is

nInstructioaccessesMemory

/1000nsInstructio 1000

Misses

rate Miss =

004.000.11000/82.3rate Miss ninstructio KB 16 ==

44

Page 45: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Example(B-16)Since 36% of the instructions are data transfers, the data miss rate is

The unified miss rate needs to account for instruction and data access:

As stated above, about 74% of the memory accesses are instruction references. Thus, the overall miss rate for the split caches is

Thus, a 32 KB unified cache has a slightly lower effective miss rate than two 16 KB caches.The average memory access time formula can be divided into instruction and data accesses:

Therefore, the time for each organization is

114.036.01000/9.40rate Miss data KB 16 ==

0318.036.000.1

1000/3.43rate Miss unified KB 32 =+

=

( ) ( ) 0326.0114.0%26004.0%74 =´+´

( )( )penalty Missrate miss DataHit timedata %

penalty Missrate missn InstructioHit timensinstructio % timeaccessmemory Average´+´

´+´=

( ) ( ) 52.7200114.01%26200004.01%74AMATsplit =´+´+´+´=

( ) ( ) 62.72000318.011%262000318.0.01%74AMATunified =´+++´+´+´=45

Page 46: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

IntroductiontoCacheOrganization

46

Page 47: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Cache Memory

• Cache memory– The level of the memory hierarchy closest to the CPU

• Given accesses X1, …, Xn–1, Xn

n How do we know if the data is present?

n Where do we look?

47

Page 48: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Direct Mapped Cache

• Location determined by address• Direct mapped: only one choice

– (Block address) modulo (#Blocks in cache)

n #Blocks is a power of 2n Use low-order address

bits as index to the entry

5-bit address space for total 32 bytes. This is simplified and a cache linenormally contains more bytes, e.g. 64 or 128 bytes. 48

Page 49: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Tags and Valid Bits

• Which particular block is stored in a cache location?– Store block address as well as the data– Actually, only need the high-order bits– Called the tag: the high-order bits

• What if there is no data in a location?– Valid bit: 1 = present, 0 = not present– Initially 0

49

Page 50: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Cache Example

• 8-blocks, 1 word/block, direct mapped• Initial state

Index V Tag Data000 N001 N010 N011 N100 N101 N110 N111 N

50

Page 51: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Cache Example

Index V Tag Data000 N001 N010 N011 N100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block22 10 110 Miss 110

51

Page 52: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Cache Example

Index V Tag Data000 N001 N010 Y 11 Mem[11010]011 N100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block26 11 010 Miss 010

52

Page 53: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Cache Example

Index V Tag Data000 N001 N010 Y 11 Mem[11010]011 N100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block22 10 110 Hit 11026 11 010 Hit 010

53

Page 54: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Cache Example

Index V Tag Data000 Y 10 Mem[10000]001 N010 Y 11 Mem[11010]011 Y 00 Mem[00011]100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block16 10 000 Miss 0003 00 011 Miss 01116 10 000 Hit 000

54

Page 55: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Cache Example

Index V Tag Data000 Y 10 Mem[10000]001 N010 Y 10 Mem[10010]011 Y 00 Mem[00011]100 N101 N110 Y 10 Mem[10110]111 N

Word addr Binary addr Hit/miss Cache block18 10 010 Miss 010

55

Page 56: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Address Subdivision

• 4-bytedatapercacheentry:blockorlinesize– Cacheline

• Whydowekeepmultiplebytesinonecacheline?– Spatiallocality

56

Page 57: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Example: Larger Block Size

• 64 blocks, 16 bytes/block– To what block number does address 1200 (decimal) map?:

11

• Map all addresses between 1200 - 1215

Tag Index Offset03491031

4 bits6 bits22 bits

1200 = 0x4B00000 0000 0000 0000 0000 0100 1011 0000

57

Page 58: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Block Size Considerations

• Larger blocks should reduce miss rate– Due to spatial locality

• But in a fixed-sized cache– Larger blocks Þ fewer of them

• More competition Þ increased miss rate– Larger blocks Þ pollution

• Larger miss penalty– Can override benefit of reduced miss rate– Early restart and critical-word-first can help

58

Page 59: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Cache Misses

• On cache hit, CPU proceeds normally– IF or MEM stage to access instruction or data memory– 1 cycle

• On cache miss: à x10 or x100 cycles– Stall the CPU pipeline– Fetch block from next level of hierarchy

• Instruction cache miss– Restart instruction fetch

• Data cache miss– Complete data access

100: LW X1 100(X2)

59

Page 60: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Write-Through• On data-write hit, could just update the block in

cache– But then cache and memory would be inconsistent

• Write through: also update memory– But makes writes take longer– e.g., if base CPI = 1, 10% of instructions are stores, write

to memory takes 100 cycles• Effective CPI = 1 + 0.1×100 = 11

• Solution: write buffer– Holds data waiting to be written to memory– CPU continues immediately

• Only stalls on write if write buffer is already full and write multiple bytes a time

200: SW X1 100(X2)

60

Page 61: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Write-Back

• Alternative: On data-write hit, just update the block in cache– Keep track of whether each block is dirty

• When a dirty block is replaced– Write it back to

memory– Can use a write buffer

to allow replacing block to be read first

61

Page 62: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Write Allocation

• What should happen on a write miss?• Alternatives for write-through

– Allocate on miss: fetch the block– Write around: don’t fetch the block

• Since programs often write a whole block before reading it (e.g., initialization)

• For write-back– Usually fetch the block

62

Page 63: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Example: Intrinsity FastMATH• Embedded MIPS processor

– 12-stage pipeline– Instruction and data access on each cycle

• Split cache: separate I-cache and D-cache– Each 16KB: 256 blocks × 16 words/block– D-cache: write-through or write-back

• SPEC2000 miss rates– I-cache: 0.4%– D-cache: 11.4%– Weighted average: 3.2%

63

Page 64: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Example: Intrinsity FastMATH

64

Page 65: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

MainMemorySupportingCaches• UseDRAMsformainmemory

– Fixedwidth(e.g.,1word)– Connectedbyfixed-widthclockedbus

• BusclockistypicallyslowerthanCPUclock

• Examplecacheblockread– 1buscycleforaddresstransfer– 15buscyclesperDRAMaccess– 1buscycleperdatatransfer

• For4-wordblock,1-word-wideDRAM– Misspenalty=1+4×15+4×1=65buscycles– Bandwidth=16bytes/65cycles=0.25B/cycle

65

Page 66: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Associative Caches

• Fully associative– Allow a given block to go in any cache entry– Requires all entries to be searched at once– Comparator per entry (expensive)

• n-way set associative– Each set contains n entries– Block number determines which set

• (Block number) modulo (#Sets in cache)– Search all entries in a given set at once– n comparators (less expensive)

66

Page 67: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Associative Cache Example

67

Page 68: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Spectrum of Associativity

• For a cache with 8 entries

68

Page 69: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Associativity Example

• Compare 4-block caches– Direct mapped, 2-way set associative,

fully associative– Block access sequence: 0, 8, 0, 6, 8

• Direct mapped

Block address

Cache index

Hit/miss Cache content after access0 1 2 3

0 0 miss Mem[0]8 0 miss Mem[8]0 0 miss Mem[0]6 2 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]

Access sequence

69

Page 70: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Associativity Example

• 2-way set associativeBlock

addressCache index

Hit/miss Cache content after accessSet 0 Set 1

0 0 miss Mem[0]8 0 miss Mem[0] Mem[8]0 0 hit Mem[0] Mem[8]6 0 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]

n Fully associativeBlock

addressHit/miss Cache content after access

0 miss Mem[0]8 miss Mem[0] Mem[8]0 hit Mem[0] Mem[8]6 miss Mem[0] Mem[8] Mem[6]8 hit Mem[0] Mem[8] Mem[6]

70

Page 71: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

HowMuchAssociativity

• Increasedassociativitydecreasesmissrate– Butwithdiminishingreturns

• Simulationofasystemwith64KBD-cache,16-wordblocks,SPEC2000– 1-way:10.3%– 2-way:8.6%– 4-way:8.3%– 8-way:8.1%

71

Page 72: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

SetAssociativeCacheOrganization

72

Page 73: Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

ReplacementPolicy

• Directmapped:nochoice• Setassociative

– Prefernon-validentry,ifthereisone– Otherwise,chooseamongentriesintheset

• Least-recentlyused(LRU)– Choosetheoneunusedforthelongesttime

• Simplefor2-way,manageablefor4-way,toohardbeyondthat• Random

– GivesapproximatelythesameperformanceasLRUforhighassociativity

73