lecture 17: memory hierarchy and cache coherence · 2018. 2. 12. · lecture 17: memory hierarchy...
TRANSCRIPT
Lecture17:MemoryHierarchyandCacheCoherence
ConcurrentandMul7coreProgramming
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
ParallelisminHardware
• Instruc7on-LevelParallelism– Pipeline– Out-of-orderexecu7on,and– Superscalar
• Thread-LevelParallelism– Chipmul7threading,mul7core– Coarse-grainedandfine-grainedmul7threading– SMT
• Data-LevelParallelism– SIMD/Vector– GPU/SIMT
2
ComputerArchitecture,AQuan7ta7veApproach.5THEdi7on,TheMorganKaufmann,September30,2011byJohnL.Hennessy(Author),DavidA.PaWerson
Topics(Part2)
• Parallelarchitecturesandhardware– Parallelcomputerarchitectures– Memoryhierarchyandcachecoherency
• ManycoreGPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– IntroducGontooffloadingmodelinOpenMPandOpenACC
• Programmingonlargescalesystems(Chapter6)– MPI(pointtopointandcollec7ves)– IntroducGontoPGASlanguages,UPCandChapel
• Parallelalgorithms(Chapter8,9&10)– Densematrix,andsor7ng
3
Outline
• Memory,LocalityofreferenceandCaching• Cachecoherenceinsharedmemorysystem
4
Memoryun7lnow…
• We’vereliedonaverysimplemodelofmemoryformostthisclass– MainMemoryisalineararrayofbytesthatcanbeaccessed
givenamemoryaddress– Alsousedregisterstostorevalues
• Realityismorecomplex.ThereisanenGrememorysystem.– Differentmemoriesexistatdifferentlevelsofthecomputer– Eachvaryintheirspeed,size,andcost
5
Random-Access Memory (RAM)
• Keyfeatures– RAMispackagedasachip.– Basicstorageunitisacell(onebitpercell).– MulGpleRAMchipsformamemory.
• Sta7cRAM(SRAM)– Eachcellstoresbitwithasix-transistorcircuit.– Retainsvalueindefinitely,aslongasitiskeptpowered.– RelaGvelyinsensiGvetodisturbancessuchaselectricalnoise.– FasterandmoreexpensivethanDRAM.
6
Random-Access Memory (RAM)
• DynamicRAM(DRAM)– Eachcellstoresbitwithacapacitorandtransistor.– Valuemustberefreshedevery10-100ms.– SensiGvetodisturbances.– SlowerandcheaperthanSRAM.
7
Memory Modules… real lifeDRAM
• Inreality,– SeveralDRAMchipsarebundledintoMemoryModules
• SIMMS-SingleInlineMemoryModule• DIMMS-DualInlineMemoryModule• DDR-DualdataRead
– Readstwiceeveryclockcycle• QuadPump:SimultaneousR/W
Source for Pictures: http://en.kioskea.net/contents/pc/ram.php3
8
SDR, DDR,QuadPump
9
MemorySpeeds
• ProcessorSpeeds:1GHzprocessorspeedis1nseccycleGme.
• MemorySpeeds(50nsec)• AccessSpeedgap
– InstrucGonsthatstoreorloadfrommemory
10
DIMMModuleChipType ClockSpeed(MHz) BusSpeed(MHz) TransferRate(MB/s)
PC1600DDR200 100 200 1600
PC2100DDR266 133 266 2133
PC2400DDR300 150 300 2400
registers
on-chip L1 cache (SRAM)
main memory (DRAM)
local secondary storage (local disks)
Larger, slower,
and cheaper
(per byte) storage devices
remote secondary storage (distributed file systems, Web servers)
Local disks hold files retrieved from disks on remote network servers.
Main memory holds disk blocks retrieved from local disks.
off-chip L2 cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache memory.
CPU registers hold words retrieved from L1 cache.
L2 cache holds cache lines retrieved from main memory.
L0:
L1:
L2:
L3:
L4:
L5:
Smaller, faster, and
costlier (per byte) storage devices
MemoryHierarchy(Review)
11
main memory I/O
bridge bus interface L2 cache
ALU register file
cache bus system bus memory bus
L1 cache
CacheMemories(SRAM)
• Cachememoriesaresmall,fastSRAM-basedmemoriesmanagedautomaGcallyinhardware.– Holdfrequentlyaccessedblocksofmainmemory
• CPUlooksfirstfordatainL1,theninL2,theninmainmemory.
• Typicalbusstructure:
12
Processor
HowtoExploitMemoryHierarchy
• Availabilityofmemory– Cost,size,speed
• Principleoflocality– Memoryreferencesarebunchedtogether– AsmallporGonofaddressspaceisaccessedatanygivenGme
• Thisspaceinhighspeedmemory– Problem:notallofitmayfit
13
Typesoflocality
• Temporallocality– TendencytoaccesslocaGonsrecentlyreferenced
• SpaGallocality
– TendencytoreferencelocaGonsaroundrecentlyreferenced– LocaGonx,thenotherswillbex-korx+k
14
X X X t
Sourcesoflocality
• Temporallocality– Codewithinaloop– SameinstrucGonsfetchedrepeatedly
• SpaGallocality– Dataarrays– Localvariablesinstack– Dataallocatedinchunks(conGguousbytes)
for(i=0;i<N;i++){A[i]=B[i]+C[i]*a;}
15
Whatdoeslocalitybuy?
• AddressthegapbetweenCPUspeedandRAMspeed• SpaGalandtemporallocalityimpliesasubsetofinstrucGonscanfitinhighspeedmemoryfromGmetoGme
• CPUcanaccessinstrucGonsanddatafromthishighspeedmemory
• Smallhighspeedmemorycanmakecomputerfasterandcheaper
• Speedof1-20nsecatcostof$50to$100perMbyte• ThisisCaching!!
16
Inser7nganL1CacheBetweenCPUandMainMemory
17
a b c d block 10
p q r s block 21
...
...
w x y z block 30
...
The big slow main memory has room for many 4-word blocks.
The small fast L1 cache has room for two 4-word blocks.
The tiny, very fast CPU register file has room for four 4-byte words. The transfer unit between
the CPU register file and the cache is a 4-byte block.
line 0
line 1 The transfer unit between the cache and main memory is a 4-word block (16 bytes).
Whatinfo.Doesacacheneed
• Cache:Asmaller,fasterstoragedevicethatactsasastagingareaforasubsetofthedatainalarger,slowerdevice.
• YouessenGallyallowasmallerregionofmemorytoholddatafromalargerregion.Nota1-1mapping.
• WhatkindofinformaGondoweneedtokeep:– Theactualdata– Wherethedataactuallycomesfrom– Ifdataisevenconsideredvalid
18
CacheOrganiza7on
• Mapeachregionofmemorytoasmallerregionofcache• Discardaddressbits
– Discardlowerorderbits(a)– Discardhigherorderbits(b)
• Cacheaddresssizeis4bits• Memoryaddresssizeis8bits• Incaseof a)
– 0000xxxxismappedto0000incache• Incaseofb)
– xxxx0001ismappedto0001incache
19
cache
memory
(b)
(a)
Findingdataincache
• Partofmemoryaddressappliedtocache• Remainingisstoredastagincache• Lowerorderbitsdiscarded• Needtocheckif00010011
– Cacheindexis0001– Tagis0011
• Iftagmatches,hit,usedata• Nomatch,miss,fetchdatafrommemory
20
address tag
valid
valid
tag
tag set 0:
B = 2b bytes per cache block
E lines per set
S = 2s sets
t tag bits per line
1 valid bit per line
Cache size: C = B x E x S data bytes
• • •
valid
valid
tag
tag set 1: • • •
valid
valid tag
tag
set S-1: • • •
• • •
Cache is an array of sets.
Each set contains one or more lines.
Each line holds a block of data.
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
GeneralOrgofaCacheMemory
21
t bits s bits b bits
0
<set index> <block offset>
m-1
<tag>
Address A:
v
v
tag
tag set 0: • • •
v
v
tag
tag set 1: •
v
v
tag
tag set S-1: • • •
• • •
The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>.
The word contents begin at offset <block offset> bytes from the beginning of the block.
AddressingCaches
22
0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1 • • 0 1 • • • B–1
0 1 • • • B–1
0 1 • • • B–1
set 0: valid tag cache block
Direct-MappedCache
• Simplestkindofcache• Characterizedbyexactlyonelineperset.
23
valid tag
valid tag
• • •
set 1:
set S-1:
E=1 lines per set
cache block
cache block
set 0: valid tag
valid tag
valid tag
• • •
set 1:
set S-1: t bits s bits
set index block offset0 m-1
b bits
tag
selected set
cache block
cache block
cache block 0 0 0 0 1
AccessingDirect-MappedCaches
• SetselecGon– Usethesetindexbitstodeterminethesetofinterest.
24
=1? (1) The valid bit must be set
1 0110
t bits s bits
set index block offset0 m-1
b bits
tag
selected set (i):
(3) If (1) and (2), then cache hit,
and block offset selects
starting byte.
(2) The tag bits in the cache = ? line must match the
tag bits in the address
AccessingDirect-MappedCaches
• LinematchingandwordselecGon– Linematching:Findavalidlineintheselectedsetwitha
matchingtag – WordselecGon:Thenextracttheword
25
3 0 1 2 7 4 5 6
0110 i 100
w0 w1 w2 w3
valid
valid
valid
Example:Directmappedcache
• 32bitaddress,64KBcache,32byteblock• Howmanysets,howmanybitsforthetag,howmanybitsfortheoffset?
26
tag
tag
tag
• • •
set 0:
set 1:
cache block
cache block
cache block set n-1:
Write-throughvswrite-back
• Whattodowhenanupdateoccurs?• Write-through:immediately
– Simpletoimplement,synchronouswrite– Uniformlatencyonmisses
• Write-back:writewhenblockisreplaced– RequiresaddiGonaldirtybitormodifiedbit– Asynchronouswrites– Non-uniformmisslatency– Cleanmiss:readfromlowerlevel– Dirtymiss:writetolowerlevelandread(fill)
27
WritesandCache
• ReadinginformaGonfromacacheisstraightforward.• WhataboutwriGng?
– Whatifyou’rewriGngdatathatisalreadycached(write-hit)?– Whatifthedataisnotinthecache(write-miss)?
• Dealingwithawrite-hit.– Write-through-immediatelywritedatabacktomemory– Write-back-deferthewritetomemoryforaslongaspossible
• Dealingwithawrite-miss.– write-allocate-loadtheblockintomemoryandupdate– no-write-allocate-writesdirectlytomemory
• Benefits?Disadvantages?• Write-througharetypicallyno-write-allocate.• Write-backaretypicallywrite-allocate.
28
size: speed: $/Mbyte: line size:
200 B 3 ns
8 B 32 B larger, slower, cheaper
8-64 KB 3 ns
1-4MB SRAM 128 MB DRAM 60 ns $1.50/MB 8 KB
30 GB 8 ms $0.05/MB
Memory
L1 d-cache
Regs Unified
L2 Cache
Processor
6 ns $100/MB 32 B
L1 i-cache
disk
Mul7-LevelCaches
• OpGons:separatedataandinstrucGoncaches,oraunifiedcache
29
CachePerformanceMetrics
• MissRate– FracGonofmemoryreferencesnotfoundincache(misses/
references)– Typicalnumbers:
• 3-10%forL1• canbequitesmall(e.g.,<1%)forL2,dependingonsize,etc.
• HitTime– Timetodeliveralineinthecachetotheprocessor(includesGmeto
determinewhetherthelineisinthecache)– Typicalnumbers:
• 1clockcycleforL1• 3-8clockcyclesforL2
• MissPenalty– AddiGonalGmerequiredbecauseofamiss
• Typically25-100cyclesformainmemory
30
int sumarrayrows(int a[M][N]) {
int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++)
sum += a[i][j]; return sum;
}
int sumarraycols(int a[M][N]) {
int i, j, sum = 0;
for (j = 0; j < N; j++) for (i = 0; i < M; i++)
sum += a[i][j]; return sum;
}
Miss rate = 1/4 = 25% Miss rate = 100%
Wri7ngCacheFriendlyCode
• Repeatedreferencestovariablesaregood(temporallocality)
• Stride-1referencepaxernsaregood(spaGallocality)• Examples:
– coldcache,4-bytewords,4-wordcacheblocks
31
MatrixMul7plica7onExample
• MajorCacheEffectstoConsider– Totalcachesize
• Exploittemporallocalityandblocking)– Blocksize
• ExploitspaGallocality
• DescripGon:– MulGplyNxNmatrices– O(N3)totaloperaGons– Accesses
• Nreadspersourceelement• NvaluessummedperdesGnaGon
– butmaybeabletoholdinregister
/* ijk */ for (i=0; i<n; i++) {
for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++)
sum += a[i][k] * b[k][j]; c[i][j] = sum;
} }
Variable sum held in register
32
MissRateAnalysisforMatrixMul7ply
• Assume:– Linesize=32BYTES(bigenoughfor464-bitwords)– Matrixdimension(N)isverylarge
• Approximate1/Nas0.0– CacheisnotevenbigenoughtoholdmulGplerows
• AnalysisMethod:– Lookataccesspaxernofinnerloop
33
LayoutofCArraysinMemory(review)
• Carraysallocatedinrow-majororder– eachrowinconGguousmemorylocaGons
• Steppingthroughcolumnsinonerow:– for(i = 0; i < N; i++)sum+= a[0][i];– accessessuccessiveelements– ifblocksize(B)>4bytes,exploitspaGallocality
• compulsorymissrate=4bytes/B• Steppingthroughrowsinonecolumn:
– for(i = 0; i < n; i++)sum += a[i][0];
• accessesdistantelements• nospaGallocality!
– compulsorymissrate=1(i.e.100%)
34
MatrixMul7plica7on(ijk)
35
MatrixMul7plica7on(jik)
36
MatrixMul7plica7on(kij)
37
MatrixMul7plica7on(ikj)
38
MatrixMul7plica7on(jki)
39
MatrixMul7plica7on(kji)
40
SummaryofMatrixMul7plica7on
41
for (i=0; i<n; i++) { for (j=0; j<n; j++)
{
sum = 0.0;
for (k=0; k<n; k++) sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
ijk (& jik): kij (& ikj): jki (& kji): • 2 loads, 0 stores • misses/iter = 1.25
for (k=0; k<n; k++) { for (i=0; i<n; i++)
{
r = a[i][k];
for (j=0; j<n; j++) c[i]
[j] += r * b[k][j];
}
}
for (j=0; j<n; j++) { for (k=0; k<n; k++)
{ r = b[k][j];
for (i=0; i<n; i++) c[i]
[j] += a[i][k] * r; }
}
• 2 loads, 1 store • misses/iter = 0.5
• 2 loads, 1 store • misses/iter = 2.0
Outline
• Memory,LocalityofreferenceandCaching• Cachecoherenceinsharedmemorysystem
42
Sharedmemorysystems
• Allprocesseshaveaccesstothesameaddressspace– E.g.PCwithmorethanoneprocessor
• DataexchangebetweenprocessesbywriGng/readingsharedvariables– Sharedmemorysystemsareeasytoprogram– CurrentstandardinscienGficprogramming:OpenMP
• Twoversionsofsharedmemorysystemsavailabletoday– CentralizedSharedMemoryArchitectures– DistributedSharedMemoryarchitectures
CentralizedSharedMemoryArchitecture
• AlsoreferredtoasSymmetricMulG-Processors(SMP)• Allprocessorssharethesamephysicalmainmemory
• MemorybandwidthperprocessorislimiGngfactorforthistypeofarchitecture
• Typicalsize:2-32processors
Memory
CPU CPU
CPU CPU
Centralizedsharedmemorysystem(I)
• IntelX7350quad-core(Tigerton)– PrivateL1cache:32KBinstrucGon,32KBdata– SharedL2cache:4MBunifiedcache
CoreL1
CoreL1
sharedL2
CoreL1
CoreL1
sharedL2
1066MHzFSB
Centralizedsharedmemorysystems(II)
• IntelX7350quad-core(Tigerton)mulG-processorconfiguraGon
C0
C1
L2
C8
C9
L2
C2
C3
L2
C10
C11
L2
C4
C5
L2
C12
C13
L2
C6
C7
L2
C14
C15
L2
Socket0 Socket1 Socket2 Socket3
MemoryControllerHub(MCH)
Memory Memory Memory Memory
8GB/s8GB/s8GB/s8GB/s
DistributedSharedMemoryArchitectures
• AlsoreferredtoasNon-UniformMemoryArchitectures(NUMA)
• Somememoryisclosertoacertainprocessorthanothermemory– ThewholememoryissGlladdressablefromallprocessors– Dependingonwhatdataitemaprocessorretrieves,theaccess
Gmemightvarystrongly
Memory
CPU CPU
Memory
CPU CPU
Memory
CPU CPU
Memory
CPU CPU
NUMAarchitectures(II)
• ReducesthememoryboxleneckcomparedtoSMPs• Moredifficulttoprogramefficiently
– E.g.firsttouchpolicy:dataitemwillbelocatedinthememoryoftheprocessorwhichusesadataitemfirst
• Toreduceeffectsofnon-uniformmemoryaccess,cachesareo}enused– ccNUMA:cache-coherentnon-uniformmemoryaccess
architectures• Largestexampleasoftoday:SGIOriginwith512processors
DistributedSharedMemorySystems
CacheCoherence
• Real-worldsharedmemorysystemshavecachesbetweenmemoryandCPU
• CopiesofasingledataitemcanexistinmulGplecaches• ModificaGonofashareddataitembyoneCPUleadstooutdatedcopiesinthecacheofanotherCPU
Memory
CPU0
Cache
CPU1
Cache
Originaldataitem
CopyofdataitemincacheofCPU0 Copyofdataitem
incacheofCPU1
Cachecoherence(II)
• TypicalsoluGon:– Cacheskeeptrackonwhetheradataitemissharedbetween
mulGpleprocesses– UponmodificaGonofashareddataitem,‘noGficaGon’of
othercacheshastooccur– Othercacheswillhavetoreloadtheshareddataitemonthe
nextaccessintotheircache• CachecoherenceisonlyanissueincasemulGpletasksaccessthesameitem– MulGplethreads– MulGpleprocesseshaveajointsharedmemorysegment– ProcessisbeingmigratedfromoneCPUtoanother
CacheCoherenceProtocols
• SnoopingProtocols– Sendallrequestsfordatatoallprocessors– Processorssnoopabustoseeiftheyhaveacopyandrespondaccordingly– Requiresbroadcast,sincecachinginformaGonisatprocessors– Workswellwithbus(naturalbroadcastmedium)– Dominatesforcentralizedsharedmemorymachines
• Directory-BasedProtocols– KeeptrackofwhatisbeingsharedincentralizedlocaGon– Distributedmemory=>distributeddirectoryforscalability
(avoidsboxlenecks)– Sendpoint-to-pointrequeststoprocessorsvianetwork– ScalesbexerthanSnooping– Commonlyusedfordistributedsharedmemorymachines
Categoriesofcachemisses
• Uptonow:– CompulsoryMisses:firstaccesstoablockcannotbeinthecache(cold
startmisses)– CapacityMisses:cachecannotcontainallblocksrequiredfortheexecuGon– ConflictMisses:cacheblockhastobediscardedbecauseofblock
replacementstrategy• InmulG-processorsystems:
– CoherenceMisses:cacheblockhastobediscardedbecauseanotherprocessormodifiedthecontent• truesharingmiss:anotherprocessormodifiedthecontentoftherequestelement
• falsesharingmiss:anotherprocessorinvalidatedtheblock,althoughtheactualitemofinterestisunchanged.
BusSnoopingTopology
LargerSharedMemorySystems
• TypicallyDistributedSharedMemorySystems• Localorremotememoryaccessviamemorycontroller• Directorypercachethattracksstateofeveryblockineverycache
– Whichcacheshaveacopyofblock,dirtyvs.clean,...• Infopermemoryblockvs.percacheblock?
– PLUS:Inmemory=>simplerprotocol(centralized/onelocaGon)– MINUS:Inmemory=>directoryisƒ(memorysize)vs.ƒ(cachesize)
• Preventdirectoryasboxleneck?distributedirectoryentrieswithmemory,eachkeepingtrackofwhichprocessorshavecopiesoftheirblocks
DistributedDirectoryMPs
• Falsesharing– Whenatleastonethreadwritetoa
cachelinewhileothersaccessit• Thread0:=A[1](read)• Thread1:A[0]=…(write)
• SoluGon:usearraypadding
int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;
int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;
FalseSharinginOpenMP
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
False Sharing
CPUs Caches Memory
A store into a shared cache line invalidates the other copies of that line:
The system is not able to distinguish between changes
within one individual line
57
A
T0
T1
NUMAandFirstTouchPolicy
• DataplacementpolicyonNUMAarchitectures
• FirstTouchPolicy
– Theprocessthatfirsttouchesapageofmemorycausesthatpagetobeallocatedinthenodeonwhichtheprocessisrunning
58
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A generic cc-NUMA architecture
NUMAFirst-touchplacement/1
59
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/1
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[99]
First TouchAll array elements are in the memory of
the processor executing this thread
int a[100]; Onlyreservethevm
address
NUMAFirst-touchplacement/2
60
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/2
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[49]
#pragma omp parallel for num_threads(2)
First TouchBoth memories each have “their half” of
the array
a[50] :a[99]
WorkwithFirst-TouchinOpenMP
• First-touchinpracGce– IniGalizedataconsistentlywiththecomputaGons
61
#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}
ConcludingObserva7ons
• ProgrammercanopGmizeforcacheperformance– Howdatastructuresareorganized– Howdataareaccessed
• Nestedloopstructure• Blockingisageneraltechnique
• Allsystemsfavor“cachefriendlycode”– Ge�ngabsoluteopGmumperformanceisverypla�orm
specific• Cachesizes,linesizes,associaGviGes,etc.
– Cangetmostoftheadvantagewithgenericcode• Keepworkingsetreasonablysmall(temporallocality)• Usesmallstrides(spaGallocality)
– WorkwithcachecoherenceprotocolandNUMAfirsttouchpolicy
62
References
• ComputerArchitecture,AQuanGtaGveApproach.5THEdiGon,TheMorganKaufmann,September30,2011byJohnL.Hennessy(Author),DavidA.Paxerson
• APrimeronMemoryConsistencyandCacheCoherenceDanielJ.SorinMarkD.HillDavidA.Wood,SYNTHESISLECTURESONCOMPUTERARCHITECTUREMarkD.Hill,SeriesEditor,2011
63