lecture 17: memory hierarchy and cache coherence · 2018. 2. 12. · lecture 17: memory hierarchy...

Lecture17:MemoryHierarchyandCacheCoherence

ConcurrentandMul7coreProgramming

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]/~yan

1

ParallelisminHardware

•  Instruc7on-LevelParallelism–  Pipeline–  Out-of-orderexecu7on,and–  Superscalar

•  Thread-LevelParallelism–  Chipmul7threading,mul7core–  Coarse-grainedandfine-grainedmul7threading–  SMT

•  Data-LevelParallelism–  SIMD/Vector–  GPU/SIMT

2

ComputerArchitecture,AQuan7ta7veApproach.5THEdi7on,TheMorganKaufmann,September30,2011byJohnL.Hennessy(Author),DavidA.PaWerson

Topics(Part2)

•  Parallelarchitecturesandhardware–  Parallelcomputerarchitectures–  Memoryhierarchyandcachecoherency

•  ManycoreGPUarchitecturesandprogramming–  GPUsarchitectures–  CUDAprogramming–  IntroducGontooffloadingmodelinOpenMPandOpenACC

•  Programmingonlargescalesystems(Chapter6)–  MPI(pointtopointandcollec7ves)–  IntroducGontoPGASlanguages,UPCandChapel

•  Parallelalgorithms(Chapter8,9&10)–  Densematrix,andsor7ng

3

Outline

•  Memory,LocalityofreferenceandCaching•  Cachecoherenceinsharedmemorysystem

4

Memoryun7lnow…

•  We’vereliedonaverysimplemodelofmemoryformostthisclass–  MainMemoryisalineararrayofbytesthatcanbeaccessed

givenamemoryaddress–  Alsousedregisterstostorevalues

•  Realityismorecomplex.ThereisanenGrememorysystem.–  Differentmemoriesexistatdifferentlevelsofthecomputer–  Eachvaryintheirspeed,size,andcost

5

Random-Access Memory (RAM)

•  Keyfeatures–  RAMispackagedasachip.–  Basicstorageunitisacell(onebitpercell).–  MulGpleRAMchipsformamemory.

•  Sta7cRAM(SRAM)–  Eachcellstoresbitwithasix-transistorcircuit.–  Retainsvalueindefinitely,aslongasitiskeptpowered.–  RelaGvelyinsensiGvetodisturbancessuchaselectricalnoise.–  FasterandmoreexpensivethanDRAM.

6

Random-Access Memory (RAM)

•  DynamicRAM(DRAM)–  Eachcellstoresbitwithacapacitorandtransistor.–  Valuemustberefreshedevery10-100ms.–  SensiGvetodisturbances.–  SlowerandcheaperthanSRAM.

7

Memory Modules… real lifeDRAM

•  Inreality,–  SeveralDRAMchipsarebundledintoMemoryModules

•  SIMMS-SingleInlineMemoryModule•  DIMMS-DualInlineMemoryModule•  DDR-DualdataRead

–  Readstwiceeveryclockcycle•  QuadPump:SimultaneousR/W

Source for Pictures: http://en.kioskea.net/contents/pc/ram.php3

8

SDR, DDR,QuadPump

9

MemorySpeeds

•  ProcessorSpeeds:1GHzprocessorspeedis1nseccycleGme.

•  MemorySpeeds(50nsec)•  AccessSpeedgap

–  InstrucGonsthatstoreorloadfrommemory

10

DIMMModuleChipType ClockSpeed(MHz) BusSpeed(MHz) TransferRate(MB/s)

PC1600DDR200 100 200 1600

PC2100DDR266 133 266 2133

PC2400DDR300 150 300 2400

registers

on-chip L1 cache (SRAM)

main memory (DRAM)

local secondary storage (local disks)

Larger, slower,

and cheaper

(per byte) storage devices

remote secondary storage (distributed file systems, Web servers)

Local disks hold files retrieved from disks on remote network servers.

Main memory holds disk blocks retrieved from local disks.

off-chip L2 cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache memory.

CPU registers hold words retrieved from L1 cache.

L2 cache holds cache lines retrieved from main memory.

L0:

L1:

L2:

L3:

L4:

L5:

Smaller, faster, and

costlier (per byte) storage devices

MemoryHierarchy(Review)

11

main memory I/O

bridge bus interface L2 cache

ALU register file

cache bus system bus memory bus

L1 cache

CacheMemories(SRAM)

•  Cachememoriesaresmall,fastSRAM-basedmemoriesmanagedautomaGcallyinhardware.–  Holdfrequentlyaccessedblocksofmainmemory

•  CPUlooksfirstfordatainL1,theninL2,theninmainmemory.

•  Typicalbusstructure:

12

Processor

HowtoExploitMemoryHierarchy

•  Availabilityofmemory–  Cost,size,speed

•  Principleoflocality–  Memoryreferencesarebunchedtogether–  AsmallporGonofaddressspaceisaccessedatanygivenGme

•  Thisspaceinhighspeedmemory–  Problem:notallofitmayfit

13

Typesoflocality

•  Temporallocality–  TendencytoaccesslocaGonsrecentlyreferenced

•  SpaGallocality

–  TendencytoreferencelocaGonsaroundrecentlyreferenced–  LocaGonx,thenotherswillbex-korx+k

14

X X X t

Sourcesoflocality

•  Temporallocality–  Codewithinaloop–  SameinstrucGonsfetchedrepeatedly

•  SpaGallocality–  Dataarrays–  Localvariablesinstack–  Dataallocatedinchunks(conGguousbytes)

for(i=0;i<N;i++){A[i]=B[i]+C[i]*a;}

15

Whatdoeslocalitybuy?

•  AddressthegapbetweenCPUspeedandRAMspeed•  SpaGalandtemporallocalityimpliesasubsetofinstrucGonscanfitinhighspeedmemoryfromGmetoGme

•  CPUcanaccessinstrucGonsanddatafromthishighspeedmemory

•  Smallhighspeedmemorycanmakecomputerfasterandcheaper

•  Speedof1-20nsecatcostof$50to$100perMbyte•  ThisisCaching!!

16

Inser7nganL1CacheBetweenCPUandMainMemory

17

a b c d block 10

p q r s block 21

...

...

w x y z block 30

...

The big slow main memory has room for many 4-word blocks.

The small fast L1 cache has room for two 4-word blocks.

The tiny, very fast CPU register file has room for four 4-byte words. The transfer unit between

the CPU register file and the cache is a 4-byte block.

line 0

line 1 The transfer unit between the cache and main memory is a 4-word block (16 bytes).

Whatinfo.Doesacacheneed

•  Cache:Asmaller,fasterstoragedevicethatactsasastagingareaforasubsetofthedatainalarger,slowerdevice.

•  YouessenGallyallowasmallerregionofmemorytoholddatafromalargerregion.Nota1-1mapping.

•  WhatkindofinformaGondoweneedtokeep:–  Theactualdata–  Wherethedataactuallycomesfrom–  Ifdataisevenconsideredvalid

18

CacheOrganiza7on

•  Mapeachregionofmemorytoasmallerregionofcache•  Discardaddressbits

–  Discardlowerorderbits(a)–  Discardhigherorderbits(b)

•  Cacheaddresssizeis4bits•  Memoryaddresssizeis8bits•  Incaseof a)

–  0000xxxxismappedto0000incache•  Incaseofb)

–  xxxx0001ismappedto0001incache

19

cache

memory

(b)

(a)

Findingdataincache

•  Partofmemoryaddressappliedtocache•  Remainingisstoredastagincache•  Lowerorderbitsdiscarded•  Needtocheckif00010011

–  Cacheindexis0001–  Tagis0011

•  Iftagmatches,hit,usedata•  Nomatch,miss,fetchdatafrommemory

20

address tag

valid

valid

tag

tag set 0:

B = 2b bytes per cache block

E lines per set

S = 2s sets

t tag bits per line

1 valid bit per line

Cache size: C = B x E x S data bytes

• • •

valid

valid

tag

tag set 1: • • •

valid

valid tag

tag

set S-1: • • •

• • •

Cache is an array of sets.

Each set contains one or more lines.

Each line holds a block of data.

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

GeneralOrgofaCacheMemory

21

t bits s bits b bits

0

<set index> <block offset>

m-1

<tag>

Address A:

v

v

tag

tag set 0: • • •

v

v

tag

tag set 1: •

v

v

tag

tag set S-1: • • •

• • •

The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>.

The word contents begin at offset <block offset> bytes from the beginning of the block.

AddressingCaches

22

0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1 • • 0 1 • • • B–1

0 1 • • • B–1

0 1 • • • B–1

set 0: valid tag cache block

Direct-MappedCache

•  Simplestkindofcache•  Characterizedbyexactlyonelineperset.

23

valid tag

valid tag

• • •

set 1:

set S-1:

E=1 lines per set

cache block

cache block

set 0: valid tag

valid tag

valid tag

• • •

set 1:

set S-1: t bits s bits

set index block offset0 m-1

b bits

tag

selected set

cache block

cache block

cache block 0 0 0 0 1

AccessingDirect-MappedCaches

•  SetselecGon–  Usethesetindexbitstodeterminethesetofinterest.

24

=1? (1) The valid bit must be set

1 0110

t bits s bits

set index block offset0 m-1

b bits

tag

selected set (i):

(3) If (1) and (2), then cache hit,

and block offset selects

starting byte.

(2) The tag bits in the cache = ? line must match the

tag bits in the address

AccessingDirect-MappedCaches

•  LinematchingandwordselecGon–  Linematching:Findavalidlineintheselectedsetwitha

matchingtag –  WordselecGon:Thenextracttheword

25

3 0 1 2 7 4 5 6

0110 i 100

w0 w1 w2 w3

valid

valid

valid

Example:Directmappedcache

•  32bitaddress,64KBcache,32byteblock•  Howmanysets,howmanybitsforthetag,howmanybitsfortheoffset?

26

tag

tag

tag

• • •

set 0:

set 1:

cache block

cache block

cache block set n-1:

Write-throughvswrite-back

•  Whattodowhenanupdateoccurs?•  Write-through:immediately

–  Simpletoimplement,synchronouswrite–  Uniformlatencyonmisses

•  Write-back:writewhenblockisreplaced–  RequiresaddiGonaldirtybitormodifiedbit–  Asynchronouswrites–  Non-uniformmisslatency–  Cleanmiss:readfromlowerlevel–  Dirtymiss:writetolowerlevelandread(fill)

27

WritesandCache

•  ReadinginformaGonfromacacheisstraightforward.•  WhataboutwriGng?

–  Whatifyou’rewriGngdatathatisalreadycached(write-hit)?–  Whatifthedataisnotinthecache(write-miss)?

•  Dealingwithawrite-hit.–  Write-through-immediatelywritedatabacktomemory–  Write-back-deferthewritetomemoryforaslongaspossible

•  Dealingwithawrite-miss.–  write-allocate-loadtheblockintomemoryandupdate–  no-write-allocate-writesdirectlytomemory

•  Benefits?Disadvantages?•  Write-througharetypicallyno-write-allocate.•  Write-backaretypicallywrite-allocate.

28

size: speed: $/Mbyte: line size:

200 B 3 ns

8 B 32 B larger, slower, cheaper

8-64 KB 3 ns

1-4MB SRAM 128 MB DRAM 60 ns $1.50/MB 8 KB

30 GB 8 ms $0.05/MB

Memory

L1 d-cache

Regs Unified

L2 Cache

Processor

6 ns $100/MB 32 B

L1 i-cache

disk

Mul7-LevelCaches

•  OpGons:separatedataandinstrucGoncaches,oraunifiedcache

29

CachePerformanceMetrics

•  MissRate–  FracGonofmemoryreferencesnotfoundincache(misses/

references)–  Typicalnumbers:

•  3-10%forL1•  canbequitesmall(e.g.,<1%)forL2,dependingonsize,etc.

•  HitTime–  Timetodeliveralineinthecachetotheprocessor(includesGmeto

determinewhetherthelineisinthecache)–  Typicalnumbers:

•  1clockcycleforL1•  3-8clockcyclesforL2

•  MissPenalty–  AddiGonalGmerequiredbecauseofamiss

•  Typically25-100cyclesformainmemory

30

int sumarrayrows(int a[M][N]) {

int i, j, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < N; j++)

sum += a[i][j]; return sum;

}

int sumarraycols(int a[M][N]) {

int i, j, sum = 0;

for (j = 0; j < N; j++) for (i = 0; i < M; i++)

sum += a[i][j]; return sum;

}

Miss rate = 1/4 = 25% Miss rate = 100%

Wri7ngCacheFriendlyCode

•  Repeatedreferencestovariablesaregood(temporallocality)

•  Stride-1referencepaxernsaregood(spaGallocality)•  Examples:

–  coldcache,4-bytewords,4-wordcacheblocks

31

MatrixMul7plica7onExample

•  MajorCacheEffectstoConsider–  Totalcachesize

•  Exploittemporallocalityandblocking)–  Blocksize

•  ExploitspaGallocality

•  DescripGon:–  MulGplyNxNmatrices–  O(N3)totaloperaGons–  Accesses

•  Nreadspersourceelement•  NvaluessummedperdesGnaGon

–  butmaybeabletoholdinregister

/* ijk */ for (i=0; i<n; i++) {

for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++)

sum += a[i][k] * b[k][j]; c[i][j] = sum;

} }

Variable sum held in register

32

MissRateAnalysisforMatrixMul7ply

•  Assume:–  Linesize=32BYTES(bigenoughfor464-bitwords)–  Matrixdimension(N)isverylarge

•  Approximate1/Nas0.0–  CacheisnotevenbigenoughtoholdmulGplerows

•  AnalysisMethod:–  Lookataccesspaxernofinnerloop

33

LayoutofCArraysinMemory(review)

•  Carraysallocatedinrow-majororder–  eachrowinconGguousmemorylocaGons

•  Steppingthroughcolumnsinonerow:–  for(i = 0; i < N; i++)sum+= a[0][i];–  accessessuccessiveelements–  ifblocksize(B)>4bytes,exploitspaGallocality

•  compulsorymissrate=4bytes/B•  Steppingthroughrowsinonecolumn:

–  for(i = 0; i < n; i++)sum += a[i][0];

•  accessesdistantelements•  nospaGallocality!

–  compulsorymissrate=1(i.e.100%)

34

MatrixMul7plica7on(ijk)

35

MatrixMul7plica7on(jik)

36

MatrixMul7plica7on(kij)

37

MatrixMul7plica7on(ikj)

38

MatrixMul7plica7on(jki)

39

MatrixMul7plica7on(kji)

40

SummaryofMatrixMul7plica7on

41

for (i=0; i<n; i++) { for (j=0; j<n; j++)

{

sum = 0.0;

for (k=0; k<n; k++) sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

ijk (& jik): kij (& ikj): jki (& kji): • 2 loads, 0 stores • misses/iter = 1.25

for (k=0; k<n; k++) { for (i=0; i<n; i++)

{

r = a[i][k];

for (j=0; j<n; j++) c[i]

[j] += r * b[k][j];

}

}

for (j=0; j<n; j++) { for (k=0; k<n; k++)

{ r = b[k][j];

for (i=0; i<n; i++) c[i]

[j] += a[i][k] * r; }

}

• 2 loads, 1 store • misses/iter = 0.5

• 2 loads, 1 store • misses/iter = 2.0

Outline

•  Memory,LocalityofreferenceandCaching•  Cachecoherenceinsharedmemorysystem

42

Sharedmemorysystems

•  Allprocesseshaveaccesstothesameaddressspace–  E.g.PCwithmorethanoneprocessor

•  DataexchangebetweenprocessesbywriGng/readingsharedvariables–  Sharedmemorysystemsareeasytoprogram–  CurrentstandardinscienGficprogramming:OpenMP

•  Twoversionsofsharedmemorysystemsavailabletoday–  CentralizedSharedMemoryArchitectures–  DistributedSharedMemoryarchitectures

CentralizedSharedMemoryArchitecture

•  AlsoreferredtoasSymmetricMulG-Processors(SMP)•  Allprocessorssharethesamephysicalmainmemory

•  MemorybandwidthperprocessorislimiGngfactorforthistypeofarchitecture

•  Typicalsize:2-32processors

Memory

CPU CPU

CPU CPU

Centralizedsharedmemorysystem(I)

•  IntelX7350quad-core(Tigerton)–  PrivateL1cache:32KBinstrucGon,32KBdata–  SharedL2cache:4MBunifiedcache

CoreL1

CoreL1

sharedL2

CoreL1

CoreL1

sharedL2

1066MHzFSB

Centralizedsharedmemorysystems(II)

•  IntelX7350quad-core(Tigerton)mulG-processorconfiguraGon

C0

C1

L2

C8

C9

L2

C2

C3

L2

C10

C11

L2

C4

C5

L2

C12

C13

L2

C6

C7

L2

C14

C15

L2

Socket0 Socket1 Socket2 Socket3

MemoryControllerHub(MCH)

Memory Memory Memory Memory

8GB/s8GB/s8GB/s8GB/s

DistributedSharedMemoryArchitectures

•  AlsoreferredtoasNon-UniformMemoryArchitectures(NUMA)

•  Somememoryisclosertoacertainprocessorthanothermemory–  ThewholememoryissGlladdressablefromallprocessors–  Dependingonwhatdataitemaprocessorretrieves,theaccess

Gmemightvarystrongly

Memory

CPU CPU

Memory

CPU CPU

Memory

CPU CPU

Memory

CPU CPU

NUMAarchitectures(II)

•  ReducesthememoryboxleneckcomparedtoSMPs•  Moredifficulttoprogramefficiently

–  E.g.firsttouchpolicy:dataitemwillbelocatedinthememoryoftheprocessorwhichusesadataitemfirst

•  Toreduceeffectsofnon-uniformmemoryaccess,cachesareo}enused–  ccNUMA:cache-coherentnon-uniformmemoryaccess

architectures•  Largestexampleasoftoday:SGIOriginwith512processors

DistributedSharedMemorySystems

CacheCoherence

•  Real-worldsharedmemorysystemshavecachesbetweenmemoryandCPU

•  CopiesofasingledataitemcanexistinmulGplecaches•  ModificaGonofashareddataitembyoneCPUleadstooutdatedcopiesinthecacheofanotherCPU

Memory

CPU0

Cache

CPU1

Cache

Originaldataitem

CopyofdataitemincacheofCPU0 Copyofdataitem

incacheofCPU1

Cachecoherence(II)

•  TypicalsoluGon:–  Cacheskeeptrackonwhetheradataitemissharedbetween

mulGpleprocesses–  UponmodificaGonofashareddataitem,‘noGficaGon’of

othercacheshastooccur–  Othercacheswillhavetoreloadtheshareddataitemonthe

nextaccessintotheircache•  CachecoherenceisonlyanissueincasemulGpletasksaccessthesameitem–  MulGplethreads–  MulGpleprocesseshaveajointsharedmemorysegment–  ProcessisbeingmigratedfromoneCPUtoanother

CacheCoherenceProtocols

•  SnoopingProtocols–  Sendallrequestsfordatatoallprocessors–  Processorssnoopabustoseeiftheyhaveacopyandrespondaccordingly–  Requiresbroadcast,sincecachinginformaGonisatprocessors–  Workswellwithbus(naturalbroadcastmedium)–  Dominatesforcentralizedsharedmemorymachines

•  Directory-BasedProtocols–  KeeptrackofwhatisbeingsharedincentralizedlocaGon–  Distributedmemory=>distributeddirectoryforscalability

(avoidsboxlenecks)–  Sendpoint-to-pointrequeststoprocessorsvianetwork–  ScalesbexerthanSnooping–  Commonlyusedfordistributedsharedmemorymachines

Categoriesofcachemisses

•  Uptonow:–  CompulsoryMisses:firstaccesstoablockcannotbeinthecache(cold

startmisses)–  CapacityMisses:cachecannotcontainallblocksrequiredfortheexecuGon–  ConflictMisses:cacheblockhastobediscardedbecauseofblock

replacementstrategy•  InmulG-processorsystems:

–  CoherenceMisses:cacheblockhastobediscardedbecauseanotherprocessormodifiedthecontent•  truesharingmiss:anotherprocessormodifiedthecontentoftherequestelement

•  falsesharingmiss:anotherprocessorinvalidatedtheblock,althoughtheactualitemofinterestisunchanged.

BusSnoopingTopology

LargerSharedMemorySystems

•  TypicallyDistributedSharedMemorySystems•  Localorremotememoryaccessviamemorycontroller•  Directorypercachethattracksstateofeveryblockineverycache

–  Whichcacheshaveacopyofblock,dirtyvs.clean,...•  Infopermemoryblockvs.percacheblock?

–  PLUS:Inmemory=>simplerprotocol(centralized/onelocaGon)–  MINUS:Inmemory=>directoryisƒ(memorysize)vs.ƒ(cachesize)

•  Preventdirectoryasboxleneck?distributedirectoryentrieswithmemory,eachkeepingtrackofwhichprocessorshavecopiesoftheirblocks

DistributedDirectoryMPs

•  Falsesharing–  Whenatleastonethreadwritetoa

cachelinewhileothersaccessit•  Thread0:=A[1](read)•  Thread1:A[0]=…(write)

•  SoluGon:usearraypadding

int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;

int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;

FalseSharinginOpenMP

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

False Sharing

CPUs Caches Memory

A store into a shared cache line invalidates the other copies of that line:

The system is not able to distinguish between changes

within one individual line

57

A

T0

T1

NUMAandFirstTouchPolicy

•  DataplacementpolicyonNUMAarchitectures

•  FirstTouchPolicy

–  Theprocessthatfirsttouchesapageofmemorycausesthatpagetobeallocatedinthenodeonwhichtheprocessisrunning

58



A generic cc-NUMA architecture

NUMAFirst-touchplacement/1

59



About “First Touch” placement/1

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[99]

First TouchAll array elements are in the memory of

the processor executing this thread

int a[100]; Onlyreservethevm

address

NUMAFirst-touchplacement/2

60



About “First Touch” placement/2

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[49]

#pragma omp parallel for num_threads(2)

First TouchBoth memories each have “their half” of

the array

a[50] :a[99]

WorkwithFirst-TouchinOpenMP

•  First-touchinpracGce–  IniGalizedataconsistentlywiththecomputaGons

61

#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}

ConcludingObserva7ons

•  ProgrammercanopGmizeforcacheperformance–  Howdatastructuresareorganized–  Howdataareaccessed

•  Nestedloopstructure•  Blockingisageneraltechnique

•  Allsystemsfavor“cachefriendlycode”–  Ge�ngabsoluteopGmumperformanceisverypla�orm

specific•  Cachesizes,linesizes,associaGviGes,etc.

–  Cangetmostoftheadvantagewithgenericcode•  Keepworkingsetreasonablysmall(temporallocality)•  Usesmallstrides(spaGallocality)

–  WorkwithcachecoherenceprotocolandNUMAfirsttouchpolicy

62

References

•  ComputerArchitecture,AQuanGtaGveApproach.5THEdiGon,TheMorganKaufmann,September30,2011byJohnL.Hennessy(Author),DavidA.Paxerson

•  APrimeronMemoryConsistencyandCacheCoherenceDanielJ.SorinMarkD.HillDavidA.Wood,SYNTHESISLECTURESONCOMPUTERARCHITECTUREMarkD.Hill,SeriesEditor,2011

63

lecture 17: memory hierarchy and cache coherence · 2018. 2. 12. · lecture 17: memory hierarchy...

Documents