cs252 spring 2017 graduate computer architecture lecture...

45
WU UCB CS252 SP17 CS252 Spring 2017 Graduate Computer Architecture Lecture 11: Memory Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17

Upload: others

Post on 17-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

WU UCB CS252 SP17

CS252 Spring 2017Graduate Computer Architecture

Lecture 11:Memory

Lisa Wu, Krste Asanovichttp://inst.eecs.berkeley.edu/~cs252/sp17

Page 2: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

WU UCB CS252 SP17

Logistics for the 15-min meeting next Tuesday• Email me one email per project group by Sunday

evening 2/26• 2-5 sentences on what the project objective is (what

problem are you solving and what solution are you proposing – very briefly)• Bulleted list of what each person is responsible for• The infrastructure you want to use to evaluate your

solution

2

Page 3: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

LastTimeinLecture10

VLIWMachines§ Compiler-controlledstaticscheduling§ Loopunrolling§ Softwarepipelining§ Tracescheduling§ Rotatingregisterfile§ Predication§ Limitsofstaticscheduling

3

Page 4: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

EarlyRead-OnlyMemoryTechnologies

4

Punchedcards,Fromearly1700sthroughJaquard Loom,Babbage,andthenIBM Punchedpapertape,

instructionstreaminHarvardMk1

IBMCardCapacitorROS

IBMBalancedCapacitorROS

DiodeMatrix,EDSAC-2µcodestore

Page 5: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

WU UCB CS252 SP17

Card Capacitor ROS for IBM System/360 Model 30

5

http://www.glennsmuseum.com/ibm/ibm.html

Page 6: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

EarlyRead/WriteMainMemoryTechnologies

6

WilliamsTube,ManchesterMark1,1947

Babbage,1800s:Digitsstoredonmechanicalwheels

MercuryDelayLine,Univac1,1951

Also,regenerativecapacitormemoryonAtanasoff-Berrycomputer,androtatingmagneticdrummemoryonIBM650

Page 7: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

MITWhirlwindCoreMemory

7

Page 8: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

CoreMemory

8

§ Corememorywasfirstlargescalereliablemainmemory- inventedbyForresterinlate40s/early50satMITforWhirlwindproject

§ Bitsstoredasmagnetizationpolarityonsmallferritecoresthreadedontotwo-dimensionalgridofwires

§ CoincidentcurrentpulsesonXandYwireswouldwritecellandalsosenseoriginalstate(destructivereads)

DECPDP-8/EBoard,4Kwordsx12bits,(1968)

§ Robust,non-volatilestorage§ Usedonspaceshuttlecomputers§ Coresthreadedontowiresbyhand

(25billionayearatpeakproduction)§ Coreaccesstime~1µs

Page 9: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

SemiconductorMemory

§ Semiconductormemorybegantobecompetitiveinearly1970s- Intelformedtoexploitmarketforsemiconductormemory- EarlysemiconductormemorywasStaticRAM(SRAM).SRAMcellinternalssimilartoalatch(cross-coupledinverters).

§ FirstcommercialDynamicRAM(DRAM)wasIntel1103- 1Kbitofstorageonsinglechip- chargeonacapacitorusedtoholdvalue- Valuehastoberegularlyreadandwrittenback,hencedynamic

Semiconductormemoryquicklyreplacedcorein‘70s

9

Page 10: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

One-TransistorDynamicRAM[Dennard,IBM]

10

TiN topelectrode(VREF)Ta2O5 dielectric

Wbottomelectrode

polywordline access

transistor

1-TDRAMCell

word

bit

accesstransistor

Storagecapacitor(FETgate,trench,stack)

VREF

Page 11: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

WU UCB CS252 SP17

Memory cells: Using less transistors overtime

11

http://www.intechopen.com/books/advances-in-solid-state-circuit-technologies/dimension-increase-in-metal-oxide-semiconductor-memories-and-transistors

Page 12: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

ModernDRAMCellStructure

12[Samsung,sub-70nmDRAM,2004]

Page 13: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

DRAMConceptualArchitecture

13

RowAdd

ress

Decode

r

Col.1

Col.2M

Row1

Row2N

ColumnDecoder&SenseAmplifiers

M

N

N+M

bitlineswordlines

Memorycell(onebit)

DData

§ Bitsstoredin2-dimensionalarraysonchip§ Modernchipshavearound4-8logicalbanksoneachchip

§ eachlogicalbankphysicallyimplementedasmanysmallerarrays

Page 14: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

DRAMPhysicalLayout

14

II. DRAM TECHNOLOGY AND ARCHITECTURE

DRAMs are commoditized high volume products which need to have very low manufacturing costs. This puts significant constraints on the technology and architecture. The three most important factors for cost are the cost of a wafer, the yield and the die area. Cost of a wafer can be kept low if a simple transistor process and few metal levels are used. Yield can be optimized by process optimization and by optimizing the amount of redundancy. Die area optimization is achieved by keeping the array efficiency (ratio of cell area to total die area) as high as possible. The optimum approach changes very little even when the cell area is shrunk significantly over generations. DRAMs today use a transistor process with few junction optimizations, poly-Si gates and relatively high threshold voltage to suppress leakage. This process is much less expensive than a logic process but also much lower performance. It requires higher operating voltages than both high performance and low active power logic processes. Keeping array efficiency constant requires shrinking the area of the logic circuits on a DRAM at the same rate as the cell area. This is difficult as it is easier to shrink the very regular pattern of the cells than lithographically more complex circuitry. In addition the increasing complexity of the interface requires more circuit area.

Figure 1 shows the floorplan of a typical modern DDR2 or DDR3 DRAM and an enlargement of the cell array. The eight array blocks correspond to the eight banks of the DRAM. Row logic to decode the row address, implement row redundancy and drive the master wordlines is placed between the banks. At the other edge of the banks column

logic includes column address decoding, column redundancy and drivers for the column select lines as well as the secondary sense-amplifiers which sense or drive the array master data lines. The center stripe contains all other logic: the data and control pads and interface, central control logic, voltage regulators and pumps of the power system and circuitry to support efficient manufacturing test. Circuitry and buses in the center stripe are usually shared between banks to save area. Concurrent operation of banks is therefore limited to that portion of an operation that takes place inside a bank. For example the delay between two activate commands to different banks is limited by the time it takes to decode commands and addresses and trigger the command at a bank. Interleaving of reads and writes from and to different banks is limited by data contention on the shared data bus in the center stripe. Operations inside different banks can take place concurrently; one bank can for example be activating or precharging a wordline while another bank is simultaneously streaming data.

The enlargement of a small part of an array block at the right side of Figure 1 shows the hierarchical structure of the array block. Hierarchical wordlines and array data lines which were first developed in the early 1990s [5], [6] are now used by all major DRAM vendors. Master wordlines, column select lines and master array data lines are the interface between the array block and the rest of the DRAM circuitry. Individual cells connect to local wordlines and bitlines, bitlines are sensed or driven by bitline sense-amplifiers which connect to column select lines and local array data lines. The circuitry making the connection between local lines and master lines is placed in the local wordline driver stripe and bitline sense-amplifier stripe

Row logic Column logic

Serializer and driver (begin of write data bus)

Buffer

1:8

Control logicSub-array

Center stripe

column select line (M3 - Al) master array data lines (M3 - Al)

local array data lines

master wordline (M2 - Al)

local wordline (gate poly)

bitlines (M1 - W)

local wordline driver stripe

bitline sense-amplifier stripe

Array block (bold line)

Figure 1. Physical floorplan of a DRAM. A DRAM actually contains a very large number of small DRAMs called sub-arrays.

364

[Vogelsang,MICRO-2010]

Page 15: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

DRAMPackaging(Laptops/Desktops/Servers)

§ DIMM(DualInlineMemoryModule)containsmultiplechipswithclock/control/addresssignalsconnectedinparallel(sometimesneedbufferstodrivesignalstoallchips)

§ Datapinsworktogethertoreturnwideword(e.g.,64-bitdatabususing16x4-bitparts)

15

Addresslinesmultiplexedrow/columnaddress

Clockandcontrolsignals

Databus(4b,8b,16b,32b)

DRAMchip

~12

~7

Page 16: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

DRAMPackaging,MobileDevices

16[AppleA4packagecross-section,iFixit 2010]

TwostackedDRAMdieProcessorpluslogicdie

[AppleA4packageoncircuitboard]

Page 17: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

DRAMOperation§ Threestepsinread/writeaccesstoagivenbank§ Rowaccess(RAS)

- decoderowaddress,enableaddressedrow(oftenmultipleKbinrow)- bitlines sharechargewithstoragecell- smallchangeinvoltagedetectedbysenseamplifierswhichlatchwholerowofbits

- senseamplifiersdrivebitlines fullrailtorechargestoragecells§ Columnaccess(CAS)

- decodecolumnaddresstoselectsmallnumberofsenseamplifierlatches(4,8,16,or32bitsdependingonDRAMpackage)

- onread,sendlatchedbitsouttochippins- onwrite,changesenseamplifierlatcheswhichthenchargestoragecellstorequiredvalue

- canperformmultiplecolumnaccessesonsamerowwithoutanotherrowaccess(burstmode)

§ Precharge- chargesbitlinestoknownvalue,requiredbeforenextrowaccess

§ Eachstephasalatencyofaround15-20nsinmodernDRAMs§ VariousDRAMstandards(DDR,RDRAM)havedifferentwaysofencodingthesignalsfortransmissiontotheDRAM,butallsharesamecorearchitecture

17

Page 18: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

WU UCB CS252 SP17

DRAM Read

18

IBM Application Note on Understanding DRAM Operation

Page 19: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

WU UCB CS252 SP17

DRAM Read

19

Page 20: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

WU UCB CS252 SP17

DRAM Write

20

IBM Application Note on Understanding DRAM Operation

Page 21: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

WU UCB CS252 SP17

DRAM Write

21

Page 22: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

MemoryParameters

§ Latency- Timefrominitiationtocompletionofonememoryread(e.g.,innanoseconds,orinCPUorDRAMclockcycles)

§ Occupancy- Timethatamemorybankisbusywithonerequest-Usuallytheimportantparameterforamemorywrite

§ Bandwidth- Rateatwhichrequestscanbeprocessed(accesses/sec,orGB/s)

§ Allcanvarysignificantlyforreadsvs.writes,oraddress,oraddresshistory(e.g.,open/closepageonDRAMbank)

22

Page 23: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

Processor-DRAMGap(latency)

23

Time

µProc60%/year

DRAM7%/year

1

10

100

10001980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformanceGap:(growing50%/yr)

Perfo

rmance

Four-issue 3GHzsuperscalaraccessing100nsDRAMcouldexecute 1,200instructionsduringtimeforonememoryaccess!

Page 24: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

PhysicalSizeAffectsLatency

24

SmallMemory

CPU

BigMemory

CPU

§ Signalshavefurthertotravel§ Fanouttomorelocations

Page 25: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

Twopredictablepropertiesofmemoryreferences:§ TemporalLocality:Ifalocationisreferenceditislikelytobereferencedagaininthenearfuture.

§ SpatialLocality:Ifalocationisreferenceditislikelythatlocationsnearitwillbereferencedinthenearfuture.

Page 26: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

MemoryReferencePatterns

DonaldJ.Hatfield,JeanetteGerald:ProgramRestructuringforVirtualMemory.IBMSystemsJournal10(3):168-192(1971)

Time

Mem

oryAd

dress(on

edo

tperaccess)

SpatialLocality

TemporalLocality

Page 27: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

MemoryHierarchy

§ Small,fastmemorynearprocessortobufferaccessestobig,slowmemory-Makecombinationlooklikeabig,fastmemory

§ Keeprecentlyaccesseddatainsmallfastmemoryclosertoprocessortoexploittemporallocality- Cachereplacementpolicyfavorsrecentlyaccesseddata

§ Fetchwordsaroundrequestedwordtoexploitspatiallocality-Usemultiwordcachelines,andprefetching

27

Page 28: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

ManagementofMemoryHierarchy

§ Small/faststorage,e.g.,registers- Addressusuallyspecifiedininstruction-Generallyimplementeddirectlyasaregisterfile-buthardwaremightdothingsbehindsoftware’sback,e.g.,stackmanagement,registerrenaming

§ Larger/slowerstorage, e.g.,mainmemory- Addressusuallycomputedfromvaluesinregister-Generallyimplementedasahardware-managedcachehierarchy(hardwaredecideswhatiskeptinfastmemory)-butsoftwaremayprovide“hints”,e.g.,don’tcacheorprefetch

28

Page 29: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

ImportantCacheParameters(Review)

§ Capacity(inbytes)§ Associativity(fromdirect-mappedtofullyassociative)§ Linesize(bytessharingatag)§ Write-backversuswrite-through§ Write-allocateversuswriteno-allocate§ Replacementpolicy(leastrecentlyused,random)

29

Page 30: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

ImprovingCachePerformance

30

Averagememoryaccesstime(AMAT)=Hittime+Missratex Misspenalty

Toimproveperformance:• reducethehittime• reducethemissrate• reducethemisspenalty

Whatis bestcachedesignfor5-stagepipeline?

Biggestcachethatdoesn’tincreasehittimepast1cycle(approx8-32KBinmoderntechnology)[designissuesmorecomplexwith deeperpipelinesand/orout-of-ordersuperscalarprocessors]

Page 31: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

CausesofCacheMisses:The3C’s

§ Compulsory:firstreferencetoaline(a.k.a.coldstartmisses)-missesthatwouldoccurevenwithinfinitecache

§ Capacity:cacheistoosmalltoholdalldataneededbytheprogram-missesthatwouldoccurevenunderperfectreplacementpolicy

§ Conflict:missesthatoccurbecauseofcollisionsduetoline-placementstrategy-missesthatwouldnotoccurwithidealfullassociativity

31

Page 32: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

EffectofCacheParametersonPerformance

§ Largercachesize+ reducescapacityandconflictmisses- hittimewillincrease

§ Higherassociativity+ reducesconflictmisses- mayincreasehittime

§ Largerlinesize+ reducescompulsorymisses- increasesconflictmissesandmisspenalty

32

Page 33: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

MultilevelCaches

33

Problem:AmemorycannotbelargeandfastSolution:Increasingsizesofcacheateachlevel

CPU L1$ L2$ DRAM

Local miss rate = misses in cache / accesses to cacheGlobal miss rate = misses in cache / CPU memory accessesMisses per instruction = misses in cache / number of instructions

Page 34: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

PresenceofL2influencesL1design

§ UsesmallerL1ifthereisalsoL2- TradeincreasedL1missrateforreducedL1hittime- BackupL2reducesL1misspenalty- Reducesaverageaccessenergy

§ Usesimplerwrite-throughL1withon-chipL2-Write-backL2cacheabsorbswritetraffic,doesn’tgooff-chip

- AtmostoneL1missrequestperL1access(nodirtyvictimwriteback)simplifiespipelinecontrol

- Simplifiescoherenceissues- SimplifieserrorrecoveryinL1(canusejustparitybitsinL1andreloadfromL2whenparityerrordetectedonL1read)

34

Page 35: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

InclusionPolicy

§ Inclusivemultilevelcache:- Innercachecanonlyholdlinesalsopresentinoutercache- Externalcoherencesnoopaccessneedonlycheckoutercache

§ Exclusivemultilevelcaches:- Innercachemayholdlinesnotinouter cache- Swaplinesbetweeninner/outercachesonmiss-UsedinAMDAthlonwith64KBprimaryand256KBsecondarycache

§ Whychooseonetypeortheother?

35

Page 36: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

Power7On-ChipCaches[IBM2009]

36

32KBL1I$/core32KBL1D$/core3-cyclelatency

256KBUnifiedL2$/core8-cyclelatency

32MBUnifiedSharedL3$EmbeddedDRAM(eDRAM)25-cyclelatencytolocalslice

Page 37: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

Prefetching

§ Speculateonfutureinstructionanddataaccessesandfetchthemintocache(s)- Instructionaccesseseasiertopredictthandataaccesses

§ Varietiesofprefetching- Hardwareprefetching- Softwareprefetching-Mixedschemes

§ Whattypesofmissesdoesprefetchingaffect?

37

Page 38: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

IssuesinPrefetching

38

§ Usefulness– shouldproducehits§ Timeliness– notlateandnottooearly§ Cacheandbandwidthpollution

L1Data

L1Instruction

UnifiedL2Cache

RF

CPU

Prefetched data

Page 39: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

HardwareInstructionPrefetching

§ InstructionprefetchinAlphaAXP21064- Fetchtwolinesonamiss;therequestedline(i)andthenextconsecutiveline(i+1)

- Requestedlineplacedincache,andnextlineininstructionstreambuffer

- Ifmissincachebuthitinstreambuffer,movestreambufferlineintocacheandprefetchnextline(i+2)

39

L1Instruction

UnifiedL2Cache

RF

CPU

StreamBuffer

PrefetchedinstructionlineReq

line

Reqline

Page 40: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

HardwareDataPrefetching

§ Prefetch-on-miss:- Prefetchb+1uponmissonb

§ One-BlockLookahead(OBL)scheme- Initiateprefetchforblockb+1whenblockbisaccessed-Whyisthisdifferentfromdoublingblocksize?- CanextendtoN-blocklookahead

§ Stridedprefetch- Ifobservesequenceofaccessestolineb,b+N,b+2N,thenprefetchb+3Netc.

§ Example:IBMPower5[2003]supportseightindependentstreamsofstridedprefetchperprocessor,prefetching12linesaheadofcurrentaccess

40

Page 41: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

WU UCB CS252 SP17

Prefetch-On-Miss vs. Tagged Prefetch

41

“Data Prefetch Mechanisms”, S. P. VanderWiel and D. J. Lilja, University of Minnesota

Page 42: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

SoftwarePrefetching

42

for(i=0; i < N; i++) {prefetch( &a[i + 1] );prefetch( &b[i + 1] );SUM = SUM + a[i] * b[i];

}

Page 43: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

SoftwarePrefetchingIssues

§ Timingisthebiggestissue,notpredictability- Ifyouprefetch veryclosetowhenthedataisrequired,youmightbetoolate

- Prefetch tooearly,causepollution- EstimatehowlongitwilltakeforthedatatocomeintoL1,sowecansetPappropriately

- Whyisthishardtodo?

for(i=0; i < N; i++) {prefetch( &a[i + P] );prefetch( &b[i + P] );SUM = SUM + a[i] * b[i];

}

43

Mustconsidercostofprefetch instructions

Page 44: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

CompilerOptimizations

§ Restructuringcodeaffectsthedataaccesssequence-Groupdataaccessestogethertoimprovespatiallocality- Re-orderdataaccessestoimprovetemporallocality

§ Preventdatafromenteringthecache-Usefulforvariablesthatwillonlybeaccessedoncebeforebeingreplaced

-Needsmechanismforsoftwaretotellhardwarenottocachedata(“no-allocate”instructionhintsorpagetablebits)

§ Killdatathatwillneverbeusedagain- Streamingdataexploitsspatiallocalitybutnottemporallocality

- Replaceintodeadcachelocations

44

Page 45: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

45