cs252 spring 2017 graduate computer architecture lecture...

WU UCB CS252 SP17

CS252 Spring 2017Graduate Computer Architecture

Lecture 11:Memory

Lisa Wu, Krste Asanovichttp://inst.eecs.berkeley.edu/~cs252/sp17

WU UCB CS252 SP17

Logistics for the 15-min meeting next Tuesday• Email me one email per project group by Sunday

evening 2/26• 2-5 sentences on what the project objective is (what

problem are you solving and what solution are you proposing – very briefly)• Bulleted list of what each person is responsible for• The infrastructure you want to use to evaluate your

solution

2

©KrsteAsanovic,2015CS252,Fall2015,Lecture11

LastTimeinLecture10

VLIWMachines§ Compiler-controlledstaticscheduling§ Loopunrolling§ Softwarepipelining§ Tracescheduling§ Rotatingregisterfile§ Predication§ Limitsofstaticscheduling

3


EarlyRead-OnlyMemoryTechnologies

4

Punchedcards,Fromearly1700sthroughJaquard Loom,Babbage,andthenIBM Punchedpapertape,

instructionstreaminHarvardMk1

IBMCardCapacitorROS

IBMBalancedCapacitorROS

DiodeMatrix,EDSAC-2µcodestore

WU UCB CS252 SP17

Card Capacitor ROS for IBM System/360 Model 30

5

http://www.glennsmuseum.com/ibm/ibm.html


EarlyRead/WriteMainMemoryTechnologies

6

WilliamsTube,ManchesterMark1,1947

Babbage,1800s:Digitsstoredonmechanicalwheels

MercuryDelayLine,Univac1,1951

Also,regenerativecapacitormemoryonAtanasoff-Berrycomputer,androtatingmagneticdrummemoryonIBM650


MITWhirlwindCoreMemory

7


CoreMemory

8

§ Corememorywasfirstlargescalereliablemainmemory- inventedbyForresterinlate40s/early50satMITforWhirlwindproject

§ Bitsstoredasmagnetizationpolarityonsmallferritecoresthreadedontotwo-dimensionalgridofwires

§ CoincidentcurrentpulsesonXandYwireswouldwritecellandalsosenseoriginalstate(destructivereads)

DECPDP-8/EBoard,4Kwordsx12bits,(1968)

§ Robust,non-volatilestorage§ Usedonspaceshuttlecomputers§ Coresthreadedontowiresbyhand

(25billionayearatpeakproduction)§ Coreaccesstime~1µs


SemiconductorMemory

§ Semiconductormemorybegantobecompetitiveinearly1970s- Intelformedtoexploitmarketforsemiconductormemory- EarlysemiconductormemorywasStaticRAM(SRAM).SRAMcellinternalssimilartoalatch(cross-coupledinverters).

§ FirstcommercialDynamicRAM(DRAM)wasIntel1103- 1Kbitofstorageonsinglechip- chargeonacapacitorusedtoholdvalue- Valuehastoberegularlyreadandwrittenback,hencedynamic

Semiconductormemoryquicklyreplacedcorein‘70s

9


One-TransistorDynamicRAM[Dennard,IBM]

10

TiN topelectrode(VREF)Ta2O5 dielectric

Wbottomelectrode

polywordline access

transistor

1-TDRAMCell

word

bit

accesstransistor

Storagecapacitor(FETgate,trench,stack)

VREF

WU UCB CS252 SP17

Memory cells: Using less transistors overtime

11

http://www.intechopen.com/books/advances-in-solid-state-circuit-technologies/dimension-increase-in-metal-oxide-semiconductor-memories-and-transistors


ModernDRAMCellStructure

12[Samsung,sub-70nmDRAM,2004]


DRAMConceptualArchitecture

13

RowAdd

ress

Decode

r

Col.1

Col.2M

Row1

Row2N

ColumnDecoder&SenseAmplifiers

M

N

N+M

bitlineswordlines

Memorycell(onebit)

DData

§ Bitsstoredin2-dimensionalarraysonchip§ Modernchipshavearound4-8logicalbanksoneachchip

§ eachlogicalbankphysicallyimplementedasmanysmallerarrays


DRAMPhysicalLayout

14

II. DRAM TECHNOLOGY AND ARCHITECTURE

DRAMs are commoditized high volume products which need to have very low manufacturing costs. This puts significant constraints on the technology and architecture. The three most important factors for cost are the cost of a wafer, the yield and the die area. Cost of a wafer can be kept low if a simple transistor process and few metal levels are used. Yield can be optimized by process optimization and by optimizing the amount of redundancy. Die area optimization is achieved by keeping the array efficiency (ratio of cell area to total die area) as high as possible. The optimum approach changes very little even when the cell area is shrunk significantly over generations. DRAMs today use a transistor process with few junction optimizations, poly-Si gates and relatively high threshold voltage to suppress leakage. This process is much less expensive than a logic process but also much lower performance. It requires higher operating voltages than both high performance and low active power logic processes. Keeping array efficiency constant requires shrinking the area of the logic circuits on a DRAM at the same rate as the cell area. This is difficult as it is easier to shrink the very regular pattern of the cells than lithographically more complex circuitry. In addition the increasing complexity of the interface requires more circuit area.

Figure 1 shows the floorplan of a typical modern DDR2 or DDR3 DRAM and an enlargement of the cell array. The eight array blocks correspond to the eight banks of the DRAM. Row logic to decode the row address, implement row redundancy and drive the master wordlines is placed between the banks. At the other edge of the banks column

logic includes column address decoding, column redundancy and drivers for the column select lines as well as the secondary sense-amplifiers which sense or drive the array master data lines. The center stripe contains all other logic: the data and control pads and interface, central control logic, voltage regulators and pumps of the power system and circuitry to support efficient manufacturing test. Circuitry and buses in the center stripe are usually shared between banks to save area. Concurrent operation of banks is therefore limited to that portion of an operation that takes place inside a bank. For example the delay between two activate commands to different banks is limited by the time it takes to decode commands and addresses and trigger the command at a bank. Interleaving of reads and writes from and to different banks is limited by data contention on the shared data bus in the center stripe. Operations inside different banks can take place concurrently; one bank can for example be activating or precharging a wordline while another bank is simultaneously streaming data.

The enlargement of a small part of an array block at the right side of Figure 1 shows the hierarchical structure of the array block. Hierarchical wordlines and array data lines which were first developed in the early 1990s [5], [6] are now used by all major DRAM vendors. Master wordlines, column select lines and master array data lines are the interface between the array block and the rest of the DRAM circuitry. Individual cells connect to local wordlines and bitlines, bitlines are sensed or driven by bitline sense-amplifiers which connect to column select lines and local array data lines. The circuitry making the connection between local lines and master lines is placed in the local wordline driver stripe and bitline sense-amplifier stripe

Row logic Column logic

Serializer and driver (begin of write data bus)

Buffer

1:8

Control logicSub-array

Center stripe

column select line (M3 - Al) master array data lines (M3 - Al)

local array data lines

master wordline (M2 - Al)

local wordline (gate poly)

bitlines (M1 - W)

local wordline driver stripe

bitline sense-amplifier stripe

Array block (bold line)

Figure 1. Physical floorplan of a DRAM. A DRAM actually contains a very large number of small DRAMs called sub-arrays.

364

[Vogelsang,MICRO-2010]


DRAMPackaging(Laptops/Desktops/Servers)

§ DIMM(DualInlineMemoryModule)containsmultiplechipswithclock/control/addresssignalsconnectedinparallel(sometimesneedbufferstodrivesignalstoallchips)

§ Datapinsworktogethertoreturnwideword(e.g.,64-bitdatabususing16x4-bitparts)

15

Addresslinesmultiplexedrow/columnaddress

Clockandcontrolsignals

Databus(4b,8b,16b,32b)

DRAMchip

~12

~7


DRAMPackaging,MobileDevices

16[AppleA4packagecross-section,iFixit 2010]

TwostackedDRAMdieProcessorpluslogicdie

[AppleA4packageoncircuitboard]


DRAMOperation§ Threestepsinread/writeaccesstoagivenbank§ Rowaccess(RAS)

- decoderowaddress,enableaddressedrow(oftenmultipleKbinrow)- bitlines sharechargewithstoragecell- smallchangeinvoltagedetectedbysenseamplifierswhichlatchwholerowofbits

- senseamplifiersdrivebitlines fullrailtorechargestoragecells§ Columnaccess(CAS)

- decodecolumnaddresstoselectsmallnumberofsenseamplifierlatches(4,8,16,or32bitsdependingonDRAMpackage)

- onread,sendlatchedbitsouttochippins- onwrite,changesenseamplifierlatcheswhichthenchargestoragecellstorequiredvalue

- canperformmultiplecolumnaccessesonsamerowwithoutanotherrowaccess(burstmode)

§ Precharge- chargesbitlinestoknownvalue,requiredbeforenextrowaccess

§ Eachstephasalatencyofaround15-20nsinmodernDRAMs§ VariousDRAMstandards(DDR,RDRAM)havedifferentwaysofencodingthesignalsfortransmissiontotheDRAM,butallsharesamecorearchitecture

17

WU UCB CS252 SP17

DRAM Read

18

IBM Application Note on Understanding DRAM Operation

WU UCB CS252 SP17

DRAM Read

19

WU UCB CS252 SP17

DRAM Write

20

IBM Application Note on Understanding DRAM Operation

WU UCB CS252 SP17

DRAM Write

21


MemoryParameters

§ Latency- Timefrominitiationtocompletionofonememoryread(e.g.,innanoseconds,orinCPUorDRAMclockcycles)

§ Occupancy- Timethatamemorybankisbusywithonerequest-Usuallytheimportantparameterforamemorywrite

§ Bandwidth- Rateatwhichrequestscanbeprocessed(accesses/sec,orGB/s)

§ Allcanvarysignificantlyforreadsvs.writes,oraddress,oraddresshistory(e.g.,open/closepageonDRAMbank)

22


Processor-DRAMGap(latency)

23

Time

µProc60%/year

DRAM7%/year

1

10

100

10001980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformanceGap:(growing50%/yr)

Perfo

rmance

Four-issue 3GHzsuperscalaraccessing100nsDRAMcouldexecute 1,200instructionsduringtimeforonememoryaccess!


PhysicalSizeAffectsLatency

24

SmallMemory

CPU

BigMemory

CPU

§ Signalshavefurthertotravel§ Fanouttomorelocations


Twopredictablepropertiesofmemoryreferences:§ TemporalLocality:Ifalocationisreferenceditislikelytobereferencedagaininthenearfuture.

§ SpatialLocality:Ifalocationisreferenceditislikelythatlocationsnearitwillbereferencedinthenearfuture.


MemoryReferencePatterns

DonaldJ.Hatfield,JeanetteGerald:ProgramRestructuringforVirtualMemory.IBMSystemsJournal10(3):168-192(1971)

Time

Mem

oryAd

dress(on

edo

tperaccess)

SpatialLocality

TemporalLocality


MemoryHierarchy

§ Small,fastmemorynearprocessortobufferaccessestobig,slowmemory-Makecombinationlooklikeabig,fastmemory

§ Keeprecentlyaccesseddatainsmallfastmemoryclosertoprocessortoexploittemporallocality- Cachereplacementpolicyfavorsrecentlyaccesseddata

§ Fetchwordsaroundrequestedwordtoexploitspatiallocality-Usemultiwordcachelines,andprefetching

27


ManagementofMemoryHierarchy

§ Small/faststorage,e.g.,registers- Addressusuallyspecifiedininstruction-Generallyimplementeddirectlyasaregisterfile-buthardwaremightdothingsbehindsoftware’sback,e.g.,stackmanagement,registerrenaming

§ Larger/slowerstorage, e.g.,mainmemory- Addressusuallycomputedfromvaluesinregister-Generallyimplementedasahardware-managedcachehierarchy(hardwaredecideswhatiskeptinfastmemory)-butsoftwaremayprovide“hints”,e.g.,don’tcacheorprefetch

28


ImportantCacheParameters(Review)

§ Capacity(inbytes)§ Associativity(fromdirect-mappedtofullyassociative)§ Linesize(bytessharingatag)§ Write-backversuswrite-through§ Write-allocateversuswriteno-allocate§ Replacementpolicy(leastrecentlyused,random)

29


ImprovingCachePerformance

30

Averagememoryaccesstime(AMAT)=Hittime+Missratex Misspenalty

Toimproveperformance:• reducethehittime• reducethemissrate• reducethemisspenalty

Whatis bestcachedesignfor5-stagepipeline?

Biggestcachethatdoesn’tincreasehittimepast1cycle(approx8-32KBinmoderntechnology)[designissuesmorecomplexwith deeperpipelinesand/orout-of-ordersuperscalarprocessors]


CausesofCacheMisses:The3C’s

§ Compulsory:firstreferencetoaline(a.k.a.coldstartmisses)-missesthatwouldoccurevenwithinfinitecache

§ Capacity:cacheistoosmalltoholdalldataneededbytheprogram-missesthatwouldoccurevenunderperfectreplacementpolicy

§ Conflict:missesthatoccurbecauseofcollisionsduetoline-placementstrategy-missesthatwouldnotoccurwithidealfullassociativity

31


EffectofCacheParametersonPerformance

§ Largercachesize+ reducescapacityandconflictmisses- hittimewillincrease

§ Higherassociativity+ reducesconflictmisses- mayincreasehittime

§ Largerlinesize+ reducescompulsorymisses- increasesconflictmissesandmisspenalty

32


MultilevelCaches

33

Problem:AmemorycannotbelargeandfastSolution:Increasingsizesofcacheateachlevel

CPU L1$ L2$ DRAM

Local miss rate = misses in cache / accesses to cacheGlobal miss rate = misses in cache / CPU memory accessesMisses per instruction = misses in cache / number of instructions


PresenceofL2influencesL1design

§ UsesmallerL1ifthereisalsoL2- TradeincreasedL1missrateforreducedL1hittime- BackupL2reducesL1misspenalty- Reducesaverageaccessenergy

§ Usesimplerwrite-throughL1withon-chipL2-Write-backL2cacheabsorbswritetraffic,doesn’tgooff-chip

- AtmostoneL1missrequestperL1access(nodirtyvictimwriteback)simplifiespipelinecontrol

- Simplifiescoherenceissues- SimplifieserrorrecoveryinL1(canusejustparitybitsinL1andreloadfromL2whenparityerrordetectedonL1read)

34


InclusionPolicy

§ Inclusivemultilevelcache:- Innercachecanonlyholdlinesalsopresentinoutercache- Externalcoherencesnoopaccessneedonlycheckoutercache

§ Exclusivemultilevelcaches:- Innercachemayholdlinesnotinouter cache- Swaplinesbetweeninner/outercachesonmiss-UsedinAMDAthlonwith64KBprimaryand256KBsecondarycache

§ Whychooseonetypeortheother?

35


Power7On-ChipCaches[IBM2009]

36

32KBL1I$/core32KBL1D$/core3-cyclelatency

256KBUnifiedL2$/core8-cyclelatency

32MBUnifiedSharedL3$EmbeddedDRAM(eDRAM)25-cyclelatencytolocalslice


Prefetching

§ Speculateonfutureinstructionanddataaccessesandfetchthemintocache(s)- Instructionaccesseseasiertopredictthandataaccesses

§ Varietiesofprefetching- Hardwareprefetching- Softwareprefetching-Mixedschemes

§ Whattypesofmissesdoesprefetchingaffect?

37


IssuesinPrefetching

38

§ Usefulness– shouldproducehits§ Timeliness– notlateandnottooearly§ Cacheandbandwidthpollution

L1Data

L1Instruction

UnifiedL2Cache

RF

CPU

Prefetched data


HardwareInstructionPrefetching

§ InstructionprefetchinAlphaAXP21064- Fetchtwolinesonamiss;therequestedline(i)andthenextconsecutiveline(i+1)

- Requestedlineplacedincache,andnextlineininstructionstreambuffer

- Ifmissincachebuthitinstreambuffer,movestreambufferlineintocacheandprefetchnextline(i+2)

39

L1Instruction

UnifiedL2Cache

RF

CPU

StreamBuffer

PrefetchedinstructionlineReq

line

Reqline


HardwareDataPrefetching

§ Prefetch-on-miss:- Prefetchb+1uponmissonb

§ One-BlockLookahead(OBL)scheme- Initiateprefetchforblockb+1whenblockbisaccessed-Whyisthisdifferentfromdoublingblocksize?- CanextendtoN-blocklookahead

§ Stridedprefetch- Ifobservesequenceofaccessestolineb,b+N,b+2N,thenprefetchb+3Netc.

§ Example:IBMPower5[2003]supportseightindependentstreamsofstridedprefetchperprocessor,prefetching12linesaheadofcurrentaccess

40

WU UCB CS252 SP17

Prefetch-On-Miss vs. Tagged Prefetch

41

“Data Prefetch Mechanisms”, S. P. VanderWiel and D. J. Lilja, University of Minnesota


SoftwarePrefetching

42

for(i=0; i < N; i++) {prefetch( &a[i + 1] );prefetch( &b[i + 1] );SUM = SUM + a[i] * b[i];

}


SoftwarePrefetchingIssues

§ Timingisthebiggestissue,notpredictability- Ifyouprefetch veryclosetowhenthedataisrequired,youmightbetoolate

- Prefetch tooearly,causepollution- EstimatehowlongitwilltakeforthedatatocomeintoL1,sowecansetPappropriately

- Whyisthishardtodo?

for(i=0; i < N; i++) {prefetch( &a[i + P] );prefetch( &b[i + P] );SUM = SUM + a[i] * b[i];

}

43

Mustconsidercostofprefetch instructions


CompilerOptimizations

§ Restructuringcodeaffectsthedataaccesssequence-Groupdataaccessestogethertoimprovespatiallocality- Re-orderdataaccessestoimprovetemporallocality

§ Preventdatafromenteringthecache-Usefulforvariablesthatwillonlybeaccessedoncebeforebeingreplaced

-Needsmechanismforsoftwaretotellhardwarenottocachedata(“no-allocate”instructionhintsorpagetablebits)

§ Killdatathatwillneverbeusedagain- Streamingdataexploitsspatiallocalitybutnottemporallocality

- Replaceintodeadcachelocations

44


Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

45

cs252 spring 2017 graduate computer architecture lecture...

Documents