cs252 spring 2017 graduate computer architecture lecture...
TRANSCRIPT
WU UCB CS252 SP17
CS252 Spring 2017Graduate Computer Architecture
Lecture 11:Memory
Lisa Wu, Krste Asanovichttp://inst.eecs.berkeley.edu/~cs252/sp17
WU UCB CS252 SP17
Logistics for the 15-min meeting next Tuesday• Email me one email per project group by Sunday
evening 2/26• 2-5 sentences on what the project objective is (what
problem are you solving and what solution are you proposing – very briefly)• Bulleted list of what each person is responsible for• The infrastructure you want to use to evaluate your
solution
2
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
LastTimeinLecture10
VLIWMachines§ Compiler-controlledstaticscheduling§ Loopunrolling§ Softwarepipelining§ Tracescheduling§ Rotatingregisterfile§ Predication§ Limitsofstaticscheduling
3
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
EarlyRead-OnlyMemoryTechnologies
4
Punchedcards,Fromearly1700sthroughJaquard Loom,Babbage,andthenIBM Punchedpapertape,
instructionstreaminHarvardMk1
IBMCardCapacitorROS
IBMBalancedCapacitorROS
DiodeMatrix,EDSAC-2µcodestore
WU UCB CS252 SP17
Card Capacitor ROS for IBM System/360 Model 30
5
http://www.glennsmuseum.com/ibm/ibm.html
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
EarlyRead/WriteMainMemoryTechnologies
6
WilliamsTube,ManchesterMark1,1947
Babbage,1800s:Digitsstoredonmechanicalwheels
MercuryDelayLine,Univac1,1951
Also,regenerativecapacitormemoryonAtanasoff-Berrycomputer,androtatingmagneticdrummemoryonIBM650
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
MITWhirlwindCoreMemory
7
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
CoreMemory
8
§ Corememorywasfirstlargescalereliablemainmemory- inventedbyForresterinlate40s/early50satMITforWhirlwindproject
§ Bitsstoredasmagnetizationpolarityonsmallferritecoresthreadedontotwo-dimensionalgridofwires
§ CoincidentcurrentpulsesonXandYwireswouldwritecellandalsosenseoriginalstate(destructivereads)
DECPDP-8/EBoard,4Kwordsx12bits,(1968)
§ Robust,non-volatilestorage§ Usedonspaceshuttlecomputers§ Coresthreadedontowiresbyhand
(25billionayearatpeakproduction)§ Coreaccesstime~1µs
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
SemiconductorMemory
§ Semiconductormemorybegantobecompetitiveinearly1970s- Intelformedtoexploitmarketforsemiconductormemory- EarlysemiconductormemorywasStaticRAM(SRAM).SRAMcellinternalssimilartoalatch(cross-coupledinverters).
§ FirstcommercialDynamicRAM(DRAM)wasIntel1103- 1Kbitofstorageonsinglechip- chargeonacapacitorusedtoholdvalue- Valuehastoberegularlyreadandwrittenback,hencedynamic
Semiconductormemoryquicklyreplacedcorein‘70s
9
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
One-TransistorDynamicRAM[Dennard,IBM]
10
TiN topelectrode(VREF)Ta2O5 dielectric
Wbottomelectrode
polywordline access
transistor
1-TDRAMCell
word
bit
accesstransistor
Storagecapacitor(FETgate,trench,stack)
VREF
WU UCB CS252 SP17
Memory cells: Using less transistors overtime
11
http://www.intechopen.com/books/advances-in-solid-state-circuit-technologies/dimension-increase-in-metal-oxide-semiconductor-memories-and-transistors
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
ModernDRAMCellStructure
12[Samsung,sub-70nmDRAM,2004]
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
DRAMConceptualArchitecture
13
RowAdd
ress
Decode
r
Col.1
Col.2M
Row1
Row2N
ColumnDecoder&SenseAmplifiers
M
N
N+M
bitlineswordlines
Memorycell(onebit)
DData
§ Bitsstoredin2-dimensionalarraysonchip§ Modernchipshavearound4-8logicalbanksoneachchip
§ eachlogicalbankphysicallyimplementedasmanysmallerarrays
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
DRAMPhysicalLayout
14
II. DRAM TECHNOLOGY AND ARCHITECTURE
DRAMs are commoditized high volume products which need to have very low manufacturing costs. This puts significant constraints on the technology and architecture. The three most important factors for cost are the cost of a wafer, the yield and the die area. Cost of a wafer can be kept low if a simple transistor process and few metal levels are used. Yield can be optimized by process optimization and by optimizing the amount of redundancy. Die area optimization is achieved by keeping the array efficiency (ratio of cell area to total die area) as high as possible. The optimum approach changes very little even when the cell area is shrunk significantly over generations. DRAMs today use a transistor process with few junction optimizations, poly-Si gates and relatively high threshold voltage to suppress leakage. This process is much less expensive than a logic process but also much lower performance. It requires higher operating voltages than both high performance and low active power logic processes. Keeping array efficiency constant requires shrinking the area of the logic circuits on a DRAM at the same rate as the cell area. This is difficult as it is easier to shrink the very regular pattern of the cells than lithographically more complex circuitry. In addition the increasing complexity of the interface requires more circuit area.
Figure 1 shows the floorplan of a typical modern DDR2 or DDR3 DRAM and an enlargement of the cell array. The eight array blocks correspond to the eight banks of the DRAM. Row logic to decode the row address, implement row redundancy and drive the master wordlines is placed between the banks. At the other edge of the banks column
logic includes column address decoding, column redundancy and drivers for the column select lines as well as the secondary sense-amplifiers which sense or drive the array master data lines. The center stripe contains all other logic: the data and control pads and interface, central control logic, voltage regulators and pumps of the power system and circuitry to support efficient manufacturing test. Circuitry and buses in the center stripe are usually shared between banks to save area. Concurrent operation of banks is therefore limited to that portion of an operation that takes place inside a bank. For example the delay between two activate commands to different banks is limited by the time it takes to decode commands and addresses and trigger the command at a bank. Interleaving of reads and writes from and to different banks is limited by data contention on the shared data bus in the center stripe. Operations inside different banks can take place concurrently; one bank can for example be activating or precharging a wordline while another bank is simultaneously streaming data.
The enlargement of a small part of an array block at the right side of Figure 1 shows the hierarchical structure of the array block. Hierarchical wordlines and array data lines which were first developed in the early 1990s [5], [6] are now used by all major DRAM vendors. Master wordlines, column select lines and master array data lines are the interface between the array block and the rest of the DRAM circuitry. Individual cells connect to local wordlines and bitlines, bitlines are sensed or driven by bitline sense-amplifiers which connect to column select lines and local array data lines. The circuitry making the connection between local lines and master lines is placed in the local wordline driver stripe and bitline sense-amplifier stripe
Row logic Column logic
Serializer and driver (begin of write data bus)
Buffer
1:8
Control logicSub-array
Center stripe
column select line (M3 - Al) master array data lines (M3 - Al)
local array data lines
master wordline (M2 - Al)
local wordline (gate poly)
bitlines (M1 - W)
local wordline driver stripe
bitline sense-amplifier stripe
Array block (bold line)
Figure 1. Physical floorplan of a DRAM. A DRAM actually contains a very large number of small DRAMs called sub-arrays.
364
[Vogelsang,MICRO-2010]
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
DRAMPackaging(Laptops/Desktops/Servers)
§ DIMM(DualInlineMemoryModule)containsmultiplechipswithclock/control/addresssignalsconnectedinparallel(sometimesneedbufferstodrivesignalstoallchips)
§ Datapinsworktogethertoreturnwideword(e.g.,64-bitdatabususing16x4-bitparts)
15
Addresslinesmultiplexedrow/columnaddress
Clockandcontrolsignals
Databus(4b,8b,16b,32b)
DRAMchip
~12
~7
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
DRAMPackaging,MobileDevices
16[AppleA4packagecross-section,iFixit 2010]
TwostackedDRAMdieProcessorpluslogicdie
[AppleA4packageoncircuitboard]
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
DRAMOperation§ Threestepsinread/writeaccesstoagivenbank§ Rowaccess(RAS)
- decoderowaddress,enableaddressedrow(oftenmultipleKbinrow)- bitlines sharechargewithstoragecell- smallchangeinvoltagedetectedbysenseamplifierswhichlatchwholerowofbits
- senseamplifiersdrivebitlines fullrailtorechargestoragecells§ Columnaccess(CAS)
- decodecolumnaddresstoselectsmallnumberofsenseamplifierlatches(4,8,16,or32bitsdependingonDRAMpackage)
- onread,sendlatchedbitsouttochippins- onwrite,changesenseamplifierlatcheswhichthenchargestoragecellstorequiredvalue
- canperformmultiplecolumnaccessesonsamerowwithoutanotherrowaccess(burstmode)
§ Precharge- chargesbitlinestoknownvalue,requiredbeforenextrowaccess
§ Eachstephasalatencyofaround15-20nsinmodernDRAMs§ VariousDRAMstandards(DDR,RDRAM)havedifferentwaysofencodingthesignalsfortransmissiontotheDRAM,butallsharesamecorearchitecture
17
WU UCB CS252 SP17
DRAM Read
18
IBM Application Note on Understanding DRAM Operation
WU UCB CS252 SP17
DRAM Read
19
WU UCB CS252 SP17
DRAM Write
20
IBM Application Note on Understanding DRAM Operation
WU UCB CS252 SP17
DRAM Write
21
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
MemoryParameters
§ Latency- Timefrominitiationtocompletionofonememoryread(e.g.,innanoseconds,orinCPUorDRAMclockcycles)
§ Occupancy- Timethatamemorybankisbusywithonerequest-Usuallytheimportantparameterforamemorywrite
§ Bandwidth- Rateatwhichrequestscanbeprocessed(accesses/sec,orGB/s)
§ Allcanvarysignificantlyforreadsvs.writes,oraddress,oraddresshistory(e.g.,open/closepageonDRAMbank)
22
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
Processor-DRAMGap(latency)
23
Time
µProc60%/year
DRAM7%/year
1
10
100
10001980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformanceGap:(growing50%/yr)
Perfo
rmance
Four-issue 3GHzsuperscalaraccessing100nsDRAMcouldexecute 1,200instructionsduringtimeforonememoryaccess!
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
PhysicalSizeAffectsLatency
24
SmallMemory
CPU
BigMemory
CPU
§ Signalshavefurthertotravel§ Fanouttomorelocations
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
Twopredictablepropertiesofmemoryreferences:§ TemporalLocality:Ifalocationisreferenceditislikelytobereferencedagaininthenearfuture.
§ SpatialLocality:Ifalocationisreferenceditislikelythatlocationsnearitwillbereferencedinthenearfuture.
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
MemoryReferencePatterns
DonaldJ.Hatfield,JeanetteGerald:ProgramRestructuringforVirtualMemory.IBMSystemsJournal10(3):168-192(1971)
Time
Mem
oryAd
dress(on
edo
tperaccess)
SpatialLocality
TemporalLocality
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
MemoryHierarchy
§ Small,fastmemorynearprocessortobufferaccessestobig,slowmemory-Makecombinationlooklikeabig,fastmemory
§ Keeprecentlyaccesseddatainsmallfastmemoryclosertoprocessortoexploittemporallocality- Cachereplacementpolicyfavorsrecentlyaccesseddata
§ Fetchwordsaroundrequestedwordtoexploitspatiallocality-Usemultiwordcachelines,andprefetching
27
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
ManagementofMemoryHierarchy
§ Small/faststorage,e.g.,registers- Addressusuallyspecifiedininstruction-Generallyimplementeddirectlyasaregisterfile-buthardwaremightdothingsbehindsoftware’sback,e.g.,stackmanagement,registerrenaming
§ Larger/slowerstorage, e.g.,mainmemory- Addressusuallycomputedfromvaluesinregister-Generallyimplementedasahardware-managedcachehierarchy(hardwaredecideswhatiskeptinfastmemory)-butsoftwaremayprovide“hints”,e.g.,don’tcacheorprefetch
28
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
ImportantCacheParameters(Review)
§ Capacity(inbytes)§ Associativity(fromdirect-mappedtofullyassociative)§ Linesize(bytessharingatag)§ Write-backversuswrite-through§ Write-allocateversuswriteno-allocate§ Replacementpolicy(leastrecentlyused,random)
29
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
ImprovingCachePerformance
30
Averagememoryaccesstime(AMAT)=Hittime+Missratex Misspenalty
Toimproveperformance:• reducethehittime• reducethemissrate• reducethemisspenalty
Whatis bestcachedesignfor5-stagepipeline?
Biggestcachethatdoesn’tincreasehittimepast1cycle(approx8-32KBinmoderntechnology)[designissuesmorecomplexwith deeperpipelinesand/orout-of-ordersuperscalarprocessors]
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
CausesofCacheMisses:The3C’s
§ Compulsory:firstreferencetoaline(a.k.a.coldstartmisses)-missesthatwouldoccurevenwithinfinitecache
§ Capacity:cacheistoosmalltoholdalldataneededbytheprogram-missesthatwouldoccurevenunderperfectreplacementpolicy
§ Conflict:missesthatoccurbecauseofcollisionsduetoline-placementstrategy-missesthatwouldnotoccurwithidealfullassociativity
31
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
EffectofCacheParametersonPerformance
§ Largercachesize+ reducescapacityandconflictmisses- hittimewillincrease
§ Higherassociativity+ reducesconflictmisses- mayincreasehittime
§ Largerlinesize+ reducescompulsorymisses- increasesconflictmissesandmisspenalty
32
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
MultilevelCaches
33
Problem:AmemorycannotbelargeandfastSolution:Increasingsizesofcacheateachlevel
CPU L1$ L2$ DRAM
Local miss rate = misses in cache / accesses to cacheGlobal miss rate = misses in cache / CPU memory accessesMisses per instruction = misses in cache / number of instructions
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
PresenceofL2influencesL1design
§ UsesmallerL1ifthereisalsoL2- TradeincreasedL1missrateforreducedL1hittime- BackupL2reducesL1misspenalty- Reducesaverageaccessenergy
§ Usesimplerwrite-throughL1withon-chipL2-Write-backL2cacheabsorbswritetraffic,doesn’tgooff-chip
- AtmostoneL1missrequestperL1access(nodirtyvictimwriteback)simplifiespipelinecontrol
- Simplifiescoherenceissues- SimplifieserrorrecoveryinL1(canusejustparitybitsinL1andreloadfromL2whenparityerrordetectedonL1read)
34
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
InclusionPolicy
§ Inclusivemultilevelcache:- Innercachecanonlyholdlinesalsopresentinoutercache- Externalcoherencesnoopaccessneedonlycheckoutercache
§ Exclusivemultilevelcaches:- Innercachemayholdlinesnotinouter cache- Swaplinesbetweeninner/outercachesonmiss-UsedinAMDAthlonwith64KBprimaryand256KBsecondarycache
§ Whychooseonetypeortheother?
35
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
Power7On-ChipCaches[IBM2009]
36
32KBL1I$/core32KBL1D$/core3-cyclelatency
256KBUnifiedL2$/core8-cyclelatency
32MBUnifiedSharedL3$EmbeddedDRAM(eDRAM)25-cyclelatencytolocalslice
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
Prefetching
§ Speculateonfutureinstructionanddataaccessesandfetchthemintocache(s)- Instructionaccesseseasiertopredictthandataaccesses
§ Varietiesofprefetching- Hardwareprefetching- Softwareprefetching-Mixedschemes
§ Whattypesofmissesdoesprefetchingaffect?
37
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
IssuesinPrefetching
38
§ Usefulness– shouldproducehits§ Timeliness– notlateandnottooearly§ Cacheandbandwidthpollution
L1Data
L1Instruction
UnifiedL2Cache
RF
CPU
Prefetched data
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
HardwareInstructionPrefetching
§ InstructionprefetchinAlphaAXP21064- Fetchtwolinesonamiss;therequestedline(i)andthenextconsecutiveline(i+1)
- Requestedlineplacedincache,andnextlineininstructionstreambuffer
- Ifmissincachebuthitinstreambuffer,movestreambufferlineintocacheandprefetchnextline(i+2)
39
L1Instruction
UnifiedL2Cache
RF
CPU
StreamBuffer
PrefetchedinstructionlineReq
line
Reqline
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
HardwareDataPrefetching
§ Prefetch-on-miss:- Prefetchb+1uponmissonb
§ One-BlockLookahead(OBL)scheme- Initiateprefetchforblockb+1whenblockbisaccessed-Whyisthisdifferentfromdoublingblocksize?- CanextendtoN-blocklookahead
§ Stridedprefetch- Ifobservesequenceofaccessestolineb,b+N,b+2N,thenprefetchb+3Netc.
§ Example:IBMPower5[2003]supportseightindependentstreamsofstridedprefetchperprocessor,prefetching12linesaheadofcurrentaccess
40
WU UCB CS252 SP17
Prefetch-On-Miss vs. Tagged Prefetch
41
“Data Prefetch Mechanisms”, S. P. VanderWiel and D. J. Lilja, University of Minnesota
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
SoftwarePrefetching
42
for(i=0; i < N; i++) {prefetch( &a[i + 1] );prefetch( &b[i + 1] );SUM = SUM + a[i] * b[i];
}
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
SoftwarePrefetchingIssues
§ Timingisthebiggestissue,notpredictability- Ifyouprefetch veryclosetowhenthedataisrequired,youmightbetoolate
- Prefetch tooearly,causepollution- EstimatehowlongitwilltakeforthedatatocomeintoL1,sowecansetPappropriately
- Whyisthishardtodo?
for(i=0; i < N; i++) {prefetch( &a[i + P] );prefetch( &b[i + P] );SUM = SUM + a[i] * b[i];
}
43
Mustconsidercostofprefetch instructions
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
CompilerOptimizations
§ Restructuringcodeaffectsthedataaccesssequence-Groupdataaccessestogethertoimprovespatiallocality- Re-orderdataaccessestoimprovetemporallocality
§ Preventdatafromenteringthecache-Usefulforvariablesthatwillonlybeaccessedoncebeforebeingreplaced
-Needsmechanismforsoftwaretotellhardwarenottocachedata(“no-allocate”instructionhintsorpagetablebits)
§ Killdatathatwillneverbeusedagain- Streamingdataexploitsspatiallocalitybutnottemporallocality
- Replaceintodeadcachelocations
44
©KrsteAsanovic,2015CS252,Fall2015,Lecture11
Acknowledgements
§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)
45