understanding dram architectureslide source: onur mutlu, cmu . 12 dram bank operation row buffer...
TRANSCRIPT
Understanding DRAM Architecture
R. Govindarajan Computer Science & Automation
Supercomputer Edn. & Res. Centre
Indian Institute of Scinece, Bangalore
Why Study Memory System?
• Memory Wall [McKee’94]
– CPU-Memory speed disparity
– 100’s of cycles for off-chip access
DRAM
(2X/10 yrs)
Processor-Memory
Performance Gap:
(grows 50% / year)
Proessor
(2X/1.5yr)
Perf
orm
an
ce
Year
Moore’s Law
2
Memory Hierarchy : Recap
3
CPU MMU
L1
I-Cache
L1
D-Cache
L2 Unified Cache
Memory
Memory Hierarchy in Multicore
4
L2-Cache
C0 C1
L1$ L1$
C2 C3
L1$ L1$
Memory
Memory Controller
1 – 2 Cycles
100 – 300 cycles
10 – 15 cycles
Core
0
Core
1
Core
15
...
L1D
L1I
L1D
L1I
L1D
L1I
L2
Cache
(Off
Chip)
Main
Memory
Hit
Mem
ory
Co
ntr
oll
er
Miss
Multi-Core Processor
L3
Cache
(LLSC)
L2
Cache
Core
14
L1D
L1I
Memory Hierarchy in Multicore
Memory Bandwidth Demand for Multicores
• Memory Wall [McKee’94]
– CPU-Memory speed disparity
– 100’s of cycles for off-chip access
• Bandwidth Wall [ISCA’09]
– More cores and limited off-chip bandwidth
– Cores double every 18 months
– Pincount grows only by 10%
Off-chip accesses are expensive ! Memory System Performance is Critical
Big Picture of Memory
Big Picture of Memory
Big Picture of Memory
Memory Controller
Data Read & Write operations
Control
Address
Data
DIMM
Rank
Device
Overview of a DRAM Memory
Bank
10
Rows
Columns
Bank
Logic
Row Buffer
DRAM Bank
Basic DRAM Operations
• ACTIVATE Bring data from DRAM core into the row-buffer
• READ/WRITE Perform read/write operations on the contents in the row-buffer
• PRECHARGE Store data back to DRAM core (ACTIVATE discharges capacitors), put cells back at neutral voltage
Ld Ld
Memory Requests
PRE RD ACT
Ld
RD
Row buffer hits are faster and consume less power
PRE RD ACT
Row Buffer
Miss
Row Buffer
Hit Row Buffer
Miss
12
DRAM Bank Operation
Row Buffer
Access Address
(Row 0, Column 0)
Ro
w d
eco
der
Row address 0
Empty
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Access Address
(Row 0, Column 0)
Ro
w d
eco
der
Row address 0
Empty
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Access Address
(Row 0, Column 0)
Ro
w d
eco
der
Row 0
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Access Address
(Row 0, Column 0)
Ro
w d
eco
der
Column decoder Column address 0
Row 0
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Access Address
(Row 0, Column 0)
Ro
w d
eco
der
Column decoder Column address 0
Row 0
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Access Address
(Row 0, Column 0)
Ro
w d
eco
der
Column decoder Column address 0
Data
Row 0
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Row 0
Access Address
(Row 0, Column 1) Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Row 0
Access Address
(Row 0, Column 1)
HIT
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Row 0
Access Address
(Row 0, Column 1)
Column address 1
HIT
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Row 0
Access Address
(Row 0, Column 1)
Column address 1
HIT
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Row 0
Access Address
(Row 0, Column 1)
Column address 1
HIT
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Row 0
Access Address
(Row 1, Column 0) Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Row 0
Access Address
(Row 1, Column 0)
CONFLICT !
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Access Address
(Row 1, Column 0)
CONFLICT !
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Access Address
(Row 1, Column 0)
CONFLICT !
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Access Address
(Row 1, Column 0)
Row address 1
CONFLICT !
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Access Address
(Row 1, Column 0)
Row address 1
CONFLICT !
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Access Address
(Row 1, Column 0)
Row 1 CONFLICT !
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Access Address
(Row 1, Column 0)
Row 1
Column address 0
CONFLICT !
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Access Address
(Row 1, Column 0)
Row 1
Column address 0
CONFLICT !
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
12
DRAM Bank Operation
Row Buffer
Ro
w d
eco
der
Column decoder
Data
Access Address
(Row 1, Column 0)
Row 1
Column address 0
Columns
Ro
ws
Slide Source: Onur Mutlu, CMU
DRAM Command Summary
Slide Source: S. Rixner
DRAM Memory Controller
• Frontend
– Request/Response Buffers
– Memory mapping
– Arbiter
• Controller Backend
– Command Generator
• Timing to be obeyed
Slide Source: S. Rixner
Memory Controller Control
Address
Data
Bank Level Parallelism in DRAM
Bank
Memory Controller Control
Address
Data
Bank Level Parallelism in DRAM
Bank
Memory Requests
Ld A0
Memory Controller Control
Address
Data
Bank Level Parallelism in DRAM
Bank
Memory Requests
PRE RD ACT
Ld A0
Memory Controller Control
Address
Data
Bank Level Parallelism in DRAM
Bank
Ld B1
Memory Requests
PRE RD ACT
Ld A0
PRE RD ACT
Memory Controller Control
Address
Data
Bank Level Parallelism in DRAM
Bank
Ld B2 Ld B1
Memory Requests
PRE RD ACT
Ld A0
PRE RD ACT
PRE RD ACT
Memory Controller Control
Address
Data
Bank Level Parallelism in DRAM
Bank
Ld B2 Ld B1
Memory Requests
PRE RD ACT
Ld A0
PRE RD ACT
PRE RD ACT
Ld C1
RD
Memory Controller Control
Address
Data
Bank Level Parallelism in DRAM
Bank
Ld B2 Ld B1
Memory Requests
PRE RD ACT
Ld A0
PRE RD ACT
PRE RD ACT
Ld C1
RD
Bank Level Parallelism • Improves perf. with Parallelism and Row Buffer Hit • Hurts perf. due to bank-to-bank switch delay
DRAM Refresh
• Capacitors leak and lose charge Need periodic restoration of charge
• JEDEC Spec: At normal temp, cell retention time limit is 64ms. At high (extended) temp, retention time halves to 32ms.
• The memory controller issues refresh operations periodically.
Normal Access Normal
Access
Normal
Access
Normal
Access Refresh
Normal
Access
• Assume 4GB DRAM with 2KB pages, organized as 16 banks
• 2M pages total, 128K pages per bank
• Refreshing a page takes 20ns (ACTIVATE+PRECHARGE)
• Refreshing all pages in a bank 2.6ms!
• 2.6/64 = 4% overhead!
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0
Row 0;Col 0
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0
Row 0;Col 0
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0
Row 0;Col 0
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0 Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0 Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
Row 0;Col 5
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0 Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
Row 0;Col 4
Row 0;Col 5
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
Row 0;Col 4
Row 0;Col 5
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
Row 0;Col 4
Row 0;Col 5
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
Row 0;Col 4
Row 0;Col 5
Row 1
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
Row 0;Col 4
Row 0;Col 5
Row 1
43
Memory Access Scheduling: FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 4
Row 0;Col 5
54
Memory Access Scheduling : FR-FCFS
• A row-conflict memory access takes significantly
longer than a row-hit access
• Current controllers take advantage of the row
buffer
• Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00]
(1) Row-hit (column) first: Service row-hit memory
accesses first
(2) Oldest-first: Then service older accesses first
• This scheduling policy aims to maximize DRAM
throughput
Slide Source: Onur Mutlu, CMU
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0
Row 0;Col 0
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0
Row 0;Col 0
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0
Row 0;Col 0
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0 Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0 Request Buffer
Slide Source: Onur Mutlu, CMU
Row 0;Col 1
Row 1;Col 0
Row 0;Col 4
Row 0;Col 5
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0 Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
Row 0;Col 4
Row 0;Col 5
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0 Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
Row 0;Col 5
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0 Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Row 0
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1;Col 0
Row 1
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Request Buffer
Slide Source: Onur Mutlu, CMU
Row 1
55
Memory Access Scheduling : FR-FCFS
Row Buffer R
ow
deco
der
Column decoder
Data
Request Buffer
Slide Source: Onur Mutlu, CMU
Emerging Memory Technology
• Non-Volatile Memory technology
– Phase Change Memory (PCM), Magnetic RAM
(MRAM), Resistive RAM (RRAM), Spin Torque
Transfer RAM (STT-RAM), …
68 Slide Source: Moin Quereshi, Georgia Tech.
Emerging Memory Technology
• Phase Change Memory
– Data stored by changing phase of special material
– Data read by detecting material’s resistance
– Phase change material (chalcogenide glass) exists in two states:
1. Amorphous: high resistivity – reset state or 0
2. Crystalline: low resistivity – set state or 1
– Non-volatality and low idle power (no refresh)
– Expected to scale (to 9nm), denser than DRAM, and can store multiple bits/cell
– Higher Write latency and write-energy
– Endurance issues (cell dies after 108 writes)
69 Slide Source: Onur Mutlu, CMU
DRAM – PCM Hybrid Memory
• PCM-based (main) memory be organized?
• Hybrid PCM+DRAM
– How to partition/migrate data between PCM and DRAM
– Is DRAM a cache for PCM or part of main memory?
– How to design the hardware and software 70 Slide Source: Onur Mutlu, CMU
PCM-based Main Memory
• How should PCM-based (main) memory be organized?
• Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]:
– How to redesign entire hierarchy (and cores) to overcome PCM shortcomings
71 Slide Source: Onur Mutlu, CMU
Expanding the Multicore Memory Hierarchy
72
L2-Cache
C0 C1
L1$ L1$
C2 C3
L1$ L1$
Memory
Memory Controller
DRAM
Cache
Stacked DRAM
• DRAM vertically stacked over the processor die.
• Stacked DRAMs offer
– High bandwidth
– Large capacity
– Same or slightly lower latency.
3-D Stacked DRAM 2.5-D Stacked DRAM
Stacked DRAM
• DRAM vertically stacked over the processor die.
• Stacked DRAMs offer
– High bandwidth
– Large capacity
– Same or slightly lower latency.
3-D Stacked DRAM 2.5-D Stacked DRAM
Can be used as
Cache or
Part of Memory
Multicore With DRAM Cache
Core
0
Core
1
Core
N
.
.
.
L1D
L1I
L1D
L1I
L1D
L1I
L2
(LLSC)
DRAM
Cache
(Vertically
Stacked)
(Off
Chip)
Main
Memory
Hit
Memory
Controller
Miss
Processor with Stacked DRAM
Overview of Different Designs
PA_M PA_DRAM $ = Hash (PA_M) + Offset
Problems in Architecting Large Caches • Organizing at cache line granularity (64 B)
reduces wasted space and wasted bandwidth
• Problem: Cache of hundreds of MB needs tag-
store of tens of MB
• E.g. 256MB DRAM cache needs ~20MB tag store
(5 bytes/line)
• But big blocks have their own issues
– Wasted off-chip bandwidth
– Wasted cache space
Problems in Architecting Large Caches • Organizing at cache line granularity (64 B)
reduces wasted space and wasted bandwidth
• Problem: Cache of hundreds of MB needs tag-
store of tens of MB
• E.g. 256MB DRAM cache needs ~20MB tag store
(5 bytes/line)
• But big blocks have their own issues
– Wasted off-chip bandwidth
– Wasted cache space
Option 1: SRAM Tags
Fast, But Impractical
(Not enough transistors)
Option 2: Tags in DRAM
Naïve design has 2x latency
(Two accesses -- tag and data)
Stacked DRAM Caches
Tags-On-DRAM
• Cache tags on DRAM itself
• Typically 64B blocks
• Due to overhead of tag
access from DRAM,
requires some form of
predictor/cache in SRAM
• Several recent proposals
(Loh-Hill, AlloyCache,
ATCache, Bi-Modal)
Tags-On-SRAM
• Cache tags on SRAM
• Expensive SRAM
• Large storage overhead
• So typically uses larger block
sizes to reduce overhead
(~ 1KB)
• Off-chip bandwidth and
cache utilization are
concerns
• Several recent proposals
(FootPrintCache, CHOP)
Stacked DRAM Cache Orgn.
Core
0
Core
1
Core
N
.
.
.
L1D
L1I
L1D
L1I
L1D
L1I
L2
(LLSC)
DRAM
Cache
(Vertically
Stacked)
(Off
Chip)
Main
Memory
Hit
Memory
Controller
Miss
Processor with Stacked DRAM
Stacked DRAM Cache Orgn.
Core
0
Core
1
Core
N
.
.
.
L1D
L1I
L1D
L1I
L1D
L1I
L2
(LLSC)
DRAM
Cache
(Vertically
Stacked)
(Off
Chip)
Main
Memory
Hit
Memory
Controller
Miss
Processor with Stacked DRAM
MetaData
on SRAM
Stacked DRAM Cache Orgn.
Core
0
Core
1
Core
N
.
.
.
L1D
L1I
L1D
L1I
L1D
L1I
L2
(LLSC)
DRAM
Cache
(Vertically
Stacked)
(Off
Chip)
Main
Memory
Tag-
Pred
Hit
Memory
Controller
Miss
Processor with Stacked DRAM
MetaData
on DRAM
Overview of Different Designs
• For DRAM caches, critical to optimize first for
latency, then hit-rate
PA_M PA_DRAM $ = Hash (PA_M) + Offset
Overview of Different Designs
• For DRAM caches, critical to optimize first for
latency, then hit-rate
PA_M PA_DRAM $ = Hash (PA_M) + Offset
Hash (PA_M) + Offset
Overview of Different Designs
• For DRAM caches, critical to optimize first for
latency, then hit-rate
PA_M PA_DRAM $ = Hash (PA_M) + Offset
Hash (PA_M) + Offset
Hash (PA_M)
Overview of Bi-Modal Cache
• Tags-In-DRAM organization
• With 3 new organizational features:
1) Cache Sets are Bi-Modal – they can hold
a combination of big (512B) and small
(64B) blocks
2) Parallel Tag and Data Accesses
3) Eliminating Most Tag Accesses via a
small SRAM based Way Locator
Reduce Hit
Latency
Improves Hit Rate
And
Reduces Off-Chip
Bandwidth
Supporting Bi-Modal Block Sizes
• Each Set can hold some big
(512B) and some small (64B)
blocks.
• Block Size Predictor
– Blocks with high spatial reuse
fetch 512B
– Blocks with little spatial reuse
fetch 64B
512B 512B 512B 64 64 64 …
Predict
Block
Size
DRAM Cache Miss
Fetch 512B
Block
Fetch 64B
Block
Big Small
DRAM Cache Set
Parallel Tag and Data Accesses
High Row Buffer Hit Rate in
the Metadata Bank!
Channel 0
D M D D
Channel 1
D M D D
Tag Access Data Access
T T T T T T T T T T T T T T T T
Data Data
Page in DRAM Cache
Eliminating a Majority of Tag Accesses using the Way Locator
Set MRU: Tag and Way MRU-1: Tag and Way
Set 0 Tag a1 Way 3 Tag a2 Way 1
Set 1 Tag a3 Way 2 Tag a4 Way 0
Set 2 Tag a5 Way 0 Tag a6 Way 3
…
Set N Tag am Way x Tag an Way y
Addr
Set
2-way Set Associative Cache
Each entry specifies tag and associated way
(DRAM column) where data is stored
Putting them together Last Level Cache Miss
Way
Locator
Hit?
DRAM
Cache
(Data)
Hit
Return
Data
DRAM
Cache
(Tag)
DRAM
Cache
(Data)
Miss
Tag
Match?
Cache Hit
Return
Data
Predict
Block
Size
Fetch 512B
Block
Fetch 64B
Block
Big Small
Cache Miss
Lowest
Latency
Parallel
Access
Hit Latency Improvement
Common Case:
Over 85% of
accesses