18-740/640
Computer Architecture
Lecture 9: Emerging Memory Technologies
Prof. Onur Mutlu
Carnegie Mellon University
Fall 2015, 9/30/2015
Required Readings
2
Required Reading Assignment:• Lee et al., “Phase Change Technology and the Future of Main
Memory,” IEEE Micro, Jan/Feb 2010.
Recommended References:
• M. Qureshi et al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009.
• H. Yoon et al., “Row buffer locality aware caching policies for hybrid memories,” ICCD 2012.
• J. Zhao et al., “FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems,” MICRO 2014.
Self-Optimizing DRAM Controllers
Problem: DRAM controllers difficult to design It is difficult for
human designers to design a policy that can adapt itself very well to different workloads and different system conditions
Idea: Design a memory controller that adapts its scheduling policy decisions to workload behavior and system conditions using machine learning.
Observation: Reinforcement learning maps nicely to memory control.
Design: Memory controller is a reinforcement learning agent that dynamically and continuously learns and employs the best scheduling policy.
3Ipek+, “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008.
Self-Optimizing DRAM Controllers
Dynamically adapt the memory scheduling policy via interaction with the system at runtime
Associate system states and actions (commands) with long term reward values: each action at a given state leads to a learned reward
Schedule command with highest estimated long-term reward value in each state
Continuously update reward values for <state, action> pairs based on feedback from system
4
Self-Optimizing DRAM Controllers
Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning Approach"Proceedings of the 35th International Symposium on Computer Architecture(ISCA), pages 39-50, Beijing, China, June 2008.
5
States, Actions, Rewards
6
❖ Reward function
• +1 for scheduling Read and Write commands
• 0 at all other times
Goal is to maximize data bus utilization
❖ State attributes
• Number of reads, writes, and load misses in transaction queue
• Number of pending writes and ROB heads waiting for referenced row
• Request’s relative
ROB order
❖ Actions
• Activate
• Write
• Read - load miss
• Read - store miss
• Precharge - pending
• Precharge - preemptive
• NOP
Performance Results
7
Self Optimizing DRAM Controllers
Advantages
+ Adapts the scheduling policy dynamically to changing workload behavior and to maximize a long-term target
+ Reduces the designer’s burden in finding a good scheduling policy. Designer specifies:
1) What system variables might be useful
2) What target to optimize, but not how to optimize it
Disadvantages and Limitations
-- Black box: designer much less likely to implement what she cannot easily reason about
-- How to specify different reward functions that can achieve different objectives? (e.g., fairness, QoS)
-- Hardware complexity?8
More on Self-Optimizing DRAM Controllers
Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning Approach"Proceedings of the 35th International Symposium on Computer Architecture(ISCA), pages 39-50, Beijing, China, June 2008.
9
Evaluating New Ideas
for Future (Memory) Architectures
Simulation: The Field of Dreams
Dreaming and Reality
An architect is in part a dreamer, a creator
Simulation is a key tool of the architect
Simulation enables
The exploration of many dreams
A reality check of the dreams
Deciding which dream is better
Simulation also enables
The ability to fool yourself with false dreams
12
Why High-Level Simulation?
Problem: RTL simulation is intractable for design space exploration too time consuming to design and evaluate
Especially over a large number of workloads
Especially if you want to predict the performance of a good chunk of a workload on a particular design
Especially if you want to consider many design choices
Cache size, associativity, block size, algorithms
Memory control and scheduling algorithms
In-order vs. out-of-order execution
Reservation station sizes, ld/st queue size, register file size, …
…
Goal: Explore design choices quickly to see their impact on the workloads we are designing the platform for
13
Different Goals in Simulation Explore the design space quickly and see what you want to
potentially implement in a next-generation platform
propose as the next big idea to advance the state of the art
the goal is mainly to see relative effects of design decisions
Match the behavior of an existing system so that you can
debug and verify it at cycle-level accuracy
propose small tweaks to the design that can make a difference in performance or energy
the goal is very high accuracy
Other goals in-between:
Refine the explored design space without going into a full detailed, cycle-accurate design
Gain confidence in your design decisions made by higher-level design space exploration
14
Tradeoffs in Simulation
Three metrics to evaluate a simulator
Speed
Flexibility
Accuracy
Speed: How fast the simulator runs (xIPS, xCPS)
Flexibility: How quickly one can modify the simulator to evaluate different algorithms and design choices?
Accuracy: How accurate the performance (energy) numbers the simulator generates are vs. a real design (Simulation error)
The relative importance of these metrics varies depending on where you are in the design process
15
Trading Off Speed, Flexibility, Accuracy
Speed & flexibility affect:
How quickly you can make design tradeoffs
Accuracy affects:
How good your design tradeoffs may end up being
How fast you can build your simulator (simulator design time)
Flexibility also affects:
How much human effort you need to spend modifying the simulator
You can trade off between the three to achieve design exploration and decision goals
16
High-Level Simulation
Key Idea: Raise the abstraction level of modeling to give up some accuracy to enable speed & flexibility (and quick simulator design)
Advantage
+ Can still make the right tradeoffs, and can do it quickly
+ All you need is modeling the key high-level factors, you can omit corner case conditions
+ All you need is to get the “relative trends” accurately, not exact performance numbers
Disadvantage
-- Opens up the possibility of potentially wrong decisions
-- How do you ensure you get the “relative trends” accurately?17
Simulation as Progressive Refinement
High-level models (Abstract, C)
…
Medium-level models (Less abstract)
…
Low-level models (RTL with eveything modeled)
…
Real design
As you refine (go down the above list)
Abstraction level reduces
Accuracy (hopefully) increases (not necessarily, if not careful)
Speed and flexibility reduce
You can loop back and fix higher-level models18
Making The Best of Architecture
A good architect is comfortable at all levels of refinement
Including the extremes
A good architect knows when to use what type of simulation
19
Ramulator: A Fast and Extensible
DRAM Simulator
[IEEE Comp Arch Letters’15]
20
Ramulator Motivation
DRAM and Memory Controller landscape is changing
Many new and upcoming standards
Many new controller designs
A fast and easy-to-extend simulator is very much needed
21
Ramulator
Provides out-of-the box support for many DRAM standards:
DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, plus new proposals (SALP, AL-DRAM, TLDRAM, RowClone, and SARP)
~2.5X faster than fastest open-source simulator
Modular and extensible to different standards
22
Case Study: Comparison of DRAM Standards
23
Across 22 workloads, simple CPU model
Ramulator Paper and Source Code
Yoongu Kim, Weikun Yang, and Onur Mutlu,"Ramulator: A Fast and Extensible DRAM Simulator"IEEE Computer Architecture Letters (CAL), March 2015. [Source Code]
Source code is released under the liberal MIT License
https://github.com/CMU-SAFARI/ramulator
24
Extra Credit Assignment
Review the Ramulator paper
Send your reviews to me ([email protected])
Download and run Ramulator
Compare DDR3, DDR4, SALP, HBM for the libquantumbenchmark (provided in Ramulator repository)
Send your brief report to me
25
Emerging Memory Technologies
Agenda
Major Trends Affecting Main Memory
Requirements from an Ideal Main Memory System
Opportunity: Emerging Memory Technologies
Conclusions
Discussion
27
Major Trends Affecting Main Memory (I)
Need for main memory capacity and bandwidth increasing
Main memory energy/power is a key system design concern
DRAM technology scaling is ending
28
Trends: Problems with DRAM as Main Memory
Need for main memory capacity and bandwidth increasing
DRAM capacity hard to scale
Main memory energy/power is a key system design concern
DRAM consumes high power due to leakage and refresh
DRAM technology scaling is ending
DRAM capacity, cost, and energy/power hard to scale
29
Agenda
Major Trends Affecting Main Memory
Requirements from an Ideal Main Memory System
Opportunity: Emerging Memory Technologies
Conclusions
Discussion
30
Traditional
Enough capacity
Low cost
High system performance (high bandwidth, low latency)
New
Technology scalability: lower cost, higher capacity, lower energy
Energy (and power) efficiency
QoS support and configurability (for consolidation)
31
Requirements from an Ideal Memory System
Traditional
Higher capacity
Continuous low cost
High system performance (higher bandwidth, low latency)
New
Technology scalability: lower cost, higher capacity, lower energy
Energy (and power) efficiency
QoS support and configurability (for consolidation)
32
Requirements from an Ideal Memory System
Emerging, resistive memory technologies (NVM) can help
How Do We Solve The Memory Problem?
Fix it: Make DRAM and controllers more intelligent
New interfaces, functions, architectures: system-DRAM codesign
Eliminate or minimize it: Replace or (more likely) augment DRAM with a different technology
New technologies and system-wide rethinking of memory & storage
Embrace it: Design heterogeneous memories (none of which are perfect) and map data intelligently across them
New models for data management and maybe usage
…
33
Solutions (to memory scaling) require software/hardware/device cooperation
Microarchitecture
ISA
Programs
Algorithms
Problems
Logic
Devices
Runtime System
(VM, OS, MM)
User
Solution 2: Emerging Memory Technologies Some emerging resistive memory technologies seem more
scalable than DRAM (and they are non-volatile)
Example: Phase Change Memory
Expected to scale to 9nm (2022 [ITRS])
Expected to be denser than DRAM: can store multiple bits/cell
But, emerging technologies have shortcomings as well
Can they be enabled to replace/augment/surpass DRAM?
Lee+, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA’09, CACM’10, Micro’10.
Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters 2012.
Yoon, Meza+, “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012.
Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.
Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.
Lu+, “Loose Ordering Consistency for Persistent Memory,” ICCD 2014.
Zhao+, “FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems,” MICRO 2014.
Yoon, Meza+, “Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories,” ACM TACO 2014.
Ren+, “Dual-Scheme Checkpointing: “A Software-Transparent Mechanism for Supporting Crash Consistency in Persistent Memory Systems,” MICRO 2015.
34
Solution 3: Hybrid Memory Systems
Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.
Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award.
CPUDRAMCtrl
Fast, durableSmall,
leaky, volatile, high-cost
Large, non-volatile, low-costSlow, wears out, high active energy
PCM CtrlDRAM Technology X (e.g., PCM)
Hardware/software manage data allocation and movement to achieve the best of multiple technologies
Agenda
Major Trends Affecting Main Memory
Requirements from an Ideal Main Memory System
Opportunity: Emerging Memory Technologies
Conclusions
Discussion
36
The Promise of Emerging Technologies
Likely need to replace/augment DRAM with a technology that is
Technology scalable
And at least similarly efficient, high performance, and fault-tolerant
or can be architected to be so
Some emerging resistive memory technologies appear promising
Phase Change Memory (PCM)?
Spin Torque Transfer Magnetic Memory (STT-MRAM)?
Memristors? RRAM? ReRAM?
And, maybe there are other ones
Can they be enabled to replace/augment/surpass DRAM?
37
Agenda
Major Trends Affecting Main Memory
Requirements from an Ideal Main Memory System
Opportunity: Emerging Memory Technologies
Background
PCM (or Technology X) as DRAM Replacement
Hybrid Memory Systems
Conclusions
Discussion
38
Charge vs. Resistive Memories
Charge Memory (e.g., DRAM, Flash)
Write data by capturing charge Q
Read data by detecting voltage V
Resistive Memory (e.g., PCM, STT-MRAM, memristors)
Write data by pulsing current dQ/dt
Read data by detecting resistance R
39
Limits of Charge Memory
Difficult charge placement and control
Flash: floating gate charge
DRAM: capacitor charge, transistor leakage
Reliable sensing becomes difficult as charge storage unit size reduces
40
Promising Resistive Memory Technologies
PCM
Inject current to change material phase
Resistance determined by phase
STT-MRAM
Inject current to change magnet polarity
Resistance determined by polarity
Memristors/RRAM/ReRAM
Inject current to change atomic structure
Resistance determined by atom distance
41
What is Phase Change Memory?
Phase change material (chalcogenide glass) exists in two states:
Amorphous: Low optical reflexivity and high electrical resistivity
Crystalline: High optical reflexivity and low electrical resistivity
42
PCM is resistive memory: High resistance (0), Low resistance (1)
PCM cell can be switched between states reliably and quickly
How Does PCM Work?
Write: change phase via current injection
SET: sustained current to heat cell above Tcryst
RESET: cell heated above Tmelt and quenched
Read: detect phase via material resistance
amorphous/crystalline
43
LargeCurrent
SET (cryst)Low resistance
103-104 W
SmallCurrent
RESET (amorph)High resistance
AccessDevice
MemoryElement
106-107 W
Photo Courtesy: Bipin Rajendran, IBM Slide Courtesy: Moinuddin Qureshi, IBM
Opportunity: PCM Advantages
Scales better than DRAM, Flash
Requires current pulses, which scale linearly with feature size
Expected to scale to 9nm (2022 [ITRS])
Prototyped at 20nm (Raoux+, IBM JRD 2008)
Can be denser than DRAM
Can store multiple bits per cell due to large resistance range
Prototypes with 2 bits/cell in ISSCC’08, 4 bits/cell by 2012
Non-volatile
Can retain data for >10 years at 85C
No refresh needed, low idle power
44
45
PCM Resistance → Value
Cell resistance
1 0Cell value:
46
Multi-Level Cell PCM
Multi-level cell: more than 1 bit per cell
Further increases density by 2 to 4x [Lee+,ISCA'09]
But MLC-PCM also has drawbacks
Higher latency and energy than single-level cell PCM
47
MLC-PCM Resistance → Value
Cell resistance
11 000110Cell value:
Bit 1 Bit 0
48
MLC-PCM Resistance → Value
Cell resistance
11 000110Cell value:
Less margin between values→ need more precise sensing/modification of cell contents→ higher latency/energy (~2x for reads and 4x for writes)
Phase Change Memory Properties
Surveyed prototypes from 2003-2008 (ITRS, IEDM, VLSI, ISSCC)
Derived PCM parameters for F=90nm
Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009.
49
50
Phase Change Memory Properties: Latency
Latency comparable to, but slower than DRAM
Read Latency
50ns: 4x DRAM, 10-3x NAND Flash
Write Latency
150ns: 12x DRAM
Write Bandwidth
5-10 MB/s: 0.1x DRAM, 1x NAND Flash
51
Phase Change Memory Properties
Dynamic Energy
40 uA Rd, 150 uA Wr
2-43x DRAM, 1x NAND Flash
Endurance
Writes induce phase change at 650C
Contacts degrade from thermal expansion/contraction
108 writes per cell
10-8x DRAM, 103x NAND Flash
Cell Size
9-12F2 using BJT, single-level cells
1.5x DRAM, 2-3x NAND (will scale with feature size, MLC)
52
Phase Change Memory: Pros and Cons
Pros over DRAM
Better technology scaling
Non volatility
Low idle power (no refresh)
Cons
Higher latencies: ~4-15x DRAM (especially write)
Higher active energy: ~2-50x DRAM (especially write)
Lower endurance (a cell dies after ~108 writes)
Challenges in enabling PCM as DRAM replacement/helper:
Mitigate PCM shortcomings
Find the right way to place PCM in the system
Ensure secure and fault-tolerant PCM operation53
PCM-based Main Memory: Challenges
Where to place PCM in the memory hierarchy?
Hybrid OS controlled PCM-DRAM
Hybrid OS controlled PCM and hardware-controlled DRAM
Pure PCM main memory
How to mitigate shortcomings of PCM?
How to minimize amount of DRAM in the system?
How to take advantage of (byte-addressable and fast) non-volatile main memory?
Can we design specific-NVM-technology-agnostic techniques?54
PCM-based Main Memory (I)
How should PCM-based (main) memory be organized?
Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09, Meza+
IEEE CAL’12]:
How to partition/migrate data between PCM and DRAM
55
Hybrid Memory Systems: Challenges
Partitioning
Should DRAM be a cache or main memory, or configurable?
What fraction? How many controllers?
Data allocation/movement (energy, performance, lifetime)
Who manages allocation/movement?
What are good control algorithms?
How do we prevent degradation of service due to wearout?
Design of cache hierarchy, memory controllers, OS
Mitigate PCM shortcomings, exploit PCM advantages
Design of PCM/DRAM chips and modules
Rethink the design of PCM/DRAM with new requirements
56
PCM-based Main Memory (II)
How should PCM-based (main) memory be organized?
Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]:
How to redesign entire hierarchy (and cores) to overcome PCM shortcomings
57
STT-MRAM as Main Memory
Magnetic Tunnel Junction (MTJ) device
Reference layer: Fixed magnetic orientation
Free layer: Parallel or anti-parallel
Magnetic orientation of the free layer determines logical state of device
High vs. low resistance
Write: Push large current through MTJ to change orientation of free layer
Read: Sense current flow
Kultursay et al., “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.
Reference Layer
Free Layer
Barrier
Reference Layer
Free Layer
Barrier
Logical 0
Logical 1
Word Line
Bit Line
AccessTransistor
MTJ
Sense Line
Aside: STT MRAM: Pros and Cons
Pros over DRAM
Better technology scaling
Non volatility
Low idle power (no refresh)
Cons
Higher write latency
Higher write energy
Reliability?
Another level of freedom
Can trade off non-volatility for lower write latency/energy (by reducing the size of the MTJ)
59
Agenda
Major Trends Affecting Main Memory
Requirements from an Ideal Main Memory System
Opportunity: Emerging Memory Technologies
Background
PCM (or Technology X) as DRAM Replacement
Hybrid Memory Systems
Conclusions
Discussion
60
An Initial Study: Replace DRAM with PCM
Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009.
Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC)
Derived “average” PCM parameters for F=90nm
61
Results: Naïve Replacement of DRAM with PCM
Replace DRAM with PCM in a 4-core, 4MB L2 system
PCM organized the same as DRAM: row buffers, banks, peripherals
1.6x delay, 2.2x energy, 500-hour average lifetime
Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009.
62
Architecting PCM to Mitigate Shortcomings
Idea 1: Use multiple narrow row buffers in each PCM chip
Reduces array reads/writes better endurance, latency, energy
Idea 2: Write into array at
cache block or word
granularity
Reduces unnecessary wear
63
DRAM PCM
Results: Architected PCM as Main Memory
1.2x delay, 1.0x energy, 5.6-year average lifetime
Scaling improves energy, endurance, density
Caveat 1: Worst-case lifetime is much shorter (no guarantees)
Caveat 2: Intensive applications see large performance and energy hits
Caveat 3: Optimistic PCM parameters?64
More on PCM As Main Memory
Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger,"Architecting Phase Change Memory as a Scalable DRAM Alternative"Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 2-13, Austin, TX, June 2009. Slides (pdf)
65
Agenda
Major Trends Affecting Main Memory
Requirements from an Ideal Main Memory System
Opportunity: Emerging Memory Technologies
Background
PCM (or Technology X) as DRAM Replacement
Hybrid Memory Systems
Conclusions
Discussion
66
Hybrid Memory Systems
Meza, Chang, Yoon, Mutlu, Ranganathan, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.
CPUDRAMCtrl
Fast, durableSmall,
leaky, volatile, high-cost
Large, non-volatile, low-costSlow, wears out, high active energy
PCM CtrlDRAM Phase Change Memory (or Tech. X)
Hardware/software manage data allocation and movement to achieve the best of multiple technologies
One Option: DRAM as a Cache for PCM
PCM is main memory; DRAM caches memory rows/blocks
Benefits: Reduced latency on DRAM cache hit; write filtering
Memory controller hardware manages the DRAM cache
Benefit: Eliminates system software overhead
Three issues:
What data should be placed in DRAM versus kept in PCM?
What is the granularity of data movement?
How to design a low-cost hardware-managed DRAM cache?
Two idea directions:
Locality-aware data placement [Yoon+ , ICCD 2012]
Cheap tag stores and dynamic granularity [Meza+, IEEE CAL 2012]
68
DRAM as a Cache for PCM
Goal: Achieve the best of both DRAM and PCM/NVM
Minimize amount of DRAM w/o sacrificing performance, endurance
DRAM as cache to tolerate PCM latency and write bandwidth
PCM as main memory to provide large capacity at good cost and power
69
DATA
PCM Main Memory
DATAT
DRAM Buffer
PCM Write Queue
T=Tag-Store
Processor
Flash
Or
HDD
Qureshi+, “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009.
Write Filtering Techniques
Lazy Write: Pages from disk installed only in DRAM, not PCM
Partial Writes: Only dirty lines from DRAM page written back
Page Bypass: Discard pages with poor reuse on DRAM eviction
Qureshi et al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009.
70
Processor
DATA
PCM Main Memory
DATAT
DRAM Buffer
Flash
Or
HDD
Results: DRAM as PCM Cache (I)
Simulation of 16-core system, 8GB DRAM main-memory at 320 cycles, HDD (2 ms) with Flash (32 us) with Flash hit-rate of 99%
Assumption: PCM 4x denser, 4x slower than DRAM
DRAM block size = PCM page size (4kB)
71
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
db1 db2 qsort bsearch kmeans gauss daxpy vdotp gmean
No
rmali
zed
Execu
tio
n T
ime
8GB DRAM
32GB PCM
32GB DRAM
32GB PCM + 1GB DRAM
Qureshi+, “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009.
Results: DRAM as PCM Cache (II)
PCM-DRAM Hybrid performs similarly to similar-size DRAM
Significant energy savings with PCM-DRAM Hybrid
Average lifetime: 9.7 years (no guarantees)
72
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Power Energy Energy x Delay
Va
lue
No
rma
lize
d t
o 8
GB
DR
AM 8GB DRAM
Hybrid (32GB PCM+ 1GB DRAM)
32GB DRAM
Qureshi+, “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009.
More on DRAM-PCM Hybrid Memory
Scalable High-Performance Main Memory System Using Phase-Change Memory Technology. Moinuddin K. Qureshi, Viji Srinivasan, and Jude A. Rivers Appears in the International Symposium on Computer Architecture (ISCA) 2009.
73
Agenda
Major Trends Affecting Main Memory
Requirements from an Ideal Main Memory System
Opportunity: Emerging Memory Technologies
Background
PCM (or Technology X) as DRAM Replacement
Hybrid Memory Systems
Row-Locality Aware Data Placement
Efficient DRAM (or Technology X) Caches
Conclusions
Discussion
74
Hybrid Memory
• Key question: How to place data between the heterogeneous memory devices?
75
DRAM PCM
CPU
MC MC
Hybrid Memory: A Closer Look
76
MC MC
DRAM(small capacity cache)
PCM(large capacity store)
CPU
Memory channel
Bank Bank Bank Bank
Row buffer
Row (buffer) hit: Access data from row buffer fast
Row (buffer) miss: Access data from cell array slow
LOAD X LOAD X+1LOAD X+1LOAD X
Row Buffers and Latency
77
RO
W A
DD
RES
S
ROW DATA
Row buffer miss!Row buffer hit!
Bank
Row buffer
CELL ARRAY
Key Observation
• Row buffers exist in both DRAM and PCM
– Row hit latency similar in DRAM & PCM [Lee+ ISCA’09]
– Row miss latency small in DRAM, large in PCM
• Place data in DRAM which
– is likely to miss in the row buffer (low row buffer locality)miss penalty is smaller in DRAM
AND
– is reused many times cache only the data worth the movement cost and DRAM space
78
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Server Cloud Avg
No
rmali
zed
Wei
gh
ted
Sp
eed
up
Workload
FREQ FREQ-Dyn RBLA RBLA-Dyn
10%
System Performance
79
14%
Benefit 1: Increased row buffer locality (RBL) in PCM by moving low RBL data to DRAM
17%
Benefit 1: Increased row buffer locality (RBL) in PCM by moving low RBL data to DRAM
Benefit 2: Reduced memory bandwidth consumption due to stricter caching criteria
Benefit 2: Reduced memory bandwidth consumption due to stricter caching criteria
Benefit 3: Balanced memory request load between DRAM and PCM
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Weighted Speedup Max. Slowdown Perf. per Watt
Normalized Metric
16GB PCM RBLA-Dyn 16GB DRAM
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
No
rma
lize
d W
eig
hte
d S
pee
du
p
0
0.2
0.4
0.6
0.8
1
1.2
No
rma
lize
d M
ax
. S
low
do
wn
Compared to All-PCM/DRAM
80
Our mechanism achieves 31% better performance than all PCM, within 29% of all DRAM performance
31%
29%
For More on Hybrid Memory Data Placement
HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding, and Onur Mutlu,"Row Buffer Locality Aware Caching Policies for Hybrid Memories"Proceedings of the 30th IEEE International Conference on Computer Design (ICCD), Montreal, Quebec, Canada, September 2012. Slides (pptx) (pdf)
81
Agenda
Major Trends Affecting Main Memory
Requirements from an Ideal Main Memory System
Opportunity: Emerging Memory Technologies
Background
PCM (or Technology X) as DRAM Replacement
Hybrid Memory Systems
Row-Locality Aware Data Placement
Efficient DRAM (or Technology X) Caches
Conclusions
Discussion
82
The Problem with Large DRAM Caches
A large DRAM cache requires a large metadata (tag + block-based information) store
How do we design an efficient DRAM cache?
83
DRAM PCM
CPU
(small, fast cache) (high capacity)
MemCtlr
MemCtlr
LOAD X
Access X
Metadata:X DRAM
X
Idea 1: Tags in Memory
Store tags in the same row as data in DRAM
Store metadata in same row as their data
Data and metadata can be accessed together
Benefit: No on-chip tag storage overhead
Downsides:
Cache hit determined only after a DRAM access
Cache hit requires two DRAM accesses
84
Cache block 2Cache block 0 Cache block 1
DRAM rowTag
0Tag
1Tag
2
Idea 2: Cache Tags in SRAM
Recall Idea 1: Store all metadata in DRAM
To reduce metadata storage overhead
Idea 2: Cache in on-chip SRAM frequently-accessed metadata
Cache only a small amount to keep SRAM size small
85
Idea 3: Dynamic Data Transfer Granularity
Some applications benefit from caching more data
They have good spatial locality
Others do not
Large granularity wastes bandwidth and reduces cache utilization
Idea 3: Simple dynamic caching granularity policy
Cost-benefit analysis to determine best DRAM cache block size
Group main memory into sets of rows
Some row sets follow a fixed caching granularity
The rest of main memory follows the best granularity
Cost–benefit analysis: access latency versus number of cachings
Performed every quantum
86
TIMBER Tag Management
A Tag-In-Memory BuffER (TIMBER)
Stores recently-used tags in a small amount of SRAM
Benefits: If tag is cached:
no need to access DRAM twice
cache hit determined quickly
87
Tag0
Tag1
Tag2
Row0Tag
0Tag
1Tag
2Row27
Row Tag
LOAD X
Cache block 2Cache block 0 Cache block 1
DRAM rowTag
0Tag
1Tag
2
TIMBER Tag Management Example (I)
Case 1: TIMBER hit
88
Bank Bank Bank Bank
CPU
MemCtlr
MemCtlr
LOAD X
TIMBER: X DRAM
X
Access X
Tag0
Tag1
Tag2
Row0Tag
0Tag
1Tag
2Row27
Our proposal
TIMBER Tag Management Example (II)
Case 2: TIMBER miss
89
CPU
MemCtlr
MemCtlr
LOAD Y
Y DRAM
Bank Bank Bank Bank
Access Metadata(Y)
Y
1. Access M(Y)
Tag0
Tag1
Tag2
Row0Tag
0Tag
1Tag
2Row27
Miss
M(Y)
2. Cache M(Y)
Row143
3. Access Y (row hit)
90
Metadata Storage Performance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SRAM Region TIM TIMBER
No
rmal
ize
d W
eig
hte
d S
pe
ed
up
(Ideal)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SRAM Region TIM TIMBER
No
rmal
ize
d W
eig
hte
d S
pe
ed
up
91
Metadata Storage Performance
-48%
Performance degrades due to increased metadata lookup access latency
(Ideal)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SRAM Region TIM TIMBER
No
rmal
ize
d W
eig
hte
d S
pe
ed
up
92
Metadata Storage Performance
36%
Increased row locality reduces average
memory access latency
(Ideal)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SRAM Region TIM TIMBER
No
rmal
ize
d W
eig
hte
d S
pe
ed
up
93
Metadata Storage Performance
23%Data with locality can
access metadata at SRAM latencies
(Ideal)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SRAM Region TIM TIMBER TIMBER-Dyn
No
rmal
ize
d W
eig
hte
d S
pe
ed
up
94
Dynamic Granularity Performance
10%
Reduced channel contention and
improved spatial locality
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SRAM Region TIM TIMBER TIMBER-Dyn
No
rmal
ize
d W
eig
hte
d S
pe
ed
up
95
TIMBER Performance
-6%
Reduced channel contention and
improved spatial locality
Meza, Chang, Yoon, Mutlu, Ranganathan, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.
0
0.2
0.4
0.6
0.8
1
1.2
SRAM Region TIM TIMBER TIMBER-Dyn
No
rmal
ize
d P
erf
orm
ance
pe
r W
att
(fo
r M
em
ory
Sys
tem
)
96
TIMBER Energy Efficiency
Fewer migrations reduce transmitted data and channel contention
18%
Meza, Chang, Yoon, Mutlu, Ranganathan, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.
More on Large DRAM Cache Design
Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan, "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management"IEEE Computer Architecture Letters (CAL), February 2012.
97
STT-MRAM As Main Memory
STT-MRAM as Main Memory
Magnetic Tunnel Junction (MTJ) device
Reference layer: Fixed magnetic orientation
Free layer: Parallel or anti-parallel
Magnetic orientation of the free layer determines logical state of device
High vs. low resistance
Write: Push large current through MTJ to change orientation of free layer
Read: Sense current flow
Kultursay et al., “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.
Reference Layer
Free Layer
Barrier
Reference Layer
Free Layer
Barrier
Logical 0
Logical 1
Word Line
Bit Line
AccessTransistor
MTJ
Sense Line
STT-MRAM: Pros and Cons
Pros over DRAM
Better technology scaling
Non volatility
Low idle power (no refresh)
Cons
Higher write latency
Higher write energy
Reliability?
Another level of freedom
Can trade off non-volatility for lower write latency/energy (by reducing the size of the MTJ)
100
Architected STT-MRAM as Main Memory
4-core, 4GB main memory, multiprogrammed workloads
~6% performance loss, ~60% energy savings vs. DRAM
101
88%
90%
92%
94%
96%
98%
Pe
rfo
rma
nce
vs.
DR
AM
STT-RAM (base) STT-RAM (opt)
0%
20%
40%
60%
80%
100%
En
erg
y
vs.
DR
AM
ACT+PRE WB RB
Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.
More on STT-MRAM as Main Memory
Emre Kultursay, Mahmut Kandemir, AnandSivasubramaniam, and Onur Mutlu,"Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative"Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, April 2013. Slides (pptx) (pdf)
102
Other Opportunities with Emerging Technologies
Merging of memory and storage
e.g., a single interface to manage all data
New applications
e.g., ultra-fast checkpoint and restore
More robust system design
e.g., reducing data loss
Processing tightly-coupled with memory
e.g., enabling efficient search and filtering
103
Merging of Memory and Storage:
Persistent Memory Managers
Coordinated Memory and Storage with NVM (I)
The traditional two-level storage model is a bottleneck with NVM Volatile data in memory a load/store interface
Persistent data in storage a file system interface
Problem: Operating system (OS) and file system (FS) code to locate, translate, buffer data become performance and energy bottlenecks with fast NVM stores
105
Two-Level Store
Processorand caches
Main MemoryStorage (SSD/HDD)
Virtual memory
Address translation
Load/Store
Operating system
and file system
fopen, fread, fwrite, …
Persistent (e.g., Phase-Change) Memory
Coordinated Memory and Storage with NVM (II)
Goal: Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data
Improves both energy and performance
Simplifies programming model as well
106
Unified Memory/Storage
Processorand caches
Persistent (e.g., Phase-Change) Memory
Load/Store
Persistent MemoryManager
Feedback
Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.
The Persistent Memory Manager (PMM)
Exposes a load/store interface to access persistent data
Applications can directly access persistent memory no conversion,
translation, location overhead for persistent data
Manages data placement, location, persistence, security
To get the best of multiple forms of storage
Manages metadata storage and retrieval
This can lead to overheads that need to be managed
Exposes hooks and interfaces for system software
To enable better data placement and management decisions
Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.
107
The Persistent Memory Manager (PMM)
108
PMM uses access and hint information to allocate, locate, migrate and access data in the heterogeneous array of devices
Persistent objects
Efficient Data Mapping among Heterogeneous Devices
A persistent memory exposes a large, persistent address space
But it may use many different devices to satisfy this goal
From fast, low-capacity volatile DRAM to slow, high-capacity non-volatile HDD or Flash
And other NVM devices in between
Performance and energy can benefit from good placement of data among these devices
Utilizing the strengths of each device and avoiding their weaknesses, if possible
For example, consider two important application characteristics: locality and persistence
109
110
Efficient Data Mapping among Heterogeneous Devices
111
X
Columns in a column store that arescanned through only infrequently
place on Flash
Efficient Data Mapping among Heterogeneous Devices
112
X
Columns in a column store that arescanned through only infrequently
place on Flash
X
Frequently-updated index for a Content Delivery Network (CDN)
place in DRAM
Efficient Data Mapping among Heterogeneous Devices
Applications or system software can provide hints for data placement
Performance Benefits of a Single-Level Store
113
~5X
~24X
Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.
Energy Benefits of a Single-Level Store
114
~5X
~16X
Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.
More on Single-Level Stores
Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu,"A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory"Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, June 2013. Slides (pptx)Slides (pdf)
115
Enabling and Exploiting NVM: Issues
Many issues and ideas from technology layer to algorithms layer
Enabling NVM and hybrid memory
How to tolerate errors?
How to enable secure operation?
How to tolerate performance and power shortcomings?
How to minimize cost?
Exploiting emerging technologies
How to exploit non-volatility?
How to minimize energy consumption?
How to exploit NVM on chip?116
Microarchitecture
ISA
Programs
Algorithms
Problems
Logic
Devices
Runtime System
(VM, OS, MM)
User
Security Challenges of Emerging Technologies
1. Limited endurance Wearout attacks
2. Non-volatility Data persists in memory after crash/poweroff
Inconsistency and retrieval of privileged information
3. Multiple bits per cell Information leakage (via side channel)
117
Securing Emerging Memory Technologies
1. Limited endurance Wearout attacks
Better architecting of memory chips to absorb writes
Hybrid memory system management
Online wearout attack detection
2. Non-volatility Data persists in memory after crash/poweroff
Inconsistency and retrieval of privileged information
Efficient encryption/decryption of whole main memory
Hybrid memory system management
3. Multiple bits per cell Information leakage (via side channel)
System design to hide side channel information
118
Agenda
Major Trends Affecting Main Memory
Requirements from an Ideal Main Memory System
Opportunity: Emerging Memory Technologies
Background
PCM (or Technology X) as DRAM Replacement
Hybrid Memory Systems
Conclusions
Discussion
119
Summary: Memory Scaling (with NVM)
Main memory scaling problems are a critical bottleneck for system performance, efficiency, and usability
Solution 1: Tolerate DRAM
Solution 2: Enable emerging memory technologies
Replace DRAM with NVM by architecting NVM chips well
Hybrid memory systems with automatic data management
An exciting topic with many other solution directions & ideas
Hardware/software/device cooperation essential
Memory, storage, controller, software/app co-design needed
Coordinated management of persistent memory and storage
Application and hardware cooperative management of NVM
120
18-740/640
Computer Architecture
Lecture 9: Emerging Memory Technologies
Prof. Onur Mutlu
Carnegie Mellon University
Fall 2015, 9/30/2015