Download - 18-740/640 Computer Architecture - ECE:Course Pageece740/f15/lib/exe/fetch.php?... · 2015. 9. 30. · 18-740/640 Computer Architecture Lecture 9: Emerging Memory Technologies Prof

18-740/640

Computer Architecture

Lecture 9: Emerging Memory Technologies

Prof. Onur Mutlu

Carnegie Mellon University

Fall 2015, 9/30/2015

Required Readings

2

Required Reading Assignment:• Lee et al., “Phase Change Technology and the Future of Main

Memory,” IEEE Micro, Jan/Feb 2010.

Recommended References:

• M. Qureshi et al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009.

• H. Yoon et al., “Row buffer locality aware caching policies for hybrid memories,” ICCD 2012.

• J. Zhao et al., “FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems,” MICRO 2014.

Self-Optimizing DRAM Controllers

Problem: DRAM controllers difficult to design It is difficult for

human designers to design a policy that can adapt itself very well to different workloads and different system conditions

Idea: Design a memory controller that adapts its scheduling policy decisions to workload behavior and system conditions using machine learning.

Observation: Reinforcement learning maps nicely to memory control.

Design: Memory controller is a reinforcement learning agent that dynamically and continuously learns and employs the best scheduling policy.

3Ipek+, “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008.


Dynamically adapt the memory scheduling policy via interaction with the system at runtime

Associate system states and actions (commands) with long term reward values: each action at a given state leads to a learned reward

Schedule command with highest estimated long-term reward value in each state

Continuously update reward values for <state, action> pairs based on feedback from system

4


Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning Approach"Proceedings of the 35th International Symposium on Computer Architecture(ISCA), pages 39-50, Beijing, China, June 2008.

5

http://users.ece.cmu.edu/~omutlu/pub/rlmc_isca08.pdf

http://isca2008.cs.princeton.edu/

States, Actions, Rewards

6

❖ Reward function

• +1 for scheduling Read and Write commands

• 0 at all other times

Goal is to maximize data bus utilization

❖ State attributes

• Number of reads, writes, and load misses in transaction queue

• Number of pending writes and ROB heads waiting for referenced row

• Request’s relative

ROB order

❖ Actions

• Activate

• Write

• Read - load miss

• Read - store miss

• Precharge - pending

• Precharge - preemptive

• NOP

Performance Results

7

Self Optimizing DRAM Controllers

Advantages

+ Adapts the scheduling policy dynamically to changing workload behavior and to maximize a long-term target

+ Reduces the designer’s burden in finding a good scheduling policy. Designer specifies:

1) What system variables might be useful

2) What target to optimize, but not how to optimize it

Disadvantages and Limitations

-- Black box: designer much less likely to implement what she cannot easily reason about

-- How to specify different reward functions that can achieve different objectives? (e.g., fairness, QoS)

-- Hardware complexity?8

More on Self-Optimizing DRAM Controllers

Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning Approach"Proceedings of the 35th International Symposium on Computer Architecture(ISCA), pages 39-50, Beijing, China, June 2008.

9

http://users.ece.cmu.edu/~omutlu/pub/rlmc_isca08.pdf

http://isca2008.cs.princeton.edu/

Evaluating New Ideas

for Future (Memory) Architectures

Simulation: The Field of Dreams

Dreaming and Reality

An architect is in part a dreamer, a creator

Simulation is a key tool of the architect

Simulation enables

The exploration of many dreams

A reality check of the dreams

Deciding which dream is better

Simulation also enables

The ability to fool yourself with false dreams

12

Why High-Level Simulation?

Problem: RTL simulation is intractable for design space exploration too time consuming to design and evaluate

Especially over a large number of workloads

Especially if you want to predict the performance of a good chunk of a workload on a particular design

Especially if you want to consider many design choices

Cache size, associativity, block size, algorithms

Memory control and scheduling algorithms

In-order vs. out-of-order execution

Reservation station sizes, ld/st queue size, register file size, …

…

Goal: Explore design choices quickly to see their impact on the workloads we are designing the platform for

13

Different Goals in Simulation Explore the design space quickly and see what you want to

potentially implement in a next-generation platform

propose as the next big idea to advance the state of the art

the goal is mainly to see relative effects of design decisions

Match the behavior of an existing system so that you can

debug and verify it at cycle-level accuracy

propose small tweaks to the design that can make a difference in performance or energy

the goal is very high accuracy

Other goals in-between:

Refine the explored design space without going into a full detailed, cycle-accurate design

Gain confidence in your design decisions made by higher-level design space exploration

14

Tradeoffs in Simulation

Three metrics to evaluate a simulator

Speed

Flexibility

Accuracy

Speed: How fast the simulator runs (xIPS, xCPS)

Flexibility: How quickly one can modify the simulator to evaluate different algorithms and design choices?

Accuracy: How accurate the performance (energy) numbers the simulator generates are vs. a real design (Simulation error)

The relative importance of these metrics varies depending on where you are in the design process

15

Trading Off Speed, Flexibility, Accuracy

Speed & flexibility affect:

How quickly you can make design tradeoffs

Accuracy affects:

How good your design tradeoffs may end up being

How fast you can build your simulator (simulator design time)

Flexibility also affects:

How much human effort you need to spend modifying the simulator

You can trade off between the three to achieve design exploration and decision goals

16

High-Level Simulation

Key Idea: Raise the abstraction level of modeling to give up some accuracy to enable speed & flexibility (and quick simulator design)

Advantage

+ Can still make the right tradeoffs, and can do it quickly

+ All you need is modeling the key high-level factors, you can omit corner case conditions

+ All you need is to get the “relative trends” accurately, not exact performance numbers

Disadvantage

-- Opens up the possibility of potentially wrong decisions

-- How do you ensure you get the “relative trends” accurately?17

Simulation as Progressive Refinement

High-level models (Abstract, C)

…

Medium-level models (Less abstract)

…

Low-level models (RTL with eveything modeled)

…

Real design

As you refine (go down the above list)

Abstraction level reduces

Accuracy (hopefully) increases (not necessarily, if not careful)

Speed and flexibility reduce

You can loop back and fix higher-level models18

Making The Best of Architecture

A good architect is comfortable at all levels of refinement

Including the extremes

A good architect knows when to use what type of simulation

19

Ramulator: A Fast and Extensible

DRAM Simulator

[IEEE Comp Arch Letters’15]

20

Ramulator Motivation

DRAM and Memory Controller landscape is changing

Many new and upcoming standards

Many new controller designs

A fast and easy-to-extend simulator is very much needed

21

Ramulator

Provides out-of-the box support for many DRAM standards:

DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, plus new proposals (SALP, AL-DRAM, TLDRAM, RowClone, and SARP)

~2.5X faster than fastest open-source simulator

Modular and extensible to different standards

22

Case Study: Comparison of DRAM Standards

23

Across 22 workloads, simple CPU model

Ramulator Paper and Source Code

Yoongu Kim, Weikun Yang, and Onur Mutlu,"Ramulator: A Fast and Extensible DRAM Simulator"IEEE Computer Architecture Letters (CAL), March 2015. [Source Code]

Source code is released under the liberal MIT License

https://github.com/CMU-SAFARI/ramulator

24

http://users.ece.cmu.edu/~omutlu/pub/ramulator_dram_simulator-ieee-cal15.pdf

http://www.computer.org/web/cal



Extra Credit Assignment

Review the Ramulator paper

Send your reviews to me ([email protected])

Download and run Ramulator

Compare DDR3, DDR4, SALP, HBM for the libquantumbenchmark (provided in Ramulator repository)

Send your brief report to me

25

mailto:[email protected]

Emerging Memory Technologies

Agenda

Major Trends Affecting Main Memory

Requirements from an Ideal Main Memory System

Opportunity: Emerging Memory Technologies

Conclusions

Discussion

27

Major Trends Affecting Main Memory (I)

Need for main memory capacity and bandwidth increasing

Main memory energy/power is a key system design concern

DRAM technology scaling is ending

28

Trends: Problems with DRAM as Main Memory

Need for main memory capacity and bandwidth increasing

DRAM capacity hard to scale

Main memory energy/power is a key system design concern

DRAM consumes high power due to leakage and refresh

DRAM technology scaling is ending

DRAM capacity, cost, and energy/power hard to scale

29

Agenda




Conclusions

Discussion

30

Traditional

Enough capacity

Low cost

High system performance (high bandwidth, low latency)

New

Technology scalability: lower cost, higher capacity, lower energy

Energy (and power) efficiency

QoS support and configurability (for consolidation)

31

Requirements from an Ideal Memory System

Traditional

Higher capacity

Continuous low cost

High system performance (higher bandwidth, low latency)

New

Technology scalability: lower cost, higher capacity, lower energy

Energy (and power) efficiency

QoS support and configurability (for consolidation)

32

Requirements from an Ideal Memory System

Emerging, resistive memory technologies (NVM) can help

How Do We Solve The Memory Problem?

Fix it: Make DRAM and controllers more intelligent

New interfaces, functions, architectures: system-DRAM codesign

Eliminate or minimize it: Replace or (more likely) augment DRAM with a different technology

New technologies and system-wide rethinking of memory & storage

Embrace it: Design heterogeneous memories (none of which are perfect) and map data intelligently across them

New models for data management and maybe usage

…

33

Solutions (to memory scaling) require software/hardware/device cooperation

Microarchitecture

ISA

Programs

Algorithms

Problems

Logic

Devices

Runtime System

(VM, OS, MM)

User

Solution 2: Emerging Memory Technologies Some emerging resistive memory technologies seem more

scalable than DRAM (and they are non-volatile)

Example: Phase Change Memory

Expected to scale to 9nm (2022 [ITRS])

Expected to be denser than DRAM: can store multiple bits/cell

But, emerging technologies have shortcomings as well

Can they be enabled to replace/augment/surpass DRAM?

Lee+, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA’09, CACM’10, Micro’10.

Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters 2012.

Yoon, Meza+, “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012.

Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.

Lu+, “Loose Ordering Consistency for Persistent Memory,” ICCD 2014.

Zhao+, “FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems,” MICRO 2014.

Yoon, Meza+, “Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories,” ACM TACO 2014.

Ren+, “Dual-Scheme Checkpointing: “A Software-Transparent Mechanism for Supporting Crash Consistency in Persistent Memory Systems,” MICRO 2015.

34

Solution 3: Hybrid Memory Systems

Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.

Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award.

CPUDRAMCtrl

Fast, durableSmall,

leaky, volatile, high-cost

Large, non-volatile, low-costSlow, wears out, high active energy

PCM CtrlDRAM Technology X (e.g., PCM)

Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Agenda




Conclusions

Discussion

36

The Promise of Emerging Technologies

Likely need to replace/augment DRAM with a technology that is

Technology scalable

And at least similarly efficient, high performance, and fault-tolerant

or can be architected to be so

Some emerging resistive memory technologies appear promising

Phase Change Memory (PCM)?

Spin Torque Transfer Magnetic Memory (STT-MRAM)?

Memristors? RRAM? ReRAM?

And, maybe there are other ones

Can they be enabled to replace/augment/surpass DRAM?

37

Agenda




Background

PCM (or Technology X) as DRAM Replacement

Hybrid Memory Systems

Conclusions

Discussion

38

Charge vs. Resistive Memories

Charge Memory (e.g., DRAM, Flash)

Write data by capturing charge Q

Read data by detecting voltage V

Resistive Memory (e.g., PCM, STT-MRAM, memristors)

Write data by pulsing current dQ/dt

Read data by detecting resistance R

39

Limits of Charge Memory

Difficult charge placement and control

Flash: floating gate charge

DRAM: capacitor charge, transistor leakage

Reliable sensing becomes difficult as charge storage unit size reduces

40

Promising Resistive Memory Technologies

PCM

Inject current to change material phase

Resistance determined by phase

STT-MRAM

Inject current to change magnet polarity

Resistance determined by polarity

Memristors/RRAM/ReRAM

Inject current to change atomic structure

Resistance determined by atom distance

41

What is Phase Change Memory?

Phase change material (chalcogenide glass) exists in two states:

Amorphous: Low optical reflexivity and high electrical resistivity

Crystalline: High optical reflexivity and low electrical resistivity

42

PCM is resistive memory: High resistance (0), Low resistance (1)

PCM cell can be switched between states reliably and quickly

How Does PCM Work?

Write: change phase via current injection

SET: sustained current to heat cell above Tcryst

RESET: cell heated above Tmelt and quenched

Read: detect phase via material resistance

amorphous/crystalline

43

LargeCurrent

SET (cryst)Low resistance

103-104 W

SmallCurrent

RESET (amorph)High resistance

AccessDevice

MemoryElement

106-107 W

Photo Courtesy: Bipin Rajendran, IBM Slide Courtesy: Moinuddin Qureshi, IBM

Opportunity: PCM Advantages

Scales better than DRAM, Flash

Requires current pulses, which scale linearly with feature size

Expected to scale to 9nm (2022 [ITRS])

Prototyped at 20nm (Raoux+, IBM JRD 2008)

Can be denser than DRAM

Can store multiple bits per cell due to large resistance range

Prototypes with 2 bits/cell in ISSCC’08, 4 bits/cell by 2012

Non-volatile

Can retain data for >10 years at 85C

No refresh needed, low idle power

44

45

PCM Resistance → Value

Cell resistance

1 0Cell value:

46

Multi-Level Cell PCM

Multi-level cell: more than 1 bit per cell

Further increases density by 2 to 4x [Lee+,ISCA'09]

But MLC-PCM also has drawbacks

Higher latency and energy than single-level cell PCM

47

MLC-PCM Resistance → Value

Cell resistance

11 000110Cell value:

Bit 1 Bit 0

48

MLC-PCM Resistance → Value

Cell resistance

11 000110Cell value:

Less margin between values→ need more precise sensing/modification of cell contents→ higher latency/energy (~2x for reads and 4x for writes)

Phase Change Memory Properties

Surveyed prototypes from 2003-2008 (ITRS, IEDM, VLSI, ISSCC)

Derived PCM parameters for F=90nm

Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009.

49

Phase Change Memory Properties: Latency

Latency comparable to, but slower than DRAM

Read Latency

50ns: 4x DRAM, 10-3x NAND Flash

Write Latency

150ns: 12x DRAM

Write Bandwidth

5-10 MB/s: 0.1x DRAM, 1x NAND Flash

51

Phase Change Memory Properties

Dynamic Energy

40 uA Rd, 150 uA Wr

2-43x DRAM, 1x NAND Flash

Endurance

Writes induce phase change at 650C

Contacts degrade from thermal expansion/contraction

108 writes per cell

10-8x DRAM, 103x NAND Flash

Cell Size

9-12F2 using BJT, single-level cells

1.5x DRAM, 2-3x NAND (will scale with feature size, MLC)

52

Phase Change Memory: Pros and Cons

Pros over DRAM

Better technology scaling

Non volatility

Low idle power (no refresh)

Cons

Higher latencies: ~4-15x DRAM (especially write)

Higher active energy: ~2-50x DRAM (especially write)

Lower endurance (a cell dies after ~108 writes)

Challenges in enabling PCM as DRAM replacement/helper:

Mitigate PCM shortcomings

Find the right way to place PCM in the system

Ensure secure and fault-tolerant PCM operation53

PCM-based Main Memory: Challenges

Where to place PCM in the memory hierarchy?

Hybrid OS controlled PCM-DRAM

Hybrid OS controlled PCM and hardware-controlled DRAM

Pure PCM main memory

How to mitigate shortcomings of PCM?

How to minimize amount of DRAM in the system?

How to take advantage of (byte-addressable and fast) non-volatile main memory?

Can we design specific-NVM-technology-agnostic techniques?54

PCM-based Main Memory (I)

How should PCM-based (main) memory be organized?

Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09, Meza+

IEEE CAL’12]:

How to partition/migrate data between PCM and DRAM

55

Hybrid Memory Systems: Challenges

Partitioning

Should DRAM be a cache or main memory, or configurable?

What fraction? How many controllers?

Data allocation/movement (energy, performance, lifetime)

Who manages allocation/movement?

What are good control algorithms?

How do we prevent degradation of service due to wearout?

Design of cache hierarchy, memory controllers, OS

Mitigate PCM shortcomings, exploit PCM advantages

Design of PCM/DRAM chips and modules

Rethink the design of PCM/DRAM with new requirements

56

PCM-based Main Memory (II)

How should PCM-based (main) memory be organized?

Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]:

How to redesign entire hierarchy (and cores) to overcome PCM shortcomings

57

STT-MRAM as Main Memory

Magnetic Tunnel Junction (MTJ) device

Reference layer: Fixed magnetic orientation

Free layer: Parallel or anti-parallel

Magnetic orientation of the free layer determines logical state of device

High vs. low resistance

Write: Push large current through MTJ to change orientation of free layer

Read: Sense current flow

Kultursay et al., “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

Reference Layer

Free Layer

Barrier

Reference Layer

Free Layer

Barrier

Logical 0

Logical 1

Word Line

Bit Line

AccessTransistor

MTJ

Sense Line

Aside: STT MRAM: Pros and Cons

Pros over DRAM


Non volatility


Cons

Higher write latency

Higher write energy

Reliability?

Another level of freedom

Can trade off non-volatility for lower write latency/energy (by reducing the size of the MTJ)

59

Agenda




Background



Conclusions

Discussion

60

An Initial Study: Replace DRAM with PCM


Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC)

Derived “average” PCM parameters for F=90nm

61

Results: Naïve Replacement of DRAM with PCM

Replace DRAM with PCM in a 4-core, 4MB L2 system

PCM organized the same as DRAM: row buffers, banks, peripherals

1.6x delay, 2.2x energy, 500-hour average lifetime


62

Architecting PCM to Mitigate Shortcomings

Idea 1: Use multiple narrow row buffers in each PCM chip

Reduces array reads/writes better endurance, latency, energy

Idea 2: Write into array at

cache block or word

granularity

Reduces unnecessary wear

63

DRAM PCM

Results: Architected PCM as Main Memory

1.2x delay, 1.0x energy, 5.6-year average lifetime

Scaling improves energy, endurance, density

Caveat 1: Worst-case lifetime is much shorter (no guarantees)

Caveat 2: Intensive applications see large performance and energy hits

Caveat 3: Optimistic PCM parameters?64

More on PCM As Main Memory

Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger,"Architecting Phase Change Memory as a Scalable DRAM Alternative"Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 2-13, Austin, TX, June 2009. Slides (pdf)

65

http://users.ece.cmu.edu/~omutlu/pub/pcm_isca09.pdf

http://isca09.cs.columbia.edu/

http://users.ece.cmu.edu/~omutlu/pub/lee_isca09_talk.pdf

Agenda




Background



Conclusions

Discussion

66


Meza, Chang, Yoon, Mutlu, Ranganathan, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.

CPUDRAMCtrl

Fast, durableSmall,

leaky, volatile, high-cost

Large, non-volatile, low-costSlow, wears out, high active energy

PCM CtrlDRAM Phase Change Memory (or Tech. X)

Hardware/software manage data allocation and movement to achieve the best of multiple technologies

One Option: DRAM as a Cache for PCM

PCM is main memory; DRAM caches memory rows/blocks

Benefits: Reduced latency on DRAM cache hit; write filtering

Memory controller hardware manages the DRAM cache

Benefit: Eliminates system software overhead

Three issues:

What data should be placed in DRAM versus kept in PCM?

What is the granularity of data movement?

How to design a low-cost hardware-managed DRAM cache?

Two idea directions:

Locality-aware data placement [Yoon+ , ICCD 2012]

Cheap tag stores and dynamic granularity [Meza+, IEEE CAL 2012]

68

DRAM as a Cache for PCM

Goal: Achieve the best of both DRAM and PCM/NVM

Minimize amount of DRAM w/o sacrificing performance, endurance

DRAM as cache to tolerate PCM latency and write bandwidth

PCM as main memory to provide large capacity at good cost and power

69

DATA

PCM Main Memory

DATAT

DRAM Buffer

PCM Write Queue

T=Tag-Store

Processor

Flash

Or

HDD

Qureshi+, “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009.

Write Filtering Techniques

Lazy Write: Pages from disk installed only in DRAM, not PCM

Partial Writes: Only dirty lines from DRAM page written back

Page Bypass: Discard pages with poor reuse on DRAM eviction

Qureshi et al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009.

70

Processor

DATA

PCM Main Memory

DATAT

DRAM Buffer

Flash

Or

HDD

Results: DRAM as PCM Cache (I)

Simulation of 16-core system, 8GB DRAM main-memory at 320 cycles, HDD (2 ms) with Flash (32 us) with Flash hit-rate of 99%

Assumption: PCM 4x denser, 4x slower than DRAM

DRAM block size = PCM page size (4kB)

71

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

db1 db2 qsort bsearch kmeans gauss daxpy vdotp gmean

No

rmali

zed

Execu

tio

n T

ime

8GB DRAM

32GB PCM

32GB DRAM

32GB PCM + 1GB DRAM


Results: DRAM as PCM Cache (II)

PCM-DRAM Hybrid performs similarly to similar-size DRAM

Significant energy savings with PCM-DRAM Hybrid

Average lifetime: 9.7 years (no guarantees)

72

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

Power Energy Energy x Delay

Va

lue

No

rma

lize

d t

o 8

GB

DR

AM 8GB DRAM

Hybrid (32GB PCM+ 1GB DRAM)

32GB DRAM


More on DRAM-PCM Hybrid Memory

Scalable High-Performance Main Memory System Using Phase-Change Memory Technology. Moinuddin K. Qureshi, Viji Srinivasan, and Jude A. Rivers Appears in the International Symposium on Computer Architecture (ISCA) 2009.

73

http://dl.acm.org/citation.cfm?id=1555760

Agenda




Background



Row-Locality Aware Data Placement

Efficient DRAM (or Technology X) Caches

Conclusions

Discussion

74

Hybrid Memory

• Key question: How to place data between the heterogeneous memory devices?

75

DRAM PCM

CPU

MC MC

Hybrid Memory: A Closer Look

76

MC MC

DRAM(small capacity cache)

PCM(large capacity store)

CPU

Memory channel

Bank Bank Bank Bank

Row buffer

Row (buffer) hit: Access data from row buffer fast

Row (buffer) miss: Access data from cell array slow

LOAD X LOAD X+1LOAD X+1LOAD X

Row Buffers and Latency

77

RO

W A

DD

RES

S

ROW DATA

Row buffer miss!Row buffer hit!

Bank

Row buffer

CELL ARRAY

Key Observation

• Row buffers exist in both DRAM and PCM

– Row hit latency similar in DRAM & PCM [Lee+ ISCA’09]

– Row miss latency small in DRAM, large in PCM

• Place data in DRAM which

– is likely to miss in the row buffer (low row buffer locality)miss penalty is smaller in DRAM

AND

– is reused many times cache only the data worth the movement cost and DRAM space

78

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Server Cloud Avg

No

rmali

zed

Wei

gh

ted

Sp

eed

up

Workload

FREQ FREQ-Dyn RBLA RBLA-Dyn

10%

System Performance

79

14%

Benefit 1: Increased row buffer locality (RBL) in PCM by moving low RBL data to DRAM

17%

Benefit 1: Increased row buffer locality (RBL) in PCM by moving low RBL data to DRAM

Benefit 2: Reduced memory bandwidth consumption due to stricter caching criteria

Benefit 2: Reduced memory bandwidth consumption due to stricter caching criteria

Benefit 3: Balanced memory request load between DRAM and PCM

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Weighted Speedup Max. Slowdown Perf. per Watt

Normalized Metric

16GB PCM RBLA-Dyn 16GB DRAM

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

No

rma

lize

d W

eig

hte

d S

pee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

No

rma

lize

d M

ax

. S

low

do

wn

Compared to All-PCM/DRAM

80

Our mechanism achieves 31% better performance than all PCM, within 29% of all DRAM performance

31%

29%

For More on Hybrid Memory Data Placement

HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding, and Onur Mutlu,"Row Buffer Locality Aware Caching Policies for Hybrid Memories"Proceedings of the 30th IEEE International Conference on Computer Design (ICCD), Montreal, Quebec, Canada, September 2012. Slides (pptx) (pdf)

81

http://users.ece.cmu.edu/~omutlu/pub/rowbuffer-aware-caching_iccd12.pdf

http://www.iccd-conf.com/

http://users.ece.cmu.edu/~omutlu/pub/yoon_iccd12_talk.pptx

http://users.ece.cmu.edu/~omutlu/pub/yoon_iccd12_talk.pdf

Agenda




Background



Row-Locality Aware Data Placement

Efficient DRAM (or Technology X) Caches

Conclusions

Discussion

82

The Problem with Large DRAM Caches

A large DRAM cache requires a large metadata (tag + block-based information) store

How do we design an efficient DRAM cache?

83

DRAM PCM

CPU

(small, fast cache) (high capacity)

MemCtlr

MemCtlr

LOAD X

Access X

Metadata:X DRAM

X

Idea 1: Tags in Memory

Store tags in the same row as data in DRAM

Store metadata in same row as their data

Data and metadata can be accessed together

Benefit: No on-chip tag storage overhead

Downsides:

Cache hit determined only after a DRAM access

Cache hit requires two DRAM accesses

84

Cache block 2Cache block 0 Cache block 1

DRAM rowTag

0Tag

1Tag

2

Idea 2: Cache Tags in SRAM

Recall Idea 1: Store all metadata in DRAM

To reduce metadata storage overhead

Idea 2: Cache in on-chip SRAM frequently-accessed metadata

Cache only a small amount to keep SRAM size small

85

Idea 3: Dynamic Data Transfer Granularity

Some applications benefit from caching more data

They have good spatial locality

Others do not

Large granularity wastes bandwidth and reduces cache utilization

Idea 3: Simple dynamic caching granularity policy

Cost-benefit analysis to determine best DRAM cache block size

Group main memory into sets of rows

Some row sets follow a fixed caching granularity

The rest of main memory follows the best granularity

Cost–benefit analysis: access latency versus number of cachings

Performed every quantum

86

TIMBER Tag Management

A Tag-In-Memory BuffER (TIMBER)

Stores recently-used tags in a small amount of SRAM

Benefits: If tag is cached:

no need to access DRAM twice

cache hit determined quickly

87

Tag0

Tag1

Tag2

Row0Tag

0Tag

1Tag

2Row27

Row Tag

LOAD X

Cache block 2Cache block 0 Cache block 1

DRAM rowTag

0Tag

1Tag

2

TIMBER Tag Management Example (I)

Case 1: TIMBER hit

88

Bank Bank Bank Bank

CPU

MemCtlr

MemCtlr

LOAD X

TIMBER: X DRAM

X

Access X

Tag0

Tag1

Tag2

Row0Tag

0Tag

1Tag

2Row27

Our proposal

TIMBER Tag Management Example (II)

Case 2: TIMBER miss

89

CPU

MemCtlr

MemCtlr

LOAD Y

Y DRAM

Bank Bank Bank Bank

Access Metadata(Y)

Y

1. Access M(Y)

Tag0

Tag1

Tag2

Row0Tag

0Tag

1Tag

2Row27

Miss

M(Y)

2. Cache M(Y)

Row143

3. Access Y (row hit)

90

Metadata Storage Performance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SRAM Region TIM TIMBER

No

rmal

ize

d W

eig

hte

d S

pe

ed

up

(Ideal)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


No

rmal

ize

d W

eig

hte

d S

pe

ed

up

91


-48%

Performance degrades due to increased metadata lookup access latency

(Ideal)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


No

rmal

ize

d W

eig

hte

d S

pe

ed

up

92


36%

Increased row locality reduces average

memory access latency

(Ideal)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


No

rmal

ize

d W

eig

hte

d S

pe

ed

up

93


23%Data with locality can

access metadata at SRAM latencies

(Ideal)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SRAM Region TIM TIMBER TIMBER-Dyn

No

rmal

ize

d W

eig

hte

d S

pe

ed

up

94

Dynamic Granularity Performance

10%

Reduced channel contention and

improved spatial locality

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


No

rmal

ize

d W

eig

hte

d S

pe

ed

up

95

TIMBER Performance

-6%

Reduced channel contention and

improved spatial locality


0

0.2

0.4

0.6

0.8

1

1.2


No

rmal

ize

d P

erf

orm

ance

pe

r W

att

(fo

r M

em

ory

Sys

tem

)

96

TIMBER Energy Efficiency

Fewer migrations reduce transmitted data and channel contention

18%


More on Large DRAM Cache Design

Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan, "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management"IEEE Computer Architecture Letters (CAL), February 2012.

97

http://users.ece.cmu.edu/~omutlu/pub/timber-fine-grained-dram-cache_ieee-cal12.pdf

http://www.cs.virginia.edu/~tcca/

STT-MRAM As Main Memory

STT-MRAM as Main Memory

Magnetic Tunnel Junction (MTJ) device

Reference layer: Fixed magnetic orientation

Free layer: Parallel or anti-parallel

Magnetic orientation of the free layer determines logical state of device

High vs. low resistance

Write: Push large current through MTJ to change orientation of free layer

Read: Sense current flow

Kultursay et al., “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

Reference Layer

Free Layer

Barrier

Reference Layer

Free Layer

Barrier

Logical 0

Logical 1

Word Line

Bit Line

AccessTransistor

MTJ

Sense Line

STT-MRAM: Pros and Cons

Pros over DRAM


Non volatility


Cons

Higher write latency

Higher write energy

Reliability?

Another level of freedom

Can trade off non-volatility for lower write latency/energy (by reducing the size of the MTJ)

100

Architected STT-MRAM as Main Memory

4-core, 4GB main memory, multiprogrammed workloads

~6% performance loss, ~60% energy savings vs. DRAM

101

88%

90%

92%

94%

96%

98%

Pe

rfo

rma

nce

vs.

DR

AM

STT-RAM (base) STT-RAM (opt)

0%

20%

40%

60%

80%

100%

En

erg

y

vs.

DR

AM

ACT+PRE WB RB

Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

More on STT-MRAM as Main Memory

Emre Kultursay, Mahmut Kandemir, AnandSivasubramaniam, and Onur Mutlu,"Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative"Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, April 2013. Slides (pptx) (pdf)

102

http://users.ece.cmu.edu/~omutlu/pub/sttram_ispass13.pdf

http://www.ispass.org/ispass2013/

http://users.ece.cmu.edu/~omutlu/pub/kultursay_ispass13_talk.pptx

http://users.ece.cmu.edu/~omutlu/pub/kultursay_ispass13_talk.pdf

Other Opportunities with Emerging Technologies

Merging of memory and storage

e.g., a single interface to manage all data

New applications

e.g., ultra-fast checkpoint and restore

More robust system design

e.g., reducing data loss

Processing tightly-coupled with memory

e.g., enabling efficient search and filtering

103

Merging of Memory and Storage:

Persistent Memory Managers

Coordinated Memory and Storage with NVM (I)

The traditional two-level storage model is a bottleneck with NVM Volatile data in memory a load/store interface

Persistent data in storage a file system interface

Problem: Operating system (OS) and file system (FS) code to locate, translate, buffer data become performance and energy bottlenecks with fast NVM stores

105

Two-Level Store

Processorand caches

Main MemoryStorage (SSD/HDD)

Virtual memory

Address translation

Load/Store

Operating system

and file system

fopen, fread, fwrite, …

Persistent (e.g., Phase-Change) Memory

Coordinated Memory and Storage with NVM (II)

Goal: Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data

Improves both energy and performance

Simplifies programming model as well

106

Unified Memory/Storage

Processorand caches

Persistent (e.g., Phase-Change) Memory

Load/Store

Persistent MemoryManager

Feedback


The Persistent Memory Manager (PMM)

Exposes a load/store interface to access persistent data

Applications can directly access persistent memory no conversion,

translation, location overhead for persistent data

Manages data placement, location, persistence, security

To get the best of multiple forms of storage

Manages metadata storage and retrieval

This can lead to overheads that need to be managed

Exposes hooks and interfaces for system software

To enable better data placement and management decisions


107

The Persistent Memory Manager (PMM)

108

PMM uses access and hint information to allocate, locate, migrate and access data in the heterogeneous array of devices

Persistent objects

Efficient Data Mapping among Heterogeneous Devices

A persistent memory exposes a large, persistent address space

But it may use many different devices to satisfy this goal

From fast, low-capacity volatile DRAM to slow, high-capacity non-volatile HDD or Flash

And other NVM devices in between

Performance and energy can benefit from good placement of data among these devices

Utilizing the strengths of each device and avoiding their weaknesses, if possible

For example, consider two important application characteristics: locality and persistence

109

110


111

X

Columns in a column store that arescanned through only infrequently

place on Flash


112

X

Columns in a column store that arescanned through only infrequently

place on Flash

X

Frequently-updated index for a Content Delivery Network (CDN)

place in DRAM


Applications or system software can provide hints for data placement

Performance Benefits of a Single-Level Store

113

~5X

~24X


Energy Benefits of a Single-Level Store

114

~5X

~16X


More on Single-Level Stores

Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu,"A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory"Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, June 2013. Slides (pptx)Slides (pdf)

115

http://users.ece.cmu.edu/~omutlu/pub/persistent-memory-management_weed13.pdf

http://research.ihost.com/weed2013/

http://users.ece.cmu.edu/~omutlu/pub/mutlu_weed13_talk.pptx

http://users.ece.cmu.edu/~omutlu/pub/mutlu_weed13_talk.pdf

Enabling and Exploiting NVM: Issues

Many issues and ideas from technology layer to algorithms layer

Enabling NVM and hybrid memory

How to tolerate errors?

How to enable secure operation?

How to tolerate performance and power shortcomings?

How to minimize cost?

Exploiting emerging technologies

How to exploit non-volatility?

How to minimize energy consumption?

How to exploit NVM on chip?116

Microarchitecture

ISA

Programs

Algorithms

Problems

Logic

Devices

Runtime System

(VM, OS, MM)

User

Security Challenges of Emerging Technologies

1. Limited endurance Wearout attacks

2. Non-volatility Data persists in memory after crash/poweroff

Inconsistency and retrieval of privileged information

3. Multiple bits per cell Information leakage (via side channel)

117

Securing Emerging Memory Technologies

1. Limited endurance Wearout attacks

Better architecting of memory chips to absorb writes

Hybrid memory system management

Online wearout attack detection

2. Non-volatility Data persists in memory after crash/poweroff

Inconsistency and retrieval of privileged information

Efficient encryption/decryption of whole main memory

Hybrid memory system management

3. Multiple bits per cell Information leakage (via side channel)

System design to hide side channel information

118

Agenda




Background



Conclusions

Discussion

119

Summary: Memory Scaling (with NVM)

Main memory scaling problems are a critical bottleneck for system performance, efficiency, and usability

Solution 1: Tolerate DRAM

Solution 2: Enable emerging memory technologies

Replace DRAM with NVM by architecting NVM chips well

Hybrid memory systems with automatic data management

An exciting topic with many other solution directions & ideas

Hardware/software/device cooperation essential

Memory, storage, controller, software/app co-design needed

Coordinated management of persistent memory and storage

Application and hardware cooperative management of NVM

120

18-740/640

Computer Architecture

Lecture 9: Emerging Memory Technologies

Prof. Onur Mutlu

Carnegie Mellon University

Fall 2015, 9/30/2015

Download - 18-740/640 Computer Architecture - ECE:Course Pageece740/f15/lib/exe/fetch.php?... · 2015. 9. 30. · 18-740/640 Computer Architecture Lecture 9: Emerging Memory Technologies Prof

Top Related