persistent memory: new tier or storage replacement? · 2019-12-21 · persistent memory: new tier...

Persistent memory: new tier or storage replacement?Kimberly Keeton and Susan Spence

[email protected] and [email protected]

SNIA Storage Developer ConferenceSeptember 2017

mailto:[email protected]


Outline

– Persistent memory blurs boundaries between memory and storage– Memory-Driven Computing (MDC) – context for our work with persistent memory– Memory-Driven Computing software– Challenges for persistent memory as storage– Summary

©Copyright 2017 Hewlett Packard Enterprise Company 2

Memory + storage hierarchy technologiesLATENCY

SRAM (caches)

DDRDRAM

DISKsms

MBs 10-100GBs 1-10TBs 10-100TBs

1-10ns

50-100ns

1-10µs

50ns

1TBs

200ns-1µs

CAPACITY©Copyright 2017 Hewlett Packard Enterprise Company 3

SSDs


SRAM (caches)

DDRDRAM

DISKs

On-packageDRAM

NVM

ms

MBs 10-100GBs 1-10TBs 10-100TBs

1-10ns

50-100ns

1-10µs

50ns

+ Massive b/w

1TBs

200ns-1µs

CAPACITY

Two new entries!


SSDs

Non-Volatile Memory (NVM)

– Persistently stores data– Access latencies comparable to DRAM– Byte addressable (load/store) rather than block addressable (read/write)– Some NVM technologies more energy efficient and denser than DRAM

Resistive RAM(Memristor)

3D Flash

Phase-Change Memory

Spin-Transfer Torque MRAM

ns μs

Latency

Haris Volos, et al. "Aerie: Flexible File-System Interfaces to Storage-Class Memory," Proc. EuroSys 2014.


NVDIMM-N

Gen-Z: open systems interconnect standardhttp://www.genzconsortium.org– Open standard for memory-semantic interconnect

– Members: 30+ companies covering SoC, memory, I/O, networking, mechanical, system software, etc.

– Motivation– Emergence of low-latency storage class memory– Demand for large capacity, rack-scale resource pools

and multi-node architectures

– Memory semantics– All communication as memory operations (load/store,

put/get, atomics)

– High performance– Tens to hundreds GB/s bandwidth– Sub-microsecond load-to-use memory latency

– Draft spec available for public download


Open Standard

I/O

Accelerators

FPGA GPU

CPUs

SoC ASICNEURO

MemoryMemory

Network Storage

Direct Attach, Switched, or Fabric Topology

Memory technologies

NVM NVM NVM…

SoC

Memory

http://www.genzconsortium.org/

GPU

ASIC

Quantum

RIS

CV Memory

7

Memory

Mem

ory

Memory

Mem

ory

SoC

SoC

SoC

SoC

Future architectureMemory-Driven Computing

Today’s architecture From processor-centric computing

©Copyright 2017 Hewlett Packard Enterprise Company

Core Memory-Driven Computing components

Fast, persistentmemory Fast memory fabric Task-specific

processingNew and Adapted

software


HPE introduces the world’s largest single-memory computerThe prototype contains 160 terabytes of fabric-attached memory

– 160 TB of shared memory spread across 40 physical nodes, interconnected using a high-performance fabric protocol

– An optimized Linux-based operating system running on ThunderX2, Cavium’s flagship second generation dual socket capable ARMv8-A workload optimized System on a Chip

– Photonics/Optical communication links, including the new X1 photonics module, are online and operational

– Software programming tools designed to take advantage of abundant of persistent memory


Putting it all together: Memory-Driven Computing

– Converging memory and storage– Byte-addressable NVM replaces hard drives and SSDs

– Resource disaggregation leads to high capacity shared memory pool– Fabric-attached memory pool is accessible by all compute resources– Low diameter networks provide near-uniform low latency

– Local volatile memory provides lower latency, high performance tier

– Distributed heterogeneous compute resources

– Software– Memory-speed persistence– Direct, unmediated access to all fabric-attached NVM across the memory

fabric– Non-coherent accesses between compute nodes


Local DRAM

Local DRAM

Local DRAM

Local DRAM

SoC

SoC

SoC

SoC

NVM

NVM

NVM

NVM

Fabric-Attached

Memory Pool

Com

mun

icat

ions

and

mem

ory

fabr

ic

Net

wor

k

10

Memory-Driven Computing in context

Shared nothing

Local DRAM

Local DRAM

Local NVM

Local NVM

Local DRAM

Local DRAM

Local NVM

Local NVM

Net

wor

k

SoC

SoC

SoC

SoC

Shared everything

SoCLocal DRAM

Local NVM

SoCLocal DRAM

Local NVM

Net

wor

k

Physical Server

Coh

eren

t In

terc

onne

ct

Physical Server

Physical Server


Memory-Driven Computing in context

Shared nothing

Local DRAM

Local DRAM

Local NVM

Local NVM

Local DRAM

Local DRAM

Local NVM

Local NVM

Net

wor

k

SoC

SoC

SoC

SoC

Shared everything

SoCLocal DRAM

Local NVM

SoCLocal DRAM

Local NVM

Net

wor

k

Physical Server

Coh

eren

t In

terc

onne

ct

Physical Server

Physical Server

Local DRAM

Local DRAM

Local DRAM

Local DRAM

SoC

SoC

SoC

SoC

Shared something

NVM

NVM

NVM

NVM

Fabric-Attached

Memory Pool

Com

mun

icat

ions

and

mem

ory

fabr

ic

Net

wor

k


Questions driving work on MDC software

13

How do I ...

− use persistent memory?

− take full advantage of byte-addressable persistent memory?

− share data safely in a large, globally-addressable pool of memory?

− ensure data consistency in the face of failures and errors?

− scale effectively to use our large-memory, many-core systems?


Memory-Driven Computing delivered with systems software

Memory is sharednon-coherently over fabric

Memory is large

Memory is persistent14©Copyright 2017 Hewlett Packard Enterprise Company



Memory is large

Memory is persistent

Large indexes reduce storage access

In-memory data structures

15©Copyright 2017 Hewlett Packard Enterprise Company



Memory is large


Communicate thru memory

Data sharing by reference



16

Easier load-balancing




Memory is large


Communicate thru memory

Data sharing by reference



No storage overheadsFast snapshots

Fast checkpointing

17

Easier load-balancing


Memory-Driven Computing Developer Toolkit

Memory is large Memory is sharednon-coherently over fabric



Memory-Driven Computing Developer Toolkit

Memory is large Memory is sharednon-coherently over fabric


ShovellerALPS

MDSMPGCAtlas

Foedus

NV-WALCRIU-PMEMNVthreads

NVMMRadix Tree


Disk

Libnvwal: Write-Ahead Logging Library

– Key mechanism for atomicity and durability– Atomicity: Record all modifications in WAL, and– Durability: Persist WAL before applying

– Persistence latency critical to transaction performance

20

TRANSFER (from, to, amount)transaction {from = from – amountto = to + amount

} 100

from

to200

120180

BEFORE WAL

120

180

AFTER

TRANSFER (from, to, 20)


Disk

Libnvwal: Write-Ahead Logging Library

– Key mechanism for atomicity and durability– Atomicity: Record all modifications in WAL, and– Durability: Persist WAL before applying

– Persistence latency critical to transaction performance

– Non-volatile memory DIMM (NVDIMM) enables low latency logging

21

TRANSFER (from, to, amount)transaction {from = from – amountto = to + amount

} 100

from

to200

120180

BEFORE WAL

120

180

AFTER

TRANSFER (from, to, 20)


Libnvwal: Hybrid NVDIMM/Disk Write-Ahead Logging

– NVDIMM enables low latency persistence– DRAM: Best performance/lowest latency for fast data access– NAND Flash: Persistent store for the NVDIMM

– But limited capacity (8GB NVDIMM today) requires truncating log– Requires checkpointing database– Precludes archiving log for disaster recovery

22

DRAM Controller

NAND Flash


Libnvwal: Hybrid NVDIMM/Disk Write-Ahead Logging

– NVDIMM enables low latency persistence– DRAM: Best performance/lowest latency for fast data access – NAND Flash: Persistent store for the NVDIMM

– But limited capacity (8GB NVDIMM today) requires truncating log– Requires checkpointing database– Precludes archiving log for disaster recovery

Libnvwal– Stores log tail in NVDIMM for low-latency and de-stages log to

disk for capacity

– Relieves application from implementing complex rotation logic– Non-trivial interaction between NVM buffers, file system, log metadata

durability, and flushing threads

– Improves MySQL OLTP performance by up to 106% (sysbench)

23

TAIL


Fewer software layersTraditional Database System Managed Data Structures

Application

Database client

Database server

Filesystem

Object → Relation

Application

MDS Runtime


Managed Data Structures (MDS)Simplify programming on persistent in-memory data

– Ease of Programming– Programmer manages only application-level data structures

–Record, Array, Map, Set, Graph, List, Queue, …–MDS data structures are automatically persisted in NVM

– APIs in multiple programming languages: Java, C++–Programmer access through references to data–Direct reads and writes

– Ease of Data Sharing– Just pass a reference

–Each program treats the data as local to the program– High-level concurrency controls

–Ensure consistent data in the face of data sharing by multiple threads/processes

Managed SpaceIn NVM

Process 1 Process 2


“Shared something” Key-Value Store (KVS)Leverage large pool of fabric-attached memory and fabric atomics

CPU

DRAM

I/O CPU

DRAM

I/O … CPU

DRAM

I/O

…

1 2 N

Memory Fabric

26

Data stored in fabric-attached NVM


High-level architecture for shared something KVS

– What’s different in fabric-attached memory (FAM)?– Memory semantic accesses (including atomics)– No hardware-based cache coherence between nodes– Separate fault domains between FAM and nodes

– Goal: multi-node, FAM-aware KVS

– How: exploit FAM and FAM atomics for– Concurrency control– Fault tolerance– Load balancing

– Open sourced components– Memory manager: MDC toolkit NVMM– Indexes: MDC toolkit radix tree

27

Memory Manager

Indexes

API

Hash table

Radix tree B+-tree

Librarian File System (LFS)


28

Librarian

Fabric-attached memory

“Books” -- 8GB Allocation Units

“Shelves” -- Logical Allocations

Librarian File System

Filesystem Key-value store Application framework


Non-volatile Memory Manager (NVMM)

– Goals– Provide finer-grained memory allocation than LFS– Enable sharing across different processes and nodes– Tolerate failures of compute nodes and processes

– Manage shelves provided by LFS– Allocate, free, and name shelves (multiples of 8GBs)– Expose a global address space: {shelf ID, shelf offset}

– Memory access abstractions– Region APIs for direct memory-map access– Heap APIs for alloc/free style access

– Heap: allow any process from any node to allocate and free globally shared NVM transparently

29


Pool 1

Key Value Store

Shelf 5

Pool 2

Shelf 10 Shelf 19

Alloc/Free

Heap

Internal bookkeeping Indexes

Mmap

Region

...

NVMM


Radix trees

– Also known as compact prefix tries– “Compress” common prefixes to improve space

efficiency

– Maintain keys in sorted order– Support range (multi-key) lookups, unlike hash

tables

– More amenable to FAM than B+-trees– Require fewer structural changes (e.g., node

splits) as tree grows– Atomic operations used to insert or delete key and

leave tree in consistent state

romane 1romanus 2romulus 3rubens 4ruber 5rubicon 6rubicundus 7

https://en.wikipedia.org/wiki/Radix_tree


Radix trees


efficiency


tables




radix 0romane 1romanus 2romulus 3rubens 4ruber 5rubicon 6rubicundus 7

0adix



Memory-Driven Computing Developer ToolkitSoftware already available to you

‒ Example applications‒ Programming and analytics

tools‒ Operating system support‒ Emulation/simulation tools

Get access to the toolkit: https://www.labs.hpe.com/the-machine/developer-toolkit

Open source components

Machine (Prototype) hardware

Node Operating System

Persistent Memory Library (pmem.io)


Fabric attached memoryatomics library

Linux for Memory-Driven

Computing

Example Applications

Management

Librarian

Data Management & Programming FrameworksManaged data

structuresSparkle

Emulation/Simulation ToolsPerformance

emulation for NVMFabric attached

memory emulation

X86 emulation (Superdome X, MC990x, ProLiant)

Fault-tolerant programming

Fast optimistic engine

Image Search Large Scale Graph Inference

Persistent memory toolkit


https://www.labs.hpe.com/the-machine/developer-toolkit

Persistent memory as storage?

–If persistent memory is the new storage…it must safely remember persistent data.

–Persistent data should be stored:– Reliably, in the face of failures– Securely, in the face of exploits– In a cost-effective manner– Using a data access API that’s most natural for the device characteristics


Storing data reliably, securely and cost-effectivelyThe problem

– Potential concerns about using persistent memory to safely store persistent data:– NVM failures may result in loss of persistent data– Persistent data may be stolen

– Time to revisit traditional storage services– Ex: replication, erasure codes, encryption, compression, deduplication, wear leveling, snapshots

– New challenges:– Need to operate at memory speeds, not storage speeds– Traditional solutions (e.g., encryption, compression) complicate direct access– Space-efficient redundancy for NVM


Storing data reliably, securely and cost-effectivelyPotential solutions

– Software implementations can trade performance for reliability, security and cost-effectiveness– But will diminish benefits from faster technologies

– Memory-side hardware acceleration– Memory speeds may demand acceleration (e.g., DMA-style data movement, memset, encryption, compression)– What functions are ripe for memory-side acceleration?

– Wear leveling for fabric-attached non-volatile memory– Repeated NVM writes may exacerbate device wear issues– What’s the right balance between hardware-assisted wear leveling and software techniques?

– Fabric-attached non-volatile memory diagnostics– What is the equivalent of Self-Monitoring, Analysis and Reporting Technology (SMART) for NVM?



SRAM (caches)

DDRDRAM

DISKs

On-packageDRAM

NVM

ms

MBs 10-100GBs 1-10TBs 10-100TBs

1-10ns

50-100ns

1-10µs

50ns

1TBs

200ns-1µs

CAPACITY©Copyright 2017 Hewlett Packard Enterprise Company

Load/store memory won’t be the most cost-effective “forever” storage for cold data

37

SSDs

Memory + storage hierarchy – a usage view (ownership)LATENCY

SRAM (caches)

DDRDRAM

DISKs

On-package DRAM

NVM

ms

MBs 10-100GBs 1-10TBs 10-100TBs

1-10ns

50-100ns

1-10µs

50ns

1TBs

200ns-1µs


DURABLE (weeks months)

SCRATCH/EPHEMERAL (seconds)

PERSISTENTto failures(hours days)

ARCHIVE (years)

NODE LOCAL

SHARED

How to manage multi-tiered hierarchy, to ensure data is in “right” tier?

38

SSDs

The appropriate API for the technologyLATENCY

SRAM (caches)

DDRDRAM

SSDs

DISKs

On-packageDRAM

NVM

ms

MBs 10-100GBs 1-10TBs 10-100TBs

1-10ns

50-100ns

1-10µs

50ns

1TBs

200ns-1µs

CAPACITY©Copyright 2017 Hewlett Packard Enterprise Company 39


SRAM (caches)

DDRDRAM

SSDs

DISKs

On-packageDRAM

NVM

ms

MBs 10-100GBs 1-10TBs 10-100TBs

1-10ns

50-100ns

1-10µs

50ns

1TBs

200ns-1µs


Memory (load/store cachelines, atomics) APINote: needs to include entire data path (i.e., load-to-use)

40


SRAM (caches)

DDRDRAM

SSDs

DISKs

On-packageDRAM

NVM

ms

MBs 10-100GBs 1-10TBs 10-100TBs

1-10ns

50-100ns

1-10µs

50ns

1TBs

200ns-1µs


Memory (load/store cachelines, atomics) APINote: needs to include entire data path (i.e., load-to-use)

I/O (read/write, get/put blocks) APINeed to amortize access times over larger data blocks

41

How to get started with Memory-Driven Computing • Latest updates on The Machine project and Memory-Driven Computing: www.hpe.com/themachine

• Join The Machine User Group at https://www.labs.hpe.com/the-machine/user-group• Join community discussions on Slack https://www.labs.hpe.com/slack #themachineusergroup• Subscribe to “The Machine User Group” tab in the “Behind the scenes @ Labs” blog

https://community.hpe.com/t5/Behind-the-scenes-Labs/bg-p/BehindthescenesatLabs/label-name/The%20Machine%20User%20Group#.WXZGN4jyscE. Register and click “subscribe to this label”.

• Questions? Contact [email protected]

• Follow us on our Hewlett Packard Labs social handles: • Twitter: @HPE_Labs• LinkedIn: “Hewlett Packard Labs”• “Hewlett Packard Enterprise” YouTube page – The Machine and Hewlett Packard Labs channels• Instagram: HPE • Facebook: Hewlett Packard Enterprise


http://www.hpe.com/themachine

https://www.labs.hpe.com/the-machine/user-group

https://www.labs.hpe.com/slack

https://community.hpe.com/t5/Behind-the-scenes-Labs/bg-p/BehindthescenesatLabs/label-name/The%20Machine%20User%20Group%23.WXZGN4jyscE


Wrapping up

– Persistent memory blurs the line between memory and storage

– Gen-Z fabric enables us to shift system architecture towards Memory-Driven Computing

– Combination of technologies enables us to rethink the programming model– Simplify software stack– Operate directly on memory-format persistent data

– Many opportunities for software innovation

– How would you exploit Memory-Driven Computing?


https://www.labs.hpe.com/the-machinehttps://www.labs.hpe.com/the-machine/developer-toolkithttps://www.labs.hpe.com/next-next/mdc

43

GPUASIC

Quantum

RIS

CV

Memory +

Fabric

https://www.labs.hpe.com/the-machine

https://www.labs.hpe.com/the-machine/developer-toolkit

https://www.labs.hpe.com/next-next/mdc

persistent memory: new tier or storage replacement? · 2019-12-21 · persistent memory: new tier...

Documents