persistent memory: new tier or storage replacement? · 2019-12-21 · persistent memory: new tier...
TRANSCRIPT
Persistent memory: new tier or storage replacement?Kimberly Keeton and Susan Spence
[email protected] and [email protected]
SNIA Storage Developer ConferenceSeptember 2017
Outline
– Persistent memory blurs boundaries between memory and storage– Memory-Driven Computing (MDC) – context for our work with persistent memory– Memory-Driven Computing software– Challenges for persistent memory as storage– Summary
©Copyright 2017 Hewlett Packard Enterprise Company 2
Memory + storage hierarchy technologiesLATENCY
SRAM (caches)
DDRDRAM
DISKsms
MBs 10-100GBs 1-10TBs 10-100TBs
1-10ns
50-100ns
1-10µs
50ns
1TBs
200ns-1µs
CAPACITY©Copyright 2017 Hewlett Packard Enterprise Company 3
SSDs
Memory + storage hierarchy technologiesLATENCY
SRAM (caches)
DDRDRAM
DISKs
On-packageDRAM
NVM
ms
MBs 10-100GBs 1-10TBs 10-100TBs
1-10ns
50-100ns
1-10µs
50ns
+ Massive b/w
1TBs
200ns-1µs
CAPACITY
Two new entries!
©Copyright 2017 Hewlett Packard Enterprise Company 4
SSDs
Non-Volatile Memory (NVM)
– Persistently stores data– Access latencies comparable to DRAM– Byte addressable (load/store) rather than block addressable (read/write)– Some NVM technologies more energy efficient and denser than DRAM
Resistive RAM(Memristor)
3D Flash
Phase-Change Memory
Spin-Transfer Torque MRAM
ns μs
Latency
Haris Volos, et al. "Aerie: Flexible File-System Interfaces to Storage-Class Memory," Proc. EuroSys 2014.
©Copyright 2017 Hewlett Packard Enterprise Company 5
NVDIMM-N
Gen-Z: open systems interconnect standardhttp://www.genzconsortium.org– Open standard for memory-semantic interconnect
– Members: 30+ companies covering SoC, memory, I/O, networking, mechanical, system software, etc.
– Motivation– Emergence of low-latency storage class memory– Demand for large capacity, rack-scale resource pools
and multi-node architectures
– Memory semantics– All communication as memory operations (load/store,
put/get, atomics)
– High performance– Tens to hundreds GB/s bandwidth– Sub-microsecond load-to-use memory latency
– Draft spec available for public download
©Copyright 2017 Hewlett Packard Enterprise Company 6
Open Standard
I/O
Accelerators
FPGA GPU
CPUs
SoC ASICNEURO
MemoryMemory
Network Storage
Direct Attach, Switched, or Fabric Topology
Memory technologies
NVM NVM NVM…
SoC
Memory
GPU
ASIC
Quantum
RIS
CV Memory
7
Memory
Mem
ory
Memory
Mem
ory
SoC
SoC
SoC
SoC
Future architectureMemory-Driven Computing
Today’s architecture From processor-centric computing
©Copyright 2017 Hewlett Packard Enterprise Company
Core Memory-Driven Computing components
Fast, persistentmemory Fast memory fabric Task-specific
processingNew and Adapted
software
©Copyright 2017 Hewlett Packard Enterprise Company 8
HPE introduces the world’s largest single-memory computerThe prototype contains 160 terabytes of fabric-attached memory
– 160 TB of shared memory spread across 40 physical nodes, interconnected using a high-performance fabric protocol
– An optimized Linux-based operating system running on ThunderX2, Cavium’s flagship second generation dual socket capable ARMv8-A workload optimized System on a Chip
– Photonics/Optical communication links, including the new X1 photonics module, are online and operational
– Software programming tools designed to take advantage of abundant of persistent memory
©Copyright 2017 Hewlett Packard Enterprise Company 9
Putting it all together: Memory-Driven Computing
– Converging memory and storage– Byte-addressable NVM replaces hard drives and SSDs
– Resource disaggregation leads to high capacity shared memory pool– Fabric-attached memory pool is accessible by all compute resources– Low diameter networks provide near-uniform low latency
– Local volatile memory provides lower latency, high performance tier
– Distributed heterogeneous compute resources
– Software– Memory-speed persistence– Direct, unmediated access to all fabric-attached NVM across the memory
fabric– Non-coherent accesses between compute nodes
©Copyright 2017 Hewlett Packard Enterprise Company
Local DRAM
Local DRAM
Local DRAM
Local DRAM
SoC
SoC
SoC
SoC
NVM
NVM
NVM
NVM
Fabric-Attached
Memory Pool
Com
mun
icat
ions
and
mem
ory
fabr
ic
Net
wor
k
10
Memory-Driven Computing in context
Shared nothing
Local DRAM
Local DRAM
Local NVM
Local NVM
Local DRAM
Local DRAM
Local NVM
Local NVM
Net
wor
k
SoC
SoC
SoC
SoC
Shared everything
SoCLocal DRAM
Local NVM
SoCLocal DRAM
Local NVM
Net
wor
k
Physical Server
Coh
eren
t In
terc
onne
ct
Physical Server
Physical Server
©Copyright 2017 Hewlett Packard Enterprise Company 11
Memory-Driven Computing in context
Shared nothing
Local DRAM
Local DRAM
Local NVM
Local NVM
Local DRAM
Local DRAM
Local NVM
Local NVM
Net
wor
k
SoC
SoC
SoC
SoC
Shared everything
SoCLocal DRAM
Local NVM
SoCLocal DRAM
Local NVM
Net
wor
k
Physical Server
Coh
eren
t In
terc
onne
ct
Physical Server
Physical Server
Local DRAM
Local DRAM
Local DRAM
Local DRAM
SoC
SoC
SoC
SoC
Shared something
NVM
NVM
NVM
NVM
Fabric-Attached
Memory Pool
Com
mun
icat
ions
and
mem
ory
fabr
ic
Net
wor
k
©Copyright 2017 Hewlett Packard Enterprise Company 12
Questions driving work on MDC software
13
How do I ...
− use persistent memory?
− take full advantage of byte-addressable persistent memory?
− share data safely in a large, globally-addressable pool of memory?
− ensure data consistency in the face of failures and errors?
− scale effectively to use our large-memory, many-core systems?
©Copyright 2017 Hewlett Packard Enterprise Company
Memory-Driven Computing delivered with systems software
Memory is sharednon-coherently over fabric
Memory is large
Memory is persistent14©Copyright 2017 Hewlett Packard Enterprise Company
Memory-Driven Computing delivered with systems software
Memory is sharednon-coherently over fabric
Memory is large
Memory is persistent
Large indexes reduce storage access
In-memory data structures
15©Copyright 2017 Hewlett Packard Enterprise Company
Memory-Driven Computing delivered with systems software
Memory is sharednon-coherently over fabric
Memory is large
Memory is persistent
Communicate thru memory
Data sharing by reference
Large indexes reduce storage access
In-memory data structures
16
Easier load-balancing
©Copyright 2017 Hewlett Packard Enterprise Company
Memory-Driven Computing delivered with systems software
Memory is sharednon-coherently over fabric
Memory is large
Memory is persistent
Communicate thru memory
Data sharing by reference
Large indexes reduce storage access
In-memory data structures
No storage overheadsFast snapshots
Fast checkpointing
17
Easier load-balancing
©Copyright 2017 Hewlett Packard Enterprise Company
Memory-Driven Computing Developer Toolkit
Memory is large Memory is sharednon-coherently over fabric
Memory is persistent
18©Copyright 2017 Hewlett Packard Enterprise Company
Memory-Driven Computing Developer Toolkit
Memory is large Memory is sharednon-coherently over fabric
Memory is persistent
ShovellerALPS
MDSMPGCAtlas
Foedus
NV-WALCRIU-PMEMNVthreads
NVMMRadix Tree
19©Copyright 2017 Hewlett Packard Enterprise Company
Disk
Libnvwal: Write-Ahead Logging Library
– Key mechanism for atomicity and durability– Atomicity: Record all modifications in WAL, and– Durability: Persist WAL before applying
– Persistence latency critical to transaction performance
20
TRANSFER (from, to, amount)transaction {from = from – amountto = to + amount
} 100
from
to200
120180
BEFORE WAL
120
180
AFTER
TRANSFER (from, to, 20)
©Copyright 2017 Hewlett Packard Enterprise Company
Disk
Libnvwal: Write-Ahead Logging Library
– Key mechanism for atomicity and durability– Atomicity: Record all modifications in WAL, and– Durability: Persist WAL before applying
– Persistence latency critical to transaction performance
– Non-volatile memory DIMM (NVDIMM) enables low latency logging
21
TRANSFER (from, to, amount)transaction {from = from – amountto = to + amount
} 100
from
to200
120180
BEFORE WAL
120
180
AFTER
TRANSFER (from, to, 20)
©Copyright 2017 Hewlett Packard Enterprise Company
Libnvwal: Hybrid NVDIMM/Disk Write-Ahead Logging
– NVDIMM enables low latency persistence– DRAM: Best performance/lowest latency for fast data access– NAND Flash: Persistent store for the NVDIMM
– But limited capacity (8GB NVDIMM today) requires truncating log– Requires checkpointing database– Precludes archiving log for disaster recovery
22
DRAM Controller
NAND Flash
©Copyright 2017 Hewlett Packard Enterprise Company
Libnvwal: Hybrid NVDIMM/Disk Write-Ahead Logging
– NVDIMM enables low latency persistence– DRAM: Best performance/lowest latency for fast data access – NAND Flash: Persistent store for the NVDIMM
– But limited capacity (8GB NVDIMM today) requires truncating log– Requires checkpointing database– Precludes archiving log for disaster recovery
Libnvwal– Stores log tail in NVDIMM for low-latency and de-stages log to
disk for capacity
– Relieves application from implementing complex rotation logic– Non-trivial interaction between NVM buffers, file system, log metadata
durability, and flushing threads
– Improves MySQL OLTP performance by up to 106% (sysbench)
23
TAIL
©Copyright 2017 Hewlett Packard Enterprise Company
Fewer software layersTraditional Database System Managed Data Structures
Application
Database client
Database server
Filesystem
Object → Relation
Application
MDS Runtime
©Copyright 2017 Hewlett Packard Enterprise Company 24
Managed Data Structures (MDS)Simplify programming on persistent in-memory data
– Ease of Programming– Programmer manages only application-level data structures
–Record, Array, Map, Set, Graph, List, Queue, …–MDS data structures are automatically persisted in NVM
– APIs in multiple programming languages: Java, C++–Programmer access through references to data–Direct reads and writes
– Ease of Data Sharing– Just pass a reference
–Each program treats the data as local to the program– High-level concurrency controls
–Ensure consistent data in the face of data sharing by multiple threads/processes
Managed SpaceIn NVM
Process 1 Process 2
©Copyright 2017 Hewlett Packard Enterprise Company 25
“Shared something” Key-Value Store (KVS)Leverage large pool of fabric-attached memory and fabric atomics
CPU
DRAM
I/O CPU
DRAM
I/O … CPU
DRAM
I/O
…
1 2 N
Memory Fabric
26
Data stored in fabric-attached NVM
©Copyright 2017 Hewlett Packard Enterprise Company
High-level architecture for shared something KVS
– What’s different in fabric-attached memory (FAM)?– Memory semantic accesses (including atomics)– No hardware-based cache coherence between nodes– Separate fault domains between FAM and nodes
– Goal: multi-node, FAM-aware KVS
– How: exploit FAM and FAM atomics for– Concurrency control– Fault tolerance– Load balancing
– Open sourced components– Memory manager: MDC toolkit NVMM– Indexes: MDC toolkit radix tree
27
Memory Manager
Indexes
API
Hash table
Radix tree B+-tree
Librarian File System (LFS)
©Copyright 2017 Hewlett Packard Enterprise Company
28
Librarian
Fabric-attached memory
“Books” -- 8GB Allocation Units
“Shelves” -- Logical Allocations
Librarian File System
Filesystem Key-value store Application framework
©Copyright 2017 Hewlett Packard Enterprise Company
Non-volatile Memory Manager (NVMM)
– Goals– Provide finer-grained memory allocation than LFS– Enable sharing across different processes and nodes– Tolerate failures of compute nodes and processes
– Manage shelves provided by LFS– Allocate, free, and name shelves (multiples of 8GBs)– Expose a global address space: {shelf ID, shelf offset}
– Memory access abstractions– Region APIs for direct memory-map access– Heap APIs for alloc/free style access
– Heap: allow any process from any node to allocate and free globally shared NVM transparently
29
Librarian File System (LFS)
Pool 1
Key Value Store
Shelf 5
Pool 2
Shelf 10 Shelf 19
Alloc/Free
Heap
Internal bookkeeping Indexes
Mmap
Region
...
NVMM
©Copyright 2017 Hewlett Packard Enterprise Company
Radix trees
– Also known as compact prefix tries– “Compress” common prefixes to improve space
efficiency
– Maintain keys in sorted order– Support range (multi-key) lookups, unlike hash
tables
– More amenable to FAM than B+-trees– Require fewer structural changes (e.g., node
splits) as tree grows– Atomic operations used to insert or delete key and
leave tree in consistent state
romane 1romanus 2romulus 3rubens 4ruber 5rubicon 6rubicundus 7
https://en.wikipedia.org/wiki/Radix_tree
30©Copyright 2017 Hewlett Packard Enterprise Company
Radix trees
– Also known as compact prefix tries– “Compress” common prefixes to improve space
efficiency
– Maintain keys in sorted order– Support range (multi-key) lookups, unlike hash
tables
– More amenable to FAM than B+-trees– Require fewer structural changes (e.g., node
splits) as tree grows– Atomic operations used to insert or delete key and
leave tree in consistent state
radix 0romane 1romanus 2romulus 3rubens 4ruber 5rubicon 6rubicundus 7
0adix
https://en.wikipedia.org/wiki/Radix_tree
31©Copyright 2017 Hewlett Packard Enterprise Company
Radix trees
– Also known as compact prefix tries– “Compress” common prefixes to improve space
efficiency
– Maintain keys in sorted order– Support range (multi-key) lookups, unlike hash
tables
– More amenable to FAM than B+-trees– Require fewer structural changes (e.g., node
splits) as tree grows– Atomic operations used to insert or delete key and
leave tree in consistent state
radix 0romane 1romanus 2romulus 3rubens 4ruber 5rubicon 6rubicundus 7
0adix
https://en.wikipedia.org/wiki/Radix_tree
32©Copyright 2017 Hewlett Packard Enterprise Company
Memory-Driven Computing Developer ToolkitSoftware already available to you
‒ Example applications‒ Programming and analytics
tools‒ Operating system support‒ Emulation/simulation tools
Get access to the toolkit: https://www.labs.hpe.com/the-machine/developer-toolkit
Open source components
Machine (Prototype) hardware
Node Operating System
Persistent Memory Library (pmem.io)
Librarian File System (LFS)
Fabric attached memoryatomics library
Linux for Memory-Driven
Computing
Example Applications
Management
Librarian
Data Management & Programming FrameworksManaged data
structuresSparkle
Emulation/Simulation ToolsPerformance
emulation for NVMFabric attached
memory emulation
X86 emulation (Superdome X, MC990x, ProLiant)
Fault-tolerant programming
Fast optimistic engine
Image Search Large Scale Graph Inference
Persistent memory toolkit
©Copyright 2017 Hewlett Packard Enterprise Company 33
Persistent memory as storage?
–If persistent memory is the new storage…it must safely remember persistent data.
–Persistent data should be stored:– Reliably, in the face of failures– Securely, in the face of exploits– In a cost-effective manner– Using a data access API that’s most natural for the device characteristics
©Copyright 2017 Hewlett Packard Enterprise Company 34
Storing data reliably, securely and cost-effectivelyThe problem
– Potential concerns about using persistent memory to safely store persistent data:– NVM failures may result in loss of persistent data– Persistent data may be stolen
– Time to revisit traditional storage services– Ex: replication, erasure codes, encryption, compression, deduplication, wear leveling, snapshots
– New challenges:– Need to operate at memory speeds, not storage speeds– Traditional solutions (e.g., encryption, compression) complicate direct access– Space-efficient redundancy for NVM
©Copyright 2017 Hewlett Packard Enterprise Company 35
Storing data reliably, securely and cost-effectivelyPotential solutions
– Software implementations can trade performance for reliability, security and cost-effectiveness– But will diminish benefits from faster technologies
– Memory-side hardware acceleration– Memory speeds may demand acceleration (e.g., DMA-style data movement, memset, encryption, compression)– What functions are ripe for memory-side acceleration?
– Wear leveling for fabric-attached non-volatile memory– Repeated NVM writes may exacerbate device wear issues– What’s the right balance between hardware-assisted wear leveling and software techniques?
– Fabric-attached non-volatile memory diagnostics– What is the equivalent of Self-Monitoring, Analysis and Reporting Technology (SMART) for NVM?
©Copyright 2017 Hewlett Packard Enterprise Company 36
Memory + storage hierarchy technologiesLATENCY
SRAM (caches)
DDRDRAM
DISKs
On-packageDRAM
NVM
ms
MBs 10-100GBs 1-10TBs 10-100TBs
1-10ns
50-100ns
1-10µs
50ns
1TBs
200ns-1µs
CAPACITY©Copyright 2017 Hewlett Packard Enterprise Company
Load/store memory won’t be the most cost-effective “forever” storage for cold data
37
SSDs
Memory + storage hierarchy – a usage view (ownership)LATENCY
SRAM (caches)
DDRDRAM
DISKs
On-package DRAM
NVM
ms
MBs 10-100GBs 1-10TBs 10-100TBs
1-10ns
50-100ns
1-10µs
50ns
1TBs
200ns-1µs
CAPACITY©Copyright 2017 Hewlett Packard Enterprise Company
DURABLE (weeks months)
SCRATCH/EPHEMERAL (seconds)
PERSISTENTto failures(hours days)
ARCHIVE (years)
NODE LOCAL
SHARED
How to manage multi-tiered hierarchy, to ensure data is in “right” tier?
38
SSDs
The appropriate API for the technologyLATENCY
SRAM (caches)
DDRDRAM
SSDs
DISKs
On-packageDRAM
NVM
ms
MBs 10-100GBs 1-10TBs 10-100TBs
1-10ns
50-100ns
1-10µs
50ns
1TBs
200ns-1µs
CAPACITY©Copyright 2017 Hewlett Packard Enterprise Company 39
The appropriate API for the technologyLATENCY
SRAM (caches)
DDRDRAM
SSDs
DISKs
On-packageDRAM
NVM
ms
MBs 10-100GBs 1-10TBs 10-100TBs
1-10ns
50-100ns
1-10µs
50ns
1TBs
200ns-1µs
CAPACITY©Copyright 2017 Hewlett Packard Enterprise Company
Memory (load/store cachelines, atomics) APINote: needs to include entire data path (i.e., load-to-use)
40
The appropriate API for the technologyLATENCY
SRAM (caches)
DDRDRAM
SSDs
DISKs
On-packageDRAM
NVM
ms
MBs 10-100GBs 1-10TBs 10-100TBs
1-10ns
50-100ns
1-10µs
50ns
1TBs
200ns-1µs
CAPACITY©Copyright 2017 Hewlett Packard Enterprise Company
Memory (load/store cachelines, atomics) APINote: needs to include entire data path (i.e., load-to-use)
I/O (read/write, get/put blocks) APINeed to amortize access times over larger data blocks
41
How to get started with Memory-Driven Computing • Latest updates on The Machine project and Memory-Driven Computing: www.hpe.com/themachine
• Join The Machine User Group at https://www.labs.hpe.com/the-machine/user-group• Join community discussions on Slack https://www.labs.hpe.com/slack #themachineusergroup• Subscribe to “The Machine User Group” tab in the “Behind the scenes @ Labs” blog
https://community.hpe.com/t5/Behind-the-scenes-Labs/bg-p/BehindthescenesatLabs/label-name/The%20Machine%20User%20Group#.WXZGN4jyscE. Register and click “subscribe to this label”.
• Questions? Contact [email protected]
• Follow us on our Hewlett Packard Labs social handles: • Twitter: @HPE_Labs• LinkedIn: “Hewlett Packard Labs”• “Hewlett Packard Enterprise” YouTube page – The Machine and Hewlett Packard Labs channels• Instagram: HPE • Facebook: Hewlett Packard Enterprise
©Copyright 2017 Hewlett Packard Enterprise Company 42
Wrapping up
– Persistent memory blurs the line between memory and storage
– Gen-Z fabric enables us to shift system architecture towards Memory-Driven Computing
– Combination of technologies enables us to rethink the programming model– Simplify software stack– Operate directly on memory-format persistent data
– Many opportunities for software innovation
– How would you exploit Memory-Driven Computing?
©Copyright 2017 Hewlett Packard Enterprise Company
https://www.labs.hpe.com/the-machinehttps://www.labs.hpe.com/the-machine/developer-toolkithttps://www.labs.hpe.com/next-next/mdc
43
GPUASIC
Quantum
RIS
CV
Memory +
Fabric