Copyright © Fusion-io, Inc. All rights reserved.
FLASH IN THE DATACENTER Nisha Talagala
NON VOLATILE MEMORY
Flash ▸ 100s GB to 10 TB per
PCIe device ▸ Media trend – increase
in density, reduction of write cycles, SLC/MLC/3BPC
▸ 100s of thousands to millions of IOPS, GB/s of bandwidth
PCM/MRAM/STT/Other NVMs ▸ Still in research ▸ Potential of extreme performance
increase
2 PDSW 8 – Supercomputing 13
HOW TO EFFECTIVELY USE FLASH?
Performance ▸ Closer to CPU – highest bandwidth, lowest latency ▸ Server (compute) side flash complements storage side flash
Hierarchy of DRAM, flash, disk Disk displacement usages
▸ Caches – server and storage side ▸ Scale out and cluster file systems
• flash in metadata server • storage server
▸ Staging, checkpoint
DRAM displacement usages ▸ Improved paging, semi-external memory
3 PDSW 8 – Supercomputing 13
NVM IN THE DATA CENTER TODAY
4
Web front Ends
Caching tiers
Database
and application tiers
Storage Flash
Flash Flash Flash DRAM and Flash
Flash Flash Flash DRAM, flash and disk
DRAM, flash and disk Flash
PDSW 8 – Supercomputing 13
5
Non-Volatile Memory
Non-Volatile Storage
Volatile Memory
Volatile-Storage
Flash – Storage or Memory? P
erfo
rman
ce
Persistence
DRAM
Disk, Tape
Flash and Other NVMs
Flash and Other NVMs
MEMORY STORAGE CONVERGENCE
PDSW 8 – Supercomputing 13
EVOLUTION OF ENTERPRISE FLASH
FLASH AS DISK
Application
Application source code converts native data structures into block I/O
Conventional I/O Access
Block I/O
NVM Devices/Media
F L A S H B E Y O N D D I S K
Application
Application source code does I/O with native data structures
Enhanced I/O
Atomic I/O Transaction
Key-Value Transaction
Native primitives
NVM Devices/Media
F L A S H A S M E M O R Y
Application
Application source code manipulates native memory data
structures
Memory Access
Extended Memory Persistent Memory
NVM Devices/Media
PDSW 8 – Supercomputing 13
EVOLUTION OF ENTERPRISE FLASH
FLASH AS DISK
Application
Application source code converts native data structures into block I/O
Conventional I/O Access
Block I/O
NVM Devices/Media
F L A S H B E Y O N D D I S K
Application
Application source code does I/O with native data structures
Enhanced I/O
Atomic I/O Transaction
Key-Value Transaction
Native primitives
NVM Devices/Media
F L A S H A S M E M O R Y
Application
Application source code manipulates native memory data
structures
Memory Access
Extended Memory Persistent Memory
NVM Devices/Media
PDSW 8 – Supercomputing 13
SC11 GENERAL PARALLEL FILE SYSTEM (GPFS) DEMO
8
▸ 24 uncompressed 1080p videos (up to 6 GB/s of data) ▸ Five Fusion Powered GPFS-based NSD servers ▸ Three visualization workstations
PDSW 8 – Supercomputing 13
DEMO SOFTWARE SPECS
9
▸ RedHat Enterprise Linux 6.1 - InfiniBand Software Stack ▸ IBM GPFS (General Parallel File System) 3.4.0.8 ▸ NVIDIA Linux Driver ▸ Fusion-io VSL 3
PDSW 8 – Supercomputing 13
DEMO HARDWARE SPECS
10
▸ Five 0.5U NSD Servers, each with six core dual socket CPUs,12 GB RAM, InfiniBand HCA, and an ioDrive2
▸ 3 Visualization workstations, with an InfiniBand HCA and an NVIDIA graphics card
▸ 36-port QDR InfiniBand switch
PDSW 8 – Supercomputing 13
November 16, 2013 Fusion-io Confidential 11
+ Software
Your Server Becomes a Shared Flash Storage Appliance
ION DATA ACCELERATOR – HIGH AVAILABILITY
12
LUN 0 LUN 0
LUN 1 LUN 1
LUN 0 LUN 1
40Gb
PDSW 8 – Supercomputing 13
ION – FREEDOM OF CHOICE
SOFTWARE
Leverage your buying power
FULLY INTEGRATED SOLUTION
▸ Leverage your buying power ▸ Integrate ▸ Support
• ION Software (via Fusion-io) • Server (via server OEM) • ioDrive (via your supplier)
▸ No hassles, partner integrated ▸ Support
• End-to-End Fusion-io Support
13 PDSW 8 – Supercomputing 13
EVOLUTION OF ENTERPRISE FLASH
FLASH AS DISK
Application
Application source code converts native data structures into block I/O
Conventional I/O Access
Block I/O
NVM Devices/Media
F L A S H B E Y O N D D I S K
Application
Application source code does I/O with native data structures
Enhanced I/O
Atomic I/O Transaction
Key-Value Transaction
Native primitives
NVM Devices/Media
F L A S H A S M E M O R Y
Application
Application source code manipulates native memory data
structures
Memory Access
Extended Memory Persistent Memory
NVM Devices/Media
PDSW 8 – Supercomputing 13
15
Area Hard Disk Drives Flash Devices
Logical to Physical Blocks
Nearly 1:1 Mapping Remapped at every write
Read/Write Performance
Largely symmetrical Heavily asymmetrical. Additional operation (erase)
Sequential vs Random Performance
100x difference. Elevator scheduling for disk arm
<10x difference. No disk arm – NAND die
Background operations Rarely impact foreground Regular occurrence. If unmanaged - can impact foreground
Wear out Largely unlimited writes Limited writes
IOPS 100s to 1000s 100Ks to Millions
Latency 10s ms 10s-100s us
NVM (FLASH, OTHER) IS DIFFERENT FROM DISK
PDSW 8 – Supercomputing 13
CONVENTIONAL I/O ACCESS
16
APPLICATION
Application source code
Simple Block
Network File
Simple Block
Proprietary Storage OS
Non Volatile Memory Media
Native Flash Translation Layer
Storage Media
Conventional I/O access
PDSW 8 – Supercomputing 13
MULTI-QUEUE I/O IN LINUX*
• Extending Linux block I/O to support NVM performance • Multi-queue
• Software queues, Hardware queues • Per CPU issue/completion, multi-socket scaling • Matches inherent parallelism in NVM devices and CPUs • Supports upcoming queue oriented standards models
• Performance • 3.5x – 10x increase in IOPS (from ~1M to 3.5-10M) • 10x – 38x reduction in I/O stack latency
*Linux Block I/O: Introducing Multiqueue SSD Access on Multicore Systems Bjorling M., Axboe J., Nellans D., Bonnett P. SYSTOR 2013 University of Copenhagen and Fusion-io
PDSW 8 – Supercomputing 13
DIRECT-ACCESS I/O THROUGH NATIVE INTERFACES
18
APPLICATION
Application source code
Transactional Block
Native File
Key-Value Object
Simple Block
Network File
Simple Block
Proprietary Storage OS
Non Volatile Memory Media
Native Flash Translation Layer
Storage Media
Conventional I/O access
Direct access I/O
PDSW 8 – Supercomputing 13
FLASH PRIMITIVES: SAMPLE USES AND BENEFITS
19
Databases Transactional Atomicity: Replace various workarounds implemented in database code to provide write atomicity (MySQL double-buffered writes, etc.)
Filesystems File Update Atomicity: Replace various workarounds implemented in filesystem code to provide file/directory update atomicity (journaling, etc.)
▸ 98% performance of raw writes Smarter media now natively understands atomic updates, with no additional metadata overhead.
▸ 2x longer flash media life Atomic Writes can increase the life of flash media up to 2x due to reduction in write-ahead-logging and double-write buffering.
▸ 50% less code in key modules Atomic operations dramatically reduce application logic, such as journaling, built as work-arounds.
PDSW 8 – Supercomputing 13
ATOMIC WRITES – MYSQL EXAMPLE
20
Traditional MySQL Writes MySQL with Atomic Writes
Page C Page
B
Page A
Buffer
DRAM Buffer
SSD (or HDD) Database
Database Server
Page C
Page B
Page A
Page C
Page B
Page A
Page C
Page B
Page A
Application initiates updates to pages A, B, and C.
1
MySQL copies updated pages to memory buffer.
2
MySQL writes to double-write buffer on the media.
3
Once step 3 is acknowledged, MySQL writes the updates to the actual tablespace.
4
ioMemory Database
Page C
Page B
Page A
DRAM Buffer
Page C
Page B
Page A
Application initiates updates to pages A, B, and C.
1
MySQL copies updated pages to memory buffer.
2
MySQL writes to actual tablespace, bypassing the double-write buffer step due to inherent atomicity guaranteed by the (intelligent) device.
3
Database Server
Page C Page
B
Page A
PDSW 8 – Supercomputing 13
2-4x Latency Improvement on Percona Server
MYSQL EXAMPLE: LATENCY IMPROVEMENT
0
20
40
60
80
100
120
140
160
180
200
1 10
7 21
3 31
9 42
5 53
1 63
7 74
3 84
9 95
5 10
61
1167
12
73
1379
14
85
1591
16
97
1803
19
09
2015
21
21
2227
23
33
2439
25
45
2651
27
57
2863
29
69
3075
31
81
3287
33
93
3499
Mill
isec
onds
Seconds
Sysbench 99% Latency OLTP workload
XFS DoubleWrite DirectFS Atomic Atomic Writes
PDSW 8 – Supercomputing 13
70% Transactions/sec Improvement on MariaDB Server
MYSQL EXAMPLE: THROUGHPUT IMPROVEMENT
0
2000
4000
6000
8000
10000
12000
14000
16000
Tim
e 18
0 36
0 54
0 72
0 90
0 10
80
1260
14
40
1620
18
00
1980
21
60
2340
25
20
2700
28
80
3060
32
40
3420
36
00
3780
39
60
4140
43
20
4500
46
80
4860
50
40
5220
54
00
5580
57
60
5940
61
20
6300
64
80
6660
68
40
7020
New
Ord
erTX
N
10se
c
Seconds
XtraDB 5.5.30 - Atomics TPC-C - 2500 warehouses
230GB data - 50GB buffer pool
DirectFS/Atomic Ext4 No-DoubleWrite Ext4 DoubleWrite
Atomic Writes
PDSW 8 – Supercomputing 13
KEY-VALUE INTERFACE: SAMPLE USES AND BENEFITS
23
NoSQL Applications Increase performance by eliminating packing and unpacking blocks, defragmentation, and duplicate metadata at app layer. Reduce application I/O through batched operations. Reduce overprovisioning due to lack of coordination between two-layers of garbage collection (application-layer and flash-layer). Some top NoSQL applications recommend over-provisioning by 3x due to this.
▸ Near performance of raw device Smarter media now natively understands a key-value I/O interface with lock-free updates, crash recovery, and no additional metadata overhead.
▸ 3x throughput on same SSD Early benchmarks comparing against synchronous levelDB show over 3x improvement.
▸ Up to 3x capacity increase Dramatically reduces over-provisioning through coordinated garbage collection and automated key expiry.
PDSW 8 – Supercomputing 13
KEY-VALUE INTERFACE - PERFORMANCE
24
Key-Value get/put vs. Raw read/write vs. levelDB read/write
0
20000
40000
60000
80000
100000
120000
140000
160000
1 2 4 8 16
Ops
/s
Threads
GET/READ Performance
Leveldb-sync
NVMKV
Raw device
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
1 2 4 8 16
Ops
/s
Threads
PUT/WRITE Performance
Leveldb-sync
NVMKV
FIO
PDSW 8 – Supercomputing 13
MEMORY-ACCESS THROUGH NATIVE INTERFACES
25
APPLICATION
Application source code
Extended (Volatile) Memory
Persistent Memory Transactional
Block Native File Key-Value Object
Simple Block
Network File
Simple Block
Proprietary Storage OS
Non Volatile Memory Media
Native Flash Translation Layer
Storage Media
Conventional I/O access
Memory access
Direct access I/O
PDSW 8 – Supercomputing 13
GRAPH500* AND DI-MMAP**
• Traversing massive graphs • "Using 2.56TB of Fusion-io NAND flash to access data using memory
semantics, LLNL's new Graph500 algorithm can process graphs 8x larger than before with only a 50% performance degradation compared to an all DRAM system.”
• Results: 55.6 MTEPS (Million Traversed Edges Per Second) 4 x 640GB Fusion-io MLC
• DI-MMAP: Accelerated mmap for highly concurrent apps • 3-5x improvement in mmap performance
* Graph500: Traversing massive graphs with NAND flash; Pearce, Gokhale, & Amato (LLNL) **DI-MMAP: A High Performance Memory Map Runtime for Data Intensive Applications; Van Essen, Hsieh, Ames, Gokhale (LLNL)
PDSW 8 – Supercomputing 13
IMPROVING LINUX SWAP (DEMAND-PAGING)
27
Originally designed as a last resort to prevent OOM (out-of-memory) failures • Never tuned for high-performance demand-paging • Never tuned for multi-threaded apps • Poor performance
Tuned for flash (leverages native characteristics) ▸ O(1) algorithm for swap_out – reduce algorithm time and leverage fast random I/O ▸ Per CPU reclaim – greater throughput for multi-threaded environments ▸ Intelligent read-ahead on swap-in – cut legacy, disk-era cruft for rotational latency
Disks
System Memory
Default Swap
ioMemory/Flash
System Memory
Optimized Swap
PDSW 8 – Supercomputing 13
FAST SWAP - PERFORMANCE
0
500000
1000000
1500000
2000000
2500000
0 100 200 300 400 500 600 700 800
Mem
ory
Ops
/s
Time
Default OS-Swap
Improved OS-Swap
~2x improvement in page-out rate
~3.5x improvement in page-in and out rate
~3x reduction in load completion time
3x reduction in load completion time with fast swap
PDSW 8 – Supercomputing 13
COMPARING I/O AND MEMORY ACCESS SEMANTICS
November 18, 2013 29
I/O I/O semantics examples:
• Open file descriptor – open(), read(), write(), seek(), close() • (New) Write multiple data blocks atomically, nvm_vectored_write() • (New) Open key-value store – nvm_kv_open(), kv_put(), kv_get(), kv_batch_*()
Memory Access (Volatile)
Volatile memory semantics example: • Allocate virtual memory, e.g. malloc() • memcpy/pointer dereference writes (or reads) to memory address • (Improved) Page-faulting transparently loads data from NVM into memory
Memory Access
(Non-Volatile)
Non-volatile memory semantics example: • (New) Allocate and manage persistent memory
PDSW 8 – Supercomputing 13
30
https://opennvm.github.io
http://www.opencompute.org/projects/storage/
PDSW 8 – Supercomputing 13
1ST CONTRIBUTION: FLASH PRIMITIVES
31 https://opennvm.github.io
On GitHub: • API specifications, such as:
• nvm_atomic_write() • nvm_batch_atomic_operations() • nvm_atomic_trim()
• Sample program code
PDSW 8 – Supercomputing 13
2ND CONTRIBUTION: LINUX FAST-SWAP
32 https://opennvm.github.io
On GitHub: • Documentation
• Experimental Linux kernel with
virtual memory swap patch (3.6 kernel)
• Benchmarking utility
PDSW 8 – Supercomputing 13
3RD CONTRIBUTION: KEY-VALUE INTERFACE
33 https://opennvm.github.io
On GitHub: • API specifications, such as:
nvm_kv_put() • nvm_kv_get() • nvm_kev_batch_put() • nvm_kv_set_global_expiry()
• Sample program code
• Benchmarking utility
• Source code for flash optimized key value store
PDSW 8 – Supercomputing 13
APPS USING OPENNVM TECHNOLOGY
November 18, 2013 34 https://opennvm.github.io
PDSW 8 – Supercomputing 13
OPENNVM, STANDARDS, AND CONSORTIUMS
November 16, 2013 35
▸ opennvm.github.io
• Primitives API specifications, sample code
• Linux swap kernel patch and benchmarking tools
• key-value interface API library and code, sample usage code, benchmark tools
▸ INCITS SCSI (T10) active standards proposals:
• SBC-4 SPC-5 Atomic-Write http://www.t10.org/cgi-bin/ac.pl?t=d&f=11-229r6.pdf
• SBC-4 SPC-5 Scattered writes, optionally atomic http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-086r3.pdf
• SBC-4 SPC-5 Gathered reads, optionally atomic http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-087r3.pdf
▸ SNIA NVM-Programming TWG draft guide: http://snia.org/forums/sssi/nvmp
JOIN US AT OPENNVM.GITHUB.IO
36 PDSW 8 – Supercomputing 13
T H A N K Y O U