architecting flash for application acceleration and memory
TRANSCRIPT
1 c
Architecting Flash for Application Acceleration and Memory Efficiency
Fellow, SanDisk August 5, 2014
Nisha Talagala
2
Forward-Looking Statements
During our meeting today we may make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to industry trends, future memory technology and product capabilities and performance. Information in this presentation may also include or be based upon information from third parties, which reflects their expectations and projections as of the date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due to factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof or as of the date of issuance by a third party, as the case may be.
3
Non Volatile Memory (NVM) Flash (NAND)
– 100s GB to 10 TB per device
– Trends - Higher capacity, lower cost/GB, lower write cycles, SLC->MLC->3BPC
– 100K to millions of IOPS, GB/s of bandwidth
PCM/MRAM/STT/Other NVMs
– Still in research
4
Why Flash? Perfect match for
Databases
Low latency – good for low queue I/O
Handles mixed sequential and random I/O at various block sizes much better than disk
▸ Capacity ▸ IOPS ▸ Cost/IOP
4TB 3TB 15 200,000 $$$$ ¢¢¢¢
5
Evolution of Flash Usage FLASH + DISK FLASH AS DISK FLASH BEYOND DISK FLASH AS MEMORY
Increased flash awareness
Better Transactions/$/Watt
6
Flash Beyond Disk: Not Just Fast Disk Area Hard Disk Drives Flash Devices
Read/Write Performance Largely symmetrical Heavily asymmetrical. Additional operation (erase)
Sequential vs Random Performance 100x difference. Elevator scheduling for disk arm
<10x difference. No disk arm – NAND die
Mapping and Background ops Rare Regular occurrence – like a log structured file system
Wear out Largely unlimited Limited writes
IOPS 100s to 1,000s 100Ks to Millions
Latency 10s ms 10s-100s us
7
Flash-Awareness: Opportunities in the I/O Stack
Flash devices –I/O and Primitives (Atomics, PTRIM, etc) - Memory, Persistent Memory
File System (XFS, Ext4, Btrfs, NVMFS)
Apps
APIs, Libraries
Transparent Optimization
Flash-Aware Optimization
8
Examples covered in this talk - MySQL - MongoDB - LevelDB Gigaspaces with ZetaScale™ Software – See OPEN Session 301-A: Flash Changes the Game in Application Performance and TCO (Enterprise Applications Track)
9
Example: MySQL
10
Flash as Disk: Tune for a Really Fast Disk Has been going on now for
several years with great results
Targeted data placement, NOOP scheduler, seek-less optimizations, concurrency improvements, etc.
Make the block I/O stack faster
Find the fastest file-system
11
Multi-Instance: Using all the IOPS
Instances 1 2 4
Thro
ughp
ut, N
OT/1
0sec
0
4000
6000
8000
10000
12000
Fusion-io, 48 threads, 2400W – 64GB BP
4810
8788
11952
2000
12
Atomic Writes
Traditional MySQL Writes MySQL with Atomic Writes
Page C Page B
Page A
Buffer
DRAM Buffer
SSD (or HDD) Database
Database Server
Page C Page B Page A
Page C Page B Page A
Page C Page B Page A
Application initiates updates to pages A, B, and C.
1
MySQL copies updated pages to memory buffer.
2
MySQL writes to double-write buffer on the media.
3
Once step 3 is acknowledged, MySQL writes the updates to the actual tablespace.
4
ioMemory Database
Page C Page B Page A
DRAM Buffer
Page C Page B Page A
Application initiates updates to pages A, B, and C.
1
MySQL copies updated pages to memory buffer.
2
MySQL writes to actual tablespace, bypassing the double-write buffer step due to inherent atomicity guaranteed by the (intelligent) device.
3
Database Server
Page C Page B
Page A
13
Performance with Atomic Writes
Double-write disabled – Non-ACID
Double-write – ACID
Without Atomic Writes With Atomic I/O
Twice the performance AND with ACID properties
Atomic writes – ACID
• Atomic writes at 99% of the performance of raw writes • 2x flash device endurance improvement
14
0
20
40
60
80
100
120
140
160
180
200
111
122
133
144
155
166
177
188
199
111
0112
1113
2114
3115
4116
5117
6118
7119
8120
9122
0123
1124
2125
3126
4127
5128
6129
7130
8131
9133
0134
1135
21
Mill
isec
onds
Seconds
Sysbench 99% Latency OLTP workload
XFS DoubleWriteDirectFS Atomic
Atomic Writes: Latency Improvement 2-4x Latency Improvement on Percona Server
Atomic Writes
15
NVM-Compression Uses natural thin-provisioning of the
underlying flash
No bit packing at MySQL, no rebalancing
Holes in data files are TRIMed/unmapped
Multi-threaded flush and atomics to reduce latency
Pluggable compression algorithms
ioMemory-VSL
NVMFS
MySQL
16
Performance Improvement (LinkBench)
100%
20%
90%
0%
20%
40%
60%
80%
100%
Uncompressed Row Compression NVM-Compression
Transaction Rate Compression Performance Penalty
17
Performance Improvement: (TPCC-like)
0
5000
10000
15000
20000
25000
30000
Tim
e13
026
039
052
065
078
091
010
4011
7013
0014
3015
6016
9018
2019
5020
8022
1023
4024
7026
0027
3028
6029
9031
2032
5033
8035
10
New Order TX
TPC-C like workload MariaDB 10
1,000 warehouses - 75GB DRAM
MySQL uncompressed
MySQL compression
Fusion-io Compression
Time
18
Capacity Efficiency and Endurance
Row-comp Page-comp49.0% 58.5%
44.0%46.0%48.0%50.0%52.0%54.0%56.0%58.0%60.0%
% im
prov
emen
t
Vs. Uncompressed *
*For LinkBench with lz77. Comparable results with lz4.
• Architecturally more storage efficient than row compression
• When combined with
Atomic Writes, can improve flash endurance ~4x
19
Flash Aware API Integration for Database
20
Example: MongoDB
21
Reducing DRAM with Flash MongoDB Throughput (operations/second)
0
2000
4000
6000
8000
24 GB Node 12 GB Node 8 GB Node
-26% -33%
2x DRAM Reduction 3x DRAM Reduction
22
Improving Performance Predictability while reducing DRAM
12GB 2x DRAM Reduction
8GB 3x DRAM Reduction
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Improved Predictability with Less DRAM
24G
12G
8G
24GB Baseline
23
Improving Performance Predictability while reducing DRAM
Throughput Deviation (operations/second)
0
100
200
300
400
24 GB Node 12 GB Node 8 GB Node
2x DRAM Reduction 3x DRAM Reduction
3x Improvement
24
Improving Performance Predictability with Less DRAM Latency Deviation (microsecond)
0
3
6
9
12
Update Latency Deviation Read Latency Deviation
2x DRAM Reduction 3x DRAM Reduction
Tighter consistency even with 3x less DRAM
25
Example: LevelDB
26
LevelDB, NVMKV, and Raw Block Device
Fusion-io flash aware stack can bring full performance of flash to KV stores
27
Improving endurance and lifetime with Flash Awareness
Fusion-io Confidential
RW: Random Async Write RW-S: Random Sync Write SW: Sequential Async Write SW-S: Sequential Sync Write
Improving endurance by 2x-40x with flash aware key value storage
28
Performance Predictability
29
Transaction throughput (YCSB)
Throughput improvements of 50%-300% while using less DRAM
30
Availability Atomic Writes
– MariaDB mainline >= 5.5.31 – Percona Server >= 5.5.31 – Oracle MySQL >= 5.7.4
NVM Compression – MariaDB – 10.0.9 – Percona Server – 5.6 – Oracle MySQL – labs release (http://labs.mysql.com/)
NVMFS available early access MongoDB – works with existing MongoDB releases NVMKV – available at opennvm.github.io
31
https://opennvm.github.io
http://www.opencompute.org/projects/storage
32
Thank You
33
Appendix
34
File System Support
POSIX Interface Functionality
fallocate(offset, len) Preallocation and extending files/table space
fallocate(PUNCH_HOLE) Unmap/Punch hole operation. (issue Persistent TRIMs)
io_submit() AIO Transparent Atomic writes
34
NVM-Compression uses POSIX compliant Interfaces
NVM-Compression speed from NVMFS file system
35
NVMFS Non Volatile Memory FileSystem A POSIX compliant filesystem, designed by Fusion-io
Strengths
– Efficient pre-allocation of large files – Pre-allocation of large files is efficient – No fragmentation/degradation with aging of filesystem – Near raw flash speeds for I/O, atomic writes and TRIM
36
©2014 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. ZetaScale is a trademark of SanDisk Enterprise IP LLC. Fusion-IO, ioDrive and ioDrive Duo are trademarks of Fusion-IO, Inc. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their holder(s).