architecting flash for application acceleration and memory

1 c

Architecting Flash for Application Acceleration and Memory Efficiency

Fellow, SanDisk August 5, 2014

Nisha Talagala

2

Forward-Looking Statements

During our meeting today we may make forward-looking statements.

Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to industry trends, future memory technology and product capabilities and performance. Information in this presentation may also include or be based upon information from third parties, which reflects their expectations and projections as of the date of issuance.

Actual results may differ materially from those expressed in these forward-looking statements due to factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports.

We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof or as of the date of issuance by a third party, as the case may be.

3

Non Volatile Memory (NVM) Flash (NAND)

– 100s GB to 10 TB per device

– Trends - Higher capacity, lower cost/GB, lower write cycles, SLC->MLC->3BPC

– 100K to millions of IOPS, GB/s of bandwidth

PCM/MRAM/STT/Other NVMs

– Still in research

4

Why Flash? Perfect match for

Databases

Low latency – good for low queue I/O

Handles mixed sequential and random I/O at various block sizes much better than disk

▸ Capacity ▸ IOPS ▸ Cost/IOP

4TB 3TB 15 200,000 $$$$ ¢¢¢¢

5

Evolution of Flash Usage FLASH + DISK FLASH AS DISK FLASH BEYOND DISK FLASH AS MEMORY

Increased flash awareness

Better Transactions/$/Watt

6

Flash Beyond Disk: Not Just Fast Disk Area Hard Disk Drives Flash Devices

Read/Write Performance Largely symmetrical Heavily asymmetrical. Additional operation (erase)

Sequential vs Random Performance 100x difference. Elevator scheduling for disk arm

<10x difference. No disk arm – NAND die

Mapping and Background ops Rare Regular occurrence – like a log structured file system

Wear out Largely unlimited Limited writes

IOPS 100s to 1,000s 100Ks to Millions

Latency 10s ms 10s-100s us

7

Flash-Awareness: Opportunities in the I/O Stack

Flash devices –I/O and Primitives (Atomics, PTRIM, etc) - Memory, Persistent Memory

File System (XFS, Ext4, Btrfs, NVMFS)

Apps

APIs, Libraries

Transparent Optimization

Flash-Aware Optimization

8

Examples covered in this talk - MySQL - MongoDB - LevelDB Gigaspaces with ZetaScale™ Software – See OPEN Session 301-A: Flash Changes the Game in Application Performance and TCO (Enterprise Applications Track)

9

Example: MySQL

10

Flash as Disk: Tune for a Really Fast Disk Has been going on now for

several years with great results

Targeted data placement, NOOP scheduler, seek-less optimizations, concurrency improvements, etc.

Make the block I/O stack faster

Find the fastest file-system

11

Multi-Instance: Using all the IOPS

Instances 1 2 4

Thro

ughp

ut, N

OT/1

0sec

0

4000

6000

8000

10000

12000

Fusion-io, 48 threads, 2400W – 64GB BP

4810

8788

11952

2000

12

Atomic Writes

Traditional MySQL Writes MySQL with Atomic Writes

Page C Page B

Page A

Buffer

DRAM Buffer

SSD (or HDD) Database

Database Server

Page C Page B Page A



Application initiates updates to pages A, B, and C.

1

MySQL copies updated pages to memory buffer.

2

MySQL writes to double-write buffer on the media.

3

Once step 3 is acknowledged, MySQL writes the updates to the actual tablespace.

4

ioMemory Database


DRAM Buffer


Application initiates updates to pages A, B, and C.

1

MySQL copies updated pages to memory buffer.

2

MySQL writes to actual tablespace, bypassing the double-write buffer step due to inherent atomicity guaranteed by the (intelligent) device.

3

Database Server

Page C Page B

Page A

13

Performance with Atomic Writes

Double-write disabled – Non-ACID

Double-write – ACID

Without Atomic Writes With Atomic I/O

Twice the performance AND with ACID properties

Atomic writes – ACID

• Atomic writes at 99% of the performance of raw writes • 2x flash device endurance improvement

14

0

20

40

60

80

100

120

140

160

180

200

111

122

133

144

155

166

177

188

199

111

0112

1113

2114

3115

4116

5117

6118

7119

8120

9122

0123

1124

2125

3126

4127

5128

6129

7130

8131

9133

0134

1135

21

Mill

isec

onds

Seconds

Sysbench 99% Latency OLTP workload

XFS DoubleWriteDirectFS Atomic

Atomic Writes: Latency Improvement 2-4x Latency Improvement on Percona Server

Atomic Writes

15

NVM-Compression Uses natural thin-provisioning of the

underlying flash

No bit packing at MySQL, no rebalancing

Holes in data files are TRIMed/unmapped

Multi-threaded flush and atomics to reduce latency

Pluggable compression algorithms

ioMemory-VSL

NVMFS

MySQL

16

Performance Improvement (LinkBench)

100%

20%

90%

0%

20%

40%

60%

80%

100%

Uncompressed Row Compression NVM-Compression

Transaction Rate Compression Performance Penalty

17

Performance Improvement: (TPCC-like)

0

5000

10000

15000

20000

25000

30000

Tim

e13

026

039

052

065

078

091

010

4011

7013

0014

3015

6016

9018

2019

5020

8022

1023

4024

7026

0027

3028

6029

9031

2032

5033

8035

10

New Order TX

TPC-C like workload MariaDB 10

1,000 warehouses - 75GB DRAM

MySQL uncompressed

MySQL compression

Fusion-io Compression

Time

18

Capacity Efficiency and Endurance

Row-comp Page-comp49.0% 58.5%

44.0%46.0%48.0%50.0%52.0%54.0%56.0%58.0%60.0%

% im

prov

emen

t

Vs. Uncompressed *

*For LinkBench with lz77. Comparable results with lz4.

• Architecturally more storage efficient than row compression

• When combined with

Atomic Writes, can improve flash endurance ~4x

19

Flash Aware API Integration for Database

20

Example: MongoDB

21

Reducing DRAM with Flash MongoDB Throughput (operations/second)

0

2000

4000

6000

8000

24 GB Node 12 GB Node 8 GB Node

-26% -33%

2x DRAM Reduction 3x DRAM Reduction

22

Improving Performance Predictability while reducing DRAM

12GB 2x DRAM Reduction

8GB 3x DRAM Reduction

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Improved Predictability with Less DRAM

24G

12G

8G

24GB Baseline

23

Improving Performance Predictability while reducing DRAM

Throughput Deviation (operations/second)

0

100

200

300

400

24 GB Node 12 GB Node 8 GB Node


3x Improvement

24

Improving Performance Predictability with Less DRAM Latency Deviation (microsecond)

0

3

6

9

12

Update Latency Deviation Read Latency Deviation


Tighter consistency even with 3x less DRAM

25

Example: LevelDB

26

LevelDB, NVMKV, and Raw Block Device

Fusion-io flash aware stack can bring full performance of flash to KV stores

27

Improving endurance and lifetime with Flash Awareness

Fusion-io Confidential

RW: Random Async Write RW-S: Random Sync Write SW: Sequential Async Write SW-S: Sequential Sync Write

Improving endurance by 2x-40x with flash aware key value storage

28

Performance Predictability

29

Transaction throughput (YCSB)

Throughput improvements of 50%-300% while using less DRAM

30

Availability Atomic Writes

– MariaDB mainline >= 5.5.31 – Percona Server >= 5.5.31 – Oracle MySQL >= 5.7.4

NVM Compression – MariaDB – 10.0.9 – Percona Server – 5.6 – Oracle MySQL – labs release (http://labs.mysql.com/)

NVMFS available early access MongoDB – works with existing MongoDB releases NVMKV – available at opennvm.github.io

31

https://opennvm.github.io

http://www.opencompute.org/projects/storage

32

Thank You

33

Appendix

34

File System Support

POSIX Interface Functionality

fallocate(offset, len) Preallocation and extending files/table space

fallocate(PUNCH_HOLE) Unmap/Punch hole operation. (issue Persistent TRIMs)

io_submit() AIO Transparent Atomic writes

34

NVM-Compression uses POSIX compliant Interfaces

NVM-Compression speed from NVMFS file system

35

NVMFS Non Volatile Memory FileSystem A POSIX compliant filesystem, designed by Fusion-io

Strengths

– Efficient pre-allocation of large files – Pre-allocation of large files is efficient – No fragmentation/degradation with aging of filesystem – Near raw flash speeds for I/O, atomic writes and TRIM

36

©2014 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. ZetaScale is a trademark of SanDisk Enterprise IP LLC. Fusion-IO, ioDrive and ioDrive Duo are trademarks of Fusion-IO, Inc. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their holder(s).

architecting flash for application acceleration and memory

Documents