linux nvm-express - flash memory summit · • power management: suspend/resume ... crc t10 dif...

38
Linux NVM-Express Keith Busch Software Engineer Intel Flash Memory Summit 2014 Santa Clara, CA 1

Upload: dangthuan

Post on 08-May-2018

231 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Linux NVM-Express

Keith Busch Software Engineer

Intel

Flash Memory Summit 2014 Santa Clara, CA

1

Page 2: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Agenda

Flash Memory Summit 2014 Santa Clara, CA

2

Driver Development History Current State Kernel Enhancements Future Work Getting Involved

Page 3: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

In the beginning …

Originally revealed with 1.0 spec, March 2011, contributed by Matthew Wilcox

Merge to mainline January, 2012 with 3.3. Since has seen over 150 commits from 25

individuals contributing bug fixes, features and enhancements.

Flash Memory Summit 2014 Santa Clara, CA

3

3.3 • Initial commit based on NVMe 1.0c

Page 4: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Getting new updates Upstream

Flash Memory Summit 2014 Santa Clara, CA

4

Maintainer Tree (infradead.org)

Linux Mainline (kernel.org)

Distros

merging appropriate changes back for ecosystem

copy/fork for product dev Developer Copy

Developer Copy

Page 5: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Driver Details: Queue Allocation

Ideal case: one SQ/CQ pair per cpu core MSI-x IRQ affinity assigned to CPU

associated Queue Flash Memory Summit 2014 Santa Clara, CA

5

Page 6: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Driver Details: Block Stack Anatomy

Flash Memory Summit 2014 Santa Clara, CA

6

Page 7: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Storage Stack Comparison

SAS vs. NVMe Device: Latency and CPU utilization

reduced by 50+%*: NVMe: 2.8us, 9,100 cycles SAS: 6.0us, 19,500 cycles

Flash Memory Summit 2014 Santa Clara, CA

7

* Measurement taken on Intel® Core™ i5-2500K 3.3GHz 6MB L3 Cache Quad-Core Desktop Processor using Linux

Page 8: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Kernel 3.6 …

Flash Memory Summit 2014 Santa Clara, CA

8

3.3 • Initial commit based on NVMe 1.0c

3.6 • Greater than 512 byte block support • Device capability constraints

Page 9: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Kernel 3.9 …

Flash Memory Summit 2014 Santa Clara, CA

9

3.3 • Initial commit based on NVMe 1.0c

3.6 • Greater than 512 byte block support • Support for devices with limited capabilities

3.9 • Discard/TRIM (NVMe Data-Set Mgmt) • Metadata passthrough commands • SG_IO SCSI-to-NVMe translation • Character device for management

Page 10: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

3.9: SCSI-NVMe SG_IO

• Read/Write 6, 10, 12, 16 • Inquiry • Mode Sense 10/16 • Mode Select 10/16 • Log Sense • Read Capacity 10/16 • Report LUNS

Flash Memory Summit 2013 Santa Clara, CA

10

• Request Sense • Security Protocol In/Out • Start Stop Unit • Test Unit Ready • Write Buffer • Unmap

Page 11: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Kernel 3.10 …

Flash Memory Summit 2014 Santa Clara, CA

11

3.3 • Initial commit based on NVMe 1.0c

3.6 • Greater than 512 byte block support • Support for devices with limited capabilities

3.9 • Discard/TRIM (NVMe Data-Set Mgmt) • Metadata passthrough commands • SG_IO SCSI-to-NVMe translation • Character device for management

3.10 • Multiple Message MSI • Disk stats / iostat • NVMe BIO splitting

Page 12: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

3.10: BIO Splitting

Not all I/O vectors can be mapped to an NVMe command’s PRP list

Requires virtually contiguous buffers

Flash Memory Summit 2014 Santa Clara, CA

12

Page 13: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Kernel 3.12

Flash Memory Summit 2014 Santa Clara, CA

13

3.12 • Power Management: Suspend/Resume

Page 14: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Kernel 3.14

Flash Memory Summit 2014 Santa Clara, CA

14

3.12 • Power Management: Suspend/Resume

3.14 • Dynamic Partitions • Surprise Removal (no I/O) • Command Abort Handling • Controller Failure and Recovery

Page 15: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Kernel 3.15

Flash Memory Summit 2014 Santa Clara, CA

15

3.12 • Power Management: Suspend/Resume

3.14 • Dynamic Partitions • Surprise Removal, no I/O • Command Abort Handling • Controller Failure and Recovery

3.15 • HDIO_GETGEO • Pre-CPU Queue optimizations • Hot plug CPU • Surprise Removal while running IO

Page 16: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

3.15: Disk Geometry

Prevent partitions that create this scenario:

Flash Memory Summit 2014 Santa Clara, CA

16

Page 17: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

3.15: Per-CPU Optimization

When more cores than queues, before:

Flash Memory Summit 2014 Santa Clara, CA

17

Page 18: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

3.15: Per-CPU Optimization

When more cores than queues, after:

Flash Memory Summit 2014 Santa Clara, CA

18

Page 19: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

3.15: Surprise Removal

Flash Memory Summit 2014 Santa Clara, CA

19

Additional synchronization and reference counting software need for controller+storage removal safe without sacrificing performance

Page 20: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Kernel 3.16

Flash Memory Summit 2014 Santa Clara, CA

20

3.12 • Power Management: Suspend/Resume

3.14 • Dynamic Partitions • Surprise Removal, no I/O • Command Abort Handling • Controller Failure and Recovery

3.15 • HDIO_GETGEO • Pre-CPU Queue optimizations • Hot plug CPU • Surprise Removal while running IO

3.16 • Flush • Tracepoints • Function Level Reset Notify

Page 21: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Kernel 3.17 (expected)

Asynchronous Event Notification Mismatched Host-Device Pages PCI-e Advanced Error Reporting Write Zeroes command

Flash Memory Summit 2014 Santa Clara, CA

21

Page 22: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Kernel Enhancements & Future Work

Flash Memory Summit 2014 Santa Clara, CA

22

Page 23: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Block Multi-Queue

Removes request based block driver software queue contention bottleneck

Moves contention closer to the h/w, reducing latency on multi-threaded IO

Merged into Linux 3.13; virtio-blk only mainline driver Flash Memory Summit 2014

Santa Clara, CA

23

Page 24: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Block Multi-Queue

IO Merging and Scheduling Removes duplicated code in bio-based

drivers: Timeouts, Tracing, Tagging, Diskstats, Runtime

Power Management, Device Removal Latest nvme-mq tree based on 3.15, passes

stability testing. Mainline merge date uncertain

Flash Memory Summit 2014 Santa Clara, CA

24

Page 25: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Page IO

Reading/writing a page (swap example):

Multiple allocations, additional CPU and Memory resources required

… but NVMe is optimized to handle pages

Flash Memory Summit 2014 Santa Clara, CA

25

Page 26: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Page IO

New function to “bdev” ops with block layer read/write page functions:

Benchmarks performance increase by 20% No additional memory required!

Applied in Linux 3.16 TODO: Implement for NVMe

Flash Memory Summit 2014 Santa Clara, CA

26

Page 27: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

CRC T10 DIF/DIX

Meta-data Formats used with protection information:

Guard calculated as CRC-16, costly on CPU cycles.

HW Acceleration?

Flash Memory Summit 2014 Santa Clara, CA

27

Page 28: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

CRC T10 DIF/DIX

X86_64 HW acceleration: PCLMULQDQ Up to 8x calculation speed improvement 3.5x on SSD I/O throughput

Added in 3.11 through linux-crypto

Flash Memory Summit 2014 Santa Clara, CA

28

0 500 1000 1500 2000

4k

16k

64k

MB/s

Buf

fer S

ize

Throughput Comparison CRC T10 DIF

PCLMUL

Table

Page 29: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

CRC T10 DIF/DIX

Still not able to use blk-integrity extensions Ref Tag consistent only with 512-byte block formats

NVM-e supports all powers of 2 >= 512

… and 8 bytes separate meta-data NVM-e Metadata may be 8 to 65536 bytes

Flash Memory Summit 2014 Santa Clara, CA

29

Page 30: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Other Future Work

Block Polling NVMe 1.1 and 1.2 Support

SGL Subsystem Namespace List

Performance Tuning Aggregate Doorbell writes Asynchronous Start/Stop Runtime Power Management SR-IOV Anything else ???

Flash Memory Summit 2014 Santa Clara, CA

30

Page 31: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Getting Involved

Subscribe to Linux-NVMe Mailing List http://lists.infradead.org/mailman/listinfo/linux-nvme

Clone and enhance Linux-NVMe Repository: http://git.infradead.org/users/willy/linux-nvme.git

Read up on NVM-Express: http://www.nvmexpress.org/

Flash Memory Summit 2014 Santa Clara, CA

31

Page 32: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

FIN

Questions? [email protected]

Flash Memory Summit 2014 Santa Clara, CA

32

Page 33: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Backup

Flash Memory Summit 2014 Santa Clara, CA

33

Page 34: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Block Polling

IO Latency Sources:

Beyond NAND: For low-latency device, context switch and interrupt dominate observed latency.

Flash Memory Summit 2014 Santa Clara, CA

34

Page 35: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Flash Memory Summit 2014 Santa Clara, CA

‹#›

Page 36: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

User-space Tooling

Examples of user space tools for device management: http://git.infradead.org/users/kbusch/nvme-user.git

Provides examples using the NVMe IOCTL interface to send commands to controller and parse output

Flash Memory Summit 2014 Santa Clara, CA

36

Page 37: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

Emulation

Wanting to try nvme software, but lacking hardware: http://git.infradead.org/users/kbusch/qemu-

nvme.git Useful if you just want to test driver and user

space tools. Performs poorly on imitating nvme

performance characteristics, among other things. Flash Memory Summit 2014

Santa Clara, CA

37

Page 38: Linux NVM-Express - Flash Memory Summit · • Power Management: Suspend/Resume ...  CRC T10 DIF ... Linux NVM-Express Author:

References NVM-Express:

http://www.nvmexpress.org/

Linux-NVMe Mailing List http://lists.infradead.org/mailman/listinfo/linux-nvme

Linux-NVMe Repository: http://git.infradead.org/users/willy/linux-nvme.git

CRC T10 DIF https://lkml.org/lkml/2013/5/1/449

Page IO http://marc.info/?l=linux-kernel&m=139843448100786&w=2

Block poll http://lwn.net/SubscriberLink/556244/309ec42e8b9a4fcf/

Block Multi-queue http://kernel.dk/systor13-final18.pdf

SAS vs NVMe: http://www.nvmexpress.org/wp-content/uploads/2013/04/IDF-2012-NVM-Express-and-

the-PCI-Express-SSD-Revolution.pdf

Flash Memory Summit 2014 Santa Clara, CA

38