asynchronous i/o stack: a low-latency kernel i/o stack for ultra … · 2019. 10. 28. · nvramos...

48
NVRAMOS 2019 Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong Sungkyunkwan University (SKKU) Source: Gyusun Lee et al., Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs, USENIX ATC 19

Upload: others

Post on 02-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Asynchronous I/O Stack:A Low-latency Kernel I/O Stack

for Ultra-Low Latency SSDs

Jinkyu JeongSungkyunkwan University (SKKU)

Source: Gyusun Lee et al., Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs, USENIX ATC 19

Page 2: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Emerging ultra-low latency SSDs deliver I/Os in a few µs

Storage Performance Trends

2

Source: R. E. Bryant and D. R. O'Hallaron, Computer Systems: A Programmer's Perspective, Second Edition, Pearson Education, Inc., 2015

1985 1990 1995 2000 2003 2005 2010 2017

Late

ncy

(ns)

Year

HDD

SSD

ULL-SSD

DRAM

SRAM

Samsung Z-SSD Intel Optane SSD

101010101010101010

Page 3: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

0

0.2

0.4

0.6

0.8

1

SATASSD

NVMeSSD

Z-SSD OptaneSSD

SATASSD

NVMeSSD

Z-SSD OptaneSSD

Read Write (+fsync)

Nor

mal

ized

Lat

ency

Device

Kernel

User

• Low-latency SSDs expose the overhead of kernel I/O stack

Overhead of Kernel I/O Stack

3

Page 4: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Synchronous I/O vs. Asynchronous I/O

4

Our Idea: apply asynchronous I/O concept to the I/O stack itself

Throughput Total latency

CPU

Device

A (computation)

B (I/O)

Total latency

Synchronous I/O

CPU

Device

A’ A”

B

Total latency

A” is independent to BAsynchronous I/O

Page 5: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Read Path Overview

5

CPU

Device I/O

I/O stack operations I/O stack operations

CPU

Device I/O Latency reductionAsync. operations

sys_read() Return to user

sys_read() Return to user

Vanilla Read Path

Proposed Read Path

Page 6: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Write Path Overview

CPU

Device

sys_write()

Buffered write

Vanilla Write Path

Return to user

Page 7: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Write Path Overview

7

CPU

Device

CPU

DeviceLatency reduction

sys_fsync()

I/O I/O I/O

I/O stack ops. I/O stack ops. I/O stack ops.

sys_fsync()

I/O stack ops.

I/O I/O

I/O stack ops.

I/O stack ops.Return to user

Return to user

Vanilla Fsync Path

Proposed Fsync Path

I/O

Page 8: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Read path−Analysis of vanilla read path−Proposed read path

• Light-weight block I/O layer• Write path−Analysis of vanilla write path−Proposed write path

• Evaluation• Conclusion

Agenda

8

Page 9: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Interrupt

Analysis of Vanilla Read Path

9

CPU

DeviceI/O submit

Total latency 12.82μs I/O 7.26µs

Page cache lookup 0.30µs

Page allocation 0.19µs

Page cache insertion 0.33µs

LBA retrieval 0.09µs

BIO submission 0.72µs

DMA mapping 0.29µs

NVMe I/O submission 0.37µs

Context switch 0.95µs

DMA unmapping 0.23µs

Request completion 0.81µs

Copy-to-user 0.21µs

Context switch 0.95µs

Page cacheFile systemBlock layerDevice driver Interrupt handler

sys_read() Return to user

Page 10: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Page Allocation / DMA Mapping

10

CPU

DeviceI/O submit

I/O 7.26µs

Page allocation 0.19µs DMA mapping 0.29µs

Interrupt

Page 11: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• DMA-mapped page pool

Asynchronous Page Allocation / DMA Mapping

11

CPU

DeviceI/O submit

I/O 7.26µs

Page allocation 0.19µs DMA mapping 0.29µs

Interrupt

…Core 0

64 pages

Core N …4KB DMA-mapped pages

Pagepool allocation

Page 12: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• DMA-mapped page pool

Asynchronous Page Allocation / DMA Mapping

12

…Core 0

64 pages

Core N …

Page refill

CPU

DeviceI/O submit

I/O 7.26µs

Pagepool allocation 0.016µs DMA mapping 0.29µsPage allocation 0.19µs

Interrupt

Pagepool allocation

4KB DMA-mapped pages

Page 13: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Page Cache Insertion

13

CPU

DeviceI/O submit

I/O 7.26µs

Page cache insertion 0.33µs

Interrupt

Page Page Page

Root

Leaf node

Prevention from duplicated I/O requestsfor the same file index

Page cache lookup overhead

Page cache tree extension overhead

Make I/O request

Cache insertion successWait for page read

Page cache tree

Cache lookup?Miss Cache lookup?

Hit

Page 14: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Lazy Page Cache Insertion

14

CPU

DeviceI/O submit

I/O 7.26µs

Page cache insertion 0.35µs

Interrupt

Page Page

Root

Leaf node

Page freeDuplicated I/O requests

Page cache treePage cache lookup overhead

Page cache tree extension overhead

Lazy cache insertion?Success

Cache lookup?

Make I/O request

Cache lookup?

Lazy cache insertion?

Make I/O request

Fail

Miss

Miss

(extremely low frequency)

Page 15: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

DMA Unmapping

15

CPU

DeviceI/O submit

I/O 7.26µs

DMA unmapping 0.23µs

Interrupt

Page 16: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Implementation−Delays DMA unmapping to when a system is idle or waiting for another

I/O requests−Extended version of the deferred protection scheme in Linux [ASPLOS’16]−Optionally disabled for safety

Lazy DMA Unmapping

16

CPU

DeviceI/O submit

I/O 7.26µs

Lazy DMA unmapping 0.35µs

Interrupt

Page 17: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Remaining Overheads in the Proposed Read Path

17

CPU

DeviceI/O submit

I/O 7.26µsInterrupt

BIO submission 0.72µs

NVMe I/O submission 0.37µs Request completion 0.81µs

Page 18: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Read path−Analysis of vanilla read path−Proposed read path

• Light-weight block I/O layer• Write path−Analysis of vanilla write path−Proposed write path

• Evaluation• Conclusion

Agenda

18

Page 19: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Structure conversion−Merge bio with pending request via I/O merging−Assign new tag & request and convert from bio

• Multi-queue structure−Software staging queue (SW queue)

ü Support I/O scheduling and reordering

−Hardware dispatch queue (HW queue)ü Deliver the I/O request to the device driver

• Multiple dynamic memory allocations−Bio (block layer)−NVMe iod, scatter/gather list, NVMe PRP* list

(device driver)

Linux Multi-queue Block I/O Layer

19

iod

prp_list… NVMe Queue Pairs

Multi-queueBlock Layer

sg_list

bio Device Driver

Linux multi-queue block layer

Per-coreSW Queues

HW Queues

request: length, bio(s)

Tagrequest

NVMe CMD

bio: LBA, length, page(s), …

page

submit_bio()

*PRP: physical region page

Page 20: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Structure conversion− Inefficiency of I/O merging [Zhang,OSDI’18]

ü Useful feature for low-performance storage device

• Multi-queue structure• Multiple dynamic memory allocations

Linux Multi-queue Block I/O Layer

20

iod

prp_list… NVMe Queue Pairs

Multi-queueBlock Layer

sg_list

bio Device Driver

Linux multi-queue block layer

Per-coreSW Queues

HW Queues

request: length, bio(s)

Tagrequest

NVMe CMD

bio: LBA, length, page(s), …

page

submit_bio()

Page 21: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Structure conversion− Inefficiency of I/O merging [Zhang,OSDI’18]

ü Useful feature for low-performance storage device

• Multi-queue structure− Inefficiency of I/O scheduling for low-latency

SSDs [Saxena,ATC’10] [Xu,SYSTOR’15]ü Default configuration is noop scheduler

−Bypass multi-queue structure [Zhang,OSDI’18]

−Device-side I/O scheduling [Peter,OSDI’14] [Joshi,HotStorage’17]

• Multiple dynamic memory allocations

Linux Multi-queue Block I/O Layer

21

iod

prp_list… NVMe Queue Pairs

Multi-queueBlock Layer

sg_list

bio Device Driver

Linux multi-queue block layer

Per-coreSW Queues

HW Queues

request: length, bio(s)

Tagrequest

NVMe CMD

bio: LBA, length, page(s), …

page

submit_bio()

Page 22: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Light-weight Block I/O Layer

22

prp_list

… NVMe Queue Pairs

Light-weightBlock Layer

Device Driver

Light-weight block layer

Per-CPULbio Pool

Tag

lbio: LBA, length, prp_list, page(s),

dma_addr(s) page

submit_lbio()

DMA-mappedPage PoolCore 0

lbio

NVMe CMD

… ……

Core n

• Light-weight bio (lbio) structure− Contains only essential arguments for to make

NVMe I/O request− Eliminates unnecessary structure conversions

and allocations

• Per-CPU lbio pool−Supports lockless lbio object allocation−Supports tagging function

• Single dynamic memory allocation−NVMe PRP* list (device driver)

*PRP: physical region page

Page 23: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Read Path Comparison

23

Proposed Read Path

Proposed Read Path (before applying light-weight block I/O layer)

CPU

Device I/O 7.26µs

LBIO submission 0.13µs LBIO completion 0.65µs

I/O submit Interrupt Latency reduction

CPU

DeviceI/O submit

I/O 7.26µs

Interrupt

BIO submission 0.72µs Request completion 0.81µsNVMe I/O submission 0.37µs

sys_read() Return to user

sys_read() Return to user

Page 24: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Read Path Comparison

CPU

Device I/O 7.26µs

24

CPU

DeviceI/O submit

Total latency 12.82μs I/O 7.26µs

Total latency 10.10μs

Interrupt

I/O submit Interrupt

Latency reduction

Vanilla Read Path

Proposed Read Path

sys_read() Return to user

sys_read() Return to user

Page 25: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Read path−Analysis of vanilla read path−Proposed read path

• Light-weight block I/O layer• Write path−Analysis of vanilla write path−Proposed write path

• Evaluation• Conclusion

Agenda

25

Page 26: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Analysis of Vanilla Fsync Path (Ext4 Ordered Mode)

26

Data block I/O

Data block submit Journal block submitFlush &

commit blocksubmit

CPU

jbd2

Device Journal block I/O Commit block I/O12.73μs 10.72μs12.57μs

Data writeback 5.68µs

jbd2 call 0.80µs

Journal block preparation 5.55µs- Allocating buffer pages- Allocating journal area block- Checksum computation…

Commit block preparation2.15µs

sys_fsync() Return to user

Page 27: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Proposed Fsync Path (Ext4 Ordered Mode)

27

Data block I/O

CPU

jbd2

Device11.37μs

Data blocksubmit

Journal blocksubmit

Data Writeback 4.18µs- Reduced latency by applying light-weight block I/O layer

jbd2 call 0.78µs- Wake jbd2 before data block wait

Journal blockpreparation

5.21µsCommit block preparation 1.90µs

sys_fsync()

jbd2 commit wait

Page 28: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Proposed Fsync Path (Ext4 Ordered Mode)

28

Data block I/O Journal block I/O Commit block I/O

CPU

jbd2

Device10.61μs 10.44μs11.37μs

Data blocksubmit

Journal blocksubmit

Flush &commit block

submit

Data Writeback 4.18µs- Reduced latency by applying light-weight block I/O layer

jbd2 call 0.78µs- Wake jbd2 before data block wait

Journal blockpreparation

5.21µsCommit block preparation 1.90µs

Data & journal block I/O wait Flush & commit block dispatch 0.04µs

sys_fsync() Return to user

jbd2 commit wait

Page 29: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Fsync Path Comparison (Ext4 Ordered Mode)

29

Data block I/O Journal block I/O Commit block I/O

CPU

jbd2

Device

Data blocksubmit

Journal blocksubmit

Flush &commit block submit

Total latency 38.03μs

CPU

jbd2

Data block I/ODevice Journal block I/O Commit block I/O

Data block submit Journal block submitFlush &

commit block submit

Total latency 53.94μs

Latency reduction

Vanilla Fsync Path

Proposed Fsync Path

Page 30: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Read path−Analysis of vanilla read path−Proposed read path

• Light-weight block I/O layer• Write path−Analysis of vanilla write path−Proposed write path

• Evaluation• Conclusion

Agenda

30

Page 31: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Experimental Setup

31

Server Dell R730

OS Ubuntu 16.04.4

Base kernel Linux 5.0.5

CPU Intel Xeon E5-2640v3 2.6GHz 8-cores

Memory DDR4 32GB

Storage devices Z-SSD: Samsung SZ985 800GBOptane SSD: Intel Optane 905P 960GB

Workloads Synthetic micro-benchmark: FIOReal-world workload: RocksDB DBbench

Page 32: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• 4KB block size

FIO Performance (Random Read)• Single thread

32

0

20

40

60

80

100

Late

ncy

(µs)

Block size

Vanilla AIOS AIOS-poll

Up to 23%latency reduction

7.6µs

0

100

200

300

400

500

600

700

1 2 4 8 16 32 64 128IO

PS (k

)Threads

Vanilla AIOSUp to 26% IOPS improvement

Page 33: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• 4KB block size

FIO Performance (Random Write+Fsync, Ext4 Ordered)• Single thread

33

0

50

100

150

200

250

Late

ncy

(µs)

Block size

Vanilla AIOS

Up to 26%latency reduction

0

20

40

60

80

100

120

140

160

1 2 4 8 16IO

PS (k

)Threads

Vanilla AIOS

Up to 27%IOPS improvement

Page 34: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• DBbench readrandom− 64GB dataset

• DBbench fillsync− 16GB dataset

050

100150200250300350400450500

1 2 4 8 16

OP/

s (k

)

Threads

Vanilla AIOS

0102030405060708090

100

1 2 4 8 16O

P/s

(k)

Threads

Vanilla AIOS

RocksDB Performance

34

Up to 44%performanceimprovement

Up to 27%performanceimprovement

27.2% ▲22.3% ▲

22.9% ▲

18.5% ▲

10.7% ▲

44.3% ▲38.9% ▲

35.7% ▲

30.6% ▲

23.6% ▲

Page 35: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Conclusion

35

• Asynchronous I/O stack−Applies asynchronous I/O concept to the kernel I/O stack itself

−Overlaps computation with I/O to reduce total I/O latency

• Light-weight block I/O layer−Provides low-latency block I/O services for low-latency NVMe SSDs

• Performance evaluation−Achieves a single-digit microsecond I/O latency on Optane SSD

−Achieves significant latency reduction and performance improvement on

real-world workloadsSource code: https://github.com/skkucsl/aios

Page 36: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Comparison with User-level Direct Access Approach

36

0102030405060708090

Ran

dom

Rea

d La

tenc

y (µ

s) User+DeviceCopy-to-userKernelUser+Device (SPDK)

4KB 8KB 16KB 32KB 64KB 128KBBlock size

Page 37: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Q&A

• Thank you

37

Page 38: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Acknowledgment

• This talk is supported by the National Research Foundation of

Korea (NRF) grant funded by the Korea government (MSIT)

(NRF-2017R1C1B2007273).

38

Page 39: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Extra Slides

Page 40: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Extent status tree (in-memory structure)− Contains LBA mapping info. for file blocks (composed of extent objects)−Occurs additional I/O request to read mapping block in case of extent lookup miss

• Control & data plane approach [OSDI‘14]− Preloads the entire mapping info. into the memory in the control plane (e.g., file open)− Selectively applies to files requiring low latency access

Preloading Extent Tree

40

CPU

DeviceI/O submit

Device I/O 7.26µs

LBA retrieval 0.09µs 0.07µs

Interrupt

Page 41: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

CDF of the Block I/O Submission Latency

41

4KB random read 32KB random read

• Block I/O submission latency− Time from allocating a object (bio or lbio) to dispatching an I/O command to a device

Page 42: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Read Performance Analysis with Opt. Levels

42

00.10.20.30.40.50.60.70.80.9

1

4KB 8KB 16KB 32KB 64KB 128KB

Nor

mal

ized

Lat

ency

Block size

Vanilla + Preload + MQ-bypass+ LBIO + Async-page + Async-DMA • Preload

− Preloading extent tree• MQ-bypass− Bypassing SW, HW queue

• LBIO− Light-weight block I/O layer

• Async-page− Asynchronous page allocation− Lazy page cache insertion

• Async-DMA− Asynchronous DMA mapping− Lazy DMA unmapping

Page 43: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Write(+fsync) Performance Analysis with Opt. Levels

43

00.10.20.30.40.50.60.70.80.9

1

4KB 8KB 16KB 32KB 64KB 128KB

Nor

mal

ized

Lat

ency

Block size

Vanilla + LBIO + AIOS

• LBIO− Light-weight block I/O layer− Synchronous page allocation /

DMA mapping • AIOS− Proposed fsync path

Page 44: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

CPU Usage Breakdown

• RocksDB DBbench readrandom / fillsync using 8 threads

44

00.10.20.30.40.50.60.70.80.9

1

Vanilla AIOS Vanilla AIOS

readrandom fillsync

Nor

mal

ized

CPU

Usa

ge

Idle

I/O wait

Kernel

UserEffective overlapping between

I/O and kernel operationsthrough AIOS

Reduced overall runtime

Reduced I/O wait & kernel CPU time

Page 45: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Comparison with User-level Direct Access Approach

45

0102030405060708090

Ran

dom

Rea

d La

tenc

y (µ

s) User+DeviceCopy-to-userKernelUser+Device (SPDK)

4KB 8KB 16KB 32KB 64KB 128KBBlock size

Page 46: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Memory Overheads on Asynchronous I/O Stack

46

Object Linuxblock layer

Light-weightblock layer

Statically allocated objects (per-core)

request pool 412KB -

lbio array - 192KB

Free page pool - 256KB

128KB block request

(l)bio + (l)bio_vec 648B 704B

request 384B -

iod 974B -

prp_list 4096B 4096B

Total 6104B 4800B

Page 47: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Proposed Fsync Path (Ext4 Ordered Mode)

47

Data block I/O Journal block I/O Commit block I/O

CPU

jbd2

Device10.61μs 10.44μs11.37μs

Data blocksubmit

Journal blocksubmit

Flush &commit block

submit

Data Writeback 4.18µs- Reduced latency by applying light-weight block I/O layer

jbd2 call 0.78µs- Wake jbd2 before data block wait

Journal blockpreparation

5.21µsCommit block preparation 1.90µs

Data & journal block I/O wait Flush & commit block dispatch 0.04µs

sys_fsync() Return to user

jbd2 commit wait

Page 48: Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

• Implementation−Delays DMA unmapping to when a system is idle or waiting for another

I/O requests−Extended version of the deferred protection scheme in Linux [ASPLOS’16]−Optionally disabled for safety

Lazy DMA Unmapping

48

CPU

DeviceI/O submit

I/O 7.26µs

Lazy DMA unmapping 0.35µs

Interrupt