high performance storage devices in the linux kernel
TRANSCRIPT
![Page 2: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/2.jpg)
2 . 1
HIGH PERFORMANCE STORAGEDEVICES IN KERNEL
E8 Storage
By / Evgeny Budilovsky @budevg
12 Jan 2016
![Page 3: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/3.jpg)
3 . 1
STORAGE STACK 101
![Page 4: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/4.jpg)
3 . 2
APPLICATIONApplication invokes system calls (read, write, mmap) onfiles
Block device files - direct access to block deviceRegular files - access through specific file system (ext4,btrfs, etc.)
![Page 5: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/5.jpg)
3 . 3
FILE SYSTEMsystem calls go through VFS layerspecific file system logic appliedread/write operations translated into read/write ofmemory pagesPage cache involved to reduce disk access
![Page 6: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/6.jpg)
3 . 4
BIO LAYER (STRUCT BIO)File system constructs BIO unit which is the main unit of IODispatch bio to block layer / bypass driver (make_request_fndriver)
![Page 7: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/7.jpg)
3 . 5
BLOCK LAYER (STRUCT REQUEST)Setup struct request and move itto request queue (single queueper device)IO scheduler can delay/mergerequestsDispatch requests
To hardware drivers(request_fn driver)To scsi layer (scsi devices)
![Page 8: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/8.jpg)
SCSI LAYER (STRUCT SCSI_CMND)translate request intoscsi commanddispatch commandto low level scsidrivers
![Page 9: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/9.jpg)
3 . 6
WRITE (O_DIRECT)# Application perf record g dd if=/dev/zero of=1.bin bs=4k \ count=1000000 oflag=direct ... perf report callgraph stdio
# Submit into block queue (bypass page cache) write system_call_fastpath sys_write vfs_write new_sync_write ext4_file_write_iter __generic_file_write_iter generic_file_direct_write ext4_direct_IO ext4_ind_direct_IO __blockdev_direct_IO do_blockdev_direct_IO submit_bio generic_make_request
![Page 10: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/10.jpg)
3 . 73 . 8
# after io scheduling submit to scsi layer and to the hardware ... io_schedule blk_flush_plug_list queue_unplugged __blk_run_queue scsi_request_fn scsi_dispatch_cmd ata_scsi_queuecmd
![Page 11: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/11.jpg)
3 . 8
3 . 9
WRITE (WITH PAGE CACHE)# Application perf record g dd if=/dev/zero of=1.bin bs=4k \ count=1000000 ... perf report callgraph stdio
# write data into page cache write system_call_fastpath sys_write vfs_write new_sync_write ext4_file_write_iter __generic_file_write_iter generic_perform_write ext4_da_write_begin grab_cache_page_write_begin pagecache_get_page __page_cache_alloc alloc_pages_current
![Page 12: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/12.jpg)
3 . 10
# asynchronously flush dirty pages to disk
kthread worker_thread process_one_work bdi_writeback_workfn wb_writeback __writeback_inodes_wb writeback_sb_inodes __writeback_single_inode do_writepages ext4_writepages mpage_map_and_submit_buffers mpage_submit_page ext4_bio_write_page ext4_io_submit submit_bio generic_make_request blk_queue_bio
![Page 13: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/13.jpg)
3 . 10
4 . 1
HIGH PERFORMANCE BLOCKDEVICES
![Page 14: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/14.jpg)
4 . 2
CHANGES IN STORAGE WORLDRotational devices (HDD)"hundreds" of IOPS , "tens" ofmilliseconds latencyToday, flash based devices (SSD)"hundreds of thousands" of IOPS,"tens" of microseconds latencyLarge internal data parallelismIncrease in cores and NUMAarchitectureNew standardized storageinterfaces (NVME)
![Page 15: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/15.jpg)
4 . 3
NVM EXPRESS
![Page 16: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/16.jpg)
4 . 4
HIGH PERFORMANCE STANDARD
Standardized interface for PCIeSSDsDesigned from the ground up toexploit
Low latency of today’s PCIe-based SSD’sParallelism of today’s CPU’s
![Page 17: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/17.jpg)
BEATS AHCI STANDARD FOR SATA HOSTS AHCI NVME
Maximum QueueDepth
1 command queue 32 commandsper Q
64K queues 64KCommands per Q
Un-cacheable registeraccesses (2K cycles
6 per non-queued command 9 perqueued command
2 per command
MSI-X and InterruptSteering
Single interrupt; no steering 2K MSI-X interrupts
Parallelism & MultipleThreads
Requires synchronization lock toissue command
No locking
Efficiency for 4KBCommands
Command parameters require twoserialized host DRAM fetches
Commandparameters in one64B fetch
![Page 18: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/18.jpg)
4 . 54 . 6
IOPS
consumer grade NVME SSDs (enterprise grade have muchbetter performance)100% writes less impressive due to NAND limitation
![Page 19: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/19.jpg)
4 . 7
BANDWIDTH
![Page 20: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/20.jpg)
4 . 8
LATENCY
![Page 21: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/21.jpg)
5 . 1
THE OLD STACK DOES NOT SCALE
![Page 22: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/22.jpg)
5 . 2
THE NULL_BLK EXPERIMENTJens Axboe (Facebook)null_blk configuration
queue_mode=1(rq) completion_nsec=0irqmode=0(none)
fioEach thread does pread(2), 4k, randomly, O_DIRECT
Each added thread alternates between the two availableNUMA nodes (2 socket system, 32 threads)
![Page 23: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/23.jpg)
5 . 3
LIMITED PERFORMANCE
![Page 24: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/24.jpg)
5 . 4
PERFSpinlockcontention
![Page 25: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/25.jpg)
5 . 5
PERF40% from sending request (blk_queue_bio)20% from completing request (blk_end_bidi_request)18% sending bio from the application to the bio layer(blk_flush_plug_list)
![Page 26: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/26.jpg)
5 . 6
OLD STACK HAS SEVERE SCALING ISSUES
![Page 27: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/27.jpg)
5 . 7
PROBLEMSGood scalability before block layer (file system, pagecache, bio)Single shared queue is a problemWe can use bypass mode driver which will work with bio'swithout getting into shared queue.Problem with bypass driver: code duplication
![Page 28: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/28.jpg)
6 . 1
BLOCK MULTI-QUEUE TO THERESCUE
![Page 29: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/29.jpg)
6 . 2
HISTORYPrototyped in 2011Paper in SYSTOR 2013Merged into linux 3.13 (2014)A replacement for old block layer with different driver API
Drivers gradually converted to blk-mq (scsi-mq, nvme-core, virtio_blk)
![Page 30: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/30.jpg)
6 . 3
ARCHITECTURE - 2 LAYERS OF QUEUESApplication works with per-CPU software queueMultiple software queuesmap into hardware queuesNumber of HW queues isbased on number of HWcontexts supported bydeviceRequests from HW queuesubmitted by low leveldriver to the device
![Page 31: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/31.jpg)
ARCHITECTURE - ALLOCATION AND TAGGINGIO tag
Is an integer value that uniquely identifies IO submittedto hardwareOn completion we can use the tag to find out which IOwas completedLegacy drivers maintained their own implementationof tagging
With block-mq, requests allocated at initialization time(based on queue depth)Tag and request allocations combinedAvoids per request allocations in driver and tagmaintenance
![Page 32: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/32.jpg)
6 . 46 . 5
ARCHITECTURE - I/O COMPLETIONSGenerally we want completions to be as local as possibleUse IPIs to complete requests on submitting nodeOld block layer was using software interrupts instead ofIPIsBest case there is an SQ/CQ pair for each core, with MSI-Xinterrupt setup for each CQ, steered to the relevant coreIPIs used when there aren't enough interrupts/HW queues
![Page 33: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/33.jpg)
6 . 6
NULL_BLK EXPERIMENT AGAINnull_blk configuration
queue_mode=2(multiqueue) completion_nsec=0irqmode=0(none) submit_queues=32
fioEach thread does pread(2), 4k, randomly, O_DIRECT
Each added thread alternates between the two availableNUMA nodes (2 socket system, 32 threads)
![Page 34: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/34.jpg)
6 . 7
SUCCESS
![Page 35: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/35.jpg)
6 . 8
WHAT ABOUT HARDWARE WITHOUT MULTIQUEUE SUPPORT
Same null_blk setup1/2/n hw queues in blk-mqmq-1 and mq-2 so closesince we have 2 socketssystemNuma issues eliminatedonce we have queue pernuma node
![Page 36: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/36.jpg)
CONVERSION PROGRESSmtip32xx (micron SSD)NVMevirtio_blk, xen blockdriverrbd (ceph block)loopubiSCSI (scsi-mq)
![Page 37: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/37.jpg)
6 . 97 . 1
SUMMARYStorage device performance has accelerated fromhundreds of IOPS to hundreds thousands of IOPSBottlenecks in software gradually eliminated byexploiting concurrency and introducing lock-lessarchitectureblk-mq is one example
Questions ?
![Page 38: High Performance Storage Devices in the Linux Kernel](https://reader036.vdocuments.mx/reader036/viewer/2022081401/587960291a28ab1e388b61d5/html5/thumbnails/38.jpg)
8 . 1
REFERENCES1.
2. 3. 4. 5. 6. 7.
Linux Block IO: Introducing Multi-queue SSD Access onMulti-core SystemsThe Performance Impact of NVMe and NVMe over FabricsNull block device driverblk-mq: new multi-queue block IO queueing mechanismfioperfSolving the Linux storage scalability bottlenecks - by JensAxboe