openzfs data-driven performance

43
Data-Driven Development in OpenZFS Adam Leventhal, CTO Delphix @ahl

Upload: ahl0003

Post on 06-May-2015

1.809 views

Category:

Technology


3 download

DESCRIPTION

OpenZFS data-driven performance presented at the first OpenZFS developer conference 11/18/2013. Lot's of DTrace examples and output.

TRANSCRIPT

Page 1: OpenZFS data-driven performance

Data-Driven Development in OpenZFS

Adam Leventhal, CTO Delphix@ahl

Page 2: OpenZFS data-driven performance

ZFS Was Slow, Is Faster

Adam Leventhal, CTO Delphix@ahl

Page 3: OpenZFS data-driven performance

My Version of ZFS History

• 2001-2005 The 1st age of ZFS: building the behemoth– Stability, reliability, features

• 2006-2008 The 2nd age of ZFS: appliance model and open source– Completing the picture; making it work as advertised; still more

features

• 2008-2010 The 3rd age of ZFS: trial by fire– Stability in the face of real workloads– Performance in the face of real workloads

Page 4: OpenZFS data-driven performance

The 1st Age of OpenZFS

• All the stuff Matt talked about, yes:– Many platforms– Many companies– Many contributors

• Performance analysis on real and varied customer workloads

Page 5: OpenZFS data-driven performance

A note about the data

• The data you are about to see is real• The names have been changed to protect the innocent (and

guilty)• It was mostly collected with DTrace• We used some other tools as well: lockstat, mpstat• You might wish I had more / different data – I do too

Page 6: OpenZFS data-driven performance

Writes Are Slow

Page 7: OpenZFS data-driven performance

NFS Sync Writes sync write microseconds value ------------- Distribution ------------- count 8 | 0 16 | 149 32 |@@@@@@@@@@@@@@@@@@@@@ 8682 64 |@@@@@ 2226 128 |@@@@ 1743 256 |@@ 658 512 | 95 1024 | 20 2048 | 19 4096 | 122 8192 |@@ 744 16384 |@@ 865 32768 |@@ 625 65536 |@ 316 131072 | 113 262144 | 22 524288 | 70 1048576 | 94 2097152 | 16 4194304 | 0

Page 8: OpenZFS data-driven performance

IO Writes write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 338 64 | 490 128 | 720 256 |@@@@ 15079 512 |@@@@@ 20342 1024 |@@@@@@@ 27807 2048 |@@@@@@@@ 28897 4096 |@@@@@@@@ 29910 8192 |@@@@@ 20605 16384 |@ 5081 32768 | 1079 65536 | 69 131072 | 5 262144 | 1 524288 | 0

Page 9: OpenZFS data-driven performance

NFS Sync Writes: Even Worse sync write microseconds value ------------- Distribution ------------- count 8 | 0 16 |@ 9 32 |@@@@@@@@@@ 84 64 |@@@@@@@@@@ 85 128 |@@@@ 34 256 |@ 9 512 | 0 1024 | 1 2048 | 2 4096 |@ 7 8192 |@@ 19 16384 |@ 7 32768 | 2 65536 | 2 131072 | 0 262144 | 0 524288 | 0 1048576 |@@ 14 2097152 |@@@@@@ 51 4194304 |@ 7 8388608 | 0

Page 10: OpenZFS data-driven performance

First Problem: The Write Throttle

Page 11: OpenZFS data-driven performance

How long is spa_sync() taking?#!/usr/sbin/dtrace -s

fbt::spa_sync:entry/stringof(args[0]->spa_name) == "domain0"/{ self->ts = timestamp; loads = 0;}

fbt::space_map_load:entry/stringof(args[4]->os_spa->spa_name) == "domain0"/{ loads++;}

fbt::spa_sync:return{ @["microseconds", loads] = quantize((timestamp - self->ts) / 1000); self->ts = 0;}

Page 12: OpenZFS data-driven performance

How long is spa_sync() taking?# ./sync.d -c 'sleep 60'dtrace: script './sync.d' matched 3 probesdtrace: pid 20420 has exited

microseconds 15 value ------------- Distribution ------------- count 524288 | 0 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 2097152 | 0

microseconds 16 value ------------- Distribution ------------- count 524288 | 0 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20 2097152 |@@@@@@@@@@ 7 4194304 | 0

Page 13: OpenZFS data-driven performance

Where is spa_sync() giving up the CPU?#!/usr/sbin/dtrace -s

fbt::spa_sync:entry{ self->ts = timestamp; }

sched:::off-cpu/self->ts/{ self->off = timestamp; }

sched:::on-cpu/self->off/{ @s[stack()] = quantize((timestamp - self->off) / 1000); self->off = 0;}

fbt::spa_sync:return/self->ts/{ @t["microseconds", probefunc] = quantize((timestamp - self->ts) / 1000); self->ts = 0; self->sync = 0;}

Page 14: OpenZFS data-driven performance

Where is spa_sync() giving up the CPU?… genunix`cv_wait+0x61 zfs`zio_wait+0x5d zfs`dsl_pool_sync+0xe1 zfs`spa_sync+0x38d zfs`txg_sync_thread+0x247 unix`thread_start+0x8

value ------------- Distribution ------------- count 256 | 0 512 |@@@@@@ 4 1024 |@@@@@@@@@@@@ 8 2048 | 0 4096 | 0 8192 | 0 16384 | 0 32768 | 0 65536 | 0 131072 | 0 262144 | 0 524288 |@@@@ 3 1048576 |@@@ 2 2097152 |@@@@@@@@@@@@@ 9 4194304 |@ 1 8388608 | 0

Page 15: OpenZFS data-driven performance

ZFS Write Throttle

• Keep transactions to a reasonable size – limit outstanding data

• Target a fixed time (1-5 seconds on most systems)• Figure out how much we can write in that time• Don’t accept more than that amount of data in a txg• When we get to 7/8ths of the limit, insert a 10ms delay

Page 16: OpenZFS data-driven performance

ZFS Write Throttle

• Keep transactions to a reasonable size – limit outstanding data

• Target a fixed time (1-5 seconds on most systems)• Figure out how much we can write in that time• Don’t accept more than that amount of data in a txg• When we get to 7/8ths of the limit, insert a 10ms delay

WTF!?

Page 17: OpenZFS data-driven performance

7/8ths full delaying for 10ms async write microseconds value ------------- Distribution ------------- count 16 | 0 32 |@@@@@@@@@@@@@ 1549 64 |@@@@@@@@@@@ 1306 128 |@@@@@@@@@ 1049 256 |@@ 192 512 | 34 1024 | 23 2048 | 47 4096 |@ 63 8192 |@ 153 16384 |@ 83 32768 | 11 65536 | 5 131072 | 4 262144 | 3 524288 |@ 102 1048576 |@ 106 2097152 |@ 69 4194304 | 0

Page 18: OpenZFS data-driven performance

Observing the write throttle limit (second-by-second)# dtrace -n 'BEGIN{ start = timestamp; } fbt::dsl_pool_sync:entry/stringof(args[0]->dp_spa->spa_name) == "domain0"/{ @[(timestamp - start) / 1000000000] = min(args[0]->dp_write_limit / 1000000); }' -xaggsortkey -c 'sleep 600'dtrace: description 'BEGIN' matched 2 probes … 9 470 10 470 11 487 14 487 15 515 16 515 17 557 18 581 19 581 20 617 21 617 22 635 23 663 24 663 25 673

Saw anywhere from 100 – 800 MB!

Page 19: OpenZFS data-driven performance

Second Problem: IO Queuing

Page 20: OpenZFS data-driven performance

Check out IO queue times microseconds write sync value ------------- Distribution ------------- count 0 | 0 1 | 2 2 |@@@@@@@ 51 4 |@@@@@@ 43 8 |@ 5 16 | 3 32 |@ 6 64 |@ 10 128 |@@ 13 256 |@@ 18 512 |@@@@@ 38 1024 |@@@@@@ 44 2048 |@@@@@ 37 4096 |@@@ 24 8192 |@ 9 16384 | 0

Page 21: OpenZFS data-driven performance

IO times with queue depth 10 (default) write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 70 64 | 170 128 | 130 256 |@@ 1143 512 |@@@ 1762 1024 |@@@@ 2417 2048 |@@@@@@@ 4135 4096 |@@@@@@@@ 4816 8192 |@@@@@@@ 4132 16384 |@@@@ 2370 32768 |@@@ 1456 65536 | 148 131072 | 8 262144 | 0

Page 22: OpenZFS data-driven performance

IO times with queue depth 20 write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 43 64 | 137 128 |@ 243 256 |@@@@@ 2233 512 |@@@@@ 2238 1024 |@@@@ 1968 2048 |@@@@@ 2395 4096 |@@@@@@ 2660 8192 |@@@@@@ 2829 16384 |@@@@@ 2499 32768 |@@@ 1466 65536 |@ 296 131072 | 0

Page 23: OpenZFS data-driven performance

IO times with queue depth 30 write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 82 64 | 137 128 | 230 256 |@@@@ 2195 512 |@@@@ 2589 1024 |@@@@ 2416 2048 |@@@@@ 2844 4096 |@@@@@@ 3330 8192 |@@@@@@ 3794 16384 |@@@@@@ 3306 32768 |@@@ 2008 65536 |@ 443 131072 | 1 262144 | 0

Page 24: OpenZFS data-driven performance

IO times with queue depth 64 microseconds write value ------------- Distribution ------------- count 16 | 0 32 | 345 64 |@ 697 128 | 169 256 | 60 512 | 380 1024 |@ 1084 2048 |@ 1562 4096 |@ 1819 8192 |@@@@ 4974 16384 |@@@@@@@@@ 10683 32768 |@@@@@@@@@@@@@ 15637 65536 |@@@@@@@@@ 10608 131072 |@ 1050 262144 | 0

avg latency iops throughputwrite 44557us 817/s 30300k/s

Page 25: OpenZFS data-driven performance

IO times with queue depth 128 microseconds write value ------------- Distribution ------------- count 16 | 0 32 | 330 64 |@ 665 128 | 228 256 | 203 512 |@ 552 1024 |@ 1135 2048 |@ 1458 4096 |@ 1434 8192 |@@ 2049 16384 |@@@@ 4070 32768 |@@@@@@@ 7936 65536 |@@@@@@@@@@@ 11269 131072 |@@@@@@@@@ 9737 262144 |@ 1282 524288 | 0

avg latency iops throughputwrite 88774us 705/s 38303k/s

Page 26: OpenZFS data-driven performance

IO Problems

• The choice of IO queue depth was crucial– Where did the default of 10 come from?!– Balance between latency and throughput

• Shared IO queue for reads and writes– Maybe this makes sense for disks… maybe…

• The wrong queue depth caused massive queuing within ZFS– “What do you mean my SAN is slow? It looks great to me!”

Page 27: OpenZFS data-driven performance

New IO Scheduler

• Choose a limit on the “dirty” (modified) data on the system• As more accumulates, schedule more concurrent IOs• Limits per IO type• If we still can’t keep up, start to limit the rate of incoming

data

• Chose defaults as close to the old behavior as possible• Much more straightforward to measure and tune

Page 28: OpenZFS data-driven performance

Third Problem: Lock Contention

Page 29: OpenZFS data-driven performance

Looking at lockstat(1M) (1/3) Count indv cuml rcnt nsec Lock Caller167980 9% 9% 0.00 61747 0xffffff0d4aaa4818 taskq_thread+0x2a8

nsec ------ Time Distribution ------ count Stack 512 | 3233 thread_start+0x8 1024 |@ 10651 2048 |@@@@ 26537 4096 |@@@@@@@@@@ 56854 8192 |@@@@@ 29262 16384 |@ 10577 32768 |@ 5703 65536 | 5053 131072 | 3555 262144 | 5272 524288 | 5400 1048576 | 4186 2097152 | 1487 4194304 | 163 8388608 | 17 16777216 | 21 33554432 | 7 67108864 | 2

Page 30: OpenZFS data-driven performance

Looking at lockstat(1M) (2/3)Count indv cuml rcnt nsec Lock Caller166416 8% 17% 0.00 88424 0xffffff0d4aaa4818 cv_wait+0x69

nsec ------ Time Distribution ------ count Stack 512 |@ 7775 taskq_thread_wait+0x84 1024 |@@ 14577 taskq_thread+0x308 2048 |@@@@@ 31499 thread_start+0x8 4096 |@@@@@@ 36522 8192 |@@@ 19818 16384 |@ 11065 32768 |@ 7302 65536 |@ 7932 131072 | 5537 262144 |@ 7992 524288 |@ 8003 1048576 |@ 6017 2097152 | 2086 4194304 | 198 8388608 | 48 16777216 | 37 33554432 | 7 67108864 | 1

Page 31: OpenZFS data-driven performance

Looking at lockstat(1M) (3/3)Count indv cuml rcnt nsec Lock Caller136877 7% 24% 0.00 19897 0xffffff0d4aaa4818 taskq_dispatch_ent+0x4a

nsec ------ Time Distribution ------ count Stack 512 | 1798 zio_taskq_dispatch+0xb5 1024 | 1575 zio_issue_async+0x19 2048 |@ 5593 zio_execute+0x8d 4096 |@@@@@@@@@@@@@ 61337 8192 |@@@@ 19408 16384 |@@@ 15724 32768 |@@@ 13923 65536 |@@ 9733 131072 | 3564 262144 | 3171 524288 | 947 1048576 | 84 2097152 | 1 4194304 | 0 8388608 | 15 16777216 | 1 33554432 | 2 67108864 | 1

Page 32: OpenZFS data-driven performance

Name that lock!> 0xffffff0d4aaa4818::whatisffffff0d4aaa4818 is ffffff0d4aaa47fc+20, allocated from taskq_cache> 0xffffff0d4aaa4818-20::taskqADDR NAME ACT/THDS Q'ED MAXQ INSTffffff0d4aaa47fc zio_write_issue 0/ 24 0 26977 -

Page 33: OpenZFS data-driven performance

Lock Breakup

• Broke up the taskq lock for write_issue• Added multiple taskqs, randomly assigned• Recently hit a similar problem for read_interrupt• Same solution

• Worth investigating taskq stats• A dynamic taskq might be an interesting experiment

• Other lock contention issues resolved• Still more need additional attention

Page 34: OpenZFS data-driven performance

Last Problem: Spacemap Shenanigans

Page 35: OpenZFS data-driven performance

Where does spa_sync() spend its time?…dsl_pool_sync_done 16us ( 0%)spa_config_exit 19us ( 0%)zio_root 20us ( 0%)spa_config_enter 23us ( 0%)spa_errlog_sync 45us ( 0%)spa_update_dspace 49us ( 0%)zio_wait 53us ( 0%)dmu_objset_is_dirty 66us ( 0%)spa_sync_config_object 75us ( 0%)spa_sync_aux_dev 79us ( 0%)list_is_empty 86us ( 0%)dsl_scan_sync 124us ( 0%)ddt_sync 201us ( 0%)txg_list_remove 519us ( 0%)vdev_config_sync 1830us ( 0%)bpobj_iterate 9939us ( 0%)vdev_sync 27907us ( 1%)bplist_iterate 35301us ( 1%)vdev_sync_done 346336us (16%)dsl_pool_sync 1652050us (79%)spa_sync 2077646us (100%)

Page 36: OpenZFS data-driven performance

Where does spa_sync() spend its time?…dsl_pool_sync_done 16us ( 0%)spa_config_exit 19us ( 0%)zio_root 20us ( 0%)spa_config_enter 23us ( 0%)spa_errlog_sync 45us ( 0%)spa_update_dspace 49us ( 0%)zio_wait 53us ( 0%)dmu_objset_is_dirty 66us ( 0%)spa_sync_config_object 75us ( 0%)spa_sync_aux_dev 79us ( 0%)list_is_empty 86us ( 0%)dsl_scan_sync 124us ( 0%)ddt_sync 201us ( 0%)txg_list_remove 519us ( 0%)vdev_config_sync 1830us ( 0%)bpobj_iterate 9939us ( 0%)vdev_sync 27907us ( 1%)bplist_iterate 35301us ( 1%)vdev_sync_done 346336us (16%)dsl_pool_sync 1652050us (79%)spa_sync 2077646us (100%)

This is expected; it means we’re writing

Page 37: OpenZFS data-driven performance

Where does spa_sync() spend its time?…dsl_pool_sync_done 16us ( 0%)spa_config_exit 19us ( 0%)zio_root 20us ( 0%)spa_config_enter 23us ( 0%)spa_errlog_sync 45us ( 0%)spa_update_dspace 49us ( 0%)zio_wait 53us ( 0%)dmu_objset_is_dirty 66us ( 0%)spa_sync_config_object 75us ( 0%)spa_sync_aux_dev 79us ( 0%)list_is_empty 86us ( 0%)dsl_scan_sync 124us ( 0%)ddt_sync 201us ( 0%)txg_list_remove 519us ( 0%)vdev_config_sync 1830us ( 0%)bpobj_iterate 9939us ( 0%)vdev_sync 27907us ( 1%)bplist_iterate 35301us ( 1%)vdev_sync_done 346336us (16%)dsl_pool_sync 1652050us (79%)spa_sync 2077646us (100%)

What’s this?

Page 38: OpenZFS data-driven performance

What’s vdev_sync_done() doing?txg_list_empty 0us ( 0%)txg_list_remove 15us ( 0%)metaslab_sync_done 8681us (90%)vdev_sync_done 9563us (100%)

Page 39: OpenZFS data-driven performance

How about metaslab_sync_done()?vdev_dirty 3266usvdev_space_update 5333usspace_map_load_wait 5758usspace_map_vacate 30455usmetaslab_weight 54507usmetaslab_group_sort 68445usspace_map_unload 1519906usmetaslab_sync_done 1630626us

Page 40: OpenZFS data-driven performance

What about all space_map_*() functions?space_map_truncate 33 times 6ms ( 0%)space_map_load_wait 1721 times 7ms ( 0%)space_map_sync 3766 times 210ms ( 0%)space_map_unload 135 times 1268ms ( 0%)space_map_free 21694 times 4280ms ( 1%)space_map_vacate 3643 times 45891ms (12%)space_map_seg_compare 13124822 times 55423ms (14%)space_map_add 580809 times 79868ms (21%)space_map_remove 514181 times 81682ms (21%)space_map_walk 2081 times 120962ms (32%)spa_sync 1 times 374818ms (100%)

Page 41: OpenZFS data-driven performance

How about the CPU performance counters?# dtrace -n 'cpc:::PAPI_tlb_dm-all-10000{ @[stack()] = count(); }' -n END'{ trunc(@, 20); printa(@); }' -c 'sleep 100’… zfs`metaslab_segsize_compare+0x1f genunix`avl_find+0x52 genunix`avl_add+0x2d zfs`space_map_remove+0x170 zfs`space_map_alloc+0x47 zfs`metaslab_group_alloc+0x310 zfs`metaslab_alloc_dva+0x2c1 zfs`metaslab_alloc+0x9c zfs`zio_dva_allocate+0x8a zfs`zio_execute+0x8d genunix`taskq_thread+0x285 unix`thread_start+0x8 1550

zfs`lzjb_decompress+0x89 zfs`zio_decompress_data+0x53 zfs`zio_decompress+0x56 zfs`zio_pop_transforms+0x3d zfs`zio_done+0x26b zfs`zio_execute+0x8d zfs`zio_notify_parent+0xa6 zfs`zio_done+0x4ea zfs`zio_execute+0x8d zfs`zio_notify_parent+0xa6 zfs`zio_done+0x4ea zfs`zio_execute+0x8d genunix`taskq_thread+0x285 unix`thread_start+0x8 1712

Page 42: OpenZFS data-driven performance

Spacemaps and Metaslabs

• Two things going on here:– 30,000+ segments per spacemap– Building the perfect spacemap – close enough would work– Doing a bunch of work that we can clever our way out of

• Still much to be done:– Why 200 metaslabs per LUN?– Allocations can still be very painful

Page 43: OpenZFS data-driven performance

The Next Age of OpenZFS

• General purpose and purpose-built OpenZFS products• Used for varied and demanding uses• Data-driven discoveries

– Write throttle needed rethinking– Metaslabs / spacemaps / allocation is fertile ground– Performance nose-dives around 85% of pool capacity– Lock contention impacts high-performance workloads

• What’s next?– More workloads; more data!– Feedback on recent enhancements– Connect allocation / scrub to the new IO scheduler– Consider data-driven, adaptive algorithms within OpenZFS