ceph day beijing: cetune: a framework of profile and tune ceph performance
TRANSCRIPT
CeTune – *Ceph Profiling and Tuning FrameworkChendi XUE, [email protected]
Software Engineer
Agenda
• Background
• How to use CeTune
• CeTune modules
• How CeTune help to Tune
• Summary
2
Agenda
• Background
• How to use CeTune
• CeTune modules
• How CeTune help to Tune
• Summary
3
Background
• What is the problem ?
• End users face numerous challenges to drive best performance
• Increasing requests from end users on:
• How to troubleshooting the *Ceph cluster?
• How to identify the best tuning knobs from many (500+) parameters
• How to handle the unexpected performance regression between frequent
releases.
• Solution ?
• A toolkit/framework to
• Easy deploy and benchmark ceph cluster
• Easy way to analyze performance result, system metrics, perfcounter and
latency layout
• Shorten user’s landing time of *Ceph based storage solution
4
CeTune
5
What it is …
A toolkit to deploy, benchmark, profile and tune Ceph cluster performance
Agenda
• Background
• How to use CeTune
• CeTune modules
• How CeTune help to Tune
• Summary
6
CeTune internal CeTune Controller:
Provides a console interface for user to monitor the working progress and view the performance result data.
Controls all other CeTune nodes to deploy, benchmark, and monitor system and ceph status.
CeTune Node:
Self-check if the work has done successfully and response to controller.
7
CeTune configuration
Conf/all.conf
all.conf configures most information for CeTune, including information to deploy and information to run benchmark, etc.
field value description
deploy_ceph_version hammer Ceph version
deploy_mon_servers aceph01 node to deploy mon
deploy_osd_serversaceph01,aceph02,aceph03,aceph04 node to deploy osd
deploy_rbd_nodes client01,client02 node to deploy rbd and rados
aceph01/dev/sda1:/dev/sdb1,/dev/sdd1:/dev/sdb2 , …
set osd and journal device hereaceph02
/dev/sda1:/dev/sdb1,/dev/sdd1:/dev/sdb2 , …
aceph03/dev/sda1:/dev/sdb1,/dev/sdd1:/dev/sdb2 ,…
aceph04/dev/sda1:/dev/sdb1,/dev/sdd1:/dev/sdb2 ,…
osd_partition_count1
using script in deploy/prepare-scripts/list_partitions.sh,
osd_partition_size 2000G will do partition for you
journal_partition_count 5
journal_partition_size 60G
public_network 10.10.5.0/24 tell ceph cluster to use 10Gb nic
cluster_network 10.10.5.0/24
field value description
head client01
Configuration for benchmark
tmp_dir /opt/
user root
list_vclientvclient01,vclient02,vclient03,vclient04, …
list_client client01,client02
list_ceph aceph01,aceph02,aceph03,aceph04
list_mon aceph01
volume_size 40960 will create rbd volume using this param
rbd_volume_count 80 rbd volume total number
run_vm_num 80, 70, 60, 50, … if you want to run vm/rbd loadline, set here
run_file/dev/vdb
when using vm, rbd will mount as /dev/vdb in vm
run_size40g
for fio run_size, must be small than volume_size
run_io_pattern seqwrite,seqread,randwrite,randread
run_record_size 64k,4k fio block size
run_queue_depth 64,8
run_warmup_time 100
run_time 300
dest_dir /mnt/data/ destination directory
dest_dir_remote_bak
192.168.3.101:/data4/Chendi/ArborValley/v0.91/raw/ destination directory
rbd_num_per_client 40,40
run test 40 rbd in client01 and test 40 rbd in client02 respectively
#deploy ceph #benchmark
8
CeTune configuration
Conf/tuner.yaml
tuner.yaml configures a job worksheet. each testjob can be applied different tunings.
testjob1:
workstages: [“deploy”,”benchmark"]
benchmark_engine: "qemurbd“
version: 'hammer'
pool:
rbd:
size: 2
pg_num: 8192
disk:
read_ahead_kb: 2048
global:
debug_lockdep: 0/0
debug_context: 0/0
… …
mon_pg_warn_max_per_osd: 1000
ms_nocrc: true
throttler_perf_counter: false
osd:
osd_enable_op_tracker: false
osd_op_num_shards: 10
filestore_wbthrottle_enable: false
filestore_max_sync_interval: 10
filestore_max_inline_xattr_size: 254
filestore_max_inline_xattrs: 6
filestore_queue_committing_max_bytes: 1048576000
filestore_queue_committing_max_ops: 5000
filestore_queue_max_bytes: 1048576000
filestore_queue_max_ops: 500
journal_max_write_bytes: 1048576000
journal_max_write_entries: 1000
journal_queue_max_bytes: 1048576000
journal_queue_max_ops: 3000
testjob2:
… …
testjob3:
… …
9
Kickoff CeTuneroot@client01:/root# cd /root/cetune/tuner
root@client01:/root/cetune/tuner# python tuner.py
[LOG]Check ceph version, reinstall ceph if necessary
[LOG]start to redeploy ceph
[LOG]ceph.conf file generated
[LOG]Shutting down mon daemon
[LOG]Shutting down osd daemon
[LOG]Clean mon dir
[LOG]Started to mkfs.xfs on osd devices
[LOG]mkfs.xfs for /dev/sda1 on aceph01
… …
[LOG]mkfs.xfs for /dev/sdf1 on aceph04
[LOG]Build osd.0 daemon on aceph01
… …
[LOG]Build osd.39 daemon on aceph01
[LOG]delete ceph pool rbd
[LOG]delete ceph pool data
[LOG]delete ceph pool metadata
[LOG]create ceph pool rbd, pg_num is 8192
[LOG]set ceph pool rbd size to 2
[WARNING]Applied tuning, waiting ceph to be healthy
[WARNING]Applied tuning, waiting ceph to be healthy
… …
[LOG]Tuning has been applied to ceph cluster, ceph is healthy now
RUNID: 36, Result dir: //mnt/data/36-80-seqwrite-4k-100-300-vdb
[LOG]Prerun_check: check if rbd volumes are initialized
[WARNING]Ceph cluster used data: 0.00KB, planed data: 3276800MB
[WARNING]rbd volume initialization not done
[LOG]80 RBD Images created
[LOG]create rbd volume vm attaching xml
[LOG]Distribute vdbs xml
[LOG]Attach rbd image to vclient1
… …
[LOG]Start to initialize rbd volumes
[LOG]FIO Jobs started on [‘vclient01’,’vclient02’, …. ‘vclient80’]
[WARN]160 fio job still running
… …
[LOG]RBD initialization complete
[LOG]Prerun_check: check if fio installed in vclient
[LOG]Prerun_check: check if rbd volume attached
[LOG]Prerun_check: check if sysstat installed
[LOG]Prepare_run: distribute fio.conf to vclient
[LOG]Benchmark start
[LOG]FIO Jobs started on [‘vclient01’,’vclient02’, …. ‘vclient80’]
[WARN]160 fio job still running
… …
[LOG]stop monitoring, and workload
[LOG]collecting data
[LOG]processing data
[LOG]creating html report
[LOG]scp to result backup server
10
Agenda
• Background
• How to use CeTune
• CeTune modules
• How CeTune help to Tune
• Summary
11
Deploy
Configure:
1. all.conf,
2. tuner.yaml
Preparation:
1. Connect to apt/yum source
2. Auto ssh to each nodes
3. Disk partition
One click to Start cetune
Compare current ceph version vs. desired version, reinstall if
necessary
Deploy to all nodes:
1. Rbd
2. osd, mon, mds
3. Object workload generator
Apply tuner.yaml tuning knobs to ceph cluster
Wait ceph cluster to be healthy
During CeTune deployment phase:
root@client01:/root/cetune/tuner# python tuner.py
[LOG]Check ceph version, reinstall ceph if necessary
[LOG]start to redeploy ceph
[LOG]ceph.conf file generated
[LOG]Shutting down mon daemon
[LOG]Shutting down osd daemon
[LOG]Clean mon dir
[LOG]Started to mkfs.xfs on osd devices
[LOG]mkfs.xfs for /dev/sda1 on aceph01
… …
[LOG]mkfs.xfs for /dev/sdf1 on aceph04
[LOG]Build osd.0 daemon on aceph01
… …
[LOG]Build osd.39 daemon on aceph01
[LOG]delete ceph pool rbd
[LOG]delete ceph pool data
[LOG]delete ceph pool metadata
[LOG]create ceph pool rbd, pg_num is 8192
[LOG]set ceph pool rbd size to 2
[WARNING]Applied tuning, waiting ceph to be healthy
[WARNING]Applied tuning, waiting ceph to be healthy
… …
[LOG]Tuning has been applied to ceph cluster, ceph is healthy now
12
Benchmark
Configure in tuner.yaml
1. workload engine
2. tuning knobs
3. io pattern config
Preparation:
1. Prepare virtual machine
2. Install workload generator
• Fio, cosbench, cephfs engine
One click to Start cetune
Compare current ceph tuning vs. desired tuning, re-apply if necessary
Prepare to do benchmark:
1. Check workload and rbd volume,
create rbd if necessary
2. Initialize rbd if needed
During benchmark phase:
1. Monitor system metrics data
2. Fetch perfcounter data
3. Fetch lttng data
4. Block to wait workload process to complete
Wait ceph cluster to be healthy
During CeTune benchmark phase:
* Cosbench is an open source Benchmarking tool developed by Intel to measure Cloud Object Storage Service performance, which can act as an object workload under CBT framework. 13
Analyzer
Process data of sar data, iostat data, perfcounter data, lttng data(wip) Blktrace( wip ) Valgrind( wip )
Data Archieved to
One folder
Process data of system metrics and perfcounter:
1. node by node
2. result as a big json
node
field name (iostat, perfcounter)
key (w/s, r_op_latency …)
second count , data
Process data of lttng( wip ):
1. lttng data being traced following google dapper semantics, with one unified trace_id identify the tracepoints of one same io
2. Send lttng data to zipkin-collecter, and can be viewed by zipkin web
Process data of blktrace, valgrind ( wip ) Send to visualizer module
During CeTune analyze phase:
14
Tuner
Tuner extracts ceph cluster configuration from all.conf, and automatically generate a tuner.conf file with some tuning reference.
The main usage of tuner is
• user can test ceph cluster over multi version with various tuning knobs.
• Users can define a bunch of testjobs there, each testjob can has multi-workstage like reinstall then re-build, and then start the benchmark test.
• So using tuner, CeTune is able to make the ceph performance test for automatically.
SEQWRITE SEQREAD RANDWRITE RANDREAD SEQWRITE SEQREAD RANDWRITE RANDREAD SEQWRITE SEQREAD RANDWRITE RANDREAD
Firefly Giant Hammer
unTuned vs. Tuned
unTuned Tuned
15
Visualizer
CeTune provides html page to show the result data.
System metrics View
Latency layoutsView
16
Agenda
• Background
• How to use CeTune
• CeTune modules
• How CeTune help to Tune
• Summary
17
Firefly Randread Case
runid op_size op_type QD engine serverNum clientNum rbdNum runtime fio_iops fio_bw fio_latency osd_iops osd_bw osd_latency
Before tune
4k randread qd8 vdb 4 2 40 401 sec 3389.000 13.313 MB/s93.991 msec
3729.249 16.798 MB/s15.996 msec
Before tune
4k randread qd8 vdb 4 2 80 301 sec 3693.000 14.577 MB/s172.485 msec
3761.441 14.986 MB/s16.452 msec
Long frontend latency, and short backend latency
Randread 40 vm, each vm with 100 iops capping, only 3389 iops total ??
Why fio_latency is 94ms but osd disk latency on 16ms??
From CeTune processed latency graph, we get some hints, one osd op_latency is as high as 1sec, but its process_latency is only 25msec.
Which means op are waiting in osd queue to process, should we add more osd_op_threads?
18
Firefly Randread Case
After adding osd_op_threads, problem solved. op_r_latency matches op_r_process_latency
Fio latency back to 40ms, and osd side op real process time is about 25-30ms. Which makes more sence.
runid op_size op_type QD engine serverNum clientNum rbdNum runtime fio_iops fio_bw fio_latency osd_iops osd_bw osd_latency
Before tune 4k randread qd8 vdb 4 2 40 401 sec 3389.000 13.313 MB/s 93.991 msec 3729.249 16.798 MB/s 15.996 msec
Before tune 4k randread qd8 vdb 4 2 80 301 sec 3693.000 14.577 MB/s 172.485 msec 3761.441 14.986 MB/s 16.452 msec
After tune 4k randread qd8 vdb 4 2 40 400 sec 3979.000 15.640 MB/s 40.503 msec 3943.347 15.913 MB/s 21.804 msec
After tune 4k randread qd8 vdb 4 2 80 400 sec 7441.000 29.223 MB/s 85.488 msec 7295.486 28.611 MB/s 57.085 msec
19
Agenda
• Background
• How to use CeTune
• CeTune modules
• How CeTune help to Tune
• Summary
20
Summary & next step
• Summary
• CeTune makes deploying and benchmarking ceph in an easy way.
• CeTune helps a lot in identifying performance bottleneck.
• CeTune is able to adopt more and more benchmark tool and analyze methodology.
• Next step
• Tuner: more intelligent, generating reference tuning and even HW requirement by input of performance expectation.
• Deploy: Adopting web based UI ( *VSM ) to deploy ceph cluster
• Analyzer: Adopting more and more good and light-runtime-overhead analyzing methodology into CeTune
• Benchmark: benchmark ceph performance on more aspects and scenarios.
*Cosbench: an open source Benchmarking tool developed by Intel to measure Cloud Object Storage Service performance, which can act as an object workload under CBT framework.*vsm : a web-based management application for Ceph storage systems. VSM creates, manages, and monitors a Ceph cluster. VSM simplifies the creation and day-to-day management of a Ceph cluster for cloud and data center storage administrators.
21
Q & A
22
Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Intel, Xeon and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
*Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation. 23
Legal Information: Benchmark and Performance Claims Disclaimers
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase.
Test and System Configurations: See Back up for details.
For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
24
Risk FactorsThe above statements and any others in this document that refer to plans and expectations for the first quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks," "estimates," "may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for Intel’s products is highly variable and could differ from expectations due to factors including changes in the business and economic conditions; consumer confidence or income levels; customer acceptance of Intel’s and competitors’ products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel’s gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel’s ability to respond quickly to technological developments and to introduce new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations, which could be changed without prior notice. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. The amount, timing and execution of Intel’s stock repurchase program and dividend program could be affected by changes in Intel’s priorities for the use of cash, such as operational spending, capital spending, acquisitions, and as a result of changes to Intel’s cash flows and changes in tax laws. Product defects or errata (deviations from published specifications) may adversely impact our expenses, revenues and reputation. Intel’s results could be affected by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. Intel’s results may be affected by the timing of closing of acquisitions, divestitures and other significant transactions. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release.
Rev. 1/15/15 25
Backup
26
HW configuration
Ceph Nodes
CPU 1 x Intel Xeon E3-1275 V2 @ 3.5 GHz (4-core, 8 threads)
Chipset Intel C204 chipset
Memory 32 GB (4 x 8GB DDR3 @ 1600 MHz)
NIC 1 X 82599ES 10GbE SFP+, 4x 82574L 1GbE RJ45
HBA/C204 {SAS2008 PCI-Express Fusion-MPT SAS-2} / {6 Series/C200 Series Chipset Family SATA AHCI Controller}
Disks
1 x SSDSA2SH064G1GC 2.5’’ 64GB for OS
2 x Intel S3500 400GB SSD (Journal)
1 x Intel P3700 1.6TB PCI-E SSD (Cache Tier Storage)
10 x Seagate ST3000NM0033-9ZM 3.5’’ 3TB 7200rpm SATA HDD (Data)
Client Nodes
CPU 2 x Intel Xeon E5-2680 @ 2.8Hz (20-core, 40 threads) (Qty: 3)
Memory 128 GB (8GB * 16 DDR3 1333 MHZ)
NIC 2x 10Gb 82599EB, ECMP (20Gb)
Disks 1 HDD for OS
Client VM
CPU 1 X VCPU VCPUPIN
Memory 512 MB
27
Test methodology
• Storage interface
• Use QemuRBD as storage interface
• Space allocation (per node)
• Data Drive:
– Sits on 10x 3TB HDD drives
• Journal:
– Sits on 2x S3500 400GB
– 5 journal partitions per data drive
– size: 60GB * 5
RBD volume:
– size: 60GB
– One RBD volume per VM
• Run rules
Drop osds page caches (echo "1“ > /proc/sys/vm/drop_caches)
100 secs for warm up, 300 secs for data collection
Run 4KB/64KB tests under different # of rbds or VMs (40 VMs in
max)
Use “dd” to prepare data for R/W tests
• Use fio (ioengine=libaio, direct=1) to generate 4 IO patterns:
• 64KB sequential write/read,
• 4KB random write/read
• For capping tests, Seq Read/Write (60MB/s), and Rand Read/Write
(100 iops)
28
Intel Confidential — Do Not Forward
Q & A