improving software-defined storage outcomes through
TRANSCRIPT
![Page 1: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/1.jpg)
1
Improving Software-defined Storage Outcomes Through Telemetry Insights
SUP-1312
Lars Marowsky-Brée
Distinguished Engineer
![Page 2: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/2.jpg)
2
Agenda
1. Goals and Motivation
2. Data collection methodology
3. Scope and limitations
4. Exploratory Analysis
5. Pretty pictures
6. Q&A
![Page 3: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/3.jpg)
3
Goals And Motivation (Developer Side)
• Improve product/project decisions
• Understand actual deployments
• Detect anomalies and trends pro-actively
![Page 4: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/4.jpg)
4
Automated Telemetry Augments Support
• Support cases only opened once an issue has escalated to human
attention
• Data from support incidents biased towards unhealthy environments
• We want to identify issues before they escalate to support incidents and
better understand impact of reported support incidents
![Page 5: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/5.jpg)
5
Goals And Motivation (User/Customer Pov)
• Improve product/project decisions to reflect your usage
• Make sure developers understand your deployments
• Detect anomalies and trends pro-actively before they affect your
systems
![Page 6: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/6.jpg)
6
Automated Telemetry Vs Surveys
• Surveys are limited in scope and depth
• Survey provides qualitative data and human insights
• Telemetry is automated and delivers more frequent updates
• Telemetry has fewer typos :-)
• Automated telemetry + surveys: <3
![Page 7: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/7.jpg)
7
Sneak Peek: Community Survey’19
• 404 responses
• Total capacity reported: ~1184 PB
• Unclear, since obviously not all units were aligned
• 33% said they have enabled Telemetry already <3
• … does this match the reports?
• Full(er) analysis upcoming
![Page 8: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/8.jpg)
8
84 Weren’t aware the feature existed
74 Wish to understand data privacy better
54 Run Ceph versions that do not support it yet
33 Are in firewalled or airgapped environments
Why Users Have Not Enabled Telemetry
![Page 9: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/9.jpg)
9
Telemetry Methodology
• Ceph clusters report aggregate statistics
• Data is anonymized, no IP addresses/hostnames/... stored!
• “Upstream first” via the Ceph Foundation
• Community Data License Agreement – Sharing, Version 1.0
• Shared data corpus improves outcomes
• Opt-in, not (yet) enabled by default
• # ceph telemetry on
![Page 10: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/10.jpg)
10
Ceph Community Support For Telemetry
• Upstream support began in Ceph Mimic
• Significant enhancements in Nautilus
• SUSE backported to Luminous
• Supported in:
• SUSE Enterprise Storage 5.5 Maintenance Updates (upcoming)
• SUSE Enterprise Storage 6
![Page 11: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/11.jpg)
11
Examples Of Data Included With Telemetry
• Total aggregate for capacity and usage
• Number of OSDs, MONs, hosts
• Versions (Ceph, kernel, distribution) aggregates
• CephFS metrics, number of RBDs, pool data
• Crashes (can be disabled separately)
# ceph telemetry show
![Page 12: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/12.jpg)
12
Limitations – Caveat, emptor
Biased sample!
• “Recent” versions only
• Not enabled by default, users need to actively enable
• Environments need access to Internet for upload
• Enterprise environments likely under-represented
Thus: not representative of whole population, treat with care!
Trends, don’t worry about exact numbers
![Page 13: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/13.jpg)
13
Exploratory Data Analysis
• Python (ipython, pandas)
• Data preparation – clean-up, flatten into table
• Resample to common intervals (daily, extrapolated)
• Start evaluating the data
• Find errors in data set, go back to 1
• Enjoyed SUSE’s HackWeek 2020 very much!
![Page 14: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/14.jpg)
14
Time For Pretty Pictures
• Overall trends
• Example of finding a bug
• Version and feature adoption
• Identifying most common practices
• Sizing in the real world
![Page 15: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/15.jpg)
15
How Many Clusters Are Reporting In?
![Page 16: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/16.jpg)
16
Total Capacity Reporting (Petabytes)
![Page 17: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/17.jpg)
17
In [183]: t_on = survey[
survey['Is telemetry enabled in your cluster?'] == 'Yes']
In [184]: t_on['Total raw capacity'].agg('sum')/10**3
Out[184]: 280.126
In [185]: t_on['How many clusters ...'].agg('sum')
Out[185]: 308.0
Cross-checking This With The Survey Results:
![Page 18: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/18.jpg)
18
Major Ceph Versions In The Field
![Page 19: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/19.jpg)
19
Breakdown Of Ceph v14.x.y On OSDs In
The Field
![Page 20: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/20.jpg)
20
v14.x.y Again, But Normalized
![Page 21: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/21.jpg)
21
When Do People Update?
• Important for staff planning etc
• Compute rate of change per version for every day
• Excursion: total flow through versions
• Aggregate the absolute values per day for total rate of change
• Aggregate by day of week
… also a good example of the caveats to be mindful of:
![Page 22: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/22.jpg)
22
Versions Change Aggregated By Day-of-week
![Page 23: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/23.jpg)
23
Placement Groups: How Many Per Pool?
• Quite important for the even balancing of data
• Rule of thumb is to have ~100 PGs per OSD
• Should be rounded to a power of two
• Exact formula is a bit more difficult as it varies with the data
distribution between pools, pool “size”, ...
• What do users do?
![Page 24: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/24.jpg)
24
Top 20 pg_num Values Across All Pools …?!
![Page 25: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/25.jpg)
25
pg_num – power Of Two Or Not
![Page 26: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/26.jpg)
26
How Did The Ceph Project Remedy This?
• Improve documentation, remove bad example, clarify impact
• Improve UI/UX experience
• Add HEALTH_WARN if state is detected
• Introduce pg_autoscaler to fully automate this
• Available in SUSE Enterprise Storage 6 MU
https://ceph.io/community/the-first-telemetry-results-are-in/
![Page 27: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/27.jpg)
27
Adoption Of pg_autoscaler functionality
![Page 28: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/28.jpg)
28
Power Of Two pg_num with pg_autoscaler On:
![Page 29: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/29.jpg)
29
Prioritization
• What is the actual usage pattern?
• How significant would an issue in a specific feature/area be?
• Focus QA and assess support incident impact
• But also: understand why some users are holding out on a “legacy”
feature
• Are we ready to depreciate something?
![Page 30: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/30.jpg)
30
How Many OSDs Remain On FileStore?
![Page 31: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/31.jpg)
31
No Of Pools: Replicated Vs Erasure Coding
![Page 32: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/32.jpg)
32
No Of Clusters: Replicated Vs Erasure Coding
![Page 33: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/33.jpg)
33
Which Erasure Code Plugins are used?
![Page 34: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/34.jpg)
34
EC: Which k+m Values Are Chosen?
![Page 35: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/35.jpg)
35
What Defaults Do Users Most Frequently Change?
![Page 36: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/36.jpg)
36
Let’s Talk Real World Sizing
• Everyone wants to know what other people do
• Reflects market sweet spots
• Currently only a snapshot, not enough data to identify hardware trends
![Page 37: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/37.jpg)
37
Deployed Densities, Device Sizes (Quartiles)
0.25 0.5 0.75 1.0
OSD/host 3 6 11 63
OSD/host < 1PB 3 5 9 63
OSD/host > 1PB 13 16 24 58
TB/OSD 1 4 7 14
TB/OSD < 1PB 1 3 5 14
TB/OSD > 1PB 6 10 11 12
TB/host 4 16 50 630
TB/host < 1PB 3 12 40 186
TB/host >1PB 61 128 199 630
![Page 38: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/38.jpg)
38
OSDs: Rotational Vs flash/SSD/NVMe
![Page 39: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/39.jpg)
39
OSDs: Rotational Vs flash/SSD/Nvme, >=1PB
![Page 40: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/40.jpg)
40
Future Enhancements
Support different telemetry transport methods (with registration?)
Include more relevant metrics as identified by yet unanswerable questions
• Performance metrics, OSD variance, per-pool capacity/usage, client versions/numbers …
• Device and fault data for predictive failure analysis
• Data mining crash data
Automated dashboards on Ceph site: https://telemetry-public.ceph.com/
Consider if/how to enable this by default once acceptance is up
![Page 41: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/41.jpg)
41
Questions? Answers!
# ceph telemetry on
Help us serve you better.
![Page 42: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/42.jpg)
42
Questions?
![Page 43: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/43.jpg)
43
General Disclaimer
This document is not to be construed as a promise by any participating company to
develop, deliver, or market a product. It is not a commitment to deliver any material,
code, or functionality, and should not be relied upon in making purchasing
decisions. SUSE makes no representations or warranties with respect to the contents of
this document, and specifically disclaims any express or implied warranties of
merchantability or fitness for any particular purpose. The development, release, and
timing of features or functionality described for SUSE products remains at the sole
discretion of SUSE. Further, SUSE reserves the right to revise this document and to
make changes to its content, at any time, without obligation to notify any person or entity
of such revisions or changes. All SUSE marks referenced in this presentation are
trademarks or registered trademarks of SUSE, LLC, Inc. in the United States and other
countries. All third-party trademarks are the property of their respective owners.
![Page 44: Improving Software-defined Storage Outcomes Through](https://reader034.vdocuments.mx/reader034/viewer/2022051906/628487e3f0cd344a69228007/html5/thumbnails/44.jpg)