better service monitoring through histograms sv perl 09012016
TRANSCRIPT
![Page 1: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/1.jpg)
Better service monitoring through histogramsFred Moyer - @phredmoyerSilicon Valley Perl, 09-01-2016
![Page 2: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/2.jpg)
Who likes to wake up for false positives?
![Page 3: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/3.jpg)
Synthetics
Easy to setup, but not a real user
![Page 4: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/4.jpg)
Stephen Falken: Uh, uh, General, what you see on these screens up here is a fantasy; a computer-enhanced hallucination. Those blips are not real missiles. They're phantoms. (War Games, 1983)
![Page 5: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/5.jpg)
Real Users
![Page 6: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/6.jpg)
Real Users
![Page 7: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/7.jpg)
500 ms is really 2,000 ms
Spike Erosion
![Page 8: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/8.jpg)
Threshold Based Alerting
![Page 9: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/9.jpg)
“Alert if a request takes longer than 200 ms”
10,10,10,10,10,10,10,10,10,5000
Alerts on one outlier in 10
Threshold Alerting
![Page 10: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/10.jpg)
“Alert if request average over one minute is longer than 200 ms”
avg(10,10,210,210,210,210) = 143 (860/6)
Does not alert on multiple high samples
Threshold Alerting
![Page 11: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/11.jpg)
‘average’ eq ‘arithmetic mean’A=S/N
A = averageN = the number of samples
S = the sum of the samples in the set
Math Refresher
![Page 12: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/12.jpg)
median = midpoint of data set
The 50th percentile is 555 - q(0.5)
Value 111 222 333 444 555
666
777 888 999
Sample # 1 2 3 4 5 6 7 8 9
Math Refresher
![Page 13: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/13.jpg)
90th percentile - 90% of samples below it
The 90th percentile is 1,000 - q(0.9)
Value 111
222
333
444
555
666
777
888
999 1,00
01,111
Sample #
1 2 3 4 5 6 7 8 9 10 11
Math Refresher
![Page 14: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/14.jpg)
100th Percentile - the maximum value
The 100th percentile is 1,111 - q(1)
Value 111
222
333
444
555
666
777
888
999
1,000 1,11
1Sample #
1 2 3 4 5 6 7 8 9 10 11
Math Refresher
![Page 15: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/15.jpg)
Sample value
Number of samples
Histogram
![Page 16: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/16.jpg)
Sample value
Number of samples
Normal Distribution
![Page 17: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/17.jpg)
Sample value
Number of samples
Normal Distribution
68% within one sigma (σ)
![Page 18: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/18.jpg)
Sample value
Number of samples
Non-Normal Distribution
![Page 19: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/19.jpg)
Sample value
Number of samples
Non-Normal Distribution
![Page 20: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/20.jpg)
Non-Normal Distribution
Operations data groups at different points
![Page 21: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/21.jpg)
Non-Normal Distribution
Users to the right of the red line are gone
![Page 22: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/22.jpg)
Request latency“We keep hearing from people that the
website is slow. But it is fine when we test it, and the request latency graph is
constant”
You are only looking at part of the picture.
![Page 23: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/23.jpg)
Heat Map
Histograms over time windows
![Page 24: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/24.jpg)
Percentiles
![Page 25: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/25.jpg)
Practical PercentilesBandwidth usage is often billed at 95th percentile
usageRecord 5 minute data usage intervals
Sort samples by value of sampleThrow out the highest 5% of samples
Charge usage based on the remaining top sample, i.e. 300 MB transferred over 5 minutes = 1 MB/s rate
billing
![Page 26: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/26.jpg)
Practical Percentiles
If I measure 95th percentile per 5 minutes all month long,
I CANNOT calculate 95th percentile over the month.
![Page 27: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/27.jpg)
Angry users
How many users are you pissing off?
![Page 28: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/28.jpg)
Angry users
![Page 29: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/29.jpg)
“Alert me if request latency 90th percentile over one minute is
exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,10,10,5000] == 10Alert IS NOT triggered
Do you want to be woken up for this? NO!
![Page 30: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/30.jpg)
“Alert me if request latency 90th percentile over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,250,300] = ~270Alert IS triggered
Do you want to be woken up for this? YES!
![Page 31: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/31.jpg)
Percentile based alerting
![Page 32: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/32.jpg)
Who’s using this approach?
Google.com - in house monitoring systemsCirconus.com - hosted histogram monitoring
You? (I’ve written my own histograms but use Circonus for production systems)
![Page 33: Better service monitoring through histograms sv perl 09012016](https://reader035.vdocuments.mx/reader035/viewer/2022062903/58a0b0571a28ab75368b54bf/html5/thumbnails/33.jpg)
Questions?
Thanks to Circonus for tools and help with math
http://www.circonus.com/free-account/Look for future monitoring talks here soon
http://meetup.com/monitorSF