how to do monitoring that won't make your engineers quit
TRANSCRIPT
![Page 1: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/1.jpg)
Relaxing picture of Yoga
![Page 3: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/3.jpg)
![Page 4: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/4.jpg)
hunt through logs for 2 hours
![Page 5: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/5.jpg)
![Page 6: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/6.jpg)
![Page 7: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/7.jpg)
![Page 8: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/8.jpg)
Monitoring that will make your engineers give up
Gil Zellner (CloudifyDev at Gigaspaces)
Twitter: @Heathenaspargus
![Page 9: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/9.jpg)
Who am I?Now:
Past:
@Heathenaspargus
![Page 12: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/12.jpg)
cost of hiring new employee is 1.5-3x their monthly salary
@Heathenaspargus
![Page 14: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/14.jpg)
Easy (days) Intermediate (months)
Hard (years)
- no changes to infrastructure
- just policy
- Small changes to apps
- logging- light
automation
- Design for better operability
- long term
@Heathenaspargus
![Page 19: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/19.jpg)
frustration - I am unable to complete my task
@Heathenaspargus
![Page 20: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/20.jpg)
Time spent inefficiently
@Heathenaspargus
![Page 21: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/21.jpg)
Repetitive tasks
@Heathenaspargus
![Page 22: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/22.jpg)
Working Alone
@Heathenaspargus
![Page 23: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/23.jpg)
Yak Shaving
@Heathenaspargus
![Page 24: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/24.jpg)
https://www.ergoflex.co.uk/blog/category/sleep-research/sleeponomics-could-sleep-deprivation-be-the-real-reason-politicians-make-bad-decisions
@Heathenaspargus
![Page 25: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/25.jpg)
Mandatory Half day-off after night production issue
@Heathenaspargus
![Page 26: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/26.jpg)
Allocate weekly time to resolve or automate issues that kept us up at night
@Heathenaspargus
![Page 27: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/27.jpg)
Wider rotation (more people do on-call)
@Heathenaspargus
![Page 28: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/28.jpg)
https://www.youtube.com/watch?v=IUoEiDT1nXY
Creating a DevOps Culture: Identifying a “Single Person of Failure”
@Heathenaspargus
![Page 29: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/29.jpg)
Knowledge Matrix
Deploy System Mobile Link Backend
Gil V V
Karen V V
Ari V V
@Heathenaspargus
![Page 32: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/32.jpg)
Easy (days) Intermediate (months)
Hard (years)
- no changes to infrastructure
- just policy
- Small changes to apps
- logging- light
automation
- Design for better operability
- long term
@Heathenaspargus
![Page 33: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/33.jpg)
![Page 34: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/34.jpg)
![Page 36: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/36.jpg)
solution: alert only things that meet the following criteria:
1) Alert on symptoms, not suspected "causes"2) Actionable3) Business breaking
@Heathenaspargus
![Page 37: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/37.jpg)
Alerte générale!
@Heathenaspargus
![Page 38: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/38.jpg)
Solution: direct alerts to relevant parties
@Heathenaspargus
![Page 39: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/39.jpg)
Companies that are doing this as a service:
@Heathenaspargus
![Page 44: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/44.jpg)
Companies that are doing this as a service:
@Heathenaspargus
![Page 45: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/45.jpg)
Picking the right things to measure
![Page 47: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/47.jpg)
Netflix stream starts per second
@Heathenaspargus
![Page 48: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/48.jpg)
What are your KPIs ?stream starts per second
Taxi orders per minute
Api calls per second
@Heathenaspargus
![Page 49: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/49.jpg)
Companies that are doing this as a service:
@Heathenaspargus
![Page 51: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/51.jpg)
Make heal script
@Heathenaspargus
![Page 53: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/53.jpg)
Auto-remediation basics1) Make remediation script2) Make diagnosis script3) Connect them
@Heathenaspargus
![Page 55: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/55.jpg)
Facebook Auto Remediation
https://www.facebook.com/notes/facebook-engineering/making-facebook-self-healing/10150275248698920
@Heathenaspargus
![Page 56: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/56.jpg)
Heal Workflows - Cloudify
@Heathenaspargus
![Page 57: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/57.jpg)
Easy (days) Intermediate (months)
Hard (years)
- no changes to infrastructure
- just policy
- Small changes to apps
- logging- light
automation
- Design for better operability
- long term
@Heathenaspargus
![Page 58: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/58.jpg)
Incentive for resilient architecture
0.99 uptime: 87.6 hours per year
0.999 uptime: 8.76 hours per year
0.9999 uptime: 52.6 minutes per year
0.99999 uptime: 5.3 minutes per year
@Heathenaspargus
![Page 59: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/59.jpg)
Automated failovers
@Heathenaspargus
![Page 60: How to do monitoring that won't make your engineers quit](https://reader031.vdocuments.mx/reader031/viewer/2022022201/5889a2651a28abf2038b4f5b/html5/thumbnails/60.jpg)
The AntiFragile organizationhttps://queue.acm.org/detail.cfm?id=2499552
@Heathenaspargus