life on-call, availa-liberty, and the pursuit of happiness
TRANSCRIPT
Life On-Call, Availa-
liberty, & the Pursuit
of HappinessRunbooksDave Cliffe
@CliffeHangers
Incident #1:Oct 27, 2011
Incident #2:May 1, 2013
Incident #3:Nov 2, 2015
Collaboration/Resolution
MICROSERVICES
APPS & SERVICES
CONTAINERS
CLOUD
NETWORK
DATABASE
SERVERS
Developer
NOC
Helpdesk
IT OpsSystem and User
Efficiency
ALERT 1 ALERT 2 ALERT 3
Correlate, Cluster and Manage
EVENTS
People Data Process
Deployment Tools
Monitoring Tools
Ticketing Tools
APP
SYSTEM
LOG
WEB
MOBILE APP
Automatic Escalations
On-CallScheduling
Your Fastest Path to Incident Resolution
Availability
Every software powered company experiences downtime
http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html
Cost of outages:
$7,400,000 annual cost @175 hours downtime Gartner
“The most important ability is availability.”
All CEOs everywhere
Why is Availability a terrible metric?
The Tyranny of the SLA
credit: J. Paul Reed (@jpaulreed)
“System Availability” means the percentage of total time during which the Hosted Service network is available to Client and Client is able to access the Hosted Service system interface.
______ warrants the following minimum levels of Hosted Service System Availability during each calendar month: 99.95%
The following definitions will apply to the calculation of “availability”:“Hosted Service System Availability” means the percentage of total time during each calendar month during whichthe Hosted Service is available to Client, excluding Scheduled Downtime and Emergency Maintenance
An actual SaaS SLA
Are you Available?
Happiness
Measuring (Un)Happiness
Responsiveness
Pain
Health Checks
https://labs.spotify.com/2014/09/16/squad-health-check-model/
Happiness++
http://www.activestate.com/blog/2014/01/devops-hero-culture
Beware the ‘Hero Culture’
Eliminate Single Points of
Dependence
Reduce Alert
Fatigue
https://www.pinterest.com/pin/497929302524908289
On a regular basis, For every alert, Ask …
1) Is it actionable?2) Is it urgent?3) Could we consolidate?4) Did the right person get it?
“The most important on-call responsibility is to understand customer impact.” Anonymous Customer (who I didn’t verify I could quote)
Sharing Operational
Responsibility
“Giving developers operational responsibilities has greatly enhanced the QUALITY of the services, both from a customer and
a technology point of view.
The TRADITIONAL model is that you take your software to the wall that separates development and operations, and throw it
over and then forget about it.
-Werner Vogels, CTO Amazon
SHARED OPERATIONAL RESPONSIBILITY
… You build it, you run it.”
“For developers to take responsibility for the systems they create, they need support from
operations to understand how to build ’reliable software that can be continuous deployed to an
unreliable platform that scales horizontally’.”
-Jez Humble, quoting Jesse Robbins (Chef)
SHARED OPERATIONAL RESPONSIBILITY