voxxed days thesaloniki 2016 - herding cats to a firefight
TRANSCRIPT
H E R D I N G C AT S T O A
F I R E F I G H TT H E E V O L U T I O N O F A N E N G I N E E R I N G O N - C A L L T E A M
G . C H A N G @ G R E Y S C H A L E
E U R U K O 2 0 1 6
T H E Y E A R 1 B . C . ( B E F O R E C AT S )
In the beginning,
there was only
darkness.
But suddenly,
out of the darkness,
there came a sound...
(pager noises)
One person was on-call. All day. And night. Every day. Every week. Forever.
(not really, but close enough)
Why not start having a rotation?
"We don't need no stinkin' on-call rotation!"
Bullshit.
"Hi, sorry to be calling at this hour. I'm from Yammer, I work with _____. Can I please speak with him?"
Date: Friday, xxth of XXX, 2013Time: 03:00 AM GMT -0800
T H E Y E A R 1 A . D . ( A F T E R D I S A S T E R )
How to maths?!?!
• Given:
• Given:
• Given: (1 + 15 ± 5 ) * 2
• ((4 / 1) * ((1 + 15 ± 5) * 2)) = ???
Answer:
How to acronyms?!?!
M T B FM T T R A A R S L A
• MTBF: Mean Time Between Failures
• MTTR: Mean Time To Recovery
• SLA: Service Level Agreement
• AAR: After Action Review
• IR: Incident Report
• OMGWTFBBQAFK
M T B F M T T R
less frequent faster recovery
requires morestable systems
needs good response training
engineers interrupted less often
engineers gainbroad knowledge
possibly more disastrous issues
possibly more frequent issues
• Google Docs Forms
• Yammer Notes
• JIRA
❎ (hard to read reports)
❎ (hard to analyse)
✅ (not perfect...but sort of works)
Hey, we're starting to get this!. . . . . . . . . . . .
Actually, not yet.
• System grows faster than we can learn about it
• Silos appear when you don't share knowledge
• Who's cleaning up this mess, anyway?
• Burnout is real
T H E R E N A I S S A N C E ( G R O W I N G PA I N S )
Do more by doing less
• Split responsibilities by stack
• Added London office for follow-the-sun coverage
• Onboard everybody to the process
• Practice, practice, practice
All hands on deck
• Keep all alerts in a configuration repo
• Managers aren't doing anything, anyway -- make them Incident Managers!
• Runbooks, runbooks everywhere (and a unified one)
• Make the initial response as simple as possible
B A C K T O T H E F U T U R E ( T H E P R E S E N T )
Combined schedules
• Fewer rotations
• Team is unified, so schedules should be too
Post-mortems and retrospectives
• What? Where? Who? Why? How?
• NO blame game
Weekly hand-overs and monthly reviews
• Previous week engineers to current week engineers
• Track top alerts and resolutions (or lack of)
• Focus on the noisiest services
• Timezones are hard
Bi-monthly surveys
• Summarise overall preparedness
• Make sure we're improving
• ...and that nobody is actually burned out
Fix ALL the alerts
• Noisy
• Flaky
• Real
W H E R E A R E T H E C AT S N O W ? !
The end game
• 1 alert per person per day
• Service owners are on-call for those services
• The world is full of kittens!
Isn't on-call just for Ops?
• No
• Responsibility for our code
• Pride in our code
• No pain, no gain
Isn't on-call just for Ops?
• No
• Responsibility for our code
• Pride in our code
• No pain, no gain
After all...
we are all cats being herded.
T H A N K Y O U
@ G R E Y S C H A L E
G . C H A N G @ G R E Y S C H A L E
E U R U K O 2 0 1 6