devops incident handling - making friends not enemies
DESCRIPTION
David Mytton CEO of Server Density presented this talk to the DevOps Meetup in London. It takes you through how to handle DevOps incidents, outages and downtime -- and more specifically how to make friends, not enemies in the process.TRANSCRIPT
How to win friends when handling outages and downtime
David MyttonLondon DevOps - Oct 2014
blog.serverdensity.com
David Mytton
Server monitoring, cloud management, dashboards and alerting
serverdensity.com
Slides: twitter.com/davidmytton
Let’s talk about downtime
2013 Spend: ~$5bn
2013 Spend: ~$6bn
2013 Spend: ~$4bn
You will have downtime
How much do you spend?
Preparation
Preparation - On Call
● Primary?
Preparation - On Call
● Primary?
● Secondary?
Preparation - On Call
● Primary?
● Secondary?
● Reachability - Tube, 3G/4G (edge?!), Do Not Disturb mode, at the gym, family emergency, system updates
Preparation - On Call
● Off call
Preparation - On Call
● Off call
● Rotations
Preparation - On Call
● Off call
● Rotations
● Illness
Preparation - On Call
● Off call
● Rotations
● Illness
● Work the next day?
Preparation - Documentation
Preparation - Documentation
● Searchable
Preparation - Documentation
● Searchable
● Easy to edit
Preparation - Documentation
● Searchable
● Easy to edit
● Independent of your infrastructure
Preparation - Documentation
● Searchable
● Easy to edit
● Independent of your infrastructure
● Up to date
Preparation - Key Info
Preparation - Key Info
● Team contacts
Preparation - Key Info
● Team contacts
● Key vendor contacts
Preparation - Key Info
● Team contacts
● Key vendor contacts
● Credentials to key systems
Unexpected failures
Unexpected failures
● Communication systems
Unexpected failures
● Communication systems
● Network connectivity
Unexpected failures
● Communication systems
● Network connectivity
● Access to support
ALERT!
ALERT!
1. Load up incident response checklist
ALERT!
1. Load up incident response checklist
2. Log incident in JIRA
ALERT!
1. Load up incident response checklist
2. Log incident in JIRA
3. Log into Ops War Room
ALERT!
1. Load up incident response checklist
2. Log incident in JIRA
4. Public status post
3. Log into Ops War Room
ALERT!
1. Load up incident response checklist
2. Log incident in JIRA
4. Public status post
5. Initial investigation
3. Log into Ops War Room
Key response principles
Key response principles
● Log everything
Key response principles
● Log everything
● Frequent public status updates
Key response principles
● Log everything
● Frequent public status updates
● Gather the team
Key response principles
● Log everything
● Frequent public status updates
● Gather the team
● Escalate!
Postmortem
Postmortem
● Within a few days
Postmortem
● Within a few days
● Tell the story
Postmortem
● Within a few days
● Tell the story
● Provide technical detail
Postmortem
● Within a few days
● Tell the story
● Provide technical detail
● Explain what failed and why
Postmortem
● How it’s going to be fixed
stspg.io/ZDC
Summary
● Preparation
● Communication
● Checklists
● Documentation
● Postmortem
どもありがとうございます
@davidmytton
blog.serverdensity.com
www.serverdensity.com