flight training for devops & humanops - incontrodevops 2016

30
Jorge Salamero Sanz <[email protected]> IncontroDevOps 1 April 2016 War Games - Flight training for DevOps

Upload: server-density

Post on 26-Jan-2017

1.950 views

Category:

Internet


0 download

TRANSCRIPT

Jorge Salamero Sanz <[email protected]>

IncontroDevOps 1 April 2016

War Games - Flight training for DevOps

How to Monitor MySQL

● Infrastructure automation

● Configuration automation

● Continuous testing

● Continuous deployment / delivery

● Monitoring

● Logs, error handling

● Feedback

● Human Ops

DevOps lifecycle

● Humans are part of any system

● Initial design, ongoing improvements

● Maintenance

● Upgrades

● Issues, Incident response

Humans in DevOps

● System issues = error rates + SLA + ...

● Human issues = alerts out of hours + interruptions + .

● System issues = Human issues

Human issues = system issues

● System health impacts human health

● Human health impacts system health

Humans impact systems

● Downtime = loss of users, reputation, revenue

● Downtime caused by unreliable systems

● Unhealthy teams reduce reliability

● Unhealthy teams = loss of users, reputation, revenue

Humans impact business

● Slip

● Lapse

● Mistake

● Violation

● (Always, again, again)

Human risk

● Prepare and practice

● Respond

● Postmortem

Expect downtime

Real example

(small war story, won’t be long)

● Power failure to half of our servers● Automated failover unavailable

(known failure condition)● Manual DNS switch required

● Expected impact: 20 min● Actual impact: 43min

Incident example

Lesson learned?

● Unfamiliarity with the process

● Pressure of time sensitive event(panic effect)

● Escalation introduces delays

The Human Factor

Handling the Human factor

● First responder, acknowledge alert

● Load incident response checklist

● Log into #ops-war-room in Slack

● Log incident into JIRA

● Begin investigation

General response process

1. Extended use of checklists

Documented procedures

● The “limits of human memory and attention”○ Complexity○ Stress and fatigue○ Ego

● Pilots, doctors, divers:Bruce Willis Ruins All Films(BCD, weights, releases, air, final)

Pre-flight checklists

1. Extended use of checklists2. Not to follow blindly, use knowledge and experience3. Independent system4. Searchable5. List of known issues and documented workarounds/fixes

Documented procedures

● Replica environment

● or mock command line

● Record actions and timing

● Multiple failures

● Unexpected results

Realistic scenarios: War Games

Results

● Team and individual test of response

● Run real commands

● Training the people

● Training the procedures

● Training the tools

Results

● Increase confidence

● Reduce panic

● Better coordination

● Trust relationships

● Improves time to resolution

Humans results

● Review● Suggestions for improvements● Do it again

● Scenario evolves● People forget

loop(): review and repeat

● On call rotation design

● Alert prioritization

● Notification optimization

What else?

Human Ops

1. Humans are part of the system

2. Humans impact systems

3. Humans impact business

4. Human issues count as system issues

Human Ops principles

meetup.com/humanops-london/

Human Ops Meetup

www.CloudStatusApp.com

Jorge Salamero SanzChief Developer Advocate

@bencerillo@serverdensity

our DevOps storiesblog.serverdensity.com