flight training for devops & humanops - incontrodevops 2016
TRANSCRIPT
Jorge Salamero Sanz <[email protected]>
IncontroDevOps 1 April 2016
War Games - Flight training for DevOps
● Infrastructure automation
● Configuration automation
● Continuous testing
● Continuous deployment / delivery
● Monitoring
● Logs, error handling
● Feedback
● Human Ops
DevOps lifecycle
● Humans are part of any system
● Initial design, ongoing improvements
● Maintenance
● Upgrades
● Issues, Incident response
Humans in DevOps
● System issues = error rates + SLA + ...
● Human issues = alerts out of hours + interruptions + .
● System issues = Human issues
Human issues = system issues
● Downtime = loss of users, reputation, revenue
● Downtime caused by unreliable systems
● Unhealthy teams reduce reliability
● Unhealthy teams = loss of users, reputation, revenue
Humans impact business
● Power failure to half of our servers● Automated failover unavailable
(known failure condition)● Manual DNS switch required
● Expected impact: 20 min● Actual impact: 43min
Incident example
● Unfamiliarity with the process
● Pressure of time sensitive event(panic effect)
● Escalation introduces delays
The Human Factor
● First responder, acknowledge alert
● Load incident response checklist
● Log into #ops-war-room in Slack
● Log incident into JIRA
● Begin investigation
General response process
● The “limits of human memory and attention”○ Complexity○ Stress and fatigue○ Ego
● Pilots, doctors, divers:Bruce Willis Ruins All Films(BCD, weights, releases, air, final)
Pre-flight checklists
1. Extended use of checklists2. Not to follow blindly, use knowledge and experience3. Independent system4. Searchable5. List of known issues and documented workarounds/fixes
Documented procedures
● Replica environment
● or mock command line
● Record actions and timing
● Multiple failures
● Unexpected results
Realistic scenarios: War Games
● Team and individual test of response
● Run real commands
● Training the people
● Training the procedures
● Training the tools
Results
● Increase confidence
● Reduce panic
● Better coordination
● Trust relationships
● Improves time to resolution
Humans results
● Review● Suggestions for improvements● Do it again
● Scenario evolves● People forget
loop(): review and repeat
1. Humans are part of the system
2. Humans impact systems
3. Humans impact business
4. Human issues count as system issues
Human Ops principles