does sfo 2016 - kevina finn-braun & j. paul reed - beyond the retrospective: embracing...
TRANSCRIPT
K E V I N A F I N N - B R A U N I N T U I T
J . PA U L R E E D R E L E A S E E N G I N E E R I N G A P P R O A C H E S
D E V O P S E N T E R P R I S E S U M M I T, 2 0 1 6
B E Y O N D T H E R E T R O S P E C T I V E : E M B R A C I N G C O M P L E X I T Y O N T H E R O A D T O W A R D S S E R V I C E O W N E R S H I P
K E V I N A F I N N - B R A U N
• Director of Product Infrastructure Service Management at Intuit
• Director of Site Reliability Service Management at Salesforce; Business Continuity at Yahoo
• Geeks out on group dynamics and behavior
• @kfinnbraun on @jpaulreed@kfinnbraun #DOES2016
J . PA U L R E E D
• @jpaulreed on
• @shipshowpodcast alum
• Managing Partner, Release Engineering Approaches
• A “DevOps Consultant™”
• Master’s Candidate in Human Factors & Systems Safety
@jpaulreed@kfinnbraun #DOES2016
A Q U I C K R E C A P F R O M L A S T D O E S
“The Blameless Cloud: Bringing Actionable Retrospectives to SFDC” DOES 2015 @jpaulreed@kfinnbraun
N E W M A R C H I N G O R D E R S
@jpaulreed@kfinnbraun #DOES2016
“ S E R V I C E O W N E R S H I P ? ”
@jpaulreed@kfinnbraun #DOES2016
I T ’ S J U S T W H AT S F D C C A L L E D “ D E V O P S “
( S S H H H , D O N ’ T T E L L A N Y O N E )
@jpaulreed@kfinnbraun #DOES2016
W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?
@jpaulreed@kfinnbraun #DOES2016
W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?
@jpaulreed@kfinnbraun #DOES2016
W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?
@jpaulreed@kfinnbraun #DOES2016
“ B U T H O W D O W E D O ‘ T H E D E V O P S ? ’ ”
• Learned helplessness?
• Uncontrollable bad event
• Perceived lack of control
• Generalized helpless behavior
@jpaulreed@kfinnbraun #DOES2016
• Learned helplessness?
• Uncontrollable bad event
• Perceived lack of control
• Generalized helpless behavior
• Actually: Structural blindness
“ B U T H O W D O W E D O ‘ T H E D E V O P S ? ’ ”
@jpaulreed@kfinnbraun #DOES2016
M A K I N G S E N S E O F S E R V I C E O W N E R S H I P
@jpaulreed@kfinnbraun #DOES2016
W O R K S H O P S U R P R I S E S !
• Understanding teams’ local rationality is key
• Words have meaning; meanings are important; but they aren’t necessarily shared
• Teams must be given space to deliver on transformations
• Teams can be “retrospective blind”
@jpaulreed@kfinnbraun #DOES2016
D E V O P S & N U C L E A R M E LT D O W N S ?
@jpaulreed@kfinnbraun
A N E W A D V E N T U R E
@jpaulreed@kfinnbraun #DOES2016
A N E W A D V E N T U R E
Quickbooks
TurboTax
Mint
FY 2016: $4.7b revenue
8,000 employees worldwide
Founded: 1983
Improving the financial lives of over 45 million customersIPO: 1993
@jpaulreed@kfinnbraun #DOES2016
S O M E D I F F E R E N T C H A L L E N G E S
• Intuit not “born in the cloud”
@jpaulreed@kfinnbraun #DOES2016
S O M E D I F F E R E N T C H A L L E N G E S
• Intuit not “born in the cloud”
• “Incidents” meant something different
@jpaulreed@kfinnbraun #DOES2016
S O M E D I F F E R E N T C H A L L E N G E S
• Intuit not “born in the cloud”
• “Incidents” meant something different
• No “Bermuda Blob”
@jpaulreed@kfinnbraun #DOES2016
S O M E D I F F E R E N T C H A L L E N G E S
• Intuit not “born in the cloud”
• “Incidents” meant something different
• No “Bermuda Blob”
• (No blob at all!)
@jpaulreed@kfinnbraun #DOES2016
S O M E D I F F E R E N T C H A L L E N G E S
• Intuit not “born in the cloud”
• “Incidents” meant something different
• No “Bermuda Blob”
• (No blob at all!)
• Different business lifecycle
@jpaulreed@kfinnbraun #DOES2016
B U T S I M I L A R C H A L L E N G E S , T O O
• Inconsistencies in operational responses
• Postmortems centered around “The Old View” of human error
• Some incidents & remediations got lost in the shuffle
• Surprising amount of (aggregated) service impact due to P3s/P4s
• “What, exactly, is an ‘incident?’”
@jpaulreed@kfinnbraun #DOES2016
“ B L A M E L E S S ” “ P O S T M O R T E M S ” ?
• Brené Brown, research sociologist, on vulnerability
• “Blame is a way to discharge pain and discomfort”
• Postmortem has a heavy connotation
• “Awesome postmortems?” Really?!
• More at: http://jpaulreed.com/blame-aware-postmortems
@jpaulreed@kfinnbraun #DOES2016
Lang
uage
Beha
vior
s
Novice Competent Proficient ExpertBeginner
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line.”
“I’m getting sent to the principal’s office because
of this outage.”
Completes the post-incident
“paperwork.”
No formal retrospective/ hallway retrospectives.
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line.”
“I’m getting sent to the principal’s office because
of this outage.”
“Let’s fix this as fast as possible.”
“What’s the correct fix to avoid this specific issue
in the future?”
Completes the post-incident
“paperwork.”
No formal retrospective/ hallway retrospectives.
Some information
(inconsistently) recorded.
Jumps to a focus on why.
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line.”
“I’m getting sent to the principal’s office because
of this outage.”
“Let’s fix this as fast as possible.”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that.”
“We need to find the root cause of this incident.”
Completes the post-incident
“paperwork.”
No formal retrospective/ hallway retrospectives.
Some information
(inconsistently) recorded.
Jumps to a focus on why.
Follows the prescribed format for retrospectives.
Possesses and incorporates complete dataset for the incident
into the retrospective.
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line.”
“I’m getting sent to the principal’s office because
of this outage.”
“Let’s fix this as fast as possible.”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that.”
“We need to find the root cause of this incident.” “Now that we’ve established
what happened, how did it happen?”
“How did these multiple factors
influence our complex system?”
Completes the post-incident
“paperwork.”
No formal retrospective/ hallway retrospectives.
Some information
(inconsistently) recorded.
Jumps to a focus on why.
Follows the prescribed format for retrospectives.
Possesses and incorporates complete dataset for the incident
into the retrospective.
Identifies inherent bias
in self and others.
Perspectives solicited from all involved team members/functional groups.
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line.”
“I’m getting sent to the principal’s office because
of this outage.”
“Let’s fix this as fast as possible.”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that.”
“We need to find the root cause of this incident.” “Now that we’ve established
what happened, how did it happen?”
“How did these multiple factors
influence our complex system?”
“How does our team/system contribute to our successes?”
“What can we incorporate from this incident to
better respond next time?”
Completes the post-incident
“paperwork.”
No formal retrospective/ hallway retrospectives.
Some information
(inconsistently) recorded.
Jumps to a focus on why.
Follows the prescribed format for retrospectives.
Possesses and incorporates complete dataset for the incident
into the retrospective.
Identifies inherent bias
in self and others.
Perspectives solicited from all involved team members/functional groups.
Able to facilitate retrospectives by healthily helping others address
tendency to blame/ personal & systemic bias.
Retrospective outcomes are fed back into the
system and prioritized.
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
Lang
uage
Beha
viors
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line.”
“I’m getting sent to the principal’s office because
of this outage.”
“Let’s fix this as fast as possible.”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that.”
“We need to find the root cause of this incident.” “Now that we’ve established
what happened, how did it happen?”
“How did these multiple factors
influence our complex system?”
“How does our team/system contribute to our successes?”
“What can we incorporate from this incident to
better respond next time?”
Completes the post-incident
“paperwork.”
No formal retrospective/ hallway retrospectives.
Some information
(inconsistently) recorded.
Jumps to a focus on why.
Follows the prescribed format for retrospectives.
Possesses and incorporates complete dataset for the incident
into the retrospective.
Identifies inherent bias
in self and others.
Perspectives solicited from all involved team members/functional groups.
Able to facilitate retrospectives by healthily helping others address
tendency to blame/ personal & systemic bias.
Retrospective outcomes are fed back into the
system and prioritized.
@kfinnbraun / #DOES2016 / @jpaulreed
Incident Analysis
Lang
uage
Beha
viors
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line.”
“I’m getting sent to the principal’s office because
of this outage.”
“Let’s fix this as fast as possible.”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that.”
“We need to find the root cause of this incident.” “Now that we’ve established
what happened, how did it happen?”
“How did these multiple factors
influence our complex system?”
“How does our team/system contribute to our successes?”
“What can we incorporate from this incident to
better respond next time?”
Completes the post-incident
“paperwork.”
No formal retrospective/ hallway retrospectives.
Some information
(inconsistently) recorded.
Jumps to a focus on why.
Follows the prescribed format for retrospectives.
Possesses and incorporates complete dataset for the incident
into the retrospective.
Identifies inherent bias
in self and others.
Perspectives solicited from all involved team members/functional groups.
Able to facilitate retrospectives by healthily helping others address
tendency to blame/ personal & systemic bias.
Retrospective outcomes are fed back into the
system and prioritized.
@kfinnbraun / #DOES2016 / @jpaulreed
Incident Analysis
Incident Detection Incident
Response
Incident Remediation Incident
Prevention*
T H E I N C I D E N T L I F E C Y C L E
Lang
uage
Beha
viors
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line.”
“I’m getting sent to the principal’s office because
of this outage.”
“Let’s fix this as fast as possible.”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that.”
“We need to find the root cause of this incident.” “Now that we’ve established
what happened, how did it happen?”
“How did these multiple factors
influence our complex system?”
“How does our team/system contribute to our successes?”
“What can we incorporate from this incident to
better respond next time?”
Completes the post-incident
“paperwork.”
No formal retrospective/ hallway retrospectives.
Some information
(inconsistently) recorded.
Jumps to a focus on why.
Follows the prescribed format for retrospectives.
Possesses and incorporates complete dataset for the incident
into the retrospective.
Identifies inherent bias
in self and others.
Perspectives solicited from all involved team members/functional groups.
Able to facilitate retrospectives by healthily helping others address
tendency to blame/ personal & systemic bias.
Retrospective outcomes are fed back into the
system and prioritized.
@kfinnbraun / #DOES2016 / @jpaulreed
I N C I D E N T D E T E C T I O N
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Problems with our service are obvious;
outages are obvious.”
“Other teams will notify us of any problems.”
“Most of the time, we’re the first to know
when a service is impacted.”
“We use historical data to guess at service level changes.”
“We’ve detected service level transitions via
monitoring and reduced MTTD.”
“I know which specific code/infra change caused this
service level change; here’s how I know…”
“We prioritize feature requests and bug reports to monitoring hooks;
monitoring is a 1st class citizen.”
“We’ve decoupled code/infra deployment, because we
can roll back/forward.”
“We’re not paged anymore for changes
automation can react to.”
Manual and/or external outage notifications.
No baseline metrics/ service levels are broadly bucketed.
External monitoring is in place to detect real time service transitions.
Notifications are automated.
External infra/API endpoints/ outward-facing interfaces
monitored/recorded.
Historical data exists and has been used to establish
graduated service baselines.
Application internals report data
to the monitoring system.
Monitoring systems employ deep statistical methods
to (dis)prove service anomalies.
Monitoring output is reincorporated into operational behavior in an
automated fashion.
Anomalies no longer result in defined “incidents.”
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
I N C I D E N T R E S P O N S E
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Have you tried turning it off and turning it on again?”
“Something is wrong with the X…”
“I think X is familiar with Y; let’s find them.”
“I think there’s a problem with the database, network, etc.”
Standard Incident Management System
language used.
“The deployment caused the database to hang…”
“The infrastructure on-calls: perform a system status &
report back to the IC.”
Entire team is familiar with standardized
IMS language.
Standardized IMS language is used/valued by the
entire team.
“What parts of the service did not ‘self-heal’ and
need attention?”
Team is event-focused; the team is
“alarmed” by incidents.
Inconsistent response once incident has commenced.
Response based on “tribal knowledge.”
Team is area-focused.
Team is action-focused.
Team has identified incident “responders,” and those
people know their duties.
Team is technology-focused.
Incident response is an aspect of org and team “culture.”
Incidents are embraced, but outside-business hours or
repeated incidents are considered inhumane.
Team is systems-focused.
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
I N C I D E N T A N A LY S I S
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line.”
“I’m getting sent to the principal’s office because
of this outage.”
“Let’s fix this as fast as possible.”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that.”
“We need to find the root cause of this incident.” “Now that we’ve established
what happened, how did it happen?”
“How did these multiple factors
influence our complex system?”
“How does our team/system contribute to our successes?”
“What can we incorporate from this incident to
better respond next time?”
Completes the post-incident
“paperwork.”
No formal retrospective/ hallway retrospectives.
Some information
(inconsistently) recorded.
Jumps to a focus on why.
Follows the prescribed format for retrospectives.
Possesses and incorporates complete dataset for the incident
into the retrospective.
Identifies inherent bias
in self and others.
Perspectives solicited from all involved team members/functional groups.
Able to facilitate retrospectives by healthily helping others address
tendency to blame/ personal & systemic bias.
Retrospective outcomes are fed back into the
system and prioritized.
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
I N C I D E N T R E M E D I AT I O N
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Let’s just file a ticket to track the issue.”
“I’m am sure this is the issue; the fix will correct 100%
of the occurrences.”
“I’m pretty sure we already fixed this?”
“We need an action plan to address the process gaps.”
“This needs to be fixed in the next release and
documented in our incident response docs.”
“We need to look deeper than this specific incident to really
address the problem.”“What can we learn from
this incident?”
“What other system aspects have we learned
from this incident? How can we use that?”
“While operating our system today,
how did we actively create & sustain
success?”
Remediation activities (or lack
thereof) contribute to a “break-fix” cycle.
Discussions of the incident are aggressive/blameful.
“Low hanging fruit” may be fixed, but
not documented or incorporated into team behavior.
More processes, more procedures,
more rules.
Issues of all sizes are actively managed.
Issues have a priority and teams have bandwidth to address them.
Completed issue remediation is
valued by the org.
Bandwidth exists to discuss, design and implement resiliency improvements.
Remediation is not regarded as a separate activity & is
culturally integrated into work.
Resilience is considered in the design phase
for new infra/software.
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
I N C I D E N T P R E V E N T I O N *
@kfinnbraun / #DOES2016 / @jpaulreed
Novice Competent Proficient ExpertBeginner
“Preventing future incidents is difficult
because of lacking data.”
“We can use predictive metrics
to completely avoid future incidents.”
“Our system has reasonable coverage
of its metrics.”
“We use metrics to inform attack/risk surface.”
“We use trend analysis to raise ‘soft’ problems
to operators.”
“Old documentation is problematic and dealt with accordingly.”
“When we started game days, it was a real mess.”
“We now care less about specific incidents &
more about crew formation.”
“The team is excited about game days.”
“Our crews care about their formation
and dissolution.”
Prevention efforts include documentation,
process design, metrics collection.
Retrospective focus is on static causes/effects.
Retrospectives include discussions
of active operator behaviors.
Docs, process, metrics established,
but < 100%.
Preventative focus is on reviewing docs+process+ metrics collection, but in a
day-to-day context.
Retrospectives focus on the response of the team
to an incident.
We actively inject failure into our
systems on a known schedule,
to drill.
We review our response to
induced failures.
The crew formation/dissolution process is considered our
primary role+responsibility in addressing and preventing
operational failure
We actively inject failure at random intervals.
Lang
uage
Beha
vior
s
@kfinnbraun / #DOES2016 / @jpaulreed
H E L P U S M A K E I T B E T T E R !
https://github.com/preed/incident-lifecycle-model@jpaulreed@kfinnbraun #DOES2016
FA C I L I TAT E T E A M S E X P L O R I N G T H E I R D I S C R E T I O N A R Y S PA C E
@jpaulreed@kfinnbraun #DOES2016
I N C I D E N T R E S P O N S E ! = I N C I D E N T M A N A G E M E N T
@jpaulreed@kfinnbraun #DOES2016
I N C I D E N T R E S P O N S E ! = I N C I D E N T M A N A G E M E N T
( Y O U R I N C I D E N T VA L U E S T R E A M M AT T E R S )
@jpaulreed@kfinnbraun #DOES2016
Y O U A R E N E V E R D O N E .
@jpaulreed@kfinnbraun #DOES2016
Y O U . A R E . N E V E R . D O N E .
@jpaulreed@kfinnbraun #DOES2016
AV E N U E S F O R C O L L A B O R AT I O N
• Take a look at the extended incident lifecycle model and your organization: see where it fits and doesn’t!
• (And then send us Github pull requests!)
• Compare your own (documented?) incident life cycle against your actual incident value stream; share what you find!
@jpaulreed@kfinnbraun #DOES2016
Kevina Finn-Braun [email protected] http://lnkdin.me/kevinafinnbraun
J. Paul Reed [email protected]
http://jpaulreed.com