Presenter Name, Title and or Date
The Incident Lifecycle at New RelicStep 1: Don’t PanicNate Heinrich, Product Manager
©2008-15 New Relic, Inc. All rights reserved.
This document and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission.Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,” “expects” or words of similar import.Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings we make with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at ir.newrelic.com or the SEC’s website at www.sec.gov. New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this document or otherwise, with respect to the information provided.
©2008-15 New Relic, Inc. All rights reserved.
Experience with un-amazing
©2008-15 New Relic, Inc. All rights reserved.
Background:
Product Manager:
Web Development and IT Operations
Focus on New Relic’s operations capabilities, especially Alerts
Common un-amazing conversations
©2008-15 New Relic, Inc. All rights reserved.
Conversation Conversation
▪ “Own” 24 apps they didn’t write▪ Primary on-call▪ Restart app▪ Find authors or equivalent,
create a phone bridge…
▪ Technology is a cost center▪ Speed (to deploy) and performance
are undervalued▪Monitoring is a luxury
I’m a software
company?Sys
admins
Three areas of investment to achieve an awesome incident
lifecycle©2008-15 New Relic, Inc. All rights reserved.
Incident timeline
©2008-15 New Relic, Inc. All rights reserved.
Root cause
Hitsproduction
Detect Escalate Mitigate Resolve Actionitems
Retrospect
Incident duration
Detect, escalate Resolve
Three areas of investment
©2008-15 New Relic, Inc. All rights reserved.
Root cause
Hitsproduction
Detect Escalate Mitigate Resolve Actionitems
Retrospect
Pre-incident
Culture
Routines
Priority
Incident
Post-incident
Incident duration
Detect, escalate Resolve
Three areas of investment
©2008-15 New Relic, Inc. All rights reserved.
Root cause
Hitsproduction
Detect Escalate Mitigate Resolve Actionitems
Retrospect
Pre-incident
Culture
Routines
Priority
Incident
Post-incident
Incident duration
Detect, escalate Resolve
Three areas of investment
©2008-15 New Relic, Inc. All rights reserved.
Root cause
Hitsproduction
Detect Escalate Mitigate Resolve Actionitems
Retrospect
Pre-incident
Routines
Priority
Incident
Post-incident
Incident duration
Detect, escalate Resolve
Culture
Tangent: This is probably related…
©2008-15 New Relic, Inc. All rights reserved.
Intentional
IntentionalSoftware does what you tell
it to
ImmoralAmoral servers aren’t running your code
InstantaneousSlow degradation doesn’t get attentiononly alerts affecting you now
ImminentWorks in production today,
possible it will work tomorrow
Three areas of investment
©2008-15 New Relic, Inc. All rights reserved.
Root cause
Hitsproduction
Detect Escalate Mitigate Resolve Actionitems
Retrospect
Pre-incident
Culture
Routines
Priority
Incident
Post-incident
Incident duration
Detect, escalate Resolve
Not all engineers have the same HA experience.
HA engineering is not trivial and can be difficult to approach.
©2008-15 New Relic, Inc. All rights reserved.
Culture
©2008-15 New Relic, Inc. All rights reserved.
Availability Medal Progression
Know Where You Are
Keep Your Software Running
Risks Are Fixed
Improve Availability
Programmatically
Level 1
(bronze)
Level 2
(silver)
Level 3
(gold)
Level 4
(platinum)
Culture
©2008-15 New Relic, Inc. All rights reserved.
▪Basic monitoring▪Documented risk matrix
Level 1
(bronze)
Know Where You Are
Low Medium High
X XX High
X Medium
X X XX Low
Likelihood
Impa
ct
Culture
©2008-15 New Relic, Inc. All rights reserved.
▪Build a culture where service status widely known▪Advanced monitoring
(observe issues early)▪Engage early▪Actionable data
Level 2
(silver)
Keep Your Software Running
Culture
©2008-15 New Relic, Inc. All rights reserved.
▪Zero “high highs”▪Recurring gamedays▪Upstream and downstream
impacts known
Level 3
(gold)
Risks Are Fixed
Culture
©2008-15 New Relic, Inc. All rights reserved.
▪Programmatic mitigation– Auto-scaling– Auto app instance killing– Retries & circuit breakers
Level 4
(platinum)
Improve Availability Programmatically
Culture
©2008-15 New Relic, Inc. All rights reserved.
Outcome
Clear path for teams
Assistance along the
way
Teams know where they
stand
Aggregation across teams
helps management
Nrrd Incident Orchestration
©2008-15 New Relic, Inc. All rights reserved.
Root cause
Hitsproduction
Detect Escalate Mitigate Resolve Actionitems
Retrospect
Pre-incident
Culture
Routines
Priority
Incident
Post-incident
Incident duration
Detect, escalate Resolve
Communication frequency and consistency.
Clear roles and “torch passing”.
©2008-15 New Relic, Inc. All rights reserved.
Moar ChatOps!– Nrrd (Hubot)– HipChat– Incident / Retro tracking tool (internal)
Routines
©2008-15 New Relic, Inc. All rights reserved.
Nrrd
Routines
©2008-15 New Relic, Inc. All rights reserved.
Democratize incident creation
Managed role assignment
Timed status updates
Statuses saved as incident log for retros
Nrrd
Reliability PM & Unified Work Stream
©2008-15 New Relic, Inc. All rights reserved.
Root cause
Hitsproduction
Detect Escalate Mitigate Resolve Actionitems
Retrospect
Pre-incident
Routines
Priority
Incident
Post-incident
Incident duration
Detect, escalate Resolve
Retro items all had the same priority.
Larger availability initiatives can’t compete.
©2008-15 New Relic, Inc. All rights reserved.
Don’t Repeat Incidents Policy
©2008-15 New Relic, Inc. All rights reserved.
Immediate Retro
Actions
Longer Term Holistic Actions
First merges to master post-incident
Tracked in the same place a feature work
Things I didn’t talk about…
©2008-15 New Relic, Inc. All rights reserved.
Code, server and app
ownership
Disaster recovery exercises
The written culture
Chaos nerds
Hiring & Incentive
sGameDay
s
Deep dive on tooling
Security and
availability
Side-kicking
Culture Routines
Priorities
©2008-15 New Relic, Inc. All rights reserved.
Investing in your culture, routines and how you prioritize are all essential
Awesome!