webinar - data driven postmortems - jason yee

64
DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT

Upload: codemotion

Post on 16-Mar-2018

76 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Webinar - Data driven postmortems - Jason Yee

DATA-DRIVEN POSTMORTEMSJASON YEE, DATADOG @GITBISECT

Page 2: Webinar - Data driven postmortems - Jason Yee

“THE ONLY REAL MISTAKE IS THE ONE FROM WHICH WE LEARN NOTHING.”- Henry Ford

TW: @gitbisect @datadoghq

Page 3: Webinar - Data driven postmortems - Jason Yee

@gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey Hunter

@datadoghq SaaS-based monitoring Trillions of data points per day http://jobs.datadoghq.com

Page 4: Webinar - Data driven postmortems - Jason Yee

“The problems we work on at Datadog are hard and often don't have obvious, clean-cut solutions, so it's useful to cultivate your troubleshooting skills, no matter what role you work in.”

Internal Datadog Developer Guide

TW: @gitbisect @datadoghq

Page 5: Webinar - Data driven postmortems - Jason Yee

BLAMELESS POSTMORTEMS

TW: @gitbisect @datadoghq

Page 6: Webinar - Data driven postmortems - Jason Yee

DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES

WHAT IS DEVOPS? ▸ Culture

▸ Automation

▸Metrics

▸ Sharing

TW: @gitbisect @datadoghq

Page 7: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 8: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 9: Webinar - Data driven postmortems - Jason Yee

DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES

OUR FOCUS AREA ▸ Culture

▸ Sharing

TW: @gitbisect @datadoghq

Page 10: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 11: Webinar - Data driven postmortems - Jason Yee

CULTURE & SHARING RESOURCES

BLAMELESS POSTMORTEMS▸Blameless Postmortems by John Allspaw

http://bit.ly/etsy-blameless

▸The Human Side of Postmortems by Dave Zwieback

http://bit.ly/human-postmortem

TW: @gitbisect @datadoghq

Page 12: Webinar - Data driven postmortems - Jason Yee

METRICSCULTURE & SHARING ARE GREAT, BUT WHAT ABOUT

TW: @gitbisect @datadoghq

Page 13: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 14: Webinar - Data driven postmortems - Jason Yee

COLLECTING DATA IS CHEAP; NOT HAVING IT WHEN YOU NEED IT CAN BE EXPENSIVE

SO INSTRUMENT ALL THE THINGS!

TW: @gitbisect @datadoghq

Page 15: Webinar - Data driven postmortems - Jason Yee

4 QUALITIES OF GOOD METRICSNOT ALL METRICS ARE EQUAL

TW: @gitbisect @datadoghq

Page 16: Webinar - Data driven postmortems - Jason Yee

1. WELL UNDERSTOOD

Page 17: Webinar - Data driven postmortems - Jason Yee

1. WELL UNDERSTOOD

Page 18: Webinar - Data driven postmortems - Jason Yee

1. WELL UNDERSTOOD

Page 19: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

2. SUFFICIENT GRANULARITY

Page 20: Webinar - Data driven postmortems - Jason Yee

1 second Peak 46%

1 minute Peak 36%

5 minutes Peak 12%

Page 21: Webinar - Data driven postmortems - Jason Yee

1 second Peak 46%

1 minute Peak 36%

5 minutes Peak 12%

Page 22: Webinar - Data driven postmortems - Jason Yee

1 second Peak 46%

1 minute Peak 36%

5 minutes Peak 12%

Page 23: Webinar - Data driven postmortems - Jason Yee

3. TAGGED & FILTERABLE

TW: @gitbisect @datadoghq

Page 24: Webinar - Data driven postmortems - Jason Yee
Page 25: Webinar - Data driven postmortems - Jason Yee
Page 26: Webinar - Data driven postmortems - Jason Yee
Page 27: Webinar - Data driven postmortems - Jason Yee

Query Based Monitoring“What’s the average throughput of application:nginx per version ?”

“How many requests per second is my role:accounting-app running application:postgresql hosted in region:us-west-1 compared to region:us-east-1?”

TW: @gitbisect @datadoghq

Page 28: Webinar - Data driven postmortems - Jason Yee

4. LONG-LIVED

TW: @gitbisect @datadoghq

Page 29: Webinar - Data driven postmortems - Jason Yee

METRICS 101

HOW LONG?▸ AWS Cloudwatch: Up to 15months at 1h granularity

▸ MS Azure Monitoring Service: Up to 90d at 1d granularity

▸ Google Stackdriver: Up to 6 weeks at 1m granularity

▸ Datadog: Up to 15months at 1s granularity

TW: @gitbisect @datadoghq

Page 30: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 31: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 32: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 33: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

P.S. - June 1! Mark your calendar!

Page 34: Webinar - Data driven postmortems - Jason Yee

RECURSE UNTIL YOU FIND THE TECHNICAL CAUSES

TW: @gitbisect @datadoghq

Page 35: Webinar - Data driven postmortems - Jason Yee

There is no singular “Root Cause”

Page 36: Webinar - Data driven postmortems - Jason Yee

HUMAN ELEMENTTECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES

TW: @gitbisect @datadoghq

Page 37: Webinar - Data driven postmortems - Jason Yee

IF YOU’RE STILL RESPONDING TO THE INCIDENT, IT’S NOT TIME FOR A POSTMORTEM

TW: @gitbisect @datadoghq

Page 38: Webinar - Data driven postmortems - Jason Yee

HUMAN DATA

DATA COLLECTION: WHO?▸ Everyone!

▸ Responders

▸ Identifiers

▸ Affected Users

TW: @gitbisect @datadoghq

Page 39: Webinar - Data driven postmortems - Jason Yee

HUMAN DATA

DATA COLLECTION: WHAT?▸Their perspective

▸What they did

▸What they thought

▸Why they thought/did it

TW: @gitbisect @datadoghq

Page 40: Webinar - Data driven postmortems - Jason Yee

“WRITING IS NATURE’S WAY OF LETTING YOU KNOW HOW SLOPPY YOUR THINKING IS.”

RICHARD GUINDON

TW: @gitbisect @datadoghq

Page 41: Webinar - Data driven postmortems - Jason Yee

TELLING STORIES

“A PICTURE IS WORTH A THOUSAND WORDS” - ANCIENT PROVERB

TW: @gitbisect @datadoghq

Page 42: Webinar - Data driven postmortems - Jason Yee

HUMAN DATA

DATA COLLECTION: WHEN?▸ As soon as possible.

▸Memory drops sharply within 20 minutes

▸ Susceptibility to “false memory” increases

▸Get your project managers involved!

TW: @gitbisect @datadoghq

Page 43: Webinar - Data driven postmortems - Jason Yee

HUMAN DATA

DATA SKEW/CORRUPTION▸ Stress

▸ Sleep deprivation

▸ Burnout

TW: @gitbisect @datadoghq

Page 44: Webinar - Data driven postmortems - Jason Yee

HUMAN DATA

DATA SKEW/CORRUPTION▸ Blame

▸ Fear of punitive action

TW: @gitbisect @datadoghq

Page 45: Webinar - Data driven postmortems - Jason Yee

HUMAN DATA

DATA SKEW/CORRUPTION▸ Bias

▸ Anchoring

▸ Hindsight

▸Outcome

▸ Availability (Recency)

▸ Bandwagon Effect

TW: @gitbisect @datadoghq

Page 46: Webinar - Data driven postmortems - Jason Yee

HOW WE DO POSTMORTEMS AT DATADOG

TW: @gitbisect @datadoghq

Page 47: Webinar - Data driven postmortems - Jason Yee

DATADOG POSTMORTEMS

A FEW NOTES▸ Postmortems emailed to company wide

▸ Scheduled recurring postmortem meetings

TW: @gitbisect @datadoghq

Page 48: Webinar - Data driven postmortems - Jason Yee

DATADOG’S POSTMORTEM TEMPLATE (1/5)

SUMMARY: WHAT HAPPENED?▸Describe what happened here at a high-level --

think of it as an abstract in a scientific paper.

▸What was the impact on customers?

▸What was the severity of the outage?

▸What components were affected?

▸What ultimately resolved the outage?

TW: @gitbisect @datadoghq

Page 49: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 50: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 51: Webinar - Data driven postmortems - Jason Yee

DATADOG’S POSTMORTEM TEMPLATE (2/5)

HOW WAS THE OUTAGE DETECTED?▸We want to make sure we detected the issue

early and would catch the same issue if it were to repeat.

▸Did we have a metric that showed the outage?

▸Was there a monitor on that metric?

▸ How long did it take for us to declare an outage?

TW: @gitbisect @datadoghq

Page 52: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 53: Webinar - Data driven postmortems - Jason Yee

TW: @gitbisect @datadoghq

Page 54: Webinar - Data driven postmortems - Jason Yee

DATADOG’S POSTMORTEM TEMPLATE (3/5)

HOW DID WE RESPOND?▸Who was the incident owner & who else was

involved?

▸ Slack archive links and timeline of events!

▸What went well?

▸What didn’t go so well?

TW: @gitbisect @datadoghq

Page 55: Webinar - Data driven postmortems - Jason Yee

*Names changed

TW: @gitbisect @datadoghq

Page 56: Webinar - Data driven postmortems - Jason Yee

CHATOPS ARCHIVES FTW!

*Names changed

TW: @gitbisect @datadoghq

Page 57: Webinar - Data driven postmortems - Jason Yee

*Names changed

TRACK LEARNINGS AS YOU GO

TW: @gitbisect @datadoghq

Page 58: Webinar - Data driven postmortems - Jason Yee

DATADOG’S POSTMORTEM TEMPLATE (4/5)

WHY DID IT HAPPEN?▸Deep dive into the cause

▸ Examples from this incident:

▸ http://bit.ly/dd-statuspage

▸ http://bit.ly/alq-postmortem

TW: @gitbisect @datadoghq

Page 59: Webinar - Data driven postmortems - Jason Yee

DATADOG’S POSTMORTEM TEMPLATE (5/5)

HOW DO WE PREVENT IT IN THE FUTURE?▸ Link to Github issues and Trello cards

▸Now?

▸Next?

▸ Later?

▸ Follow up notes

TW: @gitbisect @datadoghq

Page 60: Webinar - Data driven postmortems - Jason Yee

*Names changed

TW: @gitbisect @datadoghq

Page 61: Webinar - Data driven postmortems - Jason Yee

DATADOG’S POSTMORTEM TEMPLATE

RECAP:▸What happened (summary)?

▸ How did we detect it?

▸ How did we respond?

▸Why did it happen (deep dive)?

▸ Actionable next steps!

TW: @gitbisect @datadoghq

Page 62: Webinar - Data driven postmortems - Jason Yee

KEEP LEARNING

MORE RESOURCES▸ Postmortem Template

http://bit.ly/postmortem-template

▸ Post-Incident Reviews by Jason Hand http://bit.ly/post-incident-review

TW: @gitbisect @datadoghq

Page 63: Webinar - Data driven postmortems - Jason Yee

QUESTIONS?LET’S TALK!@GITBISECT

@DATADOGHQ

Page 64: Webinar - Data driven postmortems - Jason Yee

SLIDES: bit.ly/cm-postmortems