mean time to sleep: quantifying the on-call experience

132
@lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience

Upload: laurie-denness

Post on 21-Apr-2017

20.428 views

Category:

Engineering


5 download

TRANSCRIPT

Page 1: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience

Page 2: Mean Time to Sleep: Quantifying the On-Call Experience

Laurie Denness@lozzd

Ryan Frantz@ryan_frantz

Page 3: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Who is in an on-call rotation?

Page 4: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Who is on call right now?

Page 5: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Who feels like on-call sucks?

Page 6: Mean Time to Sleep: Quantifying the On-Call Experience
Page 7: Mean Time to Sleep: Quantifying the On-Call Experience

Welcome. How is on call?

Page 8: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Let’s help our people sleep

Page 9: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Make on-call more bearable

Page 10: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Incremental Changes

Page 11: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email toAcknowledge

Page 12: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 13: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 14: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 15: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 16: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 17: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 18: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 19: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

Page 20: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?

Page 21: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?

• Can it wait until the morning?

Page 22: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

Page 23: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• Previous service state

• Duration in that state

Added Context• Previous service state

Page 24: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• Previous service state

• Duration in that state

Added Context• Previous service state

Page 25: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

Page 26: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

Page 27: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

Page 28: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

Page 29: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

Page 30: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to Runbook

Page 31: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to Runbook

Page 32: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to runbook

Page 33: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Alert Storms• Reduce noise when 200 things go wrong by aggregating

Page 34: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Alert Storms• Reduce noise when 200 things go wrong by aggregating

• Trigger alert percentage of pool over threshold

Page 35: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Low friction downtime• IRC commands to downtime hosts/sets of hosts

Page 36: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Low friction downtime• IRC commands to downtime hosts/sets of hosts

Page 37: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Downtime Reminders• Help prevent false notifications

Page 38: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Downtime Reminders• Help prevent false notifications

Page 39: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

Page 40: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

Page 41: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

• Re-running jobs (transient errors)

Page 42: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

• Re-running jobs (transient errors)

• Duplicate crons (Chef)

Page 43: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

Page 44: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

• More ideas; hoped they’d stick

Page 45: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

• More ideas; hoped they’d stick

• We didn’t know because we didn’t measure

Page 46: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Measure Everything• “You can’t manage what you can’t measure.”

- Deming (not really)

Page 47: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Measure Everything• “You can’t manage what you can’t measure.”

- Deming (not really)

• But, we weren’t measuring anything

Page 48: Mean Time to Sleep: Quantifying the On-Call Experience
Page 49: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?

Page 50: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

Page 51: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

Page 52: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

• Alert times: Off-hours?

Page 53: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

• Alert times: Off-hours?

• Noisy hosts/services

Page 54: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Opsweekly

Page 55: Mean Time to Sleep: Quantifying the On-Call Experience
Page 56: Mean Time to Sleep: Quantifying the On-Call Experience
Page 57: Mean Time to Sleep: Quantifying the On-Call Experience
Page 58: Mean Time to Sleep: Quantifying the On-Call Experience
Page 59: Mean Time to Sleep: Quantifying the On-Call Experience
Page 60: Mean Time to Sleep: Quantifying the On-Call Experience
Page 61: Mean Time to Sleep: Quantifying the On-Call Experience
Page 62: Mean Time to Sleep: Quantifying the On-Call Experience
Page 63: Mean Time to Sleep: Quantifying the On-Call Experience
Page 64: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz We have data.

Page 65: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

Page 66: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

Page 67: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

3. Aggregate alerts

Page 68: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

3. Aggregate alerts

4.Profit

Page 69: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

Page 70: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

• Standard Nagios feature

Page 71: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

• Standard Nagios feature

• Computers can do this for us!

Page 72: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

Page 73: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

Page 74: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

• Put switch info into Chef using ohai

Page 75: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

• Put switch info into Chef using ohai

• Create Nagios host configs based on data

Page 76: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Service Dependencies• Hundreds of Graphite-sourced checks

Page 77: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Service Dependencies• Hundreds of Graphite-sourced checks

• Created new template that sets a servicegroup that depends on the Graphite service.

Page 78: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Keep on analyzing• It’s okay to just identify and delete alerts that don’t

mean anything!

Page 79: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Keep on analyzing• It’s okay to just identify and delete alerts that don’t

mean anything!

• Or move them to email only

Page 80: Mean Time to Sleep: Quantifying the On-Call Experience
Page 81: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

More Quantification!

Page 82: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

Page 83: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

• Use search

Page 84: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

• Use search

• Identify noisiest alerts

Page 85: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Reviewing the YearYEARLY REPORT SCREENSHOTS

Page 86: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• Great time to look at this data and make improvements

Nagios Hack Day/Week

Page 87: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• Great time to look at this data and make improvements

• If Disk Space is the worst. Can we rethink that?

Nagios Hack Day/Week

Page 88: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

Page 89: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

• A whole subset of alerts that don’t go to Ops

Page 90: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

• A whole subset of alerts that don’t go to Ops

• More teams starting this but Search Team is at 100%

Page 91: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Sleep Tracking

Page 92: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 93: Mean Time to Sleep: Quantifying the On-Call Experience

“Track your life!” - @ph

Page 94: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 95: Mean Time to Sleep: Quantifying the On-Call Experience
Page 96: Mean Time to Sleep: Quantifying the On-Call Experience
Page 97: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 98: Mean Time to Sleep: Quantifying the On-Call Experience
Page 99: Mean Time to Sleep: Quantifying the On-Call Experience
Page 100: Mean Time to Sleep: Quantifying the On-Call Experience
Page 101: Mean Time to Sleep: Quantifying the On-Call Experience
Page 102: Mean Time to Sleep: Quantifying the On-Call Experience
Page 103: Mean Time to Sleep: Quantifying the On-Call Experience
Page 104: Mean Time to Sleep: Quantifying the On-Call Experience
Page 105: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 106: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 107: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?

Page 108: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?

Page 109: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

Page 110: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

Page 111: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

• Signal to noise ratio is much better

Page 112: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

Page 113: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

Page 114: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

• Adding alerts all the time means new “annoying” things

Page 115: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

• Adding alerts all the time means new “annoying” things

• Keep monitoring

Page 116: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What’s next?

Page 117: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• We focus on people’s sleep

The Effect of Sleep

Page 118: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

The Effect of Sleep

Page 119: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

• How do we measure the impact of sleep loss/deprivation?

The Effect of Sleep

Page 120: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

• How do we measure the impact of sleep loss/deprivation?

The Effect of Sleep

• Subjective: Pittsburgh Sleepiness Scale

• Objective: Psychomotor vigilance task (PVT) to measure alertness

Page 121: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Beyond Opsweekly• Employee wellness program

Page 122: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Beyond Opsweekly• Employee wellness program

• Security have started using past sleep data to check for weird logins to systems

Page 123: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

More context: nagios-herald

Page 124: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

More reports• We have a bunch of data, we can build better reports,

drill down to analyze alerting trends

Page 125: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

More reports• We have a bunch of data, we can build better reports,

drill down to analyze alerting trends

• Can we attribute particular actions to reduced noise volume?

• Aggregate alerts

• Non-downtimed alerts

Page 126: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Thanks

Page 127: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Etsy Ops Team

Page 128: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

SewMona

Page 129: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Open Source/Links• http://ryanfrantz.com/mtts

• https://github.com/etsy/opsweekly

• https://github.com/etsy/nagios-herald

• https://github.com/jonlives/jawboneup_to_graphite

• http://codeascraft.com

Page 130: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Questions?

Page 131: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience

Page 132: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience