“sensu and sensibility” - the story of a journey from #monitoringsucks to #monitoringlove -...

58
Sensu and Sensibility Tomas Doran @bobtfish 20140923

Upload: puppet-labs

Post on 13-Jun-2015

2.647 views

Category:

Technology


0 download

DESCRIPTION

“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - Tomas Doran, Yelp

TRANSCRIPT

Page 1: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Sensu and Sensibility

Tomas  Doran  @bobtfish  2014-­‐09-­‐23

Page 2: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

2

Sensu and Sensibility

I’m part of the SRE team at Yelp. One of my jobs is “don’t break the site, ever” Another job is to enable developer productivity and fast innovation. These two things can be in conflict.

Page 3: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

3

This talk is about one particular instance of this conflict - monitoring. We used nagios. It sucked. This is half to do with nagios, half to do with the way we used it.

Page 4: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

4

This leads to developers being separated from production. Pager details out of date. Not all hosts running a service monitored as services move. Permissions issues so developers can’t ack alerts. No sane acks system.

Page 5: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

• Escalation of issues is hard • Ops ignore alerts from services • Postmortems

5

Ops have a lot of pain too. Alerts are too noisy, when they’re for services we can’t triage them. Host issues end up with ops sending email to developers@ and praying. Ops get alert fatigue, stuff gets missed, everything is terrible

Page 6: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

6

If monitoring is ‘ops problem’, everything looks on fire all the time. It’s very hard to know what’s actually broken. Lack of situational awareness, expecting broken windows stops people taking responsibility.

Page 7: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

• Escalation of issues is hard • Ops ignore alerts from services • Postmortems

• High friction, low trust, low visibility.7

Both sides are actually being reasonable. This isn’t even a Hanlon’s razor situation - everyone is really trying.

Page 8: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

“Normality”

8-­‐  http://gunshowcomic.com/648

It’s just the way we’ve built our monitoring system is killing us with a thousand cuts. And we’ve got Stockholm syndrome.

Page 9: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

“Normality”

9-­‐  http://gunshowcomic.com/648

This is dysfunctional

I’m painting a bleak picture here - not actually saying that everything was _this_ bad in our organization. But these were the types of problems we identified.

Page 10: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

10

Sensibility

Monitoring is about enabling communication.

Page 11: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

11

Sensibility

One of our core competencies is getting monitoring right! So, we decided to change everything!!!!1111

Page 12: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

“51 % viewed their ERP implementation as unsuccessful”

12

The Robbins-Gioia Survey (2001)

Why the hell would we do that? It’s clearly a massive project

Page 13: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

“40 % of the projects failed to achieve their business case within one year of going live”

13

The Conference Board Survey (2001)

And pretty high risk. If we screw the monitoring up, well, lets just not do that?

Page 14: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

• “17 percent of large IT projects go so badly that they can threaten the very existence of the company”

• “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted”

14

McKinsey & Company in conjunction with the University of Oxford (2012)

This is actually really scary..

Page 15: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Failure is an option

15-­‐  blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery

You’re not gonna get it right first time Different teams want to work in different ways. Different environments are different How do you test your monitoring system?

Page 16: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Sensibility

16

Large team + many teams - decentralized (multiple time zones for some teams) Integration - we can’t pick a product off of a shelf (and get the level of value we need)

Page 17: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

17

Sensibility

No big bang change, has to be incremental. We don’t know what our requirements are (beyond that the current system doesn’t meet them) Iteration is absolutely key to project success

Page 18: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Why Sensu?• Designed to be pluggable / extensible

• Arbitrary check metadata • Simple model

• Components do exactly one thing • Ruby

• Not afraid to extend (or fork!)

18

So why did we choose Sensu - Nagios is workable, right? Want to work with the monitoring system to integrate it into our infra, not hack around it.

Page 19: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

‘industry standard’ ‘enterprise class’

19

So we do have / did have nagios. It’s workable. In fact, it works fine, and scales pretty well (to a point). This is not a hate on nagios. It _could_ do all the things I talk about here….

Page 20: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Cheap shot

20

It’s ugly

Page 21: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

21

It tries to solve the full-stack monitoring problem. We’d already migrated most contacting to pager duty, rest to follow. Half the objects useless to us. Monolithic.

Page 22: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

status.dat cmd.dat

22

The data formats are gross.

Page 23: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

cmd.dat

23

Page 24: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

24

Centralized

Ephemeral clients are a problem. Whitelisting (needing to explicitly add hosts/services) is a problem Exported resources are horrible (slow + bad for ephemeral envs)

Page 25: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

25

To be fair, this diagram does Sensu no favors at all :)

Page 26: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

How we use Sensu

• Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module

26

We don’t use it like this, much simpler model!

Page 27: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Sensu data flow

• Sensu client runs checks on each machine • Pushes results to RabbitMQ

• Clustered, clients/messages will fail over. • Sensu server (multiple, ha)

• Processes check results, invokes handlers • Writes state to redis

• Redis + sentinel • Read by API (2 instances)

• All layers behind haproxy

27

Page 28: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Quis custodiet ipsos custodes?

28

“Sensu  has  so  many  moving  parts  that  I  wouldn’t  be  able  to  sleep  at  night  unless  I  set  up  a  Nagios  instance  to  make  sure  they  were  all  running.”

Nagios does all of these things, itself. With no introspection - ‘how deep are my queues, why are things not getting scheduled’

Page 29: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Mutually assured monitoring

• Multiple independent Sensu installs (per-datacenter) • Monitor each other!

29

We have a big environment, we run a Sensu per DC, they can monitor each other.

Page 30: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Machine readable config

• /etc/sensu/conf.d/checks/check_name.json

• Extensible with arbitrary metadata

• Hash merge

• Never edit by hand!

30

One of (IMO) the nice decisions is the use of JSON for config. JSON is a terrible format for hand-edited config, but we deploy all the config by puppet.

Page 31: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

31

This is our interface to Sensu in puppet. It’s a custom define which applies our business rules.

Page 32: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

32

Default to not paging people (for sanity), but turn that on easily. Automatically uses the default team (whoever owns the box). Can be overridden.

Page 33: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

33

We didn’t like Sensu’s alert scheduling logic. So we rewrote it :) (This is easy - just in the base class)

Page 34: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

34

Mandatory documentation!

Page 35: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

sensu::check

• monitoring_check wraps this

• Writes a JSON file for each check

• Comment safe

35

We do use the Sensu official puppet module. “Comment safe” - if you comment the puppet code out, the check goes away. Working on auto-resolving checks that are deleted now!

Page 36: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

36

This is what an actual auto generated check JSON looks like BIG BLOB OF JSON! Don’t stress, we’ll work through it.

Page 37: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

37

This looks the same for all of our Sensu checks. This is the using ‘simple mode’ and turning off half the features - servers can’t/don’t trigger checks on clients, it’s all client scheduled

Page 38: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

38

These are custom (in our base handler) - as noted before in the define. Times are converted to seconds (in puppet) so that all time intervals in JSON are seconds.

Page 39: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

39

Every check has to have a run book!

Page 40: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

40

Generated by a custom function. Goes up the parser stack and finds where it was called from.

Page 41: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

41

This stuff (more than half the check!) is the custom metadata Every alert has a team owning it. We can report in irc, JIRA, email (why? but some people do want this) or page!

Page 42: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Check scripts

• Same as nagios checks • Simple (text) output • Exit code

• Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data.

42

So, to recap - checks are scheduled and run on the client. It pushes the results to RabbitMQ, sends it’s results and definitions to the server. This is then all piped to the handlers setup.

Page 43: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Handlers

• base • JIRA • email • irc • pagerduty • awsprune

43

Page 44: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

How do checks get run?

• Every machine runs the client.

• Client managed by puppet

• Client has a TCP socket you can send JSON to

• Custom checks + pysensu-yelp

44

Check scripts are simple (as per nagios). Can write them in shell/ruby/python/whatever. More complex things can send data to the local socket. We have a python library for this (also use the ruby libraries from the sensu project)

Page 45: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

45

Sensu servers know which machine is the master right now (their own leadership election). Deploy some checks to sensu servers (e.g. cloudwatch checks!), run on the master. Fake hostname!

Page 46: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Situational awareness

46

Send alerts about dev box resource usage to the developers using that box. Why page OPS as a developer used 90% of the disk?

Page 47: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Single source of truth

• DNS is canonical for sensu servers • Configure things in one place!

47

One place can be DNS, or hiera, or whatever - but not multiple places. DNS AND hiera sucks

Page 48: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Single source of truth

• DNS is canonical for sensu servers • Configure things in one place!

48

puppet-netstdlib structured facts

Page 49: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Automatic monitoring

• E.g. cron jobs - check successful recently! • cron::d

49

There are a bunch of general patterns where you can automate monitoring. Who hates ‘cron spam’? We use a custom define which defaults to /dev/null Check jobs completed successfully (with Sensu) - make JIRA tickets!

Page 50: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Automatic monitoring

• E.g. cron jobs - check successful recently! • cron::d

50

Generic handling! Annotations!

Page 51: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Generate monitoring_check

51

And under the hood this runs create_resources to generate monitoring_checks create_resources is your friend!

Page 52: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

User specified monitoring

52

This is a cunning one. The check returns OK (assuming it can hit graphite), but also emits a bunch of additional check results to the local socket

Page 53: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

User specified monitoring

53

• Data lives in the service config • Next to the code to emit metrics!

This is awesome, as it reads our service configs. Developers can add their own alerts.

Page 54: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

• Simple checks for free!

54

User specified monitoring

This example is in ruby :)

Page 55: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

User specified monitoring

• Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS

55

Allowing developers to add their own monitoring is awesome. Putting the config for the monitoring in their application codebase is awesome.

Page 56: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Cluster checks

• We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise.

• If a service becomes fully unavailable to clients, you want to page someone.

• If one machine goes belly up, you don’t (make a JIRA ticket for handling later!)

56

Page 57: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

WIP

• This is all still a work in progress.

• We’ve not 100% migrated off of Nagios

• Open sourcing the pieces

57

Page 58: “Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #monitoringlove - PuppetConf 2014

Thanks!• Slides will be online shortly:

• slideshare.net/bobtfish • @bobtfish

• Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/

aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp

58