“sensu and sensibility” - the story of a journey from #monitoringsucks to #monitoringlove -...

Sensu and Sensibility

Tomas Doran @bobtfish 2014-‐09-‐23

2

Sensu and Sensibility

I’m part of the SRE team at Yelp. One of my jobs is “don’t break the site, ever” Another job is to enable developer productivity and fast innovation. These two things can be in conflict.

Cycle of failure and disappointment

• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production

3

This talk is about one particular instance of this conflict - monitoring. We used nagios. It sucked. This is half to do with nagios, half to do with the way we used it.

4

This leads to developers being separated from production. Pager details out of date. Not all hosts running a service monitored as services move. Permissions issues so developers can’t ack alerts. No sane acks system.



• Escalation of issues is hard • Ops ignore alerts from services • Postmortems

5

Ops have a lot of pain too. Alerts are too noisy, when they’re for services we can’t triage them. Host issues end up with ops sending email to developers@ and praying. Ops get alert fatigue, stuff gets missed, everything is terrible

6

If monitoring is ‘ops problem’, everything looks on fire all the time. It’s very hard to know what’s actually broken. Lack of situational awareness, expecting broken windows stops people taking responsibility.



• Escalation of issues is hard • Ops ignore alerts from services • Postmortems

• High friction, low trust, low visibility.7

Both sides are actually being reasonable. This isn’t even a Hanlon’s razor situation - everyone is really trying.

“Normality”

8-‐ http://gunshowcomic.com/648

It’s just the way we’ve built our monitoring system is killing us with a thousand cuts. And we’ve got Stockholm syndrome.

“Normality”

9-‐ http://gunshowcomic.com/648

This is dysfunctional

I’m painting a bleak picture here - not actually saying that everything was _this_ bad in our organization. But these were the types of problems we identified.

10

Sensibility

Monitoring is about enabling communication.

11

Sensibility

One of our core competencies is getting monitoring right! So, we decided to change everything!!!!1111

“51 % viewed their ERP implementation as unsuccessful”

12

The Robbins-Gioia Survey (2001)

Why the hell would we do that? It’s clearly a massive project

“40 % of the projects failed to achieve their business case within one year of going live”

13

The Conference Board Survey (2001)

And pretty high risk. If we screw the monitoring up, well, lets just not do that?

• “17 percent of large IT projects go so badly that they can threaten the very existence of the company”

• “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted”

14

McKinsey & Company in conjunction with the University of Oxford (2012)

This is actually really scary..

Failure is an option

15-‐ blog.parasoft.com/single-‐greatest-‐barrier-‐with-‐sw-‐delivery

You’re not gonna get it right first time Different teams want to work in different ways. Different environments are different How do you test your monitoring system?

Sensibility

16

Large team + many teams - decentralized (multiple time zones for some teams) Integration - we can’t pick a product off of a shelf (and get the level of value we need)

17

Sensibility

No big bang change, has to be incremental. We don’t know what our requirements are (beyond that the current system doesn’t meet them) Iteration is absolutely key to project success

Why Sensu?• Designed to be pluggable / extensible

• Arbitrary check metadata • Simple model

• Components do exactly one thing • Ruby

• Not afraid to extend (or fork!)

18

So why did we choose Sensu - Nagios is workable, right? Want to work with the monitoring system to integrate it into our infra, not hack around it.

‘industry standard’ ‘enterprise class’

19

So we do have / did have nagios. It’s workable. In fact, it works fine, and scales pretty well (to a point). This is not a hate on nagios. It _could_ do all the things I talk about here….

Cheap shot

20

It’s ugly

21

It tries to solve the full-stack monitoring problem. We’d already migrated most contacting to pager duty, rest to follow. Half the objects useless to us. Monolithic.

status.dat cmd.dat

22

The data formats are gross.

cmd.dat

23

24

Centralized

Ephemeral clients are a problem. Whitelisting (needing to explicitly add hosts/services) is a problem Exported resources are horrible (slow + bad for ephemeral envs)

25

To be fair, this diagram does Sensu no favors at all :)

How we use Sensu

• Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module

26

We don’t use it like this, much simpler model!

Sensu data flow

• Sensu client runs checks on each machine • Pushes results to RabbitMQ

• Clustered, clients/messages will fail over. • Sensu server (multiple, ha)

• Processes check results, invokes handlers • Writes state to redis

• Redis + sentinel • Read by API (2 instances)

• All layers behind haproxy

27

Quis custodiet ipsos custodes?

28

“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”

Nagios does all of these things, itself. With no introspection - ‘how deep are my queues, why are things not getting scheduled’

Mutually assured monitoring

• Multiple independent Sensu installs (per-datacenter) • Monitor each other!

29

We have a big environment, we run a Sensu per DC, they can monitor each other.

Machine readable config

• /etc/sensu/conf.d/checks/check_name.json

• Extensible with arbitrary metadata

• Hash merge

• Never edit by hand!

30

One of (IMO) the nice decisions is the use of JSON for config. JSON is a terrible format for hand-edited config, but we deploy all the config by puppet.

monitoring_check

monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }

31

This is our interface to Sensu in puppet. It’s a custom define which applies our business rules.

monitoring_check


32

Default to not paging people (for sanity), but turn that on easily. Automatically uses the default team (whoever owns the box). Can be overridden.

monitoring_check


33

We didn’t like Sensu’s alert scheduling logic. So we rewrote it :) (This is easy - just in the base class)

monitoring_check


34

Mandatory documentation!

sensu::check

• monitoring_check wraps this

• Writes a JSON file for each check

• Comment safe

35

We do use the Sensu official puppet module. “Comment safe” - if you comment the puppet code out, the check goes away. Working on auto-resolving checks that are deleted now!

"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }

36

This is what an actual auto generated check JSON looks like BIG BLOB OF JSON! Don’t stress, we’ll work through it.


37

This looks the same for all of our Sensu checks. This is the using ‘simple mode’ and turning off half the features - servers can’t/don’t trigger checks on clients, it’s all client scheduled


38

These are custom (in our base handler) - as noted before in the define. Times are converted to seconds (in puppet) so that all time intervals in JSON are seconds.


39

Every check has to have a run book!


40

Generated by a custom function. Goes up the parser stack and finds where it was called from.


41

This stuff (more than half the check!) is the custom metadata Every alert has a team owning it. We can report in irc, JIRA, email (why? but some people do want this) or page!

Check scripts

• Same as nagios checks • Simple (text) output • Exit code

• Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data.

42

So, to recap - checks are scheduled and run on the client. It pushes the results to RabbitMQ, sends it’s results and definitions to the server. This is then all piped to the handlers setup.

Handlers

• base • JIRA • email • irc • pagerduty • awsprune

43

How do checks get run?

• Every machine runs the client.

• Client managed by puppet

• Client has a TCP socket you can send JSON to

• Custom checks + pysensu-yelp

44

Check scripts are simple (as per nagios). Can write them in shell/ruby/python/whatever. More complex things can send data to the local socket. We have a python library for this (also use the ruby libraries from the sensu project)

45

Sensu servers know which machine is the master right now (their own leadership election). Deploy some checks to sensu servers (e.g. cloudwatch checks!), run on the master. Fake hostname!

Situational awareness

46

Send alerts about dev box resource usage to the developers using that box. Why page OPS as a developer used 90% of the disk?

Single source of truth

• DNS is canonical for sensu servers • Configure things in one place!

47

One place can be DNS, or hiera, or whatever - but not multiple places. DNS AND hiera sucks

Single source of truth

• DNS is canonical for sensu servers • Configure things in one place!

48

puppet-netstdlib structured facts

Automatic monitoring

• E.g. cron jobs - check successful recently! • cron::d

49

There are a bunch of general patterns where you can automate monitoring. Who hates ‘cron spam’? We use a custom define which defaults to /dev/null Check jobs completed successfully (with Sensu) - make JIRA tickets!

Automatic monitoring

• E.g. cron jobs - check successful recently! • cron::d

50

Generic handling! Annotations!

Generate monitoring_check

51

And under the hood this runs create_resources to generate monitoring_checks create_resources is your friend!

User specified monitoring

52

This is a cunning one. The check returns OK (assuming it can hit graphite), but also emits a bunch of additional check results to the local socket


53

• Data lives in the service config • Next to the code to emit metrics!

This is awesome, as it reads our service configs. Developers can add their own alerts.

• Simple checks for free!

54


This example is in ruby :)


• Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS

55

Allowing developers to add their own monitoring is awesome. Putting the config for the monitoring in their application codebase is awesome.

Cluster checks

• We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise.

• If a service becomes fully unavailable to clients, you want to page someone.

• If one machine goes belly up, you don’t (make a JIRA ticket for handling later!)

56

WIP

• This is all still a work in progress.

• We’ve not 100% migrated off of Nagios

• Open sourcing the pieces

57

Thanks!• Slides will be online shortly:

• slideshare.net/bobtfish • @bobtfish

• Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/

aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp

58

“sensu and sensibility” - the story of a journey from #monitoringsucks to #monitoringlove -...

Technology

monitoring system

monitoring right

sensu nagios

conflict monitoring

deployed monitoring

sensu dont use

stack monitoring problem

low visibility