wtf is sensu and monitoring

41
.WTF/is/sensu A DevOps guide to monitoring

Upload: toby-jackson

Post on 11-Apr-2017

1.758 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: WTF is Sensu and Monitoring

.WTF/is/sensuA DevOps guide to monitoring

Page 2: WTF is Sensu and Monitoring

.WTF/is/monitoringA DevOps guide to monitoring

Page 3: WTF is Sensu and Monitoring

.WTF/whoisself: author: ‘Toby Jackson <[email protected]>’ role: ‘Operations Engineer’ twitter: ‘@warmfusion’ github: ‘github.com/warmfusion’ employer: ‘www.futureplc.com/yourfuturejob/’

Page 4: WTF is Sensu and Monitoring

.WTF/is/monitoring?experience●Developer turned Engineer●Implemented Sensu at Future PLC

○340+ hosts, vms, switches etc●Helped shape our approach to monitoring

Page 5: WTF is Sensu and Monitoring

.WTF/is/monitoring?_index

Why do we monitor our systems?What should we look for?How can Sensu help us?Questions…?

Page 6: WTF is Sensu and Monitoring

.WTF/is/monitoring?why

Part One - Why do we monitor our systems

Page 7: WTF is Sensu and Monitoring

.WTF/is/monitoring?why● Client - Are they down, or is it just me?● CEO - Are we making money?● Manager - Are we meeting SLA agreements?● Engineer - Am I woken up for right reasons?● Developer - Did my deploy work?● Everyone...

○ What’s happening in our environment?

Page 8: WTF is Sensu and Monitoring

.WTF/is/monitoring?why_tomorrow● Client - Is maintenance going to happen soon?● CEO - Are we going to keep making money?● Manager - Can we meet new SLA agreements?● Engineer - Why might I get woken up tonight?● Developer - When do I need to optimise?● Everyone...

○ Whats going to happen in our environment?

Page 9: WTF is Sensu and Monitoring

.WTF/is/monitoring?what

Part Two - What should we look for?

Page 10: WTF is Sensu and Monitoring

.WTF/is/monitoring?disclaimer

Some approaches work better than othersdon’t be afraid to experiment.

Page 11: WTF is Sensu and Monitoring

.WTF/is/monitoring?principles

Focus on your customersUse a couple of monitoring systemsDe-couple your checks from your codeRemember workflow eventsMany simple checks > Fewer clever checksDon’t wake me up if it can wait

Page 12: WTF is Sensu and Monitoring

.WTF/is/monitoring?first_steps● Look for the big impact entry points● Review past incidents for danger zones● Don’t be afraid to admit that risky code exists

Page 13: WTF is Sensu and Monitoring

.WTF/is/monitoring?common●Disk, Ram, Load, Network●Patches available●Uptime●Logged in users●Config Management status

Page 14: WTF is Sensu and Monitoring

.WTF/is/monitoring?services●Create http status endpoints●JSON is great●200 OK / 503 Service Unavailable●Lightweight

●Downstream dependencies?●Service metrics?

Page 15: WTF is Sensu and Monitoring

.WTF/is/monitoring?clusters●Aggregate checks●Members don’t matter●Deploys and maintenance is ok●Avoid bypassing balancers

Page 16: WTF is Sensu and Monitoring

.WTF/is/monitoring?company●Programmatic goals can be monitored●See if revenue, purchases or direct

customer interactions can be watched●Watch for social media mentions

Page 17: WTF is Sensu and Monitoring

.WTF/is/monitoring?practise_simple

● nginx & php running● Balancer: 200 OK● nginx: 200 OK● Cron: ignore for now

Web Load Balancer

Web01nginxphpcron

Web02nginxphp

Page 18: WTF is Sensu and Monitoring

.WTF/is/monitoring?practise_adv● Balancer

>50% backends up● Nginx

< 200ms response● Cron

err log empty && <1hr old

Web Load Balancer

Web01nginxphpcron

Web02nginxphp

Page 19: WTF is Sensu and Monitoring

.WTF/is/monitoring?practise_clever● Spike in traffic● Failure counts

above thresholds● Response sizes are

curiously large● Lots of (valid) API

Auth requests

Web Load Balancer

Web01nginxphpcron

Web02nginxphp

Page 20: WTF is Sensu and Monitoring

Your users matter Know when they’re in pain

Develop a standardised app status pageConventional checks are used more frequently

Check lots of small thingsScales better and helps to isolate incidents quickly

.WTF/is/monitoring?what

Page 21: WTF is Sensu and Monitoring

.WTF/is/sensu

Part Three - How can Sensu help us

Page 22: WTF is Sensu and Monitoring

.WTF/is/sensu?introduction

“New generation” of monitoring solutionsOpen source with paid for Enterprise edition

Site: sensuapp.orgGitHub: github.com/sensuIRC: freenode - #sensu

Page 23: WTF is Sensu and Monitoring

.WTF/is/sensu?what

Consistent way to describe a service check

Executes those checks as required

Reliably handles events (and metrics)

Page 24: WTF is Sensu and Monitoring

.WTF/is/sensu?why●Tries to do one thing well; handle events

●Compatible with existing check scripts

●Large active open-source community

●Scales effectively

Page 25: WTF is Sensu and Monitoring

.WTF/is/sensu?experience●Replaced nagios, crons etc●Raised visibility of monitoring●Devolved control to development●340 (ish) hosts, vms, switches, firewalls etc●Managed exclusively through Puppet●Developed custom plugins and extensions

Page 26: WTF is Sensu and Monitoring

.WTF/is/sensu?architecture_simple

Page 27: WTF is Sensu and Monitoring

.WTF/is/sensu?howThe Sensu Standalone Check Process:

a. Sensu-Client runs a script with 1 line output and an exit code

b. Sensu-Client converts event into JSON and puts on RabbitMQ

c. Sensu-Server reads event and sends to handlersd. Handlers process event, performing some action

Page 28: WTF is Sensu and Monitoring

.WTF/is/sensu?architecture_simple

You are here

Page 29: WTF is Sensu and Monitoring

.WTF/is/sensu?standalone_check● Describes

○ what check to run○ how to handle events

● Runs at a given interval (default 60s)

● sensu-client handles output and emits events over message brokers

● Can include custom configuration which is included in event sent to handlers

sensu::checks: 'sensu-server': command: 'check-procs.rb -p bin/sensu-server -c 1' handlers: ['high', 'pagerduty'] custom: runbook: 'https://wiki.ftr.com/x/4oqq' tip: 'Check /var/log/sensu-server.log' slack: channels: - '#craggyisland'

Page 30: WTF is Sensu and Monitoring

.WTF/is/sensu?runbook

URI to page summary of Impacted servicesTroubleshootingCommon problemsHow to fixWho to talk toReferences to other information

Page 31: WTF is Sensu and Monitoring

.WTF/is/sensu?tip

Tweet length one-linerGets included in Pagerduty and Slack noticesUseful at 4am on a Sunday morning

Page 32: WTF is Sensu and Monitoring

.WTF/is/sensu?architecture_simple

You are here

Page 33: WTF is Sensu and Monitoring

.WTF/is/sensu?architecture_simple

You are here

Page 34: WTF is Sensu and Monitoring

.WTF/is/sensu?handler● Process events● Perform some (or no) action● Typically used to send alerts or

emails

sensu::handler: slack: type: 'pipe' command: 'slack.rb' config: webhook_token: 'SECRET/KEY' bot_name: 'sensu' channel: '#alerts' pagerduty: type: 'pipe' command: 'pagerduty.rb' severities: ['ok', 'critical'] config: api_key: SECRET_TOKEN_HERE

Page 35: WTF is Sensu and Monitoring

.WTF/is/sensu?standalone_metrics● The same as checks but...● handlers: [‘metrics’]

○ A special handler for this kind of result

● type: metric○ Tells sensu to always send

the output to the handler

sensu::checks: cpu-pcnt-usage-metrics: command: 'cpu-pcnt-usage-metrics.rb' handlers: ['metrics'] type: metric

Page 36: WTF is Sensu and Monitoring

.WTF/is/sensu?metric_exampleix-sensu01.cpu.user 70.92 1440425049ix-sensu01.cpu.nice 0.00 1440425049ix-sensu01.cpu.system 8.16 1440425049ix-sensu01.cpu.idle 19.90 1440425049ix-sensu01.cpu.iowait 0.00 1440425049ix-sensu01.cpu.irq 0.00 1440425049ix-sensu01.cpu.softirq 1.02 1440425049ix-sensu01.cpu.steal 0.00 1440425049ix-sensu01.cpu.guest 0.00 1440425049

Key Value Timestamp

Page 37: WTF is Sensu and Monitoring

.WTF/is/sensu?dashboards● Uchiwa - github.com/sensu/uchiwa● Mosaic - github.com/warmfusion/mosaic● Sensu-Grid - github.com/alex-leonhardt/sensu-grid

Page 38: WTF is Sensu and Monitoring

.WTF/is/sensu?issues●Uchiwa isn’t perfect●Sensu-API can crash sometimes●No maintained history (over 20 events)●Check dependencies are handled on clients●Redis for datastore

○Redundancy is a little harder (for me at least)

Page 39: WTF is Sensu and Monitoring

.WTF/is/sensu?wins●Alerts into Slack channels●Handles network partitions really well●Easy to create new checks and handlers

Page 40: WTF is Sensu and Monitoring

.WTF/is/monitoring?further_readingProgrammatic Alert Correlation - Elik Eizenberg

youtu.be/EXk19d09n54

Effective Incident Communication - Scott Kleinyoutu.be/ySSdqfZlC7Y

Search for Operability 2015 in YouTube

Page 41: WTF is Sensu and Monitoring

.WTF/whois?q=self: author: ‘Toby Jackson <[email protected]>’ role: ‘Operations Engineer’ twitter: ‘@warmfusion’ github: ‘github.com/warmfusion’ employer: ‘www.futureplc.com/yourfuturejob/’

Any Questions…?