wtf is sensu and monitoring
TRANSCRIPT
.WTF/is/sensuA DevOps guide to monitoring
.WTF/is/monitoringA DevOps guide to monitoring
.WTF/whoisself: author: ‘Toby Jackson <[email protected]>’ role: ‘Operations Engineer’ twitter: ‘@warmfusion’ github: ‘github.com/warmfusion’ employer: ‘www.futureplc.com/yourfuturejob/’
.WTF/is/monitoring?experience●Developer turned Engineer●Implemented Sensu at Future PLC
○340+ hosts, vms, switches etc●Helped shape our approach to monitoring
.WTF/is/monitoring?_index
Why do we monitor our systems?What should we look for?How can Sensu help us?Questions…?
.WTF/is/monitoring?why
Part One - Why do we monitor our systems
.WTF/is/monitoring?why● Client - Are they down, or is it just me?● CEO - Are we making money?● Manager - Are we meeting SLA agreements?● Engineer - Am I woken up for right reasons?● Developer - Did my deploy work?● Everyone...
○ What’s happening in our environment?
.WTF/is/monitoring?why_tomorrow● Client - Is maintenance going to happen soon?● CEO - Are we going to keep making money?● Manager - Can we meet new SLA agreements?● Engineer - Why might I get woken up tonight?● Developer - When do I need to optimise?● Everyone...
○ Whats going to happen in our environment?
.WTF/is/monitoring?what
Part Two - What should we look for?
.WTF/is/monitoring?disclaimer
Some approaches work better than othersdon’t be afraid to experiment.
.WTF/is/monitoring?principles
Focus on your customersUse a couple of monitoring systemsDe-couple your checks from your codeRemember workflow eventsMany simple checks > Fewer clever checksDon’t wake me up if it can wait
.WTF/is/monitoring?first_steps● Look for the big impact entry points● Review past incidents for danger zones● Don’t be afraid to admit that risky code exists
.WTF/is/monitoring?common●Disk, Ram, Load, Network●Patches available●Uptime●Logged in users●Config Management status
.WTF/is/monitoring?services●Create http status endpoints●JSON is great●200 OK / 503 Service Unavailable●Lightweight
●Downstream dependencies?●Service metrics?
.WTF/is/monitoring?clusters●Aggregate checks●Members don’t matter●Deploys and maintenance is ok●Avoid bypassing balancers
.WTF/is/monitoring?company●Programmatic goals can be monitored●See if revenue, purchases or direct
customer interactions can be watched●Watch for social media mentions
.WTF/is/monitoring?practise_simple
● nginx & php running● Balancer: 200 OK● nginx: 200 OK● Cron: ignore for now
Web Load Balancer
Web01nginxphpcron
Web02nginxphp
.WTF/is/monitoring?practise_adv● Balancer
>50% backends up● Nginx
< 200ms response● Cron
err log empty && <1hr old
Web Load Balancer
Web01nginxphpcron
Web02nginxphp
.WTF/is/monitoring?practise_clever● Spike in traffic● Failure counts
above thresholds● Response sizes are
curiously large● Lots of (valid) API
Auth requests
Web Load Balancer
Web01nginxphpcron
Web02nginxphp
Your users matter Know when they’re in pain
Develop a standardised app status pageConventional checks are used more frequently
Check lots of small thingsScales better and helps to isolate incidents quickly
.WTF/is/monitoring?what
.WTF/is/sensu
Part Three - How can Sensu help us
.WTF/is/sensu?introduction
“New generation” of monitoring solutionsOpen source with paid for Enterprise edition
Site: sensuapp.orgGitHub: github.com/sensuIRC: freenode - #sensu
.WTF/is/sensu?what
Consistent way to describe a service check
Executes those checks as required
Reliably handles events (and metrics)
.WTF/is/sensu?why●Tries to do one thing well; handle events
●Compatible with existing check scripts
●Large active open-source community
●Scales effectively
.WTF/is/sensu?experience●Replaced nagios, crons etc●Raised visibility of monitoring●Devolved control to development●340 (ish) hosts, vms, switches, firewalls etc●Managed exclusively through Puppet●Developed custom plugins and extensions
.WTF/is/sensu?architecture_simple
.WTF/is/sensu?howThe Sensu Standalone Check Process:
a. Sensu-Client runs a script with 1 line output and an exit code
b. Sensu-Client converts event into JSON and puts on RabbitMQ
c. Sensu-Server reads event and sends to handlersd. Handlers process event, performing some action
.WTF/is/sensu?architecture_simple
You are here
.WTF/is/sensu?standalone_check● Describes
○ what check to run○ how to handle events
● Runs at a given interval (default 60s)
● sensu-client handles output and emits events over message brokers
● Can include custom configuration which is included in event sent to handlers
sensu::checks: 'sensu-server': command: 'check-procs.rb -p bin/sensu-server -c 1' handlers: ['high', 'pagerduty'] custom: runbook: 'https://wiki.ftr.com/x/4oqq' tip: 'Check /var/log/sensu-server.log' slack: channels: - '#craggyisland'
.WTF/is/sensu?runbook
URI to page summary of Impacted servicesTroubleshootingCommon problemsHow to fixWho to talk toReferences to other information
.WTF/is/sensu?tip
Tweet length one-linerGets included in Pagerduty and Slack noticesUseful at 4am on a Sunday morning
.WTF/is/sensu?architecture_simple
You are here
.WTF/is/sensu?architecture_simple
You are here
.WTF/is/sensu?handler● Process events● Perform some (or no) action● Typically used to send alerts or
emails
sensu::handler: slack: type: 'pipe' command: 'slack.rb' config: webhook_token: 'SECRET/KEY' bot_name: 'sensu' channel: '#alerts' pagerduty: type: 'pipe' command: 'pagerduty.rb' severities: ['ok', 'critical'] config: api_key: SECRET_TOKEN_HERE
.WTF/is/sensu?standalone_metrics● The same as checks but...● handlers: [‘metrics’]
○ A special handler for this kind of result
● type: metric○ Tells sensu to always send
the output to the handler
sensu::checks: cpu-pcnt-usage-metrics: command: 'cpu-pcnt-usage-metrics.rb' handlers: ['metrics'] type: metric
.WTF/is/sensu?metric_exampleix-sensu01.cpu.user 70.92 1440425049ix-sensu01.cpu.nice 0.00 1440425049ix-sensu01.cpu.system 8.16 1440425049ix-sensu01.cpu.idle 19.90 1440425049ix-sensu01.cpu.iowait 0.00 1440425049ix-sensu01.cpu.irq 0.00 1440425049ix-sensu01.cpu.softirq 1.02 1440425049ix-sensu01.cpu.steal 0.00 1440425049ix-sensu01.cpu.guest 0.00 1440425049
Key Value Timestamp
.WTF/is/sensu?dashboards● Uchiwa - github.com/sensu/uchiwa● Mosaic - github.com/warmfusion/mosaic● Sensu-Grid - github.com/alex-leonhardt/sensu-grid
.WTF/is/sensu?issues●Uchiwa isn’t perfect●Sensu-API can crash sometimes●No maintained history (over 20 events)●Check dependencies are handled on clients●Redis for datastore
○Redundancy is a little harder (for me at least)
.WTF/is/sensu?wins●Alerts into Slack channels●Handles network partitions really well●Easy to create new checks and handlers
.WTF/is/monitoring?further_readingProgrammatic Alert Correlation - Elik Eizenberg
youtu.be/EXk19d09n54
Effective Incident Communication - Scott Kleinyoutu.be/ySSdqfZlC7Y
Search for Operability 2015 in YouTube
.WTF/whois?q=self: author: ‘Toby Jackson <[email protected]>’ role: ‘Operations Engineer’ twitter: ‘@warmfusion’ github: ‘github.com/warmfusion’ employer: ‘www.futureplc.com/yourfuturejob/’
Any Questions…?