practical monitoring techniques

Post on 13-Apr-2017

316 Views

Category:

Engineering

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Practical Monitoring Techniques

Today's Talk● Our Mission● Current Tools● Increasing Coverage● PD Schedules● Automatic Self Healing● Bots And Alerts channels● Events Dashboard● Dashboard Accessibility● Best Practices Summary

Our Mission

Back up culture with the proper tools to support it

Current Tools

● Metrics collections: Collectd, statsd, Cloudwatch● Monitoring: Sensu, NewRelic● Alert channels: PagerDuty, emails, slack● Dashboards: Grafana, CloudWatch, NewRelic● Application testing: E2E Testing System● Internal tools: Sensu mobile, events system,

Sensu bar and more

Increasing Coverage● Automatic collection of basic

system and 3rd party metrics for new instances

● Add alerts automatically for new instance of existed subscriber

● Each Developer / DevOps is responsible for monitoring his application / infrastructure

● Easy method to add new alerts and dashboards

● Automatic events flow

Pager Schedules

● Divided into logical groups of ownership● Schedule has escalation point

● On call should be able to connect and respond to issues in his area

● Easy method to override schedule ● Ability to contact relevant on call

● Ability to page relevant on call

Automatic Self Healing

● Better MTTR● Avoid waking On Call if

possible

● Log activity to float recurrent issues

● Limit the healing to avoid restart loops

● Make sure to sync Healer Alert↔

Bots, Integrations and Alerts Channels

● Alerts channels: Emails, slack, PD mobile, sms, calls● Integrations: Sensu to PD/Slack, CloudWatch to PD,

3rd party (EX: CouchBase, NewRelic, etc) to PD,

● Slack Bot:

Events Dashboard

● Simple Rest API for sending events● Clean timeline view to spot production events● Connections between events (“depends on” and “dependents”)● Detailed view for each event

Accessibility

● Available from everywhere by mobile ● Easy to ack, resolve, mute alerts● Slack bots to reach help● Automatically get graph with the alert● Ability to search, edit, copy, etc alerts● Treat alerts management as code (SVC, DB,

backups, etc)

Best Practices Summary

● Share the pain● Automate base metrics● Automate healing● Make help reachable● Make it easy to add alerts and dashboards● Use warning levels as soft events to avoid phone calls at night● Automate graphs in alerts● Positive alerting system check each day● Dependencies between alerts● Postmortems

Questions

top related