time to say goodbye to your nagios based setup
DESCRIPTION
Time to say goodbye to your Nagios based setup. Discover all the new cool tools out there to do some more efficient monitoring. A talk made at OSMC 2014. https://www.youtube.com/watch?v=_BAWi9ZhmicTRANSCRIPT
So you want to switch off ?
Time to say goodbye to your Nagios based setup!
© 2014 - Olivier Jan - Check my Website@olivjan - [email protected]
About me
❖ System admin and architect
❖ Co-founder of « Communauté Francophone de la Supervision Libre »
❖ Writer of the book « Nagios 3 au cœur de la supervision Open Source »
❖ Co-founder of Check my Website, a SaaS service for remote monitoring of
websites and applications (current)
Content
❖ Why switch off ? the good and maybe not so good reasons to do so !
❖ Which way to take ?
❖ Building a monitoring solution without Nagios :
❖ Tools available
❖ A personal work in progress
❖ Migrating from Nagios to this kind of solution
Some reasons to switch off…
❖ The godfather of OSS monitoring is dead as an
Open Source project ?
❖ Can’t do better with it
❖ Cool new kids out there
❖ Better « cloud » support
❖ Clear states, metrics and messages monitoring
distinction
❖ Better charting solution
❖ Near realtime monitoring
❖ Routing, aggregation, correlation…
❖ YOUR reasons ;)
Which way to take ?
❖ The « 4 mousquetaires »
❖ Naemon
❖ Icinga 2
❖ Shinken
❖ Centreon
❖ Reboot from building blocks
❖ Collect
❖ Store
❖ Visualize
❖ Alert
Tools : Collecting metrics and messages
❖ Packetbeat (metrics & messages)
❖ Rsyslog, NX log, Syslog-ng
(messages)
❖ sFlow Toolkit, Host sFlow
❖ Logstash-forwarder (messages)
❖ Collectd (metrics)
❖ Diamond (metrics)
❖ OSquery, WMI (metrics)
❖ Network level (sFlow)
❖ System Level
❖ Application Level
Tools : External collecting
❖ End user perspective
❖ Controls done closest to the
end-user
❖ Application behavior
❖ Real User Monitoring
❖ Webpagetest
❖ Selenium
❖ PhantomasJS
❖ Boomerang
❖ Bucky
Tools : Routing metrics and messages
❖ Messages : Logstash, Flume, Fluentd
❖ Metrics : StatsD
❖ Metrics : Carbon Relay NG
One or more messages can fire an event
Tools : Databases
❖ Graphite : The most used.
❖ OpenTSDB : HBase
❖ KairosDB : Cassandra
❖ InfluxDB : The most promising ?
❖ Elasticsearch : Index database
Tools : Visualizing metrics and messages
❖ Kibana
❖ Grafana
❖ Dashboards collection
Tools : Alerting
❖ Seyren : Alerting dashboard for
Graphite.
❖ Cabot : Get alerted when services
go down or metrics go crazy
❖ Bosun : An advanced, open-source
monitoring and alerting system
❖ Skyline : Real-time anomaly
detection system
❖ Oculus : Anomaly correlation
component of Etsy's Kale system
❖ Esper : Complex Event Processing
The French Monitoring Community Xperience
❖ Reboot from building blocks
❖ Collect
❖ Store
❖ Visualize
❖ Alert
The French Monitoring Community Xperience
Is it working ? What is not working ?
Collecting metrics : Collectd
❖ InfluxDB Collectd proxy
❖ In Golang like InfluxDB
❖ Temporary solution
❖ Native Collectd plugin
LoadPlugin network
<Plugin network>
# proxy address
Server "127.0.0.1" "8096"
</Plugin>
❖ PHP5-FPM metrics
❖ Nginx metrics
❖ MariaDB metrics
❖ System metrics
❖ <metricname>:<value>|<type>
Collecting messages : Rsyslog❖ Nearly ready log consumption
❖ Native distribution package
❖ Nginx Log, MySQL slow query
log
template(name=« ls_json"
type=« list" option.json="on") {
constant(value=« {")
constant(value="\"@timestamp\":\"") property(name="timereported" dateFormat=« rfc3339")
constant(value=« \",\"@version\":\"1")
constant(value="\",\"message\":\"") property(name=« msg")
constant(value="\",\"host\":\"") property(name=« hostname")
constant(value="\",\"severity\":\"") property(name=« syslogseverity-text")
constant(value="\",\"facility\":\"") property(name=« syslogfacility-text")
constant(value="\",\"programname\":\"") property(name=« programname")
constant(value="\",\"procid\":\"") property(name=« procid")
constant(value=« \"}\n")
}
Collecting @ network level : Packetbeat
❖ Specific agent
❖ Collect traffic for
❖ HTTP
❖ MySQL
❖ PostgreSQL
❖ Redis
Routing messages : Logstash
❖ Inputs
❖ Codecs/filters
❖ Outputsinput {
udp {
port => 10514
codec => "json"
type => "syslog"
}
}
filter {
# This replaces the host field with the host that generated the message (sysloghost)
if [sysloghost] {
mutate {
replace => [ "host", "%{sysloghost}" ]
remove_field => "sysloghost"
}
}
}
output {
elasticsearch { host => localhost }
}
Routing metrics : StatsD
❖ Is now a protocol implemented
in all languages
❖ InfluxDB plugin
❖ Collectd can behave as a statsD
daemon (plugin)
❖ Very easy to push metrics
echo "foo:1|c" | nc -u -w0 127.0.0.1 8125
Storing metrics : InfluxDB
❖ Make it behave like Graphite
❖ graphite-api
❖ carbon-relay-ng
❖ graphite-influxdb
❖ Cluster, cluster, cluster
❖ Design for events and metrics
Storing messages : Elasticsearch
❖ Index database
❖ Cluster, cluster, cluster
❖ Full text search
Visualizing @ network level : Packetbeat
❖ Kibana 3 modified version
❖ Dashboards ready out
of the box
Visualizing metrics : Grafana
❖ Compatible
❖ Graphite
❖ InfluxDB
❖ OpenTSDB
❖ Built on Kibana 3
Visualizing messages : Kibana 4
❖ Easy install
❖ Interactive dashboards
❖ Multiple indices
What's missing ? Wishes
❖ Alerting
❖ External monitoring
❖ Repository for dashboards…
❖ Giving sense to metrics and
messages
Alerting reboot
❖ Alert only on end user problems from an end
user perspective
❖ IRC, Chat channel…
❖ Alert thresholds based on history vs static
thresholds
❖ Statistics functions
❖ Boolean conditions
❖ Dynamic thresholds
❖ Anomaly detection
❖ Standard deviation
Coming from Nagios
❖ Graphios will inject perfdatas in Graphite or InfluxDB
❖ Check_graphite can query Graphite API from Nagios for alert based on
history
❖ Logstash will send events to NSCA
❖ Nagios log in Kibana with Grok %{NAGIOSLINE}
❖ Keep Nagios for states ?