staying sane with nagios
DESCRIPTION
From an invited talk I did at PICC-10 (now known as LOPSA-East) about how to manage a Nagios installation without pulling your hair out. In the ensuing years, I've automated more, but still have the same kind of mindset about inheritance and so on.TRANSCRIPT
Staying Sane with Nagios
Matt Simmons
@standaloneSA
http://www.standalone-sysadmin.com
Introduction & Outline
Confessions:
Global Sanity Small & Medium Shops Large Scale Shops Add Ons Warnings Additional Resources
I am not actually a Nagios Expert I do actually LIKE NagiosOutline:
I know what you're thinking...
Nagios?
Sane???
Unlikely!!!
Serenity Now!!!
Nagios? SANE?!?
Serenity Now!!!
Global Sanity
Universal Advice Affects installations of all sizes
Documentation Centralized Authentication Plugin Development
Global Sanity: Documentation
Read the documentation Object Definitions
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html Use 3_0 when searching Bookmark the good ones Nagiosbook.org will be soon coming out with 3.x docs
http://www.nagiosbook.org/
Global Sanity: Central Auth
Centralized Authentication LDAP / AD with Apache
(I use Likewise Open) Domain users -> Nagios Contacts
[email protected] Access to CGI interface
Global Sanity: Do Not Reinvent the Wheel...
Nagios Exchange http://exchange.nagios.org/ Pros:
Nearly 2000 Listings >1600 plugins
Cons: Varying quality and reliability Old, unmaintained, code rot, etc
Global Sanity: ...unless you have to
Writing your own Nagios Plugins Great guide
http://nagiosplug.sourceforge.net/developer-guidelines.html Extended Output Huge Community Any language you want
Small & Medium Shops
Not exclusively small or medium, just a non-automatic way of doing things
For people who: Manually edit / create entries in config files Don't use extensive 3rd party management software Have a small team of responsible admins Don't require large distributed monitoring networks
Configuration Sanity
When: Creating new configs Working with existing configs Testing Responding to events
Syntax Highlighting
This?
Syntax Highlighting
Or this?
Config File Hierarchy
Default config is stupid. cfg_dir directive is key
*.cfg – recursively
Hierarchy should resemble “real life” Allows for additional “group” security Use what makes sense to you and document it
Config File Hierarchy: Example
Output of “tree -d” on my Nagios objects directory
|-- commands |-- computers | |-- groups | |-- linux | | `-- services | `-- windows |-- misc `-- network |-- firewalls |-- links |-- routers `-- switches
Regular Expressions
Not all regexes are created equal use_regexp_matching
Only when object names contain: * ?
use_true_regexp_matching 'man regex' All object names Caution: Unintended consequences
Better Object Formatting
This?
Better Object Formatting
Or this?
Revision Control
CVS/SVN/git(?) Simple, maintainable, recoverable Self-documenting (if done correctly)
(ab)Use Inheritance
Templates register = 0
Multiple Inheritance Beware the spaghetti code
Use Hostgroups
define service{
service_description SSH Service Check
check_command check_ssh
host_name linux01, linux02, linux03, ... linux50
}
Use Hostgroupsdefine hostgroup{
hostgroup_name linuxservers
}
define host{
use generichost
host_name linux01
address 192.168.0.10
hostgroups linuxservers
}
define service{
service_description SSH service check
check_command check_ssh
hostgroup_name linuxservers
}
Script / Automate
Automate as much as possible New Hosts New Services Commands
mkhost.sh as a template
Use alternate contacts file when testing new features
Coworkers are under enough stress as it is No messy explanations Use symlinks to point to “real” contacts file
Plugin Sanity
Thoughts about writing, configuring, and using Nagios plugins
SNMP
Use it whenever possible. Really.
NRPE vs check_by_ssh
Nagios Remote Plugin Executable(?) Skip it when possible
Use SNMP
NRPE
When checking disk usage
Do not specify the partitions to check Instead, specify the partitions to NOT check Too easy to forget to add new partitions. If possible, use a plugin that produces statistics
for graphing usage trends
Notification Sanity
Notifications suck. Here are some ways to make them
not suck as much.
Alternate Communication Method
When the network Is down, email is down too Have a non-email contact method
SMS, cell modem, smoke signals Test it occasionally
Use parents
Establish a path FROM THE NAGIOS SERVER Failure will trigger “unreachable” states
“u” notification flag
Only useful for non-local-subnet hosts typically If the local switch dies, alerts don't go out anyway
Typically
Use Dependencies
Available for both hosts and services The disks didn't blow up, SNMP crashed What do you mean, the website is unavailable when
the database crashes
Dependencies != parents Parents establish a line between the host and
Nagios Dependencies establish logical object relationships
Notifications are Commands
Use Them Execute what you need, when you need, where you
need through extra-nagios scripts
Your imagination is the limit Electrical relays? Flashing lights? HALON release?
Please don't.
Use Passive Checks (when necessary / appropriate)
For “normal” passive checks, specify freshness checks
Useful for SNMP traps Combine with snmptrapd
Distributed Monitoring Use for capacity reasons Physical separation calls for separate Nagios
installs (in my opinion)
Macros GOOD
60 bajillion available - http://nagios.sourceforge.net/docs/3_0/macrolist.html
On Demand Macros Specify “remote” macros from other hosts
$HOSTMACRO:SOMEHOST$
Custom Variable Macros _MACADDRESS00:01:02:03:04:05
$_HOSTMACADDRESS$
Available as environmental variables in scripts $NAGIOS_MACRONAME
Use Flap Detection
Or not. Who wants a charged cellphone battery?
Measures state changes:
Weighted measure of the last 21 checks More recent counts higher
Large Shops
Too many nodes to easily configure by hand, or too many nodes to deal with using one server
Scaling Nagios Centralized Management Web Configurators
Scaling Nagios
large_installation_tweaks No summary macros, memory handling is different,
and processes fork() less
Distributed monitoring Assign groups of hosts to one Nagios server
(reporting via NSCA / Passive checks)
Check tuning docs: http://nagios.sourceforge.net/docs/3_0/tuning.html
Centralized Management
Puppet / chef / cfengine / whatever Distribute nagios user's key if necessary Install nagios agents (NSCA / NRPE) Automate Configuration Build
Puppet's built-in Nagios types sound convenient...sort of
Nagios Web Configuration
Dozen, If not hundreds I don't know of a great one. May be worth building or finding one that
matches your inventory system Don't double-up on data if you don't have to
Malproductive Practices
Overreliance on Event Handlers Please don't do anything terribly important. Edge cases are scary.
Overabuse of inheritance Spaghetti code Hard to trace
Overcomplification Simple is nearly always better
Learn More
Mailing List Nagios Users
https://lists.sourceforge.net/lists/listinfo/nagios-users
LinkedIn Nagios Users
http://www.linkedin.com/groupAnswers?viewQuestions=&gid=131532&forumID=3&sik=1272591931152