staying sane with nagios

Staying Sane with Nagios

Matt Simmons

@standaloneSA

[email protected]

http://www.standalone-sysadmin.com

Introduction & Outline

Confessions:

Global Sanity Small & Medium Shops Large Scale Shops Add Ons Warnings Additional Resources

I am not actually a Nagios Expert I do actually LIKE NagiosOutline:

I know what you're thinking...

Nagios?

Sane???

Unlikely!!!

Serenity Now!!!

Nagios? SANE?!?

Serenity Now!!!

Global Sanity

Universal Advice Affects installations of all sizes

Documentation Centralized Authentication Plugin Development

Global Sanity: Documentation

Read the documentation Object Definitions

http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html Use 3_0 when searching Bookmark the good ones Nagiosbook.org will be soon coming out with 3.x docs

http://www.nagiosbook.org/

http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html

http://www.nagiosbook.org/

Global Sanity: Central Auth

Centralized Authentication LDAP / AD with Apache

(I use Likewise Open) Domain users -> Nagios Contacts

[email protected] Access to CGI interface

mailto:[email protected]

Global Sanity: Do Not Reinvent the Wheel...

Nagios Exchange http://exchange.nagios.org/ Pros:

Nearly 2000 Listings >1600 plugins

Cons: Varying quality and reliability Old, unmaintained, code rot, etc

http://exchange.nagios.org/

Global Sanity: ...unless you have to

Writing your own Nagios Plugins Great guide

http://nagiosplug.sourceforge.net/developer-guidelines.html Extended Output Huge Community Any language you want

http://nagiosplug.sourceforge.net/developer-guidelines.html

Small & Medium Shops

Not exclusively small or medium, just a non-automatic way of doing things

For people who: Manually edit / create entries in config files Don't use extensive 3rd party management software Have a small team of responsible admins Don't require large distributed monitoring networks

Configuration Sanity

When: Creating new configs Working with existing configs Testing Responding to events

Syntax Highlighting

This?

Syntax Highlighting

Or this?

Config File Hierarchy

Default config is stupid. cfg_dir directive is key

*.cfg – recursively

Hierarchy should resemble “real life” Allows for additional “group” security Use what makes sense to you and document it

Regular Expressions

Not all regexes are created equal use_regexp_matching

Only when object names contain: * ?

use_true_regexp_matching 'man regex' All object names Caution: Unintended consequences

Better Object Formatting

This?

Better Object Formatting

Or this?

Revision Control

CVS/SVN/git(?) Simple, maintainable, recoverable Self-documenting (if done correctly)

(ab)Use Inheritance

Templates register = 0

Multiple Inheritance Beware the spaghetti code

Use Hostgroups

define service{

service_description SSH Service Check

check_command check_ssh

host_name linux01, linux02, linux03, ... linux50

}

Use Hostgroupsdefine hostgroup{

hostgroup_name linuxservers

}

define host{

use generichost

host_name linux01

address 192.168.0.10

hostgroups linuxservers

}

define service{

service_description SSH service check

check_command check_ssh

hostgroup_name linuxservers

}

Script / Automate

Automate as much as possible New Hosts New Services Commands

mkhost.sh as a template

Use alternate contacts file when testing new features

Coworkers are under enough stress as it is No messy explanations Use symlinks to point to “real” contacts file

Plugin Sanity

Thoughts about writing, configuring, and using Nagios plugins

SNMP

Use it whenever possible. Really.

NRPE vs check_by_ssh

Nagios Remote Plugin Executable(?) Skip it when possible

Use SNMP

NRPE

When checking disk usage

Do not specify the partitions to check Instead, specify the partitions to NOT check Too easy to forget to add new partitions. If possible, use a plugin that produces statistics

for graphing usage trends

Notification Sanity

Notifications suck. Here are some ways to make them

not suck as much.

Alternate Communication Method

When the network Is down, email is down too Have a non-email contact method

SMS, cell modem, smoke signals Test it occasionally

Use parents

Establish a path FROM THE NAGIOS SERVER Failure will trigger “unreachable” states

“u” notification flag

Only useful for non-local-subnet hosts typically If the local switch dies, alerts don't go out anyway

Typically

Use Dependencies

Available for both hosts and services The disks didn't blow up, SNMP crashed What do you mean, the website is unavailable when

the database crashes

Dependencies != parents Parents establish a line between the host and

Nagios Dependencies establish logical object relationships

Notifications are Commands

Use Them Execute what you need, when you need, where you

need through extra-nagios scripts

Your imagination is the limit Electrical relays? Flashing lights? HALON release?

Please don't.

Use Passive Checks (when necessary / appropriate)

For “normal” passive checks, specify freshness checks

Useful for SNMP traps Combine with snmptrapd

Distributed Monitoring Use for capacity reasons Physical separation calls for separate Nagios

installs (in my opinion)

Macros GOOD

60 bajillion available - http://nagios.sourceforge.net/docs/3_0/macrolist.html

On Demand Macros Specify “remote” macros from other hosts

$HOSTMACRO:SOMEHOST$

Custom Variable Macros _MACADDRESS00:01:02:03:04:05

$_HOSTMACADDRESS$

Available as environmental variables in scripts $NAGIOS_MACRONAME

http://nagios.sourceforge.net/docs/3_0/macrolist.html

Use Flap Detection

Or not. Who wants a charged cellphone battery?

Measures state changes:

Weighted measure of the last 21 checks More recent counts higher

Large Shops

Too many nodes to easily configure by hand, or too many nodes to deal with using one server

Scaling Nagios Centralized Management Web Configurators

Scaling Nagios

large_installation_tweaks No summary macros, memory handling is different,

and processes fork() less

Distributed monitoring Assign groups of hosts to one Nagios server

(reporting via NSCA / Passive checks)

Check tuning docs: http://nagios.sourceforge.net/docs/3_0/tuning.html

http://nagios.sourceforge.net/docs/3_0/tuning.html

Centralized Management

Puppet / chef / cfengine / whatever Distribute nagios user's key if necessary Install nagios agents (NSCA / NRPE) Automate Configuration Build

Puppet's built-in Nagios types sound convenient...sort of

Nagios Web Configuration

Dozen, If not hundreds I don't know of a great one. May be worth building or finding one that

matches your inventory system Don't double-up on data if you don't have to

Malproductive Practices

Overreliance on Event Handlers Please don't do anything terribly important. Edge cases are scary.

Overabuse of inheritance Spaghetti code Hard to trace

Overcomplification Simple is nearly always better

Learn More

Mailing List Nagios Users

https://lists.sourceforge.net/lists/listinfo/nagios-users

LinkedIn Nagios Users

http://www.linkedin.com/groupAnswers?viewQuestions=&gid=131532&forumID=3&sik=1272591931152

https://lists.sourceforge.net/lists/listinfo/nagios-users

http://www.linkedin.com/groupAnswers?viewQuestions=&gid=131532&forumID=3&sik=1272591931152

staying sane with nagios

Technology

nagios outline

nagios dependencies

separate nagios

nagios serverfailure

nagios experti

equal use

nagios exchange http

possible use snmpnrpe