nagios conference 2014 - andy brist - nagios xi failover and ha solutions

PowerPoint Presentation

Failover and High Availability Solutions for Nagios XI

Andy Brist

[email protected]

Introduction

Who am I?Nagios Support Team Manager

Team Lead for Nagios-Plugins(github.com/nagios-plugins

Disclaimer

Every environment is different

Failover/HA by nature, is a customized solution

My case studies are not your production environments

I know Nagios/XI, not your SLA

Test in a lab. First.

Agenda

Short overview of the different failback/failover solutions

Nagios XI Data Locations and other files/services relevant to failover scenarios.

Snapback

Failback

Failover

HA? Failover

Observations, Considerations

Backup (snapback)

Restore VM snapshot or spin up a new instance and restore a backup

Most common implementation

Easiest of all options

Most potential downtime of scenarios

Maximum historical and configuration data lost = the interval between snapshots

Requires manual intervention

Automated XI Backups

XI provides a method for scheduled backups through the "Scheduled Backups Component"ssh

ftp

local fs

Useful for remote backups or manual failback

Failback

Failback

Secondary is periodically updated from an XI backup.

The nagios process is started by hand when the master has an issue.

Cronjob on the secondary restores newest backup once a day.

If unconcerned with historical data and mrtg performance data, just push/restore the object configs and sql dumps (if not offloaded)

Not to be confused with snapback as this is a separate, different instance/image, not just a previous state of the failed instance.

Additional Considerations

Easy to implement with the Scheduled Backups XI component.

Agents must maintain 2+ allowed hosts

SNMP traps must be configured to push to 2+ hosts

May experience substantial downtime if the backup is large and the primary fails during a data restore on the secondary.

Failover

Difficult to get right

Demanding on i/o resources and network speed

Very little to no loss of historical data

Minimal downtime

Fully automated

Can provide minimal clustering for XI services through High Availability

Failover

Nagios XI

Object Configuration

Check Status

Object State

Program State

Historical State Data

Performance Data

Nagios XI - Services

nagios Monitoring enginemysql Object configuration and ndo historical datando2db Writes historical data to mysql databasepostgresql Nagios XI settings/user database npcd Performance data daemoncrond Task schedulerhttpd Web server

XI Data and Redundancy

Absolute minimum redundant data required for any failover scenario:(Working) Object configuration

Mysql 'nagiosql' database

Postgresql 'nagiosxi' database

Full Check Redundancy

Additional requirements for full check redundancy:mrtg config and RRDs (for bandwidth checks)

nagios libexec folder (plugins)

Any additional dependencies for plugins. For example:VMWare SDK

Oracle Perl Library

Java JRE

Runtime State Redundancy

Additional requirements for runtime state redundancy:retention.dat (state, runtime options, acknowledgments, notification depth)

NDO mysql database "nagios

Historical Redundancy

Additional Data required for complete historical redundancy:nagios.log and archives directory

perfdata RRDs

mrtg config and RRDs

NDO mysql database "nagios"

XI Data Summary

Logs/archivesPerfdataMrtg/configsDatabasesObject configsPlugins

XI Data Summary

/usr/local/nagios/var/nagios.log/usr/local/nagios/var/archives//usr/local/nagios/share/perfdata//var/lib/mrtg//etc/mrtg//var/lib/pgsql//var/lib/mysql//usr/local/nagios/etc//usr/local/nagios/libexec//usr/local/nagiosxi/

High Availability?

1. Elimination of single points of failure.2. Reliable crossover/failover.3. Detection of failures as they occur.

High Availability?

Why would you need it?Least amount of downtime

(limited) Service clustering

Shared volumes solve the issues with syncing historical data in redundant configurations

High Availability/Failover

Major components:Shared storage

Virtual IP

Management applications/scripts

Shared Storage

DRBD block level replication, part of the linux kernel, well supported and understood. Works well for all XI data types (including RRDs/DBs)

NFS Fine option, just make sure the NFS share does not have an i/o latency issue or your checks WILL get behind. Do not mount the volume on more than one server at time to avoid writing multiple checks in the case of a partial failover.

Replicated DBs Fine solution, clusters well. Use DNS or virtual ips to control access to the databases.

rsync Not immediate replication, but close. Easy to implement.

GlusterFS More problematic to set up, but good for offloaded mrtg/RRDs

DRBD

Active/passive suggested

Low latency storage

Active mount should move with the vip

Refer to Jeremy Rust's presentation notes for more information

Virtual IP

pacemaker vip script

Custom ifconfig/ip shell scripts

uCarp Scripts

keepalived

HA Failover Management

Pacemaker/Heartbeat (the HA stack)

uCarp scripts

keepalived scriptsCustom Scripts:

nagios itself Event handler driven

cron Job that checks the master for connectivity. Reuse the check_icmp or check_http plugins for this purpose.

Extra Considerations

STONITH

Clustering?

DRBD/Shared Storage

High Latency HA

NDO/Databases

Recovery

STONITH

(shoot the other node in the head)

Mechanism by which a failing server is guaranteed to be removed from the cluster

Not required, but advised

Hardware (including ups) and software (vmware stonith device and shell scripts)

Only failing over when the primary is unreachable is safest

Beware of overzealous failover conditions as they can lead to a . .

Deathmatch!

No, really. Stonith gives your servers the ability to KILL THEMSELVES and FRIENDS

Beware of services whose init actions/failures should not cause failover/stonith

Any actions requiring a shared volume in active/passive mode should not immediately cause failover due to potential latency during volume mounts

Test, test, test the disaster scenarios in a LAB first or the fragfest may include your job!

Clustering/Fencing

A number of portions of Nagios Core and Nagios XI are clusterable. Processes that can potentially be clustered:offloaded postgresql

offloaded mysql/ndo2db

offloaded mrtg

Services that are dependent on the core monitoring engine and filesystem and should not be clustered:nagios, npcd, cronjobs

httpd

snmptrapd, snmptt

DUAL DRBD Primary

Disconnecting from the master before mounting of the shared volume during failover is no longer needed.

Careful implementation allows multiple servers to concurrently access the shared volume. Potentially useful for ambitious clusters and shared historical records.

Slower, as the secondary can lock blocks.

More prone to split-brains

Usually requires clustered file systems

High Latency HA

Problematic if the HA solution was not designed for potential high latency

Will potentially cause i/o wait issues

It may be better to push checks to a central server(s) with NRDP/outbound checks/etc, keeping HA solutions local, or to pay for a faster pipe.

DRBD Proxy A good solution if high latency HA is a must uses an asynchronous buffer for block writes to the secondary volumes (does not support dual primary)

NDO Considerations

Enforce single ndo instance access to mysql

If multiple ndo processes connecting to a single ndo db is required, consider using ndo db instances

You can control ndo's access to the mysql server through iptables and the vip.

Offload ndo2db to the offloaded mysql server

Configure ndomod it to connect through a tcp socket. This can potentially decrease load on the nagios server.

Database Considerations

Initiating failover due to crashed DBs may cause a deathmatch as all nodes will fail (due to their shared nature)

Offload both postgresql and mysql databases. Requires a virtual ip or careful management of DNS.

XI has scripts to repair the databases, use them!

Recovering from Failover

Degraded ex-primaries should not be added back to the cluster automatically. Doing so may cause split brains.

Split brains REQUIRE manual intervention if preservation of historical data is desired.

Stonith Deathmatches Have a primary image/instance without stonith enabled for recovery

Maintain an ultimate disaster recovery server instance/image outside of the cluster pool for when all else has failed.

A Plea from Nagios Support

Failover/HA != backups

Test, test, TEST! Use your lab please.

Document. Everything. The biggest barrier and largest hurdle for support are unknown, undocumented, non-standard configurations. Failover/HA deployments definitely qualify.

Final Comparisons

Snapback: Easy. Slow recovery. Requires manual intervention. Highest potential historical loss.

Failback: Intermediate. Moderate recovery. Can be automated. Less historical loss.

Failover: Difficult. Fast recovery. Fully automated. Nearly no historical loss.

High Availability: Difficult. Fast recovery. Automated. Redundancy across WAN links. Limited clustering. Least potential downtime. Multiple potential issues with split-brain, stonith/deathmatches and latency, so care should be given, and scenarios tested.

Food for thought . . . .

HA in a federated model . . . . . . . .

Final Questions For You

How much of Nagios XI, or Core, can truly be set up to be "HA"? Do you care? :P

Do you need HA/failover, or will failback/snapback suffice?

Is the time trade off in your environment worth it?

Questions for Me?

Any questions?

(common/critical answers noted below for the sake of efficiency)

11 meters/sec (unladen European swallow)

42

The Prime Directive

3 Times

The Categorical Imperative/Pragmatism (choose 1)

No.*

Evasive Subjunctive

. . . Yes?

The End

Andy Brist

[email protected]

Click to edit Master title style

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

10/21/14


Click to edit Master subtitle style

10/21/14


Second level
Third level
Fourth level
Fifth level

10/21/14



10/21/14


Second level
Third level
Fourth level
Fifth level

Second level
Third level
Fourth level
Fifth level

10/21/14



Second level
Third level
Fourth level
Fifth level


Second level
Third level
Fourth level
Fifth level

10/21/14


10/21/14

10/21/14


Second level
Third level
Fourth level
Fifth level


10/21/14



10/21/14



Second level

Third level

Fourth level

Fifth level

10/21/14



Second level

Third level

Fourth level

Fifth level

10/21/14

PRESENTATION TITLE

Presenter Name

10/21/14

[email protected]



Second level

Third level

Fourth level

Fifth level

10/21/14

PRESENTATION TITLE

Presenter Name

10/21/14

[email protected]


Second level
Third level
Fourth level
Fifth level

10/21/14



10/21/14


Second level
Third level
Fourth level
Fifth level

Second level
Third level
Fourth level
Fifth level

10/21/14



Second level
Third level
Fourth level
Fifth level


Second level
Third level
Fourth level
Fifth level

10/21/14


10/21/14

10/21/14


Second level
Third level
Fourth level
Fifth level


10/21/14



10/21/14



Second level

Third level

Fourth level

Fifth level

10/21/14



Second level

Third level

Fourth level

Fifth level

10/21/14

nagios conference 2014 - andy brist - nagios xi failover and ha solutions

Technology