extensible monitoring with nagios and messaging middleware

of 25/25
Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams <[email protected]>

Post on 25-Feb-2016




0 download

Embed Size (px)


Extensible Monitoring with Nagios and Messaging Middleware. LISA 2012 Jonathan Reams < [email protected] >. Symon Says Nagios Project. Replace 12-year-old home grown monitoring system Very customized Very engineered Very unsupported ~17,000 checks Mandate to move to Nagios. - PowerPoint PPT Presentation


Presentation Title

Extensible Monitoring with Nagios and Messaging MiddlewareLISA 2012Jonathan Reams Symon Says Nagios ProjectReplace 12-year-old home grown monitoring systemVery customizedVery engineeredVery unsupported~17,000 checks Mandate to move to NagiosFalse StartInstalled NagiosPorted checks from old system to newWent out for coffee ProblemsHigh check latencyHigh load

Stock Nagios

4Nagios ProblemsTrapped on one host:Check resultsStatus dataConfiguration dataNagios isnt a great executorForks 2 processes per checkEverything is basically synchronous async achieved with multiple processesData format is simple but non-standardNagios ProblemsImplementation is all in C hard to customizeCan be I/O bound by reading/writing check result filesCannot query data from status file/configuration without reading/parsing all of itInput via FIFO gives no feedback and has a limited buffer sizeNagios ProblemsCommunication is hard!My SolutionNagMQ

A ZeroMQ-based API for NagiosA Nagios Event Broker plugin that implements a ZeroMQ-based API for Nagios8Background on ZeroMQBroker-less messaging kernel in a single libraryEmulates Berkeley socket APISupports IPC/TCP/Multicast transportsFanout, pub/sub, pipe-line, and request/reply messaging patternsAll I/O is asynchronous after connections are established with dedicated I/O threadsBindings available for large number of operating systems and languagesAgnostic of data being sent no defined data format


All interfaces are optionalComes with message queue broker for advanced messaging patterns

10Event Publisher & CommandsHost check result from publisherhost_check_processed localhost{ "host_name": "localhost", "check_type": 0, "check_options": 0, "scheduled_check": 1, "reschedule_check": 1, "current_attempt": 1, "max_attempts": 1, "state": 0, "last_state": 0, "last_hard_state": 0, "last_check": 1354996955, "last_state_change": 1337098090, "latency": 1.63600, "timeout": 60, "type": "host_check_processed", "start_time": { "tv_sec": 1354996955, "tv_usec": 636453 }, "end_time": { "tv_sec": 1354996964, "tv_usec": 161965 }, "early_timeout": 0, "execution_time": 0.07324, "return_code": 0, "output": "Host up", "long_output": null, "perf_data": null, "timestamp": { "tv_sec": 1354996964, "tv_usec": 161966 } }Command to add an acknowledgement to service problem{'comment_data': 'Stop alerting me!!', 'notify_contacts': False, 'author_name': jreams', 'persistent_comment': False, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'time_stamp': {'tv_sec': 1355074576}, 'type': 'acknowledgement'}

State DataRequest{'keys': ['host_name', 'services', 'hosts', 'service_description', 'current_state', 'members', 'type', 'name', 'problem_has_been_acknowledged', 'plugin_output', 'checks_enabled', 'notifications_enabled', 'event_handler_enabled'], 'include_services': True, 'host_name': 'localhost'}Response[{'checks_enabled': True, 'notifications_enabled': True, 'current_state': 0, 'plugin_output': 'Host up', 'problem_has_been_acknowledged': 0, 'event_handler_enabled': True, 'host_name': 'localhost', 'services': ['rotate-unix'], 'type': 'host'}, {'checks_enabled': False, 'notifications_enabled': True, 'current_state': 1, 'plugin_output': 'You are now on call', 'problem_has_been_acknowledged': False, 'event_handler_enabled': True, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'type': 'service'}]Some examplesDistributed check execution (mqexec)Custom user interfaces (nag.py, etc)High availability (haagent.py, halib.py)mqexec

mqexecAsynchronous command executorSubscribes to host_check_initiate, service_check_initiate, and event_handler_start messages, and executes command line specifiedCan filter which commands to execute based on any attribute in messageReceives messages asFair-queued worker pool (pull from MQ broker)Individual worker (subscribe directly to NagMQ)Sends results back to command interface of NagMQIn production, we have 2 hosts with 10 mqexec workers both active at all times15Performance: Stock NagiosNagios was configured with 20,000 service checks across 2,000 hosts with a dummy check script. The check interval for services was every 3 minutes with a 1-minute retry interval for services services were configured for 2 check attempts to deter- mine a hard state; and hosts only checked once. Each test ran for 20 minutes, and Nagios started each time with no state from any previous tests.

16Performance: NagMQ/mqexecUser InterfacesCommand-line$ nag.py -c 'Stop alerting me!!' add ack localhost[localhost]: No problem found[[email protected]]: Acknowledgement addedPython/Javascript/Twitter Bootstrap web interface using NagMQ (see demo)Interface to TwitterSsnp web interface in a weekendTwitter interface in 20 minutes18High Availability Stock Nagios

High availability not built into the product19High Availability - NagMQ

High Availability - NagMQUse regular program_status to provide heartbeatRetrieve active state from state interface to bring passive node into sync with active node on startupSubscribe to and send check result messages, acknowledgements, downtimes, and adaptive changes to command interfacePassive hosts mqexec(s) run checks for whatever host is activeUse VIFs owned by the message broker to direct traffic to active hostWhy not use one of these?LiveStatus live state query module with check execution workersMod_gearman distributed check execution based on gearman job queueMerlin database/distributed backend for NagiosNdoutils database backend for NagiosNSCA allows check/command submission over networkNRPE remote check executorAPI not a productNagMQ is just an interface into Nagios, not a productBetter communication with clients comes from larger ZeroMQ project leaving NagMQ to focus on NagiosImplement ad-hoc tools for Nagios without having to write any compiled codeDoing expensive data processing of monitoring data doesnt have to create latency in monitoring systemRe-use one interface for many toolsFuture WorkPluggable authentication/encryption for NagMQPluggable parser/emitter for custom data formats (XML, Yaml, etc)NDOutils database replacementMore user interfaces (Jabber, SMS, email gateway, REST API)Nagios 4NagMQhttps://github.com/jbreams/nagmq

Jonathan [email protected]