Transcript
Page 1: Extensible Monitoring with  Nagios  and Messaging Middleware

Extensible Monitoring with Nagios and Messaging MiddlewareLISA 2012Jonathan Reams <[email protected]>

Page 2: Extensible Monitoring with  Nagios  and Messaging Middleware

Symon Says Nagios Project• Replace 12-year-old home grown monitoring system

– Very customized– Very engineered– Very unsupported

• ~17,000 checks • Mandate to move to Nagios

Page 3: Extensible Monitoring with  Nagios  and Messaging Middleware

False Start

1. Installed Nagios2. Ported checks from old system to new3. Went out for coffee 4. Problems

a. High check latencyb. High load

Page 4: Extensible Monitoring with  Nagios  and Messaging Middleware

Stock Nagios

Page 5: Extensible Monitoring with  Nagios  and Messaging Middleware

Nagios Problems• Trapped on one host:

– Check results– Status data– Configuration data

• Nagios isn’t a great executor– Forks 2 processes per check– Everything is basically synchronous – async achieved

with multiple processes• Data format is simple but non-standard

Page 6: Extensible Monitoring with  Nagios  and Messaging Middleware

Nagios Problems

• Implementation is all in C – hard to customize• Can be I/O bound by reading/writing check result files• Cannot query data from status file/configuration without

reading/parsing all of it• Input via FIFO gives no feedback and has a limited

buffer size

Page 7: Extensible Monitoring with  Nagios  and Messaging Middleware

Nagios Problems

Communication is hard!

Page 8: Extensible Monitoring with  Nagios  and Messaging Middleware

My Solution

NagMQ

A ZeroMQ-based API for Nagios

Page 9: Extensible Monitoring with  Nagios  and Messaging Middleware

Background on ZeroMQ

• Broker-less messaging kernel in a single library• Emulates Berkeley socket API• Supports IPC/TCP/Multicast transports• Fanout, pub/sub, pipe-line, and request/reply messaging

patterns• All I/O is asynchronous after connections are established

with dedicated I/O threads• Bindings available for large number of operating systems

and languages• Agnostic of data being sent – no defined data format

Page 10: Extensible Monitoring with  Nagios  and Messaging Middleware

NagMQ

Page 11: Extensible Monitoring with  Nagios  and Messaging Middleware

Event Publisher & CommandsHost check result from publisherhost_check_processed localhost{ "host_name": "localhost", "check_type": 0, "check_options": 0, "scheduled_check": 1, "reschedule_check": 1, "current_attempt": 1, "max_attempts": 1, "state": 0, "last_state": 0, "last_hard_state": 0, "last_check": 1354996955, "last_state_change": 1337098090, "latency": 1.63600, "timeout": 60, "type": "host_check_processed", "start_time": { "tv_sec": 1354996955, "tv_usec": 636453 }, "end_time": { "tv_sec": 1354996964, "tv_usec": 161965 }, "early_timeout": 0, "execution_time": 0.07324, "return_code": 0, "output": "Host up", "long_output": null, "perf_data": null, "timestamp": { "tv_sec": 1354996964, "tv_usec": 161966 } }

Command to add an acknowledgement to service problem{'comment_data': 'Stop alerting me!!', 'notify_contacts': False, 'author_name': ’jreams', 'persistent_comment': False, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'time_stamp': {'tv_sec': 1355074576}, 'type': 'acknowledgement'}

Page 12: Extensible Monitoring with  Nagios  and Messaging Middleware

State DataRequest{'keys': ['host_name', 'services', 'hosts', 'service_description', 'current_state', 'members', 'type', 'name', 'problem_has_been_acknowledged', 'plugin_output', 'checks_enabled', 'notifications_enabled', 'event_handler_enabled'], 'include_services': True, 'host_name': 'localhost'}

Response[{'checks_enabled': True, 'notifications_enabled': True, 'current_state': 0, 'plugin_output': 'Host up', 'problem_has_been_acknowledged': 0, 'event_handler_enabled': True, 'host_name': 'localhost', 'services': ['rotate-unix'], 'type': 'host'}, {'checks_enabled': False, 'notifications_enabled': True, 'current_state': 1, 'plugin_output': 'You are now on call', 'problem_has_been_acknowledged': False, 'event_handler_enabled': True, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'type': 'service'}]

Page 13: Extensible Monitoring with  Nagios  and Messaging Middleware

Some examples• Distributed check execution (mqexec)• Custom user interfaces (nag.py, etc)• High availability (haagent.py, halib.py)

Page 14: Extensible Monitoring with  Nagios  and Messaging Middleware

mqexec

Page 15: Extensible Monitoring with  Nagios  and Messaging Middleware

mqexec• Asynchronous command executor• Subscribes to host_check_initiate,

service_check_initiate, and event_handler_start messages, and executes command line specified

• Can filter which commands to execute based on any attribute in message

• Receives messages as– Fair-queued worker pool (pull from MQ broker)– Individual worker (subscribe directly to NagMQ)

• Sends results back to command interface of NagMQ

Page 16: Extensible Monitoring with  Nagios  and Messaging Middleware

Performance: Stock Nagios

1 2 3 4 5 6 7 8 9 101112131415161718192002468

1012141618

Max HostAvg HostMax SvcAvg Svc

Time in Minutes

Late

ncy

in S

econ

ds

Page 17: Extensible Monitoring with  Nagios  and Messaging Middleware

Performance: NagMQ/mqexec

1 2 3 4 5 6 7 8 9 101112131415161718192002468

1012141618

Max HostAvg HostMax SvcAvg Svc

Time in Minutes

Late

ncy

in S

econ

ds

Page 18: Extensible Monitoring with  Nagios  and Messaging Middleware

User Interfaces• Command-line$ nag.py -c 'Stop alerting me!!' add ack localhost[localhost]: No problem found[uptime@localhost]: Acknowledgement added• Python/Javascript/Twitter Bootstrap web interface using

NagMQ (see demo)• Interface to Twitter

Page 19: Extensible Monitoring with  Nagios  and Messaging Middleware

High Availability – Stock Nagios

Page 20: Extensible Monitoring with  Nagios  and Messaging Middleware

High Availability - NagMQ

Page 21: Extensible Monitoring with  Nagios  and Messaging Middleware

High Availability - NagMQ• Use regular program_status to provide heartbeat• Retrieve active state from state interface to bring passive

node into sync with active node on startup• Subscribe to and send check result messages,

acknowledgements, downtimes, and adaptive changes to command interface

• Passive host’s mqexec(s) run checks for whatever host is active

• Use VIFs owned by the message broker to direct traffic to active host

Page 22: Extensible Monitoring with  Nagios  and Messaging Middleware

Why not use one of these?• LiveStatus – live state query module with check

execution workers• Mod_gearman – distributed check execution based on

gearman job queue• Merlin – database/distributed backend for Nagios• Ndoutils – database backend for Nagios• NSCA – allows check/command submission over

network• NRPE – remote check executor

Page 23: Extensible Monitoring with  Nagios  and Messaging Middleware

API – not a product• NagMQ is just an interface into Nagios, not a product• Better communication with clients comes from larger

ZeroMQ project – leaving NagMQ to focus on Nagios• Implement ad-hoc tools for Nagios without having to

write any compiled code• Doing expensive data processing of monitoring data

doesn’t have to create latency in monitoring system• Re-use one interface for many tools

Page 24: Extensible Monitoring with  Nagios  and Messaging Middleware

Future Work• Pluggable authentication/encryption for NagMQ• Pluggable parser/emitter for custom data formats (XML,

Yaml, etc)• NDOutils database replacement• More user interfaces (Jabber, SMS, email gateway,

REST API)• Nagios 4

Page 25: Extensible Monitoring with  Nagios  and Messaging Middleware

NagMQ

https://github.com/jbreams/nagmq

Jonathan [email protected]


Top Related