monitoring a hpc cluster with nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · monitoring a...
TRANSCRIPT
![Page 1: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/1.jpg)
Monitoring a HPC Cluster with Nagios
Piero Calucci
Scuola Internazionale Superiore di Studi AvanzatiTrieste
2009-04-011
2009-04-03
1Try again. . . Fail better.
![Page 2: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/2.jpg)
Outline
1 Nagios Concept
2 Nagios Web Interface
3 Nagios Installation for HPC Monitoring @SISSA
![Page 3: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/3.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
What is Nagios?
«NAGIOS R© is a system and network monitoringapplication. It watches hosts and services that you
specify, alerting you when things go bad and when they getbetter».
— Nagios documentation
![Page 4: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/4.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Architecture
plugin
plugin
plugin
nagios daemon
nagioscgi
webserver
monitoring host
All the hard work is done by plugins, the nagios daemon«only» schedules them to be executed at the right time withthe right parameters and collect results.
The cgi interface is entirely optional, but highly useful.
![Page 5: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/5.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Architecture
plugin
plugin
plugin
nagios daemon
nagioscgi
webserver
monitoring host
All the hard work is done by plugins, the nagios daemon«only» schedules them to be executed at the right time withthe right parameters and collect results.
The cgi interface is entirely optional, but highly useful.
![Page 6: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/6.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios Operation
The nagios daemon
• schedules and executes active host and service checks• accepts asyncronous passive checks• sends out notifications on host or service state change• executes event handlers on host or service state
change• writes and rotates log and state files
![Page 7: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/7.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios Operation
The nagios daemon
• schedules and executes active host and service checks• accepts asyncronous passive checks• sends out notifications on host or service state change• executes event handlers on host or service state
change• writes and rotates log and state files
![Page 8: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/8.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios Operation
The nagios daemon
• schedules and executes active host and service checks• accepts asyncronous passive checks• sends out notifications on host or service state change• executes event handlers on host or service state
change• writes and rotates log and state files
![Page 9: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/9.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios Operation
The nagios daemon
• schedules and executes active host and service checks• accepts asyncronous passive checks• sends out notifications on host or service state change• executes event handlers on host or service state
change• writes and rotates log and state files
![Page 10: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/10.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios Operation
The nagios daemon
• schedules and executes active host and service checks• accepts asyncronous passive checks• sends out notifications on host or service state change• executes event handlers on host or service state
change• writes and rotates log and state files
![Page 11: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/11.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios OperationLocal Plugin Execution
nagios daemon
localplugin
monitoring host
All active checks involve the local execution of some plugin.
![Page 12: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/12.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios OperationLocal Plugin Execution: Local Service Check
nagios daemon
localplugin
localservice
monitoring host
Locally executed plugins can just check for some localservice. . .
![Page 13: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/13.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios OperationLocal Plugin Execution: Remote Host Check
monitored host
nagios daemon
localplugin
monitoring host
. . . or then can go to the network and check some remotehost or service.
![Page 14: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/14.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios OperationNagios Remote Plugin Executor
monitored host
service plugin nrpe
nagios daemon
plugincheck_nrpe
monitoring host
NRPE allows execution of plugins on remote hosts.Remote plugin results are reported to nagios by the locallyexecuted check_nrpe plugin.
![Page 15: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/15.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Nagios OperationNagios Service Check Acceptor
asyncronousevent
monitored host
send_nsca
nagios daemonnscadaemon
monitoring host
The NSCA daemon relays to Nagios asyncronousnotifications sent by send_nsca (this is how passivechecks work).
![Page 16: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/16.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Host Checks
• a host is basically anything that can be given a nameand an address
• hosts can beUP
DOWNUNREACHABLE (the host may well be up and running,
but something in the network in between isbroken)
• host checks are executed• at regular intervals• on-demand when a service on the host changes state• on-demand when required by reachability or
dependency logic
![Page 17: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/17.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Host Checks
• a host is basically anything that can be given a nameand an address
• hosts can beUP
DOWNUNREACHABLE (the host may well be up and running,
but something in the network in between isbroken)
• host checks are executed• at regular intervals• on-demand when a service on the host changes state• on-demand when required by reachability or
dependency logic
![Page 18: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/18.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Host Checks
• a host is basically anything that can be given a nameand an address
• hosts can beUP
DOWNUNREACHABLE (the host may well be up and running,
but something in the network in between isbroken)
• host checks are executed• at regular intervals• on-demand when a service on the host changes state• on-demand when required by reachability or
dependency logic
![Page 19: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/19.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Service Checks
• a service is any monitored «thing» associated with ahost
• service state can beOK
WARNINGCRITICAL
UNKNOWN• service checks are executed
• at regular intervals• on-demand when required by dependency logic
![Page 20: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/20.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Service Checks
• a service is any monitored «thing» associated with ahost
• service state can beOK
WARNINGCRITICAL
UNKNOWN• service checks are executed
• at regular intervals• on-demand when required by dependency logic
![Page 21: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/21.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Service Checks
• a service is any monitored «thing» associated with ahost
• service state can beOK
WARNINGCRITICAL
UNKNOWN• service checks are executed
• at regular intervals• on-demand when required by dependency logic
![Page 22: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/22.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
State Types
Each host or service state can be of SOFT or HARD type.
SOFT type states are considered «uncertain» or«transitioning» and are checked with a different (usuallyhigher) frequency until a specified maximum retry count isreached – they then become HARD states.
![Page 23: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/23.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Flap Detection
• When a host or service changes its state «toofrequently» it is detected as being flapping
• Flapping hosts do not trigger notifications in order toavoid filling up mailboxes
• Flap detection threshold is configurable, and flapdetection can be disabled entirely. However, thedefaults are good enough
![Page 24: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/24.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Notifications
• notifications can be sent out whenever a HARD statetransition occurs and when a host or service remains ina non-OK hard state for a specified time
• notification can enabled or disabled for each host orservice
• notification times and contact groups can be set up sothat only the right person is contacted, only when he ison duty
![Page 25: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/25.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Event Handlers
• event handlers are executed (if defined) when a host orservice changes state and for each retry in SOFT states
• they are given all state information: state, type, retrycount
• they can do basically anything, as long as they aregiven sufficient permissions, including
• restarting a failed service or host• changing nagios configuration by writing to the
command pipe (adaptive monitoring)
![Page 26: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/26.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Tactical Overview
![Page 27: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/27.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Host State
![Page 28: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/28.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Service Status Detail
![Page 29: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/29.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Service Problems
![Page 30: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/30.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
What to MonitorHosts
• a masternode• 160+ computing nodes• several NFS servers
• including a HA NFS cluster
• lustre servers
![Page 31: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/31.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
What to MonitorServices
• «generic» services (SSH, NTP, . . . )• HPC-specific services (maui, pbs_server, pbs_mom)• computing node health (load average, hardware
errors, . . . )
![Page 32: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/32.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
What to Do
• send notifications when things go unrecoverably badavoid sending out notifications every hour (Nagiosdefault) – how often is too often?
• restart services and hosts when possible• how much does it take to declare a service crashed or a
computing node dead?• do we trust Nagios to detect correctly?• do we accept the risk of rebooting a node that was just
responding late?
![Page 33: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/33.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
What to Do
• send notifications when things go unrecoverably badavoid sending out notifications every hour (Nagiosdefault) – how often is too often?
• restart services and hosts when possible• how much does it take to declare a service crashed or a
computing node dead?• do we trust Nagios to detect correctly?• do we accept the risk of rebooting a node that was just
responding late?
![Page 34: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/34.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Open Issues
• «undead» nodes• what are we going to do with nodes that reply to ping
but nothing else?(they are UP from Nagios PoV)
• how are we going to reliably detect the undead state?• host reachability
• cluster network topology is too simple to make Nagiosreachability logic useful (no pingable gateways)
• we have no way to check single switch port (do we?)• multihomed nodes
• Nagios has minimal support for multihomed hosts (ormaybe I didn’t understand it. . . )
• we have no clear way to know all addresses associatedwith a node FQDN
![Page 35: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/35.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Open Issues
• «undead» nodes• what are we going to do with nodes that reply to ping
but nothing else?(they are UP from Nagios PoV)
• how are we going to reliably detect the undead state?• host reachability
• cluster network topology is too simple to make Nagiosreachability logic useful (no pingable gateways)
• we have no way to check single switch port (do we?)• multihomed nodes
• Nagios has minimal support for multihomed hosts (ormaybe I didn’t understand it. . . )
• we have no clear way to know all addresses associatedwith a node FQDN
![Page 36: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/36.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Open Issues
• «undead» nodes• what are we going to do with nodes that reply to ping
but nothing else?(they are UP from Nagios PoV)
• how are we going to reliably detect the undead state?• host reachability
• cluster network topology is too simple to make Nagiosreachability logic useful (no pingable gateways)
• we have no way to check single switch port (do we?)• multihomed nodes
• Nagios has minimal support for multihomed hosts (ormaybe I didn’t understand it. . . )
• we have no clear way to know all addresses associatedwith a node FQDN
![Page 37: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/37.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Open IssuesHigh System CPU Time
After 3 weeks we are over 30% CPU usage on a 2x dualcore opteron. . .
![Page 38: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/38.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Open IssuesFile Access Permissions Mess
On monitoring server:• nagios needs to read its own configuration files and
write log and status files• web server needs to read some nagios config files and
write to nagios command pipe• nagios event handlers need to write to nagios
command pipe• nsca needs to read its own configuration and write to
nagios command pipeOn monitored hosts:
• nrpe needs to read its own configuration and executenagios plugins
• send_nsca needs to read its own configuration
![Page 39: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/39.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
Open IssuesFile Access Permissions Mess
On monitoring server:• nagios needs to read its own configuration files and
write log and status files• web server needs to read some nagios config files and
write to nagios command pipe• nagios event handlers need to write to nagios
command pipe• nsca needs to read its own configuration and write to
nagios command pipeOn monitored hosts:
• nrpe needs to read its own configuration and executenagios plugins
• send_nsca needs to read its own configuration
![Page 40: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/40.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
ExampleCheck pbs_mom /1
define host {use linux-serverhost_name p001address 10.2.13.1
}
define hostgroup {hostgroup_name p-nodesalias planck nodesmembers p001,p002,p003,. . .
}
![Page 41: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/41.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
ExampleCheck pbs_mom /2
define service {use generic-servicehostgroup_name m-nodes,c-nodes,p-nodes,. . .service_description pbs_momcheck_command check_pbs_mommax_check_attempts 4notifications_enabled 1event_handler restart_pbs_momservicegroups sg_node_batch
}
![Page 42: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/42.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
ExampleCheck pbs_mom /3
define command {command_name check_pbs_momcommand_line $USER1$/check_tcp
-H $HOSTADDRESS$-p 15002
}
define command {command_name restart_pbs_momcommand_line /. . . /eventhandlers/restart_pbs_mom.sh
$HOSTNAME$ $SERVICESTATE$$SERVICESTATETYPE$$SERVICEATTEMPT$
}
![Page 43: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/43.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
ExampleCheck pbs_mom /4
restart_pbs_mom.sh is:#!/bin/sh
SSH="ssh -i /var/lib/nagios/.ssh/id_dsa -t -t"RESTART="sudo /sbin/service pbs_mom restart"
case $2 inCRITICAL)
case $3 inSOFT)
case $4 in3)
$SSH $1 $RESTART;;esac
;;HARD)
$SSH $1 $RESTART;;esac
;;esac
![Page 44: Monitoring a HPC Cluster with Nagioscalucci/nagios/nagios_slides.pdf · 2009-04-02 · Monitoring a HPC Cluster with Nagios Piero Calucci Nagios Concept Web Interface HPC Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022040905/5e78fba21563d953fd13bd96/html5/thumbnails/44.jpg)
Monitoring aHPC Clusterwith Nagios
Piero Calucci
NagiosConcept
Web Interface
HPC
That’s it
This slide intentionally left blank