nagios on tier1 farm jonathan wheeler ral tier1 fabric team 20 th june 2008

20
Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Upload: ryan-doherty

Post on 28-Mar-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Nagios on Tier1 farm

Jonathan WheelerRAL Tier1 Fabric Team

20th June 2008

Page 2: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Overview

• What we had before (Sure)• Introduction to Nagios and how it is

configured for the farm• What might we do next

Page 3: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Sure monitoring - 1

• Consists of a server and clients• Communication via sysreq

command• Required scripts set up for each

client to run checks and report results to server

Page 4: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Sure monitoring - 2

3 main tasks:a) check host alive

• active using ping• passive accepting heartbeat messages

b) receive alarm messagesc) receive “backup started” and

“backup finished” messages

Page 5: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Sure monitoring - 3

Problems:• configuration not directly under Tier1

control• requires locally-written and locally

maintained scripts• limited view of farm alarms and state• alarms only visible on server screen

Page 6: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios

• highly configurable• under active development (Nagios 2.11

legacy, Nagios 3.0.2 latest stable)• active user community (mailing list)• some commercial offerings• extensive documentation part of

installation• allows local extensions

Page 7: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios – basics -1

Nagios:• schedules test commands, for

example: is space used in /var filesystem larger than permitted limit

• accepts results as return code (0 - OK, 1 – warning, 2 – critical, 3/-1 – unknown), and a single line message

Page 8: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios – basics -2

Nagios (continued):• displays via Web interface to

authorised users • sends notification via e-mail, SMS,

RSS, Morse code, jungle drums etc• may run an event handler, e.g. if a

test fails, then put this batch node offline

Page 9: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios – networked clients

• Nagios server can use check_nrpe command to run test on networked client

• client must be running nrpe client process to

– accept and run check requests– accept results and return to server

• Nagios server can also use ssh or smtp to perform checks (little experience on Tier1)

Page 10: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Nagios server

Nagiosclient

Nagiosclient

Nagiosclient

Nagiosclient

Single server, many clients

Page 11: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios – slave servers

• Running scheduled checks and web server puts heavy load on Nagios server

• Tier1 uses master and slave servers:– master keeps all results, runs web server

and sends notifications– slaves schedule tests, run them and

return results to master (using send_nsca command to nsca daemon)

Page 12: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios – “freshness”

If slave server has crashed:• master server checks whether tests

have been run to schedule (freshness checking)

• if test is stale (test results not returned to schedule), master will run test (force check)

Page 13: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Master and slaves servers; many clients

Master server

Slave server Slave server Slave server

Client

Client

Client Client Client

Client

Client Client

Client

Page 14: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios – clearing alarms

If check condition has been corrected and

you want to clear alarm before the nextscheduled test:• can force check (from master or slave)

by issuing appropriate formatted command to server

• scripts available to do this

Page 15: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios - configuration

In our configuration Nagios knows about:– hosts– host groups– services (for checking)– contacts and contact groups– time periods (when tests are valid, when

to send contact messages)

Page 16: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios - configuration

• Configuration is made simpler by extensive use of templates, for example:– define a template for a generic host– use it to define many other hosts, only

changing parameters that are different (e.g. host name, address, group to which it belongs)

– can be recursive

Page 17: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

# Generic host definition templatedefine host{

name generic-host; name of host templatenotifications_enabled 1; Host notifications are enabledevent_handler_enabled 1; Host event handler is enabledflap_detection_enabled 1; Flap detection is enabledprocess_perf_data 1; Process performance dataretain_status_information 1; Retain status information retain_nonstatus_information 1; Retain non-status information register 0; Template definitioncheck_command check-host-alivemax_check_attempts 10notification_interval 720notification_period 24x7notification_options d,u,r

}

Page 18: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

define host{use generic-hosthost_name ganglia0430parents swt-5530-0alias Ganglia Hosthostgroups aux-servicescontact_groups thorneaddress 130.246.183.173

}

define host{use generic-hosthost_name shelobparents swt-4400-1alias CSF Webserver

……………

Page 19: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Introduction to Nagios - plugins

• Test scripts are known as plugins• Can be written in any suitable

language: shell script, Perl, C, Pascal• About 60 standard plugins (available

by RPM from Dag Wieers’ repository)• About 30+ locally written plugins• plus 14+ specially written for Castor

Page 20: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Nagios links

• Nagios home page: http://www.nagios.org/

• For locally written plugins: http://cvs.gridpp.rl.ac.uk/viewcvs/viewcvs.cgi/nagios/plugins/

• For GridPP information about Nagios: http://www.gridpp.ac.uk/wiki/Nagios