indianauniversityindianauniversity grid monitoring from a goc perspective john hicks hpcc engineer...

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Grid Monitoring from a GOC perspective

John HicksHPCC Engineer

Indiana University

October 27, 2002

Internet2 Fall Members meeting, HENP Working Group – Los Angeles

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

This presentation is concerned with work being done for the iVDGL/iGOC demonstration at SuperComputing 2002.

• Identifying the issues

• NOC vs. iGOC

• Getting information

• GOC tools

Overview

Web site: www.igoc.iu.edu

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• What role should the GOC play in grid monitoring?

• Should the GOC just collect and publish general information about the grid status?

• Should the GOC collect information for trouble shooting problems?

• Should the GOC try to direct traffic and identify potential problems analogous to an air traffic controller (suggested by Saul Youssef, Boston University)

Identifying the issues

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• What are some of the potential problems the GOC can help solve?

• Resource status and availability.

• Computational node

• Storage node

• Network

• Services (MDS)

• Resource availability can be determined with something as simple as a ping.

• Resource status depends on the measurement criteria.

• What is the machines current load?

• How much disk space is available?

• What is the measured network throughput between nodes?

• Are LDAP services available on this machine?

Identifying the issues (cont.)

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• What information does the GOC need to help solve problems?

• What data needs to be gathered?

• Grid centric (MDS).

• OS centric (Ganglia, Nagios).

• Network centric (SNMP, other network monitoring tools).

• What is the data and acquisition frequency?

• Static (total number of nodes in a cluster).

• Dynamic but infrequent (number of available nodes).

• Dynamic and frequent (jobs running on a cluster).

• Realtime (available network bandwidth).

Identifying the issues (cont.)

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• The Global NOC provides first level support for network related problems typically over networks within their domain of control.

• The iGOC should provide first level support for network, facility, and, infrastructure related problem not necessarily with their domain of control.

• The Global NOC has network engineers on staff.

• As far as I know, there is no such thing as a grid engineer.

• NOC performance monitoring usually has a demarcation point (i.e. wall jack, edge device, etc.) within a homogeneous network.

• GOC performance monitoring must measure end to end performance in a heterogeneous network and end node environment.

• The GOC must use the NOC as a resource for solving problems.

NOC vs. iGOC

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• A key component of a successful GOC is accurate contact information.

• In order to solve problems or monitor resources you have to know who to talk to.

• We are currently collecting the following contact information from each site on the grid.

• High Performance Computing (HPC) contact.

• Principle Investigator (PI).

• Network person or local NOC contact.

• Security.

• Storage.

• System administrator.

Getting information

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

We are using and developing the following tools to meet the GOC monitoring requirements.

• Nagios

• Ganglia

• LDAP tools

• GOC and other tools

GOC tools

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• Nagios® is a host and service monitor designed to inform you of network problems and end system problems.

• Nagios provides simple ping availability of resources on the network.

• Nagios works with a set of “plugins” to provide local and remote host service status.

• Custom “plugins” are relatively easy to develop.

• Different methods are provided for remote resource discovery.

• Nagios is freely available from http://www.nagios.org.

Nagios

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• Currently using the following built-in Nagios plugins:

• check_users

• check_load

• check_disk

• check_procs

• check_mem

• Current Nagios plugin development:

• check_nagios (see if a remote Nagios is running).

• check_aggregate (summarize and collect the status of a group of services).

Nagios

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• There are different ways Nagios can get information from plugins.

• nrpep (perl version of nrpe).

• check_by_ssh (passive).

• check_by_ssh (active).

• Nagios remote plugin execution (perl).

• Easy to use once setup.

• uses MD5 and TripleDES.

• Scales reasonably well for large number of hosts.

• Must have remote root access to setup.

Nagios

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• check_by_ssh (passive).


• sshd already running most places.

• Requires crontab entry to push data to the server.

• Scales reasonably well for large number of hosts.

• check_by_ssh (active).


• sshd already running most places.

• Does not scale well for large number of hosts.

Nagios

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• Current iVDGL Nagios implementation for SuperComputing demo consists of star topology.

• One Nagios server.

• Using check_by_ssh (passive).

• Does not scale well.

• Quick and dirty demo installation.tp://datatag-nagios.pi.infn.it

• Proposed persistent GOC Nagios infratructure.

• Run a Nagios server at the gatekeeper of each cluster.

• Gatekeeper Nagios only responsible for local site.

• Aggregate summary information and send to regional Nagios server.

• GOC maintains Meta Nagios with grid health status.

Nagios

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Nagios

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• Ganglia provides a complete pseudo real-time monitoring and execution environment.

• Ganglia provides a mechanism that you can not only link nodes of a cluster but an entire cluster to another cluster.

• Ganglia Monitoring Daemon (Gmond) is a multithreaded daemon that runs on each node that you want to monitor.

• Ganglia Meta Daemon (gmetad) allows you to monitor clusters.

• The Ganglia web front end uses PHP and RRDTool.

• Ganglia is freely available at http://ganglia.sourceforge.net

Ganglia

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• Ganglia has been modified to provide VO – centric reporting.

• Standard Ganglia does not provide layered reporting.

• VO – centric Ganglia has the following features:

• Monitoring of host resources (processor load, memory load, disk load, etc.)

• Simple plugin design that allows users to easily develop their own service checks (included from the standard version)

• Grid and VO related sensors

• Publishing/Retrieving summary information to third parties

• Optional SSL-enabled communication (meta-daemons and web-interface)

• MDS interface for collecting list of reporting nodes

• Optional web interface for viewing current network status, notification and problem history, log file, etc.

• Interface with Nagios(TM)

• Developed by Catalin Lucian, [email protected] – University of Chicago (http://people.cs.uchicago.edu/~cldumitr).

Ganglia

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Ganglia

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

• Grid centric information can be obtained from the MDS.

• There are a couple of good LDAP web interfaces.

• LDAPExplorer, http://igloo.its.unimelb.edu.au/LDAPExplorer.

• John’s LDAP Web interface, http://ldapweb.sourceforge.net.

• There are a number of Perl modules for LDAP,

• http://perl-ldap.sourceforge.net

• The key to extracting information is understanding the schema.

• Find out who is responsible for the schema and take an active role in its development.

• Always built dynamic search queries tools.

• Learn to use ldapsearch and grid-info-search.

LDAP tools

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

GOC staff are being presented with a new set of challenges. New tools are being developed to meet these challenges. A combination of new and old tools is required to monitor and troubleshoot grid issues. Future GOC staff and “Grid Engineers” will need a broad skill set in order to be affective.

• There are many other grid and cluster monitoring packages:

• MonaLisa, GOSSIP, Gridview, etc..

• There are many network monitoring packages.

• MRTG

• SNAPP and other RRDTool collectors.

• Netflow tools.

• Weather Map software.

• OCxMON.

• Pinger.

GOC and other tools

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Questions and discussion

John Hicks

Indiana University

[email protected]

indianauniversityindianauniversity grid monitoring from a goc perspective john hicks hpcc engineer...

Documents