indianauniversityindianauniversity grid monitoring from a goc perspective john hicks hpcc engineer...
TRANSCRIPT
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
Grid Monitoring from a GOC perspective
John HicksHPCC Engineer
Indiana University
October 27, 2002
Internet2 Fall Members meeting, HENP Working Group – Los Angeles
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
This presentation is concerned with work being done for the iVDGL/iGOC demonstration at SuperComputing 2002.
• Identifying the issues
• NOC vs. iGOC
• Getting information
• GOC tools
Overview
Web site: www.igoc.iu.edu
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• What role should the GOC play in grid monitoring?
• Should the GOC just collect and publish general information about the grid status?
• Should the GOC collect information for trouble shooting problems?
• Should the GOC try to direct traffic and identify potential problems analogous to an air traffic controller (suggested by Saul Youssef, Boston University)
Identifying the issues
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• What are some of the potential problems the GOC can help solve?
• Resource status and availability.
• Computational node
• Storage node
• Network
• Services (MDS)
• Resource availability can be determined with something as simple as a ping.
• Resource status depends on the measurement criteria.
• What is the machines current load?
• How much disk space is available?
• What is the measured network throughput between nodes?
• Are LDAP services available on this machine?
Identifying the issues (cont.)
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• What information does the GOC need to help solve problems?
• What data needs to be gathered?
• Grid centric (MDS).
• OS centric (Ganglia, Nagios).
• Network centric (SNMP, other network monitoring tools).
• What is the data and acquisition frequency?
• Static (total number of nodes in a cluster).
• Dynamic but infrequent (number of available nodes).
• Dynamic and frequent (jobs running on a cluster).
• Realtime (available network bandwidth).
Identifying the issues (cont.)
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• The Global NOC provides first level support for network related problems typically over networks within their domain of control.
• The iGOC should provide first level support for network, facility, and, infrastructure related problem not necessarily with their domain of control.
• The Global NOC has network engineers on staff.
• As far as I know, there is no such thing as a grid engineer.
• NOC performance monitoring usually has a demarcation point (i.e. wall jack, edge device, etc.) within a homogeneous network.
• GOC performance monitoring must measure end to end performance in a heterogeneous network and end node environment.
• The GOC must use the NOC as a resource for solving problems.
NOC vs. iGOC
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• A key component of a successful GOC is accurate contact information.
• In order to solve problems or monitor resources you have to know who to talk to.
• We are currently collecting the following contact information from each site on the grid.
• High Performance Computing (HPC) contact.
• Principle Investigator (PI).
• Network person or local NOC contact.
• Security.
• Storage.
• System administrator.
Getting information
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
We are using and developing the following tools to meet the GOC monitoring requirements.
• Nagios
• Ganglia
• LDAP tools
• GOC and other tools
GOC tools
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• Nagios® is a host and service monitor designed to inform you of network problems and end system problems.
• Nagios provides simple ping availability of resources on the network.
• Nagios works with a set of “plugins” to provide local and remote host service status.
• Custom “plugins” are relatively easy to develop.
• Different methods are provided for remote resource discovery.
• Nagios is freely available from http://www.nagios.org.
Nagios
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• Currently using the following built-in Nagios plugins:
• check_users
• check_load
• check_disk
• check_procs
• check_mem
• Current Nagios plugin development:
• check_nagios (see if a remote Nagios is running).
• check_aggregate (summarize and collect the status of a group of services).
Nagios
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• There are different ways Nagios can get information from plugins.
• nrpep (perl version of nrpe).
• check_by_ssh (passive).
• check_by_ssh (active).
• Nagios remote plugin execution (perl).
• Easy to use once setup.
• uses MD5 and TripleDES.
• Scales reasonably well for large number of hosts.
• Must have remote root access to setup.
Nagios
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• check_by_ssh (passive).
• Easy to use once setup.
• sshd already running most places.
• Requires crontab entry to push data to the server.
• Scales reasonably well for large number of hosts.
• check_by_ssh (active).
• Easy to use once setup.
• sshd already running most places.
• Does not scale well for large number of hosts.
Nagios
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• Current iVDGL Nagios implementation for SuperComputing demo consists of star topology.
• One Nagios server.
• Using check_by_ssh (passive).
• Does not scale well.
• Quick and dirty demo installation.tp://datatag-nagios.pi.infn.it
• Proposed persistent GOC Nagios infratructure.
• Run a Nagios server at the gatekeeper of each cluster.
• Gatekeeper Nagios only responsible for local site.
• Aggregate summary information and send to regional Nagios server.
• GOC maintains Meta Nagios with grid health status.
Nagios
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
Nagios
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
Nagios
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
Nagios
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• Ganglia provides a complete pseudo real-time monitoring and execution environment.
• Ganglia provides a mechanism that you can not only link nodes of a cluster but an entire cluster to another cluster.
• Ganglia Monitoring Daemon (Gmond) is a multithreaded daemon that runs on each node that you want to monitor.
• Ganglia Meta Daemon (gmetad) allows you to monitor clusters.
• The Ganglia web front end uses PHP and RRDTool.
• Ganglia is freely available at http://ganglia.sourceforge.net
Ganglia
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• Ganglia has been modified to provide VO – centric reporting.
• Standard Ganglia does not provide layered reporting.
• VO – centric Ganglia has the following features:
• Monitoring of host resources (processor load, memory load, disk load, etc.)
• Simple plugin design that allows users to easily develop their own service checks (included from the standard version)
• Grid and VO related sensors
• Publishing/Retrieving summary information to third parties
• Optional SSL-enabled communication (meta-daemons and web-interface)
• MDS interface for collecting list of reporting nodes
• Optional web interface for viewing current network status, notification and problem history, log file, etc.
• Interface with Nagios(TM)
• Developed by Catalin Lucian, [email protected] – University of Chicago (http://people.cs.uchicago.edu/~cldumitr).
Ganglia
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
Ganglia
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
Ganglia
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
• Grid centric information can be obtained from the MDS.
• There are a couple of good LDAP web interfaces.
• LDAPExplorer, http://igloo.its.unimelb.edu.au/LDAPExplorer.
• John’s LDAP Web interface, http://ldapweb.sourceforge.net.
• There are a number of Perl modules for LDAP,
• http://perl-ldap.sourceforge.net
• The key to extracting information is understanding the schema.
• Find out who is responsible for the schema and take an active role in its development.
• Always built dynamic search queries tools.
• Learn to use ldapsearch and grid-info-search.
LDAP tools
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
GOC staff are being presented with a new set of challenges. New tools are being developed to meet these challenges. A combination of new and old tools is required to monitor and troubleshoot grid issues. Future GOC staff and “Grid Engineers” will need a broad skill set in order to be affective.
• There are many other grid and cluster monitoring packages:
• MonaLisa, GOSSIP, Gridview, etc..
• There are many network monitoring packages.
• MRTG
• SNAPP and other RRDTool collectors.
• Netflow tools.
• Weather Map software.
• OCxMON.
• Pinger.
GOC and other tools
I
N
D
I
A
N
A
U
N
I
V
E
R
S
I
T
Y
Questions and discussion
John Hicks
Indiana University