ngop overview

23
1 NGOP Overview Jim Fromm Farms and Clustered Systems Group Computing Division Fermilab

Upload: dirk

Post on 12-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

NGOP Overview. Jim Fromm Farms and Clustered Systems Group Computing Division Fermilab. People. Integrated Systems Development Department Don Petravick Krzysztof Genser Jim Fromm Tanya Levshina Igor Mandrichenko Terry Jones Operating Systems Support Dept. Troy Dawson Lisa Giachetti - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NGOP Overview

1

NGOP Overview

Jim FrommFarms and Clustered Systems GroupComputing DivisionFermilab

Page 2: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 2

People

Integrated Systems Development Department Don Petravick Krzysztof Genser Jim Fromm Tanya Levshina Igor Mandrichenko Terry Jones

Operating Systems Support Dept. Troy Dawson Lisa Giachetti Ken Schumacher Marc Mengel

Computing Services Dept. Jeff Mack Rick Thies Rich Thompson

Page 3: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 3

Goals

NGOP working group charged with the task of developing a Distributed Management System (DMS) that would scale to the anticipated requirements for Run II farms.

Future size of farms require that the DMS be pro-active. The system should take corrective action when possible.

Must detect hardware, system, and application problems. Problem diagnostics should eliminate “noise”, or false alarms. Should provide tools to do performance analysis.

Page 4: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 4

NGOP History

Summer 1999: NGOP group created to gather requirements for a Distributed

Management System capable of efficiently monitoring Fermilab computing facility for Run II.

Sept 1999: Requirement gathering completed. Dec 1999: Evaluation of available products

presented. Jan 2000: Decision to develop a custom DMS

made Today: Development of prototype underway.

Completion is expected before year end.

Page 5: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 5

We are not alone…

As computer farms get larger, other HEP sites are looking at a similar problem

March 2000, CERN and BNL visited Fermilab to exchange ideas on lessons learned. SLAC, JLAB, and IN2P3 participated via video conference.

July 2000 Fermilab visited CERN to follow up on the March meetings.

Page 6: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 6

Some Terminology

Monitored Object is one of the following: Host: A computer identified by it’s full domain name Cluster: A collection of hosts Component: An atomic element that has a well defined

behavior. System: A collection of components.

Condition: A pre-defined state of an Monitored Object.

Event: A description of a detected condition. Action: An activity initiated by the NGOP system

based on an event. Alarm: An asynchronous indicator initiated by NGOP. Status: Shows the level of the monitored element

“functionality”. Monitoring Agent: A software component that

generates events based on conditions and performs actions.

Page 7: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 7

NGOP Requirements

Essential Features Should detect hardware, network, system, and application errors.

System Daemon status (inetd, mbatchd) Unreachable hosts. Security breaches /tmp full.

Should run on all Fermilab supported operating systems. Scalable to 1000s of hosts. Must be multi-user, must support different authorization levels. Provide an interface for user written monitoring tools. Generate different levels of alarms (Warning,Info, etc…). Perform actions based on alarms and events (email,page,restart daemon). Provide a hierarchical view of the monitored system. Dynamic configuration. Provide monitoring capabilities via a web browser, GUI, and command line

interface. Provide special states for monitored objects such as “known bad”.

Desirable Features: Ability to have overlapping clusters. Ability to generate reports based on selection criteria. Implement step by step notification of performed actions.

Page 8: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 8

Products Evaluation

Some Evaluated Products: Patrol

Not scalable for centralized monitoring One level of hierarchy No overlapping clusters No filtering of events No GUI/UI

Tkined/Scotty Not scaleable for multiple users System monitored only while GUI running Only one level of alarms

Nocol No notion of hierarchy or clusters. Web and “GUI”(curses) interface have limited customization. Very limited filtering of events

Netlogger Limited off-shelf functionality No customization for monitoring agents Very limited way to create hierarchy. Requires too much knowledge of underlying system to detect a problem.

Misc Commercial Products Complex Did not meet requirements Very expensive, both in terms of licensing and setup costs.

Page 9: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 9

Product Evaluation Summary

Many commercial and open-source products try to solve the problem in many different ways.

None of the evaluated products met the basic requirements at Fermilab.

Discussion with others who chose the commercial route were not encouraging. Many bad experiences documented.

Decision was made to develop our own custom DMS.

Page 10: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 10

Design Summary – Key System Components

Monitoring Agent:Monitors a monitored object,generates events based on certain conditions.

Sensor Agent: Similar to a monitoring agent, but this process collects performance data and generates events at a higher rate than a monitoring agent.

NGOP Central Server(NCS): The central daemon process that gathers events from MA’s, provides users with requested information, and dumps persistent data into the Archive Server.

NGOP Configuration File Management Service: Provides a mechanism to centrally locate system configuration and rules. Allows for dynamic reconfiguration of system.

Archive Server: daemon that handles archive storage. Provides a means to write, read, and query the data.

Monitoring Client: Communicate with NCS using an API to display system status in a meaningful manner.

Page 11: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 11

NGOP Architecture

Data Analyzer

Persistent Config.Data

Persistent Config.Data

Archive

Configuraton File Management

Service

Configuraton File Management

Service

Archive Service

Central Server

Cluster B

PerformanceData

Cluster A

Performance StorageService

Cluster B1

ss s

s S s

MAMA

MA

Cluster B2

MA

MA

MA

MA

Monitored Objects

Host Element

Cluster System

NGOP Components

Sensor Agent Server

Monitoring Agent Monitoring

Data Storage Clients

Connections

TCP connection between UDP Monitored Element

and MA

Not implemented in prototype yet

MA

MA

s

Administrator

MonitorMonitor

ReportGenerator

Router

s

ActionClient

Page 12: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 12

Monitoring Agents – The hook into NGOP

The monitoring agents (MA) is the process that monitors an object, and generates events when a condition is met. A message describing this event is sent to the NGOP Central Server (NCS).

NGOP defines the protocol to exchange information with the central server.

A set of basic MA’s will be deployed with the NGOP system, users are free to write their own.

An API(C,C++,Perl,Python) will be provided to allow for development of MA’s.

MA’s should send info to the NCS when: When current characteristics of a monitored object meet a condition. When the condition is no longer satisfied. Heartbeat messages sent periodically to let the NCS know it is still alive.

Examples: Monitor whether or not a batch system is running. Monitor the size of a file system, issuing alarms when it is 90% full.

Page 13: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 13

Sensor Agents

Sensor Agents send performance data to the Performance Storage Service.

The rate of this data is expected to be much higher than that of the MA’s.

Examples: Monitor the temperature of a computer every second. Monitor the CPU utilization continuously.

Page 14: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 14

NGOP Central Server

NCS is the process that gets messages sent from MA’s, stores them via the Archive Server, and provides monitoring clients (GUI for example) requested information.

One instance of the NCS will be running in the system. NCS must handle many (10,000+) MA’s, and ~ 50 clients. NCS should

Update object characteristics when MA reports a change. Determine if an MA is dead, and forward this info along to the

relevant monitoring client. Forward event and action messages to the Archive Server. Forward event messages to subscribed monitoring clients.

Page 15: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 15

NGOP Configuration File Management Service

Responsible for providing a central repository for system configuration and monitoring rules.

Allows for dynamic reconfiguration of the system. Configuration files written in xml. Central repository is implemented using CVS in the prototype. Only authorized users can update.

Page 16: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 16

Rules

Rules define the status and the alarm level associated with monitored objects.

Rules describe the condition that should be satisfied in order for a monitored object to have status and alarm level.

Master rules are stored in the Configuration File Management Service (CFMS).

Users can create their own rules and store them locally. Users with permission can store these rules in the CFMS.

Dependency rules are a mechanism to filter out noise. For example, a batch system can be dependent on the power supply. If the power goes out on a machine, the fact that the batch system is down will not be raised.

Alarm/Action rules define the condition that will cause an alarm/action to be performed.

Page 17: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 17

Monitoring Clients

Monitoring clients will be developed with an API that allows determination of the status of each node in a hierarchy, based on rules and current information obtained from the NCS.

Monitoring clients will initiate action requests. Monitoring clients determine the state of the system and

monitored elements based on information gathered from the NCS.

Page 18: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 18

Archiver/Performance Storage Service

The Archive/Performance Storage Service(PSS) is responsible for storing and retrieving messages generated by the NGOP system. These messages represent event, sensor, or action data.

Components: Archive Server Archive Retriever Performance Storage Subsystem(PSS) PSS Retriever Archive Database Interface Database (Oracle). DBArchiver

The PSS is simply another instance of the Archive Server. Performance data will need to be consolidated.

Page 19: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 19

NGOP Prototype

NGOP prototype development is currently underway. The prototype consists of the following modules: NGOP Central Server Configuration File Management Service Monitoring Agents:

OS Health: Monitors specific system daemons, file system existence and size, CPU load, and free memory.

Ping Agent: Monitors node reachability FBSNG Agent: Monitors the FBSNG batch system.

NGOP Client API Determines the status of the each monitored elements based on pre-

defined rules and current information received from the NGOP Central Server

NGOP Monitor Graphical representation of monitored elements status. Provides means to see and acknowledge occurred events and alarms Provides limited configuration options

Archive Server Stores event and action messages to local disk. The Archive Database Interface moves the message from local disk to an

Oracle database.

Page 20: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 20

NGOP Monitor

Status: Bad

Warning

Good

Undefined

Event description

Alarm:

Page 21: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 21

NGOP Monitor(event acknowledgment, known-status modification…)

Monitored Element Info:

Page 22: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 22

NGOP Monitor (Configuration Options)

Default icons for known object types:

Selecting elements for top level display:

Default colors for status representation:

Page 23: NGOP Overview

November 2, 2000 http://www-isd.fnal.gov/ngop 23

Summary

Building a DMS is a complex problem. Various commercial and open source systems were analyzed.

None met the basic requirements for the NGOP project at Fermilab.

Prototype system is under development. See http://www-isd.fnal.gov/ngop for project details.