a management interface for distributed fault tolerance corba services

10

Upload: independent

Post on 06-Mar-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

From: Proceedings of the Third IEEE International Workshop on Systems Management, Newport, RI, USA,April 1998.A Management Interface for Distributed Fault Tolerance CORBAServices�J�urgen Sch�onw�alder,y Sachin Garg,z Yennun Huang,zAad P. A. van Moorselz and Shalini Yajnikzy TU BraunschweigB�ultenweg 74/7538104 Braunschweig, [email protected]://www.ibr.cs.tu-bs.de/~schoenwz Bell Laboratories ResearchLucent Technologies600 Mountain Ave., Murray Hill, NJ 07974, USAsgarg,yen,aad,[email protected]://www.bell-labs.com/~sgarg,~yen,~aad,~shaliniABSTRACTCORBA is becoming an increasingly importantmiddleware platform for distributed software ap-plications in areas such as telecommunications.The DOORS fault-tolerance service is a CORBAservice that adds fault tolerance to CORBA ap-plications. In this paper we discuss the bene-�ts of adding a management interface to ser-vices like DOORS. We design this interface andprovide an implementation based on SNMP. Themanagement interface collects and displays dataabout DOORS, and feeds back information aboutthe underlying computing platform to DOORS,to improve its decision making. In the design,we clearly delineate DOORS and the managementcomponents, and provide an extended agent to al-low communication between CORBA and SNMP.We will discuss the interaction between CORBAand SNMP in detail, and will provide a MIB spec-ifying an information base for DOORS.�This work was carried out when J�urgen Sch�onw�alderwas consultant at Bell Labs Research in Murray Hill, sum-mer 1997.

I IntroductionAs applications are moving towards moreobject-oriented distributed platforms, distributedobject middleware technologies like the CommonObject Request Broker Architecture (CORBA)[8] are becoming increasingly popular. Sev-eral projects in �nancial, telecommunications andhealth industry currently take advantage of thedistributed object-oriented programming envi-ronment o�ered by the CORBA standards.Although the CORBA middleware eases thedevelopment of distributed applications, the cur-rent version of the CORBA standards do notaddress the reliability and availability require-ments found in many applications, especially inthe telecommunications world. In order to im-prove the reliability and availability of applica-tions, some researchers have implemented ObjectRequest Brokers (ORBs) based on the conceptof group communication and virtual synchrony[5, 6]. At Bell Laboratories we have taken a dif-ferent, service-based, approach by extending theexisting set of CORBA services with a fault toler-ance service, called Distributed Object-Oriented

Reliable Service (DOORS) [1]. The DOORS sys-tem is implemented as a collection of interactingCORBA objects, that detect CORBA object fail-ures and host failures, and recover CORBA ob-jects gracefully from such failures. An applicationdeveloper who wants to improve the reliability ofan application uses the DOORS service to imple-ment fault-tolerant CORBA objects.In this paper we introduce DoorMan, a man-agement interface to DOORS. The primary taskof DoorMan is to monitor DOORS as wellas its underlying computing system. Monitor-ing DOORS is important to visualize, under-stand and �ne-tune the functioning of DOORS,while monitoring of the underlying system al-lows DOORS to tolerate failures using the failure-prevention approach, e.g., it can take correctiveaction and migrate CORBA objects if it detectsthat an object's local host is suspected to crashsoon [3].In the design of DoorMan we carefully delin-eate DOORS and DoorMan, each having its ownfunctionality and software modules. DoorMandoes not change or take over any of the functionsof DOORS, nor does it assume any responsibil-ity about decisions related to the fault-tolerancemechanisms. As a consequence, DOORS per-forms its intended functions even in the absenceof DoorMan. Furthermore, the delineation allowsto implement DoorMan using o�-the-shelf man-agement components, which come with the addedadvantages of open and standardized technology.Our implementation is based on the Simple Net-work Management Protocol (SNMP) [12, 14].The main focus in the design of DoorMan liesin the interfacing between the CORBA world ofDOORS and the SNMP world of DoorMan. Tothat end, we develop a gateway formed by an ex-tended agent, that is able to bind with the `Repli-caManager.' (The ReplicaManager is one of theDOORS objects, as explained in Section II.B.) Inour design, adaptations to DOORS to facilitatethe management interface are restricted to theReplicaManager object. To specify the informa-tion that DoorMan can request from DOORS, wede�ne a DOORS management information base

(the DOORS MIB). Vice versa, for the informa-tion DoorMan feeds back to DOORS, we discussthe appropriate abstraction level to bridge thegap between the detailed view of the manager andthe focused view of DOORS.The proposed management interface may haveapplications beyond DOORS. In general, addingseparate management utilities to CORBASer-vices seems a natural, exible and potent ap-proach to improve the operation of the service.Doing this results in a three-step architecture:the base CORBA ORB, a CORBAService, andthe service management. This paper will try toillustrate the appropriateness of such a design forthe DOORS reliability service. It should be notedthat the approach we follow is fundamentally dif-ferent from the integration between CORBA andSNMP in the CORBA-SNMP gateway [7] andthat proposed in the Joint Inter-domain Manage-ment working group, a joint activity of The OpenGroup and the Network Management Forum [10].There, a mapping between CORBA and SNMPis de�ned that allows developers to implementSNMP agents and management applications inCORBA. In doing so, SNMP management appli-cations can be implemented using CORBA with-out much SNMP knowledge. In our approach,however, a CORBAService can be managed byan SNMP management application with minorknowledge about CORBA.This document looks at the management as-pects of the Distributed Object-Oriented Reli-able Service for CORBA. Section II gives a shortoverview of CORBA and the functionality andthe components that make up DOORS. SectionIII describes how DOORS can be extended sothat management systems can monitor its op-eration, as well as how the DOORS system cantake advantage of the information available froma management system. Section IV discusses im-plementation details and Section V states conclu-sions and lists some topics for future work.II CORBA and DOORSIn this section, we will give some basic con-cepts behind CORBA and then brie y describethe design and functionality of DOORS.

A CORBACORBA is an emerging object middlewaretechnology which supports the development ofdistributed object-oriented applications. It isan architectural standard proposed by a consor-tium of industries called the Object ManagementGroup (OMG) and its operational model is basedon RPC-style client-server communication. Thecore of the CORBA architecture is the ObjectRequest Broker (ORB). The ORB acts as the ob-ject bus which locates the server for a client re-quest and maintains the communication betweenthe server and the client. The ORB providesserver activation and server location transparencyto the client. Various vendors have implementedthe CORBA standards and there are a numberof CORBA products, e.g. IONA's Orbix [4],Visigenic's VisiBroker, in the market today. Inthis paper, we will not go into the details of theCORBA architecture. Interested readers are re-ferred to the vast literature present in the area[4, 8, 9].The CORBA standards also propose additionalservices, called CORBAServices [9], built andsupported on top of the ORB. These services,like Naming Service, Trader Service, LifeCycleservice, provide some useful functionality whichcan be used by all types of CORBA applica-tions. DOORS (Distributed Object-Oriented Re-liable Service) [1], developed at Bell Laboratories,is one such CORBA service.B DOORSDOORS provides a means for an applicationdeveloper to enhance the availability and reliabil-ity of his/her application built on top of CORBA.DOORS adds embedded fault-tolerance supportto CORBA objects in order to tolerate hostcrashes, object crashes and object hangs [1]. Ituses replication of objects as the mechanism tosupport high availability. DOORS provides fail-ure detection, failure recovery, and replica man-agement at the object level. An application de-veloper has the exibility to choose the degree ofreliability and availability by selecting from an ar-ray of choices for replication strategies, degree ofreplication, detection mechanisms, and recovery

strategies that are best suited to the applicationobject at hand. The DOORS interface providesmethods through which application objects canregister their reliability requirements and thenDOORS takes actions based on the options spec-i�ed during registration.There are three main modules in DOORS,which work together to provide the functional-ity described above - WatchDog, SuperWatchDogand the ReplicaManager. We discuss in brieftheir individual functionality and their interac-tions with each other. The modules are shown inFigure 1.� The WatchDog (WD) module runs oneach host in the network and detects objectcrashes and hangs for object servers runningon the local host. The WatchDog can usepolling to detect object crashes and heart-beats to detect object hangs. The Watchdogalso performs local recovery actions.� The SuperWatchDog (SWD) module iscentralized and is responsible for detectionof host crashes and hangs. It receives heart-beats at regular intervals from all the Watch-Dogs in a network domain. Host crashes aredetected if heartbeats from a previously reg-istered WatchDog do not arrive in a giventime interval.� The ReplicaManager (RM) module isalso centralized and as the name suggests, itis responsible for management of object repli-cas. An object can register with the Repli-caManager with a requested degree of repli-cation, a given replication style, and a listof possible locations for the replicas. TheReplicaManager manages the initial place-ment and activation of these object replicasand also controls the migration of the repli-cas during object failures. It keeps track ofhow many replicas of an object exist in a net-work domain, on which hosts they are run-ning, the status of each replica, and the num-ber of failures seen by the replica on a givenhost. This information is maintained in atable at the ReplicaManager.

WD

O1 O2 O1

WD SWD WD

O1 2 M1 M3O2 1 M1

M1 M2 M3

PollingHeartbeat

RM

Reporting, Registration

Figure 1: Structure of DOORS

Figure 1 shows how the the modules are usedto control two application objects. Object O1 onhost M1 registers with the ReplicaManager withreplication degree two and with possible hosts M1and M3. RM registers O1 with WDs on M1 andM3. The WD on M3 detects that there is no run-ning copy of O1 on the local host and starts acopy on M3. After activation, the WD watchesover the replica through polling. Similarly, a sin-gle replica of the object O2 is run on M1 andall the replicas are watched over by their localWDs through polling. As shown in the �gure,the WatchDogs on each host send heartbeats tothe central SuperWatchDog.The ReplicaManager module maintains all thestate of the DOORS system. The table stored inthe ReplicaManager has the information about allthe objects and their replicas in the system, theirregistration options, their location, their status,the number of times they have failed on a particu-lar host, etc. The information in this table will bemonitored by the management system that inter-faces with DOORS. To prevent loss of this statedue to failure of the ReplicaManager, this tableis periodically checkpointed.III DoorMan DesignIn the design of the management interface forDOORS we clearly delineate functionality thatbelongs with DOORS and functionality that be-longs with management. DoorMan operates insuch a way that DOORS does not depend in anyway on the presence of the management system.That is, DOORS is able to function properly evenif no manager is available although its e�ciencyin improving the reliability of distributed objectsmay be reduced. A clear delineation of the man-agement components also allows us to fully de-ploy existing management technology, since thistechnology will not interfere with DOORS. (Webene�t from this in our implementation by adopt-ing SNMP.) The main challenge in the design ofDoorMan then is how it interfaces with DOORS,a topic we will now discuss in detail.

A Interfacing between DOORS andDoorManDoorMan interactswith DOORS bi-directionally. First, DoorManobtains data about DOORS' operation, such asthe names, number and type of registered objects,and the location and status of the WatchDogs andSuperWatchdogs. The obtained data can be usedto analyze DOORS, as well as to display the sta-tus of DOORS in a graphical interface. Secondly,DoorMan collects additional information aboutthe underlying computing platform and feeds itback to DOORS. The data collected can, for in-stance, correspond to the status of the OS re-sources or the communication links.DoorMan interfaces with DOORS only throughthe ReplicaManager. We think this is a natu-ral decision, because, as we have seen in SectionII.B, the ReplicaManager contains a table withall data DoorMan might want to collect about theDOORS fault-tolerance mechanisms. In addition,in this way we localize the changes in DOORS to asingle module, the ReplicaManager. To establishthe information exchange from the ReplicaMan-ager to DoorMan, we add a single method to theReplicaManager's IDL. The method, when called,delivers all information in the ReplicaManager toDoorMan. In Section IV we discuss this in moredetail when introducing the SNMP implementa-tion of DoorMan.B Passing Information from DoorMan toDOORSThe other way around, DoorMan may wantto pass information back to DOORS. Again, in-formation is allowed to be passed only to theReplicaManager, not to any other component ofDOORS. Inevitably, we have to add methods tothe ReplicaManager IDL to be able to receive in-formation from DoorMan, and to act upon it ap-propriately.The ReplicaManager has a view which onlyincludes the components that are involved inDOORS, while the manager has a much more de-tailed view about underlying resources and thecommunication infrastructure. The consequenceof passing information from DoorMan to DOORS

is that the view of DOORS must be enhancedand the management information sent from themanager to DOORS should use an abstractionlevel that matches the decision algorithm usedby the ReplicaManager. We therefore introducea `health' status of resources, and the DoorManmanager abstracts the health status from the col-lected system data, and passes it to the Repli-caManager. (When DoorMan is not present,`health' corresponds to `up' or `down' of objectsand hosts.) The ReplicaManager then determineswhat to do with the information.DoorMan derives the higher-level health in-formation out of the very detailed informationsources. In related research we have identi�edwhich objects are statistically relevant for the`health' of resources [3]. Based on such results,one can select a suitable information base forDoorMan in order to improve the decision makingin DOORS.IV SNMP-Based DoorManImplementationBased on the design choices presented in Sec-tion III, we now discuss the SNMP-based im-plementation of DoorMan. Since the DoorManmanagement component is to a large extent inde-pendent from DOORS, we are able to use stan-dard technology to implement DoorMan. Choos-ing SNMP allows us to use existing agent imple-mentations, the sub-agent development kit avail-able for Solaris machines, as well as the Tnm Tclextension [13]. The DoorMan implementation isfor SUN Solaris, and DOORS is implemented us-ing Orbix (IONA's implementation of CORBA)for Solaris [4].As dictated by SNMP, we develop an agent anda management component. Most interesting isthe agent architecture, which runs on the samehost as the ReplicaManager. In its implementa-tion one must consider how to establish interac-tion between SNMP and CORBA, and what MIB[11] to de�ne for exporting a DOORS-dependentinformation base.A Agent ArchitectureFigure 2 shows the agent architecture for Door-Man. It consists of an SNMP agent, and a

DOORS subagent. The SNMP agent is the oneprovided by SUN for Solaris systems, and for thesubagent implementation (in C) we used the So-laris subagent development kit. The subagent op-erates as a CORBA client, and uses a sub-agentprotocol like AgentX [2] to present the manage-ment information as part of the SNMP manage-ment information base exported by the local sys-tem. Being a CORBA client means that the sub-agent is compiled with the CORBA client libraryand the ReplicaManager's stub, and is able toobtain a reference to the ReplicaManager. Theintroduction of a subagent simpli�es the imple-mentation of management applications, since itprovides transparent access to a DOORS MIBimplementation. In addition, subagents simplifythe security administration of the network man-agement subsystem considerably.The IDL interface of the ReplicaManager hasbeen extended to provide a method call to ex-tract information about registered client objectsand existing replicas. All this information is re-trieved using a single method call, in order to keepthe number of interactions low, and the changesto the ReplicaManager limited. To further limitthe overhead, the sub-agent caches the retrieveddata so that management requests can be satis-�ed from the cache if they come in sequence. Thisalso helps to enhance the consistency of the dataas seen by the manager.The ReplicaManager IDL must also allow thesubagent to feed back information to DOORS,and the ReplicaManager must use this informa-tion to enhance its decision making. The Door-Man manager processes the data collected, even-tually converting to numbers which represent thehealth of the various resources. It requires exper-imental investigation to determine the most suit-able way of converting the detailed data in healthindicators for the system components.B DOORS MIBThe management information exported toDoorMan is de�ned in the DOORS MIB (Fig-ure 3). The DOORS MIB contains a speci�ca-tion of all the information the ReplicaManagerhas about the functioning of DOORS. In Figure

O1 O2

WD

M1

O1

WD

M3

SWD

WD

DOORS SubAgent

SNMP Agent

O1 2 M1 M3O2 1 M1

RM

M2

DOORS Monitoring Interface

DOORS System

Figure 2: SNMP agent and DOORS subagent, inside the host running the ReplicaManager.

Figure 3: Structure of the DOORS MIB.3, the shaded boxes correspond to elements thatbelong to the indexing structure, while the trans-parent boxes correspond to the accessible entriesin the MIB.The DOORS MIB de�nes three tables. The�rst table (rosClientTable)1 lists the client objectsthat make use of DOORS services, and some ofits parameters, like name, replication style, num-ber of replications. As an example, in Figure 2rosClientName corresponds to O1 and O2, wherethe value of rosClientReplDegree for O1 is 2.The second table (rosObjectTable) lists the ob-jects and replicas under control of the Replica-Manager for each client, and contains entries forthe host name, object status, etc. In Figure 2,the value for rosObjectHost for the one instanceof O2 is M1.The third table (rosCompTable) lists all thecomponents that make up DOORS, together withtheir location. For instance, the entry rosComp-Type can take values 1 to 3 depending on whethera component is a WatchDog, a SuperWatchDog,or the ReplicaManager. This table is useful forthe manager to discover the structure of DOORS,1Note that the MIB uses the earlier acronym ROS (Re-liable Object Service) instead of DOORS.

so that it can adjust the monitoring activities tothe most critical components.C ManagerA manager has been implemented which dis-plays the DOORS MIB data, as well as othersystem data collected, in a graphical user inter-face (Figure 4). The implementation uses theTcl/Tk extension provided by the Tnm manage-ment package [13]. The Tnm Tcl extension allowsfast development of a management front-end, in-cluding a graphical interface. Figure 4 showsDOORS running for two CORBA clients, \grid"and \grid2". The top frame shows data collectedabout the clients, such as their name, and repli-cation and switch style. The frame below thatshows information about the object instances. Inthis case both clients have a single instance run-ning, and it can be seen that \grid2" experiencedone failure. Finally, the large frame in the man-ager interface continuously displays data that isbeing collected.The DoorMan manager uses a completelyevent driven approach, where low-level monitor-ing functions generate internal events that arepassed along the network topology and activateevent bindings. These bindings allow the man-

Figure 4: Screen dump of DoorMan manager application.

ager to analyze the event in the current contextand generate other events if some conditions aremet. Hence, higher level management informa-tion can be computed by passing events througha set of event bindings.V Conclusions and Topics for FurtherResearchThis paper discusses the design and implemen-tation of DoorMan, a management interface forDOORS. DoorMan allows for monitoring the op-eration of DOORS, as well as for feeding back toDOORS information abstracted from the under-lying computing platform. DoorMan is based onSNMP, which allows us to use o�-the-shelf agentand management application software.Most critical in the design of DoorMan is theinterface between SNMP and CORBA/ DOORS.We designed a subagent that can bind with theDOORS ReplicaManager, and we speci�ed a MIBfor DOORS. The ReplicaManager's IDL has beenextended to allow the sub-agent to collect data,and to allow it to pass information back.The current implementation of DoorMan al-lows future experimenting with di�erent manage-ment schemes. Di�erent approaches are possiblewhen determining which data to collect about theunderlying computing platform. To determinethis, we can make use of insights we obtainedfrom the statistical analysis of the correlation be-tween measured objects and system failures [3].Furthermore, it is of interest to determine howto extract information about the underlying com-puting system at the desired level of abstraction,and pass it back to DOORS. Di�erent algorithmscan then be developed that use this information.The extraction of information as well as the algo-rithms deal with the trade-o� between simplicityand quality of decision making, which needs fur-ther experimental research to be fully understood.REFERENCES[1] P. E. Chung, Y. Huang, S. Yajnik, D. Liang,and J. Shih, \DOORS: providing fault tolerancefor CORBA applications," Submitted for publi-cation, December 1997.[2] M. Daniele, B. Wijnen, and D. Francisco, \Agentextensibility (AgentX) protocol version 1," In-

ternet Draft <draft-ietf-agentx-ext-pro-04.txt>,Digital Equipment Corporation, IBM, Cisco Sys-tems, June 1997.[3] S. Garg, A. P. A. van Moorsel, K. S. Trivedi,and K. Vaidyanathan, \Age estimation and fail-ure forecasting in software systems using time-series analysis," Submitted for publication, De-cember 1997.[4] IONA Technologies Ltd., \Orbix referenceguide," Technical Report Release 2.1, October1995.[5] S. Ma�eis, \Piranha - a CORBA tool for highavailability," IEEE Computer, April 1997.[6] S. Ma�eis and D. C. Schmidt, \Constructing re-liable distributed communication systems withCORBA," IEEE Communications Magazine, vol.14, no. 2, , February 1997.[7] S. Mazumdar, \Inter-domainmanagement between CORBA and SNMP: Web-based management{CORBA/SNMP gateway ap-proach," in Seventh International Workshop onDistributed Systems: Operations & Management.IFIP/IEEE, October 1996.[8] OMG, \The Common Object Request Broker:Architecture and speci�cation," Technical Re-port Revision 2.0, Object Management Group,1995.[9] OMG, \CORBAServices: Common object ser-vices speci�cation," Technical report, ObjectManagement Group, March 1995.[10] Open Group, \Inter-domain management speci-�cations: Speci�cation translation," PreliminarySpeci�cation P509, Open Group, March 1997.[11] D. Perkins and E. McGinnis, UnderstandingSNMP MIBs, Prentice-Hall, Upper Saddle River,NJ, USA, 1997.[12] M. T. Rose, The Simple Book: An Introductionto Networking Management, Series in InnovativeTechnology, Prentice-Hall, Upper Saddle River,NJ, USA, 2nd edition, 1996.[13] J. Sch�onw�alder and H. Langend�orfer, \Tcl ex-tensions for network management applications,"in Proc. 3rd Tcl/Tk Workshop, pp. 279{288,Toronto (Canada), July 1995.[14] W. Stallings, SNMP, SNMPv2, and CMIP, ThePractical Guide to Network-Management Stan-dards, Addison-Wesley, Reading, MA, USA,1993.