rfdmon: a real-time and fault-tolerant distributed system

21
Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee, 37203 RFDMon: A Real-Time and Fault-Tolerant Distributed System Monitoring Approach Rajat Mehrotra Abhishek Dubey Electrical and Computer Engineering Institute for Software Integrated Systems Mississippi State University, Mississippi State, MS Vanderbilt University, Nashville, TN Jim Kwalkowski, Marc Paterno, Amitoj Singh, Randolph Herber Fermi National Laboratory, Batavia, IL Sherif Abdelwahed Electrical and Computer Engineering Mississippi State University, Mississippi State, MS TECHNICAL REPORT ISIS-11-107 October, 2011

Upload: others

Post on 09-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Institute for Software Integrated SystemsVanderbilt University

Nashville Tennessee 37203

RFDMon A Real-Time and Fault-Tolerant Distributed SystemMonitoring Approach

Rajat Mehrotra Abhishek DubeyElectrical and Computer Engineering Institute for Software Integrated Systems

Mississippi State University Mississippi State MS Vanderbilt University Nashville TN

Jim Kwalkowski Marc Paterno Amitoj Singh Randolph HerberFermi National Laboratory Batavia IL

Sherif AbdelwahedElectrical and Computer Engineering

Mississippi State University Mississippi State MS

TECHNICAL REPORTISIS-11-107

October 2011

Abstract

In this paper a systematic distributed event based (DEB) system monitoring approach ldquoRFDMonrdquo is presentedfor measuring system variables (CPU utilization memory utilization disk utilization network utilization) systemhealth (temperature and voltage of Motherboard and CPU) application performance variables (application responsetime queue size throughput) and scientific application data structures (eg PBS information MPI variables) ac-curately with minimum latency at a specified rate and with minimal resource utilization Additionally ldquoRFDMonrdquois fault tolerant (for fault in sensor framework) self-configuring (can start and stop monitoring the nodes con-figure monitors for threshold valueschanges for publishing the measurements) self-aware (aware of execution offramework on multiple nodes through HEARTBEAT) extensive (monitors multiple parameters through periodicand aperiodic sensors) resource constrained (resources can be limited for monitors) and scalable for adding extramonitors on the fly Because this framework is built on Object Management Group (httpwwwomgorg) DataDistribution Services (DDS) implementations it can be used for deploying in systems with heterogeneous nodesAdditionally it provides a functionality to limit the maximum cap on resources consumed by monitoring processessuch that it does not affect the availability of resources for the applications

2

RFDMon A Real-Time and Fault-Tolerant Distributed SystemMonitoring Approach

December 6 2011

Abstract

In this paper a systematic distributed event based (DEB)system monitoring approach ldquoRFDMonrdquo is presentedfor measuring system variables (CPU utilization mem-ory utilization disk utilization network utilization etc)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency at a speci-fied rate and with minimal resource utilization Addition-ally ldquoRFDMonrdquo is fault tolerant (for fault in monitoringframework) self-configuring (can start and stop moni-toring the nodes and configure monitors for thresholdvalueschanges for publishing the measurements) self-aware (aware of execution of the framework on multi-ple nodes through HEARTBEAT message exchange) ex-tensive (monitors multiple parameters through periodicand aperiodic sensors) resource constrainable (compu-tational resources can be limited for monitors) and ex-pandable for adding extra monitors on the fly Becausethis framework is built on Object Management Group(httpwwwomgorg) Data Distribution Services (DDS)implementations it can be used for deploying in systemswith heterogeneous nodes Additionally it provides afunctionality to limit the maximum cap on resources con-sumed by monitoring processes such that it does not af-fect the availability of resources for the applications

1 Introduction

Currently distributed systems are used in executing sci-entific applications enterprise domains for hosting e-commerce applications wireless networks and storagesolutions Furthermore distributed computing systemsare an essential part of mission critical infrastructure offlight control system health care infrastructure and in-dustrial control system A typical distributed infrastruc-

ture contains multiple autonomous computing nodes thatare deployed over a network and connected through a dis-tributed middle-ware layer to exchange the informationand control commands [1] This distributed middlewarespans across multiple machines and provides same inter-face to all the applications executing on multiple com-puting nodes The primary benefit of executing an ap-plication over a distributed infrastructure is that differ-ent instances of the application (or different applications)can communicate with each other over the network effi-ciently while keeping the details of underlying hardwareand operating system hidden from the application be-haviour code Also interaction in distributed infrastruc-ture is consistent irrespective of location and time Dis-tributed infrastructure is easy to expand enables resourcesharing among computing nodes and provides flexibilityof running redundant nodes to improve the availability ofinfrastructure

Using distributed infrastructure creates added respon-sibility of consistency synchronization and security overmultiple nodes Furthermore in enterprise domain (orcloud services) there is a tremendous pressure to achievethe system QoS objectives in all possible scenarios ofsystem operation To this end an aggregate picture ofthe distributed infrastructure should always be availableto analyze and to provide feedback for computing con-trol commands if needed The desired aggregate pic-ture can be achieved through an infrastructure monitor-ing technique that is extensive enough to accommodatesystem parameters (CPU utilization memory utiliza-tion disk utilization network utilization etc) applica-tion performance parameters (response time queue sizeand throughput) and scientific application data structures(eg Portable Batch System (PBS) information Mes-sage Passing Interface (MPI) variables) Additionallythis monitoring technique should also be customizable totune for various type of applications and their wide rangeof parameters Effective management of distributed sys-tems require an effective monitoring technique which

3

can work in a distributed manner similar to the under-lying system and reports each event to the system ad-ministrator with maximum accuracy and minimum la-tency Furthermore the monitoring technique should notbe a burden to the performance of the computing systemand should not affect the execution of the applications ofthe computing system Also the monitoring techniqueshould be able to identify the fault in itself to isolate thefaulty component and correct it immediately for effectivemonitoring of the computing system Distributed eventbased systems (DEBS) are recommended for monitor-ing and management of distributed computing systemsfor faults and system health notifications through sys-tem alarms Moreover the DEBS based monitoring isable to provide the notion of system health in terms offaultstatus signals which will help to take the appropri-ate control actions to maintain the system in safe bound-ary of operation

Contribution In this paper an event based dis-tributed monitoring approach ldquoRFDMonrdquo is presentedthat utilizes the concepts of distributed data services(DDS) for an effective exchange of monitoring measure-ments among the computing nodes This monitoringframework is built upon the ACM ARINC-653 Com-ponent Framework [2] and an open source DDS imple-mentation Open splice [3] The primary principles indesign of the proposed approach are static memory al-location for determinism spatial and temporal isolationbetween monitoring framework and real application ofdifferent criticality specification and adherence to realtime properties such as periodicity and deadlines in mon-itoring framework and providing well-defined composi-tional semantics

ldquoRFDMonrdquo is fault tolerant (for fault in sensor frame-work) self-configuring (can start and stop monitoring thenodes configure monitors for threshold valueschangesfor publishing the measurements) self-aware (aware ofexecution of the framework on multiple nodes throughHEARTBEAT messages exchange) extensive (monitorsmultiple parameters through periodic and aperiodic sen-sors) resource constrained (resources can be limited formonitors) expandable for adding extra monitors on thefly and can be applied on heterogeneous infrastructureldquoRFDMonrdquo has been developed on the existing OMGCCM standard because it is one of the widely used com-ponent model and software developers are already famil-iar with it

ldquoRFDMonrdquo is divided in two parts 1 A distributedsensor framework to monitor the various system appli-cation and performance parameter of computing sys-

tem that reports all the measurements in a centralizeddatabase over http interface 2 A Ruby on Rails [4]based web service that shows the current state of the in-frastructure by providing an interface to read the databaseand display the content of database in a graphical windowon the web browser

OutlineThis paper is organized as follows Preliminary con-

cepts of the proposed approach are presented in section 2and issues in monitoring of distributed systems is high-lighted in section 3 Related distributed monitoring prod-ucts are described in section 4 while detailed descriptionof the proposed approach ldquoRFDMonrdquo is given in sec-tion 5 Details of each sensor is presented in section 6while a set of experiments are described in section 7 Ap-plication of the proposed approach is mentioned in sec-tion 9 and major benefits of the approach is highlightedin section 8 Future direction of the work is discussed insection 10 and conclusions are presented in section 11

2 Preliminaries

The proposed monitoring approach ldquoRFDMonrdquo consistsof two major modules Distributed Sensors Frame-work and Infrastructure Monitoring Database Dis-tributed Sensors Framework utilizes Data distributionservices (DDS) middleware standard for communicationamong nodes of distributed infrastructure Specifically ituses the Opensplice Community Edition [3] It executesDDS sensor framework on top of ARINC ComponentFramework [5] Infrastructure Monitoring Databaseuses Ruby on Rails [4] to implement the web service thatcan be used to update the database with monitoring dataand display the data on administrator web browser Inthis section the primary concepts of publish-subscribemechanism OpenSplice DDS ARINC-653 and Rubyon Rails are presented which will be helpful in under-standing the proposed monitoring framework

21 Publish-Subscribe Mechanism

Publish-Subscribe is a typical communication model fordistributed infrastructure that hosts distributed applica-tions Various nodes can communicate with each otherby sending (publishing) data and receiving (subscribing)data anonymously through the communication channelas specified in the infrastructure A publisher or sub-scriber need only the name and definition of the data inorder to communicate Publishers do not need any infor-mation about the location or identity of the subscribers

4

Figure 1 Publish Subscribe Architecture

and vice versa Publishers are responsible for collectingthe data from the application formatting it as per the datadefinition and sending it out of the node to all registeredsubscribers over the publish-subscribe domain Simi-larly subscribers are responsible for receiving the datafrom the the publish-subscribe domain format it as perthe data definition and presenting the formatted data tothe intended applications (see Figure 1) All communi-cation is performed through DDS domain A domain canhave multiple partitions Each partition can contain mul-tiple topics Topics are published and subscribed acrossthe partitions Partitioning is used to group similar typeof topics together It also provides flexibility to applica-tion for receiving data from a set of data sources [6] Pub-lish subscribe mechanism overcomes the typical shortcomings of client-server model where client and serversare coupled together for exchange of messages In caseof client-server model a client should have informationregarding the location of the server for exchange of mes-sages However in case of publish-subscribe mecha-nism clients should know only about the type of dataand its definition

22 Open Splice DDS

OpenSplice DDS [3] is an open source community edi-tion version of the Data Distribution Service (DDS)specification defined by Object Management Group(httpwwwomgorg) These specifications are primar-ily used for communication needs of distributed RealTime Systems Distributed applications use DDS as aninterface for ldquoData-Centric Publish-Subscriberdquo (DCPS)communication mechanism ( see Figure 2) ldquoData-Centricrdquo communication provides flexibility to specifyQoS parameters for the data depending upon the typeavailability and criticality of the data These QoS pa-

Figure 2 Data Distribution Services (DDS) Architecture

rameters include rate of publication rate of subscriptiondata validity period etc DCPS provides flexibility to thedevelopers for defining the different QoS requirementsfor the data to take control of each message and theycan concentrate only on the handling of data instead ofon the transfer of data Publisher and subscribers useDDS framework to send and receive the data respectivelyDDS can be combined with any communication interfacefor communication among distributed nodes of applica-tion DDS framework handles all of the communicationsamong publishers and subscribers as per the QoS specifi-cations In DDS framework each message is associatedwith a special data type called topic A subscriber regis-ters to one or more topics of its interest DDS guaranteesthat the subscriber will receive messages only from thetopics that it subscribed to In DDS a host is allowed toact as a publisher for some topics and simultaneously actas subscriber for others (see Figure 3)

Benefits of DDS

The primary benefit of using DDS framework for ourproposed monitoring framework ldquoRFDMonrdquo is thatDDS is based upon publish-subscribe mechanism thatdecouples the sender and receiver of the data There isno single point of bottleneck or of failure in communi-cation Additionally DDS communication mechanism ishighly scalable for number of nodes and supports auto-discovery of the nodes DDS ensures data delivery withminimum overhead and efficient bandwidth utilization

5

Figure 3 Data Distribution Services (DDS) Entities

Figure 4 ARINC-653 Architecture

23 ARINC-653

ARINC-653 software specification has been utilized insafety-critical real time operating systems (RTOS) thatare used in avionics systems and recommended for spacemissions ARINC-653 specifications presents standardApplication Executive (APEX) kernel and its associatedservices to ensure spatial and temporal separation amongvarious applications and monitoring components in inte-grated modular avionics ARINC-653 systems (see Fig-ure 4) group multiple processes into spatially and tem-porally separated partitions Multiple (or one) partitionsare grouped to form a module (ie a processor) whileone or more modules form a system These partitions areallocated predetermined chunk of memory

Spatial partitioning [7] ensures exclusive use of amemory region by an ARINC partition It guarantees thata faulty process in a partition cannot corrupt or destroythe data structures of other processes that are executing inother partitions This space partitioning is useful to sepa-rate the low-criticality vehicle management componentsfrom safety-critical flight control components in avion-ics systems [2] Memory management hardware ensures

memory protection by allowing a process to access onlypart of the memory that belongs to the partition whichhosts the same process

Temporal partitioning [7] in Arinc-653 systems is en-sured by a fixed periodic schedule to use the process-ing resources by different partitions This fixed periodicschedule is generated or supplied to RTOS in advancefor sharing of resources among the partitions (see Fig-ure 5) This deterministic scheduling scheme ensuresthat each partition gains access to the computational re-sources within its execution interval as per the schedulingscheme Additionally it guarantees that the partitionrsquosexecution will be interrupted once partitionrsquos executioninterval will be finished the partition will be placed intoa dormant state and next partition as per the schedul-ing scheme will be allowed to access the computing re-sources In this procedure all shared hardware resourcesare managed by the partitioning OS to guarantee that theresources are freed as soon as the time slice for the parti-tion expires

Arinc-653 architecture ensures fault-containmentthrough functional separation among applications andmonitoring components In this architecture partitionsand their process can only be created during systeminitialization Dynamic creation of the processes is notsupported while system is executing Additionally userscan configure real time properties (priority periodicityduration softhard deadline etc) of the processes andpartitions during their creation These partitions andprocess are scheduled and strictly monitored for possibledeadline violations Processes of same partition sharedata and communicate using intra-partition servicesIntra partition communication is performed using buffersto provide a message passing queue and blackboards toread write and clear single data storage Two differentpartitions communicate using inter-partition servicesthat uses ports and channels for sampling and queueingof messages Synchronization of processes related tosame partition is performed through semaphores andevents [7]

24 ARINC-653 Emulation Library

ARINC-653 Emulation Library [2] (available for down-load from httpswikiisisvanderbiltedumbshmindexphpMain_Page) provides aLINUX based implementation of ARINC-653 interfacespecifications for intra-partition process communicationthat includes Blackboards and Buffers Buffers provide aqueue for passing messages and Blackboards enable pro-

6

Figure 5 Specification of Partition Scheduling in ACMFramework

cesses to read write and clear single message Intra-partition process synchronization is supported throughSemaphores and Events This library also provides pro-cess and time management services as described in theARINC-653 specification

ARINC-653 Emulation Library is also responsible forproviding temporal partitioning among partitions whichare implemented as Linux Processes Each partition in-side a module is configured with an associated period thatidentifies the rate of execution The partition propertiesalso include the time duration of execution The modulemanager is configured with a fixed cyclic schedule withpre-determined hyperperiod (see Figure 5)

In order to provide periodic time slices for all parti-tions the time of the CPU allocated to a module is di-vided into periodically repeating time slices called hy-perperiods The hyperperiod value is calculated as theleast common multiple of the periods of all partition inthe module In other words every partition executes oneor more times within a hyperperiod The temporal inter-val associated with a hyperperiod is also known as themajor frame This major frame is then subdivided intoseveral smaller time windows called minor frames Eachminor frame belongs exclusively to one of the partitionsThe length of the minor frame is the same as the dura-tion of the partition running in that frame Note that themodule manager allows execution of one and only onepartition inside a given minor frame

The module configuration specifies the hyperperiodvalue the partition names the partition executables andtheir scheduling windows which is specified with theoffset from the start of the hyperperiod and duration

Figure 6 Rails and Model View Controller (MVC) Ar-chitecture Interaction

The module manager is responsible for checking that theschedule is valid before the system can be initialized ieall scheduling windows within a hyperperiod can be exe-cuted without overlap

25 Ruby on Rails

Rails [4] is a web application development frameworkthat uses Ruby programming language [8] Rails usesModel View Controller (MVC) architecture for applica-tion development MVC divides the responsibility ofmanaging the web application development in three com-ponents1Model It contains the data definition and ma-nipulation rules for the data Model maintains the state ofthe application by storing data in the database 2ViewsView contains the presentation rules for the data and han-dles visualization requests made by the application as perthe data present in model View never handles the databut interacts with users for various ways of inputting thedata 3Controller Controller contains rules for pro-cessing the user input testing updating and maintainingthe data A user can change the data by using controllerwhile views will update the visualization of the data toreflect the changes Figure 6 presents the sequence ofevents while accessing a rails application over web in-terface In a rails application incoming client request isfirst sent to a router that finds the location of the appli-cation and parses the request to find the correspondingcontroller (method inside the controller) that will handlethe incoming request The method inside the controllercan look into the data of request can interact with themodel if needed can invoke other methods as per the na-ture of the request Finally the method sends informationto the view that renders the browser of the client with theresult

In the proposed monitoring framework ldquoRFDMonrdquo a

7

web service is developed to display the monitoring datacollected from the distributed infrastructure These mon-itoring data includes the information related to clustersnodes in a cluster node states measurements from vari-ous sensors on each node MPI and PBS related data forscientific applications web application performance andprocess accounting Schema information of the databaseis shown in figure 7

3 Issues in Distributed System Moni-toring

An efficient and accurate monitoring technique is themost basic requirement for tracking the behaviour of acomputing system to achieve the pre-specified QoS ob-jectives This efficient and accurate monitoring shouldbe performed in an on-line manner where an extensivelist of parameters are monitored and measurements areutilized by on-line control methods to tune the systemconfiguration which in turn keeps the system within de-sired QoS objectives [9] When a monitoring techniqueis applied to a distributed infrastructure it requires syn-chronization among the nodes high scalability and mea-surement correlation across the multiple nodes to under-stand the current behavior of distributed system accu-rately and within the time constraints However a typicaldistributed monitoring technique suffers from synchro-nization issues among the nodes communication delaylarge amount of measurements non deterministic natureof events asynchronous nature of measurements andlimited network bandwidth [10] Additionally this moni-toring technique suffers from high resource consumptionby monitoring units and multiple source and destinationfor measurements Another set of issues related to dis-tributed monitoring include the understanding of the ob-servations that are not directly related with the behaviorof the system and measurements that need particular or-der in reporting [11] Due to all of these issues it is ex-tremely difficult to create a consistent global view of theinfrastructure

Furthermore it is expected that the monitoring tech-nique will not introduce any fault instability and ille-gal behavior in the system under observation due to itsimplementation and will not interfere with the applica-tions executing in the system For this reason ldquoRFDMonrdquois designed in such a manner that it can self-configure(START STOP and POLL) monitoring sensors in caseof faulty (or seems to be faulty) measurements and self-aware about execution of the framework over multiple

nodes through HEARTBEAT sensors In future ldquoRFD-Monrdquo can be combined easily with a fault diagnosis mod-ule due to its standard interfaces In next section a fewof the latest approaches of distributed system monitor-ing (enterprise and open source) are presented with theirprimary features and limitations

4 Related Work

Various distributed monitoring systems are developed byindustry and research groups in past many years Gan-glia [12] Nagios [13] Zenoss [14] Nimsoft [15] Zab-bix [16] and openNMS [17] are a few of them whichare the most popular enteprise products developed for anefficient and effective monitoring of the distributed sys-tems Description of a few of these products is given infollowing paragraphs

According to [12] Ganglia is a distributed monitoringproduct for distributed computing systems which is eas-ily scalable with the size of the cluster or the grid Gan-glia is developed upon the concept of hierarchical fed-eration of clusters In this architecture multiple nodesare grouped as a cluster which is attached to a mod-ule and then multiple clusters are again grouped undera monitoring module Ganglia utilizes a multi-cast basedlistenannounce protocol for communication among thevarious nodes and monitoring modules It uses heartbeatmessage (over multi-cast addresses) within the hierarchyto maintain the existence (Membership) of lower levelmodules by the higher level Various nodes and applica-tions multi-cast their monitored resources or metrics toall of the other nodes In this way each node has an ap-proximate picture of the complete cluster at all instancesof time Aggregation of the lower level modules is donethrough polling of the child nodes in the tree Approxi-mate picture of the lover level is transferred to the higherlevel in the form of monitoring data through TCP con-nections There are two demons that function to imple-ment Ganglia monitoring system The Ganglia monitor-ing daemon (ldquogmondrdquo) runs on each node in the clus-ter to report the readings from the cluster and respond tothe requests sent from ganglia client The Ganglia MetaDaemon (ldquogmetadrdquo) executes at one level higher thanthe local nodes which aggregates the state of the clusterFinally multiple gmetad cooperates to create the aggre-gate picture of federation of clusters The primary advan-tage of utilizing the Ganglia monitoring system is auto-discovery of the added nodes each node contains the ag-gregate picture of the cluster utilizes design principleseasily portable and easily manageable

8

Figure 7 Schema of Monitoring Database (PK = Primary Key FK = Foreign key)

A plug-in based distributed monitoring approach ldquoNa-giosrdquo is described in [13] ldquoNagiosrdquo is developed basedupon agentserver architecture where agents can reportthe abnormal events from the computing nodes to theserver node (administrators) through email SMS or in-stant messages Current monitoring information and loghistory can be accessed by the administrator through aweb interface ldquoNagiosrdquo monitoring system consists offollowing three primary modules - Scheduler This isthe administrator component of the ldquoNagiosrdquo that checksthe plug-ins and take corrective actions if needed Plug-in These small modules are placed on the computingnode configured to monitor a resource and then send thereading to the ldquoNagiosrdquo server module over SNMP inter-face GUI This is a web based interface that presents thesituation of distributed monitoring system through vari-ous colourful buttons sounds and graphs The eventsreported from the plug-in will be displayed on the GUIas per the critically level

Zenoss [14] is a model-based monitoring solution

for distributed infrastructure that has comprehensive andflexible approach of monitoring as per the requirement ofenterprise It is an agentless monitoring approach wherethe central monitoring server gathers data from each nodeover SNMP interface through ssh commands In Zenossthe computing nodes can be discovered automaticallyand can be specified with their type (Xen VMWare) forcustomized monitoring interface Monitoring capabilityof Zenoss can be extended by adding small monitoringplug-ins (zenpack) to the central server For similar typeof devices it uses similar monitoring scheme that givesease in configuration to the end user Additionally clas-sification of the devices applies monitoring behaviours(monitoring templates) to ensure appropriate and com-plete monitoring alerting and reporting using definedperformance templates thresholds and event rules Fur-thermore it provides an extremely rich GUI interface formonitoring the servers placed in different geographicallocations for a detailed geographical view of the infras-tructure

9

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Abstract

In this paper a systematic distributed event based (DEB) system monitoring approach ldquoRFDMonrdquo is presentedfor measuring system variables (CPU utilization memory utilization disk utilization network utilization) systemhealth (temperature and voltage of Motherboard and CPU) application performance variables (application responsetime queue size throughput) and scientific application data structures (eg PBS information MPI variables) ac-curately with minimum latency at a specified rate and with minimal resource utilization Additionally ldquoRFDMonrdquois fault tolerant (for fault in sensor framework) self-configuring (can start and stop monitoring the nodes con-figure monitors for threshold valueschanges for publishing the measurements) self-aware (aware of execution offramework on multiple nodes through HEARTBEAT) extensive (monitors multiple parameters through periodicand aperiodic sensors) resource constrained (resources can be limited for monitors) and scalable for adding extramonitors on the fly Because this framework is built on Object Management Group (httpwwwomgorg) DataDistribution Services (DDS) implementations it can be used for deploying in systems with heterogeneous nodesAdditionally it provides a functionality to limit the maximum cap on resources consumed by monitoring processessuch that it does not affect the availability of resources for the applications

2

RFDMon A Real-Time and Fault-Tolerant Distributed SystemMonitoring Approach

December 6 2011

Abstract

In this paper a systematic distributed event based (DEB)system monitoring approach ldquoRFDMonrdquo is presentedfor measuring system variables (CPU utilization mem-ory utilization disk utilization network utilization etc)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency at a speci-fied rate and with minimal resource utilization Addition-ally ldquoRFDMonrdquo is fault tolerant (for fault in monitoringframework) self-configuring (can start and stop moni-toring the nodes and configure monitors for thresholdvalueschanges for publishing the measurements) self-aware (aware of execution of the framework on multi-ple nodes through HEARTBEAT message exchange) ex-tensive (monitors multiple parameters through periodicand aperiodic sensors) resource constrainable (compu-tational resources can be limited for monitors) and ex-pandable for adding extra monitors on the fly Becausethis framework is built on Object Management Group(httpwwwomgorg) Data Distribution Services (DDS)implementations it can be used for deploying in systemswith heterogeneous nodes Additionally it provides afunctionality to limit the maximum cap on resources con-sumed by monitoring processes such that it does not af-fect the availability of resources for the applications

1 Introduction

Currently distributed systems are used in executing sci-entific applications enterprise domains for hosting e-commerce applications wireless networks and storagesolutions Furthermore distributed computing systemsare an essential part of mission critical infrastructure offlight control system health care infrastructure and in-dustrial control system A typical distributed infrastruc-

ture contains multiple autonomous computing nodes thatare deployed over a network and connected through a dis-tributed middle-ware layer to exchange the informationand control commands [1] This distributed middlewarespans across multiple machines and provides same inter-face to all the applications executing on multiple com-puting nodes The primary benefit of executing an ap-plication over a distributed infrastructure is that differ-ent instances of the application (or different applications)can communicate with each other over the network effi-ciently while keeping the details of underlying hardwareand operating system hidden from the application be-haviour code Also interaction in distributed infrastruc-ture is consistent irrespective of location and time Dis-tributed infrastructure is easy to expand enables resourcesharing among computing nodes and provides flexibilityof running redundant nodes to improve the availability ofinfrastructure

Using distributed infrastructure creates added respon-sibility of consistency synchronization and security overmultiple nodes Furthermore in enterprise domain (orcloud services) there is a tremendous pressure to achievethe system QoS objectives in all possible scenarios ofsystem operation To this end an aggregate picture ofthe distributed infrastructure should always be availableto analyze and to provide feedback for computing con-trol commands if needed The desired aggregate pic-ture can be achieved through an infrastructure monitor-ing technique that is extensive enough to accommodatesystem parameters (CPU utilization memory utiliza-tion disk utilization network utilization etc) applica-tion performance parameters (response time queue sizeand throughput) and scientific application data structures(eg Portable Batch System (PBS) information Mes-sage Passing Interface (MPI) variables) Additionallythis monitoring technique should also be customizable totune for various type of applications and their wide rangeof parameters Effective management of distributed sys-tems require an effective monitoring technique which

3

can work in a distributed manner similar to the under-lying system and reports each event to the system ad-ministrator with maximum accuracy and minimum la-tency Furthermore the monitoring technique should notbe a burden to the performance of the computing systemand should not affect the execution of the applications ofthe computing system Also the monitoring techniqueshould be able to identify the fault in itself to isolate thefaulty component and correct it immediately for effectivemonitoring of the computing system Distributed eventbased systems (DEBS) are recommended for monitor-ing and management of distributed computing systemsfor faults and system health notifications through sys-tem alarms Moreover the DEBS based monitoring isable to provide the notion of system health in terms offaultstatus signals which will help to take the appropri-ate control actions to maintain the system in safe bound-ary of operation

Contribution In this paper an event based dis-tributed monitoring approach ldquoRFDMonrdquo is presentedthat utilizes the concepts of distributed data services(DDS) for an effective exchange of monitoring measure-ments among the computing nodes This monitoringframework is built upon the ACM ARINC-653 Com-ponent Framework [2] and an open source DDS imple-mentation Open splice [3] The primary principles indesign of the proposed approach are static memory al-location for determinism spatial and temporal isolationbetween monitoring framework and real application ofdifferent criticality specification and adherence to realtime properties such as periodicity and deadlines in mon-itoring framework and providing well-defined composi-tional semantics

ldquoRFDMonrdquo is fault tolerant (for fault in sensor frame-work) self-configuring (can start and stop monitoring thenodes configure monitors for threshold valueschangesfor publishing the measurements) self-aware (aware ofexecution of the framework on multiple nodes throughHEARTBEAT messages exchange) extensive (monitorsmultiple parameters through periodic and aperiodic sen-sors) resource constrained (resources can be limited formonitors) expandable for adding extra monitors on thefly and can be applied on heterogeneous infrastructureldquoRFDMonrdquo has been developed on the existing OMGCCM standard because it is one of the widely used com-ponent model and software developers are already famil-iar with it

ldquoRFDMonrdquo is divided in two parts 1 A distributedsensor framework to monitor the various system appli-cation and performance parameter of computing sys-

tem that reports all the measurements in a centralizeddatabase over http interface 2 A Ruby on Rails [4]based web service that shows the current state of the in-frastructure by providing an interface to read the databaseand display the content of database in a graphical windowon the web browser

OutlineThis paper is organized as follows Preliminary con-

cepts of the proposed approach are presented in section 2and issues in monitoring of distributed systems is high-lighted in section 3 Related distributed monitoring prod-ucts are described in section 4 while detailed descriptionof the proposed approach ldquoRFDMonrdquo is given in sec-tion 5 Details of each sensor is presented in section 6while a set of experiments are described in section 7 Ap-plication of the proposed approach is mentioned in sec-tion 9 and major benefits of the approach is highlightedin section 8 Future direction of the work is discussed insection 10 and conclusions are presented in section 11

2 Preliminaries

The proposed monitoring approach ldquoRFDMonrdquo consistsof two major modules Distributed Sensors Frame-work and Infrastructure Monitoring Database Dis-tributed Sensors Framework utilizes Data distributionservices (DDS) middleware standard for communicationamong nodes of distributed infrastructure Specifically ituses the Opensplice Community Edition [3] It executesDDS sensor framework on top of ARINC ComponentFramework [5] Infrastructure Monitoring Databaseuses Ruby on Rails [4] to implement the web service thatcan be used to update the database with monitoring dataand display the data on administrator web browser Inthis section the primary concepts of publish-subscribemechanism OpenSplice DDS ARINC-653 and Rubyon Rails are presented which will be helpful in under-standing the proposed monitoring framework

21 Publish-Subscribe Mechanism

Publish-Subscribe is a typical communication model fordistributed infrastructure that hosts distributed applica-tions Various nodes can communicate with each otherby sending (publishing) data and receiving (subscribing)data anonymously through the communication channelas specified in the infrastructure A publisher or sub-scriber need only the name and definition of the data inorder to communicate Publishers do not need any infor-mation about the location or identity of the subscribers

4

Figure 1 Publish Subscribe Architecture

and vice versa Publishers are responsible for collectingthe data from the application formatting it as per the datadefinition and sending it out of the node to all registeredsubscribers over the publish-subscribe domain Simi-larly subscribers are responsible for receiving the datafrom the the publish-subscribe domain format it as perthe data definition and presenting the formatted data tothe intended applications (see Figure 1) All communi-cation is performed through DDS domain A domain canhave multiple partitions Each partition can contain mul-tiple topics Topics are published and subscribed acrossthe partitions Partitioning is used to group similar typeof topics together It also provides flexibility to applica-tion for receiving data from a set of data sources [6] Pub-lish subscribe mechanism overcomes the typical shortcomings of client-server model where client and serversare coupled together for exchange of messages In caseof client-server model a client should have informationregarding the location of the server for exchange of mes-sages However in case of publish-subscribe mecha-nism clients should know only about the type of dataand its definition

22 Open Splice DDS

OpenSplice DDS [3] is an open source community edi-tion version of the Data Distribution Service (DDS)specification defined by Object Management Group(httpwwwomgorg) These specifications are primar-ily used for communication needs of distributed RealTime Systems Distributed applications use DDS as aninterface for ldquoData-Centric Publish-Subscriberdquo (DCPS)communication mechanism ( see Figure 2) ldquoData-Centricrdquo communication provides flexibility to specifyQoS parameters for the data depending upon the typeavailability and criticality of the data These QoS pa-

Figure 2 Data Distribution Services (DDS) Architecture

rameters include rate of publication rate of subscriptiondata validity period etc DCPS provides flexibility to thedevelopers for defining the different QoS requirementsfor the data to take control of each message and theycan concentrate only on the handling of data instead ofon the transfer of data Publisher and subscribers useDDS framework to send and receive the data respectivelyDDS can be combined with any communication interfacefor communication among distributed nodes of applica-tion DDS framework handles all of the communicationsamong publishers and subscribers as per the QoS specifi-cations In DDS framework each message is associatedwith a special data type called topic A subscriber regis-ters to one or more topics of its interest DDS guaranteesthat the subscriber will receive messages only from thetopics that it subscribed to In DDS a host is allowed toact as a publisher for some topics and simultaneously actas subscriber for others (see Figure 3)

Benefits of DDS

The primary benefit of using DDS framework for ourproposed monitoring framework ldquoRFDMonrdquo is thatDDS is based upon publish-subscribe mechanism thatdecouples the sender and receiver of the data There isno single point of bottleneck or of failure in communi-cation Additionally DDS communication mechanism ishighly scalable for number of nodes and supports auto-discovery of the nodes DDS ensures data delivery withminimum overhead and efficient bandwidth utilization

5

Figure 3 Data Distribution Services (DDS) Entities

Figure 4 ARINC-653 Architecture

23 ARINC-653

ARINC-653 software specification has been utilized insafety-critical real time operating systems (RTOS) thatare used in avionics systems and recommended for spacemissions ARINC-653 specifications presents standardApplication Executive (APEX) kernel and its associatedservices to ensure spatial and temporal separation amongvarious applications and monitoring components in inte-grated modular avionics ARINC-653 systems (see Fig-ure 4) group multiple processes into spatially and tem-porally separated partitions Multiple (or one) partitionsare grouped to form a module (ie a processor) whileone or more modules form a system These partitions areallocated predetermined chunk of memory

Spatial partitioning [7] ensures exclusive use of amemory region by an ARINC partition It guarantees thata faulty process in a partition cannot corrupt or destroythe data structures of other processes that are executing inother partitions This space partitioning is useful to sepa-rate the low-criticality vehicle management componentsfrom safety-critical flight control components in avion-ics systems [2] Memory management hardware ensures

memory protection by allowing a process to access onlypart of the memory that belongs to the partition whichhosts the same process

Temporal partitioning [7] in Arinc-653 systems is en-sured by a fixed periodic schedule to use the process-ing resources by different partitions This fixed periodicschedule is generated or supplied to RTOS in advancefor sharing of resources among the partitions (see Fig-ure 5) This deterministic scheduling scheme ensuresthat each partition gains access to the computational re-sources within its execution interval as per the schedulingscheme Additionally it guarantees that the partitionrsquosexecution will be interrupted once partitionrsquos executioninterval will be finished the partition will be placed intoa dormant state and next partition as per the schedul-ing scheme will be allowed to access the computing re-sources In this procedure all shared hardware resourcesare managed by the partitioning OS to guarantee that theresources are freed as soon as the time slice for the parti-tion expires

Arinc-653 architecture ensures fault-containmentthrough functional separation among applications andmonitoring components In this architecture partitionsand their process can only be created during systeminitialization Dynamic creation of the processes is notsupported while system is executing Additionally userscan configure real time properties (priority periodicityduration softhard deadline etc) of the processes andpartitions during their creation These partitions andprocess are scheduled and strictly monitored for possibledeadline violations Processes of same partition sharedata and communicate using intra-partition servicesIntra partition communication is performed using buffersto provide a message passing queue and blackboards toread write and clear single data storage Two differentpartitions communicate using inter-partition servicesthat uses ports and channels for sampling and queueingof messages Synchronization of processes related tosame partition is performed through semaphores andevents [7]

24 ARINC-653 Emulation Library

ARINC-653 Emulation Library [2] (available for down-load from httpswikiisisvanderbiltedumbshmindexphpMain_Page) provides aLINUX based implementation of ARINC-653 interfacespecifications for intra-partition process communicationthat includes Blackboards and Buffers Buffers provide aqueue for passing messages and Blackboards enable pro-

6

Figure 5 Specification of Partition Scheduling in ACMFramework

cesses to read write and clear single message Intra-partition process synchronization is supported throughSemaphores and Events This library also provides pro-cess and time management services as described in theARINC-653 specification

ARINC-653 Emulation Library is also responsible forproviding temporal partitioning among partitions whichare implemented as Linux Processes Each partition in-side a module is configured with an associated period thatidentifies the rate of execution The partition propertiesalso include the time duration of execution The modulemanager is configured with a fixed cyclic schedule withpre-determined hyperperiod (see Figure 5)

In order to provide periodic time slices for all parti-tions the time of the CPU allocated to a module is di-vided into periodically repeating time slices called hy-perperiods The hyperperiod value is calculated as theleast common multiple of the periods of all partition inthe module In other words every partition executes oneor more times within a hyperperiod The temporal inter-val associated with a hyperperiod is also known as themajor frame This major frame is then subdivided intoseveral smaller time windows called minor frames Eachminor frame belongs exclusively to one of the partitionsThe length of the minor frame is the same as the dura-tion of the partition running in that frame Note that themodule manager allows execution of one and only onepartition inside a given minor frame

The module configuration specifies the hyperperiodvalue the partition names the partition executables andtheir scheduling windows which is specified with theoffset from the start of the hyperperiod and duration

Figure 6 Rails and Model View Controller (MVC) Ar-chitecture Interaction

The module manager is responsible for checking that theschedule is valid before the system can be initialized ieall scheduling windows within a hyperperiod can be exe-cuted without overlap

25 Ruby on Rails

Rails [4] is a web application development frameworkthat uses Ruby programming language [8] Rails usesModel View Controller (MVC) architecture for applica-tion development MVC divides the responsibility ofmanaging the web application development in three com-ponents1Model It contains the data definition and ma-nipulation rules for the data Model maintains the state ofthe application by storing data in the database 2ViewsView contains the presentation rules for the data and han-dles visualization requests made by the application as perthe data present in model View never handles the databut interacts with users for various ways of inputting thedata 3Controller Controller contains rules for pro-cessing the user input testing updating and maintainingthe data A user can change the data by using controllerwhile views will update the visualization of the data toreflect the changes Figure 6 presents the sequence ofevents while accessing a rails application over web in-terface In a rails application incoming client request isfirst sent to a router that finds the location of the appli-cation and parses the request to find the correspondingcontroller (method inside the controller) that will handlethe incoming request The method inside the controllercan look into the data of request can interact with themodel if needed can invoke other methods as per the na-ture of the request Finally the method sends informationto the view that renders the browser of the client with theresult

In the proposed monitoring framework ldquoRFDMonrdquo a

7

web service is developed to display the monitoring datacollected from the distributed infrastructure These mon-itoring data includes the information related to clustersnodes in a cluster node states measurements from vari-ous sensors on each node MPI and PBS related data forscientific applications web application performance andprocess accounting Schema information of the databaseis shown in figure 7

3 Issues in Distributed System Moni-toring

An efficient and accurate monitoring technique is themost basic requirement for tracking the behaviour of acomputing system to achieve the pre-specified QoS ob-jectives This efficient and accurate monitoring shouldbe performed in an on-line manner where an extensivelist of parameters are monitored and measurements areutilized by on-line control methods to tune the systemconfiguration which in turn keeps the system within de-sired QoS objectives [9] When a monitoring techniqueis applied to a distributed infrastructure it requires syn-chronization among the nodes high scalability and mea-surement correlation across the multiple nodes to under-stand the current behavior of distributed system accu-rately and within the time constraints However a typicaldistributed monitoring technique suffers from synchro-nization issues among the nodes communication delaylarge amount of measurements non deterministic natureof events asynchronous nature of measurements andlimited network bandwidth [10] Additionally this moni-toring technique suffers from high resource consumptionby monitoring units and multiple source and destinationfor measurements Another set of issues related to dis-tributed monitoring include the understanding of the ob-servations that are not directly related with the behaviorof the system and measurements that need particular or-der in reporting [11] Due to all of these issues it is ex-tremely difficult to create a consistent global view of theinfrastructure

Furthermore it is expected that the monitoring tech-nique will not introduce any fault instability and ille-gal behavior in the system under observation due to itsimplementation and will not interfere with the applica-tions executing in the system For this reason ldquoRFDMonrdquois designed in such a manner that it can self-configure(START STOP and POLL) monitoring sensors in caseof faulty (or seems to be faulty) measurements and self-aware about execution of the framework over multiple

nodes through HEARTBEAT sensors In future ldquoRFD-Monrdquo can be combined easily with a fault diagnosis mod-ule due to its standard interfaces In next section a fewof the latest approaches of distributed system monitor-ing (enterprise and open source) are presented with theirprimary features and limitations

4 Related Work

Various distributed monitoring systems are developed byindustry and research groups in past many years Gan-glia [12] Nagios [13] Zenoss [14] Nimsoft [15] Zab-bix [16] and openNMS [17] are a few of them whichare the most popular enteprise products developed for anefficient and effective monitoring of the distributed sys-tems Description of a few of these products is given infollowing paragraphs

According to [12] Ganglia is a distributed monitoringproduct for distributed computing systems which is eas-ily scalable with the size of the cluster or the grid Gan-glia is developed upon the concept of hierarchical fed-eration of clusters In this architecture multiple nodesare grouped as a cluster which is attached to a mod-ule and then multiple clusters are again grouped undera monitoring module Ganglia utilizes a multi-cast basedlistenannounce protocol for communication among thevarious nodes and monitoring modules It uses heartbeatmessage (over multi-cast addresses) within the hierarchyto maintain the existence (Membership) of lower levelmodules by the higher level Various nodes and applica-tions multi-cast their monitored resources or metrics toall of the other nodes In this way each node has an ap-proximate picture of the complete cluster at all instancesof time Aggregation of the lower level modules is donethrough polling of the child nodes in the tree Approxi-mate picture of the lover level is transferred to the higherlevel in the form of monitoring data through TCP con-nections There are two demons that function to imple-ment Ganglia monitoring system The Ganglia monitor-ing daemon (ldquogmondrdquo) runs on each node in the clus-ter to report the readings from the cluster and respond tothe requests sent from ganglia client The Ganglia MetaDaemon (ldquogmetadrdquo) executes at one level higher thanthe local nodes which aggregates the state of the clusterFinally multiple gmetad cooperates to create the aggre-gate picture of federation of clusters The primary advan-tage of utilizing the Ganglia monitoring system is auto-discovery of the added nodes each node contains the ag-gregate picture of the cluster utilizes design principleseasily portable and easily manageable

8

Figure 7 Schema of Monitoring Database (PK = Primary Key FK = Foreign key)

A plug-in based distributed monitoring approach ldquoNa-giosrdquo is described in [13] ldquoNagiosrdquo is developed basedupon agentserver architecture where agents can reportthe abnormal events from the computing nodes to theserver node (administrators) through email SMS or in-stant messages Current monitoring information and loghistory can be accessed by the administrator through aweb interface ldquoNagiosrdquo monitoring system consists offollowing three primary modules - Scheduler This isthe administrator component of the ldquoNagiosrdquo that checksthe plug-ins and take corrective actions if needed Plug-in These small modules are placed on the computingnode configured to monitor a resource and then send thereading to the ldquoNagiosrdquo server module over SNMP inter-face GUI This is a web based interface that presents thesituation of distributed monitoring system through vari-ous colourful buttons sounds and graphs The eventsreported from the plug-in will be displayed on the GUIas per the critically level

Zenoss [14] is a model-based monitoring solution

for distributed infrastructure that has comprehensive andflexible approach of monitoring as per the requirement ofenterprise It is an agentless monitoring approach wherethe central monitoring server gathers data from each nodeover SNMP interface through ssh commands In Zenossthe computing nodes can be discovered automaticallyand can be specified with their type (Xen VMWare) forcustomized monitoring interface Monitoring capabilityof Zenoss can be extended by adding small monitoringplug-ins (zenpack) to the central server For similar typeof devices it uses similar monitoring scheme that givesease in configuration to the end user Additionally clas-sification of the devices applies monitoring behaviours(monitoring templates) to ensure appropriate and com-plete monitoring alerting and reporting using definedperformance templates thresholds and event rules Fur-thermore it provides an extremely rich GUI interface formonitoring the servers placed in different geographicallocations for a detailed geographical view of the infras-tructure

9

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

RFDMon A Real-Time and Fault-Tolerant Distributed SystemMonitoring Approach

December 6 2011

Abstract

In this paper a systematic distributed event based (DEB)system monitoring approach ldquoRFDMonrdquo is presentedfor measuring system variables (CPU utilization mem-ory utilization disk utilization network utilization etc)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency at a speci-fied rate and with minimal resource utilization Addition-ally ldquoRFDMonrdquo is fault tolerant (for fault in monitoringframework) self-configuring (can start and stop moni-toring the nodes and configure monitors for thresholdvalueschanges for publishing the measurements) self-aware (aware of execution of the framework on multi-ple nodes through HEARTBEAT message exchange) ex-tensive (monitors multiple parameters through periodicand aperiodic sensors) resource constrainable (compu-tational resources can be limited for monitors) and ex-pandable for adding extra monitors on the fly Becausethis framework is built on Object Management Group(httpwwwomgorg) Data Distribution Services (DDS)implementations it can be used for deploying in systemswith heterogeneous nodes Additionally it provides afunctionality to limit the maximum cap on resources con-sumed by monitoring processes such that it does not af-fect the availability of resources for the applications

1 Introduction

Currently distributed systems are used in executing sci-entific applications enterprise domains for hosting e-commerce applications wireless networks and storagesolutions Furthermore distributed computing systemsare an essential part of mission critical infrastructure offlight control system health care infrastructure and in-dustrial control system A typical distributed infrastruc-

ture contains multiple autonomous computing nodes thatare deployed over a network and connected through a dis-tributed middle-ware layer to exchange the informationand control commands [1] This distributed middlewarespans across multiple machines and provides same inter-face to all the applications executing on multiple com-puting nodes The primary benefit of executing an ap-plication over a distributed infrastructure is that differ-ent instances of the application (or different applications)can communicate with each other over the network effi-ciently while keeping the details of underlying hardwareand operating system hidden from the application be-haviour code Also interaction in distributed infrastruc-ture is consistent irrespective of location and time Dis-tributed infrastructure is easy to expand enables resourcesharing among computing nodes and provides flexibilityof running redundant nodes to improve the availability ofinfrastructure

Using distributed infrastructure creates added respon-sibility of consistency synchronization and security overmultiple nodes Furthermore in enterprise domain (orcloud services) there is a tremendous pressure to achievethe system QoS objectives in all possible scenarios ofsystem operation To this end an aggregate picture ofthe distributed infrastructure should always be availableto analyze and to provide feedback for computing con-trol commands if needed The desired aggregate pic-ture can be achieved through an infrastructure monitor-ing technique that is extensive enough to accommodatesystem parameters (CPU utilization memory utiliza-tion disk utilization network utilization etc) applica-tion performance parameters (response time queue sizeand throughput) and scientific application data structures(eg Portable Batch System (PBS) information Mes-sage Passing Interface (MPI) variables) Additionallythis monitoring technique should also be customizable totune for various type of applications and their wide rangeof parameters Effective management of distributed sys-tems require an effective monitoring technique which

3

can work in a distributed manner similar to the under-lying system and reports each event to the system ad-ministrator with maximum accuracy and minimum la-tency Furthermore the monitoring technique should notbe a burden to the performance of the computing systemand should not affect the execution of the applications ofthe computing system Also the monitoring techniqueshould be able to identify the fault in itself to isolate thefaulty component and correct it immediately for effectivemonitoring of the computing system Distributed eventbased systems (DEBS) are recommended for monitor-ing and management of distributed computing systemsfor faults and system health notifications through sys-tem alarms Moreover the DEBS based monitoring isable to provide the notion of system health in terms offaultstatus signals which will help to take the appropri-ate control actions to maintain the system in safe bound-ary of operation

Contribution In this paper an event based dis-tributed monitoring approach ldquoRFDMonrdquo is presentedthat utilizes the concepts of distributed data services(DDS) for an effective exchange of monitoring measure-ments among the computing nodes This monitoringframework is built upon the ACM ARINC-653 Com-ponent Framework [2] and an open source DDS imple-mentation Open splice [3] The primary principles indesign of the proposed approach are static memory al-location for determinism spatial and temporal isolationbetween monitoring framework and real application ofdifferent criticality specification and adherence to realtime properties such as periodicity and deadlines in mon-itoring framework and providing well-defined composi-tional semantics

ldquoRFDMonrdquo is fault tolerant (for fault in sensor frame-work) self-configuring (can start and stop monitoring thenodes configure monitors for threshold valueschangesfor publishing the measurements) self-aware (aware ofexecution of the framework on multiple nodes throughHEARTBEAT messages exchange) extensive (monitorsmultiple parameters through periodic and aperiodic sen-sors) resource constrained (resources can be limited formonitors) expandable for adding extra monitors on thefly and can be applied on heterogeneous infrastructureldquoRFDMonrdquo has been developed on the existing OMGCCM standard because it is one of the widely used com-ponent model and software developers are already famil-iar with it

ldquoRFDMonrdquo is divided in two parts 1 A distributedsensor framework to monitor the various system appli-cation and performance parameter of computing sys-

tem that reports all the measurements in a centralizeddatabase over http interface 2 A Ruby on Rails [4]based web service that shows the current state of the in-frastructure by providing an interface to read the databaseand display the content of database in a graphical windowon the web browser

OutlineThis paper is organized as follows Preliminary con-

cepts of the proposed approach are presented in section 2and issues in monitoring of distributed systems is high-lighted in section 3 Related distributed monitoring prod-ucts are described in section 4 while detailed descriptionof the proposed approach ldquoRFDMonrdquo is given in sec-tion 5 Details of each sensor is presented in section 6while a set of experiments are described in section 7 Ap-plication of the proposed approach is mentioned in sec-tion 9 and major benefits of the approach is highlightedin section 8 Future direction of the work is discussed insection 10 and conclusions are presented in section 11

2 Preliminaries

The proposed monitoring approach ldquoRFDMonrdquo consistsof two major modules Distributed Sensors Frame-work and Infrastructure Monitoring Database Dis-tributed Sensors Framework utilizes Data distributionservices (DDS) middleware standard for communicationamong nodes of distributed infrastructure Specifically ituses the Opensplice Community Edition [3] It executesDDS sensor framework on top of ARINC ComponentFramework [5] Infrastructure Monitoring Databaseuses Ruby on Rails [4] to implement the web service thatcan be used to update the database with monitoring dataand display the data on administrator web browser Inthis section the primary concepts of publish-subscribemechanism OpenSplice DDS ARINC-653 and Rubyon Rails are presented which will be helpful in under-standing the proposed monitoring framework

21 Publish-Subscribe Mechanism

Publish-Subscribe is a typical communication model fordistributed infrastructure that hosts distributed applica-tions Various nodes can communicate with each otherby sending (publishing) data and receiving (subscribing)data anonymously through the communication channelas specified in the infrastructure A publisher or sub-scriber need only the name and definition of the data inorder to communicate Publishers do not need any infor-mation about the location or identity of the subscribers

4

Figure 1 Publish Subscribe Architecture

and vice versa Publishers are responsible for collectingthe data from the application formatting it as per the datadefinition and sending it out of the node to all registeredsubscribers over the publish-subscribe domain Simi-larly subscribers are responsible for receiving the datafrom the the publish-subscribe domain format it as perthe data definition and presenting the formatted data tothe intended applications (see Figure 1) All communi-cation is performed through DDS domain A domain canhave multiple partitions Each partition can contain mul-tiple topics Topics are published and subscribed acrossthe partitions Partitioning is used to group similar typeof topics together It also provides flexibility to applica-tion for receiving data from a set of data sources [6] Pub-lish subscribe mechanism overcomes the typical shortcomings of client-server model where client and serversare coupled together for exchange of messages In caseof client-server model a client should have informationregarding the location of the server for exchange of mes-sages However in case of publish-subscribe mecha-nism clients should know only about the type of dataand its definition

22 Open Splice DDS

OpenSplice DDS [3] is an open source community edi-tion version of the Data Distribution Service (DDS)specification defined by Object Management Group(httpwwwomgorg) These specifications are primar-ily used for communication needs of distributed RealTime Systems Distributed applications use DDS as aninterface for ldquoData-Centric Publish-Subscriberdquo (DCPS)communication mechanism ( see Figure 2) ldquoData-Centricrdquo communication provides flexibility to specifyQoS parameters for the data depending upon the typeavailability and criticality of the data These QoS pa-

Figure 2 Data Distribution Services (DDS) Architecture

rameters include rate of publication rate of subscriptiondata validity period etc DCPS provides flexibility to thedevelopers for defining the different QoS requirementsfor the data to take control of each message and theycan concentrate only on the handling of data instead ofon the transfer of data Publisher and subscribers useDDS framework to send and receive the data respectivelyDDS can be combined with any communication interfacefor communication among distributed nodes of applica-tion DDS framework handles all of the communicationsamong publishers and subscribers as per the QoS specifi-cations In DDS framework each message is associatedwith a special data type called topic A subscriber regis-ters to one or more topics of its interest DDS guaranteesthat the subscriber will receive messages only from thetopics that it subscribed to In DDS a host is allowed toact as a publisher for some topics and simultaneously actas subscriber for others (see Figure 3)

Benefits of DDS

The primary benefit of using DDS framework for ourproposed monitoring framework ldquoRFDMonrdquo is thatDDS is based upon publish-subscribe mechanism thatdecouples the sender and receiver of the data There isno single point of bottleneck or of failure in communi-cation Additionally DDS communication mechanism ishighly scalable for number of nodes and supports auto-discovery of the nodes DDS ensures data delivery withminimum overhead and efficient bandwidth utilization

5

Figure 3 Data Distribution Services (DDS) Entities

Figure 4 ARINC-653 Architecture

23 ARINC-653

ARINC-653 software specification has been utilized insafety-critical real time operating systems (RTOS) thatare used in avionics systems and recommended for spacemissions ARINC-653 specifications presents standardApplication Executive (APEX) kernel and its associatedservices to ensure spatial and temporal separation amongvarious applications and monitoring components in inte-grated modular avionics ARINC-653 systems (see Fig-ure 4) group multiple processes into spatially and tem-porally separated partitions Multiple (or one) partitionsare grouped to form a module (ie a processor) whileone or more modules form a system These partitions areallocated predetermined chunk of memory

Spatial partitioning [7] ensures exclusive use of amemory region by an ARINC partition It guarantees thata faulty process in a partition cannot corrupt or destroythe data structures of other processes that are executing inother partitions This space partitioning is useful to sepa-rate the low-criticality vehicle management componentsfrom safety-critical flight control components in avion-ics systems [2] Memory management hardware ensures

memory protection by allowing a process to access onlypart of the memory that belongs to the partition whichhosts the same process

Temporal partitioning [7] in Arinc-653 systems is en-sured by a fixed periodic schedule to use the process-ing resources by different partitions This fixed periodicschedule is generated or supplied to RTOS in advancefor sharing of resources among the partitions (see Fig-ure 5) This deterministic scheduling scheme ensuresthat each partition gains access to the computational re-sources within its execution interval as per the schedulingscheme Additionally it guarantees that the partitionrsquosexecution will be interrupted once partitionrsquos executioninterval will be finished the partition will be placed intoa dormant state and next partition as per the schedul-ing scheme will be allowed to access the computing re-sources In this procedure all shared hardware resourcesare managed by the partitioning OS to guarantee that theresources are freed as soon as the time slice for the parti-tion expires

Arinc-653 architecture ensures fault-containmentthrough functional separation among applications andmonitoring components In this architecture partitionsand their process can only be created during systeminitialization Dynamic creation of the processes is notsupported while system is executing Additionally userscan configure real time properties (priority periodicityduration softhard deadline etc) of the processes andpartitions during their creation These partitions andprocess are scheduled and strictly monitored for possibledeadline violations Processes of same partition sharedata and communicate using intra-partition servicesIntra partition communication is performed using buffersto provide a message passing queue and blackboards toread write and clear single data storage Two differentpartitions communicate using inter-partition servicesthat uses ports and channels for sampling and queueingof messages Synchronization of processes related tosame partition is performed through semaphores andevents [7]

24 ARINC-653 Emulation Library

ARINC-653 Emulation Library [2] (available for down-load from httpswikiisisvanderbiltedumbshmindexphpMain_Page) provides aLINUX based implementation of ARINC-653 interfacespecifications for intra-partition process communicationthat includes Blackboards and Buffers Buffers provide aqueue for passing messages and Blackboards enable pro-

6

Figure 5 Specification of Partition Scheduling in ACMFramework

cesses to read write and clear single message Intra-partition process synchronization is supported throughSemaphores and Events This library also provides pro-cess and time management services as described in theARINC-653 specification

ARINC-653 Emulation Library is also responsible forproviding temporal partitioning among partitions whichare implemented as Linux Processes Each partition in-side a module is configured with an associated period thatidentifies the rate of execution The partition propertiesalso include the time duration of execution The modulemanager is configured with a fixed cyclic schedule withpre-determined hyperperiod (see Figure 5)

In order to provide periodic time slices for all parti-tions the time of the CPU allocated to a module is di-vided into periodically repeating time slices called hy-perperiods The hyperperiod value is calculated as theleast common multiple of the periods of all partition inthe module In other words every partition executes oneor more times within a hyperperiod The temporal inter-val associated with a hyperperiod is also known as themajor frame This major frame is then subdivided intoseveral smaller time windows called minor frames Eachminor frame belongs exclusively to one of the partitionsThe length of the minor frame is the same as the dura-tion of the partition running in that frame Note that themodule manager allows execution of one and only onepartition inside a given minor frame

The module configuration specifies the hyperperiodvalue the partition names the partition executables andtheir scheduling windows which is specified with theoffset from the start of the hyperperiod and duration

Figure 6 Rails and Model View Controller (MVC) Ar-chitecture Interaction

The module manager is responsible for checking that theschedule is valid before the system can be initialized ieall scheduling windows within a hyperperiod can be exe-cuted without overlap

25 Ruby on Rails

Rails [4] is a web application development frameworkthat uses Ruby programming language [8] Rails usesModel View Controller (MVC) architecture for applica-tion development MVC divides the responsibility ofmanaging the web application development in three com-ponents1Model It contains the data definition and ma-nipulation rules for the data Model maintains the state ofthe application by storing data in the database 2ViewsView contains the presentation rules for the data and han-dles visualization requests made by the application as perthe data present in model View never handles the databut interacts with users for various ways of inputting thedata 3Controller Controller contains rules for pro-cessing the user input testing updating and maintainingthe data A user can change the data by using controllerwhile views will update the visualization of the data toreflect the changes Figure 6 presents the sequence ofevents while accessing a rails application over web in-terface In a rails application incoming client request isfirst sent to a router that finds the location of the appli-cation and parses the request to find the correspondingcontroller (method inside the controller) that will handlethe incoming request The method inside the controllercan look into the data of request can interact with themodel if needed can invoke other methods as per the na-ture of the request Finally the method sends informationto the view that renders the browser of the client with theresult

In the proposed monitoring framework ldquoRFDMonrdquo a

7

web service is developed to display the monitoring datacollected from the distributed infrastructure These mon-itoring data includes the information related to clustersnodes in a cluster node states measurements from vari-ous sensors on each node MPI and PBS related data forscientific applications web application performance andprocess accounting Schema information of the databaseis shown in figure 7

3 Issues in Distributed System Moni-toring

An efficient and accurate monitoring technique is themost basic requirement for tracking the behaviour of acomputing system to achieve the pre-specified QoS ob-jectives This efficient and accurate monitoring shouldbe performed in an on-line manner where an extensivelist of parameters are monitored and measurements areutilized by on-line control methods to tune the systemconfiguration which in turn keeps the system within de-sired QoS objectives [9] When a monitoring techniqueis applied to a distributed infrastructure it requires syn-chronization among the nodes high scalability and mea-surement correlation across the multiple nodes to under-stand the current behavior of distributed system accu-rately and within the time constraints However a typicaldistributed monitoring technique suffers from synchro-nization issues among the nodes communication delaylarge amount of measurements non deterministic natureof events asynchronous nature of measurements andlimited network bandwidth [10] Additionally this moni-toring technique suffers from high resource consumptionby monitoring units and multiple source and destinationfor measurements Another set of issues related to dis-tributed monitoring include the understanding of the ob-servations that are not directly related with the behaviorof the system and measurements that need particular or-der in reporting [11] Due to all of these issues it is ex-tremely difficult to create a consistent global view of theinfrastructure

Furthermore it is expected that the monitoring tech-nique will not introduce any fault instability and ille-gal behavior in the system under observation due to itsimplementation and will not interfere with the applica-tions executing in the system For this reason ldquoRFDMonrdquois designed in such a manner that it can self-configure(START STOP and POLL) monitoring sensors in caseof faulty (or seems to be faulty) measurements and self-aware about execution of the framework over multiple

nodes through HEARTBEAT sensors In future ldquoRFD-Monrdquo can be combined easily with a fault diagnosis mod-ule due to its standard interfaces In next section a fewof the latest approaches of distributed system monitor-ing (enterprise and open source) are presented with theirprimary features and limitations

4 Related Work

Various distributed monitoring systems are developed byindustry and research groups in past many years Gan-glia [12] Nagios [13] Zenoss [14] Nimsoft [15] Zab-bix [16] and openNMS [17] are a few of them whichare the most popular enteprise products developed for anefficient and effective monitoring of the distributed sys-tems Description of a few of these products is given infollowing paragraphs

According to [12] Ganglia is a distributed monitoringproduct for distributed computing systems which is eas-ily scalable with the size of the cluster or the grid Gan-glia is developed upon the concept of hierarchical fed-eration of clusters In this architecture multiple nodesare grouped as a cluster which is attached to a mod-ule and then multiple clusters are again grouped undera monitoring module Ganglia utilizes a multi-cast basedlistenannounce protocol for communication among thevarious nodes and monitoring modules It uses heartbeatmessage (over multi-cast addresses) within the hierarchyto maintain the existence (Membership) of lower levelmodules by the higher level Various nodes and applica-tions multi-cast their monitored resources or metrics toall of the other nodes In this way each node has an ap-proximate picture of the complete cluster at all instancesof time Aggregation of the lower level modules is donethrough polling of the child nodes in the tree Approxi-mate picture of the lover level is transferred to the higherlevel in the form of monitoring data through TCP con-nections There are two demons that function to imple-ment Ganglia monitoring system The Ganglia monitor-ing daemon (ldquogmondrdquo) runs on each node in the clus-ter to report the readings from the cluster and respond tothe requests sent from ganglia client The Ganglia MetaDaemon (ldquogmetadrdquo) executes at one level higher thanthe local nodes which aggregates the state of the clusterFinally multiple gmetad cooperates to create the aggre-gate picture of federation of clusters The primary advan-tage of utilizing the Ganglia monitoring system is auto-discovery of the added nodes each node contains the ag-gregate picture of the cluster utilizes design principleseasily portable and easily manageable

8

Figure 7 Schema of Monitoring Database (PK = Primary Key FK = Foreign key)

A plug-in based distributed monitoring approach ldquoNa-giosrdquo is described in [13] ldquoNagiosrdquo is developed basedupon agentserver architecture where agents can reportthe abnormal events from the computing nodes to theserver node (administrators) through email SMS or in-stant messages Current monitoring information and loghistory can be accessed by the administrator through aweb interface ldquoNagiosrdquo monitoring system consists offollowing three primary modules - Scheduler This isthe administrator component of the ldquoNagiosrdquo that checksthe plug-ins and take corrective actions if needed Plug-in These small modules are placed on the computingnode configured to monitor a resource and then send thereading to the ldquoNagiosrdquo server module over SNMP inter-face GUI This is a web based interface that presents thesituation of distributed monitoring system through vari-ous colourful buttons sounds and graphs The eventsreported from the plug-in will be displayed on the GUIas per the critically level

Zenoss [14] is a model-based monitoring solution

for distributed infrastructure that has comprehensive andflexible approach of monitoring as per the requirement ofenterprise It is an agentless monitoring approach wherethe central monitoring server gathers data from each nodeover SNMP interface through ssh commands In Zenossthe computing nodes can be discovered automaticallyand can be specified with their type (Xen VMWare) forcustomized monitoring interface Monitoring capabilityof Zenoss can be extended by adding small monitoringplug-ins (zenpack) to the central server For similar typeof devices it uses similar monitoring scheme that givesease in configuration to the end user Additionally clas-sification of the devices applies monitoring behaviours(monitoring templates) to ensure appropriate and com-plete monitoring alerting and reporting using definedperformance templates thresholds and event rules Fur-thermore it provides an extremely rich GUI interface formonitoring the servers placed in different geographicallocations for a detailed geographical view of the infras-tructure

9

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

can work in a distributed manner similar to the under-lying system and reports each event to the system ad-ministrator with maximum accuracy and minimum la-tency Furthermore the monitoring technique should notbe a burden to the performance of the computing systemand should not affect the execution of the applications ofthe computing system Also the monitoring techniqueshould be able to identify the fault in itself to isolate thefaulty component and correct it immediately for effectivemonitoring of the computing system Distributed eventbased systems (DEBS) are recommended for monitor-ing and management of distributed computing systemsfor faults and system health notifications through sys-tem alarms Moreover the DEBS based monitoring isable to provide the notion of system health in terms offaultstatus signals which will help to take the appropri-ate control actions to maintain the system in safe bound-ary of operation

Contribution In this paper an event based dis-tributed monitoring approach ldquoRFDMonrdquo is presentedthat utilizes the concepts of distributed data services(DDS) for an effective exchange of monitoring measure-ments among the computing nodes This monitoringframework is built upon the ACM ARINC-653 Com-ponent Framework [2] and an open source DDS imple-mentation Open splice [3] The primary principles indesign of the proposed approach are static memory al-location for determinism spatial and temporal isolationbetween monitoring framework and real application ofdifferent criticality specification and adherence to realtime properties such as periodicity and deadlines in mon-itoring framework and providing well-defined composi-tional semantics

ldquoRFDMonrdquo is fault tolerant (for fault in sensor frame-work) self-configuring (can start and stop monitoring thenodes configure monitors for threshold valueschangesfor publishing the measurements) self-aware (aware ofexecution of the framework on multiple nodes throughHEARTBEAT messages exchange) extensive (monitorsmultiple parameters through periodic and aperiodic sen-sors) resource constrained (resources can be limited formonitors) expandable for adding extra monitors on thefly and can be applied on heterogeneous infrastructureldquoRFDMonrdquo has been developed on the existing OMGCCM standard because it is one of the widely used com-ponent model and software developers are already famil-iar with it

ldquoRFDMonrdquo is divided in two parts 1 A distributedsensor framework to monitor the various system appli-cation and performance parameter of computing sys-

tem that reports all the measurements in a centralizeddatabase over http interface 2 A Ruby on Rails [4]based web service that shows the current state of the in-frastructure by providing an interface to read the databaseand display the content of database in a graphical windowon the web browser

OutlineThis paper is organized as follows Preliminary con-

cepts of the proposed approach are presented in section 2and issues in monitoring of distributed systems is high-lighted in section 3 Related distributed monitoring prod-ucts are described in section 4 while detailed descriptionof the proposed approach ldquoRFDMonrdquo is given in sec-tion 5 Details of each sensor is presented in section 6while a set of experiments are described in section 7 Ap-plication of the proposed approach is mentioned in sec-tion 9 and major benefits of the approach is highlightedin section 8 Future direction of the work is discussed insection 10 and conclusions are presented in section 11

2 Preliminaries

The proposed monitoring approach ldquoRFDMonrdquo consistsof two major modules Distributed Sensors Frame-work and Infrastructure Monitoring Database Dis-tributed Sensors Framework utilizes Data distributionservices (DDS) middleware standard for communicationamong nodes of distributed infrastructure Specifically ituses the Opensplice Community Edition [3] It executesDDS sensor framework on top of ARINC ComponentFramework [5] Infrastructure Monitoring Databaseuses Ruby on Rails [4] to implement the web service thatcan be used to update the database with monitoring dataand display the data on administrator web browser Inthis section the primary concepts of publish-subscribemechanism OpenSplice DDS ARINC-653 and Rubyon Rails are presented which will be helpful in under-standing the proposed monitoring framework

21 Publish-Subscribe Mechanism

Publish-Subscribe is a typical communication model fordistributed infrastructure that hosts distributed applica-tions Various nodes can communicate with each otherby sending (publishing) data and receiving (subscribing)data anonymously through the communication channelas specified in the infrastructure A publisher or sub-scriber need only the name and definition of the data inorder to communicate Publishers do not need any infor-mation about the location or identity of the subscribers

4

Figure 1 Publish Subscribe Architecture

and vice versa Publishers are responsible for collectingthe data from the application formatting it as per the datadefinition and sending it out of the node to all registeredsubscribers over the publish-subscribe domain Simi-larly subscribers are responsible for receiving the datafrom the the publish-subscribe domain format it as perthe data definition and presenting the formatted data tothe intended applications (see Figure 1) All communi-cation is performed through DDS domain A domain canhave multiple partitions Each partition can contain mul-tiple topics Topics are published and subscribed acrossthe partitions Partitioning is used to group similar typeof topics together It also provides flexibility to applica-tion for receiving data from a set of data sources [6] Pub-lish subscribe mechanism overcomes the typical shortcomings of client-server model where client and serversare coupled together for exchange of messages In caseof client-server model a client should have informationregarding the location of the server for exchange of mes-sages However in case of publish-subscribe mecha-nism clients should know only about the type of dataand its definition

22 Open Splice DDS

OpenSplice DDS [3] is an open source community edi-tion version of the Data Distribution Service (DDS)specification defined by Object Management Group(httpwwwomgorg) These specifications are primar-ily used for communication needs of distributed RealTime Systems Distributed applications use DDS as aninterface for ldquoData-Centric Publish-Subscriberdquo (DCPS)communication mechanism ( see Figure 2) ldquoData-Centricrdquo communication provides flexibility to specifyQoS parameters for the data depending upon the typeavailability and criticality of the data These QoS pa-

Figure 2 Data Distribution Services (DDS) Architecture

rameters include rate of publication rate of subscriptiondata validity period etc DCPS provides flexibility to thedevelopers for defining the different QoS requirementsfor the data to take control of each message and theycan concentrate only on the handling of data instead ofon the transfer of data Publisher and subscribers useDDS framework to send and receive the data respectivelyDDS can be combined with any communication interfacefor communication among distributed nodes of applica-tion DDS framework handles all of the communicationsamong publishers and subscribers as per the QoS specifi-cations In DDS framework each message is associatedwith a special data type called topic A subscriber regis-ters to one or more topics of its interest DDS guaranteesthat the subscriber will receive messages only from thetopics that it subscribed to In DDS a host is allowed toact as a publisher for some topics and simultaneously actas subscriber for others (see Figure 3)

Benefits of DDS

The primary benefit of using DDS framework for ourproposed monitoring framework ldquoRFDMonrdquo is thatDDS is based upon publish-subscribe mechanism thatdecouples the sender and receiver of the data There isno single point of bottleneck or of failure in communi-cation Additionally DDS communication mechanism ishighly scalable for number of nodes and supports auto-discovery of the nodes DDS ensures data delivery withminimum overhead and efficient bandwidth utilization

5

Figure 3 Data Distribution Services (DDS) Entities

Figure 4 ARINC-653 Architecture

23 ARINC-653

ARINC-653 software specification has been utilized insafety-critical real time operating systems (RTOS) thatare used in avionics systems and recommended for spacemissions ARINC-653 specifications presents standardApplication Executive (APEX) kernel and its associatedservices to ensure spatial and temporal separation amongvarious applications and monitoring components in inte-grated modular avionics ARINC-653 systems (see Fig-ure 4) group multiple processes into spatially and tem-porally separated partitions Multiple (or one) partitionsare grouped to form a module (ie a processor) whileone or more modules form a system These partitions areallocated predetermined chunk of memory

Spatial partitioning [7] ensures exclusive use of amemory region by an ARINC partition It guarantees thata faulty process in a partition cannot corrupt or destroythe data structures of other processes that are executing inother partitions This space partitioning is useful to sepa-rate the low-criticality vehicle management componentsfrom safety-critical flight control components in avion-ics systems [2] Memory management hardware ensures

memory protection by allowing a process to access onlypart of the memory that belongs to the partition whichhosts the same process

Temporal partitioning [7] in Arinc-653 systems is en-sured by a fixed periodic schedule to use the process-ing resources by different partitions This fixed periodicschedule is generated or supplied to RTOS in advancefor sharing of resources among the partitions (see Fig-ure 5) This deterministic scheduling scheme ensuresthat each partition gains access to the computational re-sources within its execution interval as per the schedulingscheme Additionally it guarantees that the partitionrsquosexecution will be interrupted once partitionrsquos executioninterval will be finished the partition will be placed intoa dormant state and next partition as per the schedul-ing scheme will be allowed to access the computing re-sources In this procedure all shared hardware resourcesare managed by the partitioning OS to guarantee that theresources are freed as soon as the time slice for the parti-tion expires

Arinc-653 architecture ensures fault-containmentthrough functional separation among applications andmonitoring components In this architecture partitionsand their process can only be created during systeminitialization Dynamic creation of the processes is notsupported while system is executing Additionally userscan configure real time properties (priority periodicityduration softhard deadline etc) of the processes andpartitions during their creation These partitions andprocess are scheduled and strictly monitored for possibledeadline violations Processes of same partition sharedata and communicate using intra-partition servicesIntra partition communication is performed using buffersto provide a message passing queue and blackboards toread write and clear single data storage Two differentpartitions communicate using inter-partition servicesthat uses ports and channels for sampling and queueingof messages Synchronization of processes related tosame partition is performed through semaphores andevents [7]

24 ARINC-653 Emulation Library

ARINC-653 Emulation Library [2] (available for down-load from httpswikiisisvanderbiltedumbshmindexphpMain_Page) provides aLINUX based implementation of ARINC-653 interfacespecifications for intra-partition process communicationthat includes Blackboards and Buffers Buffers provide aqueue for passing messages and Blackboards enable pro-

6

Figure 5 Specification of Partition Scheduling in ACMFramework

cesses to read write and clear single message Intra-partition process synchronization is supported throughSemaphores and Events This library also provides pro-cess and time management services as described in theARINC-653 specification

ARINC-653 Emulation Library is also responsible forproviding temporal partitioning among partitions whichare implemented as Linux Processes Each partition in-side a module is configured with an associated period thatidentifies the rate of execution The partition propertiesalso include the time duration of execution The modulemanager is configured with a fixed cyclic schedule withpre-determined hyperperiod (see Figure 5)

In order to provide periodic time slices for all parti-tions the time of the CPU allocated to a module is di-vided into periodically repeating time slices called hy-perperiods The hyperperiod value is calculated as theleast common multiple of the periods of all partition inthe module In other words every partition executes oneor more times within a hyperperiod The temporal inter-val associated with a hyperperiod is also known as themajor frame This major frame is then subdivided intoseveral smaller time windows called minor frames Eachminor frame belongs exclusively to one of the partitionsThe length of the minor frame is the same as the dura-tion of the partition running in that frame Note that themodule manager allows execution of one and only onepartition inside a given minor frame

The module configuration specifies the hyperperiodvalue the partition names the partition executables andtheir scheduling windows which is specified with theoffset from the start of the hyperperiod and duration

Figure 6 Rails and Model View Controller (MVC) Ar-chitecture Interaction

The module manager is responsible for checking that theschedule is valid before the system can be initialized ieall scheduling windows within a hyperperiod can be exe-cuted without overlap

25 Ruby on Rails

Rails [4] is a web application development frameworkthat uses Ruby programming language [8] Rails usesModel View Controller (MVC) architecture for applica-tion development MVC divides the responsibility ofmanaging the web application development in three com-ponents1Model It contains the data definition and ma-nipulation rules for the data Model maintains the state ofthe application by storing data in the database 2ViewsView contains the presentation rules for the data and han-dles visualization requests made by the application as perthe data present in model View never handles the databut interacts with users for various ways of inputting thedata 3Controller Controller contains rules for pro-cessing the user input testing updating and maintainingthe data A user can change the data by using controllerwhile views will update the visualization of the data toreflect the changes Figure 6 presents the sequence ofevents while accessing a rails application over web in-terface In a rails application incoming client request isfirst sent to a router that finds the location of the appli-cation and parses the request to find the correspondingcontroller (method inside the controller) that will handlethe incoming request The method inside the controllercan look into the data of request can interact with themodel if needed can invoke other methods as per the na-ture of the request Finally the method sends informationto the view that renders the browser of the client with theresult

In the proposed monitoring framework ldquoRFDMonrdquo a

7

web service is developed to display the monitoring datacollected from the distributed infrastructure These mon-itoring data includes the information related to clustersnodes in a cluster node states measurements from vari-ous sensors on each node MPI and PBS related data forscientific applications web application performance andprocess accounting Schema information of the databaseis shown in figure 7

3 Issues in Distributed System Moni-toring

An efficient and accurate monitoring technique is themost basic requirement for tracking the behaviour of acomputing system to achieve the pre-specified QoS ob-jectives This efficient and accurate monitoring shouldbe performed in an on-line manner where an extensivelist of parameters are monitored and measurements areutilized by on-line control methods to tune the systemconfiguration which in turn keeps the system within de-sired QoS objectives [9] When a monitoring techniqueis applied to a distributed infrastructure it requires syn-chronization among the nodes high scalability and mea-surement correlation across the multiple nodes to under-stand the current behavior of distributed system accu-rately and within the time constraints However a typicaldistributed monitoring technique suffers from synchro-nization issues among the nodes communication delaylarge amount of measurements non deterministic natureof events asynchronous nature of measurements andlimited network bandwidth [10] Additionally this moni-toring technique suffers from high resource consumptionby monitoring units and multiple source and destinationfor measurements Another set of issues related to dis-tributed monitoring include the understanding of the ob-servations that are not directly related with the behaviorof the system and measurements that need particular or-der in reporting [11] Due to all of these issues it is ex-tremely difficult to create a consistent global view of theinfrastructure

Furthermore it is expected that the monitoring tech-nique will not introduce any fault instability and ille-gal behavior in the system under observation due to itsimplementation and will not interfere with the applica-tions executing in the system For this reason ldquoRFDMonrdquois designed in such a manner that it can self-configure(START STOP and POLL) monitoring sensors in caseof faulty (or seems to be faulty) measurements and self-aware about execution of the framework over multiple

nodes through HEARTBEAT sensors In future ldquoRFD-Monrdquo can be combined easily with a fault diagnosis mod-ule due to its standard interfaces In next section a fewof the latest approaches of distributed system monitor-ing (enterprise and open source) are presented with theirprimary features and limitations

4 Related Work

Various distributed monitoring systems are developed byindustry and research groups in past many years Gan-glia [12] Nagios [13] Zenoss [14] Nimsoft [15] Zab-bix [16] and openNMS [17] are a few of them whichare the most popular enteprise products developed for anefficient and effective monitoring of the distributed sys-tems Description of a few of these products is given infollowing paragraphs

According to [12] Ganglia is a distributed monitoringproduct for distributed computing systems which is eas-ily scalable with the size of the cluster or the grid Gan-glia is developed upon the concept of hierarchical fed-eration of clusters In this architecture multiple nodesare grouped as a cluster which is attached to a mod-ule and then multiple clusters are again grouped undera monitoring module Ganglia utilizes a multi-cast basedlistenannounce protocol for communication among thevarious nodes and monitoring modules It uses heartbeatmessage (over multi-cast addresses) within the hierarchyto maintain the existence (Membership) of lower levelmodules by the higher level Various nodes and applica-tions multi-cast their monitored resources or metrics toall of the other nodes In this way each node has an ap-proximate picture of the complete cluster at all instancesof time Aggregation of the lower level modules is donethrough polling of the child nodes in the tree Approxi-mate picture of the lover level is transferred to the higherlevel in the form of monitoring data through TCP con-nections There are two demons that function to imple-ment Ganglia monitoring system The Ganglia monitor-ing daemon (ldquogmondrdquo) runs on each node in the clus-ter to report the readings from the cluster and respond tothe requests sent from ganglia client The Ganglia MetaDaemon (ldquogmetadrdquo) executes at one level higher thanthe local nodes which aggregates the state of the clusterFinally multiple gmetad cooperates to create the aggre-gate picture of federation of clusters The primary advan-tage of utilizing the Ganglia monitoring system is auto-discovery of the added nodes each node contains the ag-gregate picture of the cluster utilizes design principleseasily portable and easily manageable

8

Figure 7 Schema of Monitoring Database (PK = Primary Key FK = Foreign key)

A plug-in based distributed monitoring approach ldquoNa-giosrdquo is described in [13] ldquoNagiosrdquo is developed basedupon agentserver architecture where agents can reportthe abnormal events from the computing nodes to theserver node (administrators) through email SMS or in-stant messages Current monitoring information and loghistory can be accessed by the administrator through aweb interface ldquoNagiosrdquo monitoring system consists offollowing three primary modules - Scheduler This isthe administrator component of the ldquoNagiosrdquo that checksthe plug-ins and take corrective actions if needed Plug-in These small modules are placed on the computingnode configured to monitor a resource and then send thereading to the ldquoNagiosrdquo server module over SNMP inter-face GUI This is a web based interface that presents thesituation of distributed monitoring system through vari-ous colourful buttons sounds and graphs The eventsreported from the plug-in will be displayed on the GUIas per the critically level

Zenoss [14] is a model-based monitoring solution

for distributed infrastructure that has comprehensive andflexible approach of monitoring as per the requirement ofenterprise It is an agentless monitoring approach wherethe central monitoring server gathers data from each nodeover SNMP interface through ssh commands In Zenossthe computing nodes can be discovered automaticallyand can be specified with their type (Xen VMWare) forcustomized monitoring interface Monitoring capabilityof Zenoss can be extended by adding small monitoringplug-ins (zenpack) to the central server For similar typeof devices it uses similar monitoring scheme that givesease in configuration to the end user Additionally clas-sification of the devices applies monitoring behaviours(monitoring templates) to ensure appropriate and com-plete monitoring alerting and reporting using definedperformance templates thresholds and event rules Fur-thermore it provides an extremely rich GUI interface formonitoring the servers placed in different geographicallocations for a detailed geographical view of the infras-tructure

9

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 1 Publish Subscribe Architecture

and vice versa Publishers are responsible for collectingthe data from the application formatting it as per the datadefinition and sending it out of the node to all registeredsubscribers over the publish-subscribe domain Simi-larly subscribers are responsible for receiving the datafrom the the publish-subscribe domain format it as perthe data definition and presenting the formatted data tothe intended applications (see Figure 1) All communi-cation is performed through DDS domain A domain canhave multiple partitions Each partition can contain mul-tiple topics Topics are published and subscribed acrossthe partitions Partitioning is used to group similar typeof topics together It also provides flexibility to applica-tion for receiving data from a set of data sources [6] Pub-lish subscribe mechanism overcomes the typical shortcomings of client-server model where client and serversare coupled together for exchange of messages In caseof client-server model a client should have informationregarding the location of the server for exchange of mes-sages However in case of publish-subscribe mecha-nism clients should know only about the type of dataand its definition

22 Open Splice DDS

OpenSplice DDS [3] is an open source community edi-tion version of the Data Distribution Service (DDS)specification defined by Object Management Group(httpwwwomgorg) These specifications are primar-ily used for communication needs of distributed RealTime Systems Distributed applications use DDS as aninterface for ldquoData-Centric Publish-Subscriberdquo (DCPS)communication mechanism ( see Figure 2) ldquoData-Centricrdquo communication provides flexibility to specifyQoS parameters for the data depending upon the typeavailability and criticality of the data These QoS pa-

Figure 2 Data Distribution Services (DDS) Architecture

rameters include rate of publication rate of subscriptiondata validity period etc DCPS provides flexibility to thedevelopers for defining the different QoS requirementsfor the data to take control of each message and theycan concentrate only on the handling of data instead ofon the transfer of data Publisher and subscribers useDDS framework to send and receive the data respectivelyDDS can be combined with any communication interfacefor communication among distributed nodes of applica-tion DDS framework handles all of the communicationsamong publishers and subscribers as per the QoS specifi-cations In DDS framework each message is associatedwith a special data type called topic A subscriber regis-ters to one or more topics of its interest DDS guaranteesthat the subscriber will receive messages only from thetopics that it subscribed to In DDS a host is allowed toact as a publisher for some topics and simultaneously actas subscriber for others (see Figure 3)

Benefits of DDS

The primary benefit of using DDS framework for ourproposed monitoring framework ldquoRFDMonrdquo is thatDDS is based upon publish-subscribe mechanism thatdecouples the sender and receiver of the data There isno single point of bottleneck or of failure in communi-cation Additionally DDS communication mechanism ishighly scalable for number of nodes and supports auto-discovery of the nodes DDS ensures data delivery withminimum overhead and efficient bandwidth utilization

5

Figure 3 Data Distribution Services (DDS) Entities

Figure 4 ARINC-653 Architecture

23 ARINC-653

ARINC-653 software specification has been utilized insafety-critical real time operating systems (RTOS) thatare used in avionics systems and recommended for spacemissions ARINC-653 specifications presents standardApplication Executive (APEX) kernel and its associatedservices to ensure spatial and temporal separation amongvarious applications and monitoring components in inte-grated modular avionics ARINC-653 systems (see Fig-ure 4) group multiple processes into spatially and tem-porally separated partitions Multiple (or one) partitionsare grouped to form a module (ie a processor) whileone or more modules form a system These partitions areallocated predetermined chunk of memory

Spatial partitioning [7] ensures exclusive use of amemory region by an ARINC partition It guarantees thata faulty process in a partition cannot corrupt or destroythe data structures of other processes that are executing inother partitions This space partitioning is useful to sepa-rate the low-criticality vehicle management componentsfrom safety-critical flight control components in avion-ics systems [2] Memory management hardware ensures

memory protection by allowing a process to access onlypart of the memory that belongs to the partition whichhosts the same process

Temporal partitioning [7] in Arinc-653 systems is en-sured by a fixed periodic schedule to use the process-ing resources by different partitions This fixed periodicschedule is generated or supplied to RTOS in advancefor sharing of resources among the partitions (see Fig-ure 5) This deterministic scheduling scheme ensuresthat each partition gains access to the computational re-sources within its execution interval as per the schedulingscheme Additionally it guarantees that the partitionrsquosexecution will be interrupted once partitionrsquos executioninterval will be finished the partition will be placed intoa dormant state and next partition as per the schedul-ing scheme will be allowed to access the computing re-sources In this procedure all shared hardware resourcesare managed by the partitioning OS to guarantee that theresources are freed as soon as the time slice for the parti-tion expires

Arinc-653 architecture ensures fault-containmentthrough functional separation among applications andmonitoring components In this architecture partitionsand their process can only be created during systeminitialization Dynamic creation of the processes is notsupported while system is executing Additionally userscan configure real time properties (priority periodicityduration softhard deadline etc) of the processes andpartitions during their creation These partitions andprocess are scheduled and strictly monitored for possibledeadline violations Processes of same partition sharedata and communicate using intra-partition servicesIntra partition communication is performed using buffersto provide a message passing queue and blackboards toread write and clear single data storage Two differentpartitions communicate using inter-partition servicesthat uses ports and channels for sampling and queueingof messages Synchronization of processes related tosame partition is performed through semaphores andevents [7]

24 ARINC-653 Emulation Library

ARINC-653 Emulation Library [2] (available for down-load from httpswikiisisvanderbiltedumbshmindexphpMain_Page) provides aLINUX based implementation of ARINC-653 interfacespecifications for intra-partition process communicationthat includes Blackboards and Buffers Buffers provide aqueue for passing messages and Blackboards enable pro-

6

Figure 5 Specification of Partition Scheduling in ACMFramework

cesses to read write and clear single message Intra-partition process synchronization is supported throughSemaphores and Events This library also provides pro-cess and time management services as described in theARINC-653 specification

ARINC-653 Emulation Library is also responsible forproviding temporal partitioning among partitions whichare implemented as Linux Processes Each partition in-side a module is configured with an associated period thatidentifies the rate of execution The partition propertiesalso include the time duration of execution The modulemanager is configured with a fixed cyclic schedule withpre-determined hyperperiod (see Figure 5)

In order to provide periodic time slices for all parti-tions the time of the CPU allocated to a module is di-vided into periodically repeating time slices called hy-perperiods The hyperperiod value is calculated as theleast common multiple of the periods of all partition inthe module In other words every partition executes oneor more times within a hyperperiod The temporal inter-val associated with a hyperperiod is also known as themajor frame This major frame is then subdivided intoseveral smaller time windows called minor frames Eachminor frame belongs exclusively to one of the partitionsThe length of the minor frame is the same as the dura-tion of the partition running in that frame Note that themodule manager allows execution of one and only onepartition inside a given minor frame

The module configuration specifies the hyperperiodvalue the partition names the partition executables andtheir scheduling windows which is specified with theoffset from the start of the hyperperiod and duration

Figure 6 Rails and Model View Controller (MVC) Ar-chitecture Interaction

The module manager is responsible for checking that theschedule is valid before the system can be initialized ieall scheduling windows within a hyperperiod can be exe-cuted without overlap

25 Ruby on Rails

Rails [4] is a web application development frameworkthat uses Ruby programming language [8] Rails usesModel View Controller (MVC) architecture for applica-tion development MVC divides the responsibility ofmanaging the web application development in three com-ponents1Model It contains the data definition and ma-nipulation rules for the data Model maintains the state ofthe application by storing data in the database 2ViewsView contains the presentation rules for the data and han-dles visualization requests made by the application as perthe data present in model View never handles the databut interacts with users for various ways of inputting thedata 3Controller Controller contains rules for pro-cessing the user input testing updating and maintainingthe data A user can change the data by using controllerwhile views will update the visualization of the data toreflect the changes Figure 6 presents the sequence ofevents while accessing a rails application over web in-terface In a rails application incoming client request isfirst sent to a router that finds the location of the appli-cation and parses the request to find the correspondingcontroller (method inside the controller) that will handlethe incoming request The method inside the controllercan look into the data of request can interact with themodel if needed can invoke other methods as per the na-ture of the request Finally the method sends informationto the view that renders the browser of the client with theresult

In the proposed monitoring framework ldquoRFDMonrdquo a

7

web service is developed to display the monitoring datacollected from the distributed infrastructure These mon-itoring data includes the information related to clustersnodes in a cluster node states measurements from vari-ous sensors on each node MPI and PBS related data forscientific applications web application performance andprocess accounting Schema information of the databaseis shown in figure 7

3 Issues in Distributed System Moni-toring

An efficient and accurate monitoring technique is themost basic requirement for tracking the behaviour of acomputing system to achieve the pre-specified QoS ob-jectives This efficient and accurate monitoring shouldbe performed in an on-line manner where an extensivelist of parameters are monitored and measurements areutilized by on-line control methods to tune the systemconfiguration which in turn keeps the system within de-sired QoS objectives [9] When a monitoring techniqueis applied to a distributed infrastructure it requires syn-chronization among the nodes high scalability and mea-surement correlation across the multiple nodes to under-stand the current behavior of distributed system accu-rately and within the time constraints However a typicaldistributed monitoring technique suffers from synchro-nization issues among the nodes communication delaylarge amount of measurements non deterministic natureof events asynchronous nature of measurements andlimited network bandwidth [10] Additionally this moni-toring technique suffers from high resource consumptionby monitoring units and multiple source and destinationfor measurements Another set of issues related to dis-tributed monitoring include the understanding of the ob-servations that are not directly related with the behaviorof the system and measurements that need particular or-der in reporting [11] Due to all of these issues it is ex-tremely difficult to create a consistent global view of theinfrastructure

Furthermore it is expected that the monitoring tech-nique will not introduce any fault instability and ille-gal behavior in the system under observation due to itsimplementation and will not interfere with the applica-tions executing in the system For this reason ldquoRFDMonrdquois designed in such a manner that it can self-configure(START STOP and POLL) monitoring sensors in caseof faulty (or seems to be faulty) measurements and self-aware about execution of the framework over multiple

nodes through HEARTBEAT sensors In future ldquoRFD-Monrdquo can be combined easily with a fault diagnosis mod-ule due to its standard interfaces In next section a fewof the latest approaches of distributed system monitor-ing (enterprise and open source) are presented with theirprimary features and limitations

4 Related Work

Various distributed monitoring systems are developed byindustry and research groups in past many years Gan-glia [12] Nagios [13] Zenoss [14] Nimsoft [15] Zab-bix [16] and openNMS [17] are a few of them whichare the most popular enteprise products developed for anefficient and effective monitoring of the distributed sys-tems Description of a few of these products is given infollowing paragraphs

According to [12] Ganglia is a distributed monitoringproduct for distributed computing systems which is eas-ily scalable with the size of the cluster or the grid Gan-glia is developed upon the concept of hierarchical fed-eration of clusters In this architecture multiple nodesare grouped as a cluster which is attached to a mod-ule and then multiple clusters are again grouped undera monitoring module Ganglia utilizes a multi-cast basedlistenannounce protocol for communication among thevarious nodes and monitoring modules It uses heartbeatmessage (over multi-cast addresses) within the hierarchyto maintain the existence (Membership) of lower levelmodules by the higher level Various nodes and applica-tions multi-cast their monitored resources or metrics toall of the other nodes In this way each node has an ap-proximate picture of the complete cluster at all instancesof time Aggregation of the lower level modules is donethrough polling of the child nodes in the tree Approxi-mate picture of the lover level is transferred to the higherlevel in the form of monitoring data through TCP con-nections There are two demons that function to imple-ment Ganglia monitoring system The Ganglia monitor-ing daemon (ldquogmondrdquo) runs on each node in the clus-ter to report the readings from the cluster and respond tothe requests sent from ganglia client The Ganglia MetaDaemon (ldquogmetadrdquo) executes at one level higher thanthe local nodes which aggregates the state of the clusterFinally multiple gmetad cooperates to create the aggre-gate picture of federation of clusters The primary advan-tage of utilizing the Ganglia monitoring system is auto-discovery of the added nodes each node contains the ag-gregate picture of the cluster utilizes design principleseasily portable and easily manageable

8

Figure 7 Schema of Monitoring Database (PK = Primary Key FK = Foreign key)

A plug-in based distributed monitoring approach ldquoNa-giosrdquo is described in [13] ldquoNagiosrdquo is developed basedupon agentserver architecture where agents can reportthe abnormal events from the computing nodes to theserver node (administrators) through email SMS or in-stant messages Current monitoring information and loghistory can be accessed by the administrator through aweb interface ldquoNagiosrdquo monitoring system consists offollowing three primary modules - Scheduler This isthe administrator component of the ldquoNagiosrdquo that checksthe plug-ins and take corrective actions if needed Plug-in These small modules are placed on the computingnode configured to monitor a resource and then send thereading to the ldquoNagiosrdquo server module over SNMP inter-face GUI This is a web based interface that presents thesituation of distributed monitoring system through vari-ous colourful buttons sounds and graphs The eventsreported from the plug-in will be displayed on the GUIas per the critically level

Zenoss [14] is a model-based monitoring solution

for distributed infrastructure that has comprehensive andflexible approach of monitoring as per the requirement ofenterprise It is an agentless monitoring approach wherethe central monitoring server gathers data from each nodeover SNMP interface through ssh commands In Zenossthe computing nodes can be discovered automaticallyand can be specified with their type (Xen VMWare) forcustomized monitoring interface Monitoring capabilityof Zenoss can be extended by adding small monitoringplug-ins (zenpack) to the central server For similar typeof devices it uses similar monitoring scheme that givesease in configuration to the end user Additionally clas-sification of the devices applies monitoring behaviours(monitoring templates) to ensure appropriate and com-plete monitoring alerting and reporting using definedperformance templates thresholds and event rules Fur-thermore it provides an extremely rich GUI interface formonitoring the servers placed in different geographicallocations for a detailed geographical view of the infras-tructure

9

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 3 Data Distribution Services (DDS) Entities

Figure 4 ARINC-653 Architecture

23 ARINC-653

ARINC-653 software specification has been utilized insafety-critical real time operating systems (RTOS) thatare used in avionics systems and recommended for spacemissions ARINC-653 specifications presents standardApplication Executive (APEX) kernel and its associatedservices to ensure spatial and temporal separation amongvarious applications and monitoring components in inte-grated modular avionics ARINC-653 systems (see Fig-ure 4) group multiple processes into spatially and tem-porally separated partitions Multiple (or one) partitionsare grouped to form a module (ie a processor) whileone or more modules form a system These partitions areallocated predetermined chunk of memory

Spatial partitioning [7] ensures exclusive use of amemory region by an ARINC partition It guarantees thata faulty process in a partition cannot corrupt or destroythe data structures of other processes that are executing inother partitions This space partitioning is useful to sepa-rate the low-criticality vehicle management componentsfrom safety-critical flight control components in avion-ics systems [2] Memory management hardware ensures

memory protection by allowing a process to access onlypart of the memory that belongs to the partition whichhosts the same process

Temporal partitioning [7] in Arinc-653 systems is en-sured by a fixed periodic schedule to use the process-ing resources by different partitions This fixed periodicschedule is generated or supplied to RTOS in advancefor sharing of resources among the partitions (see Fig-ure 5) This deterministic scheduling scheme ensuresthat each partition gains access to the computational re-sources within its execution interval as per the schedulingscheme Additionally it guarantees that the partitionrsquosexecution will be interrupted once partitionrsquos executioninterval will be finished the partition will be placed intoa dormant state and next partition as per the schedul-ing scheme will be allowed to access the computing re-sources In this procedure all shared hardware resourcesare managed by the partitioning OS to guarantee that theresources are freed as soon as the time slice for the parti-tion expires

Arinc-653 architecture ensures fault-containmentthrough functional separation among applications andmonitoring components In this architecture partitionsand their process can only be created during systeminitialization Dynamic creation of the processes is notsupported while system is executing Additionally userscan configure real time properties (priority periodicityduration softhard deadline etc) of the processes andpartitions during their creation These partitions andprocess are scheduled and strictly monitored for possibledeadline violations Processes of same partition sharedata and communicate using intra-partition servicesIntra partition communication is performed using buffersto provide a message passing queue and blackboards toread write and clear single data storage Two differentpartitions communicate using inter-partition servicesthat uses ports and channels for sampling and queueingof messages Synchronization of processes related tosame partition is performed through semaphores andevents [7]

24 ARINC-653 Emulation Library

ARINC-653 Emulation Library [2] (available for down-load from httpswikiisisvanderbiltedumbshmindexphpMain_Page) provides aLINUX based implementation of ARINC-653 interfacespecifications for intra-partition process communicationthat includes Blackboards and Buffers Buffers provide aqueue for passing messages and Blackboards enable pro-

6

Figure 5 Specification of Partition Scheduling in ACMFramework

cesses to read write and clear single message Intra-partition process synchronization is supported throughSemaphores and Events This library also provides pro-cess and time management services as described in theARINC-653 specification

ARINC-653 Emulation Library is also responsible forproviding temporal partitioning among partitions whichare implemented as Linux Processes Each partition in-side a module is configured with an associated period thatidentifies the rate of execution The partition propertiesalso include the time duration of execution The modulemanager is configured with a fixed cyclic schedule withpre-determined hyperperiod (see Figure 5)

In order to provide periodic time slices for all parti-tions the time of the CPU allocated to a module is di-vided into periodically repeating time slices called hy-perperiods The hyperperiod value is calculated as theleast common multiple of the periods of all partition inthe module In other words every partition executes oneor more times within a hyperperiod The temporal inter-val associated with a hyperperiod is also known as themajor frame This major frame is then subdivided intoseveral smaller time windows called minor frames Eachminor frame belongs exclusively to one of the partitionsThe length of the minor frame is the same as the dura-tion of the partition running in that frame Note that themodule manager allows execution of one and only onepartition inside a given minor frame

The module configuration specifies the hyperperiodvalue the partition names the partition executables andtheir scheduling windows which is specified with theoffset from the start of the hyperperiod and duration

Figure 6 Rails and Model View Controller (MVC) Ar-chitecture Interaction

The module manager is responsible for checking that theschedule is valid before the system can be initialized ieall scheduling windows within a hyperperiod can be exe-cuted without overlap

25 Ruby on Rails

Rails [4] is a web application development frameworkthat uses Ruby programming language [8] Rails usesModel View Controller (MVC) architecture for applica-tion development MVC divides the responsibility ofmanaging the web application development in three com-ponents1Model It contains the data definition and ma-nipulation rules for the data Model maintains the state ofthe application by storing data in the database 2ViewsView contains the presentation rules for the data and han-dles visualization requests made by the application as perthe data present in model View never handles the databut interacts with users for various ways of inputting thedata 3Controller Controller contains rules for pro-cessing the user input testing updating and maintainingthe data A user can change the data by using controllerwhile views will update the visualization of the data toreflect the changes Figure 6 presents the sequence ofevents while accessing a rails application over web in-terface In a rails application incoming client request isfirst sent to a router that finds the location of the appli-cation and parses the request to find the correspondingcontroller (method inside the controller) that will handlethe incoming request The method inside the controllercan look into the data of request can interact with themodel if needed can invoke other methods as per the na-ture of the request Finally the method sends informationto the view that renders the browser of the client with theresult

In the proposed monitoring framework ldquoRFDMonrdquo a

7

web service is developed to display the monitoring datacollected from the distributed infrastructure These mon-itoring data includes the information related to clustersnodes in a cluster node states measurements from vari-ous sensors on each node MPI and PBS related data forscientific applications web application performance andprocess accounting Schema information of the databaseis shown in figure 7

3 Issues in Distributed System Moni-toring

An efficient and accurate monitoring technique is themost basic requirement for tracking the behaviour of acomputing system to achieve the pre-specified QoS ob-jectives This efficient and accurate monitoring shouldbe performed in an on-line manner where an extensivelist of parameters are monitored and measurements areutilized by on-line control methods to tune the systemconfiguration which in turn keeps the system within de-sired QoS objectives [9] When a monitoring techniqueis applied to a distributed infrastructure it requires syn-chronization among the nodes high scalability and mea-surement correlation across the multiple nodes to under-stand the current behavior of distributed system accu-rately and within the time constraints However a typicaldistributed monitoring technique suffers from synchro-nization issues among the nodes communication delaylarge amount of measurements non deterministic natureof events asynchronous nature of measurements andlimited network bandwidth [10] Additionally this moni-toring technique suffers from high resource consumptionby monitoring units and multiple source and destinationfor measurements Another set of issues related to dis-tributed monitoring include the understanding of the ob-servations that are not directly related with the behaviorof the system and measurements that need particular or-der in reporting [11] Due to all of these issues it is ex-tremely difficult to create a consistent global view of theinfrastructure

Furthermore it is expected that the monitoring tech-nique will not introduce any fault instability and ille-gal behavior in the system under observation due to itsimplementation and will not interfere with the applica-tions executing in the system For this reason ldquoRFDMonrdquois designed in such a manner that it can self-configure(START STOP and POLL) monitoring sensors in caseof faulty (or seems to be faulty) measurements and self-aware about execution of the framework over multiple

nodes through HEARTBEAT sensors In future ldquoRFD-Monrdquo can be combined easily with a fault diagnosis mod-ule due to its standard interfaces In next section a fewof the latest approaches of distributed system monitor-ing (enterprise and open source) are presented with theirprimary features and limitations

4 Related Work

Various distributed monitoring systems are developed byindustry and research groups in past many years Gan-glia [12] Nagios [13] Zenoss [14] Nimsoft [15] Zab-bix [16] and openNMS [17] are a few of them whichare the most popular enteprise products developed for anefficient and effective monitoring of the distributed sys-tems Description of a few of these products is given infollowing paragraphs

According to [12] Ganglia is a distributed monitoringproduct for distributed computing systems which is eas-ily scalable with the size of the cluster or the grid Gan-glia is developed upon the concept of hierarchical fed-eration of clusters In this architecture multiple nodesare grouped as a cluster which is attached to a mod-ule and then multiple clusters are again grouped undera monitoring module Ganglia utilizes a multi-cast basedlistenannounce protocol for communication among thevarious nodes and monitoring modules It uses heartbeatmessage (over multi-cast addresses) within the hierarchyto maintain the existence (Membership) of lower levelmodules by the higher level Various nodes and applica-tions multi-cast their monitored resources or metrics toall of the other nodes In this way each node has an ap-proximate picture of the complete cluster at all instancesof time Aggregation of the lower level modules is donethrough polling of the child nodes in the tree Approxi-mate picture of the lover level is transferred to the higherlevel in the form of monitoring data through TCP con-nections There are two demons that function to imple-ment Ganglia monitoring system The Ganglia monitor-ing daemon (ldquogmondrdquo) runs on each node in the clus-ter to report the readings from the cluster and respond tothe requests sent from ganglia client The Ganglia MetaDaemon (ldquogmetadrdquo) executes at one level higher thanthe local nodes which aggregates the state of the clusterFinally multiple gmetad cooperates to create the aggre-gate picture of federation of clusters The primary advan-tage of utilizing the Ganglia monitoring system is auto-discovery of the added nodes each node contains the ag-gregate picture of the cluster utilizes design principleseasily portable and easily manageable

8

Figure 7 Schema of Monitoring Database (PK = Primary Key FK = Foreign key)

A plug-in based distributed monitoring approach ldquoNa-giosrdquo is described in [13] ldquoNagiosrdquo is developed basedupon agentserver architecture where agents can reportthe abnormal events from the computing nodes to theserver node (administrators) through email SMS or in-stant messages Current monitoring information and loghistory can be accessed by the administrator through aweb interface ldquoNagiosrdquo monitoring system consists offollowing three primary modules - Scheduler This isthe administrator component of the ldquoNagiosrdquo that checksthe plug-ins and take corrective actions if needed Plug-in These small modules are placed on the computingnode configured to monitor a resource and then send thereading to the ldquoNagiosrdquo server module over SNMP inter-face GUI This is a web based interface that presents thesituation of distributed monitoring system through vari-ous colourful buttons sounds and graphs The eventsreported from the plug-in will be displayed on the GUIas per the critically level

Zenoss [14] is a model-based monitoring solution

for distributed infrastructure that has comprehensive andflexible approach of monitoring as per the requirement ofenterprise It is an agentless monitoring approach wherethe central monitoring server gathers data from each nodeover SNMP interface through ssh commands In Zenossthe computing nodes can be discovered automaticallyand can be specified with their type (Xen VMWare) forcustomized monitoring interface Monitoring capabilityof Zenoss can be extended by adding small monitoringplug-ins (zenpack) to the central server For similar typeof devices it uses similar monitoring scheme that givesease in configuration to the end user Additionally clas-sification of the devices applies monitoring behaviours(monitoring templates) to ensure appropriate and com-plete monitoring alerting and reporting using definedperformance templates thresholds and event rules Fur-thermore it provides an extremely rich GUI interface formonitoring the servers placed in different geographicallocations for a detailed geographical view of the infras-tructure

9

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 5 Specification of Partition Scheduling in ACMFramework

cesses to read write and clear single message Intra-partition process synchronization is supported throughSemaphores and Events This library also provides pro-cess and time management services as described in theARINC-653 specification

ARINC-653 Emulation Library is also responsible forproviding temporal partitioning among partitions whichare implemented as Linux Processes Each partition in-side a module is configured with an associated period thatidentifies the rate of execution The partition propertiesalso include the time duration of execution The modulemanager is configured with a fixed cyclic schedule withpre-determined hyperperiod (see Figure 5)

In order to provide periodic time slices for all parti-tions the time of the CPU allocated to a module is di-vided into periodically repeating time slices called hy-perperiods The hyperperiod value is calculated as theleast common multiple of the periods of all partition inthe module In other words every partition executes oneor more times within a hyperperiod The temporal inter-val associated with a hyperperiod is also known as themajor frame This major frame is then subdivided intoseveral smaller time windows called minor frames Eachminor frame belongs exclusively to one of the partitionsThe length of the minor frame is the same as the dura-tion of the partition running in that frame Note that themodule manager allows execution of one and only onepartition inside a given minor frame

The module configuration specifies the hyperperiodvalue the partition names the partition executables andtheir scheduling windows which is specified with theoffset from the start of the hyperperiod and duration

Figure 6 Rails and Model View Controller (MVC) Ar-chitecture Interaction

The module manager is responsible for checking that theschedule is valid before the system can be initialized ieall scheduling windows within a hyperperiod can be exe-cuted without overlap

25 Ruby on Rails

Rails [4] is a web application development frameworkthat uses Ruby programming language [8] Rails usesModel View Controller (MVC) architecture for applica-tion development MVC divides the responsibility ofmanaging the web application development in three com-ponents1Model It contains the data definition and ma-nipulation rules for the data Model maintains the state ofthe application by storing data in the database 2ViewsView contains the presentation rules for the data and han-dles visualization requests made by the application as perthe data present in model View never handles the databut interacts with users for various ways of inputting thedata 3Controller Controller contains rules for pro-cessing the user input testing updating and maintainingthe data A user can change the data by using controllerwhile views will update the visualization of the data toreflect the changes Figure 6 presents the sequence ofevents while accessing a rails application over web in-terface In a rails application incoming client request isfirst sent to a router that finds the location of the appli-cation and parses the request to find the correspondingcontroller (method inside the controller) that will handlethe incoming request The method inside the controllercan look into the data of request can interact with themodel if needed can invoke other methods as per the na-ture of the request Finally the method sends informationto the view that renders the browser of the client with theresult

In the proposed monitoring framework ldquoRFDMonrdquo a

7

web service is developed to display the monitoring datacollected from the distributed infrastructure These mon-itoring data includes the information related to clustersnodes in a cluster node states measurements from vari-ous sensors on each node MPI and PBS related data forscientific applications web application performance andprocess accounting Schema information of the databaseis shown in figure 7

3 Issues in Distributed System Moni-toring

An efficient and accurate monitoring technique is themost basic requirement for tracking the behaviour of acomputing system to achieve the pre-specified QoS ob-jectives This efficient and accurate monitoring shouldbe performed in an on-line manner where an extensivelist of parameters are monitored and measurements areutilized by on-line control methods to tune the systemconfiguration which in turn keeps the system within de-sired QoS objectives [9] When a monitoring techniqueis applied to a distributed infrastructure it requires syn-chronization among the nodes high scalability and mea-surement correlation across the multiple nodes to under-stand the current behavior of distributed system accu-rately and within the time constraints However a typicaldistributed monitoring technique suffers from synchro-nization issues among the nodes communication delaylarge amount of measurements non deterministic natureof events asynchronous nature of measurements andlimited network bandwidth [10] Additionally this moni-toring technique suffers from high resource consumptionby monitoring units and multiple source and destinationfor measurements Another set of issues related to dis-tributed monitoring include the understanding of the ob-servations that are not directly related with the behaviorof the system and measurements that need particular or-der in reporting [11] Due to all of these issues it is ex-tremely difficult to create a consistent global view of theinfrastructure

Furthermore it is expected that the monitoring tech-nique will not introduce any fault instability and ille-gal behavior in the system under observation due to itsimplementation and will not interfere with the applica-tions executing in the system For this reason ldquoRFDMonrdquois designed in such a manner that it can self-configure(START STOP and POLL) monitoring sensors in caseof faulty (or seems to be faulty) measurements and self-aware about execution of the framework over multiple

nodes through HEARTBEAT sensors In future ldquoRFD-Monrdquo can be combined easily with a fault diagnosis mod-ule due to its standard interfaces In next section a fewof the latest approaches of distributed system monitor-ing (enterprise and open source) are presented with theirprimary features and limitations

4 Related Work

Various distributed monitoring systems are developed byindustry and research groups in past many years Gan-glia [12] Nagios [13] Zenoss [14] Nimsoft [15] Zab-bix [16] and openNMS [17] are a few of them whichare the most popular enteprise products developed for anefficient and effective monitoring of the distributed sys-tems Description of a few of these products is given infollowing paragraphs

According to [12] Ganglia is a distributed monitoringproduct for distributed computing systems which is eas-ily scalable with the size of the cluster or the grid Gan-glia is developed upon the concept of hierarchical fed-eration of clusters In this architecture multiple nodesare grouped as a cluster which is attached to a mod-ule and then multiple clusters are again grouped undera monitoring module Ganglia utilizes a multi-cast basedlistenannounce protocol for communication among thevarious nodes and monitoring modules It uses heartbeatmessage (over multi-cast addresses) within the hierarchyto maintain the existence (Membership) of lower levelmodules by the higher level Various nodes and applica-tions multi-cast their monitored resources or metrics toall of the other nodes In this way each node has an ap-proximate picture of the complete cluster at all instancesof time Aggregation of the lower level modules is donethrough polling of the child nodes in the tree Approxi-mate picture of the lover level is transferred to the higherlevel in the form of monitoring data through TCP con-nections There are two demons that function to imple-ment Ganglia monitoring system The Ganglia monitor-ing daemon (ldquogmondrdquo) runs on each node in the clus-ter to report the readings from the cluster and respond tothe requests sent from ganglia client The Ganglia MetaDaemon (ldquogmetadrdquo) executes at one level higher thanthe local nodes which aggregates the state of the clusterFinally multiple gmetad cooperates to create the aggre-gate picture of federation of clusters The primary advan-tage of utilizing the Ganglia monitoring system is auto-discovery of the added nodes each node contains the ag-gregate picture of the cluster utilizes design principleseasily portable and easily manageable

8

Figure 7 Schema of Monitoring Database (PK = Primary Key FK = Foreign key)

A plug-in based distributed monitoring approach ldquoNa-giosrdquo is described in [13] ldquoNagiosrdquo is developed basedupon agentserver architecture where agents can reportthe abnormal events from the computing nodes to theserver node (administrators) through email SMS or in-stant messages Current monitoring information and loghistory can be accessed by the administrator through aweb interface ldquoNagiosrdquo monitoring system consists offollowing three primary modules - Scheduler This isthe administrator component of the ldquoNagiosrdquo that checksthe plug-ins and take corrective actions if needed Plug-in These small modules are placed on the computingnode configured to monitor a resource and then send thereading to the ldquoNagiosrdquo server module over SNMP inter-face GUI This is a web based interface that presents thesituation of distributed monitoring system through vari-ous colourful buttons sounds and graphs The eventsreported from the plug-in will be displayed on the GUIas per the critically level

Zenoss [14] is a model-based monitoring solution

for distributed infrastructure that has comprehensive andflexible approach of monitoring as per the requirement ofenterprise It is an agentless monitoring approach wherethe central monitoring server gathers data from each nodeover SNMP interface through ssh commands In Zenossthe computing nodes can be discovered automaticallyand can be specified with their type (Xen VMWare) forcustomized monitoring interface Monitoring capabilityof Zenoss can be extended by adding small monitoringplug-ins (zenpack) to the central server For similar typeof devices it uses similar monitoring scheme that givesease in configuration to the end user Additionally clas-sification of the devices applies monitoring behaviours(monitoring templates) to ensure appropriate and com-plete monitoring alerting and reporting using definedperformance templates thresholds and event rules Fur-thermore it provides an extremely rich GUI interface formonitoring the servers placed in different geographicallocations for a detailed geographical view of the infras-tructure

9

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

web service is developed to display the monitoring datacollected from the distributed infrastructure These mon-itoring data includes the information related to clustersnodes in a cluster node states measurements from vari-ous sensors on each node MPI and PBS related data forscientific applications web application performance andprocess accounting Schema information of the databaseis shown in figure 7

3 Issues in Distributed System Moni-toring

An efficient and accurate monitoring technique is themost basic requirement for tracking the behaviour of acomputing system to achieve the pre-specified QoS ob-jectives This efficient and accurate monitoring shouldbe performed in an on-line manner where an extensivelist of parameters are monitored and measurements areutilized by on-line control methods to tune the systemconfiguration which in turn keeps the system within de-sired QoS objectives [9] When a monitoring techniqueis applied to a distributed infrastructure it requires syn-chronization among the nodes high scalability and mea-surement correlation across the multiple nodes to under-stand the current behavior of distributed system accu-rately and within the time constraints However a typicaldistributed monitoring technique suffers from synchro-nization issues among the nodes communication delaylarge amount of measurements non deterministic natureof events asynchronous nature of measurements andlimited network bandwidth [10] Additionally this moni-toring technique suffers from high resource consumptionby monitoring units and multiple source and destinationfor measurements Another set of issues related to dis-tributed monitoring include the understanding of the ob-servations that are not directly related with the behaviorof the system and measurements that need particular or-der in reporting [11] Due to all of these issues it is ex-tremely difficult to create a consistent global view of theinfrastructure

Furthermore it is expected that the monitoring tech-nique will not introduce any fault instability and ille-gal behavior in the system under observation due to itsimplementation and will not interfere with the applica-tions executing in the system For this reason ldquoRFDMonrdquois designed in such a manner that it can self-configure(START STOP and POLL) monitoring sensors in caseof faulty (or seems to be faulty) measurements and self-aware about execution of the framework over multiple

nodes through HEARTBEAT sensors In future ldquoRFD-Monrdquo can be combined easily with a fault diagnosis mod-ule due to its standard interfaces In next section a fewof the latest approaches of distributed system monitor-ing (enterprise and open source) are presented with theirprimary features and limitations

4 Related Work

Various distributed monitoring systems are developed byindustry and research groups in past many years Gan-glia [12] Nagios [13] Zenoss [14] Nimsoft [15] Zab-bix [16] and openNMS [17] are a few of them whichare the most popular enteprise products developed for anefficient and effective monitoring of the distributed sys-tems Description of a few of these products is given infollowing paragraphs

According to [12] Ganglia is a distributed monitoringproduct for distributed computing systems which is eas-ily scalable with the size of the cluster or the grid Gan-glia is developed upon the concept of hierarchical fed-eration of clusters In this architecture multiple nodesare grouped as a cluster which is attached to a mod-ule and then multiple clusters are again grouped undera monitoring module Ganglia utilizes a multi-cast basedlistenannounce protocol for communication among thevarious nodes and monitoring modules It uses heartbeatmessage (over multi-cast addresses) within the hierarchyto maintain the existence (Membership) of lower levelmodules by the higher level Various nodes and applica-tions multi-cast their monitored resources or metrics toall of the other nodes In this way each node has an ap-proximate picture of the complete cluster at all instancesof time Aggregation of the lower level modules is donethrough polling of the child nodes in the tree Approxi-mate picture of the lover level is transferred to the higherlevel in the form of monitoring data through TCP con-nections There are two demons that function to imple-ment Ganglia monitoring system The Ganglia monitor-ing daemon (ldquogmondrdquo) runs on each node in the clus-ter to report the readings from the cluster and respond tothe requests sent from ganglia client The Ganglia MetaDaemon (ldquogmetadrdquo) executes at one level higher thanthe local nodes which aggregates the state of the clusterFinally multiple gmetad cooperates to create the aggre-gate picture of federation of clusters The primary advan-tage of utilizing the Ganglia monitoring system is auto-discovery of the added nodes each node contains the ag-gregate picture of the cluster utilizes design principleseasily portable and easily manageable

8

Figure 7 Schema of Monitoring Database (PK = Primary Key FK = Foreign key)

A plug-in based distributed monitoring approach ldquoNa-giosrdquo is described in [13] ldquoNagiosrdquo is developed basedupon agentserver architecture where agents can reportthe abnormal events from the computing nodes to theserver node (administrators) through email SMS or in-stant messages Current monitoring information and loghistory can be accessed by the administrator through aweb interface ldquoNagiosrdquo monitoring system consists offollowing three primary modules - Scheduler This isthe administrator component of the ldquoNagiosrdquo that checksthe plug-ins and take corrective actions if needed Plug-in These small modules are placed on the computingnode configured to monitor a resource and then send thereading to the ldquoNagiosrdquo server module over SNMP inter-face GUI This is a web based interface that presents thesituation of distributed monitoring system through vari-ous colourful buttons sounds and graphs The eventsreported from the plug-in will be displayed on the GUIas per the critically level

Zenoss [14] is a model-based monitoring solution

for distributed infrastructure that has comprehensive andflexible approach of monitoring as per the requirement ofenterprise It is an agentless monitoring approach wherethe central monitoring server gathers data from each nodeover SNMP interface through ssh commands In Zenossthe computing nodes can be discovered automaticallyand can be specified with their type (Xen VMWare) forcustomized monitoring interface Monitoring capabilityof Zenoss can be extended by adding small monitoringplug-ins (zenpack) to the central server For similar typeof devices it uses similar monitoring scheme that givesease in configuration to the end user Additionally clas-sification of the devices applies monitoring behaviours(monitoring templates) to ensure appropriate and com-plete monitoring alerting and reporting using definedperformance templates thresholds and event rules Fur-thermore it provides an extremely rich GUI interface formonitoring the servers placed in different geographicallocations for a detailed geographical view of the infras-tructure

9

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 7 Schema of Monitoring Database (PK = Primary Key FK = Foreign key)

A plug-in based distributed monitoring approach ldquoNa-giosrdquo is described in [13] ldquoNagiosrdquo is developed basedupon agentserver architecture where agents can reportthe abnormal events from the computing nodes to theserver node (administrators) through email SMS or in-stant messages Current monitoring information and loghistory can be accessed by the administrator through aweb interface ldquoNagiosrdquo monitoring system consists offollowing three primary modules - Scheduler This isthe administrator component of the ldquoNagiosrdquo that checksthe plug-ins and take corrective actions if needed Plug-in These small modules are placed on the computingnode configured to monitor a resource and then send thereading to the ldquoNagiosrdquo server module over SNMP inter-face GUI This is a web based interface that presents thesituation of distributed monitoring system through vari-ous colourful buttons sounds and graphs The eventsreported from the plug-in will be displayed on the GUIas per the critically level

Zenoss [14] is a model-based monitoring solution

for distributed infrastructure that has comprehensive andflexible approach of monitoring as per the requirement ofenterprise It is an agentless monitoring approach wherethe central monitoring server gathers data from each nodeover SNMP interface through ssh commands In Zenossthe computing nodes can be discovered automaticallyand can be specified with their type (Xen VMWare) forcustomized monitoring interface Monitoring capabilityof Zenoss can be extended by adding small monitoringplug-ins (zenpack) to the central server For similar typeof devices it uses similar monitoring scheme that givesease in configuration to the end user Additionally clas-sification of the devices applies monitoring behaviours(monitoring templates) to ensure appropriate and com-plete monitoring alerting and reporting using definedperformance templates thresholds and event rules Fur-thermore it provides an extremely rich GUI interface formonitoring the servers placed in different geographicallocations for a detailed geographical view of the infras-tructure

9

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Nimsoft Monitoring Solution [15](NMS) offers alight-weight reliable and extensive distributed monitor-ing technique that allows organizations to monitor theirphysical servers applications databases public or pri-vate clouds and networking services NMS uses a mes-sage BUS for exchange of messages among the applica-tions Applications residing in the entire infrastructurepublish the data over the message BUS and subscriberapplications of those messages will automatically receivethe data These applications (or components) are con-figured with the help of a software component (calledHUB) and are attached to message BUS Monitoring ac-tion is performed by small probes and the measurementsare published to the message BUS by robots Robots aresoftware components that are deployed over each man-aged device NMS also provides an Alarm Server foralarm monitoring and a rich GUI portal to visualize thecomprehensive view of the system

These distributed monitoring approaches are signif-icantly scalable in number of nodes responsive tochanges at the computational nodes and comprehensivein number of parameters However these approachesdo not support capping of the resource consumed by themonitoring framework fault containment in monitoringunit and expandability of the monitoring approach fornew parameters in the already executing framework Inaddition to this these monitoring approaches are stand-alone and are not easily extendible to associate with othermodules that can perform fault diagnosis for the infras-tructure at different granularity (application level systemlevel and monitoring level) Furthermore these moni-toring approaches works in a serverclient or hostagentmanner (except NMS) that requires direct coupling oftwo entities where one entity has to be aware about thelocation and identity of other entity

Timing jitter is one of the major difficulties for ac-curate monitoring of a computing system for periodictasks when the computing system is too busy in run-ning other applications instead of the monitoring sensors[18] presents a feedback based approach for compensat-ing timing jitter without any change in operating systemkernel This approach has low overhead with platformindependent nature and maintains total bounded jitter forthe running scheduler or monitor Our current work usesthis approach to reduce the distributed jitter of sensorsacross all machines

The primary goal of this paper is to present an eventbased distributed monitoring framework ldquoRFDMonrdquo fordistributed systems that utilizes the data distribution ser-vice (DDS) methodology to report the events or monitor-

ing measurements In ldquoRFDMonrdquo all monitoring sen-sors execute on the ARINC-653 Emulator [2] This en-ables the monitoring agents to be organized into one ormore partitions and each partition has a period and dura-tion These attributes govern the periodicity with whichthe partition is given access to the CPU The processesexecuting under each partition can be configured for real-time properties (priority periodicity duration softharddeadline etc) The details of the approach are describedin later sections of the paper

5 Architecture of the framework

As described in previous sections the proposed monitor-ing framework is based upon data centric publish sub-scribe communication mechanism Modules (or pro-cesses) in the framework are separated from each otherthrough concept of spatial and temporal locality as de-scribed in section 23 The proposed framework hasfollowing key concepts and components to perform themonitoring

51 Sensors

Sensors are the primary component of the frameworkThese are lightweight processes that monitor a device onthe computing nodes and read it periodically or aperi-odically to get the measurements These sensors pub-lish the measurements under a topic (described in nextsubsection) to DDS domain There are various types ofsensors in the framework System Resource UtilizationMonitoring Sensors Hardware Health Monitoring Sen-sors Scientific Application Health Monitoring Sensorsand Web Application performance Monitoring SensorsDetails regarding individual sensors are provided in sec-tion 6

52 Topics

Topics are the primary unit of information exchange inDDS domain Details about the type of topic (structuredefinition) and key values (keylist) to identify the differ-ent instances of the topic are described in interface def-inition language (idl) file Keys can represent arbitrarynumber of fields in topic There are various topics inthe framework related to monitoring information nodeheartbeat control commands for sensors node state in-formation and MPI process state information

1 MONITORING INFO System resource andhardware health monitoring sensors publish mea-

10

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 8 Architecture of Sensor Framework

surements under monitoring info topic This topiccontains Node Name Sensor Name Sequence Idand Measurements information in the message

2 HEARTBEAT Heartbeat Sensor uses this topic topublish its heartbeat in the DDS domain to notifythe framework that it is alive This topic containsNode name and Sequence ID fields in the messagesAll nodes which are listening to HEARTBEAT topiccan keep track of existence of other nodes in theDDS domain through this topic

3 NODE HEALTH INFO When leader node (de-fined in Section 56) detects change in state(UP DOWN FRAMEWORK DOWN of anynode through change in heartbeat it publishesNODE HEALTH INFO topic to notify all of theother nodes regarding change in status of the nodeThis topic contains node name severity level nodestate and transition type in the message

4 LOCAL COMMAND This topic is used by theleader to send the control commands to other localnodes for start stop or poll the monitoring sensors

for measurements It contains node name sensorname command and sequence id in the messages

5 GLOBAL MEMBERSHIP INFO This topicis used for communication between local nodesand global membership manager for selectionof leader and for providing information relatedto existence of the leader This topic containssender node name target node name mes-sage type (GET LEADER LEADER ACKLEADER EXISTS LEADER DEAD) regionname and leader node name fields in the message

6 PROCESS ACCOUNTING INFO Process ac-counting sensor reads records from the process ac-counting system and publishes the records underthis topic name This topic contains node namesensor name and process accounting record in themessage

7 MPI PROCESS INFO This topic is used topublish the execution state (STARTED ENDEDKILLED) and MPIPBS information variables ofMPI processes running on the computing node

11

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

These MPI or PBS variable contains the process Idsnumber of processes and other environment vari-ables Messages of this topic contain node namesensor name and MPI process record fields

8 WEB APPLICATION INFO This topic is usedto publish the performance measurements of a webapplication executing over the computing nodeThis performance measurement is a generic datastructure that contains information logged from theweb service related to average response time heapmemory usage number of JAVA threads and pend-ing requests inside the system

53 Topic Managers

Topic Managers are classes that create subscriber or pub-lisher for a pre defined topic These publishers publishthe data received from various sensors under the topicname Subscribers receive data from the DDS domainunder the topic name and deliver to underlying applica-tion for further processing

54 Region

The proposed monitoring framework functions by orga-nizing the nodes in to regions (or clusters) Nodes canbe homogeneous or heterogeneous Nodes are combinedonly logically These nodes can be located in a singleserver rack or on single physical machine (in case ofvirtualization) However physical closeness is recom-mended to combine the nodes in a single region to min-imize the unnecessary communication overhead in thenetwork

55 Local Manager

Local manager is a module that is executed as an agenton each computing node of the monitoring frameworkThe primary responsibility of the Local Manager is to setup sensor framework on the node and publish the mea-surements in DDS domain

56 Leader of the Region

Among multiple Local Manager nodes that belongs tosame region there is a Local Manager node which is se-lected for updating the centralized monitoring databasefor sensor measurements from each Local ManagerLeader of the region will also be responsible for up-dating the changes in state (UP DOWN FRAME-WORK DOWN) of various Local Manager nodes Once

a leader of the region dies a new leader will be selectedfor the region Selection of the leader is done by GlobalMembership Manager module as described in next sub-section

57 Global Membership Manager

Global Membership Manager module is responsible tomaintain the membership of each node for a particu-lar region This module is also responsible for selec-tion of a leader for that region among all of the lo-cal nodes Once a local node comes alive it first con-tacts the global membership manager with nodersquos re-gion name Local Manager contacts to Global mem-bership manager for getting the information regard-ing leader of its region global membership man-ager replies with the name of leader for that region(if leader exists already) or assign the new node asleader Global membership manager will communicatethe same local node as leader to other local nodes infuture and update the leader information in file (ldquoRE-GIONAL LEADER MAPtxtrdquo) on disk in semicolonseparated format (RegionNameLeaderName) When alocal node sends message to global membership managerthat its leader is dead global membership manager se-lects a new leader for that region and replies to the localnode with leader name It enables the fault tolerant na-ture in the framework with respect to regional leader thatensures periodic update of the infrastructure monitoringdatabase with measurements

Leader election is a very common problem in dis-tributed computing system infrastructure where a sin-gle node is chosen from a group of competing nodesto perform a centralized task This problem is oftentermed as ldquoLeader Election Problemrdquo Therefore aleader election algorithm is always required which canguarantee that there will be only one unique designatedleader of the group and all other nodes will know if itis a leader or not [19] Various leader election algo-rithms for distributed system infrastructure are presentedin [19 20 21 22 23] Currently ldquoRFDMonrdquo selects theleader of the region based on the default choice availableto the Global Membership Manager This default choicecan be the first node registering for a region or the firstnode notifying the global membership manager about ter-mination of the leader of the region However other moresophisticated algorithms can be easily plugged into theframework by modifying the global membership man-ager module for leader election

Global membership Manager is executed through a

12

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 9 Class Diagram of the Sensor Framework

wrapper executable ldquoGlobalMembershipManagerAPPrdquoas a child process ldquoGlobalMembershipManager-APPrdquo keeps track of execution state of the Globalmembership manager and starts a fresh instance ofGlobal membership manager if the previous instanceterminates due to some error New instance ofGlobal membership manager receives data from ldquoRE-GIONAL LEADER MAPtxtrdquo file It enables the faulttolerant nature in the framework with respect to globalmembership manager

UML class diagram to show the relation among abovemodules is shown in Figure 9 Next section will describethe details regarding each sensor that is executing at eachlocal node

6 Sensor Implementation

The proposed monitoring framework implements varioussoftware sensors to monitor system resources networkresources node states MPI and PBS related informa-tion and performance data of application in execution onthe node These sensors can be periodic or aperiodic de-pending upon the implementation and type of resource

Periodic sensors are implemented for system resourcesand performance data while aperiodic sensors are usedfor MPI process state and other system events that getstriggered only on availability of the data

These sensors are executed as a ARINC-653 processon top of the ARINC-653 emulator developed at Vander-bilt [2] All sensors on a node are deployed in a singleARINC-653 partition on top of the ARINC-653 emula-tor As discussed in the section 24 the emulator mon-itors the deadline and schedules the sensors such thattheir periodicity is maintained Furthermore the emula-tor performs static cyclic scheduling of the ARINC-653partition of the Local Manager The schedule is specifiedin terms of a hyperperiod the phase and the duration ofexecution in that hyperperiod Effectively it limits themaximum CPU utilization of the Local Managers

Sensors are constructed primarily with followingproperties

bull Sensor Name Name of the sensor (eg Utiliza-tionAggregatecpuScalar)

bull Sensor Source Device Name of the device to mon-

13

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Sensor Name Period DescriptionCPU Utilization 30 seconds Aggregate utilization of all CPU cores on the machinesSwap Utilization 30 seconds Swap space usage on the machinesRam Utilization 30 seconds Memory usage on the machines

Hard Disk Utilization 30 seconds Disk usage on the machineCPU Fan Speed 30 seconds Speed of CPU fan that helps keep the processor cool

Motherboard Fan Speed 10 seconds Speed of motherboard fan that helps keep the motherboard coolCPU Temperature 10 seconds Temperature of the processor on the machines

Motherboard Temperature 10 seconds Temperature of the motherboard on the machinesCPU Voltage 10 seconds Voltage of the processor on the machines

Motherboard Voltage 10 seconds Voltage of the motherboard on the machinesNetwork Utilization 10 seconds Bandwidth utilization of each network cardNetwork Connection 30 seconds Number of TCP connections on the machines

Heartbeat 30 seconds Periodic liveness messagesProcess Accounting 30 seconds Periodic sensor that publishes the commands executed on the systemMPI Process Info -1 Aperiodic sensor that reports the change in state of the MPI Processes

Web Application Info -1 Aperiodic sensor that reports the performance data of Web Application

Table 1 List of Monitoring Sensors

itor for the measurements (eg ldquoprocstatrdquo)

bull Sensor Period Periodicity of the sensor (eg 10seconds for periodic sensors and minus1 for aperiodicsensors)

bull Sensor Deadline Deadline of the sensor measure-ments HARD (strict) or SOFT (relatively lenient)A sensor has to finish its work within a specifieddeadline A HARD deadline violation is an errorthat requires intervention from the underlying mid-dleware A SOFT deadline violation results in awarning

bull Sensor Priority Sensor priority indicates the pri-ority of scheduling the sensor over other processesin to the system In general normal (base) priorityis assigned to the sensor

bull Sensor Dead Band Sensor reports the value onlyif the difference between current value and previousrecorded value becomes greater than the specifiedsensor dead band This option reduces the numberof sensor measurements in the DDS domain if valueof sensor measurement is unchanged or changingslightly

Life cycle and function of a typical sensor (ldquoCPU uti-lizationrdquo) is described in figure 11 Periodic sensors pub-lish data periodically as per the value of sensor periodSensors support four types of commands for publish-ing the data START STOP and POLL START com-

mand starts the already initialized sensor to start publish-ing the sensor measurements STOP command stops thesensor thread to stop publishing the measurement dataPOLL command tries to get the latest measurement fromthe sensor These sensors publish the data as per thepredefined topic to the DDS domain (eg MONITOR-ING INFO or HEARTBEAT) The proposed monitoringapproach is equipped with various sensors that are listedin table 1 These sensors are categorized in followingsubsections based upon their type and functionality

System Resource Utilization Monitoring Sensors

A numbers of sensors are developed to monitor the uti-lization of system resources These sensors are periodic(30 seconds) in nature follow SOFT deadlines containsnormal priority and read system devices (eg procstateprocmeminfo etc) to collect the measurements Thesesensors publish the measurement data under MONITOR-ING INFO topic

1 CPU Utilization CPU utilization sensor monitorsldquoprocstatrdquo file on linux file system to collect themeasurements for CPU utilization of the system

2 RAM Utilization RAM utilization sensor mon-itors ldquoprocmeminfordquo file on linux file system tocollect the measurements for RAM utilization of thesystem

14

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 10 Architecture of MPI Process Sensor

3 Disk Utilization Disk utilization sensor moni-tors executes ldquodf -Prdquo command over linux file sys-tem and processes the output to collect the measure-ments for disk utilization of the system

4 SWAP Utilization SWAP utilization sensor moni-tors ldquoprocswapsrdquo file on linux file system to collectthe measurements for SWAP utilization of the sys-tem

5 Network Utilization Network utilization sensormonitors ldquoprocnetdevrdquo file on linux file system tocollect the measurements for Network utilization ofthe system in bytes per second

Hardware Health Monitoring Sensors

In proposed framework various sensors are developedto monitor health of hardware components in the sys-tem eg Motherboard voltage Motherboard fan speedCPU fan speed etc These sensors are periodic (10 sec-onds) in nature follow soft deadlines contains normalpriority and read the measurements over vendor suppliedIntelligent Platform Management Interface (IPMI) inter-face [24] These sensors publish the measurement dataunder MONITORING INFO topic

1 CPU Fan Speed CPU fan speed sensor executesIPMI command ldquoipmitool -S SDR fan1rdquo on linuxfile system to collect the measurement of CPU fanspeed

2 CPU Temperature CPU temperature sensor ex-ecutes ldquoipmitool -S SDR CPU1 Temprdquo commandover linux file system to collect the measurement ofCPU temperature

3 Motherboard Temperature Motherboard tem-perature sensor executes ldquoipmitool -S SDR SysTemprdquo command over linux file system to collectthe measurement of Motherboard temperature

4 CPU Voltage CPU voltage sensor executes ldquoipmi-tool -S SDR CPU1 Vcorerdquo command over linux filesystem to collect the measurement of CPU voltage

5 Motherboard Voltage Motherboard voltage sen-sor executes ldquoipmitool -S SDR VBATrdquo commandover linux file system to collect the measurement ofMotherboard voltage

Node Health Monitoring Sensors

Each Local Manager (on a node) executes a Heartbeatsensor that periodically sends its own name to DDS do-main under topic ldquoHEARTBEATrdquo to inform other nodesregarding its existence in the framework Local nodes ofthe region keep track of the state of other local nodesin that region through HEARTBEAT messages whichis transmit by each local node periodically If a localnode gets terminated leader of the region will update themonitoring database with node state transition values andwill notify other nodes regarding the termination of a lo-cal node through topic ldquoNODE HEALTH INFOrdquo If a

15

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

leader node gets terminated other local nodes will notifythe global membership manager regarding terminationof the leader for their region through LEADER DEADmessage under GLOBAL MEMBERSHIP INFO topicGlobal membership manager will elect a new leader forthe region and notify to local nodes in the region in subse-quent messages Thus monitoring database will alwayscontain the updated data regarding the various resourceutilization and performance features of the infrastructure

Scientific Application Health Monitoring Sensor

The primary purpose of implementing this sensor is tomonitor the state and information variables related to sci-entific applications executing over multiple nodes andupdate the same information in the centralized monitor-ing database This Sensor logs all the information in caseof state change (Started Killed Ended) of the processrelated to scientific application and report the data to ad-ministrator In the proposed framework a wrapper appli-cation (SciAppManager) is developed that can executethe real scientific application internally as a child pro-cess MPI run command should be issued to execute thisSciAppManager application from master nodes in thecluster Name of the Scientific application and other ar-guments is passed as argument while executing SciApp-Manager application (eg mpirun -np N -machinefileworkerconfig SciAppManager SciAPP arg1 arg2 argN ) SciAppManager will write the state informationof scientific application in a POSIX message queue thatexists on each node Scientific application sensor willbe listening on that message queue and log the messageas soon as it receives This sensor formats the data aspre-specified DDS topic structure and publishes it to theDDS domain under MPI PROCESS INFO topic Func-tioning of the scientific application sensor is described inFigure 10

Web Application Performance Monitoring Sen-sor

This sensor keeps track of performance behaviour of theweb application executing over the node through the webserver performance logs A generic structure of perfor-mance logs is defined in the sensor framework that in-cludes average response time heap memory usage num-ber of JAVA threads and pending requests inside the webserver In proposed framework a web application logs itsperformance data in a POSIX message queue (differentfrom SciAppManager) that exists on each node Web ap-plication performance monitoring sensor will be listening

on that message queue and log the message as soon as itreceives This sensor formats the data as pre-specifiedDDS topic structure and publishes it to the DDS domainunder WEB APPLICATION INFO topic

7 Experiments

A set of experiments have been performed to exhibit thesystem resource overhead fault adaptive nature and re-sponsiveness towards fault in the developed monitoringframework During these experiments the monitoringframework is deployed in a Linux environment (2618-27471el5xen) that consists of five nodes (ddshost1ddsnode1 ddsnode2 ddsnode3 and ddsnode4) All ofthese nodes execute similar version of Linux operatingsystems (Centos 56) Ruby on rails based web serviceis hosted on ddshost1 node In next subsections one ofthese experiment is presented in detail

Details of the Experiment

In this experiment all of the nodes (ddshost1 andddsnode14) are started one by one with a random timeinterval Once all the nodes have started executing mon-itoring framework Local Manager on a few nodes arekilled through kill system call Once a Local Managerat a node is killed Regional Leader reports change inthe node state to the centralized database as FRAME-WORK DOWN When Local Manager of a RegionalLeader is killed the monitoring framework elects itsnew leader for that region During this experiment theCPU and RAM consumption by Local Manager at eachnode is monitored through ldquoTOPrdquo system command andvarious node state transitions are reported to centralizeddatabase

Observations of the Experiment

Results from the experiment are plotted as time seriesand presented in Figures 12 13 14 15 and 16

Observations from the Figure 12 and 13

Figure 12 and 13 describes the CPU and RAM utilizationby monitoring framework (Local Manager) at each nodeduring the experiment It is clear from the Figure 12 thatCPU utilization is mostly in the range of 0 to 1 percentwith occasional spikes in the utilization Additionallyeven in case of spikes utilization is under ten percentSimilarly according to Figure 13 RAM utilization bythe monitoring framework is less than even two percent

16

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 12 CPU Utilization by the Sensor Framework atEach Node

Figure 13 RAM Utilization by the Sensor Framework atEach Node

which is very small These results clearly indicates thatoverall resource overhead of the developed monitoringapproach ldquoRFDMonrdquo is extremely low

Observations from the Figure 14 and 15

Transition of various nodes between states UP andFRAMEWORK DOWN is presented in Figure 14 Ac-cording to the figure ddshost1 is started first then fol-lowed by ddsnode1 ddsnode2 ddsnode3 and ddsnode4At time sample 310 (approximately) Local Manager ofhost ddshost1 was killed therefore its state has beenupdated to FRAMEWORK DOWN Similarly state ofddsnode2 and ddsnode3 is also updated to FRAME-WORK DOWN once their Local Manager is killed ontime sample 390 and 410 respectively Local manager atddshost1 is again started at time sample 440 therefore itsstate is updated to UP at the same time Figure 15 repre-sents the nodes which were Regional Leaders during theexperiment According to the figure initially ddshost1was the leader of the region while as soon as Local Man-ager at ddshost1 is killed at time sample 310 (see Fig-ure 14) ddsnode4 is elected as the new leader of the re-gion Similarly when local manager of the ddsnode4 iskilled at time sample 520 (see Figure 14) ddshost1 isagain elected as the leader of the region Finally whenlocal manager at ddshost1 is again killed at time sample660 (see Figure 14) ddsnode2 is elected as the leader ofthe region

On combined observation of the Figure 14 and 15 itis clearly evident that as soon as there is a fault in the

Figure 14 Transition in State of Nodes

Figure 15 Leaders of the Sensor Framework during theExperiment

framework related to the regional leader a new leader iselected instantly without any further delay This specificfeature of the developed monitoring framework exhibitthat the proposed approach is robust with respect to fail-ure of the Regional Leader and it can adapt to the faultsin the framework instantly without delay

Observations from the Figure 16

Sensor framework at ddsnode1 was allowed to executeduring the complete duration of the experiment (fromsample time 80 to the end) and no fault or change wasintroduced in this node The primary purpose of exe-cuting this node continuously was to observe the impactof introducing faults in the framework over the moni-toring capabilities of the framework In the most idealscenario all the monitoring data of ddsnode1 should bereported to the centralized database without any interrup-tion even in case of faults (leader re-election and nodesgoing out of the framework) in the region Figure 16presents the CPU utilization of ddsnode1 from the cen-tralized database during the experiment This data re-flects the CPU monitoring data reported by RegionalLeader collected through CPU monitoring sensor fromddsnode1 According to the Figure 16 monitoring datafrom ddsnode1 was collected successfully during the en-tire experiment Even in the case of regional leader re-election at time sample 310 and 520 (see Figure 15)only one or two (max) data samples are missing from the

17

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 16 CPU Utilization at node ddsnode1 during theExperiment

database Henceforth it is evident that there is a minimalimpact of faults in the framework over the monitoringfunctionality of the framework

8 Benefits of the approach

ldquoRFDMonrdquo is comprehensive in type of monitoring pa-rameters It can monitor system resources hardwarehealth computing node availability MPI job state andapplication performance data This framework is easilyscalable with the number of nodes because it is basedupon data centric publish-subscribe mechanism that isextremely scalable in itself Also in proposed frame-work new sensors can be easily added to increase thenumber of monitoring parameters ldquoRFDMonrdquo is faulttolerant with respect to faults in the framework itself dueto partial outage in network (if Regional Leader nodestops working) Also this framework can self-configure(Start Stop and Poll) the sensors as per the require-ments of the system This framework can be applied inheterogeneous environment The major benefit of usingthis sensor framework is that the total resource consump-tion by the sensors can be limited by applying Arinc-653 scheduling policies as described in previous sectionsAlso due to spatial isolation features of Arinc-653 emu-lation monitoring framework will not corrupt the mem-ory area or data structure of applications in execution onthe node This monitoring framework has very low over-head on computational resources of the system as shownin Figure 12 and 13

9 Application of the Framework

The initial version of the proposed approach was uti-lized in [25] to manage scientific workflows over a dis-tributed environment A hierarchical work flow manage-ment system was combined with distributed monitoringframework to monitor the work flows for failure recov-ery Another direct implementation of the ldquoRFDMonrdquo

is presented in [9] where virtual machine monitoringtools and web service performance monitors are com-bined with monitoring framework to manage the multi-dimensional QoS data for daytrader [26] web service ex-ecuting over IBM Websphere Application Server [27]platform ldquoRFDMonrdquo provides various key benefits thatmakes it a strong candidate for combining with other au-tonomic performance management systems Currentlythe ldquoRFDMonrdquo is utilized at Fermi Lab Batavia IL formonitoring of their scientific clusters

10 Future Work

ldquoRFDMonrdquo can be easily utilized to monitor a distributedsystem and can be easily combined with a performancemanagement system due to its flexible nature New sen-sors can be started any time during the monitoring frame-work and new set of publisher or subscriber can join theframework to publish a new monitoring data or analysethe current monitoring data It can be applied to wide va-riety of performance monitoring and management prob-lems The proposed framework helps in visualizing thevarious resource utilization hardware health and pro-cess state on computing nodes Therefore an adminis-trator can easily find the location of the fault in the sys-tem and possible causes of the faults To make this faultidentification and diagnosis procedure autonomic we aredeveloping a fault diagnosis module that can detect orpredict the faults in the infrastructure by observing andco-relating the various sensor measurements In additionto this we are developing a self-configuring hierarchi-cal control framework (extension of our work in [28]) tomanage multi-dimensional QoS parameters in multi-tierweb service environment In future we will show that theproposed monitoring framework can be combined withfault diagnosis and performance management modulesfor fault prediction and QoS management respectively

11 Conclusion

In this paper we have presented the detailed design ofldquoRFDMonrdquo which is a real-time and fault-tolerant dis-tributed system monitoring approach based upon datacentric publish-subscribe paradigm We have also de-scribed the concepts of OpenSplice DDS Arinc-653operating system and ruby on rails web developmentframework We have shown that the proposed distributedmonitoring framework ldquoRFDmonrdquo can efficiently and ac-curately monitor the system variables (CPU utilization

18

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

memory utilization disk utilization network utilization)system health (temperature and voltage of Motherboardand CPU) application performance variables (applicationresponse time queue size throughput) and scientificapplication data structures (eg PBS information MPIvariables) accurately with minimum latency The addedadvantages of using the proposed framework have alsobeen discussed in the paper

12 Acknowledgement

R Mehrotra and S Abdelwahed are supported for thiswork from the Qatar Foundation grant NPRP 09-778-2299 A Dubey is supported in part by Fermi NationalAccelerator Laboratory operated by Fermi Research Al-liance LLC under contract No DE-AC02-07CH11359with the United States Department of Energy (DoE)and by DoE SciDAC program under the contract NoDOE DE-FC02-06 ER41442 We are grateful to thehelp and guidance provided by T Bapty S Neema JKowalkowski J Simone D Holmgren A Singh NSeenu and R Herbers

References

[1] Andrew S Tanenbaum and Maarten Van SteenDistributed Systems Principles and ParadigmsPrentice Hall 2 edition October 2006

[2] Abhishek Dubey Gabor Karsai and NagabhushanMahadevan A component model for hard-real timesystems Ccm with arinc-653 Software Practiceand Experience (In press) 2011 accepted for pub-lication

[3] Opensplice dds community edition httpwwwprismtechcomopenspliceopensplice-dds-community

[4] Ruby on rails httprubyonrailsorg[Sep2011]

[5] Arinc specification 653-2 Avionics applicationsoftware standard interface part 1required servicesTechnical report Annapolis MD December 2005

[6] Bert Farabaugh Gerardo Pardo-Castellote andRick Warren An introduction to dds anddata-centric communications 2005 httpwwwomgorgnewswhitepapersIntro_To_DDSpdf

[7] A Goldberg and G Horvath Software fault protec-tion with arinc 653 In Aerospace Conference 2007IEEE pages 1 ndash11 march 2007

[8] Ruby httpwwwruby-langorgen[Sep2011]

[9] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Weston Monceaux Large scale monitor-ing and online analysis in a distributed virtualizedenvironment Engineering of Autonomic and Au-tonomous Systems IEEE International Workshopon 01ndash9 2011

[10] Lorenzo Falai Observing Monitoring and Evalu-ating Distributed Systems PhD thesis Universitadegli Studi di Firenze December 2007

[11] Monitoring in distributed systems httpwwwansacoukANSATech94Primary100801pdf[Sep2011]

[12] Ganglia monitoring system httpgangliasourceforgenet[Sep2011]

[13] Nagios httpwwwnagiosorg[Sep2011]

[14] Zenoss the cloud management company httpwwwzenosscom[Sep2011]

[15] Nimsoft unified manager httpwwwnimsoftcomsolutionsnimsoft-unified-manager[Nov2011]

[16] Zabbix The ultimate open source monitor-ing solution httpwwwzabbixcom[Nov2011]

[17] Opennms httpwwwopennmsorg[Sep2011]

[18] Abhishek Dubey et al Compensating for timing jit-ter in computing systems with general-purpose op-erating systems In ISROC Tokyo Japan 2009

[19] Hans Svensson and Thomas Arts A new leaderelection implementation In Proceedings of the2005 ACM SIGPLAN workshop on Erlang ER-LANG rsquo05 pages 35ndash39 New York NY USA2005 ACM

[20] G Singh Leader election in the presence of linkfailures Parallel and Distributed Systems IEEETransactions on 7(3)231 ndash236 mar 1996

19

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

[21] Scott Stoller Dept and Scott D Stoller Leaderelection in distributed systems with crash failuresTechnical report 1997

[22] Christof Fetzer and Flaviu Cristian A highly avail-able local leader election service IEEE Transac-tions on Software Engineering 25603ndash618 1999

[23] E Korach S Kutten and S Moran A modu-lar technique for the design of efficient distributedleader finding algorithms ACM Transactions onProgramming Languages and Systems 1284ndash1011990

[24] Intelligent platform management interface(ipmi) httpwwwintelcomdesignserversipmi[Sep2011]

[25] Pan Pan Abhishek Dubey and Luciano PiccoliDynamic workflow management and monitoringusing dds In 7th IEEE International Workshop onEngineering of Autonomic amp Autonomous Systems(EASe) 2010 under Review

[26] Daytrader httpcwikiapacheorgGMOxDOC20daytraderhtml[Nov2010]

[27] Websphere application server community editionhttpwww-01ibmcomsoftwarewebserversappservcommunity[Oct2011]

[28] Rajat Mehrotra Abhishek Dubey Sherif Abdelwa-hed and Asser Tantawi A Power-aware Modelingand Autonomic Management Framework for Dis-tributed Computing Systems CRC Press 2011

20

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement

Figure 11 Life Cycle of a CPU Utilization Sensor

21

  • Introduction
  • Preliminaries
    • Publish-Subscribe Mechanism
    • Open Splice DDS
    • ARINC-653
    • ARINC-653 Emulation Library
    • Ruby on Rails
      • Issues in Distributed System Monitoring
      • Related Work
      • Architecture of the framework
        • Sensors
        • Topics
        • Topic Managers
        • Region
        • Local Manager
        • Leader of the Region
        • Global Membership Manager
          • Sensor Implementation
          • Experiments
          • Benefits of the approach
          • Application of the Framework
          • Future Work
          • Conclusion
          • Acknowledgement