modeling and analyzing server system with rejuvenation through sysml and stochastic reward nets

8
Modeling and Analyzing Server System with Rejuvenation through SysML and Stochastic Reward Nets Ermeson C. Andrade *‡ , Fumio Machida †‡ , Dong Seong Kim and Kishor S. Trivedi * Informatics Center, Federal University of Pernambuco (UFPE), Recife, PE, Brazil Email: [email protected] Service Platforms Research Laboratories, NEC Corporation, Kawasaki, Japan Email: [email protected] Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, United States Email: [email protected], [email protected], [email protected], [email protected] Abstract—High-availability assurance of server systems is becoming an important issue, since many mission-critical applications are implemented on server systems. To achieve high-availability, software rejuvenation is a practical technique to reduce unexpected downtime caused by software aging in software applications running on server systems. Although analytic models of software rejuvenation are well-studied, such analysis is not used in server system administration due to the complexity of modeling. In this paper, we present an availability modeling method for server system with software rejuvenation based on SysML that is used to describe system configurations and maintenance operations semi-formally. The proposed approach allows system administrators, who do not have expertise in availability modeling, to design and study the effects of different rejuvenation policies deployed in server systems. To show the applicability of the proposed modeling and evaluation process, a case study of a web application server is presented. We show the correctness of our modeling method by comparing the conventional models for condition-based and time-based software rejuvenation. Keywords-Availability assessment; server system; software rejuvenation; stochastic reward nets; SysML. I. I NTRODUCTION High-availability assurance of web application services on server systems are gaining a lot of attention in re- cent years, since many organizations use web application services to implement their mission-critical applications. Techniques to achieve high availability from the hardware perspective are well-studied. However, software remains the main bottleneck in achieving high availability of many software-based systems. To increase time-to-failure (TTF) of software-based systems, proactive techniques such as server rejuvenation [1] can be used. Server rejuvenation is a preventive technique that involves stopping the running server occasionally, cleaning its internal state and restarting it. The consequences of sudden failures due to aging-related bugs [2] are postponed or prevented by such a proactive operation. The proactive operation can be scheduled and performed when the workloads are low. Although analytic models such as Markov chains are useful to study the effectiveness of server rejuvenation, they are not easy to use by system administrators who do not have expertise in stochastic availability modeling. An important and challenging issue is to enable system administrators to develop analytic models representing the real server system configurations and behaviors. To address this issue, we advocate the use of system specification language such as Systems Modeling Language (SysML) [3] to generate analytic models. SysML is used for various engineering design purposes and supports more friendly and intuitive notation methods. Some literature addresses the translation methods from system specification models using SysML and UML into analytic models (e.g., stochastic Petri nets and continuous time Markov chains) for quantitative [4][5] as well as qualitative [6][7] evaluation of information systems. How- ever, to the best of our knowledge, a few studies address availability assessment and management of web application services, and none of them allows system administrators to design/study different rejuvenation policies deployed in server system. In our previous study [8], we designed a component- based availability modeling framework which assists system administrators to construct Stochastic Reward Nets (SRNs) [9] by translating SysML diagrams which represent system behavior and configurations. We named this framework Candy. In this paper, we use Candy to model and analyze software rejuvenation policies deployed in server systems. This approach allows system administrators to make a comparative study of different rejuvenation policies. As unnecessary rejuvenation causes additional downtime and cost, we study the optimal rejuvenation trigger interval in time-based rejuvenation and the effectiveness of condition- based rejuvenation. The rest of this paper is organized as follows. Section II explains the overview of Candy. Section III presents the translation process from SysML diagram to SRNs. Section IV details the generation of server rejuvenation models. Section V discusses the numerical results for the proposed rejuvenation policies. Section VI provides the conclusion of

Upload: duke

Post on 01-May-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Modeling and Analyzing Server System with Rejuvenation through SysML andStochastic Reward Nets

Ermeson C. Andrade∗‡, Fumio Machida†‡, Dong Seong Kim‡ and Kishor S. Trivedi‡∗ Informatics Center, Federal University of Pernambuco (UFPE), Recife, PE, Brazil

Email: [email protected]†Service Platforms Research Laboratories, NEC Corporation, Kawasaki, Japan

Email: [email protected]‡Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, United States

Email: [email protected], [email protected], [email protected], [email protected]

Abstract—High-availability assurance of server systems isbecoming an important issue, since many mission-criticalapplications are implemented on server systems. To achievehigh-availability, software rejuvenation is a practical techniqueto reduce unexpected downtime caused by software aging insoftware applications running on server systems. Althoughanalytic models of software rejuvenation are well-studied, suchanalysis is not used in server system administration due tothe complexity of modeling. In this paper, we present anavailability modeling method for server system with softwarerejuvenation based on SysML that is used to describe systemconfigurations and maintenance operations semi-formally.Theproposed approach allows system administrators, who do nothave expertise in availability modeling, to design and studythe effects of different rejuvenation policies deployed inserversystems. To show the applicability of the proposed modelingand evaluation process, a case study of a web application serveris presented. We show the correctness of our modeling methodby comparing the conventional models for condition-based andtime-based software rejuvenation.

Keywords-Availability assessment; server system; softwarerejuvenation; stochastic reward nets; SysML.

I. I NTRODUCTION

High-availability assurance of web application serviceson server systems are gaining a lot of attention in re-cent years, since many organizations use web applicationservices to implement their mission-critical applications.Techniques to achieve high availability from the hardwareperspective are well-studied. However, software remainsthe main bottleneck in achieving high availability of manysoftware-based systems. To increase time-to-failure (TTF)of software-based systems, proactive techniques such asserver rejuvenation [1] can be used. Server rejuvenation isa preventive technique that involves stopping the runningserver occasionally, cleaning its internal state and restartingit. The consequences of sudden failures due to aging-relatedbugs [2] are postponed or prevented by such a proactiveoperation. The proactive operation can be scheduled andperformed when the workloads are low.

Although analytic models such as Markov chains areuseful to study the effectiveness of server rejuvenation, they

are not easy to use by system administrators who do not haveexpertise in stochastic availability modeling. An importantand challenging issue is to enable system administratorsto develop analytic models representing the real serversystem configurations and behaviors. To address this issue,we advocate the use of system specification language suchas Systems Modeling Language (SysML) [3] to generateanalytic models. SysML is used for various engineeringdesign purposes and supports more friendly and intuitivenotation methods.

Some literature addresses the translation methods fromsystem specification models using SysML and UML intoanalytic models (e.g., stochastic Petri nets and continuoustime Markov chains) for quantitative [4][5] as well asqualitative [6][7] evaluation of information systems. How-ever, to the best of our knowledge, a few studies addressavailability assessment and management of web applicationservices, and none of them allows system administratorsto design/study different rejuvenation policies deployedinserver system.

In our previous study [8], we designed a component-based availability modeling framework which assists systemadministrators to construct Stochastic Reward Nets (SRNs)[9] by translating SysML diagrams which represent systembehavior and configurations. We named this frameworkCandy. In this paper, we use Candy to model and analyzesoftware rejuvenation policies deployed in server systems.This approach allows system administrators to make acomparative study of different rejuvenation policies. Asunnecessary rejuvenation causes additional downtime andcost, we study the optimal rejuvenation trigger interval intime-based rejuvenation and the effectiveness of condition-based rejuvenation.

The rest of this paper is organized as follows. SectionII explains the overview of Candy. Section III presents thetranslation process from SysML diagram to SRNs. SectionIV details the generation of server rejuvenation models.Section V discusses the numerical results for the proposedrejuvenation policies. Section VI provides the conclusionof

the paper and briefly discusses further work.

II. CANDY FRAMEWORK

Candy is a component-based availability modeling frame-work for assessing the availability of systems based onthe design of the system infrastructure specified by SysMLmodels [8]. The system administrators can use Candy to:(i) design the system infrastructure using SysML diagrams,(ii) generate the availability model by the translation process,and (iii) study different design policies and maintenance op-erations. Candy allows the system administrators to choosethe service infrastructure that fits their budget or satisfyagiven Service Level Agreement (SLA). The overview ofCandy is shown in Figure 1.

Figure 1: Candy Framework.

The modeling evaluation process in Candy is dividedin three steps: First, system configuration and maintenanceoperations are designed by SysML. Candy supports fourdifferent types of SysML diagrams as input: (i) Block Defini-tion Diagram (BDD), (ii) Internal Block Diagram (IBD), (iii)State Machine Diagram (STM) and (iv) Activity Diagram(AD) (which becomesActivity SRN, described in more detailbelow). SysML-BDD/IBD are used for representing thesystem’s static configurations. SysML-STM describes thestate transitions of a specific system element; for example,a server failure-recovery behavior. SysML-AD describes theprocess flow of administrative operations which may affectmore than one element in the system. The dependenciesamong the SysML models are specified by the notationcalled SysML-allocation. Note that the SysML-BDD/IBDare not used in this work because we focus on the analysisof server rejuvenation in a single server.

Second, each element in SysML diagrams is translatedinto a part of the availability model, and these parts areassembled and synchronized together to construct the wholeavailability model in the form of SRNs. The translationprocess consists of three phases: (1) model translation, (2)model assembly and (3) model synchronization. In the firstphase, Candy translates the SysML diagrams into parts of theavailability model namedmodel components. In the secondphase, Candy assembles the model components generatedfrom SysML-BDD/IBD and SysML-STM using the alloca-tion notation in SysML. The obtained assembled model isnamedSystem SRN that represents the system configurationsand state transitions of each system element.System SRN

can be affected by system maintenance operations describedin SysML-ADs. In the third phase, theSystem SRN issynchronized toActivity SRN (which is a model componentgenerated from a SysML-AD). Model synchronization isperformed by identifying the relationships between actionsin Activity SRN and transitions inSystem SRN. This phasewill be described in the next subsections.

Third, various availability measures can be computed us-ing software packages such as Stochastic Petri Net Package(SPNP) [10]. In the evaluation step, the system admin-istrator needs to define reward function corresponding tothe measure of interests (e.g., steady-state (un)availability,downtime), and she/he uses input parameter values suchas component failure rates, operation delays, and coveragefactor, etc. In this paper, we assumed that system adminis-trators belong to a specific system administration group andthey can design their administrative operation using SysMLand solve the dependencies among model components bydefining guard functions for SRNs.

III. T RANSLATION FROM SYSML DIAGRAMS TO SRN

This section presents the translation from SysML dia-grams to model components and the synchronization ofthose model components to obtain the entire availabilitymodel. Since all model components are based on SRNs, anoverview of SRN is introduced.

A. Stochastic Reward Nets

Stochastic reward net (SRN) extends Generalized Stochas-tic Petri Net (GSPN) by introducing reward functions andguard functions. A reward function defines the reward ratefor each tangible marking. Various quantitative measuressuch as steady-state availability can be computed by defin-ing the corresponding reward functions. A guard functionassigned to a transition specifies condition to enable ordisable the transition, in addition to the constraints imposedby priority, input arcs, and inhibitor arcs. For more details,the reader should refer to [9].

B. Translation of SysML-STM

SysML-STM describes the state transitions of a specificsystem element such as a server failure-recovery behavior.Figure 2(a) presents an example of a simple failure andrecovery behavior for a server. The server process starts inan initial state, represented by the closed circle, and serverbecomesUP state (i.e., operational). From this state, thesever can fail. If the server fails, the process enters theDownstate. The process subsequently returns to theUP state aftera recovery finishes. A SysML-STM can be translated into amodel component by converting each state into a place andeach transition into a timed transition with input/output arcs.Figure 2(b) shows the SRN model representing the failureand recovery behavior of the server shown in Figure 2(a).The obtained model from this translation is namedSystem

SRN. The server starts inUP state, indicated by a token inplacePup. The transitionTfail fires when the server goesdown, and then the token inPup is removed and a token isdeposited inPdown. Trecv fires when the server is recovered,then the token inPdown is removed and a token is depositedin Pup.

Figure 2: Translation of a SysML-STM.

C. Translation of SysML-AD

SysML-AD describes the process flow of administrativeoperations which may affect the system state. It can becomposed of nodes and actions: initial node, final node,action node, decision node, accept event action, send signalaction and wait time action as shown in Figure 4. Initialand final nodes represent the starting point and final state ofa SysML-AD, respectively. Action nodes represent eventsperformed. Decision nodes represent different consequencesas a result of action and system states. They are representedas branching conditions. Accept event action and send signalaction are used to represent the communication betweenactivities. Wait time action represents time that elapses untilthe next action occurs.

Figure 3(a) shows an example of SysML-AD. The waittime action (shown as an hourglass shape) represents 24-hour waiting period before the decision node execution.The decision node judges whether server is working ornot, depending on the server status it results in differentconsequences. The condition of the decision node is de-scribed using the stereotype�decisionInput�. If the serveris working properly ([yes]), the SysML-AD reaches its endwithout executing any further activity. Otherwise ([no]),the server is restarted according to the action nodeServerRestart and then the SysML-AD finishes its activities. Tospecify the action that affects a system state, we introducea new stereotype�control� for the action nodeServerrestart. The server state in Figure 2 is changed fromDownstate toUP state by the execution of the action nodeServerrestart in Figure 3(a).

A SysML-AD can be translated into an SRN model byconverting each node type to a set of transitions, placesand guard functions. Figure 4 presents the translations rulesfor each node type of SysML-AD. Figure 3(b) shows anexample of translation result from SysML-AD into SRNs.There is a token in both placesPini and Pin clock, and

Figure 3: Translation of a SysML-AD.

Figure 4: Translation rules for activity nodes and actions.

the deterministic transitionTclockTg fires everyd time unitdepositing a token in placePout clock. If there is a token inboth the placePwait andPin clock the maintenance operationrepresented by the SysML-AD starts its activities checkingthe server through the transitionTdecn. Note that there aretwo outgoing control flows of the decision node, and theyare represented by the immediate transitionsTout1 dec andTout2 dec. Each immediate transition has a guard functionwhich corresponds to the decision condition. If the server

is in Down state, then the restart is performed (transitionTserverrest) and the activity diagram reaches its end. Oth-erwise, the SysML-AD reaches its end without executingany activity. The translations of the final nodes include theoutgoing arcs to the initial place, which means the activityis triggered repeatedly in response to the timer event. Notethat the transitionTserverrest is assigned with the stereotype�control�. It comes from the SysML-AD and means thatthe execution of this activity will result in the changing ofthe system state.

D. Synchronization Process

There are bidirectional dependencies betweenSystem SRNandActivity SRN; an action in anActivity SRN may inducea state change in theSystem SRN. Such types of actionsare identified by the stereotype�control�. On the otherhand, the behavior of maintenance operations modeled inan Activity SRN may be changed depending on a markingin System SRN. These dependencies can be incorporated inSRN models by assigning guard functions to the associatedtransitions.

1) Synchronization of an action in Activity SRN: If anaction represented in anActivity SRN affects the state tran-sition in a System SRN, guard functions to enable the statetransition is required. For each�control� stereo-typedtransition in the Activity SRN, the system administratorneeds to find a corresponding transitions in theSystem SRN.Consider an example in Figure 5. In order to synchronizethe Activity SRN and the System SRN, the transitionTy

is expanded with one immediate transition and one placeas shown in the right part of Figure 5. The immediatetransitionTin y and the placePin y are inserted before thetransitionTy. They represent the action invocation and actionexecution state, respectively. For all the related transitions,Candy generates four guard functions for enabling thetransitions in a consistent order. The first guard functionGin y represents the trigger of the action. The second guardfunctionGin x ensures the start of state transition. The thirdguard functionGout y represents the end of action and thefourth guard functionGout x ensures that the state transitionis completed.

in_y

in_y in_y

y

y out_y

in-x

x in_x

out_x

out_x out_x

in_y in_x

in_x in_y

out_y out_x

out_x in_y

<<control>>

Figure 5: Synchronization ofTx in Activity SRN andTy in SystemSRN.

Figure 6 shows an example of synchronization betweentransitionTrecv of theSystem SRN (see Figure 2), and transi-tion Tserverrest of the Activity SRN (see Figure 3 (b)). The

transitionTrecv is expanded with an immediate transitionTin recv and a placePin recv. The four guard functions areautomatically generated according to the related transitions.Note that it is possible to have more than one transitionsneed to be synchronized with an action.

Figure 6: An example of expanded System SRN.

2) Synchronization of decision condition: If the outgoingcontrol flows of the decision node in the SysML-AD dependson the system state, then guard functions for theActivity SRNare defined using marking of theSystem SRN. Note that thesystem administrator needs to define the guard functions for�decisionInput� and decision guards by translating somedescriptions in the original SysML-AD. If control flow doesnot depend on the state of the system, the definition of guardfunctions is not necessary in the availability evaluation.For instance, Table I describes the guard functions for thetransitionsTout1 det and Tout2 det (see Figure 3(b)) basedon the number of tokens of the placePup (see Figure 6).

Table I: An example of guard functions for the decision outputs.

Guard functionGout1 det if(]Pup ==1) 1 else 0 endGout2 det if(]Pup ==1) 0 else 1 end

IV. GENERATING SERVER REJUVENATION MODELS

This section describes the design and translation of dif-ferent rejuvenation policies in a web application service.The rejuvenation policies we take into account are (i)time-based and (ii) condition-based server rejuvenation [1].In time-based case, a server rejuvenation is performed atfixed time intervals, while in condition-based case, a serverrejuvenation is performed when system states satisfies aspecific condition.

A. Time-Based Rejuvenation

The designed SysML-STM and SysML-AD for web appli-cation service on a server system are shown in Figures 7 and8. The failure and recovery behavior of the web applicationservice with time-based rejuvenation is presented in Figure7 using SysML-STM. The server process starts in theUpstate. Once in this state, the sever can either fail or suffersoftware aging. If the server fails during the operation, theprocess enters theFailure state. The process moves to the

Detected state when the failure is detected by a monitoringmechanism. The process subsequently enters theRecoverstate and returns to theUp state after the recovery is finished.If the server process enters theFailure-probable state due toaging, the server may fail or it can be rejuvenated to cleanup the aging state with probabilityc. If the rejuvenation isnot successful, the server can fail with probability(1 − c).One should note the server rejuvenation is performed bydeterministic times (every 24 hours) for a server inUP orFailure-Probable state, and it can succeed with probabilityc

or fail with probability(1−c). The decision node is used torepresent the probabilities of success or failure of the serverrejuvenation.

Figure 7: SysML-STM representing the failure and recovery of aserver with time-based rejuvenation.

Figure 8: SysML-AD for time-based rejuvenation.

The SysML-AD for maintenance operation consideringtime-based rejuvenation is presented in Figure 8. The waittime action represents that the action nodeCheck ServerStatus is performed every 24 hours. If the server is workingproperly, the server is rejuvenated by the action nodeServerRejuvenation. Otherwise, an alert message is issued tosystem administrators who are responsible for handling thealert message and are enforcing further manual maintenanceoperations. Because the server rejuvenation can fail, theserver status is checked again after the server rejuvenation.

Figures 9 and 10 present theSystem SRN and theActivitySRN obtained by the translation of SysML diagrams forweb application service with time-based rejuvenation. Thesemodels can be obtained by the following steps. First, Candytranslates both the SysML-STM and SysML-AD into SRNsby using the translation rules described in Section III. Notethat the decision node outputs of SysML-STM are translatedinto two SRN transitions. For example, the decision nodefrom theFailure probable state is translated into the transi-tions Trejuv2 and Tfailrejuv2. They represent the successwith probability c and failure with probability (1-c) ofserver rejuvenation, respectively. Second, theSystem SRNis synchronized to theActivity SRN by defining additionalguard functions. The guard functions are assigned to thetransitions with the help of a system administrator. Forsake of simplicity the action nodes which do not have thestereotype�control� assigned to them are translated justinto a place and a timed transition.

���������

���� ���

������

������

���

��� �

���

�������

��

����� �����

��� �

��

������

���������

���������

����

���������

�����������

��� ����

��� ����

����������

���������

���������

��������������������

������������

����������������������

���� �����

���� �����

Figure 9: System SRN for the server process of Figure 7.

Table II presents the guard functions for the synchroniza-tion between theActivity SRN and System SRN. Note thatthe Activity SRN transition TSV rej is synchronized withtwo System SRN transitions (Tin rejuv2 and Tin rejuv1),since the server rejuvenation can enable these transitions.As explained in Section III-D, the transitionsTin rejuv2

and Tin rejuv1 are expanded with an immediate transitionand a place (see Figure 9). Then Candy generates guardfunctions for all related transitions (Gin rejuv1, Gout rejuv1,Gin rejuv2, Gout rejuv2, Gin SV rej and Gout SV rej). Asthe server rejuvenation can be carried out from two placesPin rejuv1 and Pin rejuv2, the guard functionsGin SV rej

andGout SV rej for the Activity SRN are modified to con-sider both places. Two guard functionsGfailrejuv1 andGfailrejuv2 are included in the synchronization, since theserver rejuvenation can fail. Finally, four guard functions(Gout1 dec1, Gout2 dec1, Gout1 dec2 andGout2 dec2) are de-fined to control the outgoing flow of decision nodes.

Figure 10: Activity SRN for the time-based rejuvenation fromFigure 8.

Table II: Guard functions for the synchronization between theActivity SRN and System SRN.

Guard functionGout1 dec1 if (](’Pup ’)==1 or ](’Pfprob’)==1) 1 else 0Gout2 det1 if (](’Pup ’)==1 or ](’Pfprob’)==1) 0 else 1

Gin rejuv1, Gin rejuv2 if ( ](’pin SV rej ’)==1) 1 else 0Gin SV rej if (](’Pin rejuv1’)==1 or

](’Pin rejuv2’)==1) 1 else 0Gout rejuv1 if ( ](’Pout SV rej ’)==1) 1 else 0Gout rejuv2 if ( ](’Pout SV rej ’)==1) 1 else 0Gout SV rej if (](’Pin rejuv1 ’)==0 and

](’Pin rejuv2’)==0) 1 else 0Gfailrejuv1, Gfailrejuv2 if ( ](’Pout SV rej ’)==1) 1 else 0

Gout1 dec2 if (](’Pup’)==1 1 else 0Gout2 det2 if (](’Pup’)==1 0 else 1

B. Condition-Based Rejuvenation

The SysML diagrams for a web application service ona server system with condition-based rejuvenation are pre-sented in Figures 11 and 12. The failure-repair behaviorof the server is the same as the previous case. The onlyone difference for the SysML-STM is that there is noself transition in theUP state (server rejuvenation), sincewe assume that the detection ofFailure-probable state isalways accurate. Figure 12 presents the SysML-AD for themaintenance operation for the condition-based rejuvenation.This maintenance operation is performed by two distincttasks. The first SysML-AD represents the detection of agingin the server system. The second SysML-AD represents theactions for server rejuvenation. The wait time action in the

first SysML-AD means that the action nodeCheck ServerStatus is performed every 5 minutes. If the server suffersfrom aging, a signal is generated and transmitted to the targetobject (accept event action). Otherwise, every 5 minutes theserver status is checked repeatedly. If the signal is receivedby the accept event action, the server is rejuvenated by theaction nodeServer Rejuvenation.

Applying the translation rules, we can obtain the SRNs asshown in Figures 13 and 14. The guards functions used tosynchronize the dependence between the SRN models arein Table III. It is necessary to define two guard functionsGout agconfir andGout agdetec to represent the synchroniza-tion between theActivity SRN for aging detection and theActivity SRN for server rejuvenation.

Figure 11: SysML-STM representing the failure and recoveryof aserver with condition-based rejuvenation.

Figure 12: SysML-ADs for condition-based rejuvenation.

V. NUMERICAL ANALYSIS

The default parameters value used in the numerical analy-sis are summarized in Table IV. We use arbitrary (but reason-ably) chosen input parameters values, since some parametersvalues are not available or are confidential. The distributionof timed transition is also determined at the evaluation time.In our case study, all timed transitions are assumed to beexponentially distributed except the deterministic transitions

Figure 13: Obtained SysML SRN for the server process.

Figure 14: Obtained System SRNs for the maintenance operationconsidering the condition-based rejuvenation.

Table III: Guard functions for the synchronization betweentheActivity SRN and System SRN.

Guard functionGout1 dec if (](’Pfprob’)==1) 1 else 0Gout2 dec if (](’Pfprob’)==1) 0 else 1

Gin rejuv1 if ( ](’pin SV rej ’)==1) 1 else 0Gin SV rej if (](’Pin rejuv1’)==1) 1 else 0

Gout rejuv1, Gfailrejuv1 if ( ](’Pout SV rej ’)==1) 1 else 0Gout SV rej if (](’Pin rejuv1’)==0) 1 else 0Gout1 dec2 if (](’Pup’)==1 1 else 0Gout2 det2 if (](’Pup’)==1 0 else 1

Gout agconfir if(](’Pin agdetec’) == 1) 1 else 0 endGout agdetec if(](’Pagconfir ’) == 0) 1 else 0 end

TclockTg and TclockIn. These transitions are approximatedby 10-stage Erlang distribution in the analysis.

The steady-state availability using the SRN models wascomputed using SPNP software package [10]. According tothe measures of interest, the system administrator definesthe reward functions. The reward functions for steady-state

Table IV: Description for input parameters and their default values.

Parameters Assigned transitions Values [1/h]Rejuvenation rate Treju1, Treju2 6

AP server failure rate Tfail1 0.00046296AP server aging rate Tfprob 0.00297619

AP server aging failure rate Tfail2 0.00138889Server failure detection rate Tdetect 12

Recovery operation start rate Tarrival 2AP server recovery rate Trecover 1

Check status TCKsta1, TCKsta2 3600Decision node Tdecn1, Tdecn2 3600

Server alert TSDalt1, TSDalt2 3600Rejuvenation trigger interval TclockTg variable

Condition check interval TclockIn 12Coverage of server rejuvenation c 0.99

availability used in this case study is shown in Table V. Theresults obtained considering the web application with andwithout server rejuvenation are shown in Table VI.

Table V: Reward functions for steady-state availability.

Reward FunctionR1 if ((](’Pup ’)==1 or ](’Pfprob’)==1) 1 else 0 end

Table VI: Steady-state availability and downtime.

Policy availability Downtime (hrs per year)Time-based rejuvenation (24 hrs) 0.991609 73.50

condition-based rejuvenation 0.998724 11.17Without rejuvenation 0.998270 15.15

The downtime incurred due to perform server rejuvenationevery 24 hours (time-based rejuvenation) is much higherthan the case without server rejuvenation. We observed thatit is not effective to trigger server rejuvenation every 24hours, since the server rejuvenation does not improve avail-ability. The condition-based rejuvenation achieved betteravailability than the time-based rejuvenation when we usedthe input parameters value and as long as the detection of thefailure-probable state in condition-based rejuvenation alwayssucceeds. In future work, we will perform more detailedsensitivity analysis including different detection coveragefactors for condition-based rejuvenation.

Figure 15 shows the steady-state availability of SRNmodel which uses time-based rejuvenation policy by varyingthe rejuvenation trigger intervals. If the rejuvenation triggerinterval is close to zero, the server is rejuvenated morethan necessary, and consequently, yields low steady-stateavailability. As the rejuvenation trigger interval increases theserver reaches an optimum value. The optimal rejuvenationtrigger interval is 752 hours, where the downtime per yearis 14.77 hours. If the rejuvenation interval goes beyond theoptimal value, the availability remains fairly stable, butitstarts to drop because server failure has more influence on

the server availability than rejuvenation does.

100 200 300 400 500 600 700 800 900 10000.9977

0.9978

0.9979

0.998

0.9981

0.9982

0.9983

0.9984

0.9985

Rejuvenation trigger interval (hours)

Ste

ady−

stat

e av

aila

bilit

y

with rejuvenationwithout rejuvenation

Figure 15: Steady state availability.

We compare the results of the obtained models with theanalytic models shown in the previous work [11] and [12].If we neglect the coverage factor of software rejuvenation(c), the transitions for failure detection (Tdetect), recoveryoperation start rate (Tarraival) and the failure without aging(Tfail1) in Figure 13, the model is then reduced to a simplecondition-based software rejuvenation system, which wasoriginally modeled by Huang [11] using continuous timeMarkov chain. We apply the same input parameter valuesto Huang’s model and compute the steady-state availability.The difference between our model and Huang’s model areminor as shown in Table VII. Similarly, if we neglect thecoverage factor (c), and the transitionsTdetect, Tarraival

and Tfail1 in Figure 9, the model is reduced to a simpletime-based software rejuvenation system, which was orig-inally modeled by Garg [12] using Markov regenerativestochastic Petri nets (MRSPN). We compute the steady-stateavailability of Garg’s model by using the same parametervalues and the results are shown in Table VII. Althoughthe obtained result is close to our result, it still has smalldifference. The difference is caused by approximation ofdeterministic transition in our model and some additionaltransitions generated from activity diagram which are notincluded in the Garg’s model. These two comparison resultsshow the correctness/accuracy of our proposed ideas andmodel generation methods.

The original models only capture the essential part ofsoftware rejuvenation. They do not incorporate the potentialeffects of coverage factor, failure detection rate and recoveryoperation start rate, which are studied in our model. Candyenables system administrator to study these effects easilybymodifying SysML diagrams and generating correspondinganalytic models.

VI. CONCLUSION

We have presented a new approach that allows com-mon system administrators to study and design differentmaintenance operations using component based availabilitymodeling framework named Candy. We have presented the

Table VII: Verification.

Policy Original models [11] [12] Our modelsTime-based rejuvenation 0.993061 0.992993

condition-based rejuvenation 0.999504 0.999503

overview of Candy and translation rules from SysML toSystem SRN andActivity SRN. We have shown a case studyof a server with time and condition-based software reju-venation. The SRNs models constructed by proposed ideaswere compared with Huang model [11] (condition-basedrejuvenation) and Garg model [12] (time-based rejuvenation)showed the feasibility and correctness of our apporach. Infuture work, we plan to apply the proposed approach ina wider context, such as cluster servers, virtualized datacenters and cloud computing systems.

REFERENCES

[1] K. Trivedi and K. Vaidyanathan, “A measurement-basedmodel for estimation of resource exhaustion in operationalsoftware systems,”in Proc. ISSRE 1999.

[2] M. Grottke and K. Trivedi, “Fighting bugs: Remove, retry,replicate, and rejuvenate,”IEEE Computer, 2007.

[3] S. Friedenthal, A. Moore, and R. Steiner,A Practical Guide toSysML: Systems Model Language. Morgan Kaufmann, 2008.

[4] E. Andrade, P. Maciel, T. Falcao, B. Nogueira, C. Araujo, andG. Callou, “Performance and energy consumption estimationfor commercial off-the-shelf component system design,”In-novations in Systems and Software Engineering, 2010.

[5] J. Merseguer and J. Campos, “Software performance mod-elling using UML and Petri nets,”LNCS, vol. 2965, 2004.

[6] L. Baresi and M. Pezze, “Improving uml with petri nets,”inENTCS, Elsevier Science, 2001.

[7] L. Baresi and M. Pezze, “On formalizing UML with high-level Petri nets,”Concurrent object-oriented programmingand petri nets, pp. 276–304, 2001.

[8] F. Machida, D. Kim, and K. Trivedi, “Component-BasedAvailability Modeling for Cloud Service Management,” inProc. ISSRE 2010 (Industry session).

[9] K. Trivedi, Probability & Statistics with Reliability, Queuingand Computer Science Applications. 2nd Ed., John Wiley &Sons, New York, 2002.

[10] C. Hirel, B. Tuffin, and K. Trivedi, “SPNP: Stochastic PetriNets. Version 6.0,” inComputer Performance Evaluation.Modelling Techniques and Tools, 2000.

[11] Y. Huang, C. Kintala, N. Kolettis, and N. Fulton, “Softwarerejuvenation: Analysis, module and applications,” inProc.FTCS 1995.

[12] S. Garg, A. Puliafito, M. Telek, and K. Trivedi, “Analysis ofsoftware rejuvenation using Markov regenerative stochasticPetri net,” inProc. ISSRE 1995.