availability modeling of energy management systems

17
Availability modeling of energy management systems Ricardo M. Fricks a , Kishor S. Trivedi b a The Meteorological System of Parana ´, Parana ´ State Power Company, Curitiba, PR 80420-170, Brazil b Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291, U.S.A. Received 20 December 1996; in revised form 16 March 1998 Abstract Energy management system (EMS) computer architectures have changed significantly over the recent past increasing the diculty and the need for a priori assessment of system performance and dependability. The old practice based on measurements is no longer acceptable because of the flexibility accrued with the deployment of the new distributed computer-based systems. The number of ‘‘what if’’ questions increased since EMS systems are now implemented using multiple workstations that can be interconnected in various dierent ways. In this paper we show how alternative configurations can be modeled and analyzed, before proposing and purchasing any equipment, with the assistance of Markov reward models. We review the concept of Markov reward models and show how they can be applied in the availability analysis of SCADA/EMS computer systems. The paper also presents a software tool that facilitates automatic generation and solution of large Markov reward models. The input language of this modeling tool uses a variation of stochastic Petri nets called stochastic reward nets, which are also reviewed. We believe this is the first time a detailed quantitative model of a SCADA/EMS computer system is proposed and solved in the general literature. # 1998 Elsevier Science Ltd. All rights reserved. Keywords: Availability; SCADA/EMS; Sensitivity 1. Introduction Energy management systems (EMS) are real-time con- trol systems designed to automate operation of the generation/transmission system of utility companies [1]. The main objective of an EMS is to assist electric uti- lity companies in meeting their goals of supplying their consumers power demand at all times and under all conditions, while fulfilling established quality-of-service requirements and reducing operational costs. Physically, an EMS is composed of at least one main computer installation (normaly referred as master station) and hundreds of remote terminal units. The master station implements a supervisory control and data acquisition (SCADA), the real-time component of an EMS. Master station computer configurations rely on fault-tolerance techniques to provide high-availability for the SCADA/EMS applications. Typically, it is required that a SCADA/EMS master station should possess a 99.8 or 99.9% probability of performing cor- rectly at the instant the EMS is requested to do so [2, 3]. Under the assumption of statistical indepen- dence between hardware and software failure modes, a 99.9% figure for system availability accounts for a minimum availability of 99.995% for the computer hardware if we consider that the master station soft- ware will be operational at least for 99.9% of the specified system lifetime. A few years ago, the number of alternative SCADA/ EMS computer configurations was very limited. Mainframes or super-minicomputers operating in standby sparing redundancy schemes used to be the main approaches for providing the high-availability required to host SCADA/EMS applications [2, 3]. More complex designs would add front-end processors mainly to download communication tasks. Recently, however, radically new alternative EMS computer con- figurations were made possible [1, 4, 5]. The basic para- digm of the newest architecture is open distributed Microelectronics Reliability 38 (1998) 727–743 0026-2714/98/$19.00 # 1998 Elsevier Science Ltd. All rights reserved. PII: S0026-2714(98)00027-4 PERGAMON

Upload: duke

Post on 21-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Availability modeling of energy management systems

Ricardo M. Fricks a, Kishor S. Trivedi b

aThe Meteorological System of ParanaÂ, Parana State Power Company, Curitiba, PR 80420-170, BrazilbDepartment of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291, U.S.A.

Received 20 December 1996; in revised form 16 March 1998

Abstract

Energy management system (EMS) computer architectures have changed signi®cantly over the recent past

increasing the di�culty and the need for a priori assessment of system performance and dependability. The oldpractice based on measurements is no longer acceptable because of the ¯exibility accrued with the deployment of thenew distributed computer-based systems. The number of ``what if'' questions increased since EMS systems are now

implemented using multiple workstations that can be interconnected in various di�erent ways.In this paper we show how alternative con®gurations can be modeled and analyzed, before proposing and

purchasing any equipment, with the assistance of Markov reward models. We review the concept of Markov reward

models and show how they can be applied in the availability analysis of SCADA/EMS computer systems. Thepaper also presents a software tool that facilitates automatic generation and solution of large Markov rewardmodels. The input language of this modeling tool uses a variation of stochastic Petri nets called stochastic reward

nets, which are also reviewed. We believe this is the ®rst time a detailed quantitative model of a SCADA/EMScomputer system is proposed and solved in the general literature. # 1998 Elsevier Science Ltd. All rights reserved.

Keywords: Availability; SCADA/EMS; Sensitivity

1. Introduction

Energy management systems (EMS) are real-time con-

trol systems designed to automate operation of the

generation/transmission system of utility companies [1].

The main objective of an EMS is to assist electric uti-

lity companies in meeting their goals of supplying their

consumers power demand at all times and under all

conditions, while ful®lling established quality-of-service

requirements and reducing operational costs.

Physically, an EMS is composed of at least one main

computer installation (normaly referred as master

station) and hundreds of remote terminal units. The

master station implements a supervisory control and

data acquisition (SCADA), the real-time component of

an EMS.

Master station computer con®gurations rely on

fault-tolerance techniques to provide high-availability

for the SCADA/EMS applications. Typically, it is

required that a SCADA/EMS master station should

possess a 99.8 or 99.9% probability of performing cor-

rectly at the instant the EMS is requested to do

so [2, 3]. Under the assumption of statistical indepen-

dence between hardware and software failure modes, a

99.9% ®gure for system availability accounts for a

minimum availability of 99.995% for the computer

hardware if we consider that the master station soft-

ware will be operational at least for 99.9% of the

speci®ed system lifetime.

A few years ago, the number of alternative SCADA/

EMS computer con®gurations was very limited.

Mainframes or super-minicomputers operating in

standby sparing redundancy schemes used to be themain approaches for providing the high-availability

required to host SCADA/EMS applications [2, 3].

More complex designs would add front-end processors

mainly to download communication tasks. Recently,

however, radically new alternative EMS computer con-

®gurations were made possible [1, 4, 5]. The basic para-

digm of the newest architecture is open distributed

Microelectronics Reliability 38 (1998) 727±743

0026-2714/98/$19.00 # 1998 Elsevier Science Ltd. All rights reserved.

PII: S0026-2714(98 )00027-4

PERGAMON

systems, which implies the distribution of SCADA/

EMS functions among heterogeneous ``o�-the-shelf''

computers interconnected via redundant local area net-works (LANs). In the new systems, the responsibility

of providing the requested high-availability environ-

ment is distributed along with the SCADA/EMS func-

tions. A side e�ect of this evolutionary wave is theincreased di�cult in quantitatively judging among

alternative designs of master station computer con-

®gurations.

The purpose of this paper is to present one valuable

analytical modeling technique and associated tool that

can provide considerable help in the analysis ofdependability (availability/reliability), performance and

performability measures during the design/procurement

of new EMS computer systems. The technique is basedon the Markov reward models (MRMs) [6] and its ap-

plication is exempli®ed through availability modeling

of the computer hardware of an EMS/SCADA system

master station. The same technique may also beapplied to model Distribution Management Systems

and Power Plant SCADA computer con®gurations.

The associated tool is the stochastic Petri net package

(SPNP) [7] which permits high-level speci®cation andanalytic±numeric solution of very large Markov

reward models. SPNP models can be textually input

using a superset of the C programming language or

drawn using iSPN, the new graphical user interface forSPNP introduced in Ref. [8].

The remainder of this paper is organized as follows.In the next section we describe a hypothetical

SCADA/EMS distributed computer system architec-

ture. In Section 3 we review the concepts of Markov

reward modeling and show how availability measurescan be determined using MRMs. A Markov model of

the system availability when considering dedicated

repair facilities to groups of system components isdescribed in Section 4. Section 5 explores the modi®-

cations necessary to incorporate shared repair in theMarkov chain developed in previous section. Next, anstochastic reward net model isomorphic to the

extended Markov chain is constructed. The new modelis then numerically solved using SPNP. The analysisproceeds in Section 6 by considering the impact of

failed components at system startup. In Section 7, weillustrate how upgrading functions and parametric sen-sitivity analysis can be used to locate critical com-ponents with respect to availability optimization of the

SCADA/EMS system. Section 8 then concludes thepaper suggesting design problems where the presentedtechniques could be very useful and presenting some

research problems related to the material covered inthis paper.

2. SCADA/EMS computer system architecture

The most common design technique used to achievethe high availability requirement of master stations is

through the addition of resources beyond what isneeded for normal SCADA/EMS operation.Redundant hardware resources allow the masterstation to correctly perform its speci®ed tasks even in

the presence of faults (hardware failures and/or soft-ware errors) in their host computer systems. Active ordynamic redundancy is employed so that fault toler-

ance is achieved by detecting the existence of faultsand then performing automatic recon®guration toremove or isolate the faulty component from the sys-

tem.Fig. 1 presents a hypothetical master station archi-

tecture based on the works of Dy-Liacco [1]; Gaushell

Fig. 1. SCADA/EMS computer system architecture.

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743728

and Darlington [2]; Horton and Gross [3]; Chainey

and Block [4]; Ockwell and Killian [9]; Podmore [10];

and Robinson [11]. The ®gure illustrates an open dis-

tributed system with the SCADA/EMS functions dis-

tributed among computer groups interconnected by a

redundant LAN. All subsystems shown in Fig. 1 are

considered critical for real-time operation and should

be available for the SCADA/EMS systems to be con-

sidered available. Development and operator training

subsystems are not represented in Fig. 1 because they

are not considered essential for real-time power system

operation.

In the hypothetical system we can identify ®ve dis-

tinct computer groups implementing the major func-

tions of a modern design:

1. Data acquisition and control: implements time-criti-

cal applications of the system, including data acqui-

sition, alarm processing, special RTU processing/

control, supervisory control and automatic gener-

ation control.

2. Storage: maintains the system real-time and histori-

cal data-bases, storing and distributing information

and reports for system operations and planning.

3. Applications: executes power system network analy-

sis and energy scheduling programs, such as state

estimator, load ¯ow, contingency analysis, load

forecasting, unit commitment, etc.

4. User interface: implements operator stations for

power system dispatchers, selectively retrieving real-

time data and other processed information, combin-

ing them and presenting to the operator through

user-friendly computer interfaces.

5. Communications: executes gateway functions, inter-

facing the SCADA/EMS system to coorporate net-

works and external computer systems (neighboring

utilities, pools, coordinating groups and regulatory

agencies).

The data acquisition and control (DAC) subsystem has

three computers and requires at least two of them to

be functional for the correct operation of the subsys-tem. The computers of the DAC subsystem interface

directly to the RTUs through independent connections

(not shown in Fig. 1). Like the DAC, the Applicationssubsystem operates in an 2-out-of-3 redundancy

scheme. The storage subsystem operates in a closed-

cluster con®guration (see Fig. 2) composed by two ®leservers, one star-coupler, two dual-disk controllers and

two disk units. The cluster con®guration is one of the

major techniques to provide database backup in thehigh-availability computer systems. In this scheme,

redundant ®le servers operate with parallel access dual

controllers that read/write data in two disk units simul-taneously.

From a dependability point of view, a computercluster is essentially a series±parallel con®guration con-

sisting of a parallel network of two or more processors

in series with a parallel network of disk controllers anda parallel network of disks [12]. The star coupler is

assumed to be 100% available since it is a passive con-

nector. The cluster is operational as long as at leastone ®le server, one disk controller, one disk unit and

the star coupler are available.

The user interface subsystem consists of three oper-

ator stations, each one corresponding to a group ofthree X-terminals connected to one workstation. The

whole user interface can be operated with only one dis-

play and workstation of any of the four operatorstations. The two computers of the Communications

subsystem operates in a parallel redundancy con®gur-

ation. The subsystem will fail only if both computer

components fail. The redundant LAN and all the inter-face cards necessary to hook the system components

into the LAN are supposed to be fault-proof, i.e. they

are 100% available.

Other modeling assumptions considered are: (i) allcomponent units are operational at the beginning of

Fig. 2. Physical con®guration of a closed-cluster con®guration.

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 729

our analysis (i.e. at time 0); (ii) component group unitsare statistically identical; (iii) the time interval between

successive occurrences of a failure or an end-of-repairevent of component units are exponentially distributed;(iv) upon the occurrence of faults in any component

unit of any of the subsystems, the system is able todetect and successfully execute the necessary recon®-guration procedure (e.g. switch out the faulty com-

ponent and switch in any spare component to replacethe faulty one); (v) all failed components can berepaired, and the repair activity will restore failed com-

ponents to their best conditions; and (vi) all systemcomponents function independently of other com-ponents.

3. Markov reward modeling

SCADA/EMS computer con®gurations are designedso that brief interruptions in system operation can betolerated, but not signi®cant aggregate outage during a

certain time period. For such systems, the majordependability measure of interest is the steady-statesystem availability, ASS, or readiness for usage. ASS

represents the long-term probabibility that the system

is operating correctly and available to perform its func-tions.The steady-state availability de®nition lends itself

well to experimental evaluation and may be determinedin terms of measured system mean time to failure(MTTF) and system mean time to repair (MTTR) as

Ass � MTTF

MTTF�MTTR�MTTF

MTBF�1�

Eq. (1) is always valid if we implicitly assume that

after each repair the system is restored to its originalstate (for a proof of this statement see Ref. [13]).Under this circumstance, the limiting availability does

not depend on the nature of the lifetime and repair dis-tributions and the sum of the MTTF and MTTRde®nes what is called the mean time between failures

(MTBF) or expected time between failures of the sys-tem.Unfortunately, the experimental evaluation of the

availability is often not possible because of the time

and expense involved. The primary di�culty is thatmeasuring MTTF and MTTR requires a working sys-tem, which in the case of a SCADA/EMS computer

system implies a substantial amount of time since theprocurement of a new system may take severalyears [5]. Besides, using this technique we neglect any

eventual redundancies present in the system architec-ture. In addition, we would like to have some meansof estimating system availability before we actually

build or purchase the system so that availability con-

siderations can be factored into the design/bid process.For these reasons, we suggest the MRM modeling

approach.An MRM is a state-space based analytic model in

which a constant reward rate ri is associated witheach state i of a Markov chain. If the MRM spends

ti time units in state i, then a reward riti is accumu-lated. Similar to the underlying Markov chains,

MRMs form a special case of discrete-state stochas-tic processes in which the current state completely

captures the past history pertaining to the systemevolution.

Suppose a ®nite state, (time-)homogeneous MRMhas an underlying continuous-time Markov chain

(CTMC) with state space O. Let Q be the in®nitesimalgenerator matrix of this CTMC. The o� diagonal

entries of Q are given by qij, (i$ j, iUj $ O), represent-ing the transition rate from state i to state j. The main

diagonal elements of Q are given by qii=ÿSi$ j qij,iUj $ O. The time-dependent state probability vector

P(t) of the CTMC (and associated MRM) obeys theKolmogorov di�erential equation:

dP�t�dt� P�t�Q; �2�

given the initial state probability vector P(0). Thesteady-state probability vector p, assuming that it

exists and is unique, is given by:

pQ � 0; �3�subject to the condition Si $ Opi=1. Here pi is thesteady-state probability of the CTMC (or MRM)

being in state i, i.e. pi�d limt 4 1 Pi(t), 8i $ O. For a®nite-state, irreducible CTMC, as is the case for most

CTMCs underlying availability models, these limitsalways exist and are independent of the initial state

probabilities P(0) [14].

Availability analysis of a system can be carriedout using MRMs by simply assigning reward rate 1

to all functional states of the system and rewardrate 0 to all states in which the system is considered

failed. Given this reward assignment, the steady-stateavailability ASS is the expected reward rate in

steady-state:

Ass �Xi2O

ripi: �4�

With the same reward assignment, the instantaneousand interval availability can also be computed. The

instantaneous availability A(t) is the expected instan-taneous reward rate at time t:

A�t� �Xi2O

riPi�t�: �5�

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743730

4. Availability modeling considering dedicated repair

facilities to component groups

The ®rst step of our analysis is to split the hardware

units of the system into component groups. Units in a

component group are of the same type and execute the

same functions in the SCADA/EMS system. Initially,

we assume the allocation of independent repair facili-

ties to the various component groups, i.e. there is one

repair facility permanently available to each group

(e.g. real-time computers, ®le servers, etc.) that is

shared among its component units. A birth±death

process [14] with ®nite state space can then be con-

structed to capture the failure/repair behavior of each

component group (see Fig. 3). In this CTMC, we rep-

resent group i with ni component units and label states

after the number of failed units.

The transitions leaving to the right of any state in

the state transition diagram of Fig. 3 represent the rate

of occurrence of failures and the ones entering states

from the right, the rate of repair of failed component

units. The sum of component unit failure rates is

necessary because the time to the ®rst failure of a

group of n components with independent exponentially

distributed lifetimes with parameter li is itself exponen-tially distributed with parameter Si = 1

n li [14]; an un-

necessary procedure for the repair rates since we have

only one repair facility associated to each group. The

repair facility is assumed to be work conservative, i.e.

the facility will not stop while there are failed units

waiting for repair. All units of the group have the

same priority to undergo repairs and the queue waiting

for repair is served in a ®rst-come ®rst-served (FCFS)

order.

Steady-state availability evaluation, usually the main

dependability measure in a SCADA/ EMS system,

needs the steady-state probabilities pk, 0RkRn, of the

CTMC as determined by Eq. (3). For the CTMC of

Fig. 3, these probabilities can be obtained directly

using the equations of a regular machine repairman

model [14]:

p0 � 1Pnk�0

lm

� �kn!

�nÿ k�!; �6�

and

pk � p0lm

� �k n!

�nÿ k�! ; 1RkRn: �7�

Closed-form expressions for the steady-state avail-

ability of each component group can then be worked

out by combining the appropriate reward rate assign-

ment with the results of Eqs. (6) and (7) according to

Eq. (4). We proceed by ®nding expressions for each

component group, then for the subsystems, and ®nally

for the complete SCADA/EMS computer con®gur-

ation. A similar approach, but based on the time-

dependent state probabilities P(t) computed using

Eq. (2) and Eq. (5), solves for transient availability

measures related to the system.

All component groups in our system can be seen as

particular cases of a general m-out-of-n redundancy

scheme. Series systems (like the failure-behavior of a

display generator in one operator station) can be inter-

preted as an n-out-of-n scheme. Likewise, parallel sys-

tems (like the failure-behavior of the disk units in the

storage susbsystem) can be interpreted as an 1-out-of-n

scheme. In the generic m-out-of-n scheme, reward rate

1 is assigned to states 0, 1, . . . , niÿmi in Fig. 3 and

reward rate 0 is assigned to the remaining states (i.e.

states niÿmi+1, . . . , ni). Consequently, the steady-state

availability of a generic component group ASS(g)(mi, ni,

li, mi) is given by

A�g�ss �mi; ni; li; mi� �1Pni

l�0limi

� �l ni!�ni ÿ l�!

Xniÿmi

k�0

limi

� �kni!

�ni ÿ k�! ;

�8�with 1RmiRni. Due to the assumed statistical indepen-

dence in failure and end-of-repair events among com-

ponent groups (i.e. component units have s-

independent lifetime distributions and each component

group has its own dedicated repair facility), we can de-

rive the steady-state availability equations of the sub-

systems as

A�1�SS �A�g�SS �2; 3;lrt;mrt�;A�2�SS �A�g�SS �1; 2;lfs;mfs� � A

�g�SS �1; 2;lhsc;mhsc�

� A�g�SS�1; 2; ldk; mdk�;A�3�SS �A�g�SS �2; 3;lhw;mhw�;

A�4�SS �1ÿ �1ÿ A�g�SS �1; 1;llw;mlw� � A�g�SS �1; 3; lxt;mxt��s;A�5�SS �A�g�SS �1; 2;llw;mlw�;

where the superscripts 1±5 refer to the subsystems

DAC, storage, applications, user interface, and com-

munications in this order. The failure and repair rates

are subscripted according to the system component

types, and s is the number of operator stations in the

user interface subsystem. Finally, the steady-state

availability of the SCADA/EMS hardware shown in

Fig. 1 is given by

ASS �Y5i�1

A�i�SS �9�

Table 1 lists the MTTFs and MTTRs assumed for

each component unit type. Under the assumption of

exponential lifetime and time to repair distributions,

the failure rate li and repair rate mi of individual units

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 731

are given by the reciprocals of the MTTFs andMTTRs of each corresponding unit type.

Solving Eq. (9) using the parametric data listed inTable 1 we obtain the steady-state availability

measures listed in Table 2. In this table, notes I, II, III

and IV refer to di�erent numbers of operator stations

in the user interface subsystem. I refers to a single dis-patcher position, while IV refers to four operator

stations. II and III correspond to the intermediate situ-

ations. From the computed results, we can observe

that when we progress from case I to II, i.e. we add asecond operator station in the system, the availability

of the user interface subsystem increases 1.368%. The

addition of a third position however only increases

subsystem availability by 0.0002%. Further analysis

reveals that when we progress from the three operatorstations to the four positions the hardware availability

gain is in®nitesimal (2.6 � 10ÿ7%). We also observe in

Table 2 that under the assumptions and parameterslisted in Table 1, the bottleneck of the hypotheticalsystem in terms of availability is the storage subsys-tem.

With the results presented in Table 2 and Eq. (9) wecomputed the steady-state availability of the SCADA/EMS system hardware shown in Fig. 1. The results are

presented in Table 3 as a function of the number ofoperator stations in the system. The interpretation ofthe results is quite straightforward. We can verify that

a hardware con®guration with at least two operatorstations would be able to guarantee 99.9% of systemavailability if the software is operational for at least anequal percentage of the system lifetime. Another poss-

ible type of analysis based on Table 3 is estimating theaverage hardware downtime in a year. For example, inthe three-station case, the steady-state availability of

SCADA/EMS hardware is 0.99996471365301. Thismeans that the system hardware is not operational0.003528634699% of the time or an average of

18.5465 min in the course of a year.All results presented so far are computable in

closed-form. However, when dependencies should be

incorporated in the system model, such as the sharingof the repair facility among component groups, weneed to resort to a numerical solution, as we considernext. Similarly, if transient measures such as instan-

taneous availability are of interest, numerical solutionwill, in general, be required.

5. Availability modeling considering a shared repair

facility among all system components

Repair facility is not usually dedicated to individualcomponent groups in a SCADA/EMS system but,instead, is shared among all component units in the

Fig. 3. Component group i availability model considering individual repair.

Table 1

Failure/repair data for the hypothetical SCADA/EMS system

Subsystem Component type MTTF MTTR

DAC

Real-time

computer 8760 3

Storage File server 8760 12

Disk controller 8760 6

Disk unit 8760 24

Application

High-end

workstation 4380 6

User interface

Low-end

workstation 4380 6

X-terminal 8760 3

Communications

Low-end

workstation 4380 6

Table 2

Steady-state availability of subsystems considering dedicated

repair facilities to each component group

Subsystem Availability Note

DAC 0.999999296785 Ð

Storage 0.999980390150 Ð

Applications 0.999988771699 ±

User interface 0.998632010704 I

0.999998128605 II

0.999999997440 III

0.999999999997 IV

Communicatios 0.999996257219 Ð

Table 3

Steady-state availability of the SCADA/

EMS system considering dedicated repair

facilities to each component group

Operator stations Availability

1 0.99859677518477

2 0.99996284488395

3 0.99996471365301

4 0.99996471620992

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743732

system. The shared repair facility can be either allo-cated to the component units on an FCFS basis or

according to a di�erent scheduling policy. Consider thecase where the SCADA/EMS repair facility is allocated

among disputing components according to a preemp-

tive resume priority (PRP) repair discipline [15] toexemplify the modeling procedure. Fixed priorities are

assigned to the component groups and units are served

in order of priority. Units of the same group areserved according to an FCFS model as mentioned ear-

lier. Failed components of higher priority preempt the

repair of lower priority units. Repair of preemptedunits resumes immediately after completion of higher

priority repairs.

Although combinatorial modeling techniques cannotexactly capture this behavior [15], state space modeling

can still provide exact solution for the problem. The

accuracy in representing the PRP policy comes fromthe memoryless property of the exponential distri-

butions: after preemption, the residual time to com-plete repair of the preempted unit preserves its original

time-to-repair probability distribution.

The necessary extensions to the state transition dia-

gram in Fig. 3 to include the shared repair are shownin the Markov chain in Fig. 4. This CTMC models the

failure/repair behavior of two types of components

(group i with ni units and group j with nj units) sharinga single repair facility according to a PRP discipline.

States in the CTMC of Fig. 4 are labeled by two inte-gers representing the number of failed components of

group i and j, respectively. An underlying modelingassumption made is that components of type i have

higher repair priority than components of type j.

Having constructed the CTMC and having de®ned

the reward rates, Eqs. (4) and (5) can again be analyti-cally solved and theirs results combined according to

the measure sought. In spite of the symmetry of the

resulting Markov chain, closed-form solution for theSCADA/EMS availability model would be at least

cumbersome if not impossible. An easier approach isto numerically solve for the desired measures.

Besides being di�cult to solve in closed-form, the

Markov chain of realistic problems is also hard to

describe due to the largeness of their state spaces. Acharacteristic that is easily inferred from Fig. 4, where

to capture repair dependence, the overall model state

space is the cross-product of state spaces of theMarkov chains describing two component groups.

Likewise, the complete SCADA/EMS availabilitymodel will have a state space dimension that is the

cross-product of the state spaces of eight Markov

chains corresponding to the individual unit types listedin Table 1. Depending on the number of operator

stations included in the model, its state space size can

grow easily to the hundreds of thousands, as illustratedin Table 4. That is a commonality of practical pro-

Fig. 4. CTMC corresponding to the SRN template.

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 733

blems since the state-space of such models can grow

exponentially with the number of components beingmodeled [16]. The direct construction of such Markovmodels is a formidable undertaking. Hence, we need toresort to more abstract description techniques to de®ne

such large state spaces. Queueing systems [17] providea powerful solution for performance models, but fordependability and performability models a better

option seems to be based on stochastic Petri nets.

5.1. Stochastic reward nets

Stochastic reward nets (SRNs) are extensions of gen-eralized stochastic Petri nets (GSPNs), which are in

turn extensions of Petri nets. A Petri net (PN) [18] is aversatile modeling tool that, among other applications,allows the analysis of discrete-event systems with con-

current or parallel events. The pictorial representationof a PN is a bipartite directed graph with places(drawn as circles) and transitions (drawn as bars) that

model the static properties of the system. Directed arcsin the graph connect places to transitions (called inputarcs) and transitions to places (called output arcs). Amultiplicity factor may be associated with these arcs to

simplify system description.Dynamic system properties are modeled by the ex-

ecution of the PN, which is controlled by the position

and movement of markers . These markers or tokens(drawn as solid dots) can only reside inside places andmove around the net by the ®ring of transitions.1

Firing rules govern the movement of tokens andenabling conditions for transitions (see Ref. [18] forthe basic ®ring rules). The ®ring rules de®ne enablingconditions for transitions. Enabled transitions may

®re, removing one or more tokens from their inputplaces (places linked to input arcs of the transition)and adding tokens to output places (places linked to

output arcs of the transition). The ®ring of a transitionis an atomic action possibly resulting in a new markingof the PN, enabling or disabling other transitions.

Each distinct marking obtained by the execution of aPN corresponds to an individual state of the PN.

After the original conception, several extensions ofPNs were proposed to increase the fundamental model-ing and/or decision power of ordinary Petri nets.2

Stochastic Petri nets (SPNs) resulted from the incor-poration of probabilistic and temporal extensions thatmade possible the mapping of PNs into stochastic pro-

cesses. Through the solution of the underlying stochas-tic process new quantitative measures could becomputed for the models described by PNs. GSPN is a

class of SPNs [19] that allows two types of transitions:immediate and timed. Once enabled, immediate tran-sitions (drawn as thin bars) ®re immediately. Timedtransitions (drawn as hollow rectangles) are transitions

with ®ring times governed by exponential distributions.These distributions govern the (random) time elapsedfrom the enabling of the transition to its e�ective ®r-

ing.We describe in this paper the modeling of complex

SCADA/EMS systems using stochastic reward nets.

The SRN concept [7] extends the GSPN proposal byassociating reward rates with PN markings to specifyoutput measures. The reward rate de®nitions are speci-

®ed at the net level as a function of net primitives likethe number of tokens in a place or the rate of a tran-sition and permit the automatic generation of MRMs.For each marking of the net, the reward rate function

is evaluated and the result is assigned as the rewardrate for that marking, transforming the underlyingMarkov model into an MRM. The de®nition of

reward rates is orthogonal to the analysis type that isused. Thus with the same reward de®nition we cancompute the steady-state availability, as well as instan-

taneous availability, and interval availability.

5.2. SRN model of the system

The failure/repair behavior of individual componentgroups is captured by the SRN template shown in

Fig. 5. Among units in a component group repair isshared on an FCFS basis, therefore there is no need todistinguish failed units in the template. Individual

EMS subsystems are then represented by combinationsof the basic template shown. For instance, the storagesubsystem is composed of three component groups: ®leservers, dual-disk controllers, and disk units.

Therefore, we need three templates like the one shownin Fig. 5 to model the subsystem. The union of thesubsystem models correspond to the SCADA/EMS

availability model. Additionally, a priority scheme issupperposed to the collective submodels to enforce theexistence of a single repair facility and respective sche-

duling policy. Without this priority enforcement, thesame collection of SRN submodels describes the inde-pendent repair case described in Section 4.

Table 4

State space size of the SCADA/EMS

Markov model

Operator stations CTMC states

1 10,368

2 82,944

3 663,552

4 5,308,416

1 A PN with tokens becomes a marked Petri net.2 Structural extensions will be introduced when necessary

for the illustration of the SRN models developed.

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743734

In Fig. 5, tokens in place down represent failed units

of the component group waiting for or undergoing

repair. Timed transitions failure and repair represent,respectively, the time to the next failure of one of the

active units, and the time to complete repair of one ofthe disabled units. Following our assumptions, these

transitions ®re according to exponential distributions

and thus have time-independent failure and repairrates, respectively. If we call the time-independent fail-

ure rate li and the time-independent repair rate mi as-sociated with each component unit in group i, then thefailure rates associated with each of the timed tran-

sitions are given by the table also shown in Fig. 5. The

rate associated with the transition failure represents thefailure rate of the component group at any given

instant and is given by the product of the number of

active units with the failure rate li of each individualunit. The instantaneous number of active units is given

by the total number of units in the group minus the

number of tokens in place down, since each token inplace down represents a failed unit. The marking-

dependent failure rate of the group provides the

necessary parameter to describe the distribution of thenext time to failure among concurrent transitions

enabled with time to failure of each being exponen-tially distributed. The repair rate of units in a group as

shown in the table in Fig. 5 is time-independent and

equal to the repair rate mi of an individual componentunit due to the assumption of shared repair among

units of a particular group.

Apart from the marking dependence used to de®ne

the failure rate of transition failure, Fig. 5 showsanother important structural extension to PNs. The

multiple inhibitor arc connecting place down to tran-

sition failure prevents transition failure to ®re whenthere are ni tokens in place down. This inhibitor arc is

used to limit the number of possible failed units to an

upper bound established by the total number of unitsin component group i.

Despite sharing the same SRN template, component

groups of the SCADA/EMS system are distinguished

by the scheduling priority to use the repair resource,the type and number of units available, and the reward

rate assignment. The PRP policy is incorporated in the

model with the association of guards to the timed tran-sition repair associated with all component groups in

the system. Preemptive scheduling is naturally mapped

to the guard concept since once a unit with repair pri-

ority p is requested, the guards of all transitions repair

with priority lower than p immediately evaluate to

false, disabling the transitions. After repair completion,

the guards will evaluate true again allowing the restart

of repair of other units of priority lower than p.

In the hypothetical system, priorities were assigned

based on three guidelines: (i) subsystems are repaired

according to the availability results listed in Table 2:

the component group of the subsystem with the lowest

availability receives the highest priority and so on; (ii)

likewise, the unit type with the smallest MTTR in a

subsystem receives the highest repair priority in a sub-

system with more than one component group; and (iii)

when there are multiple operator consoles, they are

classi®ed and distinguished for repair: repairs of units

(display genarator and X-terminals) of the ®rst console

have higher priority than the others, and so on.

The number of units ni in the template in Fig. 5 is

determined based on the performance requirements of

software and fault-tolerant scheme adopted by each

subsystem. The redundancy strategy also de®nes the

reward assignment to the respective SRNs, which we

now detail. We start by considering the DAC subsys-

tem. Only real-time computers constitute this subsys-

tem, hence we only need one SRN template to

construct the DAC model. For this subsystem, ni=3 in

the template of Fig. 5 and the rates li and mi are com-

puted using the real-time computer parameters given

in Table 1. The redundancy strategy adopted in the

SCADA subsystem is 2-out-of-3, hence we associate

reward rate 1 (subsystem operational) to SRN mark-

ings with less than 2 tokens in place down, meaning

that the DAC subsystem is operational. Otherwise we

associate reward rate 0, meaning that the subsystem

has failed and is no longer available.

Likewise, the applications subsystem is modeled

using a single SRN template to capture the failure-

repair behavior of its high-end workstations. In the ap-

plications subsystem model, ni=3 and reward rates

should also represent the 2-out-of-3 redundancy

scheme: we give reward rate 1 as long as there is at

most one token in place down, otherwise we give

reward rate 0.

Fig. 5. Failure/repair SRN model of component group i.

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 735

The storage subsystem is a bit more complex, sinceit has three types of components that are susceptible to

faults. For each component group (®le servers, dual-disk controllers, and disk units) we use one SRN likethe one shown in Fig. 5. Each one will have n = 2

since there are two units of each component type andfailure and repair rates computed using Table 1. Weassume that each component group has its own repair

facility, therefore the only connection among the threeSRNs of the storage subsystem is given by the rewardassignment. The closed-cluster redundancy scheme is

represented by given a reward rate 1 whenever there isat most one token in place down corresponding to eachcomponent group (®le servers, controllers and diskunits). Otherwise, we give reward rate 0 to the subsys-

tem.The user interface subsystem is composed of con-

soles accessed by the power system dispatchers in oper-

ator stations. A SCADA/EMS system usually hasthree or four operator stations that operate in parallel.These stations are composed of one display generator

(usually a low-end workstation) coupled with three X-terminals in our hypothetical subsystem. Thus, twocomponent groups are required to model each operator

station in the system. The ®rst SRN corresponds tothe display generator and should have ni=1 and ratescomputed using the low-end workstation parametersgiven in Table 1. The second SRN represents the X-

terminals and should have ni=3 and correspondingrates. Each operator station implements a series-paral-lel redundancy scheme where the station is considered

functional as long as the display generator and at leastone of the terminals are operational. Therefore, wegive reward rate 1 (meaning that the operator station

is functional) whenever there is no token in place downof the display generator SRN and at most two tokensin place down of the X-terminals SRN. Otherwise, wegive reward rate 0 meaning the operator position is

not available. The user interface subsystem is con-sidered available if at least one of its operator stationsis functional.

The communications subsystem has only one com-ponent type (i.e. low-end workstation) with two unitsthus is modeled with a single SRN template with

ni=2. The subsystem implements a simple parallelarrangement. Consequently, we assign reward rate 1whenever there are less than two tokens in place down

and reward rate 0 otherwise.

5.3. Numerical solution

The numerical solution of an SRN model begins bythe construction of a directed graph (called reachability

graph) showing all possible execution sequences of thePN starting from the initial marking. The nodes of thegraph are all reachable markings of the PN, and the

linking arcs between markings are labeled with the

transition that caused the state change. The reachabil-ity graph is then mapped into an isomorphic CTMC.With the sharing of repair facilities among component

groups, the template shown in Fig. 5 is mapped into aMarkov model of the type illustrated in Fig. 4.

Symbolic solution of Eq. (3) or (2) is possible onlyfor Markov chains with a small number of states or ifthe Markov chain has a very regular structure such as

the one shown in Fig. 3. Numerical solution tech-niques, on the other hand, can solve Markov models

with general structure with a large number of statessince they take advantage of the sparsity of the in®ni-tesimal generator matrix Q to reduce the space and

time required for solution and are numerically [6]. Inthis paper we use the stochastic Petri net package [7]

to numerically solve the developed SRN models.Markov models having hundreds of thousands ofstates can be readily speci®ed and solved using SPNP.

The analysis of an SRN starts by SPNP automati-cally mapping the reachability graph of the SRN into

the underlying Markov reward model. The constructedMRM is then solved to compute a variety of transient,steady-state, cumulative, and sensitivity measures. For

transient analysis, SPNP employs a highly accurateand e�cient version of uniformization (also called

randomization) [6]. Uniformization is a very useful nu-merical method to solve the system of Kolmogorovequations [Eq. (2)] and is able to handle even sti� pro-

blems. For steady-state solution, SPNP uses iterativemethods like Gauss±Seidel or near-optimal successive

overrelaxation (SOR) to solve the system of linearEq. (3) [7]. Both methods preserve the sparsity of theQ matrix and thus can be used to solve very large

models.Table 5 compares the steady-state availability of the

SCADA/EMS computer con®guration, as well as itssubsystems, for the con®guration with two operatorconsoles. Under this circumstance, the MRM iso-

morphic to the SRN model with shared repairs has82,944 states and 778,176 transitions. Observe that the

impact of sharing repair is very small when considering

Table 5

Steady-state availability of the SCADA/EMS system and re-

spective subsystems

Steady-state availability

Subsystem Independent repair Shared repair

DAC 0.999999296785 0.999999276988

Storage 0.999980390150 0.999980132430

Applications 0.999988771699 0.999988441728

User interface 0.999998128605 0.999996247920

Communications 0.999996257219 0.999996177583

SCADA/EMS

system 0.999962844884 0.999960278393

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743736

the failure/repair rates listed in Table 1. Therefore,further modeling could be executed carried out using

the independence repair assumption. However, withoutformulating and solving the second model it would beextremely di�cult for the modeler to make such a de-

cision. Besides, the same SRN model that solves forthe shared repair can be used for the independencerepair assumption, the only necessary modi®cation isto remove the guards from the transition repair of all

subsystem models.

6. Instantaneous availability

A common assumption in dependability modeling

(that we have also adopted so far) is that all com-ponent units are available at system startup, a con-dition not always veri®able in actual computer systems

since computer components may fail instantaneouslywhen systems are being energized or may even be con-nected to the system while faulty. However, throughsome modi®cations in our baseline SRN model and

the concept of instantaneous availability we can studythe consequences of this problem.The condition of failed component units at system

startup can be introduced with the assistance of ran-dom switches [19]. These constructs are stochasticextensions to the original PNs and allow, among other

things, the de®nition of initial state probabilities in anSRN. Using random switches makes possible the com-putation of system measures when there is a prob-

ability mass function de®ning the initial state of theCTMC in Fig. 3.

A random switch when modeling using SPNP is

de®ned on a set of immediate transitions, such that

when a subset of them is enabled, a probability of ®r-

ing (normalized so that they sum to 1) is de®ned for

each enabled immediate transition. The particular tran-

sition to ®re is chosen according to this discrete prob-

ability distribution.

The procedure is illustrated by the model of a com-

ponent group with three units shown in Fig. 6. This

new design presents a slight modi®cation3 of the SRN

template shown in Fig. 5 extended with a random

switch. The switch is implemented with the help of the

extra place cold units and transitions up selector and

down selector. Now instead of all component units

starting in the active state (i.e. all tokens in place up)

when the subsystem is energized we allow for some or

all units to be in the failed state, with probability p

that when ®rst energized a unit will work (transition

up selection will ®re) and with probability 1ÿ p the

unit will fail (transition down selection will ®re). Since

immediate transitions have priority to ®re over timed

transitions, this random switch will randomly de®ne

the distribution of tokens between places up and down

at time 0.

The lifetime of the units is de®ned by a mixed ran-

dom variable X with a mass at the origin. The distri-

bution function FX(t) of this random variable can be

written as

Fx�t� � p� �1ÿ p��1ÿ eÿlt�;where p as before is the probability that a unit will fail

immediately when the system is put into operation,

and l is the failure rate of the unit.

Fig. 7 displays the instantaneous availability of the

DAC subsystem when p is 0.1%. As a reference we

also plot the instantaneous availability computed using

the original subsystem (i.e. assuming p= 0%). Some

interesting aspects we can observe in the plot in Fig. 7

are:

Fig. 6. Component units can be failed at time 0.

3 The addition of place up to collect tokens representing

active units makes the concept of random switches more vis-

ible than when explained with the assistance of our basic

model. The ®ring rate associated with transition failure is now

given by ](up) � l.

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 737

. There is a some availability growth during the ®rst

few hours of operation of the modi®ed subsystem.. The impact of the any failed unit is quickly dissi-pated because of the work conservative behavior we

assumed for the repair facilities. In practical termsthis e�ect would be present only during the factoryacceptance tests of the system.

. The limiting availability for both cases is the sameas expected, since in steady-state all e�ect of the in-itial probability vector has vanished.

7. Sensitivity analysis of the SCADA/EMS computer

system architecture

In order to maximize improvements on system

dependability we need criteria to identify availabilitybottlenecks in the system. One such criterion availableis based on quantitative measures provided by the

upgrading functions of importance theory (introducedin a seminal paper by Birnbaum in 1969 [20]). If cost,size or weight are not objectives when maximizing sys-

tem dependability,4 the importance ranks suggest com-

ponents to which system upgrading e�ort should be

directed ®rst. Otherwise, the importance measures o�ervalid weighting factors to the optimization process.

The upgrading function [21] is de®ned as the re-duction in system reliability when the failure rate lk of

component ck is reduced fractionally, i.e.

IUFk �t� �d lk � @R�t; lk�

@lk;

where @R(t; lk)/@lk is the parametric sensitivityfunction [22] of the measure R(t) with respect to devi-

ations on the parameter lk. A better expression for theupgrading function, more commonly used for yieldingnumbers closer to the unit, is the fractional reduction

IUFk �t� �

lkR�t� �

@R�t;lk�@lk

; �10�

which is the de®nition of IUFk (t) we consider in this

work. Naturally, the relative ranking of components is

not a�ected when the latter expression is used insteadof the former one.

An upgrading function analysis using MRMs beginsby determining the sensitivity functions of outputmeasures to a particular input parameter lk (e.g. the

failure rate of a real-time computer). Then, rates qijwhich are functions of lk are identi®ed in the in®nitesi-

mal generator matrix Q(lk) of the MRM.5 Afteridenti®cation, we construct the (absolute) sensitivityfunction @Q(lk)/@lk and compute the derivative of the

state probabilities (time-dependent and steady-state)with respect to lk.

Fig. 7. Instantaneous availability of DAC subsystem.

4 Assumption considered in this section.5 The in®nitesimal generator matrix with rates that are func-

tion of an input parameter lk is denoted by Q(lk) to stress its

functional dependence.

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743738

If we let S(t; lk) be the row vector of the parametric

sensitivities @Pi(t; lk)/@lk, 8i $ O, then from Eq. (2) we

obtain

@S�t;lk�@t

� S�t; lk�Q�lk� � P�t;lk� @Q�lk�@lk

�11�

Assuming that the initial conditions do not depend on

lk, i.e.

S�0;lk� � @P�0;lk�@lk

� 0;

we can set the following system of di�erential

equations by combining Eqs. (2) and (11):

@P�t;lk�@t

@S�t; lk�@t

� ���P�t; lk� S�;lk��

� Q�lk� dQ�lk�dlk

0 Q�lk�

24 35; �12�

subject to the initial conditions

�P�0;lk�S�0;lk�� � �P00�:The system of Eq. (12) can then be solved using, for

instance, the methods discussed in Ref. [23].

The sensitivity of the reliability R(t) (or instan-

taneous availability A(t)) can be similarly derived:

@R�t;lk�@lk

�Xi2O

@ri�lk�@lk

�Pi�t;lk� �Xi2O

ri�lk�Si�t;lk�:

�13�Observe that in Eq. (13) even the reward rates attribu-

ted may be a functions of lk. However, this is hardly

ever the case for reliability measures since reward rate

functions are usually restricted to binary values

depending only on the operational status of the sys-

tem. Therefore, we can simplify Eq. (13):

@R�t;lk�@lk

�Xi2O

riSi�t;lk�: �14�

Substituting Eq. (14) in Eq. (10), we can write the

upgrading function of a component ck (due to vari-

ations in lk) as

IUFk �t� �

lkR�t�

Xi2O

riSi�t;lk�:

Since, for power systems control centers, steady-stateavailability is the fundamental measure of dependabil-

ity, we need to establish appropriate importancemeasure equations based on availability measures. We

then replace R(t) by A(t) in all equations and talk

about availability (or unavailability) importanceinstead of reliability (or unreliability) importance. The

formula to be used in the SCADA/EMS analysiscames from a slight modi®cation of Eq. (10):

IUFk �t� �

sk

A�t� �@A�t;sk�@sk

; �15�

where the reliability R(t) is replaced by the instan-taneous availability A(t), and lk was replaced by the

generic parameter sk, which may be either a com-ponent failure or repair rate. Observe that although

di�erent from Eq. (10), Eq. (15) is also mapped into

Eq. (13) when expressed in terms of reward rate func-tions. The steady-state counterpart of Eq. (15) is then

given by:

IUFk �1� �

sk

ASS� @ASS�sk�

@sk� lim

t41sk

A�t�Xi2O

riSi�t;sk�:

�16�Table 6 lists the parametric sensitivities computed for

the steady-state availability measure (i.e. @ASS(s)/@s).These results were computed considering the dedicated

repair con®guration of the SCADA/EMS system witha single operator station. We used the SPNP package

to compute the numerical results listed in Table 6. InSPNP, the rates and probabilities of the transitions

and their derivatives are speci®ed as functions of anindependent parameter s. SPNP automatically con-

structs the in®nitesimal generator matrix Q(s) as wellas its derivative @Q(s)/@s for the underlying MRM.

Before proceeding with the solution of Eq. (16) it is

interesting to analyze the signs of the results listed inTable 6. To interpret the sign we should always keep

in mind that sensitivity results are de®ned as deriva-tives, therefore a positive sign of @ASS(s)/@s indicates

that variations in the independent parameter s producevariations in ASS in the same direction. On the other

hand, negative sign indicates variations in oppositedirections. The sign behavior presented in Table 6 is

consistent with intuition: increases in failure ratesdecrease system availability (components fail more fre-

quently), while increases in repair rates improves avail-ability (component repairs are faster).

Using the results listed in Tables 1, 3 and 6 in

Eq. (16) we compute the importance measures pre-sented in Table 7. Note that the absolute values of the

Table 6

Parametric sensitivity of the SCADA/EMS steady-state avail-

ability

Sensitivity of ASS to variations in

Component type Failure rate (lk) Repair rate (mk)

Real-time computer ÿ0.012298826980 0.000004211927

File server ÿ0.065392055050 0.000089578158

Disk controller ÿ0.016381605561 0.000011220278

Disk unit ÿ0.260497683127 0.000713692283

Workstation ÿ6.114168343194 0.008375573073

X-terminal ÿ0.000006315724 0.000000002163

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 739

importance measures are identical for each component.

An interesting phenomena that happens in this case

due to the particular structure of the closed-form sol-

ution of the steady-state availability of a generic com-

ponent group in the system.

If we de®ne the auxiliary variable rk:

rk �dlkmk;

then the steady-state availability of a generic m-out-of-

n component group ASS(g)(mk, nk, lk, mk) given in Eq. (8)

can be rewritten:

A�g�SS �mk; nk;rk� �

1Pnklÿ0 r

lk

nk!�nk ÿ l�!

Xnkÿmk

i�0rik

nk!

�nk ÿ i�! :

�17�Accordingly, the SCADA/EMS system steady-state

availability is obtained by multiplying the availabilities

ASS(g) corresponding to all component groups in the sys-

tem [see Eq. (9)]. Therefore, the derivative of ASS with

respect to a given parameter sk (any component failure

or repair rate) can be expressed as

@ASS

@sk� @ASS

@rk� @rk@sk

; �18�

with

@rk@sk�

1mk

if sk � lk;

ÿ lkm2k

if sk � mk;

8<:Thus, from Eqs. (18) and (16) we have

@ASS�lk�@lk

@ASS�mk�@mk

� @ASS

@rk �1mk

@ASS@rk

� lkm2k

!� ÿ mk

lk;

and

IUFlk�1�

IUFmk�1� �

lkA SSmkASS

� ÿ mklk

� �� ÿ1: �19�

An important aspect to note is how misleading the sen-

sitivity values are when considered by themselves. An

analysis based only on data from Table 6 would indi-

cate that any attempt to vary the system availability

by modifying the mean time to repair of components

is highly ine�ective. A false statement according to

Table 7, which shows that in this particular system,

changes in the repair rate of a component are as e�ec-

tive as changes in the component failure rate. Table 8

lists the steady-state availability of the SCADA/EMS

system when either a component failure rate is reduced

(i.e. Dsk=ÿDlk) or the component repair rate is

increased (i.e. Dsk=Dmk). As can be seen in the table,

the results of both variations for any given component

are pratically identical (the identity is better as Dk40)

as expected due to Eq. (19) and the semantics of the

upgrade function measures (discussed in the previous

section).

The results in Table 7 identify the workstations as

the single component type that with additional devel-

opment (either reducing failure rate or improving

repair mechanisms) can provide the best improvement

of the overall system availability. In a general setting,

with proper scaling considering associated costs, the

upgrading function measures of Table 7 can be used to

guide system availability optimization along the pro-

cedures described in Ref. [24] or [23].

Another possible application of the upgrade

measures is the forecasting of system measures for

small parametric variations of individual components.

To illustrate this application, we constructed Table 9

forecasting the system steady-state availability when

component parameters are individually changed. Dskin Table 9 may either represent an increment on the

component repair rate or a reduction on the com-

ponent failure rate. The forecasted steady-state avail-

ability measures ASS' were computed using a formula

directly derived from Eq. (16):

DASS

ASS� IUF

sk�1� � Dsk

sk� o�Dsk� )

A 0SS � IUFsk�1� � Dsk

sk� 1

� �� o�Dsk�

�ASS � Dsk � @ASS�sk�@sk

� o�Dsk�;

where limDsk40 o(Dsk)/Dsk=0 and @ASS(sk)@sk is the

parametric sensitivity function of the measure ASS with

respect to deviations on the parameter sk. The quality

of the forecasted measures can be veri®ed by compar-

ing the results in Tables 8 and 9.

Table 7

Steady-state upgrade function measures of components in the

SCADA/EMS system

IkUF(1) considering variations in

Component type Failure rate (lk) Repair rate (mk)

Real-time computer ÿ1.405949 � 10ÿ6 1.405949 � 10ÿ6

File server ÿ7.475336 � 10ÿ6 7.475336 � 10ÿ6

Disk controller ÿ1.872674 � 10ÿ6 1.872674 � 10ÿ6

Disk unit ÿ2.977897 � 10ÿ5 2.977897 � 10ÿ5

Workstation ÿ1.397890 � 10ÿ3 1.397890 � 10ÿ3

X-terminal ÿ7.219862 � 10ÿ10 7.219862 � 10ÿ10

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743740

8. Conclusion

The new con®gurations bring ¯exibility and extra-life to SCADA/EMS implementations, eliminating the

need and frustation of total replacement of systems,with recurrent expenses involved. Concomitantly, the

new computer con®gurations bring increased di�cultyin quantitative analyses of alternative architectures. In

this paper we presented a technique based on the con-cept of Markov reward models that may be of con-

siderable help during the evaluation process thatprecedes any deployment of master station units.

The MRM concept was presented in the context ofavailability analysis but can easily solve problems in

other domains (e.g. performance and performability),and a comparison was established with the common

practice of availability of EMS/SCADA computer con-®gurations. We used both closed-form and numerical

methods for computing steady-state availability of thehardware of a hypothetical EMS/SCADA master

station. Failed units at the system startup and para-metric sensitivity analysis were also taken into account.

Parametric sensitivities increase the utility of Markovmodels at little additional cost. The results of para-

metric sensitivity analysis provide guidance for both

re®ning the model and improving the system being

modeled. Apart from providing means for the quanti-

tative comparison of various computer architectures,

many other applications of system availability models

like the ones developed in this paper can be

identi®ed [16]: (i) identi®cation of availability bottle-

necks that may require subsequent improvements in a

particular design; (ii) de®nition of design improvements

through sensitivity analysis of availability measures

with respect to input parameters (e.g. failure and repair

rates of components); (iii) identi®cation of critical

model input parameters to be estimated; and (iv) the

system abstraction modeled can be adapted to address

obtainable ®eld data (failure and repair rates), level of

detail, as well as resulting model size.

We also presented SPNP, a powerful stochastic

modeling package that allows the modeling of complex

system behavior. Its input language allows the descrip-

tion of large MRMs using simpler stochastic reward

nets that map better to the conceptual system abstrac-

tion being modeled. With the mathematical engine of

SPNP the solution of the various models used in this

paper took only few minutes each, including the ones

that requested parametric sensitivity analysis.

Table 8

E�ect of parametric changes in the steady-state availability of the SCADA/EMS system

ASS considering variations Dsk/sk of

Component type sk 1% 5% 10%

Real-time computer lrt 0.998596805 0.998596859 0.998596924

mrt 0.998596805 0.998596856 0.998596913

File server lfs 0.998596865 0.998597155 0.998597500

mfs 0.998596864 0.998597138 0.998597439

Disk controller lhsc 0.998596809 0.998596882 0.998596968

mhsc 0.998596809 0.998596878 0.998596953

Disk unit ldsk 0.998597087 0.998598241 0.998599617

mdsk 0.998597084 0.998598173 0.998599372

Workstation lwk 0.998610748 0.998666553 0.998736250

mwk 0.998610610 0.998663232 0.998723582

X-terminal lxt 0.998596791 0.998596791 0.998596791

mxt 0.998596791 0.998596791 0.998596791

Table 9

Forecasting system steady-state availability ASS' as a function of parametric changes

using the upgrade function measures

Dsk/skComponent type 1% 5% 10%

Real-time computer 0.998596805 0.998596861 0.998596931

File server 0.998596865 0.998597164 0.998597537

Disk controller 0.998596809 0.998596884 0.998596978

Disk unit 0.998597088 0.998598278 0.998599764

Workstation 0.998610750 0.998666587 0.998736384

X-terminal 0.998596791 0.998596791 0.998596791

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 741

Heimann et al. [25] state that, to fully analyze a can-didate computer system, there are four types of

dependability analyses: evaluation, speci®cation deter-mination, sensitivity analysis, and tradeo� analysis.Evaluation and speci®cation determination are necess-

ary during the analysis of candidate architectures of anSCADA/EMS system. Sensitivity analysis takes placeafter the system has been evaluated and allows the

assessment of the impact of changes in the con®gur-ation (before implementing them) of the chosenSCADA/EMS system during its whole lifetime.

Tradeo� analysis investigates compromises cost-avail-ability and is important along the complete lifecycle ofthe system. With the modeling technique and tooldescribed all areas mentioned can be explored.

The availability modeling of the example SCADA/EMS computer con®guration discussed in this papershows that the MRM approach has powerful modeling

capability and ¯exibility. An extension of the workpresented, currently under investigation, consists ofrelaxing the modeling assumptions stated in Subsection

3.2. Another direction of our work is the constructionof performability models appropriate for the real-timesystem.

Acknowledgements

This work was supported in part by the NationalScience Foundation under Grant EEC-9418765 and by

the Department of Defense as an Enhancement Projectto the Center for Advanced Computing andCommunication.

References

[1] Dy-Liacco TE. Modern control centers and computer

networking. IEEE Computer Applications in Power

1994;7(4):17±22.

[2] Gaushell DJ, Darlington HT. Supervisory control and

data acquisition. Proceedings of the IEEE

1987;75(12):1645±58.

[3] Horton JS, Gross DP. Computer con®gurations.

Proceedings of the IEEE 1987;75(12):1659±69.

[4] Chainey WE, Block WR. Recent advances in master

station architecture. IEEE Computer Applications in

Power 1994;7(2):24±9.

[5] Sasson AM. Open systems procurement: a migration

strategy. IEEE Transactions on Power Systems

1993;8(2):515±26.

[6] Muppala JK, Malhotra M, Trivedi KS. Markov depend-

ability models of complex systems: Analysis techniques.

In: S OÈ zekici, editors. Reliability and Maintenance of

Complex Systems. Springer, Berlin, 1996:442±486.

[7] Ciardo G, Blakemore A, Chimento PF, Muppala JK,

Trivedi KS. Automatic generation and analysis of mar-

kov reward models using stochastic reward nets. In:

Linear Algebra, Markov Chains, and Queueing Models,

IMA Volumes in Mathematics and its Applications.

Springer, Heidelberg, 1992; vol. 48.

[8] Hirel C, Wells S, Fricks R, Trivedi KS. iSPN: an inte-

grated environment for modeling using stochastic Petri

nets. In: Tools Demonstration Section at the Joint

Conference of the 7th International Workshop on Petri

Nets and Performance ModelsÐPNPM'97 and the 9th

International Conference on Modelling Techniques and

Tools for Computer Performance EvaluationÐ

PERFORMANCE'97. Saint±Malo, France, 1997:17±19.

[9] Ockwell G, Killian G. Con®guration management in an

open architecture system. IEEE Transactions on Power

Systems 1993;8(2):441±4.

[10] Podmore R. Criteria for evaluating open energy manage-

ment systems. IEEE Transactions on Power Systems

1993;8(2):466±71.

[11] Robinson JT. Inter-system communications/networking.

Proceedings of the IEEE 1987;75(12):1670±7.

[12] Ibe OC, Howe RC, Trivedi KS. Approximate availability

analysis of vaxcluster systems. IEEE Transactions on

Reliability 1989;38(1):146±52.

[13] Sahner R, Trivedi KS, Pulia®to A. Performance and

Reliability Analysis of Computer Systems: An Example-

Based Approach Using the SHARPE Software Package.

Kluwer Academic, Dordrecht, 1995.

[14] Trivedi KS. Probability & Statistics with Reliability,

Queueing, and Computer Science Applications. Prentice-

Hall, Englewood Cli�s, NJ, 1982.

[15] Malhotra M, Trivedi KS. Power-hierarchy of dependabil-

ity-model types. IEEE Transactions on Reliability

1994;43(3):493±502.

[16] Goyal A, Lavenberg SS. Modeling and analysis of com-

puter system availability. IBM Journal of Research and

Development 1987;31(6):651±64.

[17] Kleinrock L. Queueing Systems. vol. 2: Computer

Applications. Wiley, New York, 1976.

[18] Peterson JL. Petri nets. Computing Surveys

1977;9(3):223±52.

[19] Marsan MA, Conte G, Balbo G. A class of generalized

stochastic Petri net for the performance evaluation of

multiprocessor systems. ACM Transaction on Computer

Systems 1984;2(2):93±122.

[20] Birnbaum ZW. On the importance of di�erent com-

ponents in a multicomponent system. In: Multivariate

AnalysisÐII. Academic Press, New York, 1969:581±592.

[21] Henley EJ, Kumamoto H. Reliability Engineering and

Risk Assessment. Prentice-Hall, Englewood Cli�s, NJ,

1981.

[22] Frank PM. Introduction to System Sensitivity Theory.

Academic Press, New York, 1978.

[23] Blake JT, Reibman AL, Trivedi K. Sensitivity analysis of

reliability and performability measures for multiprocessor

systems. In: Proceedings of 1988 ACM SIGMETRICS

Conf. Measurement and Modeling of Computer Systems.

Santa Fe, NM, May 24-27, 1988:177±186.

[24] Lambert HE. Measures of importance of events and cut

sets in fault trees. In: RE Barlow, JB Fussell, ND

Singpurwalla, editors. Reliability and Fault Tree

Analysis. Philadelphia, PA, 1975:77±100.

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743742

[25] Heimann DI, Mittal N, Trivedi KS. Availability and re-

liability modeling for computer systems. In: MC Yovits,

editors. Advances in Computers. Academic Press, San

Diego, CA, 1990;vol. 31:175±233.

R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 743