availability modeling of energy management systems
Post on 21-Nov-2023
0 Views
Preview:
TRANSCRIPT
Availability modeling of energy management systems
Ricardo M. Fricks a, Kishor S. Trivedi b
aThe Meteorological System of ParanaÂ, Parana State Power Company, Curitiba, PR 80420-170, BrazilbDepartment of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291, U.S.A.
Received 20 December 1996; in revised form 16 March 1998
Abstract
Energy management system (EMS) computer architectures have changed signi®cantly over the recent past
increasing the di�culty and the need for a priori assessment of system performance and dependability. The oldpractice based on measurements is no longer acceptable because of the ¯exibility accrued with the deployment of thenew distributed computer-based systems. The number of ``what if'' questions increased since EMS systems are now
implemented using multiple workstations that can be interconnected in various di�erent ways.In this paper we show how alternative con®gurations can be modeled and analyzed, before proposing and
purchasing any equipment, with the assistance of Markov reward models. We review the concept of Markov reward
models and show how they can be applied in the availability analysis of SCADA/EMS computer systems. Thepaper also presents a software tool that facilitates automatic generation and solution of large Markov rewardmodels. The input language of this modeling tool uses a variation of stochastic Petri nets called stochastic reward
nets, which are also reviewed. We believe this is the ®rst time a detailed quantitative model of a SCADA/EMScomputer system is proposed and solved in the general literature. # 1998 Elsevier Science Ltd. All rights reserved.
Keywords: Availability; SCADA/EMS; Sensitivity
1. Introduction
Energy management systems (EMS) are real-time con-
trol systems designed to automate operation of the
generation/transmission system of utility companies [1].
The main objective of an EMS is to assist electric uti-
lity companies in meeting their goals of supplying their
consumers power demand at all times and under all
conditions, while ful®lling established quality-of-service
requirements and reducing operational costs.
Physically, an EMS is composed of at least one main
computer installation (normaly referred as master
station) and hundreds of remote terminal units. The
master station implements a supervisory control and
data acquisition (SCADA), the real-time component of
an EMS.
Master station computer con®gurations rely on
fault-tolerance techniques to provide high-availability
for the SCADA/EMS applications. Typically, it is
required that a SCADA/EMS master station should
possess a 99.8 or 99.9% probability of performing cor-
rectly at the instant the EMS is requested to do
so [2, 3]. Under the assumption of statistical indepen-
dence between hardware and software failure modes, a
99.9% ®gure for system availability accounts for a
minimum availability of 99.995% for the computer
hardware if we consider that the master station soft-
ware will be operational at least for 99.9% of the
speci®ed system lifetime.
A few years ago, the number of alternative SCADA/
EMS computer con®gurations was very limited.
Mainframes or super-minicomputers operating in
standby sparing redundancy schemes used to be themain approaches for providing the high-availability
required to host SCADA/EMS applications [2, 3].
More complex designs would add front-end processors
mainly to download communication tasks. Recently,
however, radically new alternative EMS computer con-
®gurations were made possible [1, 4, 5]. The basic para-
digm of the newest architecture is open distributed
Microelectronics Reliability 38 (1998) 727±743
0026-2714/98/$19.00 # 1998 Elsevier Science Ltd. All rights reserved.
PII: S0026-2714(98 )00027-4
PERGAMON
systems, which implies the distribution of SCADA/
EMS functions among heterogeneous ``o�-the-shelf''
computers interconnected via redundant local area net-works (LANs). In the new systems, the responsibility
of providing the requested high-availability environ-
ment is distributed along with the SCADA/EMS func-
tions. A side e�ect of this evolutionary wave is theincreased di�cult in quantitatively judging among
alternative designs of master station computer con-
®gurations.
The purpose of this paper is to present one valuable
analytical modeling technique and associated tool that
can provide considerable help in the analysis ofdependability (availability/reliability), performance and
performability measures during the design/procurement
of new EMS computer systems. The technique is basedon the Markov reward models (MRMs) [6] and its ap-
plication is exempli®ed through availability modeling
of the computer hardware of an EMS/SCADA system
master station. The same technique may also beapplied to model Distribution Management Systems
and Power Plant SCADA computer con®gurations.
The associated tool is the stochastic Petri net package
(SPNP) [7] which permits high-level speci®cation andanalytic±numeric solution of very large Markov
reward models. SPNP models can be textually input
using a superset of the C programming language or
drawn using iSPN, the new graphical user interface forSPNP introduced in Ref. [8].
The remainder of this paper is organized as follows.In the next section we describe a hypothetical
SCADA/EMS distributed computer system architec-
ture. In Section 3 we review the concepts of Markov
reward modeling and show how availability measurescan be determined using MRMs. A Markov model of
the system availability when considering dedicated
repair facilities to groups of system components isdescribed in Section 4. Section 5 explores the modi®-
cations necessary to incorporate shared repair in theMarkov chain developed in previous section. Next, anstochastic reward net model isomorphic to the
extended Markov chain is constructed. The new modelis then numerically solved using SPNP. The analysisproceeds in Section 6 by considering the impact of
failed components at system startup. In Section 7, weillustrate how upgrading functions and parametric sen-sitivity analysis can be used to locate critical com-ponents with respect to availability optimization of the
SCADA/EMS system. Section 8 then concludes thepaper suggesting design problems where the presentedtechniques could be very useful and presenting some
research problems related to the material covered inthis paper.
2. SCADA/EMS computer system architecture
The most common design technique used to achievethe high availability requirement of master stations is
through the addition of resources beyond what isneeded for normal SCADA/EMS operation.Redundant hardware resources allow the masterstation to correctly perform its speci®ed tasks even in
the presence of faults (hardware failures and/or soft-ware errors) in their host computer systems. Active ordynamic redundancy is employed so that fault toler-
ance is achieved by detecting the existence of faultsand then performing automatic recon®guration toremove or isolate the faulty component from the sys-
tem.Fig. 1 presents a hypothetical master station archi-
tecture based on the works of Dy-Liacco [1]; Gaushell
Fig. 1. SCADA/EMS computer system architecture.
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743728
and Darlington [2]; Horton and Gross [3]; Chainey
and Block [4]; Ockwell and Killian [9]; Podmore [10];
and Robinson [11]. The ®gure illustrates an open dis-
tributed system with the SCADA/EMS functions dis-
tributed among computer groups interconnected by a
redundant LAN. All subsystems shown in Fig. 1 are
considered critical for real-time operation and should
be available for the SCADA/EMS systems to be con-
sidered available. Development and operator training
subsystems are not represented in Fig. 1 because they
are not considered essential for real-time power system
operation.
In the hypothetical system we can identify ®ve dis-
tinct computer groups implementing the major func-
tions of a modern design:
1. Data acquisition and control: implements time-criti-
cal applications of the system, including data acqui-
sition, alarm processing, special RTU processing/
control, supervisory control and automatic gener-
ation control.
2. Storage: maintains the system real-time and histori-
cal data-bases, storing and distributing information
and reports for system operations and planning.
3. Applications: executes power system network analy-
sis and energy scheduling programs, such as state
estimator, load ¯ow, contingency analysis, load
forecasting, unit commitment, etc.
4. User interface: implements operator stations for
power system dispatchers, selectively retrieving real-
time data and other processed information, combin-
ing them and presenting to the operator through
user-friendly computer interfaces.
5. Communications: executes gateway functions, inter-
facing the SCADA/EMS system to coorporate net-
works and external computer systems (neighboring
utilities, pools, coordinating groups and regulatory
agencies).
The data acquisition and control (DAC) subsystem has
three computers and requires at least two of them to
be functional for the correct operation of the subsys-tem. The computers of the DAC subsystem interface
directly to the RTUs through independent connections
(not shown in Fig. 1). Like the DAC, the Applicationssubsystem operates in an 2-out-of-3 redundancy
scheme. The storage subsystem operates in a closed-
cluster con®guration (see Fig. 2) composed by two ®leservers, one star-coupler, two dual-disk controllers and
two disk units. The cluster con®guration is one of the
major techniques to provide database backup in thehigh-availability computer systems. In this scheme,
redundant ®le servers operate with parallel access dual
controllers that read/write data in two disk units simul-taneously.
From a dependability point of view, a computercluster is essentially a series±parallel con®guration con-
sisting of a parallel network of two or more processors
in series with a parallel network of disk controllers anda parallel network of disks [12]. The star coupler is
assumed to be 100% available since it is a passive con-
nector. The cluster is operational as long as at leastone ®le server, one disk controller, one disk unit and
the star coupler are available.
The user interface subsystem consists of three oper-
ator stations, each one corresponding to a group ofthree X-terminals connected to one workstation. The
whole user interface can be operated with only one dis-
play and workstation of any of the four operatorstations. The two computers of the Communications
subsystem operates in a parallel redundancy con®gur-
ation. The subsystem will fail only if both computer
components fail. The redundant LAN and all the inter-face cards necessary to hook the system components
into the LAN are supposed to be fault-proof, i.e. they
are 100% available.
Other modeling assumptions considered are: (i) allcomponent units are operational at the beginning of
Fig. 2. Physical con®guration of a closed-cluster con®guration.
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 729
our analysis (i.e. at time 0); (ii) component group unitsare statistically identical; (iii) the time interval between
successive occurrences of a failure or an end-of-repairevent of component units are exponentially distributed;(iv) upon the occurrence of faults in any component
unit of any of the subsystems, the system is able todetect and successfully execute the necessary recon®-guration procedure (e.g. switch out the faulty com-
ponent and switch in any spare component to replacethe faulty one); (v) all failed components can berepaired, and the repair activity will restore failed com-
ponents to their best conditions; and (vi) all systemcomponents function independently of other com-ponents.
3. Markov reward modeling
SCADA/EMS computer con®gurations are designedso that brief interruptions in system operation can betolerated, but not signi®cant aggregate outage during a
certain time period. For such systems, the majordependability measure of interest is the steady-statesystem availability, ASS, or readiness for usage. ASS
represents the long-term probabibility that the system
is operating correctly and available to perform its func-tions.The steady-state availability de®nition lends itself
well to experimental evaluation and may be determinedin terms of measured system mean time to failure(MTTF) and system mean time to repair (MTTR) as
Ass � MTTF
MTTF�MTTR�MTTF
MTBF�1�
Eq. (1) is always valid if we implicitly assume that
after each repair the system is restored to its originalstate (for a proof of this statement see Ref. [13]).Under this circumstance, the limiting availability does
not depend on the nature of the lifetime and repair dis-tributions and the sum of the MTTF and MTTRde®nes what is called the mean time between failures
(MTBF) or expected time between failures of the sys-tem.Unfortunately, the experimental evaluation of the
availability is often not possible because of the time
and expense involved. The primary di�culty is thatmeasuring MTTF and MTTR requires a working sys-tem, which in the case of a SCADA/EMS computer
system implies a substantial amount of time since theprocurement of a new system may take severalyears [5]. Besides, using this technique we neglect any
eventual redundancies present in the system architec-ture. In addition, we would like to have some meansof estimating system availability before we actually
build or purchase the system so that availability con-
siderations can be factored into the design/bid process.For these reasons, we suggest the MRM modeling
approach.An MRM is a state-space based analytic model in
which a constant reward rate ri is associated witheach state i of a Markov chain. If the MRM spends
ti time units in state i, then a reward riti is accumu-lated. Similar to the underlying Markov chains,
MRMs form a special case of discrete-state stochas-tic processes in which the current state completely
captures the past history pertaining to the systemevolution.
Suppose a ®nite state, (time-)homogeneous MRMhas an underlying continuous-time Markov chain
(CTMC) with state space O. Let Q be the in®nitesimalgenerator matrix of this CTMC. The o� diagonal
entries of Q are given by qij, (i$ j, iUj $ O), represent-ing the transition rate from state i to state j. The main
diagonal elements of Q are given by qii=ÿSi$ j qij,iUj $ O. The time-dependent state probability vector
P(t) of the CTMC (and associated MRM) obeys theKolmogorov di�erential equation:
dP�t�dt� P�t�Q; �2�
given the initial state probability vector P(0). Thesteady-state probability vector p, assuming that it
exists and is unique, is given by:
pQ � 0; �3�subject to the condition Si $ Opi=1. Here pi is thesteady-state probability of the CTMC (or MRM)
being in state i, i.e. pi�d limt 4 1 Pi(t), 8i $ O. For a®nite-state, irreducible CTMC, as is the case for most
CTMCs underlying availability models, these limitsalways exist and are independent of the initial state
probabilities P(0) [14].
Availability analysis of a system can be carriedout using MRMs by simply assigning reward rate 1
to all functional states of the system and rewardrate 0 to all states in which the system is considered
failed. Given this reward assignment, the steady-stateavailability ASS is the expected reward rate in
steady-state:
Ass �Xi2O
ripi: �4�
With the same reward assignment, the instantaneousand interval availability can also be computed. The
instantaneous availability A(t) is the expected instan-taneous reward rate at time t:
A�t� �Xi2O
riPi�t�: �5�
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743730
4. Availability modeling considering dedicated repair
facilities to component groups
The ®rst step of our analysis is to split the hardware
units of the system into component groups. Units in a
component group are of the same type and execute the
same functions in the SCADA/EMS system. Initially,
we assume the allocation of independent repair facili-
ties to the various component groups, i.e. there is one
repair facility permanently available to each group
(e.g. real-time computers, ®le servers, etc.) that is
shared among its component units. A birth±death
process [14] with ®nite state space can then be con-
structed to capture the failure/repair behavior of each
component group (see Fig. 3). In this CTMC, we rep-
resent group i with ni component units and label states
after the number of failed units.
The transitions leaving to the right of any state in
the state transition diagram of Fig. 3 represent the rate
of occurrence of failures and the ones entering states
from the right, the rate of repair of failed component
units. The sum of component unit failure rates is
necessary because the time to the ®rst failure of a
group of n components with independent exponentially
distributed lifetimes with parameter li is itself exponen-tially distributed with parameter Si = 1
n li [14]; an un-
necessary procedure for the repair rates since we have
only one repair facility associated to each group. The
repair facility is assumed to be work conservative, i.e.
the facility will not stop while there are failed units
waiting for repair. All units of the group have the
same priority to undergo repairs and the queue waiting
for repair is served in a ®rst-come ®rst-served (FCFS)
order.
Steady-state availability evaluation, usually the main
dependability measure in a SCADA/ EMS system,
needs the steady-state probabilities pk, 0RkRn, of the
CTMC as determined by Eq. (3). For the CTMC of
Fig. 3, these probabilities can be obtained directly
using the equations of a regular machine repairman
model [14]:
p0 � 1Pnk�0
lm
� �kn!
�nÿ k�!; �6�
and
pk � p0lm
� �k n!
�nÿ k�! ; 1RkRn: �7�
Closed-form expressions for the steady-state avail-
ability of each component group can then be worked
out by combining the appropriate reward rate assign-
ment with the results of Eqs. (6) and (7) according to
Eq. (4). We proceed by ®nding expressions for each
component group, then for the subsystems, and ®nally
for the complete SCADA/EMS computer con®gur-
ation. A similar approach, but based on the time-
dependent state probabilities P(t) computed using
Eq. (2) and Eq. (5), solves for transient availability
measures related to the system.
All component groups in our system can be seen as
particular cases of a general m-out-of-n redundancy
scheme. Series systems (like the failure-behavior of a
display generator in one operator station) can be inter-
preted as an n-out-of-n scheme. Likewise, parallel sys-
tems (like the failure-behavior of the disk units in the
storage susbsystem) can be interpreted as an 1-out-of-n
scheme. In the generic m-out-of-n scheme, reward rate
1 is assigned to states 0, 1, . . . , niÿmi in Fig. 3 and
reward rate 0 is assigned to the remaining states (i.e.
states niÿmi+1, . . . , ni). Consequently, the steady-state
availability of a generic component group ASS(g)(mi, ni,
li, mi) is given by
A�g�ss �mi; ni; li; mi� �1Pni
l�0limi
� �l ni!�ni ÿ l�!
Xniÿmi
k�0
limi
� �kni!
�ni ÿ k�! ;
�8�with 1RmiRni. Due to the assumed statistical indepen-
dence in failure and end-of-repair events among com-
ponent groups (i.e. component units have s-
independent lifetime distributions and each component
group has its own dedicated repair facility), we can de-
rive the steady-state availability equations of the sub-
systems as
A�1�SS �A�g�SS �2; 3;lrt;mrt�;A�2�SS �A�g�SS �1; 2;lfs;mfs� � A
�g�SS �1; 2;lhsc;mhsc�
� A�g�SS�1; 2; ldk; mdk�;A�3�SS �A�g�SS �2; 3;lhw;mhw�;
A�4�SS �1ÿ �1ÿ A�g�SS �1; 1;llw;mlw� � A�g�SS �1; 3; lxt;mxt��s;A�5�SS �A�g�SS �1; 2;llw;mlw�;
where the superscripts 1±5 refer to the subsystems
DAC, storage, applications, user interface, and com-
munications in this order. The failure and repair rates
are subscripted according to the system component
types, and s is the number of operator stations in the
user interface subsystem. Finally, the steady-state
availability of the SCADA/EMS hardware shown in
Fig. 1 is given by
ASS �Y5i�1
A�i�SS �9�
Table 1 lists the MTTFs and MTTRs assumed for
each component unit type. Under the assumption of
exponential lifetime and time to repair distributions,
the failure rate li and repair rate mi of individual units
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 731
are given by the reciprocals of the MTTFs andMTTRs of each corresponding unit type.
Solving Eq. (9) using the parametric data listed inTable 1 we obtain the steady-state availability
measures listed in Table 2. In this table, notes I, II, III
and IV refer to di�erent numbers of operator stations
in the user interface subsystem. I refers to a single dis-patcher position, while IV refers to four operator
stations. II and III correspond to the intermediate situ-
ations. From the computed results, we can observe
that when we progress from case I to II, i.e. we add asecond operator station in the system, the availability
of the user interface subsystem increases 1.368%. The
addition of a third position however only increases
subsystem availability by 0.0002%. Further analysis
reveals that when we progress from the three operatorstations to the four positions the hardware availability
gain is in®nitesimal (2.6 � 10ÿ7%). We also observe in
Table 2 that under the assumptions and parameterslisted in Table 1, the bottleneck of the hypotheticalsystem in terms of availability is the storage subsys-tem.
With the results presented in Table 2 and Eq. (9) wecomputed the steady-state availability of the SCADA/EMS system hardware shown in Fig. 1. The results are
presented in Table 3 as a function of the number ofoperator stations in the system. The interpretation ofthe results is quite straightforward. We can verify that
a hardware con®guration with at least two operatorstations would be able to guarantee 99.9% of systemavailability if the software is operational for at least anequal percentage of the system lifetime. Another poss-
ible type of analysis based on Table 3 is estimating theaverage hardware downtime in a year. For example, inthe three-station case, the steady-state availability of
SCADA/EMS hardware is 0.99996471365301. Thismeans that the system hardware is not operational0.003528634699% of the time or an average of
18.5465 min in the course of a year.All results presented so far are computable in
closed-form. However, when dependencies should be
incorporated in the system model, such as the sharingof the repair facility among component groups, weneed to resort to a numerical solution, as we considernext. Similarly, if transient measures such as instan-
taneous availability are of interest, numerical solutionwill, in general, be required.
5. Availability modeling considering a shared repair
facility among all system components
Repair facility is not usually dedicated to individualcomponent groups in a SCADA/EMS system but,instead, is shared among all component units in the
Fig. 3. Component group i availability model considering individual repair.
Table 1
Failure/repair data for the hypothetical SCADA/EMS system
Subsystem Component type MTTF MTTR
DAC
Real-time
computer 8760 3
Storage File server 8760 12
Disk controller 8760 6
Disk unit 8760 24
Application
High-end
workstation 4380 6
User interface
Low-end
workstation 4380 6
X-terminal 8760 3
Communications
Low-end
workstation 4380 6
Table 2
Steady-state availability of subsystems considering dedicated
repair facilities to each component group
Subsystem Availability Note
DAC 0.999999296785 Ð
Storage 0.999980390150 Ð
Applications 0.999988771699 ±
User interface 0.998632010704 I
0.999998128605 II
0.999999997440 III
0.999999999997 IV
Communicatios 0.999996257219 Ð
Table 3
Steady-state availability of the SCADA/
EMS system considering dedicated repair
facilities to each component group
Operator stations Availability
1 0.99859677518477
2 0.99996284488395
3 0.99996471365301
4 0.99996471620992
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743732
system. The shared repair facility can be either allo-cated to the component units on an FCFS basis or
according to a di�erent scheduling policy. Consider thecase where the SCADA/EMS repair facility is allocated
among disputing components according to a preemp-
tive resume priority (PRP) repair discipline [15] toexemplify the modeling procedure. Fixed priorities are
assigned to the component groups and units are served
in order of priority. Units of the same group areserved according to an FCFS model as mentioned ear-
lier. Failed components of higher priority preempt the
repair of lower priority units. Repair of preemptedunits resumes immediately after completion of higher
priority repairs.
Although combinatorial modeling techniques cannotexactly capture this behavior [15], state space modeling
can still provide exact solution for the problem. The
accuracy in representing the PRP policy comes fromthe memoryless property of the exponential distri-
butions: after preemption, the residual time to com-plete repair of the preempted unit preserves its original
time-to-repair probability distribution.
The necessary extensions to the state transition dia-
gram in Fig. 3 to include the shared repair are shownin the Markov chain in Fig. 4. This CTMC models the
failure/repair behavior of two types of components
(group i with ni units and group j with nj units) sharinga single repair facility according to a PRP discipline.
States in the CTMC of Fig. 4 are labeled by two inte-gers representing the number of failed components of
group i and j, respectively. An underlying modelingassumption made is that components of type i have
higher repair priority than components of type j.
Having constructed the CTMC and having de®ned
the reward rates, Eqs. (4) and (5) can again be analyti-cally solved and theirs results combined according to
the measure sought. In spite of the symmetry of the
resulting Markov chain, closed-form solution for theSCADA/EMS availability model would be at least
cumbersome if not impossible. An easier approach isto numerically solve for the desired measures.
Besides being di�cult to solve in closed-form, the
Markov chain of realistic problems is also hard to
describe due to the largeness of their state spaces. Acharacteristic that is easily inferred from Fig. 4, where
to capture repair dependence, the overall model state
space is the cross-product of state spaces of theMarkov chains describing two component groups.
Likewise, the complete SCADA/EMS availabilitymodel will have a state space dimension that is the
cross-product of the state spaces of eight Markov
chains corresponding to the individual unit types listedin Table 1. Depending on the number of operator
stations included in the model, its state space size can
grow easily to the hundreds of thousands, as illustratedin Table 4. That is a commonality of practical pro-
Fig. 4. CTMC corresponding to the SRN template.
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 733
blems since the state-space of such models can grow
exponentially with the number of components beingmodeled [16]. The direct construction of such Markovmodels is a formidable undertaking. Hence, we need toresort to more abstract description techniques to de®ne
such large state spaces. Queueing systems [17] providea powerful solution for performance models, but fordependability and performability models a better
option seems to be based on stochastic Petri nets.
5.1. Stochastic reward nets
Stochastic reward nets (SRNs) are extensions of gen-eralized stochastic Petri nets (GSPNs), which are in
turn extensions of Petri nets. A Petri net (PN) [18] is aversatile modeling tool that, among other applications,allows the analysis of discrete-event systems with con-
current or parallel events. The pictorial representationof a PN is a bipartite directed graph with places(drawn as circles) and transitions (drawn as bars) that
model the static properties of the system. Directed arcsin the graph connect places to transitions (called inputarcs) and transitions to places (called output arcs). Amultiplicity factor may be associated with these arcs to
simplify system description.Dynamic system properties are modeled by the ex-
ecution of the PN, which is controlled by the position
and movement of markers . These markers or tokens(drawn as solid dots) can only reside inside places andmove around the net by the ®ring of transitions.1
Firing rules govern the movement of tokens andenabling conditions for transitions (see Ref. [18] forthe basic ®ring rules). The ®ring rules de®ne enablingconditions for transitions. Enabled transitions may
®re, removing one or more tokens from their inputplaces (places linked to input arcs of the transition)and adding tokens to output places (places linked to
output arcs of the transition). The ®ring of a transitionis an atomic action possibly resulting in a new markingof the PN, enabling or disabling other transitions.
Each distinct marking obtained by the execution of aPN corresponds to an individual state of the PN.
After the original conception, several extensions ofPNs were proposed to increase the fundamental model-ing and/or decision power of ordinary Petri nets.2
Stochastic Petri nets (SPNs) resulted from the incor-poration of probabilistic and temporal extensions thatmade possible the mapping of PNs into stochastic pro-
cesses. Through the solution of the underlying stochas-tic process new quantitative measures could becomputed for the models described by PNs. GSPN is a
class of SPNs [19] that allows two types of transitions:immediate and timed. Once enabled, immediate tran-sitions (drawn as thin bars) ®re immediately. Timedtransitions (drawn as hollow rectangles) are transitions
with ®ring times governed by exponential distributions.These distributions govern the (random) time elapsedfrom the enabling of the transition to its e�ective ®r-
ing.We describe in this paper the modeling of complex
SCADA/EMS systems using stochastic reward nets.
The SRN concept [7] extends the GSPN proposal byassociating reward rates with PN markings to specifyoutput measures. The reward rate de®nitions are speci-
®ed at the net level as a function of net primitives likethe number of tokens in a place or the rate of a tran-sition and permit the automatic generation of MRMs.For each marking of the net, the reward rate function
is evaluated and the result is assigned as the rewardrate for that marking, transforming the underlyingMarkov model into an MRM. The de®nition of
reward rates is orthogonal to the analysis type that isused. Thus with the same reward de®nition we cancompute the steady-state availability, as well as instan-
taneous availability, and interval availability.
5.2. SRN model of the system
The failure/repair behavior of individual componentgroups is captured by the SRN template shown in
Fig. 5. Among units in a component group repair isshared on an FCFS basis, therefore there is no need todistinguish failed units in the template. Individual
EMS subsystems are then represented by combinationsof the basic template shown. For instance, the storagesubsystem is composed of three component groups: ®leservers, dual-disk controllers, and disk units.
Therefore, we need three templates like the one shownin Fig. 5 to model the subsystem. The union of thesubsystem models correspond to the SCADA/EMS
availability model. Additionally, a priority scheme issupperposed to the collective submodels to enforce theexistence of a single repair facility and respective sche-
duling policy. Without this priority enforcement, thesame collection of SRN submodels describes the inde-pendent repair case described in Section 4.
Table 4
State space size of the SCADA/EMS
Markov model
Operator stations CTMC states
1 10,368
2 82,944
3 663,552
4 5,308,416
1 A PN with tokens becomes a marked Petri net.2 Structural extensions will be introduced when necessary
for the illustration of the SRN models developed.
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743734
In Fig. 5, tokens in place down represent failed units
of the component group waiting for or undergoing
repair. Timed transitions failure and repair represent,respectively, the time to the next failure of one of the
active units, and the time to complete repair of one ofthe disabled units. Following our assumptions, these
transitions ®re according to exponential distributions
and thus have time-independent failure and repairrates, respectively. If we call the time-independent fail-
ure rate li and the time-independent repair rate mi as-sociated with each component unit in group i, then thefailure rates associated with each of the timed tran-
sitions are given by the table also shown in Fig. 5. The
rate associated with the transition failure represents thefailure rate of the component group at any given
instant and is given by the product of the number of
active units with the failure rate li of each individualunit. The instantaneous number of active units is given
by the total number of units in the group minus the
number of tokens in place down, since each token inplace down represents a failed unit. The marking-
dependent failure rate of the group provides the
necessary parameter to describe the distribution of thenext time to failure among concurrent transitions
enabled with time to failure of each being exponen-tially distributed. The repair rate of units in a group as
shown in the table in Fig. 5 is time-independent and
equal to the repair rate mi of an individual componentunit due to the assumption of shared repair among
units of a particular group.
Apart from the marking dependence used to de®ne
the failure rate of transition failure, Fig. 5 showsanother important structural extension to PNs. The
multiple inhibitor arc connecting place down to tran-
sition failure prevents transition failure to ®re whenthere are ni tokens in place down. This inhibitor arc is
used to limit the number of possible failed units to an
upper bound established by the total number of unitsin component group i.
Despite sharing the same SRN template, component
groups of the SCADA/EMS system are distinguished
by the scheduling priority to use the repair resource,the type and number of units available, and the reward
rate assignment. The PRP policy is incorporated in the
model with the association of guards to the timed tran-sition repair associated with all component groups in
the system. Preemptive scheduling is naturally mapped
to the guard concept since once a unit with repair pri-
ority p is requested, the guards of all transitions repair
with priority lower than p immediately evaluate to
false, disabling the transitions. After repair completion,
the guards will evaluate true again allowing the restart
of repair of other units of priority lower than p.
In the hypothetical system, priorities were assigned
based on three guidelines: (i) subsystems are repaired
according to the availability results listed in Table 2:
the component group of the subsystem with the lowest
availability receives the highest priority and so on; (ii)
likewise, the unit type with the smallest MTTR in a
subsystem receives the highest repair priority in a sub-
system with more than one component group; and (iii)
when there are multiple operator consoles, they are
classi®ed and distinguished for repair: repairs of units
(display genarator and X-terminals) of the ®rst console
have higher priority than the others, and so on.
The number of units ni in the template in Fig. 5 is
determined based on the performance requirements of
software and fault-tolerant scheme adopted by each
subsystem. The redundancy strategy also de®nes the
reward assignment to the respective SRNs, which we
now detail. We start by considering the DAC subsys-
tem. Only real-time computers constitute this subsys-
tem, hence we only need one SRN template to
construct the DAC model. For this subsystem, ni=3 in
the template of Fig. 5 and the rates li and mi are com-
puted using the real-time computer parameters given
in Table 1. The redundancy strategy adopted in the
SCADA subsystem is 2-out-of-3, hence we associate
reward rate 1 (subsystem operational) to SRN mark-
ings with less than 2 tokens in place down, meaning
that the DAC subsystem is operational. Otherwise we
associate reward rate 0, meaning that the subsystem
has failed and is no longer available.
Likewise, the applications subsystem is modeled
using a single SRN template to capture the failure-
repair behavior of its high-end workstations. In the ap-
plications subsystem model, ni=3 and reward rates
should also represent the 2-out-of-3 redundancy
scheme: we give reward rate 1 as long as there is at
most one token in place down, otherwise we give
reward rate 0.
Fig. 5. Failure/repair SRN model of component group i.
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 735
The storage subsystem is a bit more complex, sinceit has three types of components that are susceptible to
faults. For each component group (®le servers, dual-disk controllers, and disk units) we use one SRN likethe one shown in Fig. 5. Each one will have n = 2
since there are two units of each component type andfailure and repair rates computed using Table 1. Weassume that each component group has its own repair
facility, therefore the only connection among the threeSRNs of the storage subsystem is given by the rewardassignment. The closed-cluster redundancy scheme is
represented by given a reward rate 1 whenever there isat most one token in place down corresponding to eachcomponent group (®le servers, controllers and diskunits). Otherwise, we give reward rate 0 to the subsys-
tem.The user interface subsystem is composed of con-
soles accessed by the power system dispatchers in oper-
ator stations. A SCADA/EMS system usually hasthree or four operator stations that operate in parallel.These stations are composed of one display generator
(usually a low-end workstation) coupled with three X-terminals in our hypothetical subsystem. Thus, twocomponent groups are required to model each operator
station in the system. The ®rst SRN corresponds tothe display generator and should have ni=1 and ratescomputed using the low-end workstation parametersgiven in Table 1. The second SRN represents the X-
terminals and should have ni=3 and correspondingrates. Each operator station implements a series-paral-lel redundancy scheme where the station is considered
functional as long as the display generator and at leastone of the terminals are operational. Therefore, wegive reward rate 1 (meaning that the operator station
is functional) whenever there is no token in place downof the display generator SRN and at most two tokensin place down of the X-terminals SRN. Otherwise, wegive reward rate 0 meaning the operator position is
not available. The user interface subsystem is con-sidered available if at least one of its operator stationsis functional.
The communications subsystem has only one com-ponent type (i.e. low-end workstation) with two unitsthus is modeled with a single SRN template with
ni=2. The subsystem implements a simple parallelarrangement. Consequently, we assign reward rate 1whenever there are less than two tokens in place down
and reward rate 0 otherwise.
5.3. Numerical solution
The numerical solution of an SRN model begins bythe construction of a directed graph (called reachability
graph) showing all possible execution sequences of thePN starting from the initial marking. The nodes of thegraph are all reachable markings of the PN, and the
linking arcs between markings are labeled with the
transition that caused the state change. The reachabil-ity graph is then mapped into an isomorphic CTMC.With the sharing of repair facilities among component
groups, the template shown in Fig. 5 is mapped into aMarkov model of the type illustrated in Fig. 4.
Symbolic solution of Eq. (3) or (2) is possible onlyfor Markov chains with a small number of states or ifthe Markov chain has a very regular structure such as
the one shown in Fig. 3. Numerical solution tech-niques, on the other hand, can solve Markov models
with general structure with a large number of statessince they take advantage of the sparsity of the in®ni-tesimal generator matrix Q to reduce the space and
time required for solution and are numerically [6]. Inthis paper we use the stochastic Petri net package [7]
to numerically solve the developed SRN models.Markov models having hundreds of thousands ofstates can be readily speci®ed and solved using SPNP.
The analysis of an SRN starts by SPNP automati-cally mapping the reachability graph of the SRN into
the underlying Markov reward model. The constructedMRM is then solved to compute a variety of transient,steady-state, cumulative, and sensitivity measures. For
transient analysis, SPNP employs a highly accurateand e�cient version of uniformization (also called
randomization) [6]. Uniformization is a very useful nu-merical method to solve the system of Kolmogorovequations [Eq. (2)] and is able to handle even sti� pro-
blems. For steady-state solution, SPNP uses iterativemethods like Gauss±Seidel or near-optimal successive
overrelaxation (SOR) to solve the system of linearEq. (3) [7]. Both methods preserve the sparsity of theQ matrix and thus can be used to solve very large
models.Table 5 compares the steady-state availability of the
SCADA/EMS computer con®guration, as well as itssubsystems, for the con®guration with two operatorconsoles. Under this circumstance, the MRM iso-
morphic to the SRN model with shared repairs has82,944 states and 778,176 transitions. Observe that the
impact of sharing repair is very small when considering
Table 5
Steady-state availability of the SCADA/EMS system and re-
spective subsystems
Steady-state availability
Subsystem Independent repair Shared repair
DAC 0.999999296785 0.999999276988
Storage 0.999980390150 0.999980132430
Applications 0.999988771699 0.999988441728
User interface 0.999998128605 0.999996247920
Communications 0.999996257219 0.999996177583
SCADA/EMS
system 0.999962844884 0.999960278393
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743736
the failure/repair rates listed in Table 1. Therefore,further modeling could be executed carried out using
the independence repair assumption. However, withoutformulating and solving the second model it would beextremely di�cult for the modeler to make such a de-
cision. Besides, the same SRN model that solves forthe shared repair can be used for the independencerepair assumption, the only necessary modi®cation isto remove the guards from the transition repair of all
subsystem models.
6. Instantaneous availability
A common assumption in dependability modeling
(that we have also adopted so far) is that all com-ponent units are available at system startup, a con-dition not always veri®able in actual computer systems
since computer components may fail instantaneouslywhen systems are being energized or may even be con-nected to the system while faulty. However, throughsome modi®cations in our baseline SRN model and
the concept of instantaneous availability we can studythe consequences of this problem.The condition of failed component units at system
startup can be introduced with the assistance of ran-dom switches [19]. These constructs are stochasticextensions to the original PNs and allow, among other
things, the de®nition of initial state probabilities in anSRN. Using random switches makes possible the com-putation of system measures when there is a prob-
ability mass function de®ning the initial state of theCTMC in Fig. 3.
A random switch when modeling using SPNP is
de®ned on a set of immediate transitions, such that
when a subset of them is enabled, a probability of ®r-
ing (normalized so that they sum to 1) is de®ned for
each enabled immediate transition. The particular tran-
sition to ®re is chosen according to this discrete prob-
ability distribution.
The procedure is illustrated by the model of a com-
ponent group with three units shown in Fig. 6. This
new design presents a slight modi®cation3 of the SRN
template shown in Fig. 5 extended with a random
switch. The switch is implemented with the help of the
extra place cold units and transitions up selector and
down selector. Now instead of all component units
starting in the active state (i.e. all tokens in place up)
when the subsystem is energized we allow for some or
all units to be in the failed state, with probability p
that when ®rst energized a unit will work (transition
up selection will ®re) and with probability 1ÿ p the
unit will fail (transition down selection will ®re). Since
immediate transitions have priority to ®re over timed
transitions, this random switch will randomly de®ne
the distribution of tokens between places up and down
at time 0.
The lifetime of the units is de®ned by a mixed ran-
dom variable X with a mass at the origin. The distri-
bution function FX(t) of this random variable can be
written as
Fx�t� � p� �1ÿ p��1ÿ eÿlt�;where p as before is the probability that a unit will fail
immediately when the system is put into operation,
and l is the failure rate of the unit.
Fig. 7 displays the instantaneous availability of the
DAC subsystem when p is 0.1%. As a reference we
also plot the instantaneous availability computed using
the original subsystem (i.e. assuming p= 0%). Some
interesting aspects we can observe in the plot in Fig. 7
are:
Fig. 6. Component units can be failed at time 0.
3 The addition of place up to collect tokens representing
active units makes the concept of random switches more vis-
ible than when explained with the assistance of our basic
model. The ®ring rate associated with transition failure is now
given by ](up) � l.
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 737
. There is a some availability growth during the ®rst
few hours of operation of the modi®ed subsystem.. The impact of the any failed unit is quickly dissi-pated because of the work conservative behavior we
assumed for the repair facilities. In practical termsthis e�ect would be present only during the factoryacceptance tests of the system.
. The limiting availability for both cases is the sameas expected, since in steady-state all e�ect of the in-itial probability vector has vanished.
7. Sensitivity analysis of the SCADA/EMS computer
system architecture
In order to maximize improvements on system
dependability we need criteria to identify availabilitybottlenecks in the system. One such criterion availableis based on quantitative measures provided by the
upgrading functions of importance theory (introducedin a seminal paper by Birnbaum in 1969 [20]). If cost,size or weight are not objectives when maximizing sys-
tem dependability,4 the importance ranks suggest com-
ponents to which system upgrading e�ort should be
directed ®rst. Otherwise, the importance measures o�ervalid weighting factors to the optimization process.
The upgrading function [21] is de®ned as the re-duction in system reliability when the failure rate lk of
component ck is reduced fractionally, i.e.
IUFk �t� �d lk � @R�t; lk�
@lk;
where @R(t; lk)/@lk is the parametric sensitivityfunction [22] of the measure R(t) with respect to devi-
ations on the parameter lk. A better expression for theupgrading function, more commonly used for yieldingnumbers closer to the unit, is the fractional reduction
IUFk �t� �
lkR�t� �
@R�t;lk�@lk
; �10�
which is the de®nition of IUFk (t) we consider in this
work. Naturally, the relative ranking of components is
not a�ected when the latter expression is used insteadof the former one.
An upgrading function analysis using MRMs beginsby determining the sensitivity functions of outputmeasures to a particular input parameter lk (e.g. the
failure rate of a real-time computer). Then, rates qijwhich are functions of lk are identi®ed in the in®nitesi-
mal generator matrix Q(lk) of the MRM.5 Afteridenti®cation, we construct the (absolute) sensitivityfunction @Q(lk)/@lk and compute the derivative of the
state probabilities (time-dependent and steady-state)with respect to lk.
Fig. 7. Instantaneous availability of DAC subsystem.
4 Assumption considered in this section.5 The in®nitesimal generator matrix with rates that are func-
tion of an input parameter lk is denoted by Q(lk) to stress its
functional dependence.
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743738
If we let S(t; lk) be the row vector of the parametric
sensitivities @Pi(t; lk)/@lk, 8i $ O, then from Eq. (2) we
obtain
@S�t;lk�@t
� S�t; lk�Q�lk� � P�t;lk� @Q�lk�@lk
�11�
Assuming that the initial conditions do not depend on
lk, i.e.
S�0;lk� � @P�0;lk�@lk
� 0;
we can set the following system of di�erential
equations by combining Eqs. (2) and (11):
@P�t;lk�@t
@S�t; lk�@t
� ���P�t; lk� S�;lk��
� Q�lk� dQ�lk�dlk
0 Q�lk�
24 35; �12�
subject to the initial conditions
�P�0;lk�S�0;lk�� � �P00�:The system of Eq. (12) can then be solved using, for
instance, the methods discussed in Ref. [23].
The sensitivity of the reliability R(t) (or instan-
taneous availability A(t)) can be similarly derived:
@R�t;lk�@lk
�Xi2O
@ri�lk�@lk
�Pi�t;lk� �Xi2O
ri�lk�Si�t;lk�:
�13�Observe that in Eq. (13) even the reward rates attribu-
ted may be a functions of lk. However, this is hardly
ever the case for reliability measures since reward rate
functions are usually restricted to binary values
depending only on the operational status of the sys-
tem. Therefore, we can simplify Eq. (13):
@R�t;lk�@lk
�Xi2O
riSi�t;lk�: �14�
Substituting Eq. (14) in Eq. (10), we can write the
upgrading function of a component ck (due to vari-
ations in lk) as
IUFk �t� �
lkR�t�
Xi2O
riSi�t;lk�:
Since, for power systems control centers, steady-stateavailability is the fundamental measure of dependabil-
ity, we need to establish appropriate importancemeasure equations based on availability measures. We
then replace R(t) by A(t) in all equations and talk
about availability (or unavailability) importanceinstead of reliability (or unreliability) importance. The
formula to be used in the SCADA/EMS analysiscames from a slight modi®cation of Eq. (10):
IUFk �t� �
sk
A�t� �@A�t;sk�@sk
; �15�
where the reliability R(t) is replaced by the instan-taneous availability A(t), and lk was replaced by the
generic parameter sk, which may be either a com-ponent failure or repair rate. Observe that although
di�erent from Eq. (10), Eq. (15) is also mapped into
Eq. (13) when expressed in terms of reward rate func-tions. The steady-state counterpart of Eq. (15) is then
given by:
IUFk �1� �
sk
ASS� @ASS�sk�
@sk� lim
t41sk
A�t�Xi2O
riSi�t;sk�:
�16�Table 6 lists the parametric sensitivities computed for
the steady-state availability measure (i.e. @ASS(s)/@s).These results were computed considering the dedicated
repair con®guration of the SCADA/EMS system witha single operator station. We used the SPNP package
to compute the numerical results listed in Table 6. InSPNP, the rates and probabilities of the transitions
and their derivatives are speci®ed as functions of anindependent parameter s. SPNP automatically con-
structs the in®nitesimal generator matrix Q(s) as wellas its derivative @Q(s)/@s for the underlying MRM.
Before proceeding with the solution of Eq. (16) it is
interesting to analyze the signs of the results listed inTable 6. To interpret the sign we should always keep
in mind that sensitivity results are de®ned as deriva-tives, therefore a positive sign of @ASS(s)/@s indicates
that variations in the independent parameter s producevariations in ASS in the same direction. On the other
hand, negative sign indicates variations in oppositedirections. The sign behavior presented in Table 6 is
consistent with intuition: increases in failure ratesdecrease system availability (components fail more fre-
quently), while increases in repair rates improves avail-ability (component repairs are faster).
Using the results listed in Tables 1, 3 and 6 in
Eq. (16) we compute the importance measures pre-sented in Table 7. Note that the absolute values of the
Table 6
Parametric sensitivity of the SCADA/EMS steady-state avail-
ability
Sensitivity of ASS to variations in
Component type Failure rate (lk) Repair rate (mk)
Real-time computer ÿ0.012298826980 0.000004211927
File server ÿ0.065392055050 0.000089578158
Disk controller ÿ0.016381605561 0.000011220278
Disk unit ÿ0.260497683127 0.000713692283
Workstation ÿ6.114168343194 0.008375573073
X-terminal ÿ0.000006315724 0.000000002163
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 739
importance measures are identical for each component.
An interesting phenomena that happens in this case
due to the particular structure of the closed-form sol-
ution of the steady-state availability of a generic com-
ponent group in the system.
If we de®ne the auxiliary variable rk:
rk �dlkmk;
then the steady-state availability of a generic m-out-of-
n component group ASS(g)(mk, nk, lk, mk) given in Eq. (8)
can be rewritten:
A�g�SS �mk; nk;rk� �
1Pnklÿ0 r
lk
nk!�nk ÿ l�!
Xnkÿmk
i�0rik
nk!
�nk ÿ i�! :
�17�Accordingly, the SCADA/EMS system steady-state
availability is obtained by multiplying the availabilities
ASS(g) corresponding to all component groups in the sys-
tem [see Eq. (9)]. Therefore, the derivative of ASS with
respect to a given parameter sk (any component failure
or repair rate) can be expressed as
@ASS
@sk� @ASS
@rk� @rk@sk
; �18�
with
@rk@sk�
1mk
if sk � lk;
ÿ lkm2k
if sk � mk;
8<:Thus, from Eqs. (18) and (16) we have
@ASS�lk�@lk
@ASS�mk�@mk
� @ASS
@rk �1mk
@ASS@rk
� lkm2k
!� ÿ mk
lk;
and
IUFlk�1�
IUFmk�1� �
lkA SSmkASS
� ÿ mklk
� �� ÿ1: �19�
An important aspect to note is how misleading the sen-
sitivity values are when considered by themselves. An
analysis based only on data from Table 6 would indi-
cate that any attempt to vary the system availability
by modifying the mean time to repair of components
is highly ine�ective. A false statement according to
Table 7, which shows that in this particular system,
changes in the repair rate of a component are as e�ec-
tive as changes in the component failure rate. Table 8
lists the steady-state availability of the SCADA/EMS
system when either a component failure rate is reduced
(i.e. Dsk=ÿDlk) or the component repair rate is
increased (i.e. Dsk=Dmk). As can be seen in the table,
the results of both variations for any given component
are pratically identical (the identity is better as Dk40)
as expected due to Eq. (19) and the semantics of the
upgrade function measures (discussed in the previous
section).
The results in Table 7 identify the workstations as
the single component type that with additional devel-
opment (either reducing failure rate or improving
repair mechanisms) can provide the best improvement
of the overall system availability. In a general setting,
with proper scaling considering associated costs, the
upgrading function measures of Table 7 can be used to
guide system availability optimization along the pro-
cedures described in Ref. [24] or [23].
Another possible application of the upgrade
measures is the forecasting of system measures for
small parametric variations of individual components.
To illustrate this application, we constructed Table 9
forecasting the system steady-state availability when
component parameters are individually changed. Dskin Table 9 may either represent an increment on the
component repair rate or a reduction on the com-
ponent failure rate. The forecasted steady-state avail-
ability measures ASS' were computed using a formula
directly derived from Eq. (16):
DASS
ASS� IUF
sk�1� � Dsk
sk� o�Dsk� )
A 0SS � IUFsk�1� � Dsk
sk� 1
� �� o�Dsk�
�ASS � Dsk � @ASS�sk�@sk
� o�Dsk�;
where limDsk40 o(Dsk)/Dsk=0 and @ASS(sk)@sk is the
parametric sensitivity function of the measure ASS with
respect to deviations on the parameter sk. The quality
of the forecasted measures can be veri®ed by compar-
ing the results in Tables 8 and 9.
Table 7
Steady-state upgrade function measures of components in the
SCADA/EMS system
IkUF(1) considering variations in
Component type Failure rate (lk) Repair rate (mk)
Real-time computer ÿ1.405949 � 10ÿ6 1.405949 � 10ÿ6
File server ÿ7.475336 � 10ÿ6 7.475336 � 10ÿ6
Disk controller ÿ1.872674 � 10ÿ6 1.872674 � 10ÿ6
Disk unit ÿ2.977897 � 10ÿ5 2.977897 � 10ÿ5
Workstation ÿ1.397890 � 10ÿ3 1.397890 � 10ÿ3
X-terminal ÿ7.219862 � 10ÿ10 7.219862 � 10ÿ10
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743740
8. Conclusion
The new con®gurations bring ¯exibility and extra-life to SCADA/EMS implementations, eliminating the
need and frustation of total replacement of systems,with recurrent expenses involved. Concomitantly, the
new computer con®gurations bring increased di�cultyin quantitative analyses of alternative architectures. In
this paper we presented a technique based on the con-cept of Markov reward models that may be of con-
siderable help during the evaluation process thatprecedes any deployment of master station units.
The MRM concept was presented in the context ofavailability analysis but can easily solve problems in
other domains (e.g. performance and performability),and a comparison was established with the common
practice of availability of EMS/SCADA computer con-®gurations. We used both closed-form and numerical
methods for computing steady-state availability of thehardware of a hypothetical EMS/SCADA master
station. Failed units at the system startup and para-metric sensitivity analysis were also taken into account.
Parametric sensitivities increase the utility of Markovmodels at little additional cost. The results of para-
metric sensitivity analysis provide guidance for both
re®ning the model and improving the system being
modeled. Apart from providing means for the quanti-
tative comparison of various computer architectures,
many other applications of system availability models
like the ones developed in this paper can be
identi®ed [16]: (i) identi®cation of availability bottle-
necks that may require subsequent improvements in a
particular design; (ii) de®nition of design improvements
through sensitivity analysis of availability measures
with respect to input parameters (e.g. failure and repair
rates of components); (iii) identi®cation of critical
model input parameters to be estimated; and (iv) the
system abstraction modeled can be adapted to address
obtainable ®eld data (failure and repair rates), level of
detail, as well as resulting model size.
We also presented SPNP, a powerful stochastic
modeling package that allows the modeling of complex
system behavior. Its input language allows the descrip-
tion of large MRMs using simpler stochastic reward
nets that map better to the conceptual system abstrac-
tion being modeled. With the mathematical engine of
SPNP the solution of the various models used in this
paper took only few minutes each, including the ones
that requested parametric sensitivity analysis.
Table 8
E�ect of parametric changes in the steady-state availability of the SCADA/EMS system
ASS considering variations Dsk/sk of
Component type sk 1% 5% 10%
Real-time computer lrt 0.998596805 0.998596859 0.998596924
mrt 0.998596805 0.998596856 0.998596913
File server lfs 0.998596865 0.998597155 0.998597500
mfs 0.998596864 0.998597138 0.998597439
Disk controller lhsc 0.998596809 0.998596882 0.998596968
mhsc 0.998596809 0.998596878 0.998596953
Disk unit ldsk 0.998597087 0.998598241 0.998599617
mdsk 0.998597084 0.998598173 0.998599372
Workstation lwk 0.998610748 0.998666553 0.998736250
mwk 0.998610610 0.998663232 0.998723582
X-terminal lxt 0.998596791 0.998596791 0.998596791
mxt 0.998596791 0.998596791 0.998596791
Table 9
Forecasting system steady-state availability ASS' as a function of parametric changes
using the upgrade function measures
Dsk/skComponent type 1% 5% 10%
Real-time computer 0.998596805 0.998596861 0.998596931
File server 0.998596865 0.998597164 0.998597537
Disk controller 0.998596809 0.998596884 0.998596978
Disk unit 0.998597088 0.998598278 0.998599764
Workstation 0.998610750 0.998666587 0.998736384
X-terminal 0.998596791 0.998596791 0.998596791
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743 741
Heimann et al. [25] state that, to fully analyze a can-didate computer system, there are four types of
dependability analyses: evaluation, speci®cation deter-mination, sensitivity analysis, and tradeo� analysis.Evaluation and speci®cation determination are necess-
ary during the analysis of candidate architectures of anSCADA/EMS system. Sensitivity analysis takes placeafter the system has been evaluated and allows the
assessment of the impact of changes in the con®gur-ation (before implementing them) of the chosenSCADA/EMS system during its whole lifetime.
Tradeo� analysis investigates compromises cost-avail-ability and is important along the complete lifecycle ofthe system. With the modeling technique and tooldescribed all areas mentioned can be explored.
The availability modeling of the example SCADA/EMS computer con®guration discussed in this papershows that the MRM approach has powerful modeling
capability and ¯exibility. An extension of the workpresented, currently under investigation, consists ofrelaxing the modeling assumptions stated in Subsection
3.2. Another direction of our work is the constructionof performability models appropriate for the real-timesystem.
Acknowledgements
This work was supported in part by the NationalScience Foundation under Grant EEC-9418765 and by
the Department of Defense as an Enhancement Projectto the Center for Advanced Computing andCommunication.
References
[1] Dy-Liacco TE. Modern control centers and computer
networking. IEEE Computer Applications in Power
1994;7(4):17±22.
[2] Gaushell DJ, Darlington HT. Supervisory control and
data acquisition. Proceedings of the IEEE
1987;75(12):1645±58.
[3] Horton JS, Gross DP. Computer con®gurations.
Proceedings of the IEEE 1987;75(12):1659±69.
[4] Chainey WE, Block WR. Recent advances in master
station architecture. IEEE Computer Applications in
Power 1994;7(2):24±9.
[5] Sasson AM. Open systems procurement: a migration
strategy. IEEE Transactions on Power Systems
1993;8(2):515±26.
[6] Muppala JK, Malhotra M, Trivedi KS. Markov depend-
ability models of complex systems: Analysis techniques.
In: S OÈ zekici, editors. Reliability and Maintenance of
Complex Systems. Springer, Berlin, 1996:442±486.
[7] Ciardo G, Blakemore A, Chimento PF, Muppala JK,
Trivedi KS. Automatic generation and analysis of mar-
kov reward models using stochastic reward nets. In:
Linear Algebra, Markov Chains, and Queueing Models,
IMA Volumes in Mathematics and its Applications.
Springer, Heidelberg, 1992; vol. 48.
[8] Hirel C, Wells S, Fricks R, Trivedi KS. iSPN: an inte-
grated environment for modeling using stochastic Petri
nets. In: Tools Demonstration Section at the Joint
Conference of the 7th International Workshop on Petri
Nets and Performance ModelsÐPNPM'97 and the 9th
International Conference on Modelling Techniques and
Tools for Computer Performance EvaluationÐ
PERFORMANCE'97. Saint±Malo, France, 1997:17±19.
[9] Ockwell G, Killian G. Con®guration management in an
open architecture system. IEEE Transactions on Power
Systems 1993;8(2):441±4.
[10] Podmore R. Criteria for evaluating open energy manage-
ment systems. IEEE Transactions on Power Systems
1993;8(2):466±71.
[11] Robinson JT. Inter-system communications/networking.
Proceedings of the IEEE 1987;75(12):1670±7.
[12] Ibe OC, Howe RC, Trivedi KS. Approximate availability
analysis of vaxcluster systems. IEEE Transactions on
Reliability 1989;38(1):146±52.
[13] Sahner R, Trivedi KS, Pulia®to A. Performance and
Reliability Analysis of Computer Systems: An Example-
Based Approach Using the SHARPE Software Package.
Kluwer Academic, Dordrecht, 1995.
[14] Trivedi KS. Probability & Statistics with Reliability,
Queueing, and Computer Science Applications. Prentice-
Hall, Englewood Cli�s, NJ, 1982.
[15] Malhotra M, Trivedi KS. Power-hierarchy of dependabil-
ity-model types. IEEE Transactions on Reliability
1994;43(3):493±502.
[16] Goyal A, Lavenberg SS. Modeling and analysis of com-
puter system availability. IBM Journal of Research and
Development 1987;31(6):651±64.
[17] Kleinrock L. Queueing Systems. vol. 2: Computer
Applications. Wiley, New York, 1976.
[18] Peterson JL. Petri nets. Computing Surveys
1977;9(3):223±52.
[19] Marsan MA, Conte G, Balbo G. A class of generalized
stochastic Petri net for the performance evaluation of
multiprocessor systems. ACM Transaction on Computer
Systems 1984;2(2):93±122.
[20] Birnbaum ZW. On the importance of di�erent com-
ponents in a multicomponent system. In: Multivariate
AnalysisÐII. Academic Press, New York, 1969:581±592.
[21] Henley EJ, Kumamoto H. Reliability Engineering and
Risk Assessment. Prentice-Hall, Englewood Cli�s, NJ,
1981.
[22] Frank PM. Introduction to System Sensitivity Theory.
Academic Press, New York, 1978.
[23] Blake JT, Reibman AL, Trivedi K. Sensitivity analysis of
reliability and performability measures for multiprocessor
systems. In: Proceedings of 1988 ACM SIGMETRICS
Conf. Measurement and Modeling of Computer Systems.
Santa Fe, NM, May 24-27, 1988:177±186.
[24] Lambert HE. Measures of importance of events and cut
sets in fault trees. In: RE Barlow, JB Fussell, ND
Singpurwalla, editors. Reliability and Fault Tree
Analysis. Philadelphia, PA, 1975:77±100.
R.M. Fricks, K.S. Trivedi /Microelectronics Reliability 38 (1998) 727±743742
top related