availability models in practice

34
& CHAPTER 11 Availability Modeling in Practice KISHOR S. TRIVEDI, 1 ARCHANA SATHAYE, 2 and SRINIVASAN RAMANI 3 1 Dept. of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291 (E-mail: [email protected]); 2 Dept. of Computer Science, San Jose State University, San Jose, CA 95192 (E-mail: [email protected]); 3 IBM Corporation, 3039 Cornwallis Rd., RTP, NC 27709 (E-mail: [email protected]). As computer systems continue to be applied to mission-critical environments, techniques to evaluate their dependability become more and more important. Of the dependability measures used to characterize a system, availability is one of the most important. Techniques to evaluate a system’s availability can be broadly categorized as measurement-based and model-based. Measurement-based evaluation is expensive because it requires building a real system and taking measurements and then analyz- ing the data statistically. Model-based evaluation on the other hand is inexpensive and relatively easier to perform. In Chapter 11, we first look at some availability modeling techniques and take up a case study from an industrial setting to illustrate the application of the techniques to a real problem. Although easier to perform, model-based availability analysis poses problems like largeness and complexity of the models developed, which makes the models difficult to solve. Chapter 11 also illustrates several techniques to deal with largeness and complexity issues. 11.1 INTRODUCTION Complex computer systems are widely used in different applications, ranging from flight control, and command and control systems to commercial systems like infor- mation and financial services. These applications demand high performance and high availability. Availability evaluation addresses failure and recovery aspects of a system, whereas performance evaluation addresses processing aspects and assumes that the system components do not fail. For gracefully degrading systems, a measure that combines system performance and availability aspects is more 293 Dependable Computing Systems. Edited by Hassan B. Diab and Albert Y. Zomaya ISBN 0-471-67422-2 Copyright # 2005 John Wiley & Sons, Inc.

Upload: duke

Post on 21-Nov-2023

1 views

Category:

Documents


0 download

TRANSCRIPT

&CHAPTER 11

Availability Modeling in Practice

KISHOR S. TRIVEDI,1 ARCHANA SATHAYE,2 and SRINIVASAN RAMANI3

1Dept. of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291

(E-mail: [email protected]); 2Dept. of Computer Science, San Jose State University, San Jose,

CA 95192 (E-mail: [email protected]); 3IBM Corporation, 3039 Cornwallis Rd., RTP,

NC 27709 (E-mail: [email protected]).

As computer systems continue to be applied to mission-critical environments,

techniques to evaluate their dependability become more and more important.

Of the dependability measures used to characterize a system, availability is one of

the most important.

Techniques to evaluate a system’s availability can be broadly categorized as

measurement-based and model-based. Measurement-based evaluation is expensive

because it requires building a real system and taking measurements and then analyz-

ing the data statistically. Model-based evaluation on the other hand is inexpensive

and relatively easier to perform. In Chapter 11, we first look at some availability

modeling techniques and take up a case study from an industrial setting to illustrate

the application of the techniques to a real problem. Although easier to perform,

model-based availability analysis poses problems like largeness and complexity

of the models developed, which makes the models difficult to solve. Chapter 11

also illustrates several techniques to deal with largeness and complexity issues.

11.1 INTRODUCTION

Complex computer systems are widely used in different applications, ranging from

flight control, and command and control systems to commercial systems like infor-

mation and financial services. These applications demand high performance and

high availability. Availability evaluation addresses failure and recovery aspects of

a system, whereas performance evaluation addresses processing aspects and

assumes that the system components do not fail. For gracefully degrading systems,

a measure that combines system performance and availability aspects is more

293

Dependable Computing Systems. Edited by Hassan B. Diab and Albert Y. ZomayaISBN 0-471-67422-2 Copyright # 2005 John Wiley & Sons, Inc.

meaningful than separate measures of performance and availability. These

composite measures are called performability measures. The two basic approaches

to evaluate the availability/performance measures of a system are as follows:

measurement-based and model-based. In measurement-based evaluation, the

required measures are estimated from measured data using statistical inference tech-

niques. The data are measured from a real system or its prototype. In case of avail-

ability evaluation, measurements are not always feasible. The reason being either

the system has not been built yet or it is too expensive to conduct experiments.

That is, in a high availability system one would need to measure data from several

systems to gather good sample data. On the other hand, injecting faults in a system

can be an expensive procedure. Model-based evaluation is the cost-effective sol-

ution because it allows system evaluation without having to build and measure a

system. In this chapter, we discuss availability modeling techniques and their

usage in practice. To emphasize the practicality of these techniques, we discuss

their pros and cons with respect to a case study, VAXcluster systems1 of Digital

Equipment Corporation (DEC2).

We first discuss different availability modeling approaches in Section 11.2. In

Section 11.3, we discuss the benefits of utilizing a composite availability and per-

formance model in practice instead of a pure availability model. Our discussion

emphasizes this point using a model developed for multiprocessors to determine

the optimal number of processors in the system. In Section 11.4, we present a

case study to demonstrate the utility of availability modeling in a corporate

environment.

11.2 MODELING APPROACHES

Model-based evaluation can be through discrete-event simulation or analytical

models or hybrid models combining simulation and analytical methods. A discrete-

event simulation model can depict detailed system behavior, as it is essentially a

program whose execution simulates the dynamic behavior of the system and

evaluates the required measures. An analytical model consists of a set of equations

describing the system behavior. The required measures are obtained by solving these

equations. In simple cases, closed-form solutions are obtained, but more frequently

numerical solutions of the equations are necessary.

The main benefit of discrete-event simulation is the ability to depict detailed

system behavior in the models. Also, the flexibility of discrete-event simulation

allows its use in performance, availability, and performability modeling. The

main drawback of discrete-event simulation is the long execution time, particularly

when tight confidence bounds are required in the solutions obtained. Also, carrying

out a “what if” analysis requires rerunning the model for different input parameters.

Advances in simulation speed-up techniques, such as regenerative simulation,

1Later known as TruClusters.2Now part of Hewlett-Packard Company.

294 AVAILABILITY MODELING IN PRACTICE

importance sampling [27], importance splitting [26], and parallel and distributed

simulation, also need to be considered.

Analytical models [7] are more of an abstraction of the real system than

discrete-event simulation models. In general, analytical models tend to be easier

to develop and faster to solve than a simulation model. The main drawback is the

set of assumptions that are often necessary to make analytical models tractable.

Recent advances in model generation and solution techniques as well as computing

power make analytical models more attractive. In this chapter, we discuss model-

based evaluation using analytical techniques and how one can achieve results that

are useful in practice.

11.2.1 Analytical Availability Modeling Approaches

A system modeler can either choose state space or non-state space analytical

modeling techniques. The choice of an appropriate modeling technique to represent

the system behavior is dictated by factors such as the measures of interest, level

of detailed system behavior to be represented and the capability of the model to

represent it, ease of construction, and availability of software tools to specify and

solve the model. In this section, we discuss several non-state space and state

space modeling techniques.

11.2.2 Non-state Space Models

Non-state space models can be solved without generating the underlying state space.

Practically speaking, these models can be easily used for solving systems with

hundreds of components because there are many relatively good algorithms

available for solving such models [22]. Non-state space models can be solved to

compute measures like system availability, reliability, and system mean time to fail-

ure (MTTF). The two main assumptions used by the efficient solution techniques are

statistically independent failures and independent repair units for components. The

non-state space model types used to evaluate system availability are “reliability

block diagrams” and “fault trees.” In gracefully degradable systems, knowledge

of the performance of the system is also essential. Non-state space model types

used to evaluate system performance are “product-form queuing models” and

“task-precedence graphs” [16].

11.2.2.1 Reliability Block Diagram In a reliability block diagram (RBD),

each component of the system is represented as a block [16, 25]. This representation

of the system as a collection of components is done at the level of granularity that

is appropriate for the measurements to be obtained. The blocks are then connected

in series and/or in parallel based on the operational dependency between the

components. If for the system to be up, every component of a set of components

needs to be operational, then those blocks are connected in series in the RBD. On

the other hand, if the system can survive with at least one component, then those

blocks are connected in parallel. An RBD can be used to easily compute availability

if the repair times (and failure times) are all independent. Figure 11.1(a) shows

11.2 MODELING APPROACHES 295

an RBD representing a multiprocessor availability model with n processors where

at least one processor is required for the system to be up. This RBD represents a

simple parallel system. Given a failure rate g and repair rate t (for the purpose of

illustration, consider identical failure and repair rates for all processors), the

availability of processor Proci is given by,

A ¼t

gþ t: (11:1)

The availability of the parallel system is then given by [16],

A ¼ 1 �Yn

i¼1

(1 � Ai) ¼ 1 �

g

gþ t

� �n

: (11:2)

11.2.2.2 Fault Trees A fault tree [16], like a RBD, is useful for availability

analysis. It is a pictorial representation of the sequence of events/conditions to be

satisfied for a failure to occur. A fault tree uses and, or and k of n gates to represent

this combination of events in a tree-like structure. To represent situations where one

failure event propagates failure along multiple paths in the fault tree, fault trees can

have repeated nodes. Several efficient algorithms for solving fault trees exist. These

include algorithms for fault trees without repeated components [11], a multiple vari-

able inversion (MVI) algorithm called the LT algorithm to obtain sum of disjoint

products (SDP) from mincut set [15], and the factoring/conditioning algorithm

that works by factoring a fault tree with repeated nodes into a set of fault trees with-

out repeated nodes [19]. Binary decision diagram (BDD)-based algorithms can be

used to solve very large fault trees [5, 6, 22, 28]. Figure 11.1(b) shows the fault

tree model for our multiprocessor system. UAi represents the unavailability of

processor i. The and gate indicates that the multiprocessor system fails when all

the n processors becomes unavailable. The output of the top gate of the fault tree

represents failure of the multiprocessor system.

.

.

.

Proc 1A1

(a)(b)

Proc 2A2

Proc nAn

. . .

UA1 UA2 UAn

FAILURE

Figure 11.1 A multiprocessor system (a) Reliability block diagram, (b) Fault tree.

296 AVAILABILITY MODELING IN PRACTICE

11.2.3 State Space Models

RBDs and fault trees cannot easily handle more complex situations such as failure/repair dependencies and shared repair facilities. In such cases, more detailed models

such as state space models are required. Here, we discuss some Markovian state

space models [3, 25].

11.2.3.1 Markov Chains In this section, we consider homogeneous Markov

chains. A homogeneous continuous time Markov chain (CTMC) [25] is a state

space model, in which each state represents various conditions of the system. In

homogeneous CTMCs, transitions from one state to another occur after a time

that is exponentially distributed. The arcs representing a transition from one state

to another are labeled by the constant rate corresponding to the exponentially

distributed time of the transition. If a state in the CTMC has no transitions leaving

it, then that state is called an absorbing state, and a CTMC with one or more such

states is said to be an absorbing CTMC. For the multiprocessor example, we now

illustrate how a Markov chain can be developed to capture shared repair and

multiple failure modes.

The parameters associated with a system availability model that we will now

develop for our multiprocessor system are the failure rate g of each processor and

the processor repair rate t. The processor fault is covered with probability c and

not covered with probability 1–c. After a covered fault, the system is up in a

degraded mode after a reconfiguration delay. On the other hand, an uncovered

fault is followed by a longer delay imposed by a reboot action. The reconfiguration

and reboot delays are assumed to be exponentially distributed with means 1/d and

1/b, respectively. In practice, the reconfiguration and reboot times are extremely

small compared with the times between failures and repairs; hence, we assume

that failures and repairs do not occur during these actions. System availability can

be modeled using the Markov chain shown in Fig. 11.2.

For the system to be up, at least one processor out of the n processors needs

to be operational. The state i, 1 � i � n in the Markov model represents that i pro-

cessors are operational and that n 2 i processors are waiting for on-line repair. The

states Xn2i and Yn2i , for i ¼ 0, . . . , n 2 2, represent that a system is undergoing a

reconfiguration and is being rebooted, respectively. The steady-state probabilities

1 0n-2n-1n

n c

(1-c)

τ

β

δ (n-1)γ

γn

γ c

(n-1) γ (1-c)

τ

β

δ γ

τ

Xn

Yn

Xn-1

Yn-1

. . .

Figure 11.2 Multiprocessor Markov chain model.

11.2 MODELING APPROACHES 297

pi for each state i can be computed as [24, 25]

pn�i ¼n!

(n � i)!

g

t

� �i

pn, i ¼ 1, 2, . . . , n

pXn�i¼

n!

(n � i)!

g(n � i)c

d

g

t

� �i

pn, i ¼ 0,1, . . . ; n � 2 (11:3)

pYn�i¼

n!

(n � i)!

g(n � i)(1 � c)

b

g

t

� �i

pn, i ¼ 0,1, . . . ; n � 2,

where

pn ¼Xn

i¼0

g

t

� �i n!

(n � i)!þXn�2

i¼0

g

t

� �i g(n � i)cn!

d(n � i)!

"

þXn�2

i¼0

g

t

� �i g(n � i)(1 � c)n!

b(n � i)!

#�1

: (11:4)

The steady state availability, A(n), can be written as,

A(n) ¼Xn�1

i¼0

pn�i ¼

Pn�1i¼0 u i=(n � i)!

Q1

, (11:5)

where u ¼ g/t and

Q1 ¼Xn

i¼0

u i

(n � i)!þXn�2

i¼0

g(n � i)u i

(n � i)!

c

(1 � c)

b

� �: (11:6)

Then the system unavailability defined as a function of n is given by,

UA(n) ¼ 1 �Pn

i¼1 pi.

11.2.3.2 Markov Reward Models A Markov reward model (MRM) is essen-

tially a Markov chain with “reward rates” ri associated with each state i. This makes

specification of measures of interest more convenient. We can now denote the

reward rate of the CTMC at time t as

Z(t) ¼ rX(t): (11:7)

The expected steady-state reward rate for an irreducible CTMC can be written as

E½Z� ¼ limt!1

E½Z(t)� ¼X

i

ripi: (11:8)

298 AVAILABILITY MODELING IN PRACTICE

By appropriately choosing reward rates (or weights) for each state, the appropriate

measure can be obtained for the model on hand. For example, for the multiprocessor

example of Fig. 11.2, if the reward rates are defined as ri ¼ 1 for states i ¼ 1, . . . , n

and ri ¼ 0; otherwise, the expected steady-state reward rate gives the steady-state

availability.

11.2.3.3 Stochastic Petri Nets and Stochastic Reward Nets A Petri net

[16] is a more concise and intuitive way of representing the situation to be modeled.

It is also useful to automate the generation of large state spaces. A Petri net consists

of places, transitions, arcs, and tokens. Tokens move from one place to another

along arcs through transitions. The number of tokens in the places represents the

marking of a Petri net. If the transition firing times are stochastically timed, the

Petri net is called a stochastic Petri net (SPN). If the transition firing times are expo-

nentially distributed, the underlying reachability graph, representing transitions

from one marking to another, gives the underlying homogeneous CTMC for the situ-

ation being modeled.

When evaluating a system for its availability, it is often also necessary to consider

the performance of the system in each of its operational states. We will now develop

a performance model for the multiprocessor system, which we will use in a later

section. For the multiprocessor system, let us say we are interested in finding the

probability that an incoming task is turned away because all n processors are tied

up by other tasks being processed. The parameters associated with this pure per-

formance model are arrival rate of tasks, service rate of tasks, the number of buffers,

and a deadline on task response times. The performance model assumes that the

arriving task forms a Poisson process with rate l, and the service requirements of

tasks are independent and identically distributed with an exponential distribution

of mean 1/m. A deadline d is associated with each task. Let us also take the

number of buffers available for storing incoming tasks as b. We could use an

M/M/n/b queue represented by the generalized stochastic Petri net (GSPN, that

allows immediate transitions also) shown in Fig. 11.3 for our performance model.

Timed transitions are represented by rectangles and immediate transitions by thin

lines. Place proc contains the number of processors available. Initially, there are

n tokens here representing n processors. Note that b � n, assuming that a buffer is

taken up by each task being serviced by one of the n processors. When transition

arr fires (this is enabled only if a buffer is available, represented in the GSPN

model by the inhibitor arc from place buffer to the transition arr), a token is removed

from proc and put in place serving via the immediate transition request, representing

arr

b−n

buffer request serving service proc

n

λ µ #

Figure 11.3 GSPN model of M/M/n/b queue.

11.2 MODELING APPROACHES 299

one less free processor. At the same time that this happens, a token is also removed

from place buffer, to denote that the task is not waiting. The immediate transition

request will be enabled if there is at least one token in both place proc and place

buffer. This enforces the condition that, when a task arrives, a processor must be

available. If not, the task waits, represented by the retention of the token in place

buffer. If there are more arrivals when all processors are busy, the buffers start to

fill up. Transition arr is disabled (indicated by the inhibitor arc from place buffer)

when there are b 2 n tokens in place buffer, since there can only be b tasks in the

system: n in place serving that are currently being serviced and b 2 n in place

buffer. Therefore, the probability that an incoming task is rejected is the probability

that transition arr is disabled. Transition service represents the service time for each

task and has a firing rate that depends on the number of tokens in place serving (indi-

cated by the notation m#). That is, if place serving has p tokens, transition service has

a firing rate of pm. The expectation of the firing rate of transition service at steady

state gives the throughput of the multiprocessor system. Several tools such as SPNP

[4] are available for solving stochastic Petri net and stochastic reward net models.

In practical system design, a pure availability model may not be enough for

systems such as gracefully degradable ones. In conjunction with availability, the

performance of the system as it degrades needs to be considered. This requires

a “performability” model [12, 21] that includes both performance and availability

measures. In the next section, we present an example of a system for which a

performability measure is needed.

11.3 COMPOSITE AVAILABILITY AND PERFORMANCE MODEL

Consider the multiprocessor system again but with varying number of processors, n,

each with the same capacity. A key question asked in practice is regarding the

number of processors needed (sizing). As we discuss below, the optimal configur-

ation in terms of the number of processors is a function of the chosen measure of

system effectiveness [24]. We begin our sizing based on measures from a “pure”

availability model. Next, we consider sizing based on system performance

measures. Last, we consider a composite of performance and availability measures

to capture the effect of system degradation.

11.3.1 Multiprocessor Sizing Based on Availability

A CTMC model for the failure/repair characteristics of the multiprocessor system is

shown in Fig. 11.2. The details of the model were discussed in the introduction to

Markov chains in the previous section. The downtime during an observation interval

of duration T is given by D(n) ¼ UA(n) � T. The results shown in Fig. 11.4 assume

T is 1 year, i.e., 8,760 hours. Hence D(n) ¼ UA(n) � 8,760 � 60 min per year.

In Fig. 11.4(a), we plot the downtime D(n) against n, the number of processors

for varying values of the mean reconfiguration delay using c ¼ 0.9, g ¼ 1/6,000

per hour, b ¼ 12 per hour, and t ¼ 1 per hour. In Fig. 11.4(b), we plot D(n) against

300 AVAILABILITY MODELING IN PRACTICE

n for different coverage values with a mean reconfiguration delay, d ¼ 10 seconds.

We conclude from these results that the availability benefits of multiprocessing (i.e.,

increase in availability with increase in the number of operational processors) is

possible only if the coverage is near perfect and the reconfiguration delay is very

small or most of the other processors are able to carry out useful work while a

fault is being handled. We further observe that, for most practical parameter

values, the optimal number of processors is 2 or 3. As the number of processors

increases beyond 2, the downtime is mainly due to reconfigurations. Also, the

number of reconfigurations increases almost linearly with the number of processors.

1

10

100

1000

1 2 3 4 5 6 7 8

Dow

ntim

e in

min

s per

yea

r

Number of processors

Mean Reconfiguration Delay = 1 secMean Reconfiguration Delay = 10 secMean Reconfiguration Delay = 30 sec

1 2 3 4 5 6 7 810

-1

101

100

102

Number of processors

Dow

ntim

e in

min

utes

per

yea

r

Coverage = 0.9 Coverage = 0.95Coverage = 0.99

(b)

(a)

Figure 11.4 Downtime Vs. Number of processors for—(a) different mean reconfiguration

delays, (b) different coverage values.

11.3 COMPOSITE AVAILABILITY AND PERFORMANCE MODEL 301

In the next subsection, we consider a performance-based model for the multi-

processor sizing problem.

11.3.2 Multiprocessor Sizing Based on Performance

On the lines of the GSPN example discussed in Section 11.2.3, we use an M/M/n/b

queuing model with finite buffer capacity to compute the probability that a task is

rejected because the buffer is full. In Fig. 11.3, this is the probability that transition

arr is disabled. In Fig. 11.5, we plot the loss probability as a function of n for differ-

ent values of arrival rates.

We observe that the loss probability reduces as the number of processors

increases. The conclusion from the performance model of the fault-free system is

that the system improves as the number of processors is increased. The details of

this model and results have been previously presented [24].

The above models point out the deficiency of simply considering a pure avail-

ability or performance measure. The pure availability measure ignores different

levels of performance at various system states, whereas the pure performance

measure ignores the failure/repair behavior of the system. The next section

considers combined measures of performance and availability.

11.3.3 Multiprocessor Sizing Based on Performance and Availability

Different levels of performance can be taken into account by attaching a reward rate

ri corresponding to some measure of performance to each state i of the failure/repair

Markov model in Fig. 11.2. The resulting Markov reward model can then be

analyzed for various combined measures of performance and availability. The sim-

plest reward rate assignment is to let ri ¼ i for states with i operational processors

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

ytilibaborp ssoL

Number of processors

Arrival Rate = 1/secArrival Rate = 80/secArrival Rate = 180/sec

Figure 11.5 Loss probability Vs. Number of processors.

302 AVAILABILITY MODELING IN PRACTICE

and ri ¼ 0 for down states. With this reward assignment, shown in Table 11.1, we

can compute the capacity-oriented availability, COA(n) as the expected reward

rate in the steady state. This can be written as COA(n) ¼Pn

i¼1 ipi. COA(n) is an

upper bound on system performance that equates performance with system capacity.

Next, let us consider a performance-oriented reward rate assignment.

When i processors are operational, we can use an M/M/i/b queuing model (such

as the GSPN in Fig. 11.3) to describe the performance of the multiprocessor system.

We then assign a reward rate of ri ¼ 0 for each down state i, for all other states, we

assign a reward rate of ri ¼ Tb(i), which is the throughput for a system with i

processors and b buffers (see Table 11.2).

With this assignment, the expected reward rate at steady state computes the

throughput-oriented availability, TOA(n), which can be written as, TOA(n) ¼Pni¼1 Tb(i)pi. In Fig. 11.6, we plot COA(n) and TOA(n) for different values of the

arrival rate l.

These two measures show that in order to process a heavier workload more than

two processors are needed. The measures COA and TOA are not adequate measures

of system effectiveness because they obliterate the effects of failure/repair and

merely show the effects of system capacity and the load. Previously [24], a measure

of system effectiveness, total loss probability, was proposed that “equally” reflected

fault-free behavior and behavior in the presence of faults. The total loss probability

is defined as the sum of rejection probability due to the system being down or

full and the probability of a response time deadline being violated. The total loss

probability is computed by using the following reward rate assignments (also

shown in Table 11.3):

ri ¼1, if i is a down state,

qb(i) þ ½1 � qb(i)�½P(Rb(i) . d)�, if i is an operational state:

TABLE 11.1 Reward Rates for COA.

State Reward Rate, ri

0 0

1 � i � n i

Xn2i and Yn2i; i ¼ 0, . . . , n 2 2 0

TABLE 11.2 Reward Rates for TOA.

State Reward Rate, ri

0 0

1 � i � n Tb(i), throughput of system with

i processors and b buffers

Xn2i and Yn2i; i ¼ 0, . . . , n 2 2 0

11.3 COMPOSITE AVAILABILITY AND PERFORMANCE MODEL 303

Here, qb(i) is the probability of task rejection when the buffer is full for a system

with i operational processors, Rb(i) is the response time for a system with i

operational processors and b buffers, and d is the deadline on task response time.

For the GSPN model in Fig. 11.3, after setting n ¼ i, qb(i) is the probability that

transition arr is disabled. P(Rb(i) . d) can be calculated from the response time

distribution of an M/M/i/b queue.

With the reward rate assignment in Table 11.3, the expected reward rate in the

steady state gives the total loss probability TLP(n, b, d), which can be written as

TLP(n, b, d) ¼Xn�1

i¼0

qb(n � i)pn�i þXn�1

i¼0

{(1 � qb(n � i))P(Rb(i) . d)}pn�i

þXn�2

i¼0

(pXn�iþ pYn�i

) þ p0:

In Fig. 11.7, we plot the total loss probability TLP(n, b, d) as a function of n for

different values of the task arrival rate, l. We have used b ¼ 10, d ¼ 8 � 1025,

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 3 4 5 6 7 8

Ava

ilabi

lity

Number of processors

lambda = 80/seclambda = 150/seclambda = 240/secCap. oriented avail.

Figure 11.6 COA(n) and TOA(n) for different arrival rates.

TABLE 11.3 Reward Rates for Total Loss Probability.

State Reward Rate, ri

0 1

1 � i � n qb(i) þ (1 2 qb(i))(P(Rb(i) . d)), throughput

of system with i processors and b buffers

Xn2i and Yn2i

i ¼ 0, . . . , n 2 2

1

304 AVAILABILITY MODELING IN PRACTICE

and m ¼ 100 per second. We see that, as the value of l increases, the optimal

number of processors, n�, increases as well. For instance, n� ¼ 4 for l ¼ 20,

n� ¼ 5 for l ¼ 40, n� ¼ 6 for l ¼ 60. (Note that n� ¼ 2 for l ¼ 0 from

Fig. 11.4.) The model can be used to show that the optimal number of processors

increases with the task arrival rate, tighter deadlines, and smaller buffer spaces.

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY

In this section, we discuss a case study to demonstrate that, in practice, the choice of

an appropriate model type is dictated by the availability measures of interest, the

level of detailed system behavior to be represented, ease of model specification

and solution, representation power of the model type selected, and access to suitable

tools or toolkits that can automate model specification and solution.

In particular we describe the availability models for Digital Equipment Corpor-

ation’s (DEC) VAXcluster system. VAXclusters are used in different application

environments; hence, several availability, reliability, and performability measures

need to be computed. VAXclusters used as commercial computer systems in a

general data processing environment require us to evaluate system availability

and performability measures. To consider VAXclusters in highly critical appli-

cations, such as life support systems and financial data processing systems, we

evaluate many system reliability measures. These two measures were not adequate

for some financial institution customers of VAXclusters. We therefore evaluated

1 2 3 4 5 6 7 80.9964

0.9966

0.9968

0.997

0.9972

0.9974

0.9976

0.9978

0.998

0.9982

0.9984

Number of processors

Tota

l los

s pr

obab

ility

λ = 20λ = 40λ = 60

Figure 11.7 Total loss probability Vs. number of processors for different task arrival rates.

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY 305

task completion measures to compute probability of application interruption during

its execution period.

VAXclusters are closely coupled systems that consist of two or more VAX

computers, one or more hierarchical storage controllers (HSCs), a set of disk

volumes and a star coupler [10]. The processor (VAX) subsystem and the storage

(HSC and disk volume) subsystem are connected through the star coupler by a CI

(Computer Interconnect) bus. The star coupler is omitted from the availability

models as it is assumed to be a passive connector and hence extremely reliable.

Figure 11.8 shows the hardware topology of a VAXcluster.

Our availability modeling approach considers a VAXcluster as two independent

subsystems, namely, the processing subsystem and the storage subsystem. There-

fore, the availability of the VAXcluster is the product of the availability of each

subsystem. In the following sections, we develop a sequence of increasingly power-

ful availability models, where the level of modeling power is directly proportional

to the level of complex behavior and characteristics of VAXclusters included in

the model.

Our discussion of the availability models developed for the VAXcluster system

is organized as follows. In Sections 11.4.1 and 11.4.2, we develop models using

non-state space techniques like reliability block diagrams and fault-trees, respect-

ively. In Section 11.4.3.1, we develop a CTMC or rather a Markov reward model.

The utility of this model in practice is limited as the size of the model grows

exponentially with the number of processors in the VAXcluster. A model that

avoids largeness is discussed in Sections 11.4.3.2 and 11.4.4.

The model in Section 11.4.3.2 uses a two-level decomposition for the processor

subsystem of the VAXcluster [9]. On the other hand, for the model in Section 11.4.4,

an iterative scheme is developed [17]. The approximate nature of the results

prompted a largeness tolerance approach. In Section 11.4.5, a concise stochastic

Petri net is developed for VAXclusters consisting of uniprocessors [8]. The next

approach in Section 11.4.6 consists of realistic heterogeneous configurations,

where the VAXclusters consist of uniprocessors and/or multiprocessors [14, 18]

using stochastic reward nets [1].

.

.

.StarCoupler

VAX

VAX

VAX

HSC

HSC

Disk

Disk

Figure 11.8 Architecture of the VAXcluster system.

306 AVAILABILITY MODELING IN PRACTICE

11.4.1 Reliability Block Diagram (RBD) Model

The first model of VAXclusters uses a non-state space method, namely, the

reliability block diagram. This approach was seen in the availability model of

VAXclusters by Balkovich et al. [2]. We use this approach to partition the

VAXcluster along functional lines, and this allows us to model each component

type separately. In Fig. 11.9, the block diagram represents a VAXcluster configur-

ation with nP processors, nH HSCs, and nD disks.

We assume that the VAXcluster is down if all the components of any of the three

subsystems are down. We assume that the times to failure of all components are

mutually independent, and exponentially distributed random variables. We also

assume each component to have an independent repair facility. The repair time

here is a two-stage hypoexponentially distributed random variable with the first

phase being the travel time for the repairman to get to the field and the second

phase being the actual repair time. On evaluating this model as a pure series-parallel

availability model, the expression for the VAXcluster availability is given by:

A ¼ 1 �lP

lP þ 1=(1=mP þ 1=mF)� �

!nP !

1 �lH

lH þ 1=(1=mH þ 1=mF)� �

!nH !

� 1 �lD

lD þ 1=ð1=mD þ 1=mFÞ� �

!nD !

(11:9)

Here, 1/lP is the mean time between VAX processor failures, 1/lH is the mean time

between HSC failures, 1/lD is the mean time between disk failures, 1/mF is the

.

.

.

.

.

.

.

.

.

VAX

VAX

VAX

HSC

HSC

Disk

Disk

ProcessingSubsystem

StorageSubsystem

Figure 11.9 Reliability block diagram model for the VAXcluster system.

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY 307

mean field service travel time, and 1/mP, 1/mH , and 1/mD are the mean time to

repair a VAX processor, HSC, and disk, respectively.

The assumption that a VAXcluster is down when all the components of any of

the three subsystems are down is not in tune with reality. For a VAXcluster to be

operational, the system should meet quorum, where quorum is the minimum

number of VAXs required for the VAXcluster to function.

11.4.2 Fault Tree Model

In this section, we present a fault tree model for the VAXcluster configuration

discussed in Section 11.4.1.

Figure 11.10 is a model for the VAXcluster with nP processors, nH HSCs, and nD

disks. Observe that, in a block diagram model, the structure tells us when the system

is functioning, whereas in a fault tree model, the structure tells us when the system

has failed. In addition, we have extended the model to include a quorum required for

operation. The cluster is operational as long as kP out of nP processors, kH out of nH

HSCs, and kD out of nD disks are up. The negation of this operational information is

depicted in the fault tree as follows. The topmost node denotes “Cluster Failure” and

the associated “OR” gate specifies that a cluster fails if (nP 2 kP þ 1) processors,

(nH 2 kH þ 1) HSCs, or (nD 2 kD þ 1) disks are down. The steady-state unavailabil-

ity of the cluster, Ucluster is given by,

Ucluster ¼ (1 � (1 � UP)(1 � UH)(1 � UD)) (11:10)

OR

Cluster failure

. . . . . . . . .

P1

UP

UP

2 n

U U U U U UD D DH H H

1 2 n 1 2 n

U

P DH

of n

(n −k +1)of nH

(n −k +1)

of nDP

(n −k +1)P P H H DD

Figure 11.10 Fault tree model for the VAXcluster system.

308 AVAILABILITY MODELING IN PRACTICE

where, Ui ¼P

jJj�(ni�kiþ1)

Qj[J Uij

� � Qj�J 1 � Uij

� �� �for i ¼ P, H, or D

(processors, HSCs, or disks), and J is the set of indices of all functioning

components.

The RBD and fault tree VAXcluster availability models are very limited in their

depiction of the failure/recovery behavior of the VAXcluster. For example, they

assume that each component has its own repair facility and that there is only one

failure/recovery type. In fact, combinatorial models like RBDs, fault trees, and

reliability graphs require system components to behave in a stochastically indepen-

dent manner. Dependencies of many different kinds exist in VAXclusters and hence

combinatorial models are not entirely satisfactory for such systems. State space

modeling techniques like Markovian models can include different kinds of depen-

dencies. In the following sections, we develop state space models for the processing

subsystem and the storage subsystem separately and use a hierarchical technique to

combine the models of the two subsystems to obtain an overall system availability

model.

11.4.3 Availability Model for the VAXcluster Processing Subsystem

11.4.3.1 Continuous Time Markov Chain We now develop a more detailed

continuous time Markov chain (CTMC) model for the VAXcluster, showing two

types of failure and a coverage factor for each failure type. By using a state space

model, we are also able to incorporate shared repair for the processors in a cluster.

We assumed that the times between failures, the repair times and other recovery

times are exponentially distributed and developed an availability model for an

n-processor (n � 2) VAXcluster using a homogeneous continuous-time Markov

chain. A CTMC for a 2-processor VAXcluster developed previously [8] is shown

in Fig. 11.11.

The following behavior is characterized in this CTMC. A covered processor

failure causes a brief (in the order of seconds) cluster outage to reconfigure the

failed processor out of the cluster and back into the cluster after it is fixed (a perma-

nent failure requires a physical repair while an intermittent failure requires a pro-

cessor reboot). Thus, if a quorum is still formed by the operational processors,

a covered failure causes a small loss in system time. An uncovered failure causes

a longer system time because it requires a cluster reboot; that is, each processor

is rebooted and the cluster is formed even if the remaining operational processors

still form quorum. If the uncovered failure is permanent, the failed processor is

mapped out of the cluster during the reboot and reconfigured back into the cluster

after a physical repair. On the other hand, if the uncovered failure is intermittent,

the cluster is operational after a cluster reboot. The permanent and intermittent

mean failure times are 1/lp and 1/lI hours, respectively. The mean repair, mean

processor reboot, and mean cluster reboot times are given by 1/mp, 1/mPB,

1/mCB, respectively. Let c and k denote the coverage factors for permanent

and intermittent failures respectively. Realistically, the mean reconfiguration time

(1/mIN) to map a processor into the cluster and the time (1/mOUT) to map it out

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY 309

of the cluster are different. In the CTMC of Fig. 11.11, we have assumed that mIN ¼

mOUT ¼ mT. The states of the Markov chain are represented by (abc, d ), where,

a ¼ number of processors down with permanent failure

b ¼ number of processors down with intermittent failure

c ¼

0 if both processors are up

p if one processor is being repaired

b if one processor is being rebooted

c if cluster is undergoing a reboot

t if cluster is undergoing a reconfiguration

r if two processors are being rebooted

s if one is being rebooted and other repaired

8>>>>>>>>>>><>>>>>>>>>>>:

d ¼0 cluster up state

1 cluster down state

20p,1

10p,0

10c,1 00t,1 10t,1

01c,1

000,0

01t,1

01b,0

02r,1

11s,1

λP

λP

µP

λP

λP

2 2 λP

(1-c)

2 λI(1-k)

µCB

2 λIk

µT

λI

2 µPB

µP

λP

2 λP

λP

µTµ

P

µT µ

PB2 λP

c

µCB

λI

µPB

Figure 11.11 CTMC for the two-processor VAXcluster.

310 AVAILABILITY MODELING IN PRACTICE

The steady-state availability of the cluster is given by:

Availability ¼ P000,0 þ P10p,0 þ P01b,0 (11:11)

where, Pabc,d denotes the steady-state probability that the process is in state (abc, d ).

We computed the availability of the VAXcluster system by solving the above

CTMC using SHARPE [16], which is a software package for availability/performability analysis. The main problem with this approach was that the size of

the CTMC grew exponentially with the number of processors in the VAXcluster

system. The largeness posed the following challenges: (1) the capability of the

software to solve the model with thousands of states for VAXclusters with n . 5

processors and (2) the problem of actually generating the state space. In the next

section, we address these two drawbacks.

11.4.3.2 Approximate Availability Model for the Processing Sub-system In this section, we discuss a VAXcluster availability model that avoids

the largeness associated with a Markov model. To reduce the complexity of a

large system, Ibe et al. [9] developed a two-level hierarchical model. The bottom

level is a homogeneous CTMC, and the top level is a combinatorial model. This

top-level model was represented as a network of diodes (or three-state devices).

The approximate availability model developed for the analysis made the follow-

ing assumptions [9].

1. The behavior of each processor was modeled by a homogeneous CTMC and

assumed that this processor did not break the quorum rule. This assumption is

justified by the fact that the probability of VAXcluster failure due to loss of

quorum is relatively low.

2. Each processor has an independent repairman. This assumption is justified as

the authors saw that the MTBF was large compared to the MTTR.

These assumptions allowed the authors to decompose the n-processor VAXcluster

into n independent subsystems. Further, the states of the CTMC for the individual

processors were classified into the following three states:

1. X ¼ the set of states in which the processors are up;

2. Y ¼ the set of states in which the cluster is down due to a processor failure; and

3. Z ¼ the set of states in which the processor is down but the cluster is up.

Ibe et al. [9] compared the superstates with the three states of a diode. The three

states X, Y, and Z represent the following states of the diode: up state, the short-

circuit state, and the open-circuit state, respectively. Then the availability, A, of

the VAXcluster was defined as follows [9]:

A ¼ P ½at least one processor in superstate X and none in superstate Y�:

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY 311

Let nX , nY , and nZ denote the number of processors in superstates X, Y, and Z,

respectively. Let PX , PY , and PZ denote the probability that a processor is in super-

state X, T, and Z. Ibe et al. [9] thus defined the availability An of an n-processor

VAXcluster as follows:

A ¼Xn

nX¼1

n

nX 0 n � nX

� �PnX

X P0Y Pn�nX

Z

¼Xn

nX¼0

n

nX

� �PnX

X Pn�nX

Z �n!

0!(n � 0)!P0

XPnZ ¼ PX þ PZð Þ

n�Pn

Z : (11:12)

Ibe et al. could analyze different VAXcluster configurations by simply varying

the number of processors n in the above equation. The main drawbacks of this

approach are the approximate nature of the solution versus an exact solution, and

the need to make simplifying assumptions, one of the assumptions being an indepen-

dent repairman for each processor. In the next section, we illustrate another approxi-

mation technique to deal with large subsystems.

11.4.4 Availability Model for the VAXcluster Storage Subsystem

We next discuss a novel availability model for VAXclusters with large storage

subsystems. A fixed-point iteration scheme has been used [17] over a set of

CTMC submodels. The decomposition of the model into submodels controlled

the state space explosion, and the iteration modeled the repair priorities between

the different storage components. The model considered configurations with

shadowed (commonly known as mirrored) disks and characterized the system

along with application level recovery. In Fig. 11.12, we show the block diagram

of an example storage system configuration, which will be used to demonstrate

the technique.

The configuration shown consists of two HSCs and a set of disks. The disks are

further classified into two system disks and two application disks. The operating

system resides on the system disk, and the user accounts and other application soft-

ware on the application disks. Furthermore, it is assumed that the disks are shadowed

(commonly referred to as mirrored) and dual pathed and ported between the two

HSCs [17]. A disk dual pathed between two HSCs can be accessed clusterwide in

HSC1

HSC2 SystemDisk 2

SystemDisk 1

ApplicationDisk 1

ApplicationDisk 2

Figure 11.12 Reliability block diagram for the storage system.

312 AVAILABILITY MODELING IN PRACTICE

a coordinated way through either HSC. In case one of the HSC fails, a dual ported

disk can be accessed through the other HSC after a brief failover period.

We now discuss the sequence of approximation models developed to compute the

availability of the storage system in Fig. 11.12. The first model assumed that each

component in the block diagram has its own repair facility. We assumed that the

repair time is a two-stage hypoexponentially distributed random variable with the

first phase being the travel time and the second phase being the actual repair

time. This model can be solved as a pure series-parallel availability model to

compute the availability of the storage system, similar to the solution of the RBD

in Section 11.4.1.

In the second improved model, we removed the assumption of independent

repair. Instead, it is assumed that a repair facility is shared within a subsystem.

The storage system is now assumed as a two-level hierarchical model. The

bottom level consists of three independent CTMC models, namely, HSC, SDisk,

and ADisk, representing the HSC, system disk, and application disk subsystems,

respectively. The top level consists of a reliability block diagram representing a

series network of the three subsystems. In Fig. 11.13(a), (b), (c), we show the

CTMC models of the three subsystems. The reliability block diagram at the top

level is shown in Fig. 11.14.

The states of the CTMC model adopt the following convention:

. state nX represents n components of the subsystem are operational, where n can

take the values 0, 1, or 2; and

. state TnX represents that the field service has arrived and that (n 2 1) com-

ponents of the subsystem are operational and the rest are under repair.

In the above notation, the value of X is H, S, or A, where H is associated with the

HSC subsystem model, S with the system disk subsystem model, and A with

the application disk subsystem model. The steady-state availability of the storage

subsystem in Fig. 11.14 is given by

A ¼ AH � AS � AA: (11:13)

2H(a) (b) (c)

1H T2H

0H

H2λ µ

H

λH H

µ λH

T1H

1S T2S

0S

D2λ µ

D

λD D

µ λD

T1S

2S 2A

1A T2A

0A

D2λ µ

D

λD D

µ λD

T1A

Figure 11.13 CTMC models: Shared repair within a (a) HSC subsystem, (b) SDisk

subsystem, (c) ADisk subsystem.

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY 313

AX is the availability of the X subsystem and is given by

AX ¼ P2X þ P1X þ PT2X(11:14)

where PiX and PTiXare the steady-state probability that the Markov chain is in state

iX and state TiX , respectively.

In the third approximation, we took into account disk reload and system recovery.

When a disk subsystem experiences a failure, data on the disk may be corrupted or

lost. After the disk is repaired, the data are reloaded onto the disk from an external

source, such as a backup disk or tape. Although the reload is a local activity of a disk

subsystem, recovery is a global system-wide activity. This behavior is incorporated

in the Markov models of Fig. 11.15(a), (b), (c) as follows. The HSC Markov model

is enhanced by including application recovery states R2H and R1H after the loss of

both the HSCs in the HSC subsystem. The system disk Markov model is extended

by incorporating reload states L2S and L1S and application recovery states R2S and

R1S. The reload followed by application recovery starts immediately after the first

HSC SDISK ADISK

Figure 11.14 Top level reliability block diagram for the storage subsystem.

(a)2H

1H T2H

0H

2 λ µH

F

λH λ

H

1H

µ

H

R

µ

µR

µH

λH

µH

2 λH

T

R2H

1HR

(b)

1S T2S

0S

D2 λ µ

D

λD λ

D

T1S

2Sµ

R

µR

λD

λD

µD 2 λ

D 2 λD

R2S

L

L

R

2S

1S1S

(c)

T2A

0A

D2 λ µ

D

λD λ

D

T1A

µR

µR

λD

λD

µD 2 λ

D 2 λD

R2A

L

L

R

2A

1A1A

2A

1A

Figure 11.15 CTMC models: (a) System reocovery included for HSC subsystem, (b) Disk

reload and system recovery included for SDisk subsystem, (c) Disk reload and system

recovery included for ADisk subsystem.

314 AVAILABILITY MODELING IN PRACTICE

disk is repaired. We further assume that a component could suffer failures during a

reload and/or recovery. The application disk Markov model is extended similar to

the system disk model by including reload states L2A and L1A and recovery states R2A

and R1A. The expression for the steady-state availability of the storage subsystem is

similar to the expression obtained in the second approximation.

In the fourth approximation, the assumption of independent repair facility for

each subsystem is eliminated. In this approximation, the repair facility is shared

between subsystems, and, when more than one component is down, the following

repair priority is assumed: (1) any subsystem with all failed components is repaired

first; otherwise, (2) an HSC is repaired first, system disk second, and application disk

third. This repair priority scheme does not change the Markov model for the HSC

subsystem but changed the model for the system and application disk subsystems.

The system disk has the second highest priority and hence the system disk repair

rate mD is slowed down by multiplying it by P1, the probability that both HSCs

are operational, given that field service is present and the system is not in a recovery

mode. Then P1 is given by

P1 ¼P2H

P2H þ PT1Hþ PT2H

� � : (11:15)

It was previously assumed that a component can be repaired during recovery [17].

The system disk repair rate, mD , from the recovery states is thus slowed down by

multiplying it by P2 where,

P2 ¼PR2H

PR1Hþ PR2H

� � : (11:16)

Here, PRnH(n ¼ 1, 2) are the HSC recovery states.

The application disk has the lowest repair priority and is enforced by probabilis-

tically slowing down the repair rate. The repair rate from the nonrecovery states is

slowed down by multiplying mD by P3 where,

P3 ¼P2H

A�

P2S

B: (11:17)

Here, A ¼ P2H þ PT1Hþ PT2H

and B ¼ P2S þ PT1Sþ PT2S

þ PL1Sþ PL2S

. Then P3

expresses the probability that both HSCs are operational given that the HSC sub-

system is not in the recovery states or in states with less than two HSCs operational

and that both system disks are operational given that the system disk is in nonrecov-

ery states or states with more than one system disk up. The steady-state availability

is computed as in the first approximation.

In the above approximations, we included the field service travel time for each

subsystem. In the real world, if a field service person is present and repairing a com-

ponent in one subsystem, he would respond to a failure in another subsystem. Thus,

in this case, we should not be including travel time twice. Also, the field service

would follow the repair priority described above. The Markov model for each

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY 315

subsystem can be modified by iteratively checking the presence of field service

person in the other two Markov models. The field service person is assumed to

wait on site until reload and recovery is completed in the SDisk and ADisk

subsystem and until recovery is completed in the HSC subsystem.

The HSC subsystem is extended as follows. The rate of transition due to a

component failure is probabilistically split using the variable y1 (or 1 2 y1). The

probability that the field service is present for repairing a component in either of

the two disk subsystems is

y1 ¼ (1 � P2S � P1S � P0S) þ (1 � P2A � P1A � P0A)

� ((1 � P2S � P1S � P0S) � (1 � P2A � P1A � P0A)): (11:18)

The initial value of y1 is assumed to be 0 in the first iteration, then the above value of

y1 is used for the next iteration.

The system (application) disk subsystem is extended as follows. The rate of every

transition due to a component failure that occurs in the absence of the repair person

in the system (application) disk subsystem is multiplied by y2 (or 1 2 y2). The

expression for y2 is similar to the expression for y1 except S(A) is replaced by H.

This takes into account that the field service is present in the HSC and/or application

(system) disk subsystem.

In a similar manner, the next approximation refined the model by taking into

account the global nature of system recovery. That is, if a recovery is ongoing in

one subsystem, the other two subsystems are forced to go into recovery. The

approximated effect of global recovery is achieved with an iterative scheme that

allows for interaction between the submodels. The final approximation only modi-

fied the HSC subsystem model to incorporate the effect of an HSC failover as shown

in Fig. 11.16. Failover is the procedure of switching to an alternate path or

component after failure of a path or a component [29]. During the HSC failover

period, all the disks are switched on to the operational HSC.

In state 2H, instead of a single failure transition labeled 2lH, we now have three

failure transitions. If the primary HSC fails, the model transitions from state 2H to

state PFAIL with a rate lH and PFAIL transitions to state 1H after a failover to the

secondary HSC with a rate mFD . In state 2H, if the failure of the secondary HSC is

detected with probability Pdet , a transition to state 1H occurs with rate PdetlH and

if not detected then a transition occurs with rate (1 2 Pdet)lH to state SFAIL. The

steady-state availability of the HSC subsystem is then given by

AHSC ¼ P2H þ P1H þ PT2Hþ PSFAIL: (11:19)

The steady-state availability of the storage subsystem is given by Equation (11.13).

After various experiments, it was observed [17] that the storage downtime is more

sensitive to detection of a secondary HSC failure than the average failover time.

316 AVAILABILITY MODELING IN PRACTICE

11.4.5 SPN Availability Model

In this section, we discuss a VAXcluster availability model that tolerates largeness

and automates the generation of large Markov models. Ibe et al. [8] used generalized

stochastic Petri nets to model VAXclusters. These authors used the software tool

SPNP [4] to generate and solve the Markov model underlying the SPN. In fact,

the SPN model used [8] allows extensions to permit specifications at the net

level; hence, the resulting model is a stochastic reward net.

Figure 11.17 shows a partial SPN VAXcluster system model. The details of the

entire model of Ibe et al. [8] are beyond the scope of this chapter. The place PUP

with N tokens represents the initial condition that all the N processors are up. The

processors can suffer a permanent or intermittent failure, represented by the timed

transitions tINT and tPERM , respectively. The firing rates of the transitions tPERM

and tINT are marking dependent. These rates are expressed as #(PUP, i)lP and

#(PUP, i)lI, respectively, where #(PUP, i) represents the number of tokens in place

PUP in any marking i. The place PPERM (PINT) represents that a permanent (intermit-

tent) failure has occurred. When permanent (intermittent) failure occurs, it will be

covered with probability c (k) and uncovered with probability 1 2 c (1 2 k). The

covered permanent and intermittent failure is represented by immediate transitions

tPC and tIC , respectively. The uncovered permanent and intermittent failures are

represented by immediate transitions tPU and tIU, respectively. A failure is

2H

1H

0H

PFAIL SFAIL RPFAIL RSFAIL

R2H

R1H

T2H

T1H

λH

FDµ

λH

λH

Pdet

λH

(1-Pdet

)

λH

µH

µF

µF

λH

µH

λH

λ H

µFD

λH

Pdet

λH

(1-Pdet

)

λH

Figure 11.16 HSC submodel with failover included.

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY 317

Blo

ck

Clu

ster

Rec

onfig

urat

ion

P UIF

t IC

P INT

t IU

t IP

λ P#

#λ I

l

Inte

rmitt

ent B

lock

#λ P t PE

RM

P PER

M

1-c

t PC t PU

P P UPF

P RP

REB

P

Perm

anen

t Blo

ck

l INT

REB

t

1-k

CPF

N UP

P

kc

l l

t

µ REC

ON

FIG

REC

ON

FIG

t

Figure

11.17

Par

tial

SP

Nm

od

elo

fth

eV

AX

clu

ster

syst

em.

318

considered to be covered only if the number of operational processors is at least l, the

quorum. The input and output arc with multiplicity l from and to the place PUP

ensures quorum maintenance. In Fig. 11.17, the block labeled, “Cluster Reconfi-

guration Block” represents a group of reconfiguration places in the complete

model [8]. In addition, the following behavior is represented by the SPN.

. A covered permanent failure is not possible while the cluster is being rebooted

after an uncovered failure (token in either PUIF or PUPF). This is represented by

an inhibitor arc from PUIF and PUPF to the immediate transition tPC .

. It is assumed that a failure does not occur while the cluster is being reconfi-

gured. This is represented by the inhibitor arcs from the “Cluster Reconfigura-

tion Block” to tPERM and tINT .

. A processor under reboot can suffer a permanent failure. This is represented by

the fact that when there is a token in PREB both the transitions tREB and tIP are

enabled.

The steady-state availability is given by

A ¼Xi[t

ripi, (11:20)

where t is the set of tangible markings and,

ri ¼

1, if(#(PUP, i) � l )^

½#(P of Cluster Reboot Places, i) , 1 _ #(PUIF , i) , 1�^

½#(P Cluster Reconfiguration Block, i) , 1�

0, otherwise

8>><>>:

11.4.6 SPN Availability Model for Heterogeneous VAXclusters

We now present a SPN model that considers realistic VAXcluster configurations

that include uniprocessors and multiprocessors [18]. The heterogeneity in the

model allowed each multiprocessor to contain varying number of processors, and

each VAX in the VAXcluster to belong to different VAX families. Henceforth,

we refer to the SPN model as the heterogeneous model. Throughout this section

we refer to a single multiprocessor system as a machine, which consists of two

components: one or more processors and a platform. The platform consists of the

memory, power, cooling, console interface module, and I/O channel adapters

[18]. As in the uniprocessor case in the above sections, we depict covered and

uncovered permanent and intermittent failures. In addition, we depict the following

failure/recovery behavior for a multiprocessor. A processor failure in a machine

requires a machine reboot to map the faulty processor offline. The entire machine

is not operational during the processor repair and the reboot following the repair.

Before and after every machine reboot, a cluster reconfiguration maps the machine

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY 319

out and into the cluster. The platform components of a machine are single points of

failure for that machine [18]. In addition, the following features are included:

. The option of including the quorum disk: a quorum disk acts as a virtual node in

the VAXcluster, casting a “tie-breaker” vote.

. The option of including failure/recovery behavior like unsuccessful repair and

unsuccessful reboots: an unsuccessful repair is modeled by introducing a faulty

repair. Faulty repair means that diagnostics called out a wrong FRU (field

replaceable unit) or that the right FRU was requested but is DOA (dead on

arrival). Unsuccessful reboot means a processor reboot did not complete and

has to be repeated.

. The option of including detailed cluster reconfiguration information: for

example, if the quorum does not exist in reality, the time to form a cluster is

longer than the usual reconfiguration time.

The overall model structure is shown in Fig. 11.18. The VAXcluster is modeled

as a 1 � N array, where each plane represents a subnet of a machine consisting of Mi

processors. The place PUP in plane i contains Mi tokens and represents the initial

condition that the machine is up with Mi processors. The cluster reconfiguration

subnet consists of a single place clust_reconfig, which initially consists of a single

token. Whenever a machine i initiates a reconfiguration, the subnet associated

with it flushes the token out of clust_reconfig and returns the token after the recon-

figuration. This ensures that all machines in the cluster participate in a cluster state

transition. Similarly, the quorum disk is treated as a separate subnet that interacts

with each of the N subnets. The number of processors in a machine is varied by vary-

ing the number of tokens Mi in each subnet. The model allows the machines in the

Figure 11.18 Model structure for the VAXcluster system.

320 AVAILABILITY MODELING IN PRACTICE

cluster to belong to a different VAX series because every subnet handles the failure/recovery behavior of a machine separately. If Mi ¼ 1 in a subnet, then the machine

follows the failure/recovery behavior of a uniprocessor. The detailed subnet associ-

ated with a machine is beyond the scope of this chapter and has been previously

presented [18].

The heterogeneous SPN model included various extensions like variable arcs,

enabling functions or guards, rate type functions, etc., and hence is a Stochastic

Reward Net [1]. For example, when an intermittent covered failure transition

fires, the rate type function for the rate lIC is defined as

if (Mach UP þ (mark(Pqdup) � NQV) � QU)

lIC ¼ lplatint þ (mark(PUP) � lI)

else

lIC ¼ k � (lplatint þ (mark(PUP) � lI)),

where Mach_UP represents the number of machines up in the cluster, lplatint is the

platform intermittent failure rate, lI is the processor intermittent failure rate, and k

is the probability of a covered intermittent failure. The marking, mark(Pqdup) . 1

implies the quorum disk is up and NQV is the number of quorum votes assigned

to the quorum disk. The number of votes needed for the VAXcluster to be oper-

ational is given by QU. On the other hand, the rate type function for an intermittent

uncovered failure rate is given by, lIU ¼ (1 � k) � (lplatint þ (mark(PUP) � lI)).

The heterogeneous VAXcluster model is available if:

1. mark(clust_reconfig) ¼ 1, that is a cluster reconfiguration is not in progress,

2. NU þ (mark(Pqdup) � NQV ) � QU

where NU is obtained using the following algorithm:

Initial: NU ¼ 0

For i ¼ 1, . . . , N

If ((mark(Platform_failure, i) ¼ 0) AND (mark(PUP , i) . 0) AND

Repair and Reboot Transitions Disabled) then

NU ¼ NU þ 1

NU represents that a machine is up if no platform failure has occurred and that at

least one processor is up and repair or reboot is not in progress.

This heterogeneous model was evaluated using the SPNP package [4]. This pack-

age solved the SPN by analyzing the underlying CTMC. We resolved the problem in

the previous study [18] by using a technique that involved the truncation of the state

space [13]. The state space cardinality of the CTMC isomorphic with the hetero-

geneous model increased with the number of machines in the VAXcluster, as well

as the number of processors in each machine. To implement this state space reduction

technique by specifying a truncation level K for processor failures in the model,

the maximum value of K is M1 þ M2 þ � � � þ MN. The value K specifies that the

11.4 DIGITAL EQUIPMENT CORPORATION CASE STUDY 321

reachability graph and hence the corresponding CTMC be generated up to K processor

failures. This is implemented in the model by means of an enabling function associ-

ated with all the failure transitions. The enabling function disables all the failure tran-

sitions ifPN

i¼1 Mi � mark(PUP, i)� �

� K. This technique is justified as follows:

. In real systems, most of the time the system has majority of its components

operational [13]. This means the probability mass is concentrated on a rela-

tively small number of states in comparison to the total number of states in

the model.

. We observed the impact of varying the truncation level on the availability

measures for an example heterogeneous cluster and concluded that the effect

was minimal.

We used the heterogeneous model to not only evaluate measures associated with

standard system availability but also with system reliability and task completion.

In the system reliability class measures, we evaluated measures like frequency of

failures and frequency of disruptive outages. The term disruptive is defined as

follows: any outage that exceeds the specified tolerance limit of the user. The task

completion measures evaluated the probability that the application is interrupted

during its application period.

In this chapter, we discuss an example measure from each of the three classes of

measures [18]:

. Mean Cluster Downtime D in minutes per year: This is a system availability

measure and represents the average amount of time the cluster is not operating

during a one year observation period. Then the expression for D in terms of the

steady state cluster availability, A is given by, D ¼ (1 2 A) � 8760 � 60.

. Frequency of Disruptive Reconfigurations (FDR): This is a system reliability

measure and represents the mean number of reconfigurations which exceed

the specified tolerance duration during the one year observation period.

We evaluate FDR as, FDR ¼ mrg � ethresh�mrg � Prg � 8760 � 60, where mrg

is the cluster reconfiguration (in, out, or formation) rate, thresh is the time

units of the specified tolerance duration on the reconfiguration times, and Prg

is the probability that a reconfiguration is in progress.

. Probability of Task Interruption under Pessimistic Assumption(Prob_Psm): This is a task completion measure. It measures the probability

of a task that initially finds the system available and that needs x hours for

execution, but is interrupted by any failure in the cluster. This is a pessimistic

assumption because the system does not tolerate any interruption including the

brief reconfiguration delays. The expression for Prob_Psm is given by:

Prob Psm ¼

Pj[Upstate 1:0 � e�

PN

k¼1ck,j lp,kþli,kð Þþ lplt,kþlplt int,kð Þð Þx

� �A

Pj,

322 AVAILABILITY MODELING IN PRACTICE

where for machine k, ck,j is the number of operational processors, Pj is the prob-

ability of being in an operational state, lp,k (li,k) is the processor permanent

(intermittent) failure rate, lplt,k (lplt_int,k) is the platform permanent (intermit-

tent) failure rates for machine k, and A is the cluster availability.

Previously [18], we used these three measures for a particular configuration to

study the impact of truncation. In Table 11.4, we present the number of states

and, the number of transitions of the underlying CTMC. On observing the results,

we can conclude that we could truncate the state space of the SPN model for the

heterogeneous cluster without impacting the results.

11.5 CONCLUSION

We started by briefly discussing various non-state space and state space availability

and performance modeling approaches. Using the problem of deciding the optimal

number of processors in an n-component parallel multiprocessor system, we showed

the limitations of a pure availability or pure performance model and emphasized the

need for a composite availability and performance model. Finally, we took a case

study from a corporate environment and demonstrated an application of the tech-

niques in a real situation. Several approximations and assumptions were made

and validated before use, in order to deal with the size and complexity of the

models encountered.

REFERENCES

1. D. R. Avresky. Hardware and Software Fault Tolerance in Parallel Computing Systems.

Ellis Horwood Ltd., New York, 1992.

2. E. Balkovich, P. Bhabhalia, W. Dunnington, and T. Weyant. ‘VAXcluster availability

modeling.’ Digital Technical Journal, no. 5, pp. 69–79, September 1987.

3. G. Bolch, S. Greiner, H. de Meer, and K. S. Trivedi. Queueing Networks and Markov

Chains—Modeling and Performance Evaluation with Computer Science Applications.

John Wiley & Sons, 1998.

TABLE 11.4 Effect of State Truncation on Output Measures.

Trunc.

Level

No. of

States

No. of

Arcs

Mean Cluster

Downtime,

min/yr

Freq. of

Disruptive, reconfig.

threshold ¼ 10 s

Prob. of Task

Interruption,

t ¼ 1,000 s

1 348 948 12.91732078 9.96432050 0.00032767

2 2,088 7,110 13.00257283 9.96751584 0.00032767

3 6,394 26,686 13.00258549 9.96751604 0.00032767

4 13,236 66,596 13.00258549 9.96751604 0.00032767

5 20,728 122,746 13.00258549 9.96751604 0.00032767

REFERENCES 323

4. G. Ciardo, J. Muppala, and K. S. Trivedi. SPNP: stochastic Petri net package. Proc. Third

Int. Workshop on Petri Nets and Performance Models (PNPM89), Kyoto, Japan,

pp. 142–151, 1989.

5. S. A. Doyle and J. B. Dugan. Dependability assessment using binary decision diagrams.

Proc. 25th Intl. Symposium on Fault Tolerant Computing, pp. 249–258, 1995.

6. S. A. Doyle, J. B. Dugan, and M. Boyd. Combinatorial models and coverage: a binary

decision diagram (BDD) approach. Proc. Annual Reliability and Maintainability

Symposium, pp. 82–89, 1995.

7. K. Goseva-Popstojanova and K. S. Trivedi. Stochastic modeling formalisms for depend-

ability, performance and performability. Performance Evaluation—Origins and Direc-

tions, Lecture Notes in Computer Science (edited by G. Haring, C. Lindemann, and

M. Reiser). Springer Verlag, pp. 385–404, 2000.

8. O. Ibe, A. Sathaye, R. Howe, and K. S. Trivedi. Stochastic Petri net modeling of

VAXcluster system availability. Proc. Third International Workshop on Petri Nets and

Performance Models (PNPM89), Kyoto, Japan, pp. 112–121, 1989.

9. O. C. Ibe, R. C. Howe, and K. S. Trivedi. Approximate availability analysis of

VAXcluster systems. IEEE Transactions on Reliability, vol. 38, no. 1, pp. 146–152,

April 1989.

10. N. P. Kronenberg, H. M. Levy, W. D. Strecker, and R. J. Merwood. VAXclusters: a

closely coupled distributed system. ACM Trans. Computer Systems, vol. 4, pp. 130–146,

May 1986.

11. T. Luo and K. S. Trivedi. ‘An improved algorithm for coherent-system reliability.’ IEEE

Transactions on Reliability, vol. 47, no. 1, pp. 73–78, 1998.

12. J. F. Meyer. On evaluating the performability of degradable computer systems. IEEE

Transactions on Computers, vol. 29, no. 8, pp. 720–731, 1980.

13. R. Muntz, E. de Souza e Silva, and A. Goyal. Bounding availability of repairable com-

puter systems. IEEE Trans. on Computers, vol. 38, no. 12, pp. 1714–1723, December

1989.

14. J. Muppala, A. Sathaye, R. Howe, and K. S. Trivedi. Dependability modeling of a hetero-

geneous VAXcluster system using stochastic reward nets. Hardware and Software Fault

Tolerance in Parallel Computing Systems (edited by D. Avresky). Ellis Horwood Ltd.,

pp. 33–59, 1992.

15. J. Muppala and K. S. Trivedi. Numerical transient solution of finite markovian queueing

systems. Queueing and Related Models (edited by U. N. Bhat and I. V. Basawa). Oxford

University Press, pp. 262–284, 1992.

16. R. Sahner, A. Puliafito, and K. S. Trivedi. Performance and Reliability Analysis of Com-

puter Systems: An Example-Based Approach Using the SHARPE Software Package.

Boston: Kluwer Academic Publishers, 1995.

17. A. Sathaye, K. S. Trivedi, and D. Heimann. Approximate Availability Models of the

Storage Subsystem. Technical Report, DEC., September 1988.

18. A. Sathaye, K. S. Trivedi, and R. Howe. Availability modeling of heterogeneous

VAXcluster systems: a stochastic Petri net approach. Proc. of International Conference

on Fault-Tolerant Systems, Varna, January 1990.

19. A. Satyanarayana and A. Prabhakar. New topological formula and rapid algorithm for

reliability analysis of complex networks. IEEE Transactions on Reliability, vol. 27, pp.

82–100, 1978.

324 AVAILABILITY MODELING IN PRACTICE

20. D. Siewiorek and R. Swarz. The Theory and Practice of Reliable System Design. Digital

Press, 1982.

21. R. M. Smith, K. S. Trivedi, and A. V. Ramesh. Performability analysis: measures, an

algorithm, and a case study. IEEE Transactions on Computers, vol. 37, no. 4, pp.

406–417, April 1988.

22. H. Sun, X. Zang, and K. S. Trivedi. A BDD-based algorithm for reliability analysis of

phased-mission systems. IEEE Transactions on Reliability, vol. 48, no. 1, pp. 50–60,

March 1999.

23. L. Tomek and K. S. Trivedi. Fixed point iteration in availability modeling. In Informatik-

Fachberichte, Fehlertolerierende Rechensysteme (edited by M. Dal Cin). Berlin:

Springer-Verlag, vol. 283, pp. 229–240, 1991.

24. K. S. Trivedi, A. Sathaye, O. Ibe, and R. Howe. Should I add a processor? Proc. 23rd

Annual Hawaii Conference on System Sciences, pp. 214–221, January 1990.

25. K. S. Trivedi. Probability and Statistics with Reliability, Queuing and Computer Science

Applications (2nd ed.). New York: John Wiley & Sons, 2001.

26. B. Tuffin and K. S. Trivedi. Implementation of importance splitting techniques in stochas-

tic Petri net package. Computer Performance Evaluation: Modelling Tools and Tech-

niques; 11th International Conference; TOOLS 2000, Schaumburg, IL, (edited by

B. Haverkort, H. Bohnenkamp, and C. Smith). Lecture Notes in Computer Science

1786. Springer Verlag, 2000.

27. B. Tuffin and K. S. Trivedi. Importance sampling for the simulation of stochastic

Petri nets and fluid stochastic Petri nets. Proceedings of High Performance Computing,

Seattle, WA, April 2001.

28. X. Zang, D. Wang, H. Sun, and K. S. Trivedi. A BDD-based algorithm for analysis

of multistate components. IEEE Transactions on Computers, vol. 52, no. 12,

pp. 1608–1618, December 2003.

29. Introduction to VAXcluster Application Design. Digital Equipment Corporation, 1984.

REFERENCES 325