analysis of fault-tolerant computer systems … · a state i may be classified as preemptive-resume...

29
AD-Ai78 383 QUEUEING ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS / (U) NORTH CAROLINA UNIV AT CHAPEL HILL CURRICULUM IN OPERATIONS R. V F NICOLA ET AL. DEC 85 UNLSSIFIED UNC/ORSA/TR-85-iS FOSR-TR-86-0 88 F/G 9/2 N mmmmmmmmmm, EEEmmmEEEEEmiE EEEElhEE

Upload: others

Post on 31-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

AD-Ai78 383 QUEUEING ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS /(U) NORTH CAROLINA UNIV AT CHAPEL HILL CURRICULUM INOPERATIONS R. V F NICOLA ET AL. DEC 85

UNLSSIFIED UNC/ORSA/TR-85-iS FOSR-TR-86-0 88 F/G 9/2 N

mmmmmmmmmm,EEEmmmEEEEEmiEEEEElhEE

Page 2: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

3-1

1 -2 131 6

1311*

Page 3: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

_ OPERATIONS RESEARCH AND SYSTEMS ANALYSIS

'U Uri t

UNIVERSITY OF NORTH CAROLINA

AT CHAPEL HILL

.4V

Apr2e c2 r c2 r.12

Page 4: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

~ueuingAnalysis of Fault-TlrnComputer Systems*

V.F. Nicola

V.G. Kulkarnit

#1 K.S. Trivedif

Technical Report No. IC/ORSA/TR-85-1O

December 1985

This research was supported in part by the Air Force Office.f Sojentific Research under grants AFOSR-8'-0132 and

AFOS~i-8l4-01'4O, by the Army Research Office under grantD'AAG-29-84-K-OO'45 and by the National Science Foundation under

gatNSF -MCS-83O200.

fDepartment of computer Science, Duke University, Durham, NC2751 L

@4 t-urricui'um in Operations Research and Ojystems Analysis,'Jniversity of North Carolina, Chapel Nill, NC 27514.

~RI~iy~oN SATEMET AIApprovd to, piiblio releaB4

El T AIW&PJ 0FTIE CF SCIENTTFIC RESEARCH (A.FSC)

JL2 3 1986 N'OICE OF' 1ANSMITTAL To DTICThi.; technicil report has been reviewed and is:113,rovcd for jiubl i relPrtse IA',. AI-R 190-12?.

B stribution is unlimited.-TH-jV J. KE,.TM

S--.......... Division

Page 5: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

N.UNCLASSFIEcSECIJ'.ITY CLASSIFICATION OF THIlS PAGE (" on Data Entered)

REOTDCMNAINPG READ INSTRUCTIONSI __________________________________________ BEFORE COMPLETING FORM

.REORT NUMBER - . GOVT ACCESSION No. 3. RECIPIENT'S CATALOG NUMBER

* TITLE (and Subtitle) 5. TYPE OF REPORT II PERIOD COVERED

Queueing Analysis of Fault-Tolerant TcnclRpr

Compuer Sytems6. PERFORMING ORG. REPORT NUMBER

_____ ____ ____ ____ ____ ftTh se-0 387 AUjTrOR(,) S. CONTRACT OR GRANT NUMBER(&)

V.F. Nicola

V .G KularniAFOSR-84-0140)K.S.__Iriveai__________________

9PERFORMNG ORGANIZATION NAME AND ADDRESS 10 PROGRAM ELEMENT. PROJECT, TASK(AREA II WORK UNIT NUMBERS

Curriculum in ,_perat ions Research & Sys. Anal. //LSmith Bldg. 128A, UNC, Chaptel Hill, NC 27L5i4

I CONTROLLING OFFICE NAME AND ADDRESS 12. REPORT DATE

Air Force Otfice ot Scientific Research December 1985Bolling Air Force Base, DC 20332 13 NUMBER OF PAGES

14 MONITORING AGENCY NAmE & AODRESS(It different from Controlling Office) IS. SECLJV: ot

6 DISTRIBUTION STATEMENT (oftis~ Report)

ApprovD)istribution of this document is unlimited. (ISt-PIbu ff01pUbliorB-eea

17 DISTRIBUTION STATEMENT (of the abstract entered in Block 20, it different from Report)

Uric laisL i f ied

-KIS SUPPLEMENTARY NOTES

19 KEY WORDS (Continue on reverse aide It necessar'y and Identify by block number)

Vault-Tole-rant Computing, )uuueirig Model, Markovian failure repair.

20 ABSTRACT (Continue on revre side It necessary end Identify by block nu.mber)

In this paper we analyze a fault-tolerant compiter system. Thte 1"i lure--/

r-jp.iir iei'.vior ol, the_ system i0 , modelled by an irreducible continuous-time

FORDD 1473 EDI TION OF I NOV 65 IS OBSOLETE

147 'N 010 2- L F.0 14.6 601 LSII1SECURITY CLASSIFICATION OF THIS PAGE (*%ont Dae Entered)

...............................................................- 7.

Page 6: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

_ _ __ .ij

SECURITY CLASSIFICATION OF THIS PAGE ("a,.n Does Entered)

Markov chain. Jobs arrive in a Poisson fashion to the system and arc

serviced according to an FCFS discipline. A failure may cause the loss of

the work already done on the job in service, if any; in this case the

iterrupted job is repeated as soon as the system is ready to delivr

service. In addition to the delays due to failures and repairs, jobs

.uffer delays due to queueing. We present a general queueing analysis of

fauIt-tolerant systems and study the steady-state behavior of the number of

Sobs in the svstem. As a numerical example, we consider a system with two

processors subject to failures and repairs.

L .K..

............1...

. .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .

Page 7: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

IIOispprwanlza al-ueatcmuesytm Te failure/repair behavior of the sys,

systemid r erie accordingtoak C disciplinie. A aluemay Cuetl oso iewr

already done oi the job in service, if any; 'Iri this case thke interrupted job is repeated as soon as the sys-

temn is ready to deliver siervice. III addition to the delays due to failures and repairs, jobs bulfer delays due

to queucing. We present a general queueing analybis of fault-tolerant systems and btudy the steady-state

behavior of the number of Jobs in the system. As a numerical example, we consider a :system with two

processors subject to failures arnd repairsi.

Sk~ -

Page 8: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

|-..*

1. Ilnroduction

* .- Queueing models provide a useful tool for predicting the performance of many service systems

including computer systems, telecommunication systems, computer/communication networks and flexible

manufacturing systems. Traditional queueing models predict system performance under the assumption

that all service facilities provide failure-free service [71. It must, however, be acknowledged that service

- facilities do experience failures and that they get repaired. Failure/repair behavior of such systems is

commonly modelled separately using techniques classified under reliability/availability modeling [3]. In

recent years, it has been increasingly recognized that this separation of performance and

reliability/availability models is no longer adequate 113].

Two distinct approaches towards combined modeling of performance and reliability/availability

have been used. In the first approach, queueing models with server breakdowns and repairs are analyzed

by means of generating functions [14], supplementary variables [2], imbedded Markov process and renewal

theory [6], or probabilistic [17] techniques. These efforts generally carry out an exact steady-state queue-

ing analysis of the system in the presence of breakdowns and repairs. A transient analysis of an M/G/1

queue with server breakdown, subject to hard-deadline constraint on response time, has been considered

recently [1]. The second approach is approximate, in which it is assumed that the time to reach the

steady-state is much smaller than the times to failures/repairs. Therefore, it is reasonable to associate a

-performance measure (reward) with each state of the underlying Markov (or semi-Markov) model describ-

ing the failure/repair behavior of the system. Each of these performance measures is obtained from the

steady-state queueing analysis of the system in the corresponding state. The resulting reward model is

then analyzed for the expected values or the distributions of interesting cumulative measures of system

performance [8,9,10,!3].

Since the job oriented view of performance/reliability in fault-tolerant system s particularly impor-

* I tant. models have been developed to derive the distribution of job completion time in a failtire-prone

environment. In these models, we need to consider a possible loss of work due to the occurrence of a

failure, i.e., the interrupted job may be resumed or restarted upon service resumption. We have recently

considered models that take into account different types of interruptions in the analysis of the job comple-

p.

-... ..... . . ... -- * . . . *

Page 9: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

* .5 2

" tion time 8,9,10. These earlier results are the bases for the queueing analysis presented in this paper.

Note that the job completion time analysis includes the delays due to failures and repairs, but it does not

account for queueing delays. The purpose of this paper is to extend our earlier analysis so as to account

for the queueing delays. In effect, we consider an exact queueing analysis of the system in order to obtain

the steady-state distribution and the mean of the number of jobs in the system.

We consider a queueing model of a computer system where the jobs arrive in a Poisson fashion with

rate X. The service requirements of the incoming jobs form a sequence of independent and identically dis-

tributed random variables with common cdf G(.). The computer system exists in one of n possible

states. The state of the computer system changes with time according to an independent continuous-time

Markov chain. It is assumed that this chain is irreducible. WVhtn the computer system is in state 1 it

delivers service at rate ri > 0. A state i may be classified as preemptive-resume (prs ) or preemptive-

repeat-identical (pri) as follows: A state is said tc be prs (pra) if, upon entering that state, the work

done so far on the current job is preserved (lost), and the service is resumed (restarted) in the new state.

Thus the actual time required to complete a job depends in a complex way upon the service requirement

of the job and the evolution of the state of the computer system. It is assumed that there is an infinite

waiting room for the jobs and that the service discipline is first come first served (FCFS). Note that

even though the service requirements of jobs are independent and identically distributed, the actual times

required to complete these jobs are neither independent nor identically distributed, and hence the model

cannot be reduced to a standard M/G/1 queue [17j.

When all the states of the model describing the computer system are prs , i.e., no work is ever lost,

1- the problem described here can be analyzed as queues in random environments (e.g., see Purdue 18') In

.... the present paper we carry the analysis with the possibility of work loss, i.e., when some of the states of

the underlying Markov model are prs and the remaining pri. As loss of work due to failures and interr-

* tuptions is quite a common phenomenon in fault-tolerant computer systems, the model proposed here is of

obvious interest. Many of the breakdown-repair queueing models studied in the literature are special cases

' . of the model studied here (e.g., Mitrani [14), Nicola [171, Baccelli and Trivedi [2i and others).

° -.

-.-. -

Page 10: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

3

In the next section we first study the situation with no queueing and state some results, concerning

the analysis of job completion time, which are direct extension of those given in 191 Using these results

we set up the queueing model in Section 3 and show that it has the block .M/G/1 structure. Queueing

models with such a structure have been studied by Neuts 1151, Neuts and Lucantoni j161, Ramaswami 1191

and others. We demonstrate the usefulness of our approach by performing the numerical analysis for a

particular example. This is done in Section 4. Finally, conclusions and some extensions are discussed in

Section 5.

2 The Completion Time Analysis of a Single Job

Consider a single job with service requirement B that starts getting served at time 0. B is a ran-

dom variable with a distribution function G(x) = P(B < x) and LST G (.). Let Z(t) be the

state of the computer system at time t . It is assumed that {Z(t ),t > 0} is an irreducible continuous-

time Markov chain on {1,2,...,n } with n X n generator matrix Q [qij, where qi , (i 4) is the tran-

sition rate from state i to state j ,and q, =- - qij. Thus, each row of Q sums to zero. Let T be

the time when this job completes its service. Define, for I < i,j < n

I%.'-:. F,,(tx) = P(T < tZ(T)=j Z(O)=i.B=x

Fij (s ,x) =E(e- ;Z(T)= IZ(O)=i B =x dF ( t,z )0

00,::.Fij "(s tv) f e- fij (s ,x )dx.

@01

Theorems 2.1 and 2.2 below are minor extensions of theorems 2 and 3 in i8'. They treat the cases where

". "." all the states are prs or all the states are pri respectively. The proofs are similar to those in '8. and

hence, are omitted.

Theorem 2.1. Let all the states be of the prs type. The double transforms F,j '(s,u,). I < 1.2 < n.

are given by the unique solution to

' . - . . . . . . ..*

Page 11: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

I4Fjj .. .F I( i, < i < n

~ (~ =______ - FJtJ) (2.1)s -,-q -r, u, itI s 1 --r ,I -

where q, -q1, 6j) I if i = j and 0 otherwise.

Theorem 2.2. Let all the states be of the pri type. The transforms F5j (s z), 1 < I ' < n, are

given by the unique solution to

Fij -(s x )=e -(o q. )z r_______~ ( -~e-q, )z r, __k 2 . 2(-e ')Fk(s,x), 1< ij (2.2)

k *i

' Next we consider the mixed case. Let S C {1,2 , n } be the set of pri states, and

. = {1.2 .. n }-S be the set of prs states. Suppose Z(0) C S and let

U min{t > 0: Z(t) CS}.

Define

V -min{TU}.

For 1 C- , define

Ali (t'X P( V < t,Z( V)=j Z Z(0)=iBx)

with the corresponding LST A!,; (s ,x), and the double transform

.,, °(s,w) fe-zM1 , (s,z)dx.0

Note that

A- ,,(t.) = P(T < t < U.Z(T)=j Z(O)=z.B=x), if jCH

('] " ""and

m,, (tx) P(U < t < T,Z(U)=j I Z(0)= i,B=x), if C-S.

The following theorem is an extension of propositions 5.1 and 5.2 in j9j; the proof follows along the same

lines, and hence is omitted.

q

• ' "- - -- - -- -,- ." - , - " - - -- - . - " - . .i . .,? . i.

Page 12: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

Tfieoreyn 2.3. Let S be the set of pri states and 9be the set of prs states. Then

(i) The double transforms Mi '(s ,w ), ij C S7, satisfy the following equations

~~~r ______ qik f'(w) I, C'(S ' + Af +9SW'I

s' s-qj+ri w tC.r' s ~qj +rik 3 ;i

(ii) The double transforms Ali1 '(s w ) i C~ C CS satisfy the following eq uat itrv.

wv(s-r-qj t-r w) keg s --qj tri w

* Equations (2.3) and (2.4) have unique solutions.

*The following theorem gives a method for determining Fi, (s xz) < Ij < ni % hich itI,

main result of this section. The proof of this theorem, being similar to those of theorems 5 1 and 52 of

9), is omitted.

Theorem 2.4. Let the states in S be pri and those in .~be prs. Then

(i) The LST s F,; (s , x) i C S, satisfy the following equations

F,, (si x gi,,(s x) + E haj(s x )F1 ; (s)x iCS, I < 2 (2.5)

w here

z-(e ~ ~ + -ur,+q)h/r, -b j + 61 - Je 'A'ki (s ,x-h Ad , if ri > 0

( q) k (~) if 1,O 3C,

and(

______ IL~- f e +~/Ak, (s ,z -h )dh ,if ri > 0

* h1hil(sz)

(suq4) ~ (sq,.) MkI (s ) if ri =0, 1,C S.

-seq

Page 13: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

[ii) The LSTs Fj (s ,x ), i CS, are given by

3Fi 1 (si)= Alij (s,x) -t- l A (s,x)Fi (s,x), iC. 1 < < n. (2.6)1 S

T:hus. for the mixed case, the job completion time is completely described by the LST s Fi, (s .x

given by theorem 2.4. These expressions are essential for the queueing analysis in the next section.

.", The Queuetig Model

in this section we perform the steady state analysis of the queueing model described in the introduc-

i ton Let X(t ) be the number of jobs in the system (including any in service) at time t . Let r, be the

time when the v-th job is completed. Assume that To 0 and a new job starts service at time 0.

Let X,.. X(r,+) and ZL, Z(r,+) be the number of jobs and the state of the system, respec-

ttvely, immediately after the u-th job completion. Due to the Poisson arrivals and the Markov nature of

Z(t I,t > 0}, it is clear that {(X,,ZL,),v > 0} is a discrete time Markov chain with state space

.J 0.1. X {1,2..m where m = {i:r i > 0}1 , m < n . (Note that a job may complete in

state I only ifr, > 0). In this section we study the limiting distribution of {(X,,Z,),v > 0}. The

relevance of this limiting distribution follows from the fact that jobs arrive singly to the system and

depart singly from the system, and that the arrival process is Poisson. Therefore,

Lir P(X(t)=j)= Lim P(Xv=j)

when the limits exst. This is a well known thecrem (see Cooper [51).

Next we determine the one step transition probability matrix of {(X,,Zv),V > 04.

.9 1. The Transition Probability Matriz

We first note that a job may start while the system is in any state, but it may complete only if the

system ii, in a state with a positive service rate. From now on, we assume that ri > 0 for I < I < rn

and r, 0 for m -- i < n. Assume that the unconditional LST

i f(s)= fJF,(sx)dG(), 1 < < n 1< " < m,0

Q14......

Page 14: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

4W

7

* .has been comnuted using the methods in Section 2. Fj kt )denotes the inverse of F., (s i, e.'

F,~l t P PT < t Z T) Z (0) = ) 1 < I K < n.

Now let

a, (k )=P (Z (T =J number of azrrivals (during (0, T; k Z Z(0)

f e (Ft k <.,... I' < n I< < rn

Define the n X in matrix .A (k ) =OK,(k ).,k > 0. Now, let Y he an exponentially distributed ran-

(loin variable with parameter X, which is independent of {Z(I ),t > 0}. The following quantity will be

needed in our analysis

d' P (Z(Y) =J Z (0)==i)

X f XP (Z (t)= JI Z (0)=iOdt , 1 K i , j <Kn0

dlis the probability that the system is in state j at the time of the next arrival, given that the system

* . was empty in state i.Recognizing the integral as the Laplace transform we get, the following formula for

the n X n matrix D ( d,1

D X'XI.

whtre Q is the generator matrix of {Z(t ),t > 0} as defined in section 2. Using above notation we give

the following theorem.

Theorem, 8.1.

P (.V,, 1 zzk Z, J X,=k ,Z,=z

(k k 1), if k > k-I > 0

d~ a,(k' if k' > k =0

0, otherwise, 1< i j in

P~ro of.

Page 15: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

- .- ,..-.---.-

TS

(i) Let k' > k -1 > 0. Then .+ 1 = -,, - 1 + number of arrivals during ( The ser-

vice of the (v+l)-th customer starts at time r, with the system in state i and ends with the system in

state j. Hence the required conditional probability is a2 (k'-k+,).

(ii) Let k' > k =0. Thus the system is empty when the v-th job completes and the system is in

state i. There follows an idle period Y of exponential duration with parameter X, during which time the

state of the system changes to I with probability di . The service of the (v+l)-th job starts in state I

and X,, = number of arrivals during this service time. Hence the required conditional probability is

given by dil alij(k ), and is denoted by bij(k' ), 1 i,j < m

Let F(t)= [F1 (t) (i=1,2,...,n; j=1,2,...,m) be a n X m matrix and Fi,.r)(t) be a

m X m submatrix of it obtained by taking its first in rows. For k > 0, define the n X m matrices

4rA *(k) and m X m matrices A (k) as follows:

A (k) feX (>t dF (t)

0. k

A (k) = fe ' dF (d)(t)0 k

Also define rn X m matrices B(k), k > 0, as follows

B(k) D(rd)A *(k

where D(red) is the m X n submatrix of D = X(XI-Q ) obtained by deleting its last n-r rows.

Define the macro state vectorji={(i,1),(i,2),...,(i,rn)}, i=0,1,2,.... The macro state . means

that there are i jobs in the system, upon an arbitrary job completion. Using the state space

-'{ > 0}, we can write the one-step transition probability matrix of {(X,,Z,),v > 0}, as

0 1 2 3

0 B(O B(1) B(2) B (3)

I A~ A(0) A2 3 .3A A 0 A I ..

*40

,Ff

- o.* . .

Page 16: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

T--7 Z -T V- v. a~r - -

Notice that Z v can be in state i if and only if ri > 0. Therefore, .4 (k) and B (k ),k > 0, are all

?n X m matrices, and all elements of A (k) and B (k ),k > 0, are strictly positive.

From the above matrix representation, it is obvious that our model has a block M/G/l structure.

As mentioned before, there are a large number of models that fall into this structure and a general algo-

rithm for the solution of this problem has been studied in detail by Lucantoni and Neuts [11], Lucantoni

and Ramaswami [12] and Neuts [15]. Here, we use a modified version of the method of Lucantoni and

C Neuts [11].

Now let A = A A (k) - F(rCd) (0). Note that A (=[a . ]) is an irreducible stochastic matrix.k =0

Let 1[ be its invariant solution, i.e.,

A,where . is an m -dimensional column vector of Is. We define

.- " kA (k). = -- fFied) (s) .ol.-.- Note that KT is the expected number of arrivals per=

Tk=

departure at saturation (i.e., assuming that the system is never empty). Hence the condition of stability

for the queueing system is given by (Neuts [15]):

p= zT < 1,

which can be rewritten as

where X' is the threshold value of the job arrival rate below which the queueing system will remain

stable. We assume that the above condition is satisfied so that the Markov chain {(X,,Z,),v > 0} is

V - positive recurrent.

4 "Now let

y(iJ) = LimP(X,,=i,Z,j), i > 0, 1 < j <r.* I1 V-00-- -

The infinite probability vector JT is written as ( V f ,U J ,...), where

... t/ (i, 1),y (i,2),...,y (i,m . Define .(w)= tw'y(i,j) (j =1,2,...,m) and leti=o

-- p

€ 'o ~~~~~~~~~... . ....... .. . .. .... .. . .o. -°...°...o.. °. . .-. ... °, ."-. - -.,'-:. ..'.• '-: "-.- .'-..', '- ,-.-.- -.-. " , , '.-.-- .--. ,- ---.. .- ... . . . -"-'.. ..... '.-,. ".',.

Page 17: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

.- '.

10

"'*. :,..'"" 0 r (u,) (ol( w )(0u, , ) . (uW )). Then it can be easily shown that

: () U o, [w,( -A (w)j[wI-A (w)j1 (3.1)

where

B(u,)= 3w B (k) =D (,j )F (X(1-w)),

S(w) - wk A(k) = F(red) (X(-w))," =0

The standard procedure at this point is to determine UO by complex function theory arguments based

upon the holomorphic nature of gT (w ), but this procedure is numerically unstable. Lucantoni and Neuts

11 have developed a more stable procedure to obtain .uo0 We use a modified version of this procedure[ . ?

- which is described here for completeness.

Let V be the length of a busy period initiated by a single customer at time 0. Define

Hij P{Z( V)=j I Z(O)=i }, i j=l,2 .... m

to be the probability that the system state changes from i to j during a busy period. It is known that

the matrix I j Ij is the smallest solution to the following nonlinear equation:

00

'".-::-: U -- Y]A (k )Hi

oo

Equation (3.2) can be solved by a straightforward iterative method:

l.t f f dF(,,d)(t )exp(X(H,, -)t) (3.3)0

In the limit as n approaches infinity, H,, of equation (3.3) approaches H, the solution to (3.2). Notice

that when the matrix exponential in the above equation is computed by an eigenvalue technique, the

- equation (3.3) gives H, + in terms of F ( () evaluated at the eigenvalues of H. This method is used in

the example of next section, and it obviates the need to compute A (k) for all k. It should be noted that

H is a stochastic matrix.

. %.

Page 18: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

11

It is known that LoT is a solution to

00

. fo ( B (k )H) = 4Lo D(red) fdF(t)exp (X(H-I)(t). (3.4)k=O 0

00

The above equation determines U0 upto a multiplicative constant, since r B (k )Hk is a stochastick =0

matrix, and has rank rn-1. Again, the matrix y B(k ) I lk can be computed in terms of Fi (')k=0

evaluated at the eigenvalues of H.

At this point, Lucantoni and Neuts provide a rather formidable procedure to compute this multipli-

cative constant. Here we provide an alternative method, which is based upon the following equation:

Lim (we 1. (3.5)wt -1

Unfortunately, wI-A (w) is singular in the limit as u, -1, and hence we need to use L'Hospital's rule to

compute the limit in (3.5). To do this, write,

U1 -A (-t = R (wv)/u (w)where R (w) is the adjoint of wl-A (w) and u (w) is the determinant of wl-A (w). Then w get

-: '(1) T o ((IB (I)-A' (1))R (1)+(b (1)-A (1))R' (1)).c (3.6)

where

ii(l) F(-d)(0), P (1) D (,,d)F (0),

'(1) -x---F , (s ) I *=o, B (1) = (s) I =o1.da-

Equation (3.6) above provides the required independent equation to determine the multiplicative constant.

Once U0 is known, oT(w) is completely determined and one can compute moments by taking deriva-

tives.

S.One seeming difficulty of this procedure is the apparent necessity of having to compute R (w) and

Su (w ) algebraically, As we only need u (1),u ' (1),R (1),R ' (1), we can use the following theorems

which eliminate the necessity of computing R (w) and u (w) algebraically.

.. Theorem 3.1. Let G(w) [Gij (w)J be a m X m matrix of differentiable functions. For. k =1,2,...,m, define m X m matrices G(k)(w) as follows:

:-.-2

Page 19: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

12

{G (w ) if i ?6 k

Then

The ore m S. 2. Let G (w) and G (k)(w) be as defined in Theorem 3. 1. For k==1,2,., m define the

in X m matrices G(k )( w ) as follows:

( i G1 (w) if i=,[G~k(tv)ij 0if i =k

Then

-j- Ad '(G7(w )J [AdjG(')w )-AdjG ( ( JW[ ~ =1)(

Proofs of both these theorems are straightforward. These theorems provide 0(m4 ) method of computing

,U' (1) and R' (1).

An interesting feature of this queueing model is that the expected service time of an arbitrary job in

steady state depends upon the load offered to the system, viz. X. We can easily derive expressions for this

quantity when X approaches 0 (a lightly loaded system) or as X approaches V(' (a heavily loaded sys-

tern). Let S, be the service time of the vt customer. When the arrival rate X-0O, every incoming job

finds the system empty. Thus, in steady state, an incoming job finds the structure state process in state I

with probability 0, LimP(Z(t )=i). HenceI-00

Limn LIinE(SJ) = OiE(T(x) IZ(0)=zI)

where 0T (01,02 , . 0, )is a solution to

-- V. and

Page 20: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

;. 13

E(T(x) IZ(O) = i)j (sx) =oj=1

.:- \When the arrival rate X--X', the system is always busy. Hence, when a job comes up for service,

the structure state process is in state i with probability, 'i , where 7rT (irl, r. . . , ) is the station-

ary probability vector of the matrix .4 as defined before. Hence

Lir LimE(S,) = ir, E(T() I Z(O)= ) -,-:4-:."~ ~ -),I ,.-00 -I X

"In the next section we tabulate LimE (S,) as a function of X for a two processor fault tolerant system.

Ve also study the expected queue length in such a system as a function of X.

-. An Example

El In this section we consider an example to demonstrate the use of the techniques presented in section

3. We obtain the mean of the number of jobs in a fault-tolerant computer system in steady state. The

system has two processor units subject to failures and repairs. The failure rate of a single processor is -y.

The failure of one processor causes the preemption of the job being processed. The interrupted job is res-

tarted and processed at a reduced service rate (service rate is assumed to be proportional to the number of

% operating processors). When both processors have failed, the interrupted job is restarted as soon as one of

the processors is repaired and is processed at a reduced service rate. When the second processor is

%-. repaired the processing of the job is continued at increased (normal) service rate. The failed processors

are repaired one at a time with a rate p.

The behavior of the system can be described by a continuous-time Markov chain with the state-

transition diagram shown in figure 1. Note that state 2 corresponds to the system with two operating pro-

cessors, and is classified as a prs state. The service rate in state 2 is r;2 (=2). State I corresponds to

- °the system with one operating processor, and is classified as a pri state. The service rate in state I is

r (=I1). State 3 corresponds to the system with both processors failed, and is classified as a pr: state.

The service rate in state 3 is r3 (=0). Jobs arrive into the system according to a Poisson process at a rate

- . XV Each job has a deterministic work requirement , say z units of work.

eq

em .:

Page 21: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

14

We now follow the procedure suggested in section 2 in order to compute

F,J (s ,x), i =1,2,3; J =1,2. (Note that a job may complete only in a state with nonzero service rate).

In our example there are two types of system states, the prs subset S {2}, and the pr: subset

- - -. = {13}. From theorem 2.3, equations (2.3) and (2.4) yield

= 222s +2y+2w (4.1)

and

A-_ M; "(S , ) = " (.2M 2 1 ' w)= (s +21+2w) (4.2)

Inverting with respect to w , we get

2.2 (sx) = e +21y)z/2 (43)

and

4° -:-21M,- (S ,X ) = ( (4.41)(,+-,, -')

s +2-y e4

The LST s Fj (s ,x), i =1,3;j =1,2, can be determined from theorem 2.4, equation (2.5), as follows:

F 3 1 (S ,z ) = )FI (s ,z) (4.5)

322 ( )F 12 (8 ,X (4.6)

where

F11 (s x = (4.7)D (s x)

and

F12 (S ,X)=+ 48D(s,)

with

. . .. -o.

. 1 ° "U . U U U

Page 22: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

-s- 3s -2--2;) -' ,. - -e -' 4-1pk15

(6 -+-)(s +-y)(s --- 1-p) (s -+-2 -y)(s ±2p)

The LSTs F~i (s , 8), i=2, J=1,2, follow from theorem 2.A, equatici (2.6), as follows

F2 1 (s ,x)= M., 1 (s,x)F1 1 (s,x), (4.9)

F 22 (s ,X) = l 2 2 (S ,X ) + A' 2 1 (s ,x )F 1 2 (s ," ) (4.10)

with Af, (s ,x ) and A![, (s ,x ) as given by equations (4.3) and (4.4), respectively.

Now that we have evaluated the LSTs F i j (s ,x ), i=1,2,3; j =1,2, we can carry out the queue-

ing analysis as suggested in section 3. The matrix A is given by

A A (k)

-. F1 (,)1, i,j=1,2

can now be evaluated using equations (4.7)-(4.10). It follows that

T TA = (e-$X -e-('+u)) (le + }( + )

As noted earlier, A is an irreducible stochastic matrix. Let / be the solution to I[ -TA and

zT - 1, then

. "--"e -""Z -e -(-I+A)ziz.'7 1 --- , (4.11)

-1-e -(4.12)

- The condition of stabilty for the present queueing system is given by JT[ < 1, with

3i , = j 1ta(k) ± 2(k))

- -X×F,' (0,x )+ F,2 ' (0,)j, i=1,2.

Thus, we have, after evaluating Fij' (O,:), ij =1,2, using equations (4.7)-(4.10),

"EN

ie q..-...........................J - . - . . £ J .. ... . . . . . . ** - **%

Page 23: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

2'T(v-,L)- (e ": -e - (l- e )i, (4.13)

and

'-2 )L (e +e-I+~ (_+u-+x -- 1)t-L (e-(.Y' " x-e '.. ~ ~2*"/ t )' (e "iY +'-U-p - 2" -y-a [.(.4

The condition of stability follows (7r1/3 1,+r2 32 < 1),

<" x

.. ((.1+,)'+.-)(e'*I

Now we proceed to determine the mean of the number of jobs in the system, in steady state.

Let the irreducible stochastic matrix H (defined in section 3) be given by

6 H 1-0 0

It is determined as the smallest solution to

i f dF(t ,x )e (H-)L (4.16)0

with dF(t ,z) = [dFj (t ,x ), i ,j =1,2.

It is easy to show that

or [ 1-((1-6)e- ) 1-6+(6-1)e - ' 1where o = X(2-6-0). Hence substituting (4.17) in (4.16), we get:

(-0 -8)Fl (ox, )-.-(-l)F (o. 18-(-)~ (a,x )+(I.-0)F1 (aox '

a.(1-O)-(l-6)Ft1 (ax )-t-(0-1)F2 2 (ax) (1-6)+(6-1)F,,1 (o,x )+(1-O)F. (ox)

* The unknowns 6 and 0 can be determined by solving the following two nonlinear equations

6e, (1-0)+(1-6)F 11 (o,,x )+(0-i)F12 (o',X),(.w.

Oa (1-6)+(I-6)F 22 (a,@)+(0-)F (,x) (4.1)

.0 with Fi, (a,x), i ,j -- 1,2, from equations (4.7)-(4.10).

"" C .. . . . . . . . . . . . . . . . . . . . . .. .

Page 24: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

17

Equations (4.18) and (4.19) are solved using Broyden's method I which converges quickly for judi-

ciously chosen initial solution vectors. Then the vector jT0r is solved for by using (3.4) and (3.6). Using

"0 we then compute T (w ) from (3.1). By the method described in section 3. we are then able to com-

pute the expected number of jobs in the system in the steady state. We plot the expected number of jobs

as a function of X in figure 2 for x =0.01,pr=1 for three different values of the failure rate

"-0.01,0.05,0.1. As expected, increasing the failure rate -y implies a substantial increase in .vstem

congestion.

In the following table, we give the expected service time in the steady state as a function of X X*

for -1 -0.1, M =1 and x =0.01. In this case the threshold arrival rate X' is 180.24. The expected service

time of an arbitrary job in steady state is denoted by E(S).

* E(S)*10

0.99 5.550.90 5.600.80 5.680.70 5.770*60 5.880.50 5.980.40 6.050.30 6.140.20 6.300.10 6.710.00 19.9

It is seen that the eypected set time reduces from 0.0199 to 0.0055 as X incereases from 0 to

0.99* X*, This seemingly non-intuitive result appears because as X increases, the probability that a job

will be taken up for service when both processors are down decreases.

References

- Baccelli. F. and Trivedi, K1S., "A Single Server Queue in a Hard-Real-Time Environment," Opera-

lions Research Letters. Vol. 4, No. 4, 1985.

2 Baccelli, F. and Trivedi, K. S., "Analysiq of M/G/2-Standby Redundant System," Proc. 9-th IFIP

International Symposium on Computer Performance Modeling, Performance '88, A. K. Agrawala and

S. k. Tripathi (eds.), North Holland, Amsterdam, 1983.

Page 25: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

3, larlow, R E. and Pro.schan, F., 'Stai-tical Theory or Reliability and Life Te.sting Probabilily

Mlodels," Hlolt, Rinehart and Winston, New York, 1975.

4' Broyden. C.G., "A Class of Methods for Solving Nonlinear Simultaneous Equations," Math. Comnp..

Vol. 19. pp. 577-593, 1965.

' Cooper, R. B., "Introduction to Queueing Theory, " 2-nd edittion, North Holland, 1981.

61 Gayer D. P., *'A WXaiting Line with Interrupted Service. Including Priorities," J. Royal Statistical

Society, Series B-24, 1962, pp. 73-90.

71 Kleinrock, L., "Queueing Systems, "Vol. I & 11, John Wiley & sons, 1976.

81 IKulkarni, V. G., Nicola, V. F. and Trivedi, K. S., "On Modeling the Performance and Reliability of

Multi-Mode Computer Systems," in N1. Becker (ed.), Proc. Int. Workshop on Modeling and Perfor-

niance Ev'aluation of Parallel Systems, North-Holland, 1984.

'9 Kulkarni, V. G., Nicola, V. F., Trivedi, K. S. and Smith, R. M., "A Unified Model for the Analysis of

Job Completion Time and Performability Measures in Fault-Tolerant Systems," in review, JACM.

[10] Kulkarni, V. G., Nicola, V. F. and Trivedi, K. S., "The Completion Time of a Job on Multi-Mode

Systems," in review, J. of Applied Probability.

111 Lucantoni, D. M. and Neuts, M. F., "Numerical Methods for a Class of Markov Chains Arising in

Queueing Theory," Tech. Report, No. 78/10, Dept. of Statistics and Computer Science, University

of Delaware, 1978.

12' Lucantoni, D. MI. and Ramaswami, V., "Efficient Algorithms for Solving the Non-Linear Matrix

- . Equations Arising in Phase Type Queues," Stochastic Models, Vol. 1, No. 1, 1985, pp. 29-52.

@1 13 Meyer. J., "Closed-Form Solution of Performability," IEEE Transactions on Comnputers, Vol. C-31,

No. 7, July 1982, pp. 648-657.

14, Mitran, 1. L. and Avi-ltzhak, B.. "A Many-Server Queue with Service Interruptions," Operationsa

Research, Vol. 16, 1968. pp. 628-638.

15Neuts, MI. F., "Matrix Analytic Methods in Queueing Theory," European Journal of Operations

Research, Vol. 15, pp. 2-12, 1984.

[16] Neuts, M.F. and Lucantoni, D.,M., "A Markovian Queue with N Servers Subject to Breakdowns andK -. ~ . **Repairs," Management Science, Vol. 25, No. 9, pp. 849-861, 1979.

Page 26: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

IT17 Nicola, V. F., "A Single Server Queue with Mixed Types of Interruptions," in review. Aria Infortria.

tic a.

: 18 Purdue, P., "The MM%/I Queue in a Miarkovian Environment," Operations Research. Vol 22. if)

pp. 562-569.

191 Ramaswami, V., "The N/.G/1 Queue and its Detailed Analysis," Adv'. .4ppl. Prob., Vol 12. 1980. pp.

222-261.

Oil

Page 27: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

r .,=2 pri:3D r =0

pri 3

r =1

Figure 1

Rate Diagram For The Two Processor System

W4

Page 28: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

Expected Queue Length

-u wb.n

0

00

C>4

*U0

0U

0

Page 29: ANALYSIS OF FAULT-TOLERANT COMPUTER SYSTEMS … · A state i may be classified as preemptive-resume (prs ) or preemptive-repeat-identical (pri) as follows: A state is said tc be prs

-ol