modeling user-perceived reliability based on user behavior graphs

April 19, 2009 23:38 WSPC/INSTRUCTION FILE service˙reliability

Modeling User-Perceived Reliability Based on User Behavior Graphs ∗

Dazhi Wang

Department of Computer Science, Duke University

Durham, NC 27708

[email protected]

Kishor S. Trivedi, Fellow IEEE

Department of Electrical and Computer Engineering, Duke University

Durham, NC 27708

[email protected]

Service Reliability is an important consideration for new service deployment. Traditionalsystem-oriented measures are no longer adequate to describe the reliability perceived bythe user. In this paper we propose a general service reliability analysis approach basedon user behavior, and derive formulas to compute service reliability from the user model

and service models. The derived user-perceived service reliability incorporates the usertask reliabilities, the dependencies of different user tasks, and various types of userbehavior besides failure and recovery of system hardware and software components.This approach is applied to the service reliability computation for an example faulttolerant cluster hosing two services. Factors from both the system side and the user sideare analyzed for the user-perceived service reliability of the cluster, and the results arecompared with the system availability and service reliability lower bound.

Keywords: Service reliability, user behavior graph, Markov chain, service continuity,service accessibility

1. Introduction

Today the fast development of new technologies have enabled a variety of new ser-

vices for voice, data and multimedia. Ensuring high service reliability is important

as users become more dependent on these services to conduct their everyday activ-

ities. For this purpose, systems that provide these services are often designed to be

fault tolerant. When a fault occurs, the system may enter degraded states in which

it cannot operate in full capacity, or partial failure states in which some services

are available while some other services are not. The existence of degraded states

and partial-failure states causes two difficulties for service reliability analysis: 1)

due to the degraded states, the computation for service completion probability and

service completion time becomes more difficult because the service rate is no longer

constant; 2) due to the partial failure states, it is not possible to statically define

∗This research was supported in part by the US National Science Foundation under grant NSF-CNS-08-31325

1


2

up and down states (which is required for dependability analysis) from the system’s

perspective because the user may require different services at different times. This

creates a dynamic view of system up and down states.

Because of the degraded or partial failure states that impact the normal ex-

ecution of services, traditional dependability measures such as system reliabil-

ity/availability, mean time to failure, etc. (whose definitions can be found in 2 16 21)

cannot be directly applied to service reliability quantification since they are primar-

ily from the system’s point of view. In contrast to the system dependability mea-

sures, the service dependability should be a function of not only system resource

availabilities (both hardware and software), but also the characteristics of trans-

actions being served (e.g., resource requirement of the transaction), and the user

behavior (e.g., access sequence & usage pattern of different services). To correctly

evaluate the service quality, we propose the quantification of service reliability from

the user’s perspective.

During the user interaction with the system, the user often submits a sequence of

requests to achieve certain goals, each of the requests may require a different set of

resources in the system. The characteristics for the interaction can be summarized

as follows:

• Not all system resources are required to be up over the entire user

interaction with the system. Thus, because of the largeness of net-

worked/distributed systems that host multiple services, and only a subset

of the system resources are needed to process a user task, we are motivated

to focus on service reliability instead of the system reliability/availability.

• A resource is required to be up only during the time periods when the user

requests this resource. The resource unavailability will not affect the user-

perceived service reliability when the resource is not needed. This charac-

teristic determines that our service reliability analysis must be user-centric

as opposed to commonly carried out system-centric analysis.

• A resource may be requested multiple times during the user-system inter-

action. Due to this requirement, traditional point availability measures can

not be applied in service reliability analysis. Instead, joint availability 2 or

interval reliability 1 may be used as a basis to further develop the notion

of service reliability.

Based on these observations, the service reliability analysis needs to take into

account the details of user behavior, and it should adopt a dynamic view of system

up/down states (when needed, as long as needed, as many times as needed). We

interpret the user-perceived service reliability as follows:

During the user interaction (session) with the system, the user issues multiple

tasks (or requests) at different time points for different services in the system. The

user-perceived service reliability is the probability that all tasks in the user session

are successfully completed.


3

Service Models

User ModelS1(T1) S2(T2) S3(T3)

R1 R2 R1 R3 R4 R3

System ModelR1 R2 R3

... Rn

... ... ...

Fig. 1. User-based service reliability modeling

Figure 1 shows a high level view of service reliability from the user’s perspective.

It contains three levels: user model, a set of service models, and the system model.

The user model contains user behavior in a user session, such as issuing different

tasks/requests, thinking, or retrying after a failed task. The system model includes

the failure/recovery behavior of its resources. The service models are derived from

the system model and requirement of tasks (such as system resources needed by

the task, and the work requirement of the task). Task execution in the system

is captured by the service models. Using this approach, we not only model the

service reliability for individual tasks, but also take into consideration dependencies

between different user tasks, and the user interactions with the system.

Some research related to service reliability modeling has been conducted. One

class of such research is on the concepts and quantitative statements for service re-

liability engineering. In 29 the service reliability theory as well as the failure modes

and failure mechanisms for services are presented. In 13 the customer-oriented defi-

nitions of availability and reliability are developed for switched telecommunication

services. The second class of service reliability research is on the service availability

of various networks or computer systems, such as 9 17 30. The basic approach is

to map service availability to the availability of some system resources or certain

system configurations. In the context of the literature above, service availability

corresponds to the concept of ‘service accessibility’ described in 29, and the ser-

vice continuity for the transactions/tasks is not addressed. The third class focuses

on the modeling of service continuity. It is normally from the task’s perspective,

evaluating the reliability of the task or related measures such as task completion

time. Based on whether or not the system has degradable performance, it can be

further divided into task reliability evaluation without degraded system states such

as 10 11, and evaluation with degraded system states such as 19 27 28 3. The latter

is closely related to the distribution of performability 18 25, which characterizes

the system performance under failures. The last class studies the service reliabil-


4

ity by incorporating the user’s behavior. In 15 the user-perceived availability of a

web-based travel agency is evaluated where the user sends multiple requests during

the user session. In 31 the user think state is addressed when modeling the user-

perceived web-server availability. In 20 the various joint availabilities for multiple

user requests are considered.

The research work above focus on various aspects of the service reliability mod-

eling, however, it lacks a methodology to put all parts together to get the overall

service reliability from the user’s perspective. The main contribution of this paper

is summarized as follows:

• propose a general modeling framework for the user perceived service relia-

bility, which integrates both service accessibility and service continuity as

well as user behavior. In the framework, we take into consideration 1) task

execution under degradable systems, 2) dependencies between a sequence

of user tasks, 3) impact of user behavior on service reliability.

• derive equations to compute the service reliability for single task with de-

terministic work requirement, and develop efficient computation method

for user-perceived service reliability for a sequence of user tasks by incor-

porating the user behavior model and service models.

• apply the user-perceived service reliability modeling approach to an ex-

ample cluster system hosting two services. Various factors from both the

system side and the user side that impact the user-perceived service relia-

bility are analyzed.

The rest of the paper is organized as follows: In Section 2 we first present a simple

example to illustrate the basic idea of the service reliability quantification, then in

Section 2.2 we use the technique in 4 to derive the service reliability for single user

task in degradable systems with deterministic work requirement, in Section 2.3 the

approach in Section 2.2 is extended to model the user-perceived service reliability

with multiple user requests, in Section 2.5 the possible extension of the user model is

discussed to allow the modeling of more types of user behavior. Section 3 applies our

service reliability modeling technique to a fault tolerant computer cluster that hosts

two services, and gives the numerical results of the user-perceived service reliability

for the example system. Various factors that influence the user-perceived service

reliability are also analyzed. Section 4 presents the summary and conclusions.

2. Service Reliability Modeling

In this paper we study the following aspects of the user-perceived service relia-

bility: 1) service accessibility – the ability to initiate a transaction in the service

when desired; 2) service continuity – the ability to successfully serve an initiated

transaction till its completion; 3) service re-accessibility – the ability to re-access

the service multiple times after the first access. The concepts of 1) and 2) have

been introduced in 29, and concept 3) has been implied in 20. A service can be in-


5

terpreted as requiring certain hardware/software resources in the system to be up.

Given a CTMC model of the system in which the states characterize availabilities of

system resources, a service defines a set of up states of the system in which the ser-

vice is operational. The service accessibility can be captured by the instantaneous

availability defined on the CTMC, which is the probability that the system is in

one of the up states at time t (i.e., the time epoch when the service is accessed).

For systems without degraded states, the service continuity can be captured by the

system reliability, which is the probability that the system stays in up states during

[t, t + x] for a task requiring x units of service time. And the service re-accessibility

is analogous to the joint-availability which is the probability that the system is in

up states at times t1 and t2. However, for systems with degraded states and user

sessions containing multiple tasks or requests, the user-perceived service reliability

is not straight-forward any more to compute.

2.1. Single Task Reliability on Two-state System Model

We first consider the service reliability of a single user task running on a two-state

system. The CTMC model of the system is shown in Figure 2(a), which has constant

failure rate λ and repair rate µ. U is an operational state in which the system can

accept and process tasks, D is a down state in which new tasks will be denied and

existing tasks in the system will be lost. The system dependability measures are

shown in Table 1. The DTMC model for the user is shown in Figure 2(b) which

contains only one task. P means the task is being processed and E means the task

has completed.

U Dλ

µ

(a) system

EP 1

(b) user

Fig. 2. System and user models for single task reliability

Table 1. Dependability measures of the two-state system

reliability inst. availability. steady-state availability interval reliability

R(t) = e−λt A(t) = µ

λ+µ+ λ

λ+µ· e−(λ+µ)t A=µ/(λ + µ) RI(t, τ) = A(t) · e−λτ

Assume sa is the service accessibility for the task. If the system has reached

steady-state when the task is submitted, the service accessibility for the task sa is

the steady-state availability of the system:

sa = A = µ/(λ + µ).


6

If the task is submitted t time units after the system is started, then the service

accessibility corresponds with the point or instantaneous system availability.

sa = A(t) =µ

λ + µ+

λ

λ + µ· e−(λ+µ)t.

Given that the service can be accessed by the task, the state probability vector

of the system at the time of task submission is (1, 0), i.e., with probability 1 the

system is in state U , and 0 in state D.

Assume that if the system stays in operational state, the user task requires X

(which is a random variable) units of work to complete, and it fails if the system

goes down during task processing. Then the service continuity for the task is the

system reliability during [t, t + X ], where time t is when the task is issued to the

system, and as stated above, the system state probability vector at time t is (1, 0).

Assume the distribution of X is F (x), i.e., F (x) = PrX ≤ x, then the service

continuity can be written as

sc =

∫ ∞

x=0

Pr( system is up during [t, t + x]) · Pr(task completes at time t + x)

=

∫ ∞

0

R(x) · dF (x) =

∫ ∞

0

e−λx · dF (x)

The service reliability SR is the probability that the task can access the service

and be completed successfully. It can be written as

SR = sa · sc =( µ

λ + µ+

λ

λ + µ· e−(λ+µ)t

)

·

∫ ∞

0

e−λt · dF (t)

If X is deterministic, i.e.,

F (x) =

0, x < τ

1, x ≥ τ

then we have

SR =( µ

λ + µ+

λ

λ + µ· e−(λ+µ)t

)

· e−λτ

In the special case of deterministic work requirement, this depiction of service reli-

ability corresponds with the notion of interval reliability introduced by Barlow and

Proschan 1.

2.2. Single Task Reliability on Degradable Systems

Most large scale systems are fault tolerant and contain multiple degradable states,

in which they cannot operate with full capacity due to partial failures, but tasks

can still be processed with degraded service rate, or resumed after the failure is

recovered. In this section we show how to compute the single task reliability on

degradable systems.

We describe the degradable system using a continuous time Markov chain with

state space Ω = 1, 2, ...n and generator matrix Q = [qij ]n×n. Given a task running


7

on the system, Ω is divided into three sets: fully operational states U , in which the

task is processed by the system with full capacity; degraded states O, in which the

task is processed with less service rate but the work done so far will not be lost;

down states D, in which the work is lost and the task is considered failed. Note that

the task execution policy we considered in this paper is prs (preemptive resume) 7,

while the pri (preemptive repeat identical) and prd (preemptive repeat different)

policies 7 are beyond the scope of this paper. Assume there are totally m states in

either U or O, and therefore n−m states in D. Each state i is assigned a reward rate

ri denoting the production capacity in that state, i.e., if the system stays in state

i for t time units, then the amount of work done in that state is rit. For simplicity,

we assume ri = 1 for i ∈ U , then 0 ≤ ri < 1 for i ∈ O, and ri = 0 for i ∈ D. Let π0

be the state probability vector of the system when the task is initially submitted.

Assume the task requires a fixed τ units of work (which means it will be com-

pleted in τ time units if the system stays in fully operational states), then the task

reliability is the probability that the accumulated reward is greater than or equal to

τ before the system enters one of the down states 18, and it can be quantified using

exact solution methods such as 22 24 27. However these methods are complicated

and in most cases do not scale to large systems.

In this paper we adopt a different approach which uses Erlang distribution to

approximate deterministic distribution to compute the task reliability. As shown

in 23 and 26, a deterministic distribution with parameter τ can be approximated

using a k-stage Erlang distribution with pdf f(t; k, λ) = λktk−1e−λt/(k− 1)! where

λ = k/τ is the rate parameter. It approaches the deterministic distribution as

k → ∞. Figure 3 shows the Markov chain with state space Γ = 1, 2, ..., k + 1

1 2 k... k+1λ λ λ λ

Fig. 3. CTMC of k-stage Erlang distribution

where k + 1 is the absorbing state and the distribution of the time to absorption is

k-stage Erlang if the initial state is 1. Its generator matrix A is

A = [aij ](k+1)×(k+1) =

1...

k

k + 1

−λ λ 0. . .

. . .

−λ λ

0 0

(k+1)×(k+1)

We use the k-stage Erlang distribution with rate parameter λ = k/τ to approximate

the fixed work requirement τ of the task. In 4 it is stated that if the work requirement

of the task is a PH random variable and the system model is a CTMC, the task

completion time is also PH and can be described by a CTMC. In this paper we

extend the CTMC model in 4 by adding the absorbing down states into it and we


8

call it service model from now on. For better presentation of the service model, we

first reorder the states in Ω of the system model by moving the m states in U ∪O to

the front. Assume the new ordered state space is Ω = s1, s2, ..., sm, sm+1, ..., sn,

where the first m states are either fully operational states or degraded states, and

the last n−m states are down states. (s1, s2, ..., sn) is a permutation of (1, 2, ..., n).

Assume the permutation matrix is E, then

(1, 2, ..., n) = (s1, s2, ..., sn)E, and (s1, s2, ..., sn) = (1, 2, ..., n)ET

where ET is the transpose of E. After reordering the states in Ω, the generator

matrix of the system model is

Q = E · Q · ET =

[

Q1 Q2

Q3 Q4

]

n×n

where Q1 is a m × m sub-matrix that contains transition rates within states in

U ∪ O, and Q2 is a m × (n − m) sub-matrix that contains transition rates from

U ∪ O to D. Define the diagonal matrix R = diag[rs1, rs2

, ..., rsm] where rsi

is the

reward rate assigned to state si.

Based on the results in 4, the service model can be constructed from A and Q

as follows: each not-down state si(1 ≤ i ≤ m) in the system model is expanded

into a cluster of k + 1 states (si, j)|1 ≤ j ≤ k + 1, where (si, j) means the

system is in state si and the task has reached the jth stage. The service model also

contains n−m down states sm+1, ..., sn, thus having mk+n states in total. Of the

mk + n states, (si, j)|1 ≤ i ≤ m, 1 ≤ j ≤ k are transient states, (si, k + 1)|1 ≤

i ≤ m are absorbing states representing task completion, and sm+1, ..., sn are

absorbing states denoting task failure. If we number each expanded state (si, j) to

be m(j−1)+ i, and each down state si ∈ sm+1, ..., sn to be mk+ i, the generator

matrix C for the service model is

C =

Q1 − λR λR 0 0 ... 0 0 Q2

0 Q1 − λR λR 0 ... 0 0 Q2

.... . .

. . .. . .

. . ....

......

.... . .

. . .. . . 0

......

0 · · · · · · 0 Q1 − λR λR 0 Q2

0 · · · · · · 0 0 Q1 − λR λR Q2

0 · · · · · · 0 0 0 0 0

0 · · · · · · 0 0 0 0 0

=

[

CTT CTE CTD

0 0 0

]


9

where

CTT =

Q1 − λR λR 0 0 ... 0

0 Q1 − λR λR 0 ... 0...

. . .. . .

. . .. . .

......

. . .. . .

. . . 0

0 · · · · · · 0 Q1 − λR λR

0 · · · · · · 0 0 Q1 − λR

is a km×km sub-matrix of C and contains transition rates between transient states;

CTE = [0 0 · · · 0 λR ]T

is a km×m sub-matrix of C and contains transition rates from transient states to

absorbing states denoting successful task completion;

CTD = [Q2 Q2 · · · Q2 ]T

is a k(n−m)×(n−m) sub-matrix of C and contains transition rates from transient

states to absorbing states denoting task failure.

As an example, we consider a two-node parallel system shown in Figure 4(a).

Each node in the system has failure rate γ, and there is a single repair unit with

repair rate µ. The number within each state represents the number of failed nodes.

The task can be processed in parallel by the two nodes, each keeping in memory

a replication of task data that can be copied by the other. In Figure 4(a), state 0

is fully operational state and the reward rate is 1, state 1 has one failed node but

the task can still be processed with half of system capacity, thus the reward rate is

0.5, in state 2 both nodes are down and the task cannot be resumed after recovery,

thus the reward rate is 0. The service model is in Figure 4(b), where the repair

transition from the down state is removed. The states are numbered in the same

way as mentioned earlier in this section. State 2k + 1 and 2k + 2 represent task

success, while state 2k + 3 represents task failure.

Given the state probability vector π0 of the system model when the task is

initiated, the initial state probability vector for the service model is

u0 = (x,0, ...,0 | 0 | y) = (uT | uE | uD),

where

x = π0ET

[

Im×m

0(n−m)×m

]

is the initial probability vector for the service model in ((s1, 1), (s2, 1), ..., (sm, 1)),

y = uD = π0ET

[

0m×(n−m)

I(n−m)×(n−m)

]

is the initial probability vector for the service model in down states, the 0s in u0

are probability vectors for states ((s1, j), ..., (sm, j))|j ∈ [2..k+1]. uT , uE and uD


10

0 1 22γ γ

µ µ

(a) two-node cluster

0,1 1,1 2

0,2 1,2

0,k 1,k

...

...

...

0,k+1 1,k+1Task success

Task failure

state 1state 2

state 3

state 2k-1

state 4

state 2k

state 2k+1 state 2k+2

state 2k+32γ

2γ

2γ

µ

µ

µ

λ

λ

λ

λ/2

λ/2

λ/2

γγ

γ

(b) service model on the cluster

Fig. 4. The two-node cluster and the service model with Erlang work requirement

correspond to initial probability vectors for transient states, task completion states,

and task failure states, respectively. From 8, the probability vector for absorption

into task completion states is

vE = −uT ·C−1TT ·CTE (1)

The ith element in vE is the probability that the service model is absorbed into

state (si, k+1), i.e., it is the probability that the task is successfully completed and

the system is in state si at the time of task completion. Similarly, the probability

vector for absorption into task failure states is

vD = −uT · C−1TT ·CTD + uD (2)

where the ith element of vD is the probability that the service model is absorbed

into the task failure state sm+i.

Let L = (Q1 − λR)−1(−λR), the inverse of CTT is

C−1TT =

(Q1 − λR)−1 L(Q1 − λR)−1 · · · Lk−1(Q1 − λR)−1

(Q1 − λR)−1 · · · Lk−2(Q1 − λR)−1

. . .. . .

...

(Q1 − λR)−1 L(Q1 − λR)−1

0 (Q1 − λR)−1

Therefore vE can be computed as

vE = −[

x 0 · · · 0]

· C−1TT ·

[

0 · · · 0 λR]T

= x · Lk (3)


11

Similarly, vD can be computed as

vD = −[

x 0 · · · 0]

·C−1TT ·

[

Q2 Q2 · · · Q2

]T+ uD

= x ·[

− (

k−1∑

i=0

Li) · (Q1 − λR)−1 ·Q2

]

+ y (4)

If we choose k = 1 for the Erlang approximation, then CTT = Q1 − λR,

CTE = λR, uT = x. From Equation 1,

vE = x · [(Q1 − λR)−1(−λR)] = x · L (5)

Comparing the equation above with Equation 3, it can be seen that the solution

for service model with k-stage Erlang approximation can be acquired by solving

the service model with 1-stage Erlang for k times, using vE computed from the

previous run as the initial probability vector for the next run.

Given vE in Equation 3 and vD in Equation 4 that corresponds to ordering

(s1, s2, ..., sn) of the system states, we can get the vector πE and πD corresponding

to state ordering (1, 2, ..., n) as

πE = vE · [Im×m 0m×(n−m)] · E

= π0 ·

[

ET

[

Im×m

0

]

· Lk · [Im×m 0m×(n−m)]E

]

= π0 ·K (6)

πD = vD · [0(n−m)×m I(n−m)×(n−m)] · E

= π0 ·

[

ET ·(

J +

[

0m×m 0m×(n−m)

0(n−m)×m I(n−m)×(n−m)

]

)

·E

]

= π0 · N (7)

where

J =

[

Im×m

0(n−m)×m

]

·[

− (∑k−1

i=0 Li) · (Q1 − λR)−1 ·Q2

]

· [0(n−m)×m I(n−m)×(n−m)].

The ith element of πE is the probability that the task is completed successfully

and the system is in state i at the time of task completion, and the ith element of

πD is the probability that the task fails and the system is in state i at the time

of task failure. The (i, j)th element of matrix K is the conditional probability that

the task succeeds and the system is in state j at the time of task completion, given

the system is in state i when the task is initially sent to the system. Similarly, the

(i, j)th element of matrix J is the conditional probability that the task fails and

the system is in state j at the time of task failure, given the system is in state i

when the task is initially sent to the system.

The service reliability for the single task can be computed as

SR = πE · 1n×1


12

The service accessibility is

sa = x · 1m×1 = π0 · ET ·

[

1m×1

0(n−m)×1

]

which is the probability that the system is not in one of the down states when the

task is submitted.

As a special case, we consider the task with 0 processing time (which means

the task processing time can be neglected comparing with failure/recover times of

system resources). In this case λ in Figure 3 is ∞. Then L = (Q1−λR)−1(−λR) =

I. From Equation 6,

πE = π0 ·ET

[

Im×m 0

0 0

]

·E, K = ET

[

Im×m 0

0 0

]

· E (8)

i.e., πE is the system state probability vector for successful task initiation. And the

task’s service reliability is simply its service accessibility.

2.3. User-perceived Service Reliability

In the previous section we introduced the service reliability for single task, however

during a user session the user often issues multiple requests/tasks each of which

may require different system resources. The user-perceived service reliability is the

probability that all requests are successfully satisfied during the user session. The

success of the user session depends not only the system state, but also the user be-

havior such as the number of requests during the session, user think times, session

duration, etc. Before presenting the method of computing the user-perceived ser-

vice reliability, we first introduce the user behavior graph that captures the user’s

activity during a session.

2.3.1. User Behavior Graph

The user behavior graph (UBG) 5 6 is often used in performance benchmarks

to model the user behavior. In this paper we borrow the same concept for user-

perceived service reliability modeling purpose. For a user session, we define the user

behavior graph to consist of a set of nodes and arcs. There is one absorbing node E

that represents the end of user session. Each node other than E indicates a certain

task type. Each task type defines its own sets of fully operational states, degraded

states and down states (depending on the system resources the task needs), as

well as the work requirement for it to finish. An arc from node i to node j means

that a task of type i is completed and a new task of type j is submitted to the

system, while an arc from node i to E denotes the session ends after the task of

type i is completed. We assume the user behavior graph is a DTMC, and hence

each arc is attached a transition probability. The sojourn time in each node is the

task completion time, together with these sojourn times the UBG becomes a semi-

Markov chain. Figure 5(a) shows an example user behavior graph with two types


13

1

2

p11

Ep12

p1E

p22

p21

p2E

(a) original

1

2

(1-p1F)p11

E(1-p1F)p12

(1-p1F)p1E

(1-p2F)p22

(1-p2F)p21

(1-p2F)p2E

F

p1F

p2F

(b) extended

Fig. 5. Example user behavior graph with two types of requests

of requests 1 and 2. Let pij , (i, j ∈ 1, 2) be the probability that the user will issue

a request of type j after the current request of type i. Let piE be the probability

that the user will end the session after issuing a request of type i.

To model service failure, we add another state F into the user behavior graph

which indicates the failure of the user session. The extended user behavior graph is

shown in Figure 5(b). For each node i, assume the failure probability for executing

the task of type i is piF . Then the transition probability from node i to node j is

changed to (1 − piF ) · pij .

Given the extended user behavior graph, the user-perceived service reliability

is defined as the probability that the model enters the absorbing state E. Because

successful completion of a task requires certain system resources to be up during

task execution and the availabilities of system resources vary with time, the value of

piF depends on the system state at the time when the task is initiated and may not

be a constant. Therefore the extended user behavior graph is not a homogeneous

DTMC. In the next section we show how to compute the user-perceived service

reliability (i.e., the model enters the absorbing state E) by combining the user

behavior graph and the service models.

2.4. User-perceived Service Reliability Derivation

To derive the formulas for user-perceived service reliability computation, we make

the following definitions:

• The original UBG (without adding the F state) is a DTMC with m + 1

states, where the (m + 1)th state is the absorbing session end state E,

each of the other states represents a different type of task. The state space

ΩU = 1, 2, ..., m, E. Let

P =

[

P pE

0 1

]

be the (m + 1) × (m + 1) transition probability matrix, where pE is a


14

column vector whose ith element is piE . P = (p1,p2, ...,pm) is m × m

transition probability sub-matrix for the first m states, and pi, i ∈ [1..m]

are column vectors. The initial probability vector when the session begins

is v0 = (v1, v2, ..., vm, vE).

• The system model is a CTMC with state space ΩS = 1, 2, ..., n and

generator matrix Q. The initial probability vector at the beginning of the

session is π0 = (π1, π2, ..., πn).

As defined in Section 2.2, Each task type i in the user model has its own sets of fully

operational states Ui, degraded states Oi and down states Di in the system model,

as well as the reward rates for each system state. Therefore as shown in Equation

6, each state i of the user model corresponds to a matrix Ki, in which the (j, k)th

element of Ki is the probability that the task of type i completes successfully and

the system is in state k at the time the task ends, given that the system is in state

j when the task is initiated.

Given the user behavior graph and matrices Ki (i ∈ [1..m]) for each user state,

we build the DTMC that can be used to compute the user-perceived service relia-

bility. The DTMC has one absorbing state E that represents user session success,

and one absorbing state F representing user session failure. For any other state, we

label it using a 2-tuple (i, j) which means the system is in state j ∈ [1..n] when a

task of type i ∈ [1..m] is initiated in the system. Then we have the following state

transition probabilities:

(i, j) → (x, y) : k(i)jy · pix

(i, j) → E :n

∑

y=1

k(i)jy · piE

(i, j) → F : 1 −n

∑

y=1

k(i)jy

And the initial probability for state (i, j) is vi · πj . If we number state (i, j) as

(i−1)∗n+ j, E as mn+1 and F as mn+2, then the transition probability matrix

is

M =

p11K1 p12K1 · · · p1mK1 p1EK1 · 1 1 − K1 · 1

p21K2 p22K2 · · · p2mK2 p2EK2 · 1 1 − K2 · 1... · · · · · ·

...

pm1Km pm2Km · · · pmmKm pmEKm · 1 1 − Km · 1

01×mn 1 0

01×mn 0 1

=

Smn×mn αmn×1 βmn×1

01×mn 1 0

01×mn 0 1


15

and the initial probability vector is

u(0) = (uT (0), uE(0), uF (0)) = (uT (0), 0, 0)

where uT (0) = (u1(0),u2(0), · · · ,um(0)) and uj(0) = (vjπ1, vjπ2, ..., vjπn) is the

initial probability vector for states (j, 1), (j, 2), · · · , (j, n).

If we use u(k) to represent the state probability vector after the kth transition,

we have

u(k) = u(k − 1) ·M (9)

Therefore

uT (k) = uT (k − 1) · S, and uj(k) =

m∑

i=1

pij · ui(k − 1)Ki

We can rewrite uT (k) in matrix form as follows:

uT (k) =

u1(k)

u2(k)...

um(k)

= PT ·

u1(k − 1) ·K1

u2(k − 1) ·K2

...

um(k − 1) ·Km

(10)

The equation above takes O(m2n + mn2) time to compute.

From Equation 9, uE(k) = uE(k − 1) + uT (k − 1) · α, and the user-perceived

service reliability is

SR = uE(∞) =∞∑

k=0

uT (k) · α (11)

2.5. Modeling More General User Behavior

In Section 2.4 we derived the user-perceived service reliability by assuming the

work requirement of each task is deterministic and each state in the user model

corresponds to a task type. In this section we discuss some extensions of the models

used in the derivation to allow the modeling of more general user behavior.

First, the work requirement of each task does not need to be deterministic, it

can be of any PH-type distribution such that the techniques in 4 can be applied

to construct the service model, and Equation 1 can still be used to compute the

transition probability matrix K. The difference between Equation 1 and Equation

6 (which is for the case of Erlang distribution) is that the former needs to solve a

larger service model thus manipulating a larger matrix.

Second, in Section 2, for simplicity we assumed that each state i of the UBG

represents user sending a task of type i, however it can also represent other user

behavior as long as we can get the transition probability matrix Ki, in which the

(j, k)th element is the probability that the system is in state k when the UBG leaves

state i, given the system is in state j when the UBG enters state i.


16

As an example, a UBG state i can represent a user think state, in which all

states in the system model are fully operational states. Assume the thinking time

distribution is Θ(t), the transition probability matrix Ki is

Ki =

∫ ∞

t=0

eQtdΘ(t)

where Q is the generator matrix of the system model. Specifically, if the user think

time is deterministic with parameter θ, then

Ki = eQθ (12)

If Θ(t) is a PH distribution, the method in Section 2.2 can be used to compute Ki.

Especially if Θ(t) is exponentially distributed with rate λ, then

Ki =

∫ ∞

t=0

eQtd(1 − e−λt) = (I −Q

λ)−1 (13)

The user retry behavior can also be included in each UBG state i that represents

sending a task of type i. As shown in Figure 6, the shaded rectangle represents state

iS iW iR

success

F

Ki

Ni Wi

Ki

state i

Fig. 6. User retry behavior

i of the UBG. Inside it the sub-state iS represents the first try, if it is successful

the UBG will leave state i and the transition probability matrix is Ki which cor-

responds to K in Equation 6. If it is unsuccessful, the model enters sub-state iWwith transition probability matrix Ni which corresponds to N in Equation 7. As-

sume the user will send the retry after ω time units, then the transition probability

matrix from iW to iR is Wi = eQω, where iR is the retry sub-state, and the tran-

sition probability matrix for retry success is Ki. Therefore, the overall transition

probability matrix associated with state i is

Ki = Ki + NiWiKi = (I + NiWi) ·Ki

If the maximum number of retries is r, then Ki becomes

Ki =[

r∑

j=0

(NiWi)j]

· Ki

As a special case of the user-perceived service reliability, we consider the random-

request availability for perfect system in 20. Given a two state system with constant

failure rate α and repair rate beta, and a sequence of k tasks, the random request


17

availability for perfect system is defined as the probability that the system is up

at every task arrival time t1, t2, ..., tk, given the system is up at time t0. This can

be viewed as the user sends k requests and the random-request availability is the

user-perceived service reliability. The system model and the user model is shown in

Figure 7(a) and Figure 7(b), respectively. We have the number of system states n =

2 and number of user states 2k + 1 where m = 2k. In Figure 7(b) the Pi, (i = 1..k)

U Dα

β

(a) system

P1 T1 P2 T2 ... Pk E1 1 1 1 1 1

T01

(b) user

Fig. 7. System model and the user model

states represent request processing, for the purpose of random-request availability

computation, the work requirement of each request is set to 0. The Ti, (i = 0..k−1)

states are user think states with duration ti+1−ti. From Equation 8, the K matrices

for Pi, (i = 1..k) states are

K2i =

[

1 0

0 0

]

From Equation 12 the K matrices for Ti, (i = 0..k − 1) states are

K2i+1 =1

α + β

[

β + α · e−(α+β)(ti+1−ti) α · [1 − e−(α+β)(ti+1−ti)]

β · [1 − e−(α+β)(ti+1−ti)] α + β · e−(α+β)(ti+1−ti)

]

For the user model, the transition probability matrix P is

P =

0 1

0. . .

. . . 1

0

2k×2k

and p1E = p2E = ... = p2k−1 E = 0, p2k E = 1.

The initial system probability vector π0 = (1, 0), the probability vector for user

model is (1, 0, ..., 0). Then we have uT (0) = ((1, 0), (0, 0), ..., (0, 0)). From Equation

10 and 11, we have

SR =

k∏

i=1

(β

α + β+

α

α + β· e−(α+β)(ti−ti−1))

This is consistent with the result in 20.


18

3. A Numerical Example

In this section we apply the service reliability modeling technique to a cluster

hosting two stateless services, which could be web services, SOA services, database

services, etc.

3.1. System and User models

The basic architecture is shown in Figure 8. It has two nodes NodeA and NodeB.

Process for Service 1

Node A Node B

Process for Service 1Service 1

Service 2Process for Service 2

Process for Service 2

Fig. 8. Architecture for a computer cluster

Process replication is adopted as the mechanism to provide application level soft-

ware fault tolerance, i.e., the software processes for each service are replicated on

both nodes. As examples of such replicated software services we point out Avaya’s

SwiFT 12 and IBM’s HA WebSphere 14. Both the software processes and the node

hardware may fail in the cluster. The failure rate for service 1 process is λ1, while

for service 2 process the failure rate is λ2. Each node has hardware failure rate

λh and when a node has a hardware failure, both software processes running on

that node also become inaccessible. The cluster has a single repair unit. The repair

rates for service 1 process, service 2 process and node hardware are µ1, µ2 and µh,

respectively. Service 1 is up if at least one of its two processes is up, and so is service

2. In the case of multiple failures, the repair unit first repairs the service with both

of its processes down.

The system model for the computer cluster is shown in Figure 9(a). To reduce

the number of states in the system model, we assume that the maximum number of

coincident failures is two. Each state is labeled as (x, y, z) where x is the number of

operational service 1 processes, y is the number of operational service 2 processes,

and z is the number of operational nodes.

The user behavior graph is shown in Figure 9(b). During a user session, the user

may send multiple requests for either service 1 or service 2 in the cluster. State iP

(i = 1, 2) is the request processing state which requires service i to be up; after

the request is completed successfully, the model enters the user think state iT in

which neither service is required. When the user leaves state iT , with probability pij


19

2,2,2

1,2,2

2,1,2

1,1,1

0,2,2

1,1,2

0,1,1

2,0,2 1,0,1

0,0,0

µ1λ1

2λ1 µ12λh

2λ2µ2

µh

λ2µ2

λh

λh2λ2

2λ1 µ1

λh

λh

λ1

λ2

µ1

µ2

λhµh

(a) system model

1P 1T

2P 2T

EF

p1E

p2E

p11p12

p22

p21

1

1

(b) user model

Fig. 9. System model and the user model

(j = 1, 2) he/she will send another request for service j and the model enters state

jP , with probability piE the user will end the session without additional requests.

And similarly for state 2T . When in state iP (i = 1, 2), if the request cannot be

completed, the model enters F state representing user session failure.

Given the system model and the user model, the system up states for user state

1P are (x, y, z)| x ≥ 1, and those for user state 2P are (x, y, z)| y ≥ 1. For user

states 1T and 2T , because neither service is required, every system state is an up

state.

We assume that each user request is served very quickly so that the processing

time can be neglected. So Equation 8 is used to compute the K matrices for user

state 1P and 2P . The thinking time in state 1T and state 2T is T1 and T2, respec-

tively. We also assume that the first request of the user session is of type 1 therefore

the user model is initially in state 1P , and the system model is in steady-state when

the first user request is sent.

3.2. Numerical Results

In this section we present the numerical results for the user-perceived service reli-

ability of the example in the previous section. The default parameter values used

in the example are shown in Table 2. For comparison purposes, we also draw the

curves for steady-state system availability and the upper bound of user-perceived

service reliability. The former is computed as the steady-state probability that both

services are up. The latter is computed using the steady-state availability for service

i (i = 1, 2) as the service accessibility for each user request of type i, and can be

viewed as the user-perceived service reliability with infinite user think time. It is

computed by hierarchically solving the user model and the system model, i.e., first

use the system model to compute si (i = 1, 2), the steady-state probability that

service i is up, then assign si as the transition probability from iP to F , and solve


20

the user model in Figure 9(b) to get the absorption probability in state E.

Table 2. Default parameters used in the example

Parameter Default value Description

1/λ1 168 hours mean time to failure for service 1 process1/λ2 168 hours mean time to failure for service 2 process1/λh 200 hours mean time to failure for node hardware1/µ1 30 minutes mean time to repair service 1 process failure1/µ2 30 minutes mean time to repair service 2 process failure1/µh 1 hour mean time to repair node hardware failureT1 5 minutes user think time in state 1TT2 5 minutes user think time in state 2Tp11 0.4 transition probability from 1T to 1Pp12 0.3 transition probability from 1T to 2Pp1E 0.3 transition probability from 1T to Ep21 0.4 transition probability from 2T to 1P

p22 0.3 transition probability from 2T to 2Pp2E 0.3 transition probability from 2T to E

3.2.1. User-perceived service unreliability vs. MTTF and MTTR

We first study the impact of system MTTF and MTTR on user-perceived service

reliability. In Figure 10, the curve ‘sys-unavail’ is the system unavailability which

is defined at the beginning of Section 3.2. Curves ‘su-2min’, ‘su-30min’, ‘su-1hour’

are the user-perceived service unreliability with user think time to be 2 minutes,

30 minutes and 1 hour, respectively. Curve ‘su-ubound’ is the upper bound of user-

perceived service unreliability.

Figure 10(a) shows how the user-perceived service reliability changes with mean

time to failure (MTTF) parameters λ1, λ2 and λh. Their default values in Table 2

are multiplied by a factor γ ranging from 0.5 to 3. The x-axis shows the values for

γ, and y-axis shows the user-perceived service unreliability. As shown in the figure,

the user-perceived service unreliability decreases as the values of MTTF parameters

increase.

Similar to Figure 10(a), Figure 10(b) shows how the user-perceived service reli-

ability changes with mean time to repair (MTTR) parameters µ1, µ2 and µh. Their

default values in Table 2 are multiplied by a factor γ ranging from 0.3 to 3. As the

values of MTTR parameters increase, the user-perceived service unreliability also

increases.

In Figure 10(c) both MTTF and MTTR parameters are multiplied by a factor

γ ranging from 0.1 to 10, therefore the steady-state system unavailability, as well as

the service unavailability, will not change with γ. However as seen from the figure,

longer MTTF and MTTR parameters actually improve the user-perceived service

reliability. This can be explained by the fact that the system with longer MTTF

has higher reliability than the one with shorter MTTF, so it is more likely to stay

operational during the whole user session and serve all requests successfully.


21

0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1

1.2

1.4x 10

−3 Service unreliability vs. MTTF

γ

Ser

vice

unr

elia

bilit

y

sys−unavailsu−2minsu−30minsu−1hoursu−bound

(a) SU vs. MTTF

0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

3.5x 10

−3 Service unreliability vs. MTTR

γ

Ser

vice

unr

elia

bilit

y


(b) SU vs. MTTR

0 1 2 3 4 5 6 7 8 9 101

1.5

2

2.5

3

3.5

4x 10

−4 Service unreliability vs. MTTF and MTTR

γ

Ser

vice

unr

elia

bilit

y


(c) SU vs. MTTR and MTTF

Fig. 10. Service unreliability under different MTTFs and MTTRs

3.2.2. User-perceived service unreliability vs. user think Time

Figure 11 shows the user-perceived service unreliability under different user

think time S. S is varied from 1 minute to 20 hours. The x-axis is the user think

time, and y-axis is user-perceived service unreliability. As shown in the figure, the

user-perceived service unreliability increases with the user think time. When S goes

to infinity, the user-perceived service unreliability will reach its upper bound. This

can be explained by the concept of joint availability Aj(t, x), which is defined as

the probability that the system is operational at time t and again at time t+x. The

user-perceived service reliability is similar to the joint availability in that it requires

the corresponding services to be up at times when the requests are sent, and the


22

0 2 4 6 8 10 12 14 16 18 201

1.5

2

2.5

3

3.5

4x 10

−4 Service unreliability vs. user thinking time

User thinking time (hour)

Ser

vice

unr

elia

bilit

y

sys−unavails1−unavails2−unavailservice unrelsu−bound

Fig. 11. SU vs. user think time

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5x 10

−3 Service unreliability vs. session end prob.

pES

ervi

ce u

nrel

iabi

lity

su−2minsu−30minsu−1hoursu−bound

Fig. 12. SU vs. session end probability

user think time between two requests is analogous to the variable x in Aj(t, x). For

Markov models, Aj(t, x) is monotonically decreasing with x. Therefore when the

user think time is infinity, the user-perceived service unreliability reaches its upper

bound.

3.2.3. User-perceived service unreliability vs. session end prob.

Figure 12 shows the impact of session end probability on the user-perceived service

unreliability. In this figure we set p1E = p2E and vary them from 0.05 to 0.95, we

adjust p11, p12, p21 and p22 accordingly such that∑

j∈1,2,E pij = 1 for i = 1, 2.

As shown in the figure, the user-perceived service unreliability decreases with the

session end probability. This is because the number of requests is reduced as the

session end probability increases, and the probability that all requests are successful

increases when the number of requests decreases. By comparing curve ‘su-2min’,

‘su-30min’ and ‘su-1hour’, It can also be seen that the session end probability has

less impact on the user sessions with shorter user think times, this can be explained

by the fact that shorter user think times result in shorter session duration, and the

system can stay up during the session and serve all the requests successfully.

4. Summary and Conclusions

In this paper we have evaluated user-perceived service reliability for systems with

degraded and/or partial failure states, taking into consideration both the user task

requirements and the user behavior. Using the Erlang approximation of the deter-

ministic distribution, we built service models and from them derived the completion

probability for user tasks with deterministic work requirement under degradable

system states. This approach is extended to the modeling of user-perceived service

reliability with multiple tasks in the user session, and the dependencies of different

user tasks are addressed. The user model is further extended to allow more user


23

behavior such as user thinking and user retries.

We applied the user-perceived service reliability modeling approach to an ex-

ample fault tolerant cluster hosting two services. Various factors from both system

and user sides, including system mean time to failure (MTTF), mean time to repair

(MTTR), user think time and user session end probability, are identified and an-

alyzed for their impact on user-perceived service reliability. The results for service

reliability under these factors are compared to the system availability and service

reliability lower bound. The numerical results show that while keeping the system

availability constant, longer MTTF and MTTR can increase the user-perceived ser-

vice reliability in the context of the example. On the other hand longer user think

time can decrease the service reliability.

References

1. R. E. Barlow and F. Proschan. Mathematical Theory of Reliability. New York: JohnWiley and Sons, 1965.

2. A. Birolini. Quality and Reliability of Technical Systems: Theory-Practice-Management. Springer-Verlag, Berlin, 1998.

3. A. Bobbio and M. Telek. The task completion time in degradable sytems. In B. R.Haverkort, R. Marie, G. Rubino, and K. S. Trivedi, editors, Performability Modelling:Techniques and Tools, pages 139–161. Wiley, 2001.

4. A. Bobbio and K. S. Trivedi. Computation of the distribution of the completion timewhen the work requirement is a ph random variable. Stochastic Models, 6(1):133–150,1990.

5. M. Calzarossa and D. Ferrari. A sensitivity study of the clustering approach to work-load modeling. Performance Evaluation, 6:25–33, 1986.

6. M. Calzarossa, R. A. Marie, and K. S. Trivedi. System performance with user behaviorgraphs. Performance Evaluation, 11(3):155–164, 1990.

7. P. F. Chimento and K. S. Trivedi. The completion time of programs on processorssubject to failure and repair. IEEE Transactions on Computers, 42(10):1184–1194,Oct 1993.

8. H. Choi, W. Wang, and K. S. Trivedi. Analysis of conditional MTTF of fault-tolerantsystems. Microelectronics Reliability, 38(3):393–401, 1998.

9. L. K. Chu, S. S. Chu, and D. Sculli. Service availability of a radio access telecommu-nications network. Quality & Reliability Engineering International, 14:365–370, 1998.

10. T. Dahlberg and D. P. Agrawal. Task based reliability for large systems: A hierarchicalmodeling approach. In The 22nd International Conference on Parallel Processing,Volume III Algorithms & Applications, pages 284–287, Chicago, IL, August 16-20,1993.

11. C. R. Das and J. Kim. A unified task-based dependability model for hypercube com-puters. IEEE Transactions on Parallel Distributed Systems, 3(3):312–324, 1992.

12. S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yagnik. Performance andreliability evaluation of passive replication schemes in application level fault tolerance.In Fault Tolerant Computing Symposium (FTCS 1999), pages 322–329, June 1999.

13. K. C. Glossbrenner. Availability and reliability of switched services. IEEE Commu-nications Magazine, pages 28–32, June 1993.

14. http://www.ibm.com/websphere/.15. M. Kaaniche, K. Kanoun, and M. Martinello. A user-perceived availability evaluation


24

of a web based travel agency. In International Conference on Dependable Systems andNetworks (DSN’03), pages 709–718, San Francisco, California, June, 2003.

16. K. C. Kapur and L. R. Lamberson. Reliability in Engineering Design. John Wiley &Sons, New York, 1977.

17. R. Keralapura, C.-N. Chuah, G. Iannaccone, and S. Bhattacharyya. Service availabil-ity: a new approach to characterize ip backbone topologies. In The Twelfth IEEEInternational Workshop on Quality of Service, pages 232–241, June 2004.

18. V. Kulkarni, V. Nicola, R. Smith, and K. S. Trivedi. Numerical evaluation of per-formability measures and job completion time in a repairable fault-tolerant system.In Sixteenth International Symposium on Fault-Tolerant Computing, Vienna, Austria,July 1986.

19. V. Kulkarni, V. Nicola, and K. S. Trivedi. The completion time of a job on multi-modesystems. Advances in Applied Probability, 19(4):932–954, December 1987.

20. K. W. Lee. Stochastic models for random-request availability. IEEE Transactions onReliability, 49(1):80–84, March 2000.

21. E. E. Lewis. Introduction to Reliability Engineering. John Wiley & Sons, New York,1987.

22. C. Lindemann. An improved numerical algorithm for calculating steady-state so-lutions of deterministic and stochastic petri net models. Performance Evaluation,8(1):79–95, 1993.

23. M. Malhotra and A. L. Reibman. Selecting and implementing phase approximationsfor semi-Markov models. Stochastic Models, 9(4):473–506, 1993.

24. M. A. Marsan and G. Chiola. On petri nets with deterministic and exponentiallydistributed firing times. In Advances in Petri Nets 1987, Lecture Notes in ComputerScience, volume 266, pages 132–145. Springer Verlag, London, 1986.

25. J. F. Meyer. On evaluating the performability of degradable computing systems. IEEETransactions on Computers, 22:720–731, Aug 1980.

26. M. F. Neuts. Renewal process of phase type. Naval Research Logistics Quartely,25(3):445–454, 1978.

27. E. De Souza E Silva and H. R. Gail. Calculating availability and performability mea-sures of repairable computer systems using randomization. Journal of the Associationfor Computing Machinery, 36(1):171–193, January 1989.

28. E. De Souza E Silva, H. R. Gail, and R. V. Campos. Calculating transient distributionsof cumulative reward. In Performance’95 and 1995 ACM SIGMETRICS Conference,pages 231–240, 1995.

29. M. Tortorella. Service reliability theory and engineering, i: Foundations. Quality Tech-nology and Quantitative Management, 2(1):1–16, 2005.

30. M. R. Wilson. The quantitative impact of survivable network architectures on serviceavailability. IEEE Communications Magazine, pages 122–126, May 1998.

31. W. Xie, H. Sun, Y. Cao, and K. S. Trivedi. Modeling of user perceived webserveravailability. In IEEE International Conference on Communications (ICC 2003), An-chorage, Alaska, May 11-15, 2003.

modeling user-perceived reliability based on user behavior graphs

Documents