software reliability

Software Reliability

... Yes sometimes the system fails ...

© G. Antoniol 2012 LOG6305 2

Motivations

◆ Software is an essential component of many safety-critical systems ◆ These systems depend on the reliable operation of software components ◆ How can we decide, whether the software is ready to ship after system test? ◆ How can we decide, whether an acquired or developed component is acceptable?


What is Reliability?

◆  IEEE-Std-729-1991: “Software reliability is defined as the probability of failure-free operation for a specified period of time in a specified environment”

◆  ISO9126: “Reliability is the capability of the software product to maintain a specified level of performance when used under specified conditions”


Reliability Informal

◆  Probability of failure-free operation for a specified time in a specified environment for a given purpose

◆  This means quite different things depending on the system and the users of that system

◆  Informally, reliability is a measure of how well system users think it provides the services they require

◆  Reliability is a measure of how well the software provides the services expected by the customer.

◆  Quantification: Number of failures, severity


Software Reliability

◆  Cannot be defined objectively •  Reliability measurements which are quoted out of context are not

meaningful

◆  Requires operational profile for its definition •  The operational profile defines the expected pattern of software

usage

◆  Must consider fault consequences •  Not all faults are equally serious. System is perceived as more

unreliable if there are more serious faults


Failure Fault

◆  A failure corresponds to unexpected run-time behavior observed (by a user)

◆  A fault is a static software characteristic which causes a failure to occur

◆  Faults need not necessarily cause failures. ◆  If a user does not notice a failure, is it a failure? ➫  Remember most users don’t know the

software specification


Input/Output Mapping

Ie

Input set

OeOutput set

Program

Inputs causingerroneousoutputs

Erroneousoutputs


How Can we Measure and Model Reliability

◆  The cumulative number of failures by time t ◆  Failure Intensity: The number of failures per time

unit at time t ◆  Mean-Time-To-Failure MTTF: Average value of

inter-failure time interval ◆  We do not know for sure when the software is

going to fail next => model failure behavior as a random process


Modeling Cumulative # Failures

x x x x

time

Failu

re 1

Failu

re 2

Failu

re 3

Failu

re 4

?


Reliability Growth

◆  Reliability is improved when software faults which occur in the most frequently used parts of the software are removed

◆  Removing x% of software faults will not necessarily lead to an x% reliability improvement

◆  In a study, removing 60% of software defects actually led to a 3% reliability improvement

◆  Removing faults with serious consequences is the most important objective


Problems with Quantification?

◆  US standard for equipment on civil aircraft:

➫  [A critical failure must be ] so unlikely that [it] need not be

considered ever to occur, unless engineering judgment would

require [its] consideration.

◆  Typical requirement for critical flight systems: 10-9 probability of failure per flying hour based on a flight mean duration of 10 hours.

◆  Problem 1: such a level of reliability cannot be demonstrated – would require too much testing time

◆  Problem 2: Reliability is relative to the usage of the system


Usage Patterns

Possibleinputs

User 1

User 3User 2

Erroneousinputs


Reliability and Quality

◆  Reliability is usually more important than other quality aspects

◆  Unreliable software isn't used ◆  Hard to improve the quality of unreliable systems ◆  Software failure costs often far exceed system

costs ◆  Costs of data loss are very high


Reliability Metrics

◆  Probability of failure on demand •  This is a measure of the likelihood that the system will fail when a

service request is made •  POFOD = 0.001 means 1 out of 1000 service requests result in

failure •  Relevant for safety-critical or non-stop systems

◆  Rate of fault occurrence (ROCOF) •  Frequency of occurrence of unexpected behavior •  ROCOF of 0.02 means 2 failures are likely in each 100

operational time units •  Relevant for operating systems, transaction processing systems


Reliability Metrics

◆  Mean time to failure •  Measure of the time between observed failures •  MTTF of 500 means that the time between failures is 500 time

units •  Relevant for systems with long transactions e.g. CAD systems

◆  Availability •  Measure of how likely the system is available for use. Takes

repair/restart time into account •  Availability of 0.998 means software is available for 998 out of

1000 time units •  Relevant for continuously running systems e.g. telephone

switching systems


Reliability Measurement

◆  Measure the number of system failures for a given number of system inputs •  Used to compute POFOD

◆  Measure the time (or number of transactions) between system failures •  Used to compute ROCOF and MTTF

◆  Measure the time to restart after failure •  Used to compute AVAIL


Time Units

◆  Time units in reliability measurement must be carefully selected. Not the same for all systems

◆  Raw execution time (for non-stop systems) ◆  Calendar time (for systems which have a

regular usage pattern e.g. systems which are always run once per day)

◆  Number of transactions (for systems which are used on demand)


Failure Consequences

◆  Reliability measurements do NOT take the consequences of failure into account

◆  Transient faults may have no real consequences but other faults may cause data loss or corruption and loss of system service

◆  May be necessary to identify different failure classes and use different measurements for each of these


Specification

◆  Reliability requirements are only rarely expressed in a quantitative, verifiable way.

◆  To verify reliability metrics, an operational profile must be specified as part of the test plan.

◆  Reliability is dynamic - reliability specifications related to the source code are meaningless. •  No more than N faults/1000 lines. •  This is only useful for a post-delivery process analysis.


A Classification

Failure class DescriptionTransient Occurs only with certain inputsPermanent Occurs with all inputsRecoverable System can recover without operator interventionUnrecoverable Operator intervention needed to recover from failureNon-corrupting Failure does not corrupt sys tem state or dataCorrupting Failure corrupts system s tate or data


Specification Validation

◆  It is impossible to empirically validate very high reliability specifications

◆  No database corruptions means POFOD of less than 1 in 200 million

◆  If a transaction takes 1 second, then simulating one day’s transactions takes 3.5 days

◆  It would take longer than the system’s lifetime to test it for reliability


Compromises

◆  Because of very high costs of reliability achievement, it may be more cost effective to accept unreliability (and pay for failure costs)

◆  Social, business and political factors: ➫ A reputation for unreliable products may lose

future business ◆  Depends on system type - for business systems in

particular, modest reliability may be adequate


Costs

Cost

Low Medium High Veryhigh

Ultra-high

Reliability


Statistical Testing

◆  Testing software for reliability rather than fault detection

◆  Test data selection should follow the predicted usage profile for the software

◆  Measuring the number of errors allows the reliability of the software to be predicted

◆  An acceptable level of reliability should be specified and the software tested and amended until that level of reliability is reached


Testing Procedure

◆  Determine operational profile of the software ◆  Generate a set of test data corresponding to

this profile ◆  Apply tests, measuring amount of execution

time between each failure ◆  After a statistically valid number of tests have

been executed, reliability can be measured


Difficulties

◆  Uncertainty in the operational profile •  This is a particular problem for new systems with no operational

history. Less of a problem for replacement systems

◆  High costs of generating the operational profile •  Costs are very dependent on what usage information is collected

by the organisation which requires the profile

◆  Statistical uncertainty when high reliability is specified •  Difficult to estimate level of confidence in operational profile •  Usage pattern of software may change with time


Reliability Growth Model

◆  Growth model is a mathematical model of the system reliability change as it is tested and faults are removed

◆  Used as a means of reliability prediction by extrapolating from current data

◆  Depends on the use of statistical testing to measure the reliability of a system version


Reliability Models

◆  Many different reliability growth models have been proposed

◆  No universally applicable growth model ◆  Reliability should be measured and observed data

should be fitted to several models ◆  Best-fit model should be used for reliability

prediction


Models and Prediction

Reliability

Requiredreliability

Fitted reliabilitymodel curve

Estimatedtime of reliability

achievement

Time

= Measured reliability


Hardware Reliability

◆  Well-developed reliability theories ◆  A hardware component may fail due to design error, poor

fabrication quality, momentary overload, deterioration, etc. ◆  The random nature of system failure enables the estimation

of system reliability using probability theory ◆  Can it be adapted for software? No physical deterioration,

most programs always provide the same answer to the same input, the “fabrication” of code is trivial, etc.


Software Reliability Engineering

◆  No method of development can guarantee totally reliable software => important field in practice!

◆  A set of statistical modeling techniques ◆  Enables the achieved reliability to be assessed or

predicted, quantitatively and objectively ◆  Based on observation of system failures during system

testing and operational use ◆  Uses general reliability theory, but is much more than that


Applications ◆  How reliable is the program/component now? ◆  Based on the current reliability of a purchased/reused

component: can we accept it or should we reject it? ◆  Based on the current reliability of the system: can we stop

testing and start shipping? ◆  How reliable will the system be, if we continue testing for

some time? ◆  When will the reliability objective be achieved? ◆  How many failures will occur in the field (and when) – to help

plan resources for customer support?


Random Variable

◆  A random variable X is a function defined over a sample space S

◆  X associates a real numbers x with each possible outcome in S: X(e) = x

◆  If S is finite or countable infinite we say that X is discrete

◆  If X can be any value in a given interval we say that X is a continuous random variable


Discrete Random Variable

{ }P e X e x e S: ( ) ,= ∈◆  If S is countable X(e) is countable ◆  Event probability: ◆  Probability density function (pdf):

◆  Cumulative distribution function (cdf):

p x p xix

ii

( ) ( )≥ =∑0 1

{ }P p xx x

ii

X x = ≤≤∑ ( )


Mean and Variance

[ ] ∑=x

xpx )( XE

( )[ ]Var(X) = E X - E(X) 2


Continuous Random Variables

The pdf of a continuos random variable X is a function: f x( )

Notice: ∀ ≥ =−∞

+∞

∫x f x f x dx: ( ) ( ) 0 1

F f y dyx

(x) = ( )−∞∫

Cumulative distribution:


Mean and Variance

[ ]E X = xf x dx( )−∞

+∞

∫

[ ][ ]Var(X) = E (X - E X )2


Conditional Probabilities

We are interested in a subset B of the universe set S, in other words the sample space is essentially B. Let P(B)>0 and let A be a subset of S we define the probability of A relative to the new sample space B:

P(A|B) = P(A B) / P(B)∩


Conditional Distributions

Discrete:

Continuous:

p x x x xx xxi j i j

i j

j( | ) | )

))

= =P(X = X =P(X = X =

P(X =

f x yf x yf y

( | )( , )( )

=

Notice: f f x y dy(x) = ( , )−∞

+∞

∫


Stochastic Processes

A stochastic process is a collection of random variables:

X or X(t)t

where t is a suitable index set. In other words t may be a discrete time unit the the index set is T=[0,1,2,3,...] or it can be a point in a continuos time interval, then the index set is equal to:

T=[0, )∞


Reliability

The random variable of interest is the time to failure T. We focus our interest on the probability that the time to failure between is in some interval. The time to failure T in some interval:

T (t , t + t)∈ ΔIn other words:

P(t T t + t) = (t) t = (t + t) - (t)≤ ≤ Δ Δ Δf F F

The pdf f is also called failure density function


Reliability Function R(t)

Since T is defined only for positive values:

F f y dyt

(t) = P(0 T t) = ≤ ≤ ∫ ( )0

is the probability of observing a failure before t. The probability of success at time t, R(t), i.e. the failure time is larger than t:

R(t) = P(T > t) = 1- F(t) = f y dyt

( )+∞

∫


Failure Rate

Failure rate is the probability that a failure per unit time occurs in an interval, given that a failure has not occurred before t:

P(t T < t + t|T > t)t

P(t T < t + t)t P(T > t)

F(t + t) - F(t)t R(t)

≤=

≤=

ΔΔ

ΔΔ

ΔΔ


Hazard Rate

The hazard rate is the limit when the failure rate interval approaches to zero:

z tf t

t( ) lim

( )= =

→Δ

ΔΔ0

F(t + t) - F(t)t R(t) R(t)

Given that the system survived up to t, z(t), is the instantaneous failure rate, i.e. z(t)dt is the probability that a system of age t will fail in the small interval t, t+dt


R(t) and z(t)

By definition: z t( ) =dF(t)

dt

1R(t)

Since: R(t) =1- F(t) which means dR(t)

dt -

dF(t)dt

=

It can be derived: lg R(t) = - (x)dx + c0

t

z∫Since the system is usually good at t=0, R(0)=1:

R(t) = exp - (x)dx 0

t

z∫⎡

⎣⎢

⎤

⎦⎥


R(t) z(t) f(t)

R(t) = exp - (x)dx 0

t

z∫⎡

⎣⎢

⎤

⎦⎥

(t) = (t) exp - (x)dx 0

t

f z z∫⎡

⎣⎢

⎤

⎦⎥


Hazard Rate Over Time

Typical hazard rate:

Region I: early failure

Region II: random failure

Region III: wear-out failure

time

Haz

ard

rate

z(t)

Debugging phase Normal operating phase Do not apply to software

Bathtub curve


Example

Constant hazard rate

Linearly increasing hazard rate

z t( ) = λf t( ) = λ λ e- t

R(t) = e- tλ

z t K( ) = λ

f t( ) = Kt e-Kt2

2

R(t) = e-K t2

2


Expected Life

It is often referred as Mean Time To Failure (MTTF):

[ ]MTTF= E T =∞

∫ tf t dt( )0

In term of reliability:

MTTF = R( )t dt0

∞

∫


MTTF Examples

Constant hazard rate:

Linearly increasing hazard rate:

MTTF = R(t)dt = e 1- t

00

λ

λdt =∞∞

∫∫

MTTF = R(t)dt = e

12K

22K

-Kt2

00

2

dt =

⎛⎝⎜

⎞⎠⎟=

∞∞

∫∫Γ

2

π


Another Point of View

Instead of the random variable T, the time to next failure, we can model with the number of failure a system experienced by time t.

Clearly the number of failure experienced by time t is a random variable, actually it is a random process, since once fixed the time the number of experienced failures is a random variable.


Failure Intensity

Let M(t) be a random process representing the cumulative number of failures by time t, its mean value is:

[ ]µ (t) = E M(t)

The failure intensity function is:

[ ]( )λµ

(t) = d (t)

dt =

ddt

E M(t)


Failure Intensity

◆  m(t) = E[M(t)], with M(t) be the random process denoting the cumulative number of failures by time t, we denote m(t) as its mean value function

◆  l(t) : Failure intensity, average number of failures by unit of time at time t, Instantaneous rate of change of the expected number of failures (m(t))

◆  Reliability growth implies: dl(t)/dt < 0


Graphical Representation


Failure Intensity Properties

The failure intensity represent the expected number of failures per unit of time.

The number of failures that will occur between t and t + tΔcan be approximated by λ (t) tΔ

While µ (t) is a non decreasing function, it is the cumulative number of failures at t, the failure intensity, its derivative, must decrease in order to obtain an increase in R(t)


Reliability Models Classification

◆  Time domain: wall clock versus execution time ◆  Category: the total number of failure that can be

experienced in infinite time (finite or infinite) ◆  Type: the distribution of the number of failures

experienced by time t (Poisson or Binomial)


Poisson Type

The distribution of the failure experienced by time t is a Poisson process. In other words if:

t t t t t0 1 2 n= =0, , ,...,

is a partition of the interval [0,t] we have a Poisson process if the number of faults detected in the i-th interval are independent Poissons variables with f i i = 1,2,..., n

[ ]E (t (ti i i-1f = −µ µ) )


Classification

If is a linear function of time, we say that M(t) is a Homogenous Poisson Process (HPP)

If is a non linear we say that M(t) is a Non-Homogeneous Poisson Process

( )µ t

( )µ t


Binomial Processes

◆  There is a fixed number of faults (N) in the software at the beginning of the time in which the software is observed

◆  When a fault is detected it is removed immediately

◆  If Ta is the random variable denoting the time to failure of fault a, then the Ta for a=1,2,3, ...n are independently an identical distributed random variables (as well as the remaining faults!)


define reliability objective

modeling expected system usage

prepare test cases

execute test collect failure data

Perform: Reliability Certification Monitor: Reliability Growth

Requirements Design/Code Test

Software Reliability Engineering Process




prepare test cases






Model Expected System Usage ◆  Definition of reliability assumes a specified environment

ð To make statements on reliability in field during system test, we must test in conditions that are “similar to field conditions”

◆  Model how users will employ the software: environment, type of installation, distribution of inputs over input space

◆  According to the usage model, test cases are selected randomly ◆  One example of usage model ➫ Operational Profile (Musa): Set of system operations

and their probabilities of occurrence


Operational Profile

Ope

ratio

n1

Ope

ratio

n2

Ope

ratio

n3

Ope

ratio

n4

prob

abili

ty

10%

20%

30%

40%


Operations

◆  Major system logical task of short duration, which returns control to the system when complete and whose processing is substantially different from other operations ➫ major: related to functional requirement (similar to use

cases) ➫ logical: not bound to software, hardware, users,

machines ➫ short: 100s-1000s operations per hour under normal

load conditions ➫ different: likely to contain different faults

◆  In OO systems, an operation ~ use case


Examples

◆  Command executed by a user ◆  Response to an input from an external system or

device, e.g., processing a transaction, processing an event (alarm)

◆  Routine housekeeping, e.g., file backup, database cleanup


Develop Operational Profiles ◆  Identify who/what can initiate operations ➫ Users (of different types), external systems and devices,

system itself ◆  Create a list of operations for each operation initiator and consolidate

results ➫ Source: requirements, draft user manuals, prototypes,

previous program versions, discuss with expected users ➫ 20 to several hundred operations are typical

◆  Determine occurrence rates (per hour) of the individual operations ➫ existing field data, record field operations, simulation,

estimates ◆  Derive occurrence probabilities


Example: Fone Follower (FF) ◆  Requirements ➫ Forward incoming phone calls (voice, fax) anywhere ➫ Subscriber calls FF, enters phone numbers for where he

plans to be as a function of time ➫ FF forwards incoming calls from network (voice, fax)

to subscriber as per program. If no response to voice call, subscriber is paged (if subscriber has pager). If no response or no pager, voice calls are forwarded to voice mail

➫ Subscribers view service as standard telephone service combined with FF

➫ FF uses vendor-supplied operating system of unknown reliability


Initiators of Operations

◆  Event driven systems often have many external systems that can initiate operations in them

◆  Typically, the system under study may initiate itself administrative and maintenance operations

◆  FF:

➫ User types: subscribers, system administrators

➫ External system: Telephone network

➫ FF (audits, backups)


FF Operations List ◆  Subscriber ◆  System Administrator

◆  Network

◆  FF

–  Phone number entry –  Add subscriber –  Delete subscriber –  Proc. voice call, no pager, answer –  Proc. voice call, no pager, no answer –  Proc. voice call, pager, answer –  Proc. voice call, pager, ans. on page –  Proc. voice call, pager, no ans. on page –  Proc. fax call –  Audit section of phone number database –  Recover from hardware failure


FF Operational Profile

10.000 50 50

18.000 17.000 17.000 12.000 10.000 15.000

900 0,1

Phone number entry Add subscriber Delete subscriber Proc. voice call, no pager, answer Proc. voice call, no pager, no answer Proc. voice call, pager, answer Proc. voice call, pager, ans. on page Proc. voice call, pager, no ans. on page Proc. fax call Audit section of phone number database Recover from hardware failure

0,10 0,0005 0,0005

0,18 0,17 0,17 0,12 0,10 0,15

0,009 0,000001

Operation Occ.Rate (per hr) Occ.Prob.


Statistical Testing

◆  Testing based on operational profiles is referred to as statistical testing ◆  This form of testing has the advantage of testing more intensively the

system functions that will be used the most ◆  Good for reliability estimation, but not very effective in terms of

finding defects ◆  Hence we differentiate testing that aims at finding defects

(verification) and testing whose purpose is reliability assessment (validation).

◆  There exists research on techniques that combine white-box testing and statistical testing …




prepare test cases






Can We Accept a Component?

◆  Certification Testing: Show that a (acquired or developed) component satisfies a given reliability objective (e.g., failure intensity)

◆  Generate test data randomly according to usage model (e.g., operational profile)

◆  Record all failures (also multiple ones) but do not correct


Procedure

◆  Use a hypothesis testing control chart to show that the reliability objective is/is not satisfied

➫ Reliability Demonstration Chart

» Collect times at which failures occurred » Normalize data by multiplying with failure intensity

objective (using same units!) » Plot each failure in chart » Based on region in which failure falls, accept or

reject component


Reliability Demonstration Chart

John Musa, Software Reliability, 1998

1 0.1875 0.75 2 0.3125 1.25 3 1.25 5.0

Fail.No Mcalls at

failure Normalized

measure

Failure Intensity Objective: 4 failures/Mcalls

=> accept

Reliability Demonstration Chart

Normalized failure time (failure time x failure intensity objective)

Failu

re N

o.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Accept

Reject

Continue


Creating Demonstration Charts ◆  Select discrimination ratio γ(acceptable factor of error in

estimating failure intensity) ◆  Select consumer risk α (probability of accepting a system that

does not satisfy failure intensity objective) ◆  Select supplier risk β (probability of rejecting a system that

does satisfy failure intensity objective) ◆  Recommended defaults (γ=2, α=0.1, β=0.1)

➫ 10% risk of wrongly accepting component when failure intensity is actually >= 2 * failure intensity objective

➫ 10% risk of wrongly rejecting component when failure intensity is actually <=1/2 * failure intensity objective


Boundary Lines

γγ

−−=1lnnATN γ

γ−

−=1lnnBTN α

β−

=1

lnAαβ−= 1lnB

Where TN = boundaries for normalized failure time (x axis) n = failure number (y axis)

The next step is to construct boundary lines (between reject and Continue and Continue and Accept) according to the following formulae




prepare test cases






Failure Data: Interfailure times



Failure intensity

measured failure intensity

reliability objective

fitted reliability model curves

estimated time of reliability achievement

Monitor: Reliability Growth

Select appr. reliability models

Use models to compute current reliability


met?


no

yes

delivery

Can We Stop Testing?


MUSA Model

◆  Developed by John Musa at ATT Bell Labs ◆  One of the first proposed and still widely applied ◆  It is based on execution time ◆  Execution time can be converted into calendar

time ( in a second part of the model)


Basic Assumptions

◆  The rate of fault detection is proportional to the current number of faults in the code

◆  The fault detection rate remains constant over the interval between faults occurrence

◆  A fault is corrected instantaneously, without introducing new faults in the code

◆  The failure when the faults are detected are independent ◆  The software is operated in a similar manner as that in

which reliability predictions are made ◆  Every fault has the same change of being encountered

within a severity class as any other fault in that class


MUSA Assumptions

◆  Basic assumptions ◆  The cumulative number of failure by time t, M(t),

follows a Poisson process with mean functions:

◆  The execution times between failure are piecewise exponentially distributes, i.e. the hazard rate is a constant

[ ]µ β ββ (t) = with 0 t

0,11 01− >−e


MUSA Assumptions ...

◆  Resources available (testing personnel, debuggers, programmers) are constant over the segment of time for which the system is observed

◆  Fault-identification personnel can be fully utilized and computer utilization is constant

◆  Fault-correction personnel is established by the limitation of fault queue length for any fault correction person.


MUSA Required Data

◆  To apply MUSA model we need

➫  the actual time that software failed or

➫  the elapsed time between failures


MUSA Model Form

Given the assumptions:

[ ]µ β ββ (t) = with 0 t

0,11 01− >−e

t-10

1e = (t) βββλThis is a finite failure model with the total number of faults:

β0


Maximum Likelihood Estimation

MLE chooses an estimator such that the observed sample is the most likely to occur among all possible samples. Assuming X has a pdf the joint pdf of the sample is called the likelihood function. f x( | )θ θ where is unknown { }X Xn1,...,

L( ) =i=1

n

θ θf xi( | )∏


MUSA MLE

Suppose we observed n failure, at times: t t t1 2 n, ,...,Suppose from the last failure t n an additional time of x has elapsed which means the total time the system has been observed from the beginning is: t x with x 0n + ≥

The likelihood function is:

( )[ ]L( e e0- t - 1-e

1 i 0- 1

β β β β β β β

, )1 0 11

=⎡⎣⎢

⎤⎦⎥=

∏+

n n

i

n tn x


MUSA MLE Solution The parameters estimation for the betas give:

( )[ ]!

exp !β

β011

=− − +

nt xn

( )( )[ ]

n ni

i

n

! exp !β β1 1 110−

+

+ −− =

=∑t x

t xtn

n

Substitute the betas estimate into the equation of model to obtain an estimate of R(t), ( ) ( )µ λt t, ,...,etc


Example

Suppose the following times of failure were observed: 10, 18, 32 49, 64, 86, 105, 132, 167, 207 in hours and suppose an additional 15 CPU hours of no failures:

( )!

exp !β

β01

101 222

=− −

( )10 2220

222 1870 0

1 1! exp !β β−

−− =

With the result of : ! . ! .β β0 1136 0 006= = and


Parameters Interpretation By making the correspondence:

001 and B νβφβ ==

We represent the constant hazard rate φ and the fault reduction factor B. B models the imperfect bug fixing process; Military standards suggest B = 0.955, close to 1 but not 1 that means 5% of time when fixing a defect we really introduce a new one.


MUSA Simplest Solution

Consider the failure intensities a function of the mean number of failure:

( ) ( ) ( )µββµλνµλµλ −=⎟⎟⎠

⎞⎜⎜⎝

⎛−= 01

00 ou 1

find the least square solution of this line with the actual data. One parameter are available we can decide how much effort is still required. Let λ λF , be the final / present failure intensityP

( )Δµνλ λ λ= −0

0P F Δt P

F=

⎛⎝⎜

⎞⎠⎟

νλ

λλ

0

0lg


Grouping of Data

◆  Chose a number of failure you wish to consider for each sub-interval say k (suggested k=5)

◆  Divide the data set into p independent sub-sets each containing exactly k failures (p is the smallest integer greater or equal the total number of failure divided by k)

◆  Estimate failure intensity as the ratio between the number of failure in the interval and the interval duration


Example

k=3, first interval 2 failure, second interval 3, third 3 ...


Equations

( )r

kt t

l plkl k l

=−

= −−1

1 2 1 , ,...,( )

rm k pt tpe

e k p

=− −−

−

( )1

1

me total number of failurete time when observations are stoppedm k ll = −( )1 cumulative failure up to l interval

Build the couples: ( )r ml l, fit with the model ( )λ µ λµν= −

⎛⎝⎜

⎞⎠⎟00

1


Line fitting

( )ii yx ,Remember: for a data set And an equation y=bx+a We have:

xbya

xxn

yxyxnb

n

i

n

i

n

i

n

i

n

ii

ˆˆ

ˆ2

11

2

111

−=

⎟⎠⎞⎜

⎝⎛−

−=

∑∑

∑∑∑


Remark

( )µλ,

( ) ( ) ( )µββµλνµλµλ −=⎟⎟⎠

⎞⎜⎜⎝

⎛−= 01

00 ou 1

The role of (r,m) are those of

But our equation is:

Th is to say we need to invert the relation: 0

0

0 νλνλµ +−=


Geometric Model

( )

( ) ( )

( )t

t

tt

e

1

10

10

/10

1

1ln

, 0

βββλ

ββµ

βββµλ βµ

+=

+=

= −Fault detection is a geometric Series, constant between two faults There are infinite faults in the system Time between 2 faults is an exponential distributions


Closed form from Geometric Model

( ) ( )

( )⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛+=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−

⎟⎟⎠

⎞⎜⎜⎝

⎛−

=

∑∑

∑∑∑

∑∑

==

−

===

==

p

ll

p

ll

p

ll

p

ll

p

lll

p

ll

p

ll

mrp

rmrmp

mmp

101

101

111

2

11

2

0

ˆ1ln1expˆˆ

lnln

ˆ

βββ

β


The Musa-Okumoto Model ◆  The system test reflects the intended usage of the

system ◆  The failures are independent ◆  No new faults introduced during corrections

◆  The cumulative number of failures by time t follows a Poisson Process (NHPP)

◆  Failure intensity decreases exponentially with the expected number of failures experienced m: l(m)=l0exp(-qm) q>0 is failure intensity decay parameter, l0 is initial failure intensity.

◆  Actual times when failures occurred or failure time intervals (execution time)

General Assumptions Model-specific Assumptions Data Requirements


Poisson Processes

x x x x Time t

1 2

m(t)

!)()(ktetPk

tk

αα−=

k


The Musa-Okumoto Model (2)

Failure intensity versus failures experienced

θµλµλ −×= e0)(

y:=10*exp(-0.02*x)

Mean values experienced (my)

Failure intensit

y (lambda)

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150



Deriving m(t) and l(t)


( ) ( ) ( )10

00 +

=×== −

θτλλ

λττµτλ τθµe

dd

)1ln(1)( 0 +×= θτλθ

τµ


The Musa-Okumoto Model (3) y:=(1/0.02)*log (10*0.02*x+1)

Execution time (tau)

Mean failures

experienced (m

y)

0102030405060708090100110120130140150160170180190200

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

Failures experienced versus execution time t

)1ln(1)( 0 +×= θτλθ

τµ


y:=10/(10*0.02*x+1)

Execution time (tau)

Failure intensit

y (lambda)

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

Failure intensity versus execution time t◆  Collect failure data:

time when failure occurred

◆  Use tools to determine model parameters l0 and q

◆  Determine whether reliability objective is met or how long it might take to reach


1)(

0

0

+=

θτλλτλ



◆  Once the parameters l0 and � are estimated, it can be used to predict failure intensity in the future, not only to estimate its current value. From this, we can plan how much additional testing is likely to be needed

◆  The model allows for the realistic situation where fault correction is not perfect (infinite number of failures at infinite time)

◆  When faults stop being corrected, the model reduces to a homogeneous Poisson process with failure intensity l as the parameter – the number of failures expected in a time interval then follows a Poisson distribution


Reliability Estimation

◆  Probability of 0 failures in a time frame of length t (reliability):

◆  Additional time/failures to reach, from present failure intensity lP , the required failure intensity lF: Dt

λττ −= eP )(0

⎟⎠⎞⎜⎝

⎛ −×=ΔPF λλθ

τ 111F

Pλ

λθ

µ ==Δ 1


Conceptual View of Prediction


Time Measurement

◆  Execution (CPU) time: may be difficult to collect but accurate

◆  Calendar time: Easy to collect but makes numerous assumptions (equivalent testing intensity with time periods)

◆  Testing effort: Same as calendar time, but easier to verify the assumptions

◆  Specific time measurement may often be devised: #Calls, Lines of code compiled, #transactions


Selection of Models

◆  Several Models exist (ca. 40 published) => Which one to select? Not one model is consistently the best.

◆  Assumptions of Models: ➫ Definition of testing process ➫ Finite or infinite number of failures? ➫ No faults introduced during debugging? ➫ Distribution of data (Poisson, Binomial)? ➫ Data requirements? (inter-failure data, failure count data)

◆  Assessing the goodness-of-fit ➫ Kolmogorow-Smirnov, Chi-Square

◆  Trends in Data (prior to model application) ➫ Usually, Reliability Models assume reliability growth. Laplace test can

be used to test whether we actually experience growth


Motorola zero failure

◆  Model derived from Brettschneider 89

◆  gives:

➫ fd: failure density needed

➫ tf: total failure

btae−

[ ]( ) ( )[ ]fdtffd

failurelasttohoursfdfd++

−−−×+/5.0lg

)5.0/(ln


Pro’s and Con’s Pro’s Con’s ·  Usage model may be difficult to devise

·  Selection of reliability growth model difficult ·  Measurement of time crucial but may be

difficult in some contexts ·  Not applicable to safety-critical systems as very

high reliability requirements would take years of testing to verify.

·  Reliability can be specified ·  Objective and direct assessment of reliability ·  Prediction of time to delivery

software reliability

Documents