imperfect debugging in software reliability a bayesian approach

7/27/2019 Imperfect Debugging in Software Reliability a Bayesian Approach

1/22

\ fWf#

The Institute for Integrating Statistics in Decision Sciences

Technical Report TR-2011-1

January 24, 2011

Imperfect Debugging in Software Reliability:A Bayesian Approach

Tevfik Aktekin

Department of Decision Sciences

University of New Hampshire

Toros Caglar


The George Washington University


2/22

Imperfect Debugging in Software Reliability: A

Bayesian Approach

Tevfik Aktekin



Toros Caglar



January 19, 2011

Abstract

The objective of studying software reliability is to assist software engineers in understanding

more of the probabilistic nature of software failures during the debugging stages and to construct

reliability models. In this paper, we consider Bayesian modeling of the inter-failure times whose

parameter space is evolving stochastically over time typically observed in software reliability

applications. In doing so, we focus on the modeling of relevant parameters such as the fault

detection rate per fault and the total number of faults that are both latent (unobservable)

factors. We consider a model which can take into account imperfect debugging type of behavior

under certain conditions. Furthermore, we investigate the existence of perfect vs. imperfect

debugging in the light of data. In order to show how the proposed model work, we use real data

based on which we obtain the predictive reliability functions after each testing stage, carry out

inference on relevant model parameters and present additional insights from Bayesian analysis.

1


3/22

1 Introduction and Overview

Over the last two decades, more than a hundred software reliability models have been proposed

by researchers as pointed out by both Singpurwalla and Wilson (1994) and Kuo and Yang (1996).

Software reliability can be defined as the probability of not observing a failure for a specified

time interval under certain conditions. Here, not observing a failure represents the fact that

the software runs without any problems (which does not necessarily mean that the code does not

contain any more bugs). Software failure mainly happens due to problems in the code which can

be attributed to human error. In software reliability research hard data are based on the failures

not the errors left (bugs), since the latter are observable quantities whereas the formal is not.

Before the final release, the software passes through several stages of testing in which debugging is

performed and its reliability is assessed after each stage. As pointed out by Singpurwalla and Wilson(1994), the testing stages of software during development is considered to be the most expensive

step. When the so called software reliability is found to be adequate the software is released. The

question of when to release software given its reliability function is of utmost importance to software

practitioners and requires proper modeling of relevant uncertain quantities which give birth to the

software reliability function.

The purpose of software testing is to detect software faults (bugs) inherent in the software

code. A software fault is an error in the source-code, which can cause the software to fail when

the program is executed. The testing stage consists of several consecutive program executions.

Whenever a failure occurs, the software engineer attempts to fix the problem. When the problem

is fixed and no new errors are introduced during debugging, then a perfect debugging is said to

have occurred. Whereas, when new errors are introduced during debugging, then an imperfect

debugging is said to have occurred. We note here that when imperfect debugging occurs, the

software reliability is worsened whereas with perfect debugging the software becomes more reliable

due to the fact that there are fewer bugs left in the software. There are several reliability models

that either assume perfect or imperfect debugging which we briefly summarize in the sequel.

Most software reliability models can be classified into one of three categories: models that are

based on failure rates of the inter-failure times, models based on number of failures and models

based on the actual inter-failure times. The earliest software reliability model is due to Jelinski and

Moranda (1972) (JM) where the focus is on the modeling of failure rates. Although the model is

2


4/22

very simple and is based on several questionable assumptions, it is the building block for most of

the modern software reliability models in the literature. The JM model assumes that the software

contains an unknown number of bugs and upon software failure a bug is detected and corrected

entirely, i.e. perfect debugging. The JM model assumes that every fault contributes equally to the

failure rate at any stage of the testing and that each fault is removed permanently upon failure.

A straight forward maximum likelihood estimation is used to estimate model parameters. Another

model that is based on failure rates is the one introduced by Littlewood and Verall (1973) (LV)

where every fault does not contribute equally to the failure rate at any stage of the software testing,

each fault is still assumed to be removed permanently upon failure. For parameter estimation the

LV model uses a combination of maximum likelihood and Bayes estimation methods. Bayesian

extensions of both JM and LV models have been considered by Kuo and Yang (1995) and the LV

model by Soyer and Mazzuchi (1988). More recently, Washburn (2006) introduce the generalization

of the Jelinski-Moranda model by considering a negative binomial prior for the number of faults

left in the software where perfect debugging is assumed.

The earliest failure count model is due to Goel and Okumoto (1979) where a non-homogeneous

Poisson process with intensity function, (t) = a(1 ebt) is introduced. This model is considered

to have given birth to most count models in software reliability, also referred to as NHPP models.

Later, Musa and Okumoto (1984) propose another failure count model where a logarithmic Poisson

execution time type of approach is considered. This model simply suggests that the rate at which

failures occur exponentially decreases with the expected number of failures. Bayesian analysis

of failure count models have been the subject of interest of many studies. Kuo and Yang (1996)

propose a unified approach to the non-homogeneous Poisson process models and carry out Bayesian

inference via Markov chain Monte Carlo methods. In doing so, they consider modeling the epochs

of failures via a general order statistics model or a record value statistics model Campodonico

and Singpurwalla (1995) discuss the incorporation of expert knowledge into the count processes

typically used in software reliability such as the non-homogeneous Poisson process.

Software reliability models based on the actual inter-failure times are scarcer in the literature.

One such model is due to Singpurwalla and Soyer (1992) where the relationship between the inter-

failure times are modeled via the power law which boils down to a first order autoregressive linear

model in logarithms. Coupled with an evolution equation on the parameter space, the authors

3


5/22

obtain a Gaussian Kalman filter model. Dalal and Mallows (1988) investigate the optimal stopping

during the testing stage of software development and Morali and Soyer (2003) propose a Bayesian

state space model with analytically tractable properties for optimal stopping. There is also a

considerable amount of work done in software reliability from an operational profile perspective

which is defined as the set of all operations that a software is designed to perform and the occurrence

probabilities of these operations. Ozekici et al. (2000), Ozekici and Soyer (2001) and Ozekici and

Soyer (2003) study such operational profiles and develop optimal release strategies. In addition,

thorough reviews of most software reliability models can be found in Xie (1991), Singpurwalla

(1995) and Singpurwalla and Wilson (1999).

In this paper we consider Bayesian modeling of the inter-failure times whose parameter space

is evolving stochastically over testing stages. We define our uncertainty about these unknown

parameters via probabilities therefore a Bayesian approach is the natural choice. This method

allows us to sequentially update relevant model parameters as well as their predictive distributions

such as the predictive reliability function which can be used in determining the optimal time to

release software. In doing so we focus on the modeling of relevant parameters such as the fault

detection rate per fault and the total number of faults that are both latent (unobservable) factors.

Coupled with an exponential likelihood on the inter-failure times, we consider a model which can

take into account imperfect debugging type of behavior under certain conditions typically observed

in software reliability problems but quite scarce in the software reliability literature, see Ruggeri

et al. (2010) for a most recent study. Furthermore, we investigate the existence of perfect vs.

imperfect debugging by comparing the in-sample fit of both models using methods which will be

detailed in the sequel. In estimating model parameters, we use Markov chain Monte Carlo methods

such as the Metropolis-Hastings and the Gibbs sampler or a combination of both, see Smith and

Gelman (1992) and Chib and Greenberg (1995) for a detailed review of these methods. In order

to show how the proposed model work we use real software inter-failure times data using which we

predict the reliability function after each testing stage, carry out inference on the number of bugs

left in the system and fault detection rate per fault after each stage of testing.

There are many features of the Bayesian approach which would be of interest to software reli-

ability practitioners. As pointed out by Lindley (1990) the Bayesian approach provides a coherent

framework for making decisions under uncertainty thus enables us to develop optimal strategies

4


6/22

in releasing software. One of the attractive features of the Bayesian approach is its ability to al-

low expert knowledge incorporation via the prior distributions of relevant model parameters, see

Campodonico and Singpurwalla (1995) for instance. Another important property is the straight

forward manner in which it handles sequential updating of model parameters as one goes through

the testing stages in the light of new information regarding failures.

A synopsis of our study is as follows: In Section 2, we introduce a Bayesian model which can take

into account the imperfect debugging phenomena under certain conditions and discuss a method to

compare models. An illustration of the proposed models is carried out in Section 3 via real software

failure data where we discuss in-sample fit issues, convergence issues of the estimation method and

the inference of relevant model parameters. Finally, in Section 4 we conclude with a summary of

our findings and suggestions for future work.

2 Proposed Model

The modeling approach that we will develop in the sequel is based on the Jelinski Moranda (JM)

model as is the case for most subsequent work in the software reliability literature. Two of the main

assumptions of the JM model is that every fault contributes equally to the failure rate at any stage

of the testing and that each fault is removed permanently upon failure. We will consider the case

where each fault is removed permanently upon failure, along with the possibility of introducing

new faults into the source code while conducting the debugging procedure upon failure. In other

words, we consider scenarios where the so called imperfect debugging can occur which is scarce in

the software reliability literature.

Let us first introduce a set of relevant parameters whose inferential implications will be discussed

in the sequel. Let i for i = 1, . . . , N be the fault detection rate per fault during the ith stage of

testing where N represents the last stage of testing. In addition, let i for i = 1, . . . , N represent the

number of faults present on the software code during the ith stage of testing. Thus, the product

(ii) represents the failure rate of the software during the ith stage of testing. We note here

that both quantities are latent, i.e. unobservable quantities whose uncertainty will be defined via

probability distributions. Both i and i are functions ofi, in other words they evolve stochastically

from testing stage to testing stage.

5


7/22

2.1 Likelihood

As is the case with most software reliability models, we assume that the inter-failure times that are

observable quantities, ti for i = 1, . . . , N are exponentially distributed. Furthermore, given i and

i, ti are said to be conditionally independent. Therefore the likelihood function can be obtained

as

L(, ; D) =Ni=1

iiexp{tiii}, (2.1)

where = {1, . . . , N}, = {1, . . . , N} and D = {t1, . . . , tN}. Therefore the joint posterior of

and would be given by

p(, |D) Ni=1

iiexp{tiii}p()p(), (2.2)

where (2.2) will not be analytically available for any reasonable joint prior choice ofp() and p().

Therefore, one can use Markov chain Monte Carlo (MCMC) methods such as the Metropolis-

Hastings algorithm or the Gibbs sampler to obtain the posterior distributions of and . We

discuss the implementation and implications of such MCMC methods in our numerical example

section. Next we introduce modeling strategies for is and is.

2.2 Modeling the fault detection rate per fault, i

We believe that the fault detection rate per fault during the ith stage will be highly dependent on

what happened on the (i 1)th debugging stage. Let the dependent structure of the is be given

by the following power law relationship

i = i1 i, for i = 1, . . . , N , (2.3)

where i is lognormally distributed as i LN(0, ). By taking the logarithms of (2.3), we canobtain the following the linear model in logarithms

log(i) = log(i1) + i, for i = 1, . . . , N , (2.4)

6


8/22

where i = log(i). (2.4) is a first order autoregressive process of the latent fault detection rates

per fault in the log scale. Therefore using the Markov property, the conditional distributions of

log(i)s can be written as

log(i)|log(i1), N(log(i1), ), for i = 1, . . . , N , (2.5)

where acts like a first order autoregressive coefficient and is assumed to be U(a, b). The

availability of the conditional distributions given by (2.5) is an attractive feature since, via the

chain rule, they can be used as a product in (2.2) in lieu of p(). The lognormal is a natural

choice due to its well known properties and the availability of full conditionals ( log(i)|log(i1))

for i = 1, . . . , N .

The relationship implied by (2.3) also dictates the type of debugging that occurs during the ithdebugging stage. If for instance i < i1, then perfect debugging is said to have occurred, whereas

when i > i1, imperfect debugging is said to have occurred. In other words, when a failure is

detected at the (i 1)th failure epoch, a fault has been detected and repaired, however a new fault

was introduced during the same debugging stage. A priori we assume that is are lognormally

distributed and their dependent structure is given via the power law introduced in (2.3) where

determines on average how the fault detection rate per fault is changing from stage to stage. For

instance, when 0 < i1 < 1 and > 1 then perfect debugging tends to occur, conversely when

0 < < 1 then imperfect debugging tends to occur. Thus, inference on and being able to make

probabilistic arguments on would be of interest to software engineers in assessing the overall

performance of the testing stage. For instance one can determine the probability of imperfect

debugging via the posterior distribution of , P(0 < < 1|D) when 0 < i1 < 1. We discuss

implications of such probabilistic arguments on in our numerical application.

2.3 Modeling the total number of faults, i

Another quantity of interest is the inherent total number of faults during the ith stage of debugging

denoted by i for i = 1, . . . , N . We note here that is are latent factors whose inference will be

conditional on the fault detection rate at each stage of testing. Given the dependent structure

previously introduced for is, it is natural to assume that the total number of faults left in the

software code after the ith debugging, i, will be dependent on the previous stage. Conditional

7


9/22

on whether perfect or imperfect debugging has occurred during the previous debugging stage, the

total number of faults left in the software code is assumed to have the following structure

i = i1 i, for i = 1, . . . , N (2.6)

where

i = 1, with probability p(i < i1)

= 0, with probability p(i > i1) for i = 1, . . . , N .

In (2.6), i is a Bernoulli process whose probability of success is determined via the probability

of perfect debugging, i.e. p(i < i1). This structure, makes sure that when perfect debugging

occurs during the ith debugging stage then i goes down by one unit, since the fault that has

caused the failure has been found and fixed. Whereas when imperfect debugging occurs i stays

the same, since the fault that has caused the failure has been found and fixed, however a new fault

has been introduced while fixing the previous one. An important assumption that has been made

with regards to the behavior of Nis was that the imperfect debugging introduces only a single fault

and the perfect debugging removes a single fault. That may or may not be the case in real life

software debugging. That is more than one fault can be introduced during an imperfect debugging

and more than one fault can be repaired during a perfect debugging. Our model will assume that a

single fault can be repaired or introduced during a debugging operation. An alternative to handle

such a case is to assume a discrete Markov chain structure on i. Such a setup will not be addressed

in this study since it will further complicate the estimation procedure and can be addressed in the

future as a new problem.

Furthermore, the initial number of inherent faults (will be denoted by 0) is also unknown,

therefore needs to be defined via a probability distribution. We assume that 0 follows a Poisson

distribution with mean . In addition, we introduce an extra level of hierarchy on which can beinterpreted as the expected initial number of faults prior to testing. Typically, in software reliability

applications the initial number of inherent number of faults is estimated via deterministic functions

that are based on the total number of lines of source-code. Thus based on the structures introduced

on is and is and the data at hand, we learn more about by carrying out inference via its posterior

8


10/22

distribution. We assume a Gamma prior on and thus summarize the hierarchy on the inherent

number of faults prior to testing as follows

0 P oisson() (2.7)

and

Gamma(c, d). (2.8)

In our numerical example, we assume a flat Gamma prior for and point out how sensitive the

model estimation is to changes in hyper-parameters, c and d.

2.4 Predictive reliability function estimation

As pointed out previously, prior to the final release the software passes through several stages of

testing and its reliability is assessed after each stage. When the reliability of a piece of software

is found to be adequate the software is released. Therefore, the optimal release of software is

determined via the estimated predictive reliability function after the (i 1)th testing stage. Given

the model parameters such as i and i, the predictive reliability function during the ith testing

stage (given that the software has gone through (i 1)th stages of testing) can be obtained via

R(ti|D(i1)

) =

R(ti|i, i, D(i1)

)didi, (2.9)

where D(i1) = {t1, . . . , ti1}. Since Markov chain Monte Carlo methods will be used to generate

samples of i and i, (2.9) can also be computed via the following

R(ti|D(i1)) = 1

1

S

Sj=1

F(ti|(j)i ,

(j)i , D

(i1)), (2.10)

where S is the number of generated posterior samples and (j)i and

(j)i are the jth posterior samples

given D(i1). We note here that given D(i1), only posterior samples of (j)i1 and (j)i1 would be

available. However, one can use straight Monte Carlo in (2.3) in order to generate (j)i s and in (2.6)

to generate (j)i s. Once (2.10) is estimated it can easily be used as part of a software reliability

optimal release scheme.

9


11/22

2.5 Model comparison of perfect vs. imperfect debugging models

In addition to the imperfect debugging model developed previously, using a similar structure we

have also estimated a model which assumes perfect debugging, namely we have estimated a modified

Bayesian version of Jelinski and Moranda (1972), the JM model. In doing so, we have assumed

that all is follow a common Gamma prior and adjusted the structure on is as in (2.6) such that

the term i is replaced by 1, i.e. i = i11, for i = 1, . . . , N . Thus, the perfect debugging model

can be used as a benchmark for comparison purposes.

In comparing the in-sample performance of the proposed models, we will use one of the most

commonly used Bayes factor approximation of models with MCMC steps, this measure is also

referred to as the harmonic mean estimator. As summarized by Kass and Raftery (1995), the

harmonic mean estimator can be obtained as

p(D) = {1

S

Sj=1

p(D|(j))1}1, (2.11)

where S is the number of iterations and (j) is jth generated posterior sample vector. (2.11) is the

harmonic mean of the likelihood values and can be used to compare the in-sample fit performance

of the proposed models. For the proposed models, (2.11) can be computed as follows

p(D) = { 1S

Sj=1

{Ni=1

p(ti|(j)i , (j)i )}1}1, (2.12)

where p(ti|(j)i ,

(j)i ) =

(j)i

(j)i exp{ti

(j)i

(j)i } as in (2.1). Therefore, in comparing two models,

a higher p(D) value indicates a better fit. In our numerical example, p(D) is computed in the

log-scale.

2.6 Summary of the imperfect debugging model

Below is a concise summary of the material developed in Sections 2.1, 2.2 and 2.3 which we refer

to as the imperfect debugging model in the sequel. For i = 1,...,N, the likelihood term for the

inter-failure times is given by

ti Exp(i i).

10


12/22

The state evolution equation of the fault detection rate per fault, i is given by

i = i1 i

where i LN(0, ) and U(a, b).

The state evolution equation of inherent number of faults, i is given by

i = i1 i,

where

i = 1, with probability p(i < i1)

= 0, with probability p(i > i1).

with 0 Poisson() and Gamma(c, d).

3 Numerical Example

The degree to which our software reliability model performs with actual data is crucial in evaluating

its validity. The details of the dataset, the estimation implications, and a summary of our findings

are discussed in the sequel.

3.1 The dataset

The numerical application of our model is carried out on the well known dataset first reported in

Jelinski and Moranda (1972) referred to as the JM data in the sequel. The dataset consists of 31

software inter-failure times, 26 of which were obtained during the production stage of debugging

and the remaining 5 during the rest of the testing stage. In our example, all 31 inter-failure times

were used. Soyer and Mazzuchi (1988) and Kuo and Yang (1996) also use the same dataset to

illustrate their respective proposed models from a Bayesian point of view.

11


13/22

3.2 Markov chain Monte Carlo implementation and convergence

In order to obtain the posterior samples of model parameters, a combination of Markov chain Monte

Carlo (MCMC) methods such as the Gibbs sampling and the Metropolis-Hastings were used (see

Smith and Gelman (1992) and Chib and Greenberg (1995) for a summary of most common practice

in MCMC). In order to generate posterior samples for the proposed models (both imperfect and

perfect debugging models), the WinBUGS software was used and the code is available via email

upon request from the authors. We have assumed flat priors for model parameters when required.

Three parallel chains were used with different initial points. The chains were ran for 5,000 iterations

as the burn-in period and 20,000 samples were collected with a thinning interval of 2. Below, we

present the posterior sample output for relevant parameters and discuss further implications of

findings. For the sake of preserving space we will omit a detailed summary of convergence for all

relevant model parameters of the imperfect debugging model. We only present results for some of

the parameters and note that similar results were obtained for the rest of the parameters.

iterations

5000 10000 15000 20000 25000

0.

80

0.

85

0.

90

0.

95

1.

00

iterations

5000 10000 15000 20000 25000

20

40

60

80

iterations

1

5000 10000 15000 20000 25000

0.

005

0.0

10

0.0

15

iterations

1

5000 10000 15000 20000 25000

10

20

30

40

50

60

Figure 1: Trace plots for , , 1 and 1 for the imperfect debugging model

In obtaining the posterior samples, we did not encounter a problem of convergence in either

model. This can informally be observed from the trace plots of the examples in Figure 1.

A more formal way of assessing convergence can be carried out using the Brooks and Gelman

plots and the shrink factor as discussed in Brooks and Gelman (1998). If the shrink factor, also

referred to as the scale reduction point estimate, is around 1 then convergence is said to have been

attained. The Brooks and Gelman plots are shown in Figure 2 where the shrink factor approaches

1 as the number of iterations increases for parameters, , , 1 and 1. The estimated shrink factor

was 1 for and 1.01 for . The individual shrink factors for is were within a range of 1.00 and 1.07

with a multivariate shrink factor of 1.02. Similarly, is shrink factors were all around 1.02 with a

12


14/22

5000 10000 15000 20000 25000

1.

00

1.

05

1.1

0

1.

15

last iteration in chain

shrink

factor

median

97.5%

beta

5000 10000 15000 20000 25000

1.

0

1.2

1.4

1.

6

1.

8


shrink

factor

median

97.5%

theta

5000 10000 15000 20000 25000

1.

00

1.

05

1.

10

1.

15

1.

20

1.

25


shrink

factor

median

97.5%

5000 10000 15000 20000 25000

1.0

1.5

2.0

2.

5


shrink

factor

median

97.5%

Figure 2: Brooks and Gelman plots for , , 1 and 1 for the imperfect debugging model

multivariate shrink factor of 1.01. Similar results were obtained for other parameters, therefore the

convergence details of other parameters will be omitted from our discussion in order to preserve

space. Thus, we can conclude that we did not encounter any convergence issues with either model.

3.3 Summary of findings

Next we discuss the implications of the posterior distributions obtained for the relevant model

parameters. A boxplot based on the posterior samples of the fault detection rate per fault during

each debugging stage, i.e. is, is shown in Figure 3. As introduced previously, when i < i1

perfect debugging is said to have occurred. Figure 3 shows how the fault detection rate per fault is

changing over debugging stages. It can be fairly argued that during the first 15-16 testing stages is

tend to go up, namely it becomes easier to find faults, and is tend to go down towards the end of

the testing, in other words it becomes harder to find faults as more faults are discovered. However,

while this pattern is occurring the fault detection rate per fault goes up/down as imperfect/perfect

debugging occurs over the testing horizon. Another finding is that the uncertainty of the fault

detection rate based on the upper and lower quantiles of is seem to be higher around the 15th

and 16th testing stages and lower at the beginning and at the end of the testing. One of the

advantages of the Bayesian approach is that it allows probabilistic arguments to be made about

model parameters, for instance based on Figure 3 it would be straight forward to compute posterior

probabilities such as P(i < i1|D) for i = 2, . . . , 31 which indicates the probability of perfect

debugging from stage to stage.

Right panel of Figure 4 shows the posterior distribution of 1, the fault detection rate per fault

during the first debugging stage (similar distributions were obtained for i for i = 2, . . . , 31). Left

13


15/22

0.

00

0.

02

0.

04

0.

06

0.

08

0.1

0

0.1

2

Debugging Stages

i

Figure 3: Boxplots of i for i = 1, . . . , 31

panel of Figure 4 shows the posterior distribution of as introduced in (2.3) whose posterior mean,

E(|D) was 0.96 with as standard deviation of 0.028. This indicates that on the average imperfect

debugging tends to occur during the 31 stages of testing.

The posterior probability of perfect debugging, P(i < i1|D), determines the probability

of success of the Bernoulli process i for i = 1, . . . , 31 as introduced in (2.6). Figure 5 shows

the behavior of the posterior mean of is over testing stages, i.e. posterior probability of perfect

debugging. When the probability is greater than 0.5 then perfect debugging is said to be more

likely to occur as opposed to imperfect debugging. In Figure 5, cases that are located above 0.5 can

be categorized as the most likely perfect debugging scenarios whereas cases that are below 0.5 as

14


16/22

0.80 0.85 0.90 0.95 1.00

0

5

10

15

Density

0.005 0.010 0.015

0

50

1

00

150

200

1

Density

Figure 4: Posterior distribution plots for (left) and 1 (right) for the imperfect debugging model

0 5 10 15 20 25 30

0.2

0.4

0.6

0.8

Debugging Stages

ProbabilityofPerfectDebugging

Figure 5: Probability of perfect debugging vs. debugging stages

the most likely imperfect debugging scenarios. This shows informal evidence in favor of potential

imperfect debugging behavior in the numerical example studied in this section. As a function ofis, the number of errors in the system, is are determined according to the structure introduced in

(2.6). Figure 6 is the boxplot of i for i = 1, . . . , 31. The expected number of errors in the system

after the first debugging stage, E(1|D), was estimated to be 18.142 and the expected number of

errors in the system after the thirty first debugging stage, E(31|D), was estimated to be 4.063.

15


17/22

This once more shows evidence in favor of the presence of imperfect debugging in our numerical

example since if perfect debugging were to occur after 31 stages of debugging the expected number

of faults would have gone down by 31 units which is was not found to be the case in our example.

0

5

10

15

20

25

30

Debugging Stages

i

Figure 6: Boxplots of i for i = 1, . . . , 31

Left panel of Figure 7 shows the posterior distribution of the latent factor , i.e. expected

number of faults in the software prior to testing as introduced in (2.8). The posterior mean of ,

E(|D) was 28.01 with as standard deviation of 9.023. In estimating the hyper-parameter , we

have used a flat Gamma prior, with initial values of 50, 75, 100 for the three chains that the MCMC

algorithm was run for. We note there that the inference of was not sensitive to the choice of

the initials. Right panel of Figure 7 shows the posterior distribution of 0, the inherent number of

faults in the software prior to testing whose determined conditional on the hyper-parameter .

16


18/22

20 40 60 80

0.0

0

0.0

1

0.0

2

0.0

3

0.0

4

0.0

5

Density

10 20 30 40 50 60

0.0

0

0.0

2

0.0

4

0.0

6

0.0

8

0.1

0

0

Density

Figure 7: Posterior distribution plots for (left) and 0 (right) for the imperfect debugging model

3.4 Additional insights from Bayesian analysis

In inferring whether perfect or imperfect debugging occurs in our numerical example, we have usedthe Bayes factor calculation as introduced in (2.12). Table 1 shows the likelihood contributions

in the log-scale for the two candidate models, the imperfect debugging model and the perfect

debugging model as discussed in Section 2.5. The imperfect debugging model has the highest log-

likelihood value, implying a Bayes factor of approximately > 100 (BF=p(D|Imperfect Debug)p(D|Perfect Debug) ) which

according to Kass and Raftery (1995) shows decisive support in its favor. Thus, it can be argued

that imperfect debugging as described in this study occurs given the numerical example at hand.

Perfect Debug Imperfect Debuglog{p(D)} -191.0651 -108.58

Table 1: log{p(D)} under each model

Another attractive feature of the Bayesian approach is the computation of the predictive reli-

ability function. We have obtained the predictive reliability functions of the last three debugging

stages (29,30 and 31) as shown in Figure 8. In doing so, we have used the following

R(ti|D(i1)) = 1S

Sj=1

exp{ti(j)i (j)i }, (3.1)

for i = 29, . . . , 31 where the posterior samples of (j)i s and

(j)i s are obtained using straight Monte

Carlo from (j)i =

(j)(j)i1

(j)i (2.3) and

(j)i =

(j)i1

(j)i (2.6), respectively. These predictive

reliability distributions can easily be used as part of a larger software reliability optimal release

17


19/22

problem.

0 10 20 30 40 50

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Predictive Reliability Function After 28 Stages of Testing

t

R(t|D)

0 10 20 30 40 50

0.4

0.5

0.6

0.7

0.8

0.9

1.0


t

R(t|D)

0 10 20 30 40 50

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0


t

R(t|D)

Figure 8: Predictive reliability functions for i = 29, . . . , 31 for the imperfect debugging model

4 Concluding Remarks

In this study, we considered Bayesian analysis of the inter-failure times typically observed in software

reliability. In doing so, we defined our uncertainty about relevant model parameters via probabilities

therefore a Bayesian point of view have been followed. This allowed us to sequentially update model

parameters as well as relevant predictive distributions such as the predictive reliability function

which can be used in determining the optimal time to release software. We focused on the modeling

of latent factors such as the fault detection rate per fault and the total number of faults where amodel which can take into account imperfect debugging type of behavior under certain conditions

was introduced. This phenomenon is typically observed in software reliability applications however

its analysis is quite scarce in the literature. Furthermore, we investigated the existence of perfect

vs. imperfect debugging by comparing the in-sample fit of both models where decisive evidence in

favor of the imperfect debugging model has been found.

There are many attractive features of the Bayesian approach which would be of interest to soft-

ware reliability practitioners. The Bayesian approach provides a coherent framework for making

decisions under uncertainty thus can easily be implemented in optimal software release strategies.

Another important feature of the Bayesian approach is its ability to allow expert knowledge in-

corporation via the prior distributions of relevant model parameters as emphasized by Washburn

(2006). In addition, the straight forward manner in which it handles sequential updating of model

parameters as one goes through the testing stages in the light of new information regarding software

18


20/22

failures would be of interest to software reliability practitioners.

We believe that there are couple of areas on which future extensions of our proposed model

are possible. One of the main assumptions of our model is that only one fault can be removed

or introduced during the testing stage. This can be relaxed by introducing a Markov chain type

of structure on the number of bugs repaired or introduced during the debugging stage instead

of the current Bernoulli setup. Unfortunately, such structure will over complicate the estimation

procedure and can be considered in a companion study in the future. Another potential extension

is to introduce a state space evolution of the coefficient in the power law relationship between

the inter-failure times, in other words introduce a time dependent dynamic structure which is

sequentially updated after each testing stage as new inter-failure time data is observed.

19


21/22

References

Brooks, S. P. and Gelman, A. (1998). General methods for monitoring convergence of iterative

simulations. Journal of Computational and Graphical Statistics, 7(4):434455.

Campodonico, S. and Singpurwalla, N. D. (1995). Inference and predictions from Poisson point

processes incorporating expert knowledge. Journal of the American Statistical Association,

90(429):220226.

Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The Amer-

ican Statistician, 49(4):327335.

Dalal, S. R. and Mallows, C. L. (1988). When should one stop testing software? Journal of the

American Statistical Association, 83(403):872879.

Goel, A. L. and Okumoto, K. (1979). Time-dependent error detection rate model for software

reliability and other performance measures. Reliability, IEEE Transactions on, 28:206211.

Jelinski, Z. and Moranda, P. (1972). Software reliability research. Statistical Computer Performance

Evaluation, pages 465484.

Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Associa-

tion, 90(430):773795.

Kuo, L. and Yang, T. Y. (1995). Bayesian computation of software reliability. Journal of Compu-

tational and Graphical Statistics, 4(1):6582.

Kuo, L. and Yang, T. Y. (1996). Bayesian computation for nonhomogeneous Poisson processes in

software reliability. Journal of the American Statistical Association, 91(434):763773.

Lindley, D. V. (1990). The present position in Bayesian statistics. Statistical Science, 5(1):4489.

Littlewood, B. and Verall, J. L. (1973). A Bayesian reliability growth model for computer software.

Applied Statistics, 22(3):332346.

Morali, N. and Soyer, R. (2003). Optimal stopping in software reliability. Naval Research Logistics,

50(1):88104.

20


22/22

Musa, J. D. and Okumoto, K. (1984). A logarithmic Poisson execution time model for software re-

liability measurement. Proceedings of the 7th international Conference on Software Engineering;

Orlando, Florida, pages 230238.

Ozekici, S., Altinel, I. K., and Ozcelikyurek, S. (2000). Testing software with an operational profile.

Naval Research Logistics, 47:620634.

Ozekici, S. and Soyer, R. (2001). Bayesian testing strategies for software with an operational profile.

Naval Research Logistics, 48:747763.

Ozekici, S. and Soyer, R. (2003). Reliability of software with an operational profile. European

Journal of Operational Research, 149:459474.

Ruggeri, F., Pievatolo, A., , and Soyer, R. (2010). A Bayesian hidden markov model for imperfectdebugging. Under review.

Singpurwalla, N. (1995). Survival in dynamic environments. Statistical Science, 10(1):86103.

Singpurwalla, N. and Soyer, R. (1992). Non-homogeneous autoregressive processes for tracking

(software) reliability growth, and their Bayesian analysis. Journal of Royal Statistical Society:

Series B(Methodological), 54(1):145156.

Singpurwalla, N. and Wilson, S. (1994). Software reliability modeling. International StatisticalReview, 62(3):289317.

Singpurwalla, N. and Wilson, S. (1999). Statistical Methods in Software Engineering: Reliability

and Risk. Springer.

Smith, A. F. M. and Gelman, A. E. (1992). Bayesian statistics without tears: A Sampling perspec-

tive. The American Statistician, 46(2):8488.

Soyer, R. and Mazzuchi, T. A. (1988). A Bayes empirical Bayes model for software reliability.

Reliability, IEEE Transactions on, 37(2):348254.

Washburn, A. (2006). A sequential Bayesian generalization of the Jelinsli-Moranda software relia-

bility model. Naval Research Logistics, 53(4):354362.

Xie, M. (1991). Software Reliability Modeling. World Scientific Publisher, Singapore.

21

imperfect debugging in software reliability a bayesian approach

Documents