statistical analysis of nonstationary software metrics

11
AND SOFTWARE ELSEVIER Statistical Information and Software Technology 39 (1997) 363-373 analysis of nonstationary K. Pillai*, V.S.S. Nair software metrics Deparfmmt of Computer Science and Engineering, SIC, Sourhem Methodist Universiw. 6425 Airline Road, Dallas. TX 75275, USA Received 24 July 1996; accepted 5 October I996 Abstract Prediction, estimation, and assessment of software process attributes form an integral part of process management. Process modeling is a quantitative and systematic approach to gauging such critical project parameters. However, process modeling relies heavily on idealizations. Mathematical compromises necessitate continuous calibration of such models to maintain some level of accuracy. This is because of the nature of software measurement data, most of which are nonstationary, and thus requires special treatment. A methodology for analyzing such data is presented in this paper. It is shown that measures based on time averages are insufficient in representing data of this nature. The improved representativeness of ensemble based measures over those based on time is demonstrated. A model for LOC generation, based on ensemble averaging, is presented in this paper. However, the process of validating such a model involves testing the model to a range of unique inputs. But, exhaustive testing of process models is generally severely constrained by the lack of sufficient amounts of input data sets, generated under the conditions imposed by the modeling approach. A method employing the random walk, by which an ensemble can be generated from a single time record, on the basis of certain invariants, is provided. This method of generating an ensemble is shown to be useful, especially in situations where certain properties of the proprietary data are known, but sufficient quantities of the data are inaccessible to the analyst. The ensemble computed in this manner is then used to derive a time dependent model that is more representative of the real life entity being modeled. 0 1997 Elsevier Science B.V. Keywords: Metrics; Ensemble; Nonstationary 1. Introduction Software configuration management (CM), or the art of managing the evolution of software, has conventionally been the exclusive domain of the expert. Up until now, controlling the rate of development of a product has solely been a function of the competence of the person responsible, and some managers simply handled it better than others. But in recent years, considerable progress has been made towards measuring, analyzing, controlling and optimizing the various attributes that contribute to successful CM. An important preliminary step to CM is reliable process mod- eling (PM). A faithful model of the developmental activity is an important prerequisite to effectively tracking its growth, or predicting its eventual ruin. Even though PM cannot replace the expert as such, it does form an important supplement to expert opinion. However, it is not always possible for a model to be a faithful replica of the real life process, simply because of the sheer complexity of the pro- blem being modeled. The common work-around is to use idealizations, and then design the model to have the innate * Corresponding author. E-mail: [email protected] 0950-5849/97/$17.00 0 1997 Elsevier Science B.V. All rights reserved PI2 SO950-5849(96)00002-X ability to compensate for such theoretical compromises through some sort of a feedback mechanism. The accuracy of a model is thus considerably improved by a continual process of tuning or calibration. The method of calibrating a process can be broken down into two essential steps: 1. the validation and verification of the nature of input data to the model, 2. recalibration of the coefficients of the model to track the changing environment. The first step is essential for the proper mathematical analysis of the model. Mathematical rigor helps to avoid the generation of misleading information, and detect the goodness of a model. Conventional statistical methods of analysis do yield considerable information on the nature of input data [4]. However, simple statistical measures, that fail to capture the innate stochasm of the process, do not contribute significant insights into the dynamic behavior of the model. If a process has to be time-tracked as it evolves, an analytic time-domain representation of the input data becomes a necessity. Furthermore, built-in calibration mechanisms raise issues of stability and convergence of the model. These issues can be addressed easily with a

Upload: k-pillai

Post on 05-Jul-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Statistical analysis of nonstationary software metrics

AND SOFTWARE

ELSEVIER

Statistical

Information and Software Technology 39 (1997) 363-373

analysis of nonstationary

K. Pillai*, V.S.S. Nair

software metrics

Deparfmmt of Computer Science and Engineering, SIC, Sourhem Methodist Universiw. 6425 Airline Road, Dallas. TX 75275, USA

Received 24 July 1996; accepted 5 October I996

Abstract

Prediction, estimation, and assessment of software process attributes form an integral part of process management. Process modeling is a quantitative and systematic approach to gauging such critical project parameters. However, process modeling relies heavily on idealizations.

Mathematical compromises necessitate continuous calibration of such models to maintain some level of accuracy. This is because of the nature of software measurement data, most of which are nonstationary, and thus requires special treatment. A methodology for analyzing such data is presented in this paper. It is shown that measures based on time averages are insufficient in representing data of this nature. The improved representativeness of ensemble based measures over those based on time is demonstrated. A model for LOC generation, based on ensemble averaging, is presented in this paper. However, the process of validating such a model involves testing the model to a range of unique inputs. But, exhaustive testing of process models is generally severely constrained by the lack of sufficient amounts of input data sets,

generated under the conditions imposed by the modeling approach. A method employing the random walk, by which an ensemble can be generated from a single time record, on the basis of certain invariants, is provided. This method of generating an ensemble is shown to be useful, especially in situations where certain properties of the proprietary data are known, but sufficient quantities of the data are inaccessible to the analyst. The ensemble computed in this manner is then used to derive a time dependent model that is more representative of the real life entity being modeled. 0 1997 Elsevier Science B.V.

Keywords: Metrics; Ensemble; Nonstationary

1. Introduction

Software configuration management (CM), or the art of managing the evolution of software, has conventionally

been the exclusive domain of the expert. Up until now,

controlling the rate of development of a product has solely been a function of the competence of the person responsible, and some managers simply handled it better than others. But in recent years, considerable progress has been made

towards measuring, analyzing, controlling and optimizing the various attributes that contribute to successful CM. An

important preliminary step to CM is reliable process mod- eling (PM). A faithful model of the developmental activity is an important prerequisite to effectively tracking its

growth, or predicting its eventual ruin. Even though PM cannot replace the expert as such, it does form an important

supplement to expert opinion. However, it is not always

possible for a model to be a faithful replica of the real life process, simply because of the sheer complexity of the pro- blem being modeled. The common work-around is to use idealizations, and then design the model to have the innate

* Corresponding author. E-mail: [email protected]

0950-5849/97/$17.00 0 1997 Elsevier Science B.V. All rights reserved PI2 SO950-5849(96)00002-X

ability to compensate for such theoretical compromises

through some sort of a feedback mechanism. The accuracy of a model is thus considerably improved by a continual process of tuning or calibration. The method of calibrating

a process can be broken down into two essential steps:

1. the validation and verification of the nature of input data to the model,

2. recalibration of the coefficients of the model to track the changing environment.

The first step is essential for the proper mathematical analysis of the model. Mathematical rigor helps to avoid the generation of misleading information, and detect the goodness of a model. Conventional statistical methods of

analysis do yield considerable information on the nature of input data [4]. However, simple statistical measures, that

fail to capture the innate stochasm of the process, do not

contribute significant insights into the dynamic behavior of the model. If a process has to be time-tracked as it evolves, an analytic time-domain representation of the input data becomes a necessity. Furthermore, built-in calibration mechanisms raise issues of stability and convergence of the model. These issues can be addressed easily with a

Page 2: Statistical analysis of nonstationary software metrics

364 K. Pill& V.S.S. Nair/lnfortnation and Software Technology 39 (1997) 363-373

model that represents the instantaneous state of the process.

Thus, the first step to being able to generate an adaptive model that dynamically represents an ambient environment,

is to mathematically represent, classify, and analyze the

inputs to the model.

1.1. Motivation

Cost and schedule estimation tools used in industry most

often fail to live up to expected accuracy levels. In spite of

this, such tools are extensively used for the lack of a better

approach to solving the estimation problem. However, studies have shown [ 121 that generic cost estimation models

can be useful, only if they are calibrated to maintain some

level of accuracy. It is clear from efforts at calibrating such

estimation tools [9], that a one-shot calibration of a model for an application domain does not suffice. This is mainly due to the fact that models are usually over-simplified repre- sentations of the real life problem. Several approaches have been adopted in the past to handle the nonlinearities present

in the real life entity being modeled. For example, software

maintenance task effort prediction models have been imple- mented using methods such as pattern recognition, regres- sion, and neural networks, with varying levels of success

[lOI. It is a well known fact that almost every function encoun-

tered in the real world, can be expressed as a discrete or continuous sum of a variety of exponential functions. The

exponential function has the advantage of being a function that does not change its form under integration or differen-

tiation. Intuitively, for a linear time-invariant system, the response to an exponential function e”‘, is also an exponen- tial function of sorts. In other words, “the eigenfinctions of

linear time-invariant operators are exponentials” [ 131. This is one reason why most empirical models, such as the

COCOMO [2] or the Putnam model 1151, can approximate complex relations and functions using simple exponential terms, over certain ranges. But this approximation does not offer a close fit for an expansive domain of input values.

Studies have shown that, the effort estimates computed using such empirical models are usually off the mark by 40-70% [6]. The models are naturally idealizations, and real life nonlinearity can make such models yield highly

inaccurate estimates over time. Recalibration of the model as and when data becomes available is of course a

solution, and it involves visualizing the model as a

cybernetic one [ 181. The process of calibration requires a corrective feedback of the system performance as shown in Fig. 1.

PM is further complicated by the fact that it deals with a class of problems that change with time. Closed-loop adap- tation [ 191 offers a highly effective solution for models that represent nonlinear or time variable systems. In situations where model coefficients are variable or inaccurately known, closed-loop adaptation has the ability to find the best coefficient values for the model. However, the nature

Tuning H

Generic Model -

Adaptation

Algorithm I_ I

I

> Performance

Evaluation P

output

I Other Indicators

Fig. I. The cybernetic model.

of the data input to the model strictly governs the choice of the algorithm that drives the adaptation loop. The statistical nature of the inputs of a model further governs the manner in

which it can be mathematically analyzed. As such, classifi- cation of measurement data is an important and necessary prerequisite to successfully deriving a model that can be

tuned at will. The more representative the model, the sim- pler the tuning mechanism would be.

A classification of software measurement data, based on its mathematical behavior is provided in Section 2. A hypothetical model for code generation, and the preliminary steps involved in generating a mathematical model are pro-

vided, to illustrate the use of such methods in this paper. A methodology for generating an ensemble from a minimal

data set is provided in Section 3. Section 4 addresses future directions and applications in this area.

2. Characterization of software measurement data

The attributes that one wishes to measure in software engineering can be broadly classified [5] into three groups:

l Products or tangibles that are generated as the result of an activity.

l Processes or time dependent attributes of the develop- mental activity.

l Resources or the inputs to the process under analysis.

A conventional classification of measurable data is shown in Fig. 2. All physical phenomena can be classified [l] into

Page 3: Statistical analysis of nonstationary software metrics

K. Pillai, V.S.S. Nairbnformation and Software Technology 39 (1997) 363-373 365

NW-piOdiC Erg&

17 1 - 1-

1 1 Compkrpcriodic 1 1 Tnnsien~ 1 1 Nonergcdic 1 1 Nonagcdic ]

Fig. 2. Classification of measured data.

one or the other of these mutually exclusive classes, for the purpose of analysis.

A process of elimination can be used to ascertain the

general category to which such software measures belong. To begin with, the domain of interest excludes measures of a periodic nature, since such measures are totally predictable, and hence easily determinable without the need for a com-

plex model. The measures of interest are therefore of a nonperiodic nature. In addition to being nonperiodic, most measures of human effort and productivity are also random

in nature.

Axiom 1. Ifa situation that generates the data set in ques-

tion can be re-instantiated an infinite number of times to

generate the same data, then the data can be considered to

be deterministic. If the data set changes with difSerent

instantiations of the situation, then it should be considered

random in nature.

Software measurement data cannot be easily defined by a

simple mathematical relationship because of the fact that each observed phenomena is, to a certain degree, unique.

In other words, one out of several possible outcomes occur each time a measurement is conducted. This obviously puts software measurement data in the category of random data.

Random data can however be further categorized based on

its statistical behavior, as being stationary or otherwise.

Certain terms used in conventional mathematics are explained in the following subsection. This terminology serves well to describe the attributes of random data.

2.1. Basic statistical functions

Each measurement of random data yields a sample record

when carried out over a finite interval of time. The function that represents the random phenomenon in time, is termed a

samplefunction. And finally, a random process or a stochas-

tic process is defined as the set of all possible sample func- tions that the observed phenomenon can yield over a period

of time. It is possible to represent the properties of such stochastic

processes in terms of its statistics, evaluated at some point in time. Consider a hypothetical situation where a specification

for a program is made available to a (highly productive) development team, working toward realizing programs that are functionally complete. Consider lines of code

(LOCs) generated per day as the metric under observation. Suppose the product is delivered in 100 days. A plot of three possible code generation profiles for the team is shown in Fig. 3. It is clear that the a constant expected value of the

measured data, does not capture the variations of the aver- age value with time. But there does exist a mathematical

entity that can represent the time-dependent expected value of the data set. Let it be denoted by p..r(t). A collection of

sample functions such as those shown in Fig. 3, is termed an ensembEe and is denoted by {x(t)}. An ensemble in turn, can

be represented as {xi (&x2(t),. . .,-q(t),. . .,x,&)}. A commonly used statistic to represent any random pro-

cess is the first moment or the mean value. The mean value can be computed as the time average of a large number of observations made over an extended period of time. This yields a constant number, independent of time. The mean calculated in this manner is not represented as a function of time, and hence is not suitable for tracking the progress of a project. The ensemble average, on the other hand, is eval- uated from observations made over similar systems at the same instant. The ensemble average is a time dependent

Page 4: Statistical analysis of nonstationary software metrics

366 K. Pillai, V.S.S. Nair/lnformation and Software Technology 39 (1997) 363-373

function and can be used as an effective mathematical

100

80

0 20 I1 40 fl+A 60 80 100 100 80 xs(t) -

Days

Fig. 3. Typical software measurement data.

model of the process. Generally, the ensemble average is not the same as the time average. They are interchangeable, if and only if the data set under analysis satisfies certain mathematical conditions [l], which will be discussed

shortly. The ensemble average is calculated across a set of sample functions, and for an ensemble consisting of N sam-

ple functions, the first moment can be stated as:

iwl) = j@-- f i Xk@l). k=I

(1)

Yet another important parameter that yields considerable information on the nature of the data being observed is the

uutocorrelution function. This is calculated by computing the ensemble average of the product of the measure at two

different instances, in this case t I and t I -t A. The autocor- relation function is given by eqn (2):

%i,<r,, t, + A) = &nz f $ xk(t, )+(t, + A). k-l

The autocorrelation function gives a measure of the statis- tical dependence between measurements made at t , , and that

at f I + A. The importance of the autocorrelation function and the ensemble average is that they provide insights into

two important properties of random data, namely, ergo&city and stutionarity. The ensemble averages and time averages are interchangeable only if the random process satisfy both these properties. Measures of human effort/productivity are

nonstationary in nature and as such cannot be represented by time averages. In addition to this, most models assume sto- chastic independence between collected data points for the sake of simplicity. Computing autocorrelation helps in get- ting a handle on the degree to which this assumption is valid for that data set.

2.2. Considerations of ergodicity

Consider a development team generating code over the duration of a project (Fig. 3). In this example, an average code generation rate for the team can be easily ascertained

from accumulated data. But this measure cannot be applied to the process of development in a generic manner, across different instances, unless the requirements for ergodicity

are met. The mean value and autocorrelation for the kth sample function is given by:

T

J Xk(t> dt, 0

(31

T

‘!$(k, A) = F-5 ; J Xk(t)Xk(t+ A) dr. (4) 0

The measurement data is said to be ergodic if the mean value and the autocorrelation functions do not differ for measurement data obtained for the various sample functions

in the hypothetical experiment. A large class of software measurement data is generally nonergodic. As a result of this, models that make use of time based statistical estimates

to simulate varying data will deviate from the actual trend that the project takes. Recalibration becomes a necessity due to the lack of ergodicity of the measured data.

2.3. Considerations of stutionurity

A lack of ergodicity further implies an absence of stutio- nurity. A random process is said to be nonstationary if its mean value and autocorrelation functions change with time. It is unlikely that a programmer would consistently generate

Page 5: Statistical analysis of nonstationary software metrics

K. Pi/hi. V.S.S. Nnir/lnformnrion and Sojhmre Technology 39 (1997) 363-373 367

code at a constant rate, unlike a thermostat that would con-

stantly maintain its set temperature, in spite of random vat-

iations in the manner in which the ambient temperature is

controlled. The thermostat action is a stationary process, as opposed to the activity of a programmer, which is highly

unstable. The average productivity of a programmer varies with time, thus making it nonstationary, and thereby none-

rgodic in the general case. In other words, the thermostat

acts to maintain a constant average temperature, while the

programmer tries to solve the problem at hand optimally, with little consideration toward maintaining a constant rate of coding. Stationarity and ergodicity can usually be forced

on the productivity measures of a development team only

through the imposition of extreme, and usually detrimental

measures. Most software measurement data would be

reduced to being stationary and ergodic only if the measures of first moment and autocorrelation computed for the data

set can be forced to be time invariant. The data set is termed

weakly sturionary, if both the expected value as well as the autocorrelation function are independent of r, or the instant

in time when the analysis is done. The autocorrelation would however be a function of A, the time displacement, for a weakly stationary process.

Nonstationary data can be categorized as having one or both of the following properties:

1. average value of the data set changes with time,

2. mean square value changes with time.

The choice of the method of analysis will depend on the category to which the data set belongs. Analysis methodol-

ogies generally exploit such functions as; the probability density, the autocorrelation, and the power spectral density

for ascertaining the statistical behavior of the measured data

[Il.

2.4. Non-Markovian nature qf software models

It has been pointed out in the previous section that soft-

ware measurement data is nonergodic and nonstationary in

the general case. A data set can be further thought of as a chain of events, with the outcome of each event depending

on the state, or states that lead to it. But unlike problems in classical mechanics, most software development activities

are not just event dependent. The manner in which an event is attained, tends to influence the outcome of the next event. For illustrative purposes, consider a snapshot in time of a

software development activity, being monitored for a simple metric such as LOCs. The rate at which code will be gen- erated, and the time of completion of the project will be a

function of the complexity of the coding process, the clarity of the design, and the suitability of the developmental para- digm. In other words, the manner in which a milestone is achieved is as important a factor as the act of achieving it. The process under consideration is a hereditary process [3],

where whole or part of the past history of the system influ- ences its future state. By definition, Markov chains can be

applied only to processes where independent systems

subject to the same transition probabilities, if in the same

state, will have identical future development. From

experience it is well known that this is not the case with software development, or most human endeavors for that

matter.

3. Analysis methodology

A mathematical model that faithfully represents an ensemble should be defined in terms of a measure that cap-

tures the nonstationary nature of the measured data. The

more representative the model, the lesser the need to recali-

brate it. Lines of code, for example, is a parameter that has

neither a time-invariant mean nor a constant mean square value. Modeling a system that has underlying nonstationar-

ity such as this, requires the use of coefficients that go beyond representing a single sample function.

The concept of ensemble averaging can be applied to

compute a mathematical model for the KLOCs (thousands of LOCs) generated by a development team. The objective of the model is to represent the expected value of KLOC for the team, as a function of time. An ensemble can be gener-

ated in two ways. Measurements can either be conducted over similar teams working under conditions identical to some degree, or measurements can be conducted for one

team over a long period of time. This large time record can then be broken up to yield an ensemble. However, the

issues that accompany data collection and analysis for sys- tems such as this, do not generally allow such manipula- tions, due to the following reasons:

l It is not generally feasible to have identical teams active at the same time, working on the same problem, subject to similar financial, temporal, and spatial constraints.

l A single project is an evolving entity and as such the development environment goes through a pattern of

changes. Breaking up data collected over the duration of an evolving project into an ensemble, would thus

violate the requirement that conditions be maintained identical from one sample function to another.

A theoretical approach that generates an ensemble from a single sample function is presented in the next section.

Assumptions were made expressly with the intention of neutralizing the two constraints mentioned earlier. A theo-

retical validation [ 111, as opposed to an empirical one, is adopted in this paper. An ensemble is generated from avail-

able data, given that the behavior of certain attributes of the

measured process is known to the analyst.

3.1. Ensemble generation from a sample function

It is a fact that the work environment as well as developer productivity are highly complex functions that cannot be easily replicated. The basic assumption associated with an

Page 6: Statistical analysis of nonstationary software metrics

368 K. Pillai, V.S.S. Nair/lnformation and Software Technology 39 (1997) 363-373

ensemble is that the environment hardly changes from one

sample function to another. Given a productivity profile, measured in KLOCs (or any

other metric), a mathematical expression can be computed

to represent the data set by fitting a curve to the given

profile. But generally, such profiles vary from project to project, making the mathematical representation highly restrictive. In other words, since each sample data in the

profile is a random variable capable of taking a range of

values, if the data were to be collected again under similar conditions, the resulting profile would be different. How-

ever, it is possible in most cases, to quote the degree of

variation of each data point within a profile. This informa- tion can be extracted from empirical data available from

previous projects, or gathered from individual developers based on their experience. It basically follows that, in

most cases it is possible that each data point in the given

profile has a certain probability to vary within a certain range. It can now be argued that if the same problem were assigned to N almost identical teams, with the constraint that

they should all complete the development process at the same time, the KLOC generation profiles of each team

would to some extent, show similar trends. Two profiles taken from this set of N, would be related to each other

since one can be derived from the other by perturbing each sample data on the basis of a random process. In other words, given a profile, each data point of the profile can either be increased or decreased by a discrete value, in accordance with some invariant, to yield a series of related

profiles. This provides a means of generating N possible unique profiles from a given data set.

Empirical models used by researchers in the past have relied on such simulation techniques for performance vali- dation. One interesting case in point is the approach used by Putnam [16] in capturing the uncertainty associated with

applied parameters. The Putnam model uses a Rayleigh curve to describe the variation of staffing requirements of a project with time. The model accommodates uncertainties

or ‘risks’ in estimates by representing each point on the

manpower curve by a tuple composed of its mean and stan- dard deviation, estimated from historical data. Monte Carlo techniques are then used to generate a sequence of random values that satisfy the tuple descriptor, based on a probabil- ity distribution function of choice. Management parameters that are dependent on the manpower curve are then com-

puted using this computer generated sequence of probable

manpower values. The final model represents an average, each point on the curve being the average of a sequence of random numbers drawn from a distribution that has the prescribed mean and standard deviation.

Conventional Monte Carlo techniques of random number generation yield a random data sequence that is fairly sta- tionary. However, most metrics such as LOC are essentially nonstationary, their statistical attributes being time variant. Simulated data should represent this nonstationarity, at least to some extent, for the sake of realism. Generally, Monte

Carlo techniques of random-variate generation based on a

specific cumulative distribution function (CDF) [7] is achieved by first generating a sequence of pseudorandom numbers drawn from a uniform distribution. The inverse

function of the CDF corresponding to the chosen numbers

are then computed. The CDF chosen can be one of several distributions, that is most representative of the process being

analyzed. Functions such as the Beta, Poisson, Binomial,

Chi-square, or the Uniform distributions are mathematical

functions that closely model most natural phenomena. The choice of a distribution is based on experience gained with

empirical data. For example, the Beta distribution has been

used effectively in the past for purposes of modeling job completion times in PERT/CPM. It should, however, be

noted that points thus generated are independent of each

other. In other words, there is no memory of the previous choices made during a particular selection process. How-

ever, it is interesting to note that this averaging process [ 171 used in the Putnam model, is in effect an implementation of ensemble averaging. The resulting profile is the most repre-

sentative of a group of probable profiles in a least mean square sense, as will be illustrated in Theorem 1 elsewhere

in this article. The highly unstable nature of the measured data sample can

be modeled more realistically using a stable stochastic pro-

cess. A better approximation of the nature of such data can be achieved to a certain extent by allowing subsequently gener- ated sample points to execute an unrestricted one dimensional

random walk [3] around the measured data value. Random walks, unlike conventional Monte Carlo, are

generated on the basis of Bernoulli processes. They tend to be highly unstable over short ranges and provide a good simulation of the nonstationary nature of metrics encountered in real life. If disjoint time intervals are taken from a simple random walk sequence, they are seen to have

the same identical distribution. This implies stationarity, but

parameters associated with the random walk can be easily modified to generate a sequence that has a distribution that varies with time. Stationary random walks average out over

long periods of time, but they can be used in applications to generate highly unstable sequences by forcing their mean values to drift in a controlled manner. The choice of the

probability p of change, the drift, and the amount by which it changes would decide the degree to which new profiles generated computationally would vary from the ori-

ginal profile. Consider a single data sample acquired at instant T. Fig. 4

shows the variation across profiles, of this single data point, subject to a simulated one dimensional unrestricted random walk. The value of each point Xi(t) at instant T, across profiles (a range of i) is calculated by either adding one or subtracting one randomly from its previous value. It can be seen that the resulting variation simulates a highly unstable process. Each point in the original data set can be subjected to similar mathematical manipulation to yield any number of related profiles.

Page 7: Statistical analysis of nonstationary software metrics

K. Pillai, V.S.S. Nair/lnformation and Software Technology 39 (1997) 363-373 369

A well designed CM system [14] can easily acquire the

lines of code generated by a team, or even individual devel-

opers, over the complete duration of project. A typical pro-

file consisting of LOCs collected over a period of 100 calendar days is shown in Fig. 5 (x,(t)). Consider each

data point of the profile x,(t) to be the outcome of a trial. Let p be the probability that 1 lines of code are added, and

(1 - p) the chance of 1 lines being removed at the end of the day. The parameter p can be denoted simply by:

ProJVe nun&r

Fig. 4. Random walk with zero mean.

P = $4 (5)

where n is the total number of observed events, and y the

number of outcomes where LOC increased by 1. The data points x;(t), corresponding to each sample func-

tion can now be thought of as the outcome of a sequence of

Bernoulli trials, superposed on x;_,,(t). If the addition of 1 lines of code is denoted by a Boolean k = 1, and the removal

Calemb The (Days)

Fig. 5. The ensemble average

Page 8: Statistical analysis of nonstationary software metrics

370 K. Pillai, V.S.S. Nair/lnformation and Software Technology 39 (1997) 363-373

of 1 lines by k = 0, the outcome of each trial can be repre- sented by the relation:

xi(t)=xi_I(t)+21k-l, i=2,...,N. (6)

It is interesting to note that eqn (6) generates a process that

seems highly unstable, in spite of the fact that it is generated

by means of a stable Bernoulli process. The resulting pro- cess is naturally a Markov process since the future values of

generated samples, given its present value, is independent of the past. Thus, a fairly realistic ensemble can be statistically

generated from a single data set, provided some attributes of

the team and the work environment are known.

Fig. 5 shows a measured data record x,(t). The other nine sample functions were generated using a random walk

executed about x,(t), based on Bernoulli trials. The step variation parameter 1 was arbitrarily chosen to be 5.

The example shown above generates a sequence that is

stationary and serves to illustrate the basic methodology involved. By controlling the drift value of the mean, and by changing the probabilities involved, a wide range of

profiles can be represented using this approach. Further- more, the random walk approach to generating a sequence

of numbers is computationally less complex than the Monte Carlo techniques used in previous works [16].

3.2. A model for the expected value

A model for the expected value of the metric can now be derived for the general case. Assume that N - 1 data records were generated statistically from a single measured data

record for the project in question. Let each LOC measure- ment be denoted by x;(t), as a function of time. The next step is to find the average of Xi(t) over N different sample func- tions at the same instant t. The ensemble average (Fig. 5) can be stated as:

(7)

If an ensemble of a very large number ( >> N) of sample

functions is considered, then the ensemble average will depend on the choice of the N sample functions from the set {x;(t)}. If the estimate varies with the choice of the

sample function, then it becomes necessary to ascertain the expected value of the ensemble average. Since the sam-

ple function under consideration is nonstationary, it has a time varying mean (p,(t)). The expected value is given by:

WxWl = ; ,f EMt)l = CL,(t), r-l

(8)

where

4) = ELMI. (9)

From the above equations the expected value of j.ix(t), is an estimate of &t), and it is independent of N. For a given profile the least mean square error results, if deviations are

calculated about its time average for each data point. But for a nonstationary process such as software productivity, sub-

sequently acquired profiles will have different average values. As a result of this, the mean of one profile may

not offer the least mean square error for another profile.

The ensemble average on the other hand, compensates for

the unstable nature of the mean of a set of profiles, by taking

more than one profile into account. It can be shown that the

ensemble average offers the minimum variance for any

given set of profiles.

Theorem 1. The profile represented by the ensemble aver-

age of a set of projiles {s}, will always ofSeer the minimum

variance from the members of the set {s} at any instant r,

irrespective of the difference between pairs of pro$les of the

set {s}.

Proof. Consider an ensemble denoted by {s} of cardinality

N such that its member profiles were generated using eqn (6), with an arbitrary value of 1. Let X(t) denote the most

representative profile of the ensemble such that it offers the

minimum error from any member of the set {s}. The term error in this context can be defined as the Euclidean dis- tance between a given point on any profile and a correspond-

ing point on h(t) at any given instant. Mathematically, this error is represented by the square root of the variance at any instant.

Consider a point x;(r), sampled at instant 7, from a profile

denoted by Si. Then {x1(7),x2(~) ,... ,xi(~) ,..., x~T)} would represent the set of sample data points at instant r, acquired

across N profiles {sI,s2,...,sN}. Let $Jr) be the ensemble average computed using eqn (7) at instant 7. The variance u2, at any instant 7, of the set from any arbitrary constant h

can be expressed as:

a2 = i ,$ (Xi(7) - X(7))*. l-1

(10)

The absolute minimum of U* will be given by the value of X(r) for which the derivative of eqn (10) is zero, with a positive second derivative. Differentiating eqn (10) with respect to X(r), and setting the left hand side to zero, we get:

0 = f ,i 2h(7) - 5 2Xi(7) [ . r-l i=l 1 Solving for h(r), we get:

X(7) = $ ,f Xi(7) = bx(7). [ 1 I-l

(11)

(12)

The second derivative of eqn (10) is always a positive non- zero constant, which indicates that the turning value is a minima. By virtue of the fact that the ensemble average offers the minimum variance at any instant r, it follows that the ensemble average is the most representative curve for a given set of profiles. 0

Page 9: Statistical analysis of nonstationary software metrics

K. Pillrri. V.S.S. Nuir/hfiwmation and Sofware Technology 39 (1997) 363-373

It is obvious that the ensemble average provides a better

statistic than the time average for the following reasons: 3.3. Measurement of time

371

The measured data being nonstationary has a time

variant average. The time average calculated from one profile may not offer the least variance for subsequent

measurements of the same attribute owing to its non- stationary nature. The ensemble average on the other hand takes into

account a range of possible outcomes, as a result of

which it offers a minimum variance even for subsequent measurements of the attribute.

Most statistics that are defined for a sample function can be evaluated for an ensemble. The next step is to arrive at a

fairly reliable representative measure of the mean square value of the ensemble under analysis. The approach is simi-

lar to the one adopted for ascertaining the mean. The mean

square measurement for the ensemble is given by the relationship:

(13)

As in the case for the ensemble average, let the time variant mean square value of the software entity being measured, be

denoted by 9.:(t). The ensemble attribute q,(t), is also an unbiased estimate of the mean square value of the nonsta-

tionary process under analysis. The expected value at any time t is given by:

E&(f)1 = ; ,$ E[&t)l = \kW. (14) I-l

The ensemble can now be truly represented [l] in

terms of &(I) and 4,(t) as a deviation measure y,(r), given by:

Yi(r)= i= i=1,2 ,..., N. (15)

The ensemble {xi(t)) can now be represented by trans-

posing eqn (15) to obtain a representative mathematical

model:

x,(r)= [$GZ+)+iW). (16)

eqn (16) is a function of time and can serve as the basic

foundation for generating a cybernetic model (Fig. 1) of the development process. Similar models are used extensively in the field of random signal processing. The model captures

both the time dependence, as well as the inherent stochasm of the process. The ultimate aim is to arrive at a modeling approach that is absolutely generic. This implies the port- ability of results across time. Productivity measures, for example usually treat time as the independent variable. Thus, the manner in which time is sampled becomes an important factor.

Most software measurement data, such as the measures of

productivity show obvious time dependent cycles. Translat- ing such measures across projects require caution, since the effect of such time dependent cycles can force results that

cannot be applied across projects that have unique time scales. The example shown in Fig. 5 depicts LOCs plotted

versus calendar time. Calendar time, however, has little bearing on the actual programming time of a developer.

Typically, data collected on the basis of calendar time

goes through distinctive cycles, productivity troughs

perhaps coincidental with weekends or other holidays. A profile such as this, will have to be synchronized in time

with a new project before it can be used as an effective fine-

grained time domain model for that project. But the start- time of a project is ruled by constraints much more rigid

than the express necessity for modeling. This makes a pro- file that is unique in time, highly unsuitable for application

to any project. As a result, the scope of the model gets highly restrictive if calendar time is the basis for collecting data.

An alternative approach would be to collect data based on the actual time the developer spends on generating code. Such active-time data can be gathered by recording the

time actually spent by the developer using his or her editor or case tool, subject to additional constraints on idle time.

The result would be a profile that is fairly independent of calendar time. Such active-time based data can be applied to various projects without having to take timing variations into consideration.

(b)

Fig. 6. (a) Typical measured data; (b). (c) Profiles generated by perturbing

(a) with the random walk.

Page 10: Statistical analysis of nonstationary software metrics

312 K. Pillai, V.S.S. Nair/lrzformarion and Software Technology 39 (1997) 363-373

A typical profile collected on the basis of active time would have less fluctuations, as shown in Fig. 6(a). Start- off is generally slow-paced, but the rate of code generation speeds up with time and then tapers off as the developer completes his or her assignment. The programmer spends more time contemplating his approach during the initial phase of the implementation process. Coding then increases, peaks, and tails off when the programmer finally gets around to validating his implementation.

800 I I I I I 1 Polynomial -

700 - Dala

3.4. A model for the cumulative ensemble average

Each data point in the profile shown in Fig. 6(a) was subjected to a random walk perturbation. One thousand profiles were generated using eqn (6), with an 1 value of 5. Two such profiles are shown in Fig. 6(b) and Fig. 6(c). The cumulative LOC graph (ogive) was then generated for the ensemble average.

Fig. 7 shows the cumulative fii, with the cumulative LOC’s for profile 1 and profile 2 superimposed. It can be seen that the ensemble average tracks the profile average closely, as opposed to the time average. Time average calculations assume that the average value is distributed uniformly across the domain. As a result the cumulative graph yields a simple straight line, which hardly captures the time domain variation of the average value. The ensemble average on the other hand strikes a mean between different profiles, providing a higher degree of definition.

Standard curve fitting techniques can now be applied to the cumulative ensemble curve. Fig. 8 shows a fifth degree polynomial which was fitted to the cumulative LOC curve. Calculating the cumulative curve has the effect of removing fast variations from the initial profile, thereby making it easier to model the curve with a higher confidence factor. The model for the curve is shown in eqn (17). It is seen that a fifth degree polynomial offers a very close fit to

800 t

0 10 20 30 80 90 loo

Aclivc Time (L&y)

Fig. 7. Comparison of cumulative graphs for different profiles.

0'1 I I I 1 I I I I

0 10 20 30 40 50 60 70 80 90 100

AC& Tic (DUJS) ’

Fig. 8. Data fitted with a fifth degree polynomial.

the data:

y(x) = - 3.0248 X 10 -07x5 + 4.4569 X 10 - ““$

-0.0012x3 +0.0139x2 +4.120&’ - 3.0180. (17)

Attributes of the curve, such as its slope, or its second deri- vative can be used to track the progress of the project as it evolves. However, the search space for the optimum value of the coefficients will have as many dimensions as there are parameters to be tuned. This makes repeated curve fitting, as and when the confidence factor goes beyond the tolerance limits, a cumbersome process. This computationally inten- sive process of tuning the weights can be automated by making the model adaptive. The adaptive model has a defi- nite advantage in that it can be based on a core model that is considerably trivial. But the model retains a high degree of fidelity, since it has the ability to mutate itself with the environment.

4. Conclusion

Nonstationary software measurement data cannot be measured and analyzed using conventional statistics. The implication is that models exploiting conventional statistical measures are inherently inaccurate. This paper stresses the necessity of ascertaining the nature of the data acquired, before any processing can even be done on it. Properties such as ergodicity and stationarity are critical parameters to be considered in the choice of the analysis methodology. The elegance of ensemble based measures over time based statistics has been demonstrated. Assumptions of stochastic independence between observations help in ameliorating most of the complexities involved with modeling. However, autocorrelation as a criterion for quantifying the validity of the assumption is recommended.

A major impediment in software measurement and mod- eling is the unavailability of sufficient data to validate the

Page 11: Statistical analysis of nonstationary software metrics

K. Pillai, V.S.S. Nair/lqformation and So&are Technology 39 (1997) 363-373 373

metric or the model. However, a random walk perturbation can be applied to available data to generate possible addi- tional data at virtually no extra expense. The validity of the methodology is governed by the accuracy of the attributes used in perturbing the original data.

4.1. Future directions

Simple models such as the ones derived in this paper, can be used as the basic building block of more complex mod- els, employing multiple such terms. Variations from observed data can then be corrected using simple control mechanisms. Treating software measurement data as non- stationary enables the application of statistical signal pro- cessing techniques [8]. Adaptive techniques can then be used to enhance the predictive power and genericity of such models.

References

[I] J.S. Bendat and A.G. Piersol, Measurement and Analysis of Random

Data, John Wiley, 1966.

[2] B.W. Boehm, Software Engineering Economics. Prentice Hall, 1981.

[3] W. Feller, An Introduction to Probability Theory, John Wiley, New

York, 1988.

[15] L.H. Putnam, A general empirical solution to the macro software

sizing and estimation problem, IEEE Transactions on Software Engi-

neering, Volume SE-4 (4), 1978, pp. 3455361.

[ 161 L.H. Putnam, Tutorial: Software Cost Estimating and Life-Cycle Con-

trol, The IEEE. New York. 1980.

[ 171 L.H. Putnam and W. Myers, Measures For Excellence. Prentice-Hall,

1992.

[4] N.E. Fenton, ed., Software Metrics. Chapman & Hall, London, 199 I, [ 181 G.M. Weinberg, Quality Software Management, Vol. 2, Dorset House

pp. 89- 110. Publishing, New York. 1993.

IS] N.E. Fenton, ed., Software Metrics, Chapman & Hall, London. 1991.

pp. 42-43.

[19] B. Widrow and S.D. Stearns, Adaptive Signal Processing, Prentice-

Hall, Englewood Cliffs, NJ, 1985.

[6] N.E. Fenton, R. Whitty and Y. Izuka. eds., Software Quality Assur-

ance and Measurement, International Thomson Press, 1995.

[7] D. Gross and CM. Harris, Fundamentals of Queueing Theory, John

Wiley, 1974, pp. 38 l-400.

[8] W.S. Humphry and N.D. Singpurwalla, Predicting software produc-

tivity, IEEE Transactions on Software Engineering, 17 ( 199 I).

[Y] D.R. Jeffery and Cl. Low, Calibrating estimation tools for software

development. Software Engineering Journal, 5 (4) (1990) 2 15-22 1.

[IO] M. Jorgensen, Experience with the accuracy of software maintenance

task effort prediction models, IEEE Transactions on Software Engi-

neering, 2 I (8) (1995) 674-68 I. [I I] B.A. Kitchenham. S.L. Pfleeger and N.E. Fenton, Towards a frame-

work for software measurement validation, IEEE Transactions on

Software Engineering, 2 I (I 2) (1995) 929-944.

[ 121 B.A. Kitchenham and N.R. Taylor, Software project development cost

estimation, Journal of System Software, 5 (4) (I 985) 267-278.

[ 131 A. Papoulis, The Fourier Integral and its Application, McGraw-Hill.

1962.

[ 141 K. Pillai, V.S.S. Nair, H. Cropper and R. Zajac, A configuration man-

agement system with evolutionary prototyping, in Yolanda Martinez

Trevifio (ed.), Proc. 4th International Symposium on Applied

Corporate Computing, November 1996, ITESM, Campus Monterrey,

Monterrey, Mexico.