an effort prediction framework for software defect correction

Information and Software Technology 52 (2010) 197–209

Contents lists available at ScienceDirect

Information and Software Technology

journal homepage: www.elsevier .com/locate / infsof

An effort prediction framework for software defect correction

Alaa Hassouna, Ladan Tahvildari *

E&CE Department, University of Waterloo, Waterloo, ON, Canada N2L 3G1

a r t i c l e i n f o

Article history:Received 13 January 2009Received in revised form 29 September2009Accepted 11 October 2009

Keywords:Software effort predictionCase-based reasoningClusteringSoftware defect correctionIssue tracking system

0950-5849/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.infsof.2009.10.003

* Corresponding author.E-mail addresses: [email protected] (A. Hass

(L. Tahvildari).1 We use the term ‘‘issue” in this article as it is also u

which we have based our framework. Note that featureconsidered as defects, in the sense that these are modifinot yet present in the product.

a b s t r a c t

This article tackles the problem of predicting effort (in person–hours) required to fix a software defectposted on an Issue Tracking System. The proposed method is inspired by the Nearest Neighbour Approachpresented by the pioneering work of Weiss et al. (2007) [1]. We propose four enhancements to Weisset al. (2007) [1]: Data Enrichment, Majority Voting, Adaptive Threshold and Binary Clustering. Data Enrich-ment infuses additional issue information into the similarity-scoring procedure, aiming to increase theaccuracy of similarity scores. Majority Voting exploits the fact that many of the similar historical issueshave repeating effort values, which are close to the actual. Adaptive Threshold automatically adjusts thesimilarity threshold to ensure that we obtain only the most similar matches. We use Binary Clusteringif the similarity scores are very low, which might result in misleading predictions. This uses commonproperties of issues to form clusters (independent of the similarity scores) which are then used to pro-duce the predictions. Numerical results are presented showing a noticeable improvement over themethod proposed in Weiss et al. (2007) [1].

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

The maintenance phase typically constitutes about 70% of thesoftware development life cycle [2]. It involves making changesto various software modules, documentation and sometimes evenhardware to improve the performance. To ensure such modifica-tions do not disrupt operations or the integrity of the system, orga-nizations employ appropriate change management procedures.Effort prediction is one of the important tools which can help inplanning and executing these procedures effectively.

In this article, we address the problem of predicting effort forentries in Issue Tracking Systems early in their lifetime. These sys-tems are used to manage the different issues that arise during themaintenance phase (an issue could be a bug, a feature request or atask)1 [1]. Such early estimates could be used by testers to set sched-ules, and by managers to plan costs and provide earlier feedback tocustomers about future releases.

Predicting defect-correction effort is a more challenging taskthan predicting software development effort. While softwaredevelopment is a construction process, defect-correction is mainlya search process possibly involving all of the source code [1]. Fur-

ll rights reserved.

ouna), [email protected]

sed in the Weiss et al. work onrequests or tasks could also becations that are required, but

thermore, testers cannot trust the original developers’ assumptionsand conditions [1]. This may require them to explore areas of thecode that they are not familiar with; adding to the complexity ofthe process.

The heart of the Weiss et al. approach is an Instance-Based Rea-soning [3] method called the Nearest Neighbour Approach [4]. Thisapproach leverages experience gained from resolved issues to pre-dict correction-effort for similar emergent issues early in their life-time. We implement four key enhancements to [1]: DataEnrichment, Majority Voting, Adaptive Threshold and Binary Cluster-ing. Reference [1] computes the similarity scores using a text-sim-ilarity search engine. Data Enrichment injects additional issueinformation, as compared to [1], into the query submitted to thisengine, aiming to improve the accuracy of similarity scores. Themean prediction method used in [1] is replaced by Majority Voting.Since effort values are usually taken from a limited set, MajorityVoting capitalizes on the fact that often certain values, close tothe actual effort, appear more frequently in similar historicalmatches. Adaptive Threshold allows us to compute estimates byconsidering higher scoring matches first, which should yield betterresults. If no matches are found, the threshold is systematically de-creased until the required number of matches is reached. In somecases, similarity scores could be considered too low, at which pointit might be misleading to use them [1]. Binary Clustering alleviatesthis problem by using common properties of issues to form clus-ters (independent of the similarity scores), which are then usedto produce the predictions.

Most of the existing work addressing defect-correction effortprediction use a single approach to generate predictions. For exam-

http://dx.doi.org/10.1016/j.infsof.2009.10.003

mailto:[email protected]

mailto:[email protected]

http://www.sciencedirect.com/science/journal/09505849

http://www.elsevier.com/locate/infsof

198 A. Hassouna, L. Tahvildari / Information and Software Technology 52 (2010) 197–209

ple, to the best of our knowledge, no work has used both the Near-est Neighbour Approach with Clustering, or Clustering with Regres-sion. We propose using complimentary approaches that helpaddress the weaknesses of one another. In addition to employingthe Nearest Neighbour Approach, we use Binary Clustering to addresscases where the Nearest Neighbour Approach would produce mis-leading results. In this context, one of our goals is to show howsuch a composite approach can be used to produce accuratepredictions.

The remainder of the article is as follows: Section 2 outlinessome of the related literature. Section 3 describes our proposed ap-proach (named hereafter Effort Prediction Framework or EPF), alongwith a process model, the enhancement techniques/methodologiesand an algorithm combining them. Section 4 describes the datasetsthat have been used to evaluate our approach along with the cor-responding experimental setup. Section 5 shows the results of theevaluation experiments. Finally, Section 6 presents concludingremarks and future research directions.

2. Related work

The area of defect-correction effort prediction has begun to at-tract attention only recently. Although the number of papersaddressing this area is limited, they are nothing short of pioneeringworks.

In [5], Zeng and Rine use a self-organizing neural network ap-proach to predict defect-correction effort. First, they cluster defectsfrom a training set, and compute the probability distributions of ef-fort from the resulting clusters. These distributions are then com-pared to the defects of each instance in the test set to derive theprediction error. They use the NASA KC1 dataset [6] to evaluatetheir approach, but unfortunately they only use the Magnitude ofRelative Error (MRE) to measure performance. In the literature, thismetric is believed to be asymmetric [7,8], and the use of supportingmeasures is usually desired. Additionally, when they apply theirprediction method to an external dataset (dataset not used in thetraining), their Mean Magnitude of Relative Error (MMRE) reachesscores of up to 159%, with a maximum MRE of about 373%. Thisshows that while their technique preforms well when applied tothe original calibration dataset (with MMRE <30%), it is not reliablyextendable to other datasets.

Song et al. proposed the use of association rule mining to cate-gorize effort into intervals [9]. To evaluate their approach, they useNASA’s well known SEL defect dataset. According to their findings,their technique outperformed other methods such as C4.5 [10],Naïve Bayes [11] and Program Assessment Rating Tool (PART) [12].

Another interesting approach has been proposed by Evanco in[13]. Using explanatory variables such as software complexitymetrics, fault spread measures and the type of testing conducted;[13] develops a statistical model that can be used in conjunctionwith defect prediction methods to produce estimated defect fix ef-fort with a certain probability measure. Ref. [14] proposes a relatedmodel using Poisson and multivariate regression analysis to modelthe effort.

In [1], Weiss et al. propose a novel approach using an Instance-Based Reasoning methodology (a type of nonparametric learningmethod) called the ‘‘Nearest-Neighbor Approach” [3,4]. They usea text-similarity approach to identifying nearest neighbors, whichthey use to generate effort predictions. They test their approach onrealistic data extracted from an Issue Tracking System. Their ap-proach shows promising results, with Average Absolute Residualsreduced to about 6 h.

The above defect-correction effort prediction studies presentvarious attempts to provide accurate predictions. However, theydo not extend their experimental evaluation to additional datasets

(with the exception of [5], which has produced less than impres-sive results). This raises questions about the external validity ofthe proposed approaches. Many approaches can be calibrated tofit a specific dataset, but only a small number can be successfullyextended to additional datasets. To extend the validity of our pro-posed method, we evaluate its performance on an additional data-set. In addition, we use multiple performance metrics to assess theperformance of the proposed framework. Many performance met-rics are considered biased [7], therefore we rely on complimentarymetrics that show the distribution of error for our predictions in amore comprehensive way.

One can divide the literature into four main categories accord-ing to their method of using historical similarities to compute anestimate: Clustering, Top-K Regression, Top-K Mean and Top-KMajority Voting.

Clustering is an unsupervised learning technique which aims tofind a structure in a set of unlabeled data. It is the process ofarranging or dividing objects into groups whose members are sim-ilar in some sense. Phansalkar et al. use K-means clustering (as de-scribed in [15]) to predict a program’s performance. They find theoptimal number of clusters by using the Bayesian Information Cri-terion as described in [15]. Once they identify what cluster the tar-get program belongs to, they use the performance measure of theclosest program to the center of the cluster as the prediction.

Regression analysis tries to model a dependent variable (e.g. ef-fort) as a function of independent variables (e.g. program size), aset of model parameters and an error term. The parameters arecomputed to give a ‘‘best fit” of the data to a training set. The ideabehind Top-K Regression is based on limiting the training dataset tothe most similar instances only. This helps to alleviate the effect ofoutliers which may affect the model’s accuracy. On the other hand,we need to compute a different equation for every target instance(instance for which we need a prediction), since the correspondingset of Top-K candidates is different. Iwata et al. propose a top-30multiple regression method to predict effort for embedded soft-ware development [16]. They use a collaborative filtering systemto generate the similarity scores between projects. However, ofthe 73 projects they use, 53 of them have missing measurementsfor some of the metrics. They mitigate this problem by computingthe missing values based on project similarities using a weightedmean of the similar projects’ metric values. Then, they choosethe top 30 most similar projects according to score and scale/size,and use them to generate the regression model and the prediction.They use 5 performance metrics for evaluation: Mean AbsoluteError, Variance of Absolute Error, Mean Relative Error, Variance ofRelative Error and Rratio which they define as the inverse ofPRED(15). They compare their work with the regular multipleregression analysis and with the collaborative filtering, concludingthat their approach performs favorably.

Top-K Mean simply computes the mean of the measure (e.g. Ef-fort) describing the Top-K candidates. Weiss et al. use Top-K Meanto predict the time needed to fix a particular software maintenanceissue [1]. Top-K Weighted Mean is also a common variant of thismethod, where a weight is assigned to each instance in the candi-date set (a candidate issue) based on its similarity score. Phansal-kar et al. use Top-K Weighted Mean as an alternative to theclustering used in [17]. They define the weights as the reciprocalof the distance to each of the programs. The program with thehighest similarity (lowest distance) to the user’s application, getsthe highest weight. They experiment with three different weightedmeans: geometric mean, harmonic mean and arithmetic mean.After examining the average error of each, they conclude that theweighted harmonic mean performs the best. Since Top-K Mean isthe simplest approach for prediction, it is usually preferred bypractitioners. However, since the mean is greatly affected by outli-ers, this method should be used with caution, and data must some-

A. Hassouna, L. Tahvildari / Information and Software Technology 52 (2010) 197–209 199

times be pruned or manipulated to reduce their impact on theaccuracy of the predictions. Top-K Weighted Mean gives the usermore control over the behavior of the predictor, but the chosenweight computation method is a crucial factor in the performance.

Majority Voting counts the repeating occurrence of a certain va-lue in the history as a vote towards its use as the estimate wherethe value with most votes wins. Majority Voting can only be appliedto discrete values (it cannot be used if the effort values come froma continuous distribution). Top-K Majority Voting simply limits thehistory set to the Top-K candidates. Top-K Majority Voting performsbetter than Top-K Mean when dealing with outliers and extremevalues, however, one needs to use an alternative method in casesof indecision (no majority vote). In [18], Sun et al. propose theuse of Top-K Majority Voting to predict the financial distress of com-panies. In their work, the similarity scores are computed using fuz-zy logic. Ramirez et al. use Top-K Majority Voting to predict stellaratmospheric parameters [19]. They use Genetic Algorithms to re-duce the size of the dataset in order to decrease the computationoverhead needed to generate the similarity scores.

The above-mentioned works lack a strategy to handle cases ofmajority vote indecision. While this may describe a small percent-age of cases, the way in which these are handled can improve theoverall performance.

3. Proposed approach

To begin, we briefly describe the method proposed by Weisset al. in [1], which we call the base approach hereafter. We discussthe essence of [1] to the extent that is essential to understand ourproposed framework and the implementation of the base ap-proach. Weiss et al. use a Nearest Neighbour Approach (kNN), whichproduces a prediction as follows: the target issue (the issue forwhich we are required to make an effort prediction) is comparedto the resolved ones in the Issue Tracking System. A distance met-ric is defined to produce a similarity score between the target issueand every resolved issue. The k most similar issues (candidates) areselected, and the mean of their reported effort is used as the pre-diction. Weiss et al. also define a variant of kNN called the NearestNeighbor with Thresholds ða-kNNÞ. a-kNN applies a lower boundon the acceptable scores; i.e. if a ¼ 0:5, then only issues with scoresP0.5 are allowed as candidates. This implies that for relativelyhigh values of a, a-kNN may not choose any candidates and will re-turn ‘‘Unknown” rather than a prediction. Our proposed approachresolves the problem of ‘‘Unknown” predictions.

The distance function in [1] has been defined using text similar-ity between issues. This is extracted using an open source text sim-ilarity measuring engine called Lucene, developed by the ApacheFoundation [20], which returns a score between 0 and 1 (0 = NoSimilarity, 1 = Perfect Match). They use the Title and the Descriptionof each issue as search criteria, giving each an equal weight. Ourproposed approach uses additional issue information to improvethe accuracy of similarity scores. We use a multi-step Top-K Major-ity Voting approach (compared to Top-K Mean in [1]), which aims tomake better use of the discrete nature of the effort data.

Our Effort Prediction Framework (EPF) could be described as aset of enhancements to [1]. To help understand our approach, wedivide the four proposed enhancements into two categories: Simi-larity-Score Dependent Enhancements (SSDE) which include DataEnrichment, Majority Voting and Adaptive Threshold, and Similar-ity-Score Independent Enhancements (SSIE) which include BinaryClustering as depicted in Fig. 1. It can be observed that SSDE is usedwhen the similarity scores are high, and SSIE is used, otherwise.This is justified by relying on the studies showing that predictionmethods which use historical similarity can be more effective incomparison to those which do not (such as COCOMO or regression

models) [21,22]. However, at lower scores, SSDE could negativelyaffect the accuracy of the model [1]. In such cases, we relay onthe SSIE, relying only on issue-related information.

As shown in Fig. 1, Target Issue Information is first extractedfrom the Issue Tracking System. Then, a match query is formedthrough Data Enrichment, where we include a number of issueproperties. We feed the query into the test-similarity engine andobtain the similarity scores with resolved issues. Then, a-kNN usesthe scores and the similarity threshold (a, provided by the user) tofind the similarity matches. If no matches are found for the given a,Adaptive Threshold decrements a, if possible. If we find matches,Majority Voting is used to compute the prediction, otherwise, Bin-ary Clustering uses its own criteria to cluster related issues and pro-duce the prediction.

The remainder of this section is as follows: in Section 3.1, wedescribe some weight computation techniques needed whenMajority Voting switches to Weighted Mean prediction as shownin Fig. 1. Section 3.2 describes the proposed enhancements andSection 3.3 shows the pseudo-code describing the combination ofthese enhancements.

3.1. Weight computation techniques

In using Majority Voting, we may encounter cases where amajority vote is not reached. In cases where we have multiple sim-ilarity matches in the candidate set, but no majority vote, we relayon a Weighted Mean. It expresses the prediction as a weighted sumof the effort values corresponding to the issues in the candidate set.We experiment with two techniques to compute the weights aswill be explained next. The two particular techniques we have cho-sen in our study are meant to shed light on the effect of SimilarityScore Dependent and Similarity Score Independent weight computa-tion. These techniques show similar performance, and a small, butnoticeable improvement over simply using the mean (equalweights). The aim of this part of our study is to simply explorethe effect of these methods on the prediction accuracy, and notto prove the superiority of one over the other.

3.1.1. Least squares learningThis technique computes the weights based on minimizing the

sum of the squared residuals (a residual is the difference betweenthe actual and the predicted effort). This technique is Similarity-Score Dependent since it relies on similarity-scores to sort and fil-ter issues. To avoid confusion, we will describe some of the entitieswe will be using to present this approach. The target issue is theissue for which we are trying to compute the prediction. A targetissue has a candidate set associated with it, which contains thekNN (k Nearest Neighbors) called the candidate issues that are ta-ken from the set of historical issues (available from the issue data-base). The training set (composed of training issues) contains theremaining historical issues excluding the target issue. To find theweights, we compute a candidate set for each training issue whichwe will call the training candidate set (the same is done for the tar-get issue). The key to generating accurate weights is to choose thetraining candidate sets properly. Our technique relies on the factthat the similarity-scores are a good indicator of similarity [1].Least Squares Learning chooses the training candidate set for eachtraining issue in the same way the candidate set is chosen forthe target issue; i.e. the k Nearest Neighbors according to the sim-ilarity scores. The number of training candidate issues selected pertraining issue is configurable, however we chose to set it as thesame number of candidate issues selected for the target issue,which seemed to yield the best performance. The group of trainingcandidate sets along with the training issues are then used to min-imize the Mean Square Error (MSE). The following is a formalizationof how Least Squares Learning is used to compute the weights:

Binary Clustering

Input/Output

Adaptive Threshold

NearestNeighbours

WithThresholds

( -kNN)

Text-SimilarityEngine

(Lucene)

Issue Tracking System

DataEnrichment Query

Target IssueInformation

SimilarityScores

Majority Voting

Similarity too low?

Enough Matches?

Yes

SimilarityMatches

Yes

NoNoDecrementThreshold

( )

EffortPrediction

Yes

Threshold( )

New Threshold ( )

Legend

Process

New Module -SSDE

ObtainMajority Vote

MajorityVote?

Vote&

Frequency

# of matches > 1?

No

No

Find Least VariableWindow

Yes

ComputeWeighted

Mean(LSL or HVF)

TopKWindow

Decision Node

Data Store

ExtractRelatedCluster

Find Least Variable Window

RelatedCluster

ComputeMean

RelatedIssues

Window

New Module -SSIE

Existing Module

ResolvedIssues

Fig. 1. Process model of the effort prediction framework.


bEi ¼XM

j¼1

wjEij ð1Þ

MSE ¼ 1N

XN

i¼1

Ei � bEi

� �2ð2Þ

¼ 1N

XN

i¼1

Ei �XM

j¼1

wjEij

!2

ð3Þ

where M is the number of training candidate issues, wj representsthe weight coefficient for the jth training candidate issue, Eij repre-

sents the actual effort of the jth training candidate issue of the ithtraining issue, bEi is the predicted effort for the ith issue, N is the num-ber of training issues and Ei represents the actual effort for the ith is-sue. In order to minimize the Mean Square Error (MSE), wedifferentiate w.r.t. wq for 1 6 q 6 M and set @MSE

@q ¼ 0; resulting in:

XM

j¼1

XN

i¼1

EijEiqwj ¼XN

i¼1

EiEiq ð4Þ

By computing (4) for all q, we obtain a system of linear equations forfinding the weights wj ð1 6 j 6 MÞ.


3.1.2. Historical value frequencyThis technique benefits from the fact that the recorded effort

values are discrete; i.e. effort is rounded to the nearest 15 min.To compute the weights, this technique counts the number ofoccurrences of each effort value in the history, and computes thenormalized relative weights (estimates of probabilities) accord-ingly. This is equivalent to minimizing the mean squared residuals(by computing the statistical expectation conditioned on the can-didate set) by solely relying on the history to estimate the proba-bility values (ignoring the similarity scores, and consequently,the order in the candidate set). The following is an example ofhow these weights are computed:

We are given the Candidate Set = [1,2,4], which contains theeffort values for the target issue candidates, and History = [1,1,1,2,2,2,2,2,3,3,4,4,5], a vector of all effort values in history (resolvedissues). First, we compute the frequency of each effort value in theCandidate Set as it appears in History, this gives us the Candidate SetHistorical Frequencies (F) = [count(1),count(2),count(4)] = [3,5,2].Then, to compute the normalized weights, we divide the resultingfrequencies by the sum of the frequencies in the Candidate Set His-torical Frequencies vector, as shown in Candidate Set Weights ¼

3sumðFÞ ;

5sumðFÞ ;

2sumðFÞ

h i¼ 3

10 ;5

10 ;2

10

� �¼ ½0:3;0:5; 0:2�. Now that we have

computed the weights, we can apply them to the Candidate Setby multiplying each entry accordingly, and the final WeightedCandidate Set = [(0.3 � 1),(0.5 � 2),(0.2 � 4)] = [0.3,1,0.8]. Finally,to produce the prediction, we compute the sum of the WeightedCandidate Set.

3.2. Proposed enhancements

We apply the following SSDE enhancements to the Weiss et al.approach (refer to Fig. 1 to see where each fits in the process):

3.2.1. Data enrichmentThe Weiss et al. approach uses the Title and Description in Lu-

cene’s search query to generate the similarity scores. While thishas generated promising results, if we enrich the query by addingmore relevant criteria, the end result can potentially improve.However, there are two unfavorable side effects (caused by themore specific nature of the query): (i) in general there will be lessmatching issues, (ii) the number of high scoring matches will de-crease. This implies that, while yielding more accurate results,using the a-kNN method will have lower FEEDBACK values for higha’s, where FEEDBACK is a performance metric that measures thepercentage of issues for which the model makes a prediction.

To implement this enhancement, we perform a two-step analy-sis to determine which issue properties to include. First, we per-form a correlation analysis between each issue property availablefrom the Issue Tracking System, and the actual effort. Then, wechoose the properties with the highest correlation values for whichthe p-values2 are 60.15 as the base set. Finally, we test all the pos-sible combinations of the elements in the base set until we arriveat the set of properties that produce the most accurate results. Thisis done experimentally, by applying Data Enrichment, to producethe predictions for each possible combination and comparing thecorresponding performance metrics. During the testing we alsotake into consideration the FEEDBACK value. As mentioned earlier,the FEEDBACK decreases as we introduce more criteria into thequery. In general, we try to avoid cases where FEEDBACK reaches0%. Intuitively, we would like it be as high as possible, at the sametime, we want the accuracy of the scoring mechanism to be high as

2 The p-value associated with a correlation value measured as C represents theprobability of obtaining C when the true correlation is zero (a smaller p-value signifiesa more meaningful correlation value) [23].

well. One issue property that we did not use in our study is theassociated files changed. The main reason is that this informationis not usually available in the initial stages of the issue’s life. In fact,it is usually the end result of the issue resolution cycle. Addition-ally, issues that did have such information constituted only about12% of the dataset, not providing a significant ratio to work with.

3.2.2. Adaptive thresholdAs mentioned before, introducing Data Enrichment can nega-

tively influence FEEDBACK. Adaptive Threshold is used to compen-sate this negative effect for a-kNN. If for some a, a-kNN does notfind any candidates, we automatically decrease a. We can alsoset certain match-goals to trigger such an event; for example, dec-rement a until we reach a required FEEDBACK percentage, or have amajority vote, or simply receive a match (which is the approach wehave used). Moreover, we can control how fast a decreases, givingus greater control over the range of the score spectrum we wouldlike to include (we used 0.1 as the decrement value, similar to [1]).

We take a number of steps to implement this enhancement.Since this enhancement is Similarity-Score Dependent, it is imple-mented as a LOOP structure containing the SSDE. The user inputsthe initial a value into the algorithm. If we can find a match (ormatches) for the given a value, we submit them to the predictionmodule. If no match is found, we decrement a by a certain decre-ment-value and repeat the search for the new a. On the other hand,if a is too low to grant a reliable prediction, Adaptive Threshold ter-minates the Similarity-Score Dependent Enchantments LOOP, andhands the prediction over to the Similarity-Score IndependentEnhancements. As we can see, this approach is dependent on twomain variables that must be calibrated per data-set, namely Thresh-old Decrement and Threshold Limit. Threshold Decrement determinesthe value by which we decrement a at each interval or step duringthe Adaptive Threshold adjustment procedure. If Threshold Decre-ment is set to a low value, the threshold is decremented at a finerlevel. This forces the algorithm to obtain less but more relevantmatches as it decrements; i.e. if we set the Threshold Decrement to0.01 as opposed to 0.1, then the algorithm will try to obtain matchesat finer intervals (1.00, 0.99, 0.98, 0.97, . . .rather than stepping to 0.9directly). While finer intervals give more relevant matches a higherpriority, the fact that we may obtain less matches means that anyerror in the scoring process could be magnified if the given matchesare misleading. This is the reason we revert to using 0.1 as our pre-ferred decrement value, as also supported experimentally. Also, asexplained earlier, when applying Data Enrichment, we obtain lowerscoring matches, therefore finer Threshold Decrement values at higha’s will also magnify this effect. We also determine Threshold Limitexperimentally. Threshold Limit defines the point at which we con-sider the scores misleading; i.e. predictions based on Similarity-Score Dependent Enhancements are considered inaccurate. Wecompare different limits by assessing the performance of our ap-proach beyond the given threshold. If the Similarity-Score Indepen-dent Enhancements perform better beyond the limit, we consider ita cut off point for the SSDE. We concluded that a Threshold Limit of0.1 was appropriate for both datasets in our study.

Adaptive Threshold applied to a-kNN, can also be used as a supe-rior technique in comparison to the simple kNN. For each target is-sue, we only consider the highest scoring matches to compute theprediction. For example, for some target issue the highest scoringmatch might be at a ¼ 0:9, while another might be at a ¼ 0:6. Ifwe set our match-goal to a single match; starting at a ¼ 1 and decre-menting until we receive at least one match, we ensure that we areusing only the most similar matches to compute the prediction.

3.2.3. Majority votingIn [1], the authors use Top-K Mean to predict effort. After study-

ing the data, we observed that about 80% of the issues have repeat-


ing values in their candidate sets, and many of those values wereclose to the actual effort. We also noticed that, often, there are out-liers that could greatly skew the mean. These observations led usto using Majority Voting instead of the mean, as it is better suitedto deal with outliers, and makes better use of the discrete natureof the effort values.

We also observed that about 91% of the issues have effort <40 h,78% <16 h and 64% <8 h. This indicates that the data is skewed to-wards lower values, and the majority of the issues have effort lessthan 5 work days. However, this also means that a larger numberof the other 9% of effort values could be extreme outliers, whichis the case in our datasets. This can greatly skew the predictionsmade by Top-K Mean, giving greater confidence in adopting ourproposed Majority Voting.

When a majority vote cannot be reached, we employ a LeastVariable Window method. It is a technique that narrows downthe candidate set to the group of issues that has the least variation(Standard Deviation) in their effort values. In our implementation,we use a fixed window-size, for which we compute the standarddeviation of the elements within the window (see Fig. 2). In otherwords, given a candidate set, we sort the issues according to theireffort values, then using a fixed window-size (determined experi-mentally, refer to Section 3.3 for more details), we slide the win-dow over the sorted array and compute the standard deviation ofthe elements within the window until we reach the end of thearray. Once we obtain the list of standard deviations computedfor each window, we choose the window with the lowest value.Least Variable Window also takes the similarity scores into consid-eration; in other words, if two sets (i.e. two possible sets of issuesto be chosen for a particular window size) have the same standarddeviation, the one with the higher mean score is chosen. If there isonly one issue in the windowed candidate set, we use the corre-sponding effort as the prediction. Otherwise, the prediction is com-puted using a Weighted Mean approach. In this case, we can useLeast Squares Learning or Historical Value Frequency to computethe weights and produce the prediction. In Section 5, we will showthe relative merit of using each of these techniques with respect tothe base method of Weiss et al.

As mentioned in [1], predictions based on low similaritymatches could in fact be misleading. For these cases, we imple-ment Binary Clustering, a Similarity-Score Independent Enhance-ment (SSIE) to produce reasonably accurate predictions, as willbe explained next.

3.2.4. Binary clusteringThis technique groups similar issues into clusters. We use a

simplified binary distance function, which returns ‘‘0” if the issuehas the same properties of the cluster, or ‘‘1”, otherwise. Only is-

Sorted Array of Effort Values

Corresponding to the Candidate Set

Fixed-Size Window

Fig. 2. Visualization of the least variable window method.

sues with a distance of ‘‘0” are accepted into the cluster. Clusterproperties are chosen using the information we extract from the Is-sue Tracking System. For example, properties like project name andissue type could be used as clustering criteria. The target issue de-fines the values of the properties for a cluster; i.e. if we use projectname as the cluster property and the project name of the target is-sue is ‘‘FOO”, then only historical issues with the project name‘‘FOO” are accepted into the cluster.

To decide which properties to use for clustering, we performcorrelation analysis between the issue properties and the actual ef-fort. If more than one property is found to be highly correlated (i.e.,the properties with the highest correlation where the p-values60.15), then multiple properties are used (issues have to matchall those properties to enter a cluster). Once we decide on the clus-tering properties, we populate the cluster by applying the binarydistance function between the target issue and each issue in thehistory. Now, we have a vector of related issues, and their corre-sponding effort values. We use the Least Variable Window method(as described before in Majority Voting) to limit the variabilityof the resulting set, and then compute the mean of the window.

We can also take a multiple step approach when producing pre-dictions using SSIE. For example, in our approach, we have usedmultiple steps of Binary Clustering depending on the nature of thedataset. In some cases, clustering using certain criteria like ProjectName does not produce predictions 100% of the time (i.e. some pro-jects only have 1 issue listed under them). Therefore, we imple-ment a second Binary Clustering step using a different criterion orset of criteria (e.g. Issue Type), where cases that are not predictedby the first step of Binary Clustering could be predicted by the sec-ond step.

3.3. Framework implementation

The following algorithm shows a simplification of the final com-position of our proposed approach, incorporating all of the aboveenhancements. First, we define some variable names: thresh-old_decrement specifies the granularity at which to decrementthe a-kNN threshold, lsl_coefficients holds the weights computedby Least Squares Learning, hvf_coefficients holds the weights com-puted by Historical Value Frequency, wm_window_size andbc window size specify the Least Variable Window size for theWeighted Mean alternative in Majority Voting and for Binary Cluster-ing, respectively, issue_history contains the resolved issues that areolder than the target issue (including the similarity scores).

Lines 2–7 show the variable initializations. Lines 9–28 imple-ment the SSDE; lines 9, 11–14 implement the Adaptive Thresholdenhancement, line 10 represents the a-kNN method, line 15 per-forms the Majority Voting, lines 16–24 are executed if no majorityvote can be reached (at which point either a single match is avail-able or multiple matches are found), if there is a majority vote (line26) the prediction is the vote result. Lines 30–34 implement theSSIE, if no matches were found from the similarity-score dependentmethod, we execute these lines; at line 31 we perform the BinaryClustering, and at line 32 we find the Least Variable Window forthe related issues obtained from the cluster, then the predictionis computed at line 33.

To determine values of the underlying variables, we perform anexhaustive search over a discretized range for each variable underconsideration. This range is experimentally selected such thatexpanding it does not have a noticeable impact on the final result.We first set the variable to its initial value and produce the predic-tions for all the available test issues, while collecting performancemetrics per issue like Absolute Residual and Relative Error. Using thecollected metrics, we compute AAR (Average Absolute Residual)and PRED(x) (Percentage of predictions that fall within ±x% of theactual effort). Then, we perform the same procedure after setting


the variable to the next value within the range. Once we have a setof measured performance metrics and their corresponding variablevalues (e.g. a number of window sizes, each with its correspondingperformance metrics’ measurements), we choose the variable va-lue that minimizes AAR and PRED(x), giving precedence toPRED(x). For example, if a particular window size gave better per-formance in AAR and not in PRED(x), and another gave better per-formance in PRED(x) and not in AAR, then we would chose thelatter. However, this situation rarely occurs; as an improvementin prediction quality is usually reflected in both performancemetrics.

1:
{—Variables Initialization—} 2: prediction � null 3: threshold � 1 4: threshold_decrement � 0.1 5: threshold_limit � 0.1 6: wm_window_size � 3 7: bc_window_size � 5 8: {—SSDE Implementation—} 9: while prediction = null & threshold P threshold_limit do
{Adaptive Threshold}
10: topk � choose_topk(issue_history,k,threshold) fa-kNNg 11: if size(topk) = 0 then 12: threshold � threshold-threshold_decrement 13: continue 14: end if 15: [vote,frequency]� majority_vote(topk) {Majority Voting} 16: iffrequency < 2 then {No Majority Vote} 17: ifsize(topk)=1 then {Single Match} 18: prediction � topk 19: else{Multiple Matches} 20: topk_window � least_variable_window
(wm_window_size,topk)
21: lsl_coefficients � least_squares_learning (issue_history) 22: hvf_coefficients � historical_value
_frequency(topk,issue_history)
23: prediction � sum(topk_window � (lsl_coefficients or
hvf_coefficients))
24: end if 25: else {We have a Majority Vote} 26: prediction � vote 27: end if 28: end while 29: {—SSIE Implementation—} 30: if prediction = null then 31: related_issues � binary_clustering (target_issue,issue_history)
{Binary Clustering}
32: related_issues_window � least_variable
_window(bc_window_size,related_issues)
33: prediction � mean(related_issues_window) 34: end if
Table 1Summary of statistics describing case studies.

JIRA Branch Projects Issues Bugs Feature requests Tasks

JBoss 85 35,000 15,000 7000 8000Codehaus >100 56,000 28,000 5000 5000

Both bc_window_size and wm_window_size are determinedexperimentally. We have observed that a window size >5 makesthe results worse for Binary Clustering, and >3 worse for theWeighted Mean alternative in the Majority Voting. The thresh-old_decrement variable is also determined experimentally, withthe best results produced using a value of 0.1. Any similarity scorebelow the threshold_limit is considered too low, so if there are nomatches with scores P0.1 (determined experimentally for eachdataset), we use Binary Clustering. As for the SSIE, we can specifyas many filters/steps as required to predict 100% of the issues. Insome cases, certain criteria form empty clusters for certain targetissues, yielding no prediction. To resolve this, we apply multiplesteps of Binary Clustering, each using a different criterion, until

we obtain a prediction. For example, if we specify our clusteringcriteria to include Issue Type and Issue Priority, and we do not ob-tain matches for a particular issue, then we can relax the criteriato include only Issue Type.

4. Experiments

This section describes the experimental setup, tools and metricswe have used to implement our effort prediction framework andassess its performance. Section 4.1 describes the case studies weuse to perform our experimental studies, in addition to the IssueTracking System from which they are extracted. Then, Section 4.2describes the implementation tools and experiment setup. Finally,Section 4.3 presents the evaluation method along with the perfor-mance metrics we use to assess our proposed approach.

4.1. Case studies

Many Issue Tracking Systems currently exist in the softwarecommunity. Systems like Bugzilla [24], Mantis [25], DevTrack[26] and JIRA [27] all track bugs and issues effectively. However,only a limited number of them provide us with effort information.Since effort data is not directly related to the issue being tracked,most repositories do not keep a record of it. In addition, it takes ex-tra effort on part of the issue tester to keep track of the effort. An-other common problem with testers recording effort is thetendency for humans to overstate their own effort, rendering thedata inaccurate.

JIRA is a project management and issue tracking system whichcompanies can use to manage bugs, features, tasks, improvementsor any issue related to their projects. What makes JIRA particularyuseful is the fact that it keeps track of the actual effort spent on anissue. While JIRA provides the utility to record the effort, it is notmandatory. Therefore, we need to filter the available projects toobtain the ones which do record the effort values. A number ofopen source projects are currently being tracked through JIRA suchas: Apache [28], Spring [29], Flex [30], Codehaus [30], and JBoss[31]. For our purpose, the only projects that contain enough issueswith enough recorded actual effort values are JBoss and Codehaus.Weiss et al. [1] use the JBoss dataset (with issues tracked until 05/05/2006) to evaluate their approach. In order to provide a basis forcomparison, we use the same dataset for the evaluation of our ap-proach. In addition, we use the Codehaus dataset (with issuestracked until 01/03/2008) to evaluate our final composite ap-proach, which we also compare with that of [1].

JBoss is a division of RedHat that develops and maintains arange certified open source middle-ware projects based on theJ2EE platform. The JBoss community projects sit between applica-tion code and the operating system to provide services such as per-sistence, transactions, messaging and clustering. The JIRA JBossissue tracking branch currently tracks 85 different projects and justunder 35,000 issues. The issues can be broken up into about 15,000Bugs, 7000 Feature Requests, 8000 Tasks and the rest are assortedissues. Codehaus is a project that provides a collaborative environ-ment for other open source projects needed to meet real worldneeds. The JIRA Codehaus issue tracking branch tracks more than100 different projects and about 56,000 issues. About 28,000 are


Bugs, 5000 are Feature Requests and another 5000 are Tasks. Table1 summarizes the above data describing each case study.

The JBoss dataset comprises of about 600 issues, and the Code-haus dataset comprises of about 500 issues. Although we rely onthe same evaluation method and performance measures as de-scribed in [1], we use a smaller set of testing issues. Since the issuetime-stamp determines the size of the training set, some test issueswill have a smaller training set than others. For this reason, we lim-

Fig. 3. Implementation rules to rea

it our testing set to the most recent 300 issues. This makes the sizeof the training sets more uniform across all test issues. In practice,this should not be a problem unless the project is in its initialphases.

JIRA provides the following easily accessible issue information(properties), namely: Project Name, Title, Description, Summary,Type, Priority, State, Resolution, Reporter, Assignee. Conforming tothe criteria set by Weiss et al. in [1], we consider a limited number

lize the proposed framework.


of categories for each property. Priority has 5 categories: Blocker,Critical, Major, Minor, Optional, State has 3: Closed, Resolved, Reo-pened, Resolution has 3: Done, Out of Date, Unresolved and finally,Type has 4: Bug, Task, Sub-Task, Feature Request.

We use these properties for the Data Enrichment enhancementas query criteria and for the Binary Clustering enhancement as clus-tering criteria. Using correlation analysis (as described in Section3.2), we found that the following setup provides the best results:

� JBoss Dataset:– Data Enrichment: (Title, Description, Project Name, Type, Prior-

ity, State).– Binary Clustering: (Type).

� Codehaus Dataset:– Data Enrichment: (Title, Description, Project Name).– Binary Clustering: We use two step clustering, first using

(Project Name), then for issues that do not receive a predic-tion, we cluster using (Type). This was due to the fact thatsome projects had only 1 issue.

4.2. Implementation tools

To extract the issue information from JIRA [27], we developedPHP (v5.2.6) [32] scripts to crawl, collect and filter the information.Only issues that contained actual effort information were retainedinto our MySQL (MySQL Community Server v5.0) [33] database.We also used PHP scripts to populate, extract and format informa-tion from the database. PHP simplifies string manipulation, in addi-tion to database interaction, which was the reason we chose thisscripting language.

We use the Java Programming Language (Java SE v6) [34] togenerate the similarity scores, mainly due to the fact that Lucene[35] is developed in Java. We used version 2.3 of Lucene whichwas the most up to date version available on the Apache Founda-tion web site. We used the Multi-Field Query provided by Luceneto query the different issue information, in addition to the ‘‘time”filter (to eliminate issues newer than the target issue) and ‘‘stop-word” filter (to eliminate common English stop-words such as‘‘the”, ‘‘a”, ‘‘are”, . . . , etc.). Stop-words are usually removed fromsearch queries and indexed material as they do not provide anysignificant information (indeed, in many cases provide a vastamount of irrelevant information).

Finally, we implemented the proposed approach and performedthe data analysis using MATLAB (vR2007b) [23]. The way MATLABhandles vectors and large datasets simplifies the implementationof our approach. In addition, it includes some useful statisticalanalysis tools that are embedded into it. To exchange the requireddata between the different phases described above, we use ‘‘.csv”files.

Fig. 3 describes the general implementation architecture wehave used to evaluate the proposed framework. It highlights thehigh level process architecture, grouping the different componentsby their corresponding implementation language. It also illustratesthe interactions between the different processes and the data out-put of each. The ‘‘Filter Issues” process filters out any issues that donot conform to the categories described earlier in Section 4.1, andthose that do not have a record of the actual effort. To keep theillustration simple and easy to understand, we omitted some de-tails of intermediate steps that we took to format the data into areadable format for MATLAB. More specifically, the ‘‘Issue” datathat MATLAB uses from the ‘‘Issue Database”, goes through a PHPscript that formats it into ‘‘.csv” files that are easily readable inMATLAB. The ‘‘Similarity Score Matrix” is also outputted in ‘‘.csv”form, however, it is directly formatted in JAVA.

4.3. Performance metrics

To evaluate the performance of the proposed method, we definethe residual (or error) ri as the absolute difference between the pre-dicted effort ei and the actual effort ei reported for an issue i.Clearly, the lower the residual, the more accurate is the prediction.

ri ¼ jei � bei j ¼ jActual Effort� Estimated Effortj

One then needs appropriate performance metrics to measure andcompare the performance of the different approaches. To providecomplementary views, we have used three performance metricsas explained in the following:

� Percentage of predictions within ±x%: PRED(x) is a commonly usedperformance metric which measures the percentage of predic-tions that lie within ±x% of the actual effort value ei, i.e.,

PREDðxÞ ¼ jfijri=ei 6 x=100gjn

PRED(x) is widely used as a useful metric for determining thequality of the predictions. Also, the fact that it is used often inthe industry, and more importantly in the paper on which wehave based our framework, has motivated us to use it for com-paring the results. In specific, we use PRED(25) and PRED(50)in our experiments.

� Average Absolute Residual (also known as the Mean AbsoluteError) which is defined as follows:

AAR ¼Pn

i¼1ri

n

is another commonly used performance metric. Interpreting AARvalues is simple; larger number of good predictions results in asmaller AAR and vice versa. However, AAR is easily influencedby outliers; i.e. very large residuals can lead to misinterpreta-tions of the performance. This does not mean that AAR is aflawed metric. Some may argue that because it is influencedby those large residuals, it would provide a more risk-aversiveeffort estimate. On the one hand, we do not want to over com-pensate in our effort predictions and report higher numbers tomanagement than what would commonly be a less expensive is-sue to resolve. On the other hand, we also do not want to givethe false impression that an issue is easier to resolve than itwould actually be. This is a common problem in this field anda delicate balance is necessary when considering the accuracyand quality of such predictions. Due to this reason, it is impor-tant to provide additional metrics that demonstrate risk, pro-jected accuracy of prediction and distribution of historical dataon which the prediction is based. For this reason, we also reportthe PRED(25) and PRED(50) measures, to give a better picture ofthe residuals’ distribution. Higher PRED(x) values mean a betterquality of predictions.

� As a third performance metric, we report FEEDBACK which mea-sures the percentage of issues for which the approach provides aprediction (in some cases, a-kNN reports ‘‘Unknown” rather thanmaking a prediction). For kNN and a-kNN with a ¼ 0, FEEDBACKis 100%.

To perform the actual prediction, we retrace the history of theIssue Tracking System. In other words, we consider each issue inthe dataset as a target issue, but we only use issues that have beensubmitted before the target issue in the corresponding training set.The training set is then used to search for the nearest neighbors.This means for the first submitted issue, we cannot make a predic-tion. Due to this reason, we limit our testing set to the most recent300 issues.

201816141210

Hou

rs

86420

0 0.1 0.2 0.3 0.4 0.5αα

0.6 0.7 0.8 0.9 1

100908070605040

Perc

ent (

%)

3020100


5. Discussions on obtained results

The following subsections describe the results obtained fromperforming our experiments. In Section 5.1, we will present the re-sults of applying the individual enhancements to the Base Ap-proach. This will help the reader to differentiate the relativeimpact of each of the proposed enhancements. Then, in Section5.2, we will provide a performance comparison between the effortprediction framework and the Base Approach [1] including all theproposed enhancements. We will also present the results of apply-ing the proposed framework to the Codehaus dataset to furtherstudy and extend the validity of our work.

AAR PRED(25) PRED(50) FEEDBACK

Fig. 5. Performance metrics for the base approach [1] with Data Enrichment (JBossDataset).

201816141210

Hou

rs86420

0 0.1 0.2 0.3 0.4 0.5αα

0.6 0.7 0.8 0.9 1

100908070605040

Perc

ent (

%)

3020100


Fig. 6. Performance metrics for the base approach [1] with Majority Voting (JBossDataset).

201816141210

Hou

rs

86420

0 0.1 0.2 0.3 0.4 0.5αα

0.6 0.7 0.8 0.9 1

100908070605040

Perc

ent (

%)

3020100


Fig. 7. Performance metrics for the base approach [1] with Data Enrichment,Majority Voting and Adaptive Threshold (JBoss Dataset).

5.1. Enhancement evaluation

The following charts will only illustrate the a-kNN approach of[1]. Although the enhancements do in fact improve the kNN ap-proach of [1], our focus is to show improvements with respect tothe best result of [1] which correspond to their a-kNN approach.Figs. 4–7 show the performance of our approach using HistoricalValue Frequency for weight computation (based on using the mostrecent 300 issues).

Fig. 4 shows the performance results for the Base Approach,using the Nearest Neighbors with Thresholds method ða-kNNÞ, withk = inf. In this chart, when a ¼ 0, the prediction is basically themean of the entire training set. We can see that a-kNN performsthe best at a ¼ 1, with AAR � 6 h, PRED(25) � 30% andPRED(50) � 45%. However, FEEDBACK reaches as low as 14%.a-kNN shows a noticeable improvement as a increases. Startingat a ¼ 0:1, with AAR � 14 h, PRED(25) � 15% and PRED(50) � 25%,we get approximately a 15% improvement, with a ¼ 1, forPRED(25) and a 20% improvement for PRED(50) in comparison tothe results produced for a ¼ 0:1. The most significant improve-ment is in AAR, where we see an improvement of about 8 h.

From Fig. 5, we can see that introducing Data Enrichment does infact improve upon the Base Approach. At its best point, Data Enrich-ment has: AAR = 2 h, PRED(25) = 35% and PRED(50) = 55%, com-pared to the Base Approach (shown in Fig. 4): AAR = 6 h,PRED(25) = 30% and PRED(50) = 45%. This translates into animprovement of 4 h for AAR, 5% for PRED(25) and 10% forPRED(50). The overall FEEDBACK is lower than that of the Base Ap-proach (as expected and explained in Section 3.2). We notice someunexpected behavior for a > 0:7; if we look at the FEEDBACK be-yond that point we notice that it is very low (<10%). We attributethe change in behavior of the metrics to the low FEEDBACK; whenit is this low, any errors in the scoring process are magnified andcan greatly influence the results. Since Data Enrichment is aimedat improving the accuracy of the similarity scores, we can see thatreflected in the rapid drop in AAR even for low values of a (<0.4). It

201816141210

Hou

rs

86420

0 0.1 0.2 0.3 0.4 0.5αα

0.6 0.7 0.8 0.9 1

100908070605040

Perc

ent (

%)

3020100


Fig. 4. Performance metrics for the base approach [1] (JBoss dataset).

is observed that AAR is reduced from about 14 h to less than 4 h ata ¼ 0:3.

Fig. 6 illustrates that introducing Majority Voting produces amore consistent improvement over all values of a. It outperformsthe Base Approach results in terms of AAR for a < 0:4, but then lev-els off to match it for higher values of a. However, the PRED mea-sures show consistent improvement over all values of a. Withresults reaching a high of about 55% for PRED(50), PRED(25) reach-ing up to 35% and AAR to a minimum of about 6 h. We can observeat least a 5% improvement overall due to introducing Majority Vot-ing in place of the mean used by the Base Approach. The FEEDBACKhowever is the same as that of the Base Approach, since usingMajority Voting does not modify the percentage of issues for whichwe make predictions. Rather, it just modifies the method used forprediction, for the same issue similarity-scores.

Table 3Percentage of issues predicted by each method (JBoss Dataset). MV, majority voting;SM, single match; MM, multiple matches; BC, binary clustering.

FEEDBACK (%) MV (%) SM (No MV) (%) MM (No MV) (%) BC (%)

100 4.3 15.3 1.7 78.715 26.7 66.7 6.7 N/A21 20.3 71.9 7.8 N/A


As shown in Fig. 7, applying the three SSDE to the Base Approachshows significant improvement; at the best point we obtainAAR = 3 h, PRED(25) = 38% and PRED(50) = 60%. Compared to thebest Base Approach results (AAR � 6 h, PRED(25) � 30% andPRED(50) � 45%); this is about 3 h AAR improvement and about10% improvement for both of the PRED measures. Note that withthe introduction of Adaptive Threshold, the FEEDBACK deterioration(caused by the introduction of Data Enrichment shown in Fig. 5) isresolved (by automatically decrementing a to include the most rel-evant matches, for the given value of FEEDBACK).

One of the goals of our study is to understand how each vari-able, and in particular a, influences the quality of the predictions.However, it is difficult to present a formal sensitivity analysis asthe value of a simultaneously affects all the performance metrics,and in particular the value of FEEDBACK. As it is seen from Figs.4–7, changing the values of a in the range of its lower values(FEEDBACK close to 100%) results in significant variations in allthe metrics, and in particular in the FEEDBACK itself, showing highsensitivity. However, as a decreases, we reach to a point that all theperformance metrics follow a flat curve, showing very low sensitiv-ity to the value of a.

5.2. Framework evaluation

The following will present a comparison between the perfor-mance of our proposed method (EPF) and the Base Approach. Wewill apply the evaluation to both the JBoss and Codehaus datasetsto extend the validity of our framework.

The proposed Effort Prediction Framework always produces re-sults with 100% FEEDBACK. Therefore, to provide a comparison ba-sis with a-kNN of the Base Approach, we disable the BinaryClustering enhancement. This allows us to control the FEEDBACKof the proposed approach using the SSDE. We set thethreshold limit to a point where FEEDBACK is comparable with thatreturned by the Base Approach. As shown in Table 2, there is signif-icant improvement for all FEEDBACK values, even in comparison tothe best results produced by the Base Approach ða-kNN; a ¼ 1Þ. Theproposed composite Effort Prediction Framework shows animprovement of more than 3 h in AAR for the 100% FEEDBACK,and about 10% improvement for both of the PRED measures. Asfor the other two FEEDBACK cases (15% and 21%), we see about2 h of improvement in AAR and about a 10% improvement inPRED(25). As for PRED(50), for the 15% FEEDBACK, we see approxi-mately a 15% improvement, and about 10% improvement for the21% FEEDBACK. The other metrics (RE Max, RE Mean and RE Std-Dev) are mainly displayed to show the performance difference be-tween the use of the two weight computation methods. For thisdataset, we do not see a large difference in performance sincethe percent of issues that use a Weighted Mean is very small asshown in Table 3.

Table 2Comparison between the proposed effort prediction framework and the base approach [1] (learning; HVF, historical value frequency; RE, relative error.

Method AAR (h) PRED (25) PRED (50)

EPF LSL 8.9 23.3% 41.3%HVF 8.9 23.6% 41.6%

BA (kNN, k = 1) 13.8 15 31.7%

EPF LSL 3.3 35.6% 57.8%HVF 3.2 37.8% 60.0%

BA ða-kNN; a ¼ 1Þ 5.8 28.6% 45.2%

EPF LSL 4.9 34.4% 53.1%HVF 4.8 35.9% 54.7%

BA ða-kNN;a ¼ 0:9Þ 6.2 26.6% 45.3%

Table 3 provides a better perspective on the behavior of the pro-posed Effort Prediction Framework as described in Table 2 byshowing the percentage of issues predicted by each method (JBossDataset). The methods shown in Table 3 refer to the ones describedin Section 3.2, and depicted in Fig. 1 which describes the processmodel. We can see four different ways to produce the effort predic-tion: (1) Majority Voting, (2) Single Similarity Match (when nomajority vote is reached), (3) Multiple Similarity Matches (whenno majority vote is reached) and (4) Binary Clustering. As reflectedin Table 3, we can see why the effect of LSL vs. HVF is not significantfor the JBoss Dataset; the percentage of issues predicted usingthese methods (in the case of multiple matches when no majorityvote can be reached) is less than 10% of predictions for all FEED-BACK cases, and as low as about 2% for the 100% FEEDBACK. In otherwords, the percentage of issues predicted using Weighted Mean(LSL or HVF) is relatively low in comparison to the other methods.

To extend the validity of the proposed Effort Prediction Frame-work, we apply it to an alternate dataset (Codehaus). From Table 4,we can again see clear improvements for all FEEDBACK cases. Wewould like to point out that, although we used the same Lucenesetup, the scoring was particularly misleading for this dataset. Thisis reflected in the results above for the a-kNN of the Base Approachða ¼ 1Þ as the cases using the Least Squares Learning weight compu-tation perform worse (since it depends on similarity scores).Although the PRED measures do show improvement as a increases,the AAR metric is affected negatively. This maybe explained by adifferent style of issue documentation in the Codehaus project.We use a subjective method of determining similarities betweenissues, depending significantly on issue descriptions and documen-tation styles. This makes it difficult to track and pin point the exactreasons why the effort prediction quality differs from one data setto another. This also greatly depends on the method and quality ofresults produced by the semantic search engine. The Codehausproject is less regulated when compared to Redhat JBoss. There-fore, we would expect the descriptions and issue updates to be lessrigorous and informative. However, as outlined earlier in Section4.1, the reason we have chosen the Codehaus Dataset is becauseit is one of the very few that provides enough effort data to grantsignificance in our study. In this case, Binary Clustering greatlyhelped in improving the results due to its independence of the sim-ilarity-scores. We can see approximately a 3-h improvement inAAR for the 100% FEEDBACK case, and about 1% and 5% for the

JBoss Dataset). EPF, effort prediction framework; BA, base approach; LSL, least squares

RE Max RE Mean RE StdDev FEEDBACK (%)

79 1.77 5.48 10079 1.75 5.49191.00 5.37 19.03

6.2 0.82 1.13 156.2 0.72 1.0712 1.48 2.45

9.5 1.03 1.67 21%9.5 0.97 1.6512 1.17 2.03

Table 4Comparison between the effort prediction framework and the base approach [1] (Codehaus Dataset).

Method AAR (h) PRED (%) PRED (%) RE RE RE FEEDBACK (%)(25) (50) Max Mean StdDev

EPF LSL 5.5 20.0 41.0 140.75 1.96 8.71 100HVF 5.4 19.3 40.7 140.75 1.92 8.68

BA (kNN, k ¼ 2) 8.8 18.0 35.0 261.50 6.34 25.97

EPF LSL 8.9 27.8 57.4 11.34 1.23 2.13 18HVF 8.2 24.1 55.6 7.08 1.04 1.50

BA ða-kNN;a ¼ 1Þ 12.1 22.2 53.7 37.40 2.03 5.44

EPF LSL 11.2 15.39 38.5 191.00 4.44 19.25 48HVF 8.8 12.6 37.8 265.25 5.06 27.27

BA ða-kNN;a ¼ 0:35Þ 8.5 14.7 35.7 261.50 4.52 22.31

Table 5Percentage of issues predicted by each method (Codehaus Dataset).

FEEDBACK MV (%) SM (No MV) (%) MM (No MV) (%) BC (%)

100 0.7 11.3 6.0 82.018 3.7 63.0 33.3 N/A48 2.1 66.4 31.5 N/A


PRED(25) and PRED(50) measures, respectively. For the 18% FEED-BACK case, we see about 4 h improvement in AAR and approxi-mately 2% improvement for both of the PRED measures. Finally,for the 48% FEEDBACK, we only see a minor but apparent improve-ment for the PRED(50) (about 2%), however, the AAR and PRED(25)suffer slightly (about 0.3 h and 1%, respectively).

Table 5 shows the percentage of issues predicted by each meth-od for the Codehaus Dataset. In contrast to the JBoss Dataset per-centages shown in Table 3, we can see that the percentage ofissues predicted by MM (No MV) is much higher. Especially in the18% and 48% FEEDBACK cases, where it reaches up to 33% and31%, respectively. This is evident when observing the effect of LSLand HVF on the performance of the predictions in Table 4, as com-pared to Table 2 for the JBoss Dataset. Another interesting observa-tion is the fact that the percentage of issues predicted by MajorityVoting for the Codehaus Dataset is much lower than that of theJBoss Dataset; Codehaus with a high of about 4% and JBoss witha high of about 27%. This shows the different properties of datasetsthat the approach must handle. However, the percentage of issuespredicted by Binary Clustering is very close for both datasets (about80%).

This section summarized the findings of our research, and out-lined the performance evaluations we conducted to assess theeffectiveness of our approach. We used the same dataset (JBoss)used by Weiss et al. (Base Approach) to be able to assess the relativeperformance improvements. Whenever possible, we compared ourframework to the Base Approach, by showing the effect of addingeach of the proposed enhancements independently, and in all casesour approach produced competitive results. Finally, we comparedour proposed composite Effort Prediction Framework to the bestresults produced by the Base Approach, while showing the effectof using each weight computation technique independently. Wealso applied the same comparison to a different dataset to extendthe validity of our results. In both case studies, our approach per-formed better. The results demonstrate that our composite EffortPrediction Framework performs better than a comparable singlemethod approach. The presented results show that while the appli-cation of each enhancement individually improved the perfor-mance of the predictions, combining the different methodsimproved it even further.

The goal of our study is to show how a composite frameworkcan improve results. We would like to emphasize that the field of

effort prediction poses a very difficult problem that has been underscrutiny over the past decade. Referring to the literature and exist-ing research results in this area, no one study has shown the supe-riority of a method over another for all possible situations and fordifferent datasets. In our study of the related work, the descriptionof the proposed components of the framework and the explana-tions of the results, we have outlined the limitations of eachapproach.

6. Conclusion and future work

We have shown that using a composite framework (combiningmultiple complementary prediction methods) to predict effort isindeed an effective strategy; given that enough issues exist fromwhich we can construct a historical reference to produce accuratepredictions. It is also important to understand how the search en-gine ranks the matching results in order to apply our methods andobtain more accurate results. For example, in the case of Lucene,there are multiple ways that query terms can be combined whichaffect the results returned by the search engine, such as the Multi-Field Query which we have used in our approach. Intrinsically, is-sues have different properties and features that we should exploitto adapt the best approach for prediction. An important feature ofour approach is that it can be applied to any issue database that re-cords effort information. It can leverage any information availableto build similarities and predict effort. With the power of multipleprediction methods, the weaknesses of each individual approachare remedied and the approach becomes more adaptive. Adaptiveapproaches are the next step in effort prediction and we believemore research should be done in this area.

Better methods need to be implemented to identify similarityamong issues/bugs, in order to utilize the full potential of the SSDE.As for Lucene in particular, more research can go into better waysto combine the different criteria into the search query. Moreover,better training-issue filtering algorithms need to be implementedin order to make better use of Least Squares Learning in computingweights. Exploring alternate SSIE will also be beneficial, as we haveseen that any errors in the scoring process can have significantnegative impact on the final results. In general, the area of defecteffort prediction is relatively new, and many breakthroughs areawaited before full trust can be developed into such systems.

Overall, in spite of its significance, due to being inherently com-plex, unfortunately there has been limited progress in this area ofresearch. Another challenge is to provide a comprehensive com-parison including a larger number of related works as most ofthese are based on totally different frameworks, making such com-parisons meaningless, if not impossible. The current article shows anoticeable improvement over one of the most promising results re-ported in the literature, and also provides new directions for fur-ther research in this emerging area.


Acknowledgments

The authors thank Dr. Zimmermann from Microsoft Companyfor providing us with the JBoss dataset. The authors also thankthe anonymous reviewers for providing insightful comments.

References

[1] C. Weiss, R. Premraj, T. Zimmermann, A. Zeller, How long will it take to fix thisbug? in: Proceedings of the Fourth International Workshop on MiningSoftware Repositories, 2007, pp. 1–10.

[2] B. Boehm, V. Basili, Software defect reduction top 10 list, Computer 34 (1)(2001) 135–137.

[3] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, PearsonEducation, 2003.

[4] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning,Springer, 2001.

[5] H. Zeng, D. Rine, Estimation of software defects fix effort using neuralnetworks, in: Proceedings of the International Computer Software andApplications Conference – Workshops and Fast Abstracts, 2004, pp. 20–21.

[6] T. Menzies, Nasa kc1 software defect prediction dataset, <http://promise.site.uottawa.ca/SERepository/datasets/kc1.arff>, December 2004.

[7] B. Kitchenham, L. Pickard, S. MacDonell, M. Shepperd, What accuracy statisticsreally measure, IEEE Proceedings – Software 148 (3) (2001) 81–85.

[8] T. Foss, E. Stensrud, B. Kitchenham, I. Myrtveit, A simulation study of the modelevaluation criterion MMRE, IEEE Transactions on Software Engineering 29 (11)(2003) 985–995.

[9] Q. Song, M. Shepperd, M. Cartwright, C. Mair, Software defect associationmining and defect correction effort prediction, IEEE Transactions on SoftwareEngineering 32 (2) (2006) 69–82.

[10] J. Quinlan, C4.5 Programs for Machine Learning, Morgan Kaufman, 1993.[11] E. Frank, L. Trigg, G. Holmes, I. Witten, Na Bayes for regression, Machine

Learning 41 (1) (2000) 5–25.[12] E. Frank, I. Witten, Generating accurate rule sets without global optimization,

in: Proceedings of 15th the International Conference on Machine Learning,1998, pp. 144–151.

[13] W. Evanco, Prediction models for software fault correction effort, in:Proceedings of the European Conference on Software Maintenance andReengineering, 2001, p. 114.

[14] W. Evanco, Modeling the effort to correct faults, in: Selected Papers of theSixth Annual Oregon Workshop on Software Metrics, 1995, pp. 75–84.

[15] T. Sherwood, E. Perelman, B. Calder, Basic block distribution analysis to findperiodic behavior and simulation points in applications, in: Proceedings of the

International Conference on Parallel Architectures and CompilationTechniques, 2001, pp. 3–14.

[16] K. Iwata, Y. Anan, T. Nakashima, N. Ishii, Effort prediction model usingsimilarity for embedded software development, in: Proceedings of theInternational Conference on Computational Science and Applications, 2006,pp. 40–48.

[17] A. Phansalkar, L. John, Performance prediction using program similarity, in:Proceedings of the Standard Performance Evaluation Corporation BenchmarkWorkshop, 2006.

[18] J. Sun, X. Hui, Financial distress prediction based on similarity weighted votingCBR, Lecture Notes in Computer Science, vol. 4093/2006, Springer, 2006, pp.947–958.

[19] F. Ramirez, O. Fuentes, R. Gulati, Prediction of stellar atmospheric parametersusing instance-based machine learning and genetic algorithms, ExperimentalAstronomy 12 (3) (2001) 163–178.

[20] E. Hatcher, O. Gospodnetic, Lucene in Action (in Action series), ManningPublications Co., 2004.

[21] M. Shepperd, C. Schofield, Estimating software project effort using analogies,IEEE Transactions on Software Engineering 23 (11) (1997) 736–743.

[22] M. Jørgensen, U. Indahl, D. Sjoberg, Software effort estimation by analogy andregression toward the mean, Journal of Systems and Software 68 (3) (2003)253–262.

[23] The MathWorks Inc., Matlab – the language of technical computing, <http://www.mathworks.com/products/matlab>, June 2008.

[24] The Mozilla Organization, Bugzilla, <http://www.bugzilla.org>, June 2008.[25] Sourceforge, Manits bug tracker, <http://www.mantisbt.org>, June 2008.[26] TechExcel, Devtrack – defect tracking, issue tracking, bug tracking, <http://

www.techexcel.com/products/devsuite/devtrack.html>, June 2008.[27] Atlassian, JIRA – bug tracking, issue tracking and project management

software, <http://www.atlassian.com/software/jira> June 2008.[28] Atlassian, Apache software foundation – JIRA system dashboard, <http://

issues.apache.org/jira/secure/Dashboard.jspa>, June 2008.[29] Atlassian, Spring projects issue tracker, <http://jira.springframework.org/

secure/Dashboard.jspa>, June 2008.[30] Atlassian, Flex bug and issue management system, <http://bugs.adobe.com/

jira/secure/Dashboard.jspa>, June 2008.[31] Atlassian, JBoss – JIRA system dashboard, <http://jira.jboss.org/jira/secure/

Dashboard.jspa>, June 2008.[32] The PHP Group, PHP: hypertext preprocessor, <http://www.php.net>, June

2008.[33] MySQL AB, Mysql: the world’s most popular open source database, <http://

www.mysql.com>, June 2008.[34] Sun Microsystems Inc., Developer resources for java technology, <http://

java.sun.com>, June 2008.[35] The Apache Software Foundation, Welcome to lucene, <http://

lucene.apache.org>, June 2008.

http://promise.site.uottawa.ca/SERepository/datasets/kc1.arff

http://promise.site.uottawa.ca/SERepository/datasets/kc1.arff

http://www.mathworks.com/products/matlab

http://www.mathworks.com/products/matlab

http://www.bugzilla.org

http://www.mantisbt.org

http://www.techexcel.com/products/devsuite/devtrack.html

http://www.techexcel.com/products/devsuite/devtrack.html

http://www.atlassian.com/software/jira

http://issues.apache.org/jira/secure/Dashboard.jspa

http://issues.apache.org/jira/secure/Dashboard.jspa

http://jira.springframework.org/secure/Dashboard.jspa

http://jira.springframework.org/secure/Dashboard.jspa

http://bugs.adobe.com/jira/secure/Dashboard.jspa

http://bugs.adobe.com/jira/secure/Dashboard.jspa

http://jira.jboss.org/jira/secure/Dashboard.jspa

http://jira.jboss.org/jira/secure/Dashboard.jspa

http://www.php.net

http://www.mysql.com

http://www.mysql.com

http://java.sun.com

http://java.sun.com

http://lucene.apache.org

http://lucene.apache.org

an effort prediction framework for software defect correction

Documents