on incomplete bug fixes in eclipse and programmers’ intuition on these

10
On Incomplete Bug Fixes in Eclipse and Programmers’ Intuition on These Sabih Agbaria Joseph (Yossi) Gil Department of Computer Science The Technion—Israel Institute of Technology Technion City, Haifa 32000, Israel { sabih | yogi } @CS.Technion.AC.IL Abstract Recent studies indicate that multiple patches to soft- ware are found in a hefty portion of resolved bugs. It is also known that bugs that require multiple patches take longer to resolve, that their severity tends to be higher than the average and that they induce programmers to engage more in bug discussions. This work is concerned with the ability of pro- grammers to predict a bug will be of this sort, and in particular that it may require future patches and greater refixing effort at the time the bug is fixed at the first time. A mathematical model is developed for a retrospec- tive analysis of bugs maintenance history. In this model we compute the impact of an array of bug properties on the likelihood that a specific bug is chosen, among all open bugs, to receive its first fix. The studies we conduct on a sizable portion of the history of the Eclipse code base indicate that programmers tend to attend first to bugs which are easier to fix. The results further suggest that some of the criteria that programmers apply (probably unknowingly) to determine whether a bug is easy to fix, is the number of future patches it would require, and the amount of work involved in these patches. This is despite the fact that this information is not supposed to be available to the programmers at the time the first fix is made. It is anticipated that the method of analysis in- troduced in this work would have other applications in software engineering, but also outside of computer science. 1. Introduction The fact that the Merriam-Webster online dictionary does not recognize neither the word “refix1 nor the word “remend2 may be result of a very natural and 1. http://www.merriam-webster.com/dictionary/refix 2. http://www.merriam-webster.com/dictionary/remend simple logic: if an object is fixed, it ceases to be broken and cannot be refixed again. Software engineers are however well familiar with tenacious bugs—those bugs which seem to recur again and again after they have been supposedly fixed, refixed and refixed again. Recent studies [12], [14] indicate that multiple patches to software are found in a hefty portion of resolved bugs. (17%45% in the projects studied by Nguyen et al. [12] and 22%33% in the projects studied by Park et al. [14]). It was also found [14] that these bugs take longer to resolve, that their severity tends to be higher than average and that they induce programmers to engage more in bug discussions. The research community devoted considerable re- search effort to the study of the prevalence of bug refixing [7], [12], [15], [23] and to the development of tools and methods for dealing with this predica- ment [1], [8], [9], [12], [16], [21]–[23] (see reference [14] for a survey). The question that motivated this work is whether programmers are aware that certain bug fixes that they commit are incomplete and that the bug may require future additional fixes, perhaps even to source files which were not included in the fix. Our hypotheses were that programmers will usually prefer to fix the easy bugs first while delaying the handling of the more problematic bugs, and that, based on intuition or other factors, programmers will classify bugs that require multiple fixes as problematic and hence tend to postpone their fix. 1.1. Method and results To test these hypotheses we develop a mathematical model for a retrospective analysis of bugs maintenance history. The model is able to compute the dependency of the “preference value” of bug fixing on other bug properties. (This value can be thought of as the relative likelihood that a bug will be selected for fixing among all other present open bugs.)

Upload: danielbilar

Post on 12-Jul-2015

201 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On incomplete bug fixes in eclipse and programmers’ intuition on these

On Incomplete Bug Fixes in Eclipse and Programmers’ Intuition on These

Sabih Agbaria Joseph (Yossi) GilDepartment of Computer Science

The Technion—Israel Institute of TechnologyTechnion City, Haifa 32000, Israel

{ sabih | yogi } @CS.Technion.AC.IL

Abstract

Recent studies indicate that multiple patches to soft-ware are found in a hefty portion of resolved bugs. It isalso known that bugs that require multiple patches takelonger to resolve, that their severity tends to be higherthan the average and that they induce programmers toengage more in bug discussions.

This work is concerned with the ability of pro-grammers to predict a bug will be of this sort, andin particular that it may require future patches andgreater refixing effort at the time the bug is fixed atthe first time.

A mathematical model is developed for a retrospec-tive analysis of bugs maintenance history. In this modelwe compute the impact of an array of bug propertieson the likelihood that a specific bug is chosen, amongall open bugs, to receive its first fix. The studies weconduct on a sizable portion of the history of theEclipse code base indicate that programmers tend toattend first to bugs which are easier to fix.

The results further suggest that some of the criteriathat programmers apply (probably unknowingly) todetermine whether a bug is easy to fix, is the numberof future patches it would require, and the amount ofwork involved in these patches. This is despite the factthat this information is not supposed to be availableto the programmers at the time the first fix is made.

It is anticipated that the method of analysis in-troduced in this work would have other applicationsin software engineering, but also outside of computerscience.

1. Introduction

The fact that the Merriam-Webster online dictionarydoes not recognize neither the word “refix”1 nor theword “remend”2 may be result of a very natural and

1. http://www.merriam-webster.com/dictionary/refix2. http://www.merriam-webster.com/dictionary/remend

simple logic: if an object is fixed, it ceases to be brokenand cannot be refixed again. Software engineers arehowever well familiar with tenacious bugs—those bugswhich seem to recur again and again after they havebeen supposedly fixed, refixed and refixed again.

Recent studies [12], [14] indicate that multiplepatches to software are found in a hefty portion ofresolved bugs. (17%–45% in the projects studied byNguyen et al. [12] and 22%–33% in the projectsstudied by Park et al. [14]).

It was also found [14] that these bugs take longerto resolve, that their severity tends to be higher thanaverage and that they induce programmers to engagemore in bug discussions.

The research community devoted considerable re-search effort to the study of the prevalence of bugrefixing [7], [12], [15], [23] and to the developmentof tools and methods for dealing with this predica-ment [1], [8], [9], [12], [16], [21]–[23] (see reference[14] for a survey).

The question that motivated this work is whetherprogrammers are aware that certain bug fixes that theycommit are incomplete and that the bug may requirefuture additional fixes, perhaps even to source fileswhich were not included in the fix.

Our hypotheses were that programmers will usuallyprefer to fix the easy bugs first while delaying thehandling of the more problematic bugs, and that, basedon intuition or other factors, programmers will classifybugs that require multiple fixes as problematic andhence tend to postpone their fix.1.1. Method and results

To test these hypotheses we develop a mathematicalmodel for a retrospective analysis of bugs maintenancehistory. The model is able to compute the dependencyof the “preference value” of bug fixing on other bugproperties. (This value can be thought of as the relativelikelihood that a bug will be selected for fixing amongall other present open bugs.)

Page 2: On incomplete bug fixes in eclipse and programmers’ intuition on these

We applied this model to sizable portion of theEclipse code and bug histories. The results largely con-firm these hypotheses, showing that (i) programmersare indeed more likely to attend first to bugs whosefirst fix requires smaller modifications to fewer sourcefiles; and, that (ii) the relative likelihood that a bugwill receive its first fix is associated negatively withits future history, i.e., the first fix of a bug is likely tobe put off if its future will unfold multiple or sizablerefixes.

The finding are not to taken as an implication thatprogrammers have supernatural means for predictingthe future, and that they employ this means for puttingoff “trouble”, but rather to suggest that there are somecharacteristics of the bug, which perhaps taken togetherwith intuition, generate “bug smells” that distract pro-grammers.

Our work eliminates some obvious candidates forthese characteristics such as first fix size and bugseverity; there is exciting work ahead in identifyingthe predictors of bugs’ tenacity.

1.2. Other applications

One other application of the technique developedhere is in the analysis of bug triaging policy as appliedby the software developers. Despite the use of toolsfor bug triaging and tracking, developers particularly inopen-software projects but also in commercial projects,may deviate from the declared policy. The staggeringpopularity of the “release early, release often” slogan,may aggravate the problem since the short users feed-back loops generate a stream of bugs which is neverexhausted.

Indeed, we found that developers in the Eclipseproject do not always prefer bugs of higher priorityand severity.

The history analysis techniques introduced and ex-emplified in this work are expected to be useful in othercontexts. For example, we anticipate applications in theanalysis of of the diverse phenomena which are thoughtto follow preferential attachment (see Newman’s sur-vey [11]). Interestingly some of the phenomena thatare likely to be explained by preferential attachmentand the ensuing power-law are found in the softwaredomain [4], [17], [20]. Other applications may in-clude analysis of medical triage, hidden discriminationagainst minorities in governmental services, etc.

1.3. Related work

A line of work related to ours begun with the workof Kim, Whitehead and James [10] who argued infavor of the study of the time it takes to fix reportedbugs, claiming that this metric plays a major role when

measuring software quality of a system, just as themore traditional bugs count metric.

Studies followed suit: Panjer [13] used data miningtechniques to predict the time needed to fix bugs withan accuracy up to 34.9%. Weiss et al. [19] used theaverage time it took to fix past, textually similar bug-reports, to predict how long it will take to fix a newlyreported bug.

Later, Giger et al. [6] applied decision tree analysisto investigate the relationships between bug reportattributes and the time required for fixing it; theyfound, e.g., that the precision of the prediction could beimproved using post-submission data, like the assignee.In a sense, we continue this work by studying oneimportant factor that contributes to the total time itrequires to fix a bug, namely, when do programmersattend to it first.

Bettenburg et al [3] who conducted a survey amongdevelopers to find what information available in a bug-report they find the most useful in fixing a bug quickly.The study finds a mismatch between what developersneed and what users provide in the bug-report. Westudy whether this mismatch might lead the developersto rely on other factors. Our work complements theirwork in determining these factors not by asking theprogrammers, but by analyzing their behavior.

Outline. The remainder of this article is orga-nized as follows. The data set used in our measure-ments and the way it was extracted are describedin Section 2. Section 3 defines a metric suite forestimating the effort invested in bug fixing and refixing.The mathematical model behind the history analysisis sketched in Section 4. The results are presented inSection 5: Section 5.1 first employs this model to studythe criteria that software developers apply in selectingthe the next bug to fix. Section 5.2 furthers this study byexamining whether the “illusiveness” of a bug playsa role in this selection. Finally, Section 6 concludeswith a brief discussion of the results and a highlightof directions for future research.

2. Data set

We focused on changes to the the Eclipse sourcecode repository that had occurred in the time periodbetween April 3rd 2005 and May 2nd 2012. This pe-riod represents about 60% of the Eclipse developmenthistory.

Changes to non-JAVA [2] files, including e.g., XMLinformation, documentation, icons, images, .jar filesand other binary files were eliminated from our study.In total, the code history tracked involved some threehundred thousand changes conducted by over 150developers to circa sixty thousand source files,

Page 3: On incomplete bug fixes in eclipse and programmers’ intuition on these

Recall that Eclipse version management is carriedout by CVS which tracks changes to individual filesand not at transaction level [24]. To make it possible toidentify changes and in particular bug fixes which per-tain to multiple files, we then employed the cvs2svn3

tool to convert the raw data into subversion format,in which co-occurring changes are consolidated into asingle commit operation. In total, we identified 58,985subversion revisions in this seven year period, whichamount to about 23 revisions on average per day.

The bug (or bugs) fixed in each commit was thenidentified (as standard in the literature [5], [10], [18])by pattern matching. More information on the bug,most importantly, the time it was opened, was thenextracted from the Eclipse’s BugZilla defect manage-ment system. Overall, in the said time period we found26,689 unique bugs which were fixed at least once bya modification to a JAVA source file.

3. Metrics of effort invested in bug fixing

and refixing

Let us distinguish between the first fix (or fix forshort) of the bug, i.e., the first revision checked intothe source control system which is tagged as fixingthe bug, and the refix which constitutes zero or morerevisions checked in after the first fix, which are stilltagged as addressing this bug.

We shall measure the amount of effort invested inbug fixing and refixing by the following metrics:

• File count, or F , which is the number of sourcefiles that were modified in the relevant revision orrevisions (A file modified in r different revisionsof the refix, contributes r to F ); and,

• Token change, or T , defined as the number ofJAVA tokens added to or removed from the af-fected source files (if t1 tokens were added to acertain source file in one revision of a refix and t2tokens were removed from it in another suchrevision, then the contribution of this file to Tis t1 + t2; similarly, if t3 tokens were removedfrom a certain file and t4 tokens were added toanother file in the same revision, then these twofiles contribute t3 + t4 to T ).

Two additional metrics pertain to the refix of bugs (butnot to their first fix):

• Revision count, or R, which is simply the numberof revisions that constitute the refix; and,

• Spread, or S, defined as the number of distinctfiles modified in any of the refixing revisions, butintact during the first fix (the contribution to S of

3. http://cvs2svn.tigris.org/

a file modified multiple times during the refix is1 if the file was not modified in the fix, and 0otherwise).

We apply superscript notation to these, and will bediscussing the quantities FIXF , FIXT , REFIXF , REFIXT ,REFIXR, and, REFIXS in the sequel. In extracting thedata set we compute the values of all of these metricsfor each of the bugs, and associate their values witheach revision in which the bug was fixed. 4

Figure 3.1 depicts on a doubly-logarithmic scalethe complementary cumulative distribution function(CCDF) of the total number of revisions required tofix a bug, that is, REFIXR + 1.

Figure 3.1. Complementary cumulative distribution ofREFIXR+1 (the total number of revisions required to fix a bug)

We see that at least a quarter of the bugs were notfixed at the first revision in which they were attended,and the fix of about one percent of the bugs requiredfive revisions or more. The CCDF seems to follow astraight line in the logarithmic plane, which suggeststhat the metric REFIXR + 1 obeys a power-law.

What’s the distribution of bug fixing effort in the firstattempt? Figure 3.2 and Figure 3.3 show, respectively,the CCDF of FIXF and FIXT .

We see that FIXF is heavy tailed and seems to followa power-law. On the other hand, FIXT exhibits fasterthan polynomial decay. In fact, Figure 3.4 which re-draws Figure 3.3 on a semi-logarithmic scale indicatesthat FIXT decays exponentially in the majority of therange it spans.

To conserve space we state, without attaching fig-ures, that the distribution of the other file-based metrics(REFIXF and REFIXS) is power-law like. Similarly, we

4. Note that the metrics we use are somewhat rough; a more ac-curate measure of the effort would be in computing the Levenshteindistance of the sources, possibly accounting for renaming. Also, onecan argue that producing a revision that applies major modificationto one file and a tiny modification to another requires less effort thanthat may be required for a revision in which the amount of work isbalanced, etc. The weeks of processing time required to process ourhuge data set made such extensions infeasible.

Page 4: On incomplete bug fixes in eclipse and programmers’ intuition on these

state that the decay of REFIXT resembles an exponen-tial.

We expect that the metrics FIXF and FIXT will playa role in the decision to make the first fix to abug. What’s more interesting is the impact on thisdecision that the metrics REFIXF , REFIXT , REFIXR,and, REFIXS may have. The reason is that the values ofthese metrics can only be computed by examining thebug’s future. If such an impact is demonstrated, thenwe may venture saying that programmers have some

Figure 3.2. Complementary cumulative distribution of FIXF

(number of files changed in the first revision fixing a bug)

Figure 3.3. Complementary cumulative distribution of FIXT

(number of tokens changed in the first revision fixing a bug)

Figure 3.4. A redraw of Figure 3.3 on a semi-logarithmic scale

means for estimating that some bugs would requiremore refixes than others.

An immediate concern is that this estimation is aresult of correlation between the effort required forthe first fix of a bug, and the number of refixes andthe effort required for these. In judging this threat tovalidity, we computed the Pearson correlation of allpairs of our effort metrics. The results are tabulated inTable 1.

FIXF FIXT REFIXF REFIXT REFIXR REFIXS

FIXF 1.00 0.15 0.15 0.02 0.05 0.06FIXT 0.15 1.00 0.06 0.01 0.03 0.01REFIXF 0.15 0.06 1.00 0.07 0.15 0.39REFIXT 0.02 0.01 0.07 1.00 0.35 0.41REFIXR 0.05 0.03 0.15 0.35 1.00 0.33REFIXS 0.06 0.01 0.39 0.41 0.33 1.00

Table 1. Pearson correlation of metrics of bug fixing effort

Examining the correlation values of metrics of thefirst fix, and metrics of the refix, we see that they areall very low.

4. Mathematical model for preference

Our analysis employs an abstract model of pref-erence in which the behavior of a chooser (whichcan be an individual, an organization, or even a non-human actor) is described as a sequence of n randomselections, each made in its unique probability space.

At any point in this sequence, the chooser is pre-sented with a universe u of entities each being char-acterized by a value of some property. The values ofthese property are not necessarily numerical; they maybe ordered pairs, triples, tuples of any size, or encodethe values of multiple properties in any other way.The chooser then selects precisely one of the elementsin u. The crucial point is that the selection is madeat random where the underlying distribution dependsonly on the values of this property.

In our case, the entities are open bugs, while thechooser may be thought of as a software managerpresented with the current set of bugs and asked todecide which of these to deal with. While there aremany factors that may influence this decision, ourabstract model concentrates on the values of a certainmeasurable property (e.g., the effort required to fix abug), or even a combination of such properties andmodels the influence of the remaining factors on theselection as being random on the average.

Assume that V , the set of values of the said property,is finite, i.e.,

V = {v1, . . . , v�} .For v ∈ V , let uv be the number of entities in u whosevalue is v. Also, let xv ≥ 0 be the preference value

Page 5: On incomplete bug fixes in eclipse and programmers’ intuition on these

of v, i.e., the weight assigned to entities with value v inthe random selection process. Then, the probability ofselecting an element from u whose property value is vis simply uvxv , the weight of elements with value v,divided by

v�∈V

xv�uv� , the total weight of all elements,

that is,uvxv�

v�∈V xv�uv�. (1)

The challenge at hand is determining x =(x1, . . . , x�), the vector of unknown preference values.(Clearly, this vector can only determined up to anarbitrary scaling factor.)

The relative magnitude of the preference values istelling of the extent of dependency of the selectionon the property. If all the values in x are equal,then the selection process does not depend at all onthe property under inspection. The greater the varietyof these values, the greater the dependency of theselection in the values the property assumes.

Note that preferential attachment is a special case ofour model, in which V is a set of non-negative integersand we expect to be able to show that xv ∝ (v + b) forsome positive constant b.

The difficulty in determining x is that a singlerandom choice is useless for inferring about the under-lying probability space in the same way that the resultof a single coin flip tells nothing of the coin’s bias.Moreover, unlike the coin flips example, our modeldoes not permit repetitions of the experiment in equalconditions.

4.1. A system of equations for the unknown

preference values

What we shall do instead is consider together theuniverses u1, . . . ,un that occur in the sequence anduse these to write an solve a system of equationsfor x. We write universe ui as an �-vector ui =(ui,v1 , . . . ui,v�), or, for brevity, as ui = (ui,1, . . . ui,�),and define an �× n-matrix U in which the ith row isthe vector ui:

U =

u1...un

.

(2)

Also, let the n-vectors wj , j = 1, . . . , � be the columnsof U , i.e., U = (w1, · · · ,w�).

For i = 1, . . . , n and j = 1, . . . , �, let δi,j bethe binary random variable assuming the value 1 ifa chooser selects at step i an entity with a value vj ,and 0 otherwise. Then, the expected value of δi,j issimply ui,jxj/ui · x (where “·” denotes the dot vectorproduct). Also, for j = 1, . . . , �, define the counting

random variable

tj =n�

i=1

δi,j , (3)

that is, tj is the total number of times a value vj wasselected. By linearity of expectations

E[tj ] =n�

i=1

E[δi,j ] =n�

i=1

ui,jxj

ui · x

=

�n�

i=1

ui,j

ui · x

�xj

. (4)

We can even writeE[tj ] = (wj · y)xj (5)

wherey =

�1

u1 · x, · · · , 1

un · x

�, (6)

or, by agreeing that division of vectors is taken point-wise, y = 1/U · x (where 1 is an �-vector whoseentries are all 1, and where “·” denotes the usualmultiplication of a matrix by a vector).

Now all � equations (5) can be written concisely as

E[t] =�U tr · y

�◦ x =

�U tr · 1

U · x

�◦ x. (7)

The above system has � equations for the � un-knowns x, where matrix U is given by the problemspecification. Vector t is also known, since each entryin it represents the number of times an entity with aspecific quantity value is selected. We have no practicalmeans for computing the expectation of t though, butif the individual values in it are not too small, we shalluse the approximation t ≈ E[t].4.2. A fixed point numerical solution

The high degree of the polynomials in the system (7)precludes an analytical solution. A numerical solutionbased on the fixed point method is feasible though.Rewrite (7) in the form

x = E[t]��

U tr · 1

U · x

and define xk+1, the (k + 1)th approximation of xbased on x�

k, the previous such approximation:

xk+ 12= E[t]

��U tr · 1

U · xk

xk+1 = xk+ 12/|xk+ 1

2|

(8)

We applied (8) in this study with excellent results,finding a solution to (7) in our domain with errornot exceeding 10−15 after a few dozens of iterationswhose total CPU time is measured in seconds. In fact,Amnon J. Meir and Irad Yavneh (personal communi-cation) proved that convergence to the unique solutionis guaranteed under broad conditions. An inevitableerror though is due to the substitution (common to allstatistical analyses of this sort) of E[t] by t.

Page 6: On incomplete bug fixes in eclipse and programmers’ intuition on these

5. Results

In this section, we study what we shall call thepreference value of bugs, defined as the relative likeli-hood of a specific open, never fixed before, bug to befixed in the next revision. The study will employ themodel described in the previous section to computedependency of the preference value (denoted x) onthe user-defined severity and priority labels and on thefixing effort metrics defined above in Section 3.

5.1. Impact of bug properties and the first fix

effort on the selection of the next bug to fix

Having established the mathematical model for theanalyses of histories, we are ready to present ourempirical results. Our interest lies with the criteria thatprogrammers apply in choosing which bug to fix. Inthe terminology of the previous section, a universe isthe set of open bugs which have never been fixed. Theselection points are check in operations (commits) ofa revision which fixes a bug for the first time. And,we would like to determine whether the factors areinvolved in selecting from among all present openbugs, the next bug to fix.

To this end, we fix a property and classify the openbugs by this property, and then compute vector x ofwhat might be called bug preference values. Figure 5.1depicts the preference values when the bug distinguish-ing property is bug severity.

Figure 5.1. Preference value vs. severity

Vector x is portrayed here and henceforth by scalingit so that its maximal value is 100%. In this figure,this maximum is (unsurprisingly) obtained for bugs ofseverity blocker.

In addition, each column in the figure is adornedby the value of tj (see previous section) that wasused in computing its preference value. In Figure 5.1,we see for example that blocker bugs were fixed 216times. Recall that although the computation of thepreference value is numerically very accurate, it mayintroduce errors due to the substitution of tj for E[tj ].

The relative error of this substitution is in the orderof 1/

√tj .

Note the Kendall τ rank correlation coefficient5 andits significance value6 depicted in the figure. We seethat overall, more severe bugs tend to be attendedfirst, and that this tendency is statistically significantat the 95% significance level.

Studying the actual value of preference values welearn the somewhat surprising fact that the preferencevalue of critical, major and normal and trivial are quitesimilar. The preference value of minor and enhance-ment type bugs is however significantly smaller thanthese.

Preference value values vs. bug priority are depictedin Figure 5.2.

Figure 5.2. Preference value vs. priority

Next we study the impact of bug-fix effort on thethe likelihood of the bug being selected to be fixed.Figure 5.3 compares the preference value values ofbugs which required modification of different numbersof source files. We see that overall, programmersprefer to fix first bugs which require fewer files tofix (τ = −0.64), however the dependency of thepreference values in the number of files is not toostrong: the lowest preference is about 60% of thehighest.

The same phenomena is found when we measurethe effort by tokens rather than files, as can be seenin Figure 5.4. (Observe that since FIXT spans quite alarge range and a value of FIXT is typically assumed bya very small number of bugs, preference values werecomputed with respect to a binned value of FIXT . Sincethe decay of FIXT close to exponential (see Figure 3.4),binning of this metric was by its logarithm.)

5. Recall that this coefficient is a non-parametric statistics of aset of value pairs; the coefficient ranges between -1 in the case thatthe set represents a monotonically decreasing function and 1 in thecase that this function is monotonically increasing.

6. computed by exhaustive counting of all permutations for smallsets, and by the the standard approximation for large sets

Page 7: On incomplete bug fixes in eclipse and programmers’ intuition on these

Figure 5.3. Preference value vs. FIXF (number of files changedin the first revision fixing the bug)

Figure 5.4. Preference value vs. FIXT (number of tokens addedin the first revision fixing the bug)

The figure tells us that programmers tend to attendfirst to bugs which can be corrected by fewer additions:The highest preference value this time is about threetimes greater than the lowest; the absolute value ofthe Kendall τ value is greater; and, it is even morestatistically significant.

5.2. Impact of future refixing effort on the

selection of the next bug to fix

We now turn to the study of the impact of metrics ofthe refixing effort on the preference value. We continueto concentrate on the preference of the first fix of a bug.

Figure 5.5 depicts this value vs. REFIXR.The figure demonstrates that programmers generally

prefer fixing those bugs which requite fewer futurerefixing revisions, and the statistical significance of thistrend is at least 95%. The relative difference of thepreference levels is not meager: bugs with no furtherrefixes are about two times more likely to be fixed thanbugs with three future refixes.

Figure 5.6 depicts the dependency of the preferencevalue of a bug on REFIXS—the number of files modi-fied in refixes, but not in the original fix. Although thefigure may visually suggest that programmer tend to

Figure 5.5. Preference value vs. REFIXR(number of futurerefixing revisions)

prefer bugs whose REFIXS value is smaller, we cannotascribe statistical significance to this claim.

Figure 5.6. Preference value vs. REFIXS (number of files modi-fied during refixes, but left intact during initial fix)

The plot of the preference value against REFIXF

(Figure 5.7) does not reveal that programmers tendto prefer fixing bugs which will require fewer futurefile modifications. In fact, the close to zero Kendallcoefficient (τ = −0.05) suggests that programmersare not able to distinguish between bugs based on thiscriteria.

Figure 5.7. Preference value vs. REFIXF (number of file modifi-cation operation during refixes)

The impact of REFIXT , the fourth metric of refixing

Page 8: On incomplete bug fixes in eclipse and programmers’ intuition on these

Figure 5.8. Preference value vs. REFIXT (number of tokensadded during refixes)

effort, on the preference value for bug fixing is shownin Figure 5.8.

Evidently, the impact of this metric is major:the −0.56 value Kendall τ coefficient is significantat the 99.9% level, and, the values of the preferencevalues span a large multiplicative range.

We conclude this section by drawing attention toan interesting phenomenon: All figures presented here,and to a limited extent, also Figure 5.3 and Figure 5.4in the previous section, exhibit an increase of the pref-erence value at the extreme right of the figure, i.e., itseems as if bugs whose fixing and refixing effort is thehighest are more likely to be fixed than what we wouldexpect by the preference value of bugs with high, buta bit more moderate effort. Admittedly, the preferencevalue is never fully monotonic, and sporadically thereare “spikes” which may require deeper research, e.g.,the high preference value in FIXT = 2 (Figure 5.4).Still, the increase in the preference value at the rightend of the spectrum seems to follow a rather consistentpattern.

We were unable to explain or even pinpoint thisphenomenon. Further research is in place here to studyof the nature of the class of bugs which require highrefixing effort yet are quite attractive to programmerin the first fix. The small number of bugs (examinethe tj values) in the right-most column of the figuressuggests that these research can make use of manualinspection.

6. Conclusions, discussion, and directions

for further research

We now further the discussion of our findings onthe preference value, draw conclusions, and point outdirections for further research.

First, note that due to scaling, any single x that iscomputed with respect to some fixed property is mean-ingless when considered in isolation. What’s important

is whether this value is is large or small in comparisonwith the other values computed with respect to thisproperty.

We have seen for example that programmers’ pref-erence of first fixing blocker bugs is at least twicethan that of all other severity levels (Figure 5.1). Thisfinding does not mean that bugs of lower severitygo unattended whenever blocker bugs are open. Thereason is that in Eclipse, as in any other large project,there are many individuals engaged in maintenance andin further development of the project.

The large preference value of blocker bugs is tobe interpreted with respect to the team of developers,rather than to any individual in it. The team, as a wholeis two times more likely to fix blocker bugs than otherbugs. The team’s preference does not however precludeindividual members from investing their resources onother bugs and product enhancement.

It is interesting to see that the impact of other sever-ity levels on the preference value is quite minimal. Weleave for further research the verification of the obviousconjecture that in the eyes of software developers (atleast those involved in the Eclipse project) the sevendifferent severity levels defined by BugZilla degradeinto three: blocker, ordinary and enhancement/minor.

A similar conjecture is raised by Figure 5.2. It seemsas if there are two priority levels that matter: ordinarypriority corresponding to levels P1, P2, and P3 andlow priority corresponding to levels P4, P5.

Observe that Figure 5.1 exhibits an inversion in thatbugs of minor severity receive about two thirds of theattention of bugs marked trivial, despite the “official”definition of severity levels in BugZilla in which minorbugs are more severe than trivial bugs.

This anomaly may suggest that bugs are markedtrivial not because they are of trivial severity (asthey should), but rather because they require a trivialamount of work (as the name suggests). We found thaton average, FIXT of minor bugs was twice as highas that of trivial bugs, but clearly, more research isrequired to evaluate the quality of severity and priorityjudgments of bugs.

Second, recall that the computed x values includea statistical error. This means that minute differencesin x values, e.g., the difference between REFIXR = 4and REFIXR = 5 (Figure 5.5), are to be disregarded.Such small changes are likely to be the fruit of astatistical fluctuation.

Our extensive use of the Kendall τ rank correlationcoefficient is designed to overcome statistical errors ofthis sort. In all of our findings τ was negative, butvalues closer to -1, represent a consistent decrease ofthe x value with the increase of the property under

Page 9: On incomplete bug fixes in eclipse and programmers’ intuition on these

inspection. The statistical significance level of the τvalue tells us how likely it is that this trend is aresult of a mere coincidence. What’s more important,is the trend that the x values represent over the entirespectrum of the inspected property.

By examining these Kendall τ values in Figure 5.3and Figure 5.4, we can say that with good certaintythat the development team generally prefers dealingfirst with bugs that require less effort (i.e., lowerFIXF and FIXT ), as in the famous shortest processingtime first scheduling rule. Moreover, our findings alsosuggest that programmers are able to appreciate theprogramming effort (the flow of incoming bugs in ourdata set seems steady).

More importantly, the statistically significant τ valuein Figure 5.8 tells us that the development team, as awhole, has a good appreciation of the amount of futureprogramming effort for refixing the bugs, even at thetime the bug is supposedly fixed, and that the teamtends to delay fixing bugs which are more likely torequire more refixing effort.

Further, the statistically significant τ value in Fig-ure 5.5 tells us that that the development team possesan ability to detect recalcitrant bugs, i.e., bugs thatrequire would require more refixing iterations (highREFIXR) and that the team has the tendency to put offthe initial fix of these bugs.

We know from Table 1 that effort spent in the firstfix is a poor indicator of future fixing effort. Further,previous research tells us that bugs that require refixesare more likely to be of higher severity. But, sincebugs of higher severity are more likely to be treatedfirst (Figure 5.1) we can conclude that severity can notbe a predictor of REFIXR. More research is therefore inplace to identify the means by which programmers canestimate the amount of refixing effort and the numberof refixing iterations.

This research direction may even lead to a transla-tion of the programmer’s ability to detect incompletebug fixes into automatic tools that would point out suchfixes to software developers, and perhaps even proposeideas on how to make the fix more complete.

Third, we ask whether it is possible to extractuseful information from Figure 5.6 and Figure 5.7(representing, respectively, the dependency of x onREFIXS and REFIXF ) despite the Kendall τ valuesbeing statistically insignificant?

Recall that the error in the tj value is proportionalto 1/

√tj . Re-examining the figures, we see that the tj

value in the first three columns of these is no lessthan 859, hence the error in E[t] is in the orderof 3%. It makes sense to assume that the error inthe corresponding x value is similarly small. (In fact,

the iterative method of finding the x values, i.e.,equations (8) suggests that the error in xj is linearin the error in tj .)

We see that the difference in the respective x valuesis not meager. The first three columns of the figurestherefor suggest the development team has some roughestimate of REFIXF and and REFIXS .

The crude estimate that REFIXF > 0 is not in-teresting since it can be explained by the estimatewe demonstrated already of REFIXR. However, webelieve that it is remarkable that programmers havethe intuition that their bug fixes may be incompleteand require touching files not included in the first fix.

Fourth, we should mention the “spikes” seen occa-sionally in the figures, the most prominent of these arethe high values of x in Figure 5.8 for REFIXT = 1and REFIXT = 4. These may be attributed to twofactors:

• individual statistical error which is more major forsmall values of tj , e.g., in the case REFIXT = 1,we have tj = 33 and the error is in in the orderof 17%; and,

• the fact that together the figures represent a largenumber of computed x values, which increasesthe probability that one or two of these wouldexhibit a statistically significant deviation from itsexpectation.

It should be assumed therefore that re-doing the analy-sis on a number of major code base and accompanyingbugs’ database is likely to remove these spikes.

There remains an anomaly in the data which isnot likely to be attributed to the above two factors:Figure 5.2 suggests that bugs at the ordinary prioritylevel (P3) tend to receive (slightly) greater attentionthan bugs at the second highest priority level (P2).Although the difference is not huge, further scrutinymay be in place here.

Acknowledgments. We are grateful to Ahmed E.Hassan and Tom Zimmermann for providing tips on theanalysis of the Eclipse code and bugs databases. In-spiring correspondence with Daniel Bilar is gratefullyacknowledged.

References

[1] J. Andersen and J. Lawall. Generic patch inference. InProc. of the 23rd IEEE/ACM International Conferenceon Automated Software Engineering (ASE’08), page337346. IEEE Computer Society, Sept. 2008.

[2] K. Arnold and J. Gosling. The Java ProgrammingLanguage. The Java Series. Addison-Wesley, Reading,Massachusetts, 1996.

Page 10: On incomplete bug fixes in eclipse and programmers’ intuition on these

[3] N. Bettenburg, S. Just, A. Schroter, C. Weiss, R. Prem-raj, and T. Zimmermann. What makes a good bug re-port? In Proc. of the 16th ACM SIGSOFT InternationalSymposium on Foundations of software engineering,SIGSOFT ’08/FSE-16, pages 308–318, Atlanta, Geor-gia, 2008. ACM Press.

[4] T. Cohen and J. Gil. Self-calibration of metrics of Javamethods. In Proc. of the 37th Int. Conf. on Technologyof OO Lang. and Sys. (TOOLS’00 Pacific), pages 94–106, Sydney, Australia, Nov. 20-23 2000. Prentice-Hall.

[5] M. Fischer, M. Pinzger, and H. Gall. Populatinga release history database from version control andbug tracking systems. In Proc. of the InternationalConference on Software Maintenance, ICSM’03, pages23–32, Amsterdam, Netherlands, 2003. IEEE Comput.Soc.

[6] E. Giger, M. Pinzger, and H. Gall. Predicting the fixtime of bugs. In Proc. of the 2nd International Workshopon Recommendation Systems for Software Engineering,RSSE ’10, pages 52–56, Cape Town, South Africa,2010. ACM Press.

[7] Z. Gu, E. T. Barr, D. J. Hamilton, and Z. Su. Has thebug really been fixed? In the 32nd ACM/IEEE Interna-tional Conference on Software Engineering, volume 1of ICSE’10, page 5564. ACM Press, 2010.

[8] A. E. Hassan and R. C. Holt. Predicting changepropagation in software systems. In the 20th IEEEInternational Conference on Software Maintenance,ICSM’04, page 284293. IEEE Computer Society, 2004.

[9] M. Kim, S. Sinha, C. Go andrg, H. Shah, M. Harrold,and M. Nanda. Automated bug neighborhood analysisfor identifying incomplete bug fixes. In the 3th Interna-tional Conference on Software Testing, Verification andValidation (ICST’10), pages 383–392. IEEE ComputerSociety, Apr. 2010.

[10] S. Kim and E. J. Whitehead, Jr. How long did it taketo fix bugs? In Proc. of the 3rd International Workshopon Mining software Repositories, MSR’06, pages 173–174, Shanghai, China, 2006. ACM Press.

[11] M. E. J. Newman. Power laws, Pareto distributions andZipf’s law. Contemporary physics, 2005.

[12] T. T. Nguyen, H. A. Nguyen, N. H. Pham, J. Al-Kofahi,and T. N. Nguyen. Recurring bug fixes in object-oriented programs. In Proc. of the 32nd ACM/IEEEInternational Conference on Software Engineering, vol-ume 1 of ICSE ’10, pages 315–324, Cape Town, SouthAfrica, 2010. ACM Press.

[13] L. D. Panjer. Predicting Eclipse bug lifetimes. In Proc.of the 4th International Workshop on Mining SoftwareRepositories, MSR’07, pages 29–32, Washington, DC,USA, 2007. IEEE Computer Society.

[14] J. Park, M. Kim, B. Ray, and D.-H. Bae. An empiricalstudy of supplementary bug fixes. In Proc. of the 9th

Working Conference on Mining Software Repositories,MSR’12, pages 40–49, Zurich, Switzerland, June 2012.IEEE.

[15] R. Purushothaman and D. E. Perry. Toward understand-ing the rhetoric of small source code changes. IEEETransactions on Software Engineering, 31(6):511–526,2005.

[16] M. P. Robillard. Automatic generation of suggestionsfor program investigation. In the 10th European Soft-ware Engineering Conference (ESEC’05), page 1120.ACM Press, 2005.

[17] S. Valverde, R. F. Cancho, and R. V. Sole. Scale-free networks from optimal design. EPL (EurophysicsLetters), 2002.

[18] D. Cubranic and G. C. Murphy. Hipikat: recommendingpertinent software development artifacts. In Proc. of the25th International Conference on Software Engineering,ICSE ’03, pages 408–418, Portland, OR, USA, 2003.IEEE Computer Society.

[19] C. Weiss, R. Premraj, T. Zimmermann, and A. Zeller.How long will it take to fix this bug? In Proc. ofthe 4th International Workshop on Mining SoftwareRepositories, MSR’07, pages 1–8, Washington, DC,USA, 2007. IEEE Computer Society.

[20] R. Wheeldon and S. Counsell. Power law distributionsin class relationships. In Proceedings of the third IEEEInternational Workshop on Source Code Analysis andManipulation, 2003. IEEE Computer Society, 2003.

[21] D. L. X. Wang, J. Cheng, L. Zhang, H. Mei, and J. X.Yu. Matching dependence-related queries in the systemdependence graph. In the IEEE/ACM international con-ference on Automated software engineering (ASE’10),page 457466. ACM Press, 2010.

[22] J. L. Y. Padioleau, R. R. Hansen, and G. Muller. Doc-umenting and automating collateral evolutions in linuxdevice drivers. In the 3rd ACM SIGOPS/EuroSys Euro-pean Conference on Computer Systems (EUROSYS’08),page 247260. ACM Press, 2008.

[23] X. Zhang, S. Tallam, N. Gupta, and R. Gupta. To-wards locating execution omission errors. In the 2007ACM SIGPLAN conference on Programming LanguageDesign and Implementation (PLDI’07), page 415424.ACM Press, 2007.

[24] T. Zimmermann and P. Weissgerber. PreprocessingCVS data for fine-grained analysis. In Proc. of the1st International Workshop on Mining Software Repos-itories, MSR’04, 2004.