[ieee 2008 ieee international conference on dependable systems and networks with ftcs and dcc (dsn)...

Automated Duplicate Detection for Bug Tracking Systems

Nicholas JalbertUniversity of Virginia

Charlottesville, Virginia [email protected]

Westley WeimerUniversity of Virginia

Charlottesville, Virginia [email protected]

Abstract

Bug tracking systems are important tools that guide themaintenance activities of software developers. The utility ofthese systems is hampered by an excessive number of dupli-cate bug reports–in some projects as many as a quarter ofall reports are duplicates. Developers must manually iden-tify duplicate bug reports, but this identification process istime-consuming and exacerbates the already high cost ofsoftware maintenance. We propose a system that automati-cally classifies duplicate bug reports as they arrive to savedeveloper time. This system uses surface features, textualsemantics, and graph clustering to predict duplicate sta-tus. Using a dataset of 29,000 bug reports from the Mozillaproject, we perform experiments that include a simulationof a real-time bug reporting environment. Our system isable to reduce development cost by filtering out 8% of du-plicate bug reports while allowing at least one report foreach real defect to reach developers.

1. Introduction

As software projects become increasingly large andcomplex, it becomes more difficult to properly verify codebefore shipping. Maintenance activities [12] account forover two thirds of the life cycle cost of a software sys-tem [3], summing up to a total of $70 billion per year in theUnited States [16]. Software maintenance is critical to soft-ware dependability, and defect reporting is critical to mod-ern software maintenance.

Many software projects rely on bug reports to direct cor-rective maintenance activity [1]. In open source softwareprojects, bug reports are often submitted by software usersor developers and collected in a database by one of sev-eral bug tracking tools. Allowing users to report and poten-tially help fix bugs is assumed to improve software qualityoverall [13]. Bug tracking systems allow users to report,describe, track, classify and comment on bug reports andfeature requests. Bugzilla is a particularly popular open

source bug tracking software system [14] that is used bylarge projects such as Mozilla and Eclipse. Bugzilla bug re-ports come with a number of pre-defined fields, includingcategorical information such as the relevant product, ver-sion, operating system and self-reported incident severity,as well as free-form text fields such as defect title and de-scription. In addition, users and developers can leave com-ments and submit attachments, such as patches or screen-shots.

The number of defect reports typically exceeds the re-sources available to address them. Mature software projectsare forced to ship with both known and unknown bugs; theylack the development resources to deal with every defect.For example, in 2005, one Mozilla developer claimed that,“everyday, almost 300 bugs appear that need triaging. Thisis far too much for only the Mozilla programmers to han-dle” [2, p. 363].

A significant fraction of submitted bug reports are spuri-ous duplicates that describe already-reported defects. Pre-vious studies report that as many as 36% of bug reportswere duplicates or otherwise invalid [2]. Of the 29,000 bugreports used in the experiments in this paper, 25.9% wereidentified as duplicates by the project developers.

Developer time and effort are consumed by the triagework required to evaluate bug reports [14], and the timespent fixing bugs has been reported as a useful softwarequality metric [8]. Modern software engineering for largeprojects includes bug report triage and duplicate identifica-tion as a major component.

We propose a technique to reduce bug report triage costby detecting duplicate bug reports as they are reported. Webuild a classifier for incoming bug reports that combinesthe surface features of the report [6], textual similarity met-rics [15], and graph clustering algorithms [10] to identifyduplicates. We attempt to predict whether manual triageefforts would eventually resolve the defect report as a du-plicate or not. This prediction can serve as a filter betweendevelopers and arriving defect reports: a report predictedto be a duplicate is filed, for future reference, with the bugreports it is likely to be a duplicate of, but is not otherwise

International Conference on Dependable Systems & Networks: Anchorage, Alaska, June 24-27 2008

1-4244-2398-9/08/$20.00 ©2008 IEEE 52 DSN 2008: Jalbert & Weimer

presented to developers. As a result, no direct triage effort isspent on it. Our classifier is based on a model that takes intoaccount easily-gathered surface features of a report as wellas historical context information about previous reports.

In our experiments we apply our technique to over29,000 bug reports from the Mozilla project and experi-mentally validate its predictive power. We measure our ap-proach’s efficacy as a filter, its ability to locate the likelyoriginal for a duplicate bug report, and the relative power ofthe key features it uses to make decisions.

The Mozilla project already has over 407,000 existingreports, so naıve approaches that explicitly compare eachincoming bug report to all previous ones will not scale. Wetrain our model on historical information in long (e.g., four-month) batches, periodically regenerating it to ensure it re-mains accurate.

The main contributions of this paper are:

• A classifier that predicts bug report duplicate statusbased on an underlying model of surface features andtextual similarity. This classifier has reasonable pre-dictive power over our dataset, correctly identifying8% of the duplicate reports while allowing at least onereport for each real defect to reach developers.

• A discussion of the relative predictive power of the fea-tures in the model and an explanation of why certainmeasures of word frequency are not helpful in this do-main.

The structure of this paper is as follows. Section 2presents a motivating example that traces the history of sev-eral duplicate bug reports. We compare our approach to oth-ers in Section 3. In Section 4, we formalize our model, pay-ing careful attention to textual semantics in Section 4.1 andsurface features in Section 4.2. We present our experimen-tal framework and our experimental results in Section 5. Weconclude in Section 6.

2. Motivating Example

Duplicate bug reports are such a problem in practice thatmany projects have special guidelines and websites devotedto them. The “Most Frequently Reported Bugs” page of theMozilla Project’s Bugzilla bug tracking system is one suchexample. This webpage tracks the number of bug reportswith known duplicates and displays the most commonly re-ported bugs. Ten bug equivalence classes have over 100known duplicates and over 900 other equivalence classeshave more than 10 known duplicates each. All of these du-plicates had to be identified by hand and represent time de-velopers spent administering the bug report database andperforming triage rather than actually addressing defects.

Bug report #340535 is indicative of the problems in-volved; we will consider it and three of its duplicates.

The body of bug report #340535, submitted on June 6,2006, includes the text, “when I click OK the updater startsagain and tries to do the same thing again and again. Itnever stops. So I have to kill the task.” It was reported withseverity “normal” on Windows XP and included a logfile.

Bug report #344134 was submitted on July 10, 2006 andincludes the description, “I got a software update of Mine-field, but it failed and I got in an endless loop.” It wasalso reported with severity “normal” on Windows XP, butincluded no screenshots or logfiles. On August 29, 2006the report was identified as a duplicate of #340535.

Later, on September 17, 2006, bug report #353052 wassubmitted: “...[Thunderbird] says that the previous updatefailed to complete, try again get the same message cannotstart thunderbird at all, continous loop.” This report had aseverity of “critical” on Windows XP, and included no at-tachments. Thirteen hours later, it was marked as a dupli-cate in same equivalence class as #340535.

A fourth report, #372699, was filed on March 5, 2007. Itincluded the title, “autoupdate runs...and keeps doing this inan infinite loop until you kill thunderbird.” It had a severityof “major” on Windows XP and included additional formtext. It was marked as a duplicate within four days.

When these four example reports are presented in suc-cession their similarities are evident, but in reality they wereseparated by up to nine months and over thirty thousand in-tervening defect reports. All in all, 42 defect reports weresubmitted describing this same bug. In commercial devel-opment processes with a team of bug triagers and softwaremaintainers, it is not reasonable to expect any single de-veloper to have thirty thousand defects from the past ninemonths memorized for the purposes of rapid triage. De-velopers must thus be wary, and the cost of checking forduplicates is paid not merely for actual duplicates, but alsofor every non-duplicate submission; developers must treatevery report as a potential duplicate.

However, we can gain insight from this example. Whileself-reported severity was not indicative, some defect reportfeatures such as platform were common to all of the dupli-cates. More tellingly, however, the duplicate reports oftenused similar language. For example, each report mentionedabove included some form of the word “update”, and in-cluded “never stops”, “endless loop”, “continous loop”, or“infinite loop”, and three had “tries again”, “keeps trying”,or “keeps doing this”.

We propose a formal model for reasoning about dupli-cate bug reports. The model identifies equivalence classesbased predominantly on textual similarity, relegating sur-face features to a supporting role.



3. Related Work

In previous work we presented a model of defect re-port quality based only on surface features [6]. That modelpredicted whether a bug would be triaged within a givenamount of time. This paper adopts a more semantically-richmodel, including textual information and machine learn-ing approaches, and is concerned with detecting duplicatesrather than predicting the final status of non-duplicate re-ports. In addition, our previous work suffered from falsepositives and would occasionally filter away all reports fora given defect. The technique presented here suffers fromno such false positives in practice on a larger dataset.

Anvik et al. present a system that automatically assignsbug reports to an appropriate human developer using textcategorization and support vector machines. They claimthat their system could aid a human triager by recom-mending a set of developers for each incoming bug re-port [2]. Their method correctly suggests appropriate de-velopers with 64% precision for Firefox, although theirdatasets were smaller than ours (e.g., 10,000 Firefox re-ports) and their learned model did not generalize well toother projects. They build on previous approaches to au-tomated bug assignment with lower precision levels [4, 5].Our approach is orthogonal to theirs and both might be gain-fully employed together: first our technique filters out po-tential duplicates, and then the remaining real bug reportsare assigned to developers using their technique. Anvik etal. [1] also report preliminary results for duplicate detectionusing a combination of cosine similarity and top lists; theirmethod requires human intervention and incorrectly filteredout 10% of non-duplicate bugs on their dataset.

Weiß et al. predict the “fixing effort” or person-hoursspent addressing a defect [17]. They leverage existing bugdatabases and historical context information. To make theirprediction, they use pre-recorded development cost datafrom the existing bug report databases. Both of our ap-proaches use textual similarity to find closely related defectreports. Their technique employs the k-nearest neighbormachine learning algorithm. Their experimental validationinvolved 600 defect reports for the JBoss project. Our ap-proach is orthogonal to theirs, and a project might employour technique to weed out spurious duplicates and then em-ploy their technique on the remaining real defect reports toprioritize based on predicted effort.

Kim and Whitehead claim that the time it takes to fix abug is a useful software quality measure [8]. They mea-sure the time taken to fix bugs in two software projects. Wepredict whether a bug will eventually be resolved as a du-plicate and are not focused on particular resolution times orthe total lifetime of real bugs.

Our work is most similar to that of Runeson et al. [15], inwhich textual similarity is used to analyze known-duplicate

bug reports. In their experiments, bug reports that areknown to be duplicates are analyzed along with a set ofhistorical bug reports with the goal of generating a list ofcandidate originals for that duplicate. In Section 5.2 weshow that our technique is no worse than theirs at that task.However, our main focus is on using our model as a filterto detect unknown duplicates, rather than correctly binningknown duplictes.

4. Modeling Duplicate Defect Reports

Our goal is to develop a model of bug report similaritythat uses easy-to-gather surface features and textual seman-tics to predict if a newly-submitted report is likely to be aduplicate of a previous report. Since many defect reportsare duplicates (e.g., 25.9% in our dataset), automating thispart of the bug triage process would free up time for devel-opers to focus on other tasks, such as addressing defects andimproving software dependability.

Our formal model is the backbone of our bug report fil-tering system. We extract certain features from each bugreport in a bug tracker. When a new bug report arrives, ourmodel uses the values of those features to predict the even-tual duplicate status of that new report. Duplicate bugs arenot directly presented to developers to save triage costs.

We employ a linear regression over properties of bug re-ports as the basis for our classifier. Linear regression offersthe advantages of (1) having off-the-shelf software support,decreasing the barrier to entry for using our system; (2)supporting rapid classifications, allowing us to add textualsemantic information and still perform real-time identifica-tion; and (3) easy component examination, allowing for aqualitative analysis of the features in the model. Linear re-gression produces continuous output values as a functionof continuously-valued features; to make a binary classifierwe need to specify those features and an output value cut-off that distinguishes between duplicate and non-duplicatestatus.

We base our features not only on the newly-submittedbug report under consideration, but also on a corpus ofpreviously-submitted bug reports. A key assumption of ourtechnique is that these features will be sufficient to sepa-rate duplicates from non-duplicates. In essence, we claimthat there are ways to tell duplicate bug reports from non-duplicates just by examining them and the corpus of earlierreports.

We cannot update the context information used by ourmodel after every new bug report; the overhead would betoo high. Our linear regression model speeds up process-ing incoming reports because coefficients can be calculatedahead of time using historic bug data. A new report requiresonly feature extraction, multiplication, and a test against acutoff. However, as more and more reports are submitted



the original historic corpus becomes less and less relevantto predicting future duplicates. We thus propose a system inwhich the basic model is periodically (e.g., yearly) regener-ated, recalculating the coefficients and cutoff based on anupdated corpus of reports.

We now discuss the derivation of the important featuresused in our classifier model.

4.1 Textual Analysis

Bug reports include free-form textual descriptions and ti-tles, and most duplicate bug reports share many of the samewords. Our first step is to define a textual distance metricfor use on titles and descriptions. We use this metric as akey component in our identification of duplicates.

We adopt a “bag of words” approach when defining sim-ilarity between textual data. Each text is treated as a set ofwords and their frequency: positional information is not re-tained. Since orderings are not preserved, some potentially-important semantic information is not available for lateruse. The benefit gained is that the size of the representationgrows at most linearly with the size of the description. Thisreduces processing load and is thus desirable for a real-timesystem.

We treat bug report titles and bug report descriptionsas separate corpora. We hypothesize that the title and de-scription have different levels of importance when used toclassify duplicates. In our experience, bug report titles arewritten more succinctly than general descriptions and thusare more likely to be similar for duplicate bug reports. Wewould therefore lose some information if we combined ti-tles and descriptions together and treated them as one cor-pus. Previous work presents some evidence for this phe-nomenon: experiments which double the weighting of thetitle result in better performance [15].

We pre-process raw textual data before analyzing it, tok-enizing the text into words and removing stems from thosewords. We use the MontyLingua tool [9] as well as somebasic scripting to obtain tokenized, stemmed word lists ofdescription and title text from raw defect reports. Tokeniza-tion strips punctuation, capitalization, numbers, and othernon-alphabetic constructs. Stemming removes inflections(e.g., “scrolls” and “scrolling” both reduce to “scroll”).Stemming allows for a more precise comparison betweenbug reports by creating a more normalized corpus; ourexperiments used the common Porter stemming algorithm(e.g., [7]).

We then filter each sequence against a stoplist of com-mon words. Stoplists remove words such as “a” and “and”that are present in text but contribute little to its comparativemeaning. If such words were allowed to remain, they wouldartificially inflate the perceived similarity of defect reportswith long descriptions. We used an open source stoplist of

roughly 430 words associated with the ReqSimile tool [11].Finally, we do not consider submission-related informa-

tion, such as the version of the browser used by the reporterto submit the defect report via a web form, to be part ofthe description text. Such information is typically colocatedwith the description in bug databases, but we include onlytextual information explicitly entered by the reporter.

4.1.1 Document Similarity

We are interested in measuring the similarity between twodocuments within the same corpus; in our experiments all ofthe descriptions form one corpus and all of the titles formanother. All of the documents in a corpus taken togethercontain a set of n unique words. We represent each docu-ment in that corpus by a vector v of size n, with v[i] relatedto the total number of times that word i occurs in that docu-ment. The particular value at position v[i] is obtained froma formula that can involve the number of times word i ap-pears in that document, the number of times it appears inthe corpus, the length of the document, and the size of thecorpus.

Once we have obtained the vectors v1 and v2 for two doc-uments in the same corpus, we can compute their similarityusing the following formula in which v1 • v2 represents thedot product:

similarity = cos(θ) =v1 • v2

|v1| × |v2|

That is, the closer two vectors are to colinear, then themore weighted words the corresponding documents shareand thus, we assume, the more similar the meanings of thetwo documents. Given this cosine similarity, the efficacy ofour distance metric is determined by how we populate thevectors, and in particular how we weight word frequencies.

4.1.2 Weighting for Duplicate Defect Detection

Inverse document frequency, which incorporates corpus-wide information about word counts, is commonly used innatural language processing to identify and appropriatelyweight important words. It is based on the assumption thatimportant words are distinguished not only by the numberof times they appear in a certain text, but also by the in-verse of the ratio of the documents in which they appearin the corpus. Thus a word like “the” may appear mul-tiple times in a single document, but will not be heavily-weighted if it also appears in most other documents. Thepopular TF/IDF weighting includes both normal term fre-quency within a single document as well as inverse docu-ment frequency over an entire corpus.

In Section 5 we present experimental evidence that in-verse document frequency is not effective at distinguishingduplicate bug reports. In our dataset, duplicate bug reports



of the same underlying defect are no more likely to share“rare” words than are otherwise-similar unrelated pairs ofbug reports. We thus do not include a weighting factor cor-responding to inverse document frequency. Our weightingequation for textual similarity is:

wi = 3 + 2 log2 (count of word i in document)

Every position i in the representative vector of a bug reportv is determined based upon the frequency of term i and theconstant scaling factors present in the equation. Intuitively,the weight of a word that occurs many times grows loga-rithmically, rather than linearly. The constant factors wereempirically derived in an exhaustive optimization related toour dataset, which ranges over all of the subprojects underthe Mozilla umbrella. Once we have each document repre-sented as a weighted vector v, we can use cosine similarityto obtain a distance between two documents.

A true distance metric is symmetric. However, we usea non-symmetric “similarity function” for our training data:textual distance is used as defined above in general, but as aspecial case the one-directional similarity of an original toits duplicates is set to zero. We hypothesize that duplicateswill generally be more similar to the original bug report thanto unrelated reports. Because we are predicting if a reportis a duplicate and only one part of a duplicate-original pairhas that feature, textual similarity would be somewhat lesspredictive if it were symmetric.

4.1.3 Clustering

We use our textual similarity metric to induce a graph inwhich the nodes are defect reports and edges link reportswith similar text. We then apply a clustering algorithm tothis graph to obtain a set of clustered reports. Many com-mon clustering algorithms require either that the number ofclusters be known in advance, or that clusters be completelydisjoint, or that every element end up in a non-trivial clus-ter. Instead, we chose to apply a graph clustering algorithmdesigned for social networks to the problem of detecting du-plicate defect reports.

The graph cluster algorithm of Mishra et al. produces aset of possibly-overlapping clusters given a graph with un-weighted, undirected edges [10]. Every cluster discoveredis internally dense, in that nodes within a cluster have a highfraction of edges to other nodes within the cluster, and alsoexternally sparse, in that nodes within a cluster have a lowfraction of edges to nodes not in the cluster. The algorithmis designed with scalability in mind, and has been used tocluster graphs with over 500,000 nodes. We selected it be-cause it does not require foreknowledge of the number ofclusters, does not require that every node be in a non-trivialcluster, and is efficient in practice.

In addition, the algorithm produces a “champion”, or ex-emplary node within the cluster that has many neighbors

within the cluster and few outside of it. In our experimentsin Section 5 we measure our predictive power as a duplicateclassifier. In practice our distance metric and the champi-ons of the relevant clusters can also be used to determinewhich bug from an equivalence class of duplicates shouldbe presented to developers first.

We obtain the required graph by choosing a cutoff value.Nodes with similarity above the cutoff value are connectedby an edge. The cutoff similarity and the clustering parame-teres used in our experiments were empirically determined.

4.2 Model Features

We use textual similarity and the results of clustering asfeatures for a linear model. We keep description similar-ity and title similarity separate. For the incoming bug re-port under consideration, we determine both the highest titlesimilarity and highest description similarity it shares with areport in our historical data. Intuitively, if both of those val-ues are low then the incoming bug report is not textuallysimilar to any known bug report and is therefore unlikely tobe a duplicate.

We also use the clusters from Section 4.1.3 to define afeature that notes whether or not a report was included ina cluster. Intuitively, a report left alone as a singleton bythe clustering algorithm is less likely to be a duplicate. It iscommon for a given bug to have multiple duplicates, and wehope to tease out this structure using the graph clustering.

Finally, we complete our model with easily-obtained sur-face features from the bug report. These features include theself-reported severity, the relevant operating system, and thenumber of associated patches or screenshots [6]. These fea-tures are neither as semantically-rich nor as predictive astextual similarity. Categorical features, such as relevant op-erating system, were modeled using a one-hot encoding.

5. Experiments

All of our experiments are based on 29,000 bug re-ports from the Mozilla project. The reports span an eightmonth period from February 2005 to October 2005. TheMozilla project encompasses many different programs in-cluding web browsers, mail clients, calendar applications,and issue tracking systems. All Mozilla subprojects sharethe same bug tracking system; our dataset includes reportsfrom all of them. We chose to include all of the subprojectsto enhance the generality of our results.

Mozilla has been under active development since 2002.Bug reports from the very beginning of a project may notbe representative of a “typical” or “steady state” bug report.We selected a time period when a reasonable number of bugreports were resolved while avoiding start-up corner cases.In our dataset, 17% of defect reports had not attained some



sort of resolution. Because we use developer resolutions asa ground truth to measure performance, reports without res-olutions are not used. Finally, when considering duplicatesin this dataset we restrict attention to those for which theoriginal bug is also in the dataset.

We conducted four empirical evaluations:Text. Our first experiment demonstrates the lack of cor-

relation between sharing “rare” words and duplicate status.In our dataset, two bug reports describing the same bugwere no more likely to share “rare” words than were twonon-duplicate bug reports. This finding motivates the formof the textual similarity metric used by our algorithm.

Recall. Our second experiment was a direct comparisonwith the previous work of Runeson et al. [15]. In this exper-iment, each algorithm is presented with a known-duplicatebug report and a set of historical bug reports and is askedto generate a list of candidate originals for the duplicate. Ifthe actual original is on the list, the algorithm succeeds. Weperform no worse than the current state of the art.

Filtering. Our third and primary experiment involvedon-line duplicate detection. We tested the feasibility andeffectiveness of using our duplicate classifier as an on-linefilter. We trained our algorithm on the first half of the defectreports and tested it on the second half. Testing proceededchronologically through the held-out bug reports and pre-dicted their duplicate status. We measured both the time toprocess an incoming defect report as well as the expectedsavings and cost of such a filter. We measured cost and ben-efit in terms of the number of real defects mistakenly filteredas well as the number of duplicates correctly filtered.

Features. Finally, we applied a leave-one-out analysisand a principal component analysis to the features used byour model. These analyses address the relative predictivepower and potential overlap of the features we selected.

5.1 Textual Similarity

Our notion of textual similarity uses the cosine distancebetween weighted word vectors derived from documents.The formula used for weighting words thus induces the sim-ilarity measure. In this experiment we investigate the use ofinverse document frequency as a word-weighting factor.

Inverse document frequency is a contextual measure ofthe importance of a word. It is based on the assumption thatimportant words will appear frequently in some documentsand infrequently across the whole of the corpus. Inversedocument frequency scales the weight of each word in ourvector representation by the following factor:

IDF(w) = log(

total documentsdocuments in which word w appears

)While inverse document frequency is popular in generalnatural language processing and information retrieval tasks,

0

50

100

150

200

250

300

0 0.9 1.8 2.7 3.6 4.5 5.4 6.3 7.2

Shared-Word Frequency

Fre

qu

en

cy (

bu

g r

ep

ort

pair

s)

Duplicates

Non-Duplicates

Figure 1. Distribution of shared-word frequenciesbetween duplicate-original pairs (light) and closeduplicate-unrelated pairs (dark).

employing it for the task of bug report duplicate identifica-tion resulted in strictly worse performance than a baselineusing only non-contextual term frequency.

We performed a statistical analysis over the reports in ourdataset to determine why the inclusion of inverse documentfrequency was not helpful in identifying duplicate bug re-ports. First, we define the shared-word frequency betweentwo documents with respect to a corpus. The shared-wordfrequency between two documents is the sum of the inversedocument frequencies for all words that the two documentshave in common, divided by the number of words shared:

shared-word frequency(d1, d2) =Σw∈d1∩d2 IDF(w)

|d1 ∩ d2|

This shared-word frequency gives insight into the effect ofinverse document frequency on word weighting.

We then considered every duplicate bug report and itsassociated original bug report in turn and calculated theshared-word frequency for the titles and descriptions of thatpair. We also calculated the shared-word frequency betweeneach duplicate bug report and the closest non-original re-port, with “closest” determined by TF/IDF. This measure-ment demonstrates that over our dataset, inverse documentfrequency is just as likely to relate duplicate-original reportspairs as it is to relate unlinked report pairs.

We performed a Wilcoxon rank-sum test to determinewhether the distribution of shared-word frequency betweenduplicate-original pairs was significantly different than thatof close duplicate-unique pairs. Figure 1 shows the two dis-tributions. The distribution of duplicate-unique pair val-ues falls to the right of the of distribution of duplicate-original pair values (with a statistically-significant p-value



0.22

0.27

0.32

0.37

0.42

0.47

0.52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Top List Size

Re

ca

ll R

ate

Our AlgorithmRunesonTF/IDFRuneson No Stoplist

Figure 2. Recall rate of various algorithms on theRuneson task as a function of top list size. Our algo-rithm performs up to 1% better than that of Runesonet al., which in turn performs better than a direct ap-plication of inverse document frequency (TF/IDF).

of 4×10−7). In essence, in the domain of bug report classi-fication, shared-word frequency is more likely to increasethe similarity of unrelated pairs than of than duplicate-original pairs. Thus, we were able to dismiss shared-wordfrequency from consideration as a useful factor to distin-guish duplicates from non-duplicates. This study motivatesthe form of the weighting presented in Section 4.1.2 andused by our algorithm.

5.2 Recall Rate

Our second experiment applies our algorithm to aduplicate-detection recall task proposed by Runeson etal. [15] Their experiment consists of examining a corpusof bug reports that includes identified duplicate reports. Foreach duplicate report d, the algorithm under considerationgenerates an ordered top list containing the reports judgedmost likely to be d’s original. This top list could then beused by the triager to aid the search for duplicate bug re-ports.

Results are quantified using a recall rate metric. If thetrue original bug report associated with d appears anywherein the top list produced by the algorithm when given d, thealgorithm is judged to have correctly recalled the instanced. The total recall rate is the fraction of instances for whichthis occurs. The longer the top list is allowed to be, theeasier this task becomes. For every duplicate bug report, ouralgorithm considers every other bug report and calculatesits distance to d using the word-vector representation andcosine similarity. The reports are then sorted and the closestreports become the top list. The algorithm of Runeson et al.

is conceptually similar but uses a similarity metric with adifferent word weighting equation.

We tested four algorithms using all of the duplicate bugreports in our dataset as the corpus. Figure 2 shows the re-call rates obtained as a function of the top list size. TheTF/IDF algorithm uses the TF/IDF weighting that takesinverse document frequency into account; it is a popular de-fault choice and serves as a baseline. As presaged by theexperiment in Section 5.1, TF/IDF fared poorly.

We considered the Runeson et al. algorithm with andwithout stoplist word filtering for common words. Unsur-prisingly, stoplists improved performance. It is interestingto note that while the TF/IDF algorithm is strictly worse atthis task than either algorithm using stoplisting, it is betterthan the approach without stoplisting. Thus, an inverse doc-ument frequency approach might be useful in situations inwhich it is infeasible to create a stop word list.

In this experiment, our approach is up to 1% better thanthe previously-published state of the art on a larger dataset.We use our approach as the weighting equation used to cal-culate similarities in Section 5.3. However, the differencebetween the two approaches is not statistically significant atthe p ≤ .05 level; we conclude only that our technique is noworse than the state of the art at this recall task.

5.3 On-Line Filtering Simulation

The recall rate experiment corresponds to a system thatstill has a human in the loop to look at the list and make ajudgment about potential duplicates; it does not speak to analgorithm’s automatic performance or performance on non-duplicate reports. To measure our algorithm’s utility as anon-line filtering system, we propose a more complex exper-iment. Our algorithm is first given a contiguous prefix ofour dataset of bug reports as an initial training history. Itis then presented with each subsequent bug report in orderand asked to make a classification of that report’s duplicatestatus. Reports flagged as duplicates are filtered out andare assumed not to be shown to developers directly. Suchfiltered reports might instead be linked to similar bug re-ports posited as their originals (e.g., using top lists as inSection 5.2) or stored in a rarely-visited “spam folder”. Re-ports not flagged as duplicates are presented normally.

We first define a metric to quantify the performance ofsuch a filter. We define two symbolic costs: Triage andMiss. Triage represents a fixed cost associated with triag-ing a bug report. For each bug report that our algorithmdoes not filter out, the total cost it incurs is incremented byTriage. Miss denotes the cost for ignoring an eventually-fixed bug report and all of its duplicates. In other words,we penalize the algorithm only if it erroneously filters outan entire group of duplicates and their original, and if thatoriginal is eventually marked as “fixed” by the developers



(and not “invalid” or “works-for-me”, etc.). If two bug re-ports both describe the same defect, the algorithm can savedevelopment triage effort by setting aside one of them. Weassume the classifications provided by developers for thebug reports in our dataset are the ground truth for measur-ing performance.

We can then assign our algorithm a total cost of theform a × Triage + b × Miss . We evaluate our perfor-mance relative to the cost of triaging every bug report (i.e.,11, 340× Triage , since 11,340 reports are processed). Thecomparative cost of our model is thus (11, 340 − a) ×Triage + b×Miss .

In this experiment we simulated a deployment of our al-gorithm “in the wild” by training on the chronological firsthalf of our dataset and then testing on the second half. Wetest and train on two different subsets of bug reports bothto mitigate the threat of overfitting our linear regressionmodel, and also because information about the future wouldnot be available in a real deployment.

The algorithm pieces we have previously described areused to train the model that underlies our classifier. First,we extract features from bug reports. Second, we performthe textual preprocessing described in Section 4.1. Third,we perform separate pairwise similarity calculations of bothtitle and description text, as in Section 4.1.1. This step isthe most time intensive, but it is only performed during theinitial training stage and when the model is regenerated.

Once we have calculated the similarity, we generatethe graph structure for the clustering algorithm, as in Sec-tion 4.1.3. We use a heuristic to determine the best similar-ity cutoff for an edge to exist between two bug report nodes.For each possible cutoff, we build the graph and then exam-ine how many edges exist between bug reports in an equiv-alence class and bug reports in different classes. We choosethe cutoff that maximizes the ratio of the former to the lat-ter. Given the cutoffs we can run the clustering algorithmand turn its output into a feature usable by our linear model.

We then run a linear regression to obtain coefficients foreach feature. After the regression, we determine a cutoffto identify a bug as a duplicate. We do this by finding thecutoff that minimizes the Triage/Miss cost of applying thatmodel as a filter on all of the training bug reports. In Sec-tion 5.3.1 we investigate how cutoff choices influence ouralgorithm’s performance; in this section we present the re-sults for the best cutoff.

Bug reports from the testing set are then presented to thealgorithm one at a time. This is a quick process, requiringonly feature recovery, the calculation of a total score basedon the precalculated coefficients, and a comparison to a cut-off. The most expensive feature is the comparison to otherreports for the purposes of the textual similarity features.The running time of this step grows linearly with the sizeof the historical set. In our experiments, the average total

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Simulated Weeks of Filtering

Du

pli

cate

Bu

g R

ep

ort

s

Ideal

Our Algorithm

Figure 3. Cumulative number of duplicate bug re-ports filtered out as a function of the number of weeksof simulated filtering. Our algorithm never filteredout equivalence classes of reports that were resolvedas “fixed” by developers. Note log scale.

time to process an incoming bug report was under 20 sec-onds on a 3 GHz Intel Xeon. In 2005, the Mozilla projectreceived just over 45,000 bug reports, for an average of onebug report every 12 minutes. Our timing results suggest ouralgorithm could reasonably be implemented online.

Figure 3 shows the cumulative number of duplicate bugreports correctly filtered by our algorithm as a function oftime. For comparison, the ideal maximum number of filter-able bug reports is also shown. We correctly filtered 10%of all possible duplicates initially, trailing off to 8% as ourhistorical information became more outdated. In this exper-iment we never regenerated the historical context informa-tion; this represents a worst-case scenario for our technique.At the end of the simulation our algorithm had correctly fil-tered 8% of possible duplicates without incorrectly denyingaccess to any real defects.

Whether our algorithm’s use as a filter yields a net sav-ings of development resources generally depends on an or-ganization’s particular values for Miss and Triage — onother datasets our filter might incorrectly rule out access to alegitimate defect. Companies typically do not release main-tenance cost figures, but conventional wisdom places Missat least one order of magnitude above Triage . In certaindomains, such as safety-critical computing, Miss might bemuch higher, making this approach a poor choice. There areother scenarios, for example systems that feature automaticremote software updates, in which Miss might be lower.For Triage, Runeson et al. report on circumstance in which,“30 minutes were spent on average to analyze a [defect re-port] that is submitted” [15]. Using that figure, our filter



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

False Positive Rate

Tru

e P

os

itiv

e R

ate

Our Algorithm

Line of NoDiscrimination

Figure 4. Receiver operating characteristic (ROC)curve for our algorithm on this experiment. A falsepositive occurs when all bug reports in an equiva-lence class that represents a real defect are filteredout. A true positive occurs when we filter one ormore spurious reports while allowing at least one re-port through. Each point represents the results of ouralgorithm trained with a different cutoff.

would have saved 1.5 developer-weeks of triage effort oversixteen weeks of filtering. We intentionally do not suggestany particular values for Miss and Triage .

5.3.1 False Positives

In our analysis, we define a false positive as occurring whena fixed bug report is filtered along with all its duplicates (i.e.,when an entire equivalence class of valid bug reports is hid-den from developers). We define a true positive to be whenwe filter one or more bugs from a fixed equivalence classwhile allowing at least one report from that class to reachdevelopers. A false positive corresponds to the concept ofMiss and an abundance would negatively impact the sav-ings that result from using this system while a true positivecorresponds to our system doing useful work.

In our experiment over 11,340 bug reports spanning fourmonths, our algorithm did not completely eliminate anybug equivalence class which included a fixed bug. Hence,we had zero false positives and we feel our system islikely to reduce development and triage costs in practice.On this dataset, just over 90% of the reports we filteredwere resolved as duplicates (i.e., and not “works-for-me”or “fixed”). While the actual filtering performed is perhapsmodest, our results suggest this technique can be appliedsafely with little fear that important defects will be mistak-enly filtered out.

We also analyze the trade off between the true positive

0

20

40

60

80

100

120

Title

Sim

ilarit

y

Des

c Sim

ilarit

y

Eve

ry O

S

Rep

orte

d Day

4

Mac

OS X

Only

Sev

erity

: Major

Cluster

ing

Title

Leng

th

Has

Pat

ch

Rep

orte

d Day

7

Feature left out

Perc

en

t o

f savin

gs l

ost

Figure 5. A leave-one-out analysis of the features inour model. For each feature we trained the model andran a portion of the experiment without that feature.The y-axis shows the percentage of the savings (usingTriage and Miss) lost by not considering that feature.Note that without the title textual similarity featureour model is not profitable in this simulation.

rate and the false positive rate in Figure 4. When consid-ering both safety and potential cost savings, we are mostinterested in the values that lie along the y-axis and thuscorrespond to having no false positives. However, this anal-ysis suggests that our classifier offers a favorable tradeoffbetween the true positive rate and false positive rate at moreaggressive thresholds.

5.4 Analysis of Features

To test our hypothesis that textual similarity is of primeimportance to this task, we performed two experiments tomeasure the relative importance of individual features. Weused a leave-one-out analysis to identify features particu-larly important in the performance of the filter. We also useda principal components analysis to identify overlap and cor-relation between features.

We perform the leave-one-out analysis using data savedfrom our modeling simulation. To measure the importanceof each feature used in our filter, we reran the experimentwithout it. This included recalculating the linear modeland retraining the relevant cutoff values, then re-consideringeach bug report, determining if the restricted model wouldfilter it, and calculating our performance metric based onthose decisions.

Once we obtain a performance metric for a restrictedmodel, we can calculate the percent of savings lost as com-pared to the full model. The restricted models sometimesfilter out legitimate bug reports; the Miss cost must therefore



be considered. We chose to illustrate this analysis with thevalues Triage = 30 and Miss = 1000. Figure 5 shows thechange in performance due to the absence of each feature.Only features that resulted in more than a 1% performancedecrease are shown.

The biggest change resulted from the feature correspond-ing to the similarity between a bug reports’s title and thetitle of the report to which it was most similar. The sec-ond most important feature was the description similarityfeature. This supports our hypothesis that semantically-richtextual information is more important than surface featuresfor detecting duplicate defect reports. This had half the im-portance of the title, which is perhaps the result of the in-creased incidence of less-useful clutter in the description.Surface-level features, such as the relevant operating sys-tem or the conditions of the initial report, have less of animpact.

If these features are independent, a larger change in per-formance corresponds to a feature more important to themodel. We performed a principal component analysis tomeasure feature overlap. This analysis is essentially a di-mensionality reduction. For example, the features “heightin inches” and “height in centimeters” are not independent,and a leave-one-out analysis would underrate the impact ofthe single underlying factor they both represent (i.e., sinceleaving out the inches still leaves the centimeters availableto the model). For our features on the training dataset therewere 10 principal components that each contributed at least5% to the variance with a contribution of 10% from thefirst principal component; it is not the case that our featuresstrongly intercorrelate.

6. Conclusion

We propose a system that automatically classifies dupli-cate bug reports as they arrive to save developer time. Thissystem uses surface features, textual semantics, and graphclustering to predict duplicate status. We empirically eval-uated our approach using a dataset of 29,000 bug reportsfrom the Mozilla project, a larger dataset than has generallypreviously been reported. We show that inverse documentfrequency is not useful in this task, and we simulate usingour model as a filter in a real-time bug reporting environ-ment. Our system is able to reduce development cost byfiltering out 8% of duplicate bug reports. It still allows atleast one report for each real defect to reach developers, andspends only 20 seconds per incoming bug report to makea classification. Thus, a system based upon our approachcould realistically be implemented in a production environ-ment with little additional effort and a possible non-trivialpayoff.

References

[1] J. Anvik, L. Hiew, and G. C. Murphy. Coping with an openbug repository. In OOPSLA workshop on Eclipse technologyeXchange, pages 35–39, 2005.

[2] J. Anvik, L. Hiew, and G. C. Murphy. Who should fix thisbug? In International Conference on Software Engineering(ICSE), pages 361–370, 2006.

[3] B. Boehm and V. Basili. Software defect reduction. IEEEComputer Innovative Technology for Computer Professions,34(1):135–137, January 2001.

[4] G. Canfora and L. Cerulo. How software repositories canhelp in reoslving a new change request. In Workshop onEmpirical Studies in Reverse Engineering, 2005.

[5] D. Cubranic and G. C. Murphy. Automatic bug triage usingtext categorization. In Software Engineering & KnowledgeEngineering (SEKE), pages 92–97, 2004.

[6] P. Hooimeijer and W. Weimer. Modeling bug report quality.In Automated software engineering, pages 34–43, 2007.

[7] M. Kantrowitz, B. Mohit, and V. Mittal. Stemming and itseffects on TFIDF ranking. In Conference on Research anddevelopment in information retrieval, pages 357–359, 2000.

[8] S. Kim and J. E. James Whitehead. How long did it taketo fix bugs? In International workshop on Mining SoftwareRepositories, pages 173–174, 2006.

[9] H. Liu. MontyLingua: an end-to-end natural language pro-cessor with common sense. Technical report, http://web.media.mit.edu/∼hugo/montylingua, 2004.

[10] N. Mishra, R. Schreiber, I. Stanton, and R. E. Tarjan. Clus-tering social networks. In Workshop on Algorithms andModels for the Web-Graph (WAW2007), pages 56–67, 2007.

[11] J. N. och Dag, V. Gervasi, S. Brinkkemper, and B. Reg-nell. Speeding up requirements management in a productsoftware company: Linking customer wishes to product re-quirements through linguistic engineering. In Conference onRequirements Engineering, pages 283–294, 2004.

[12] C. V. Ramamoothy and W.-T. Tsai. Advances in softwareengineering. IEEE Computer, 29(10):47–58, 1996.

[13] E. S. Raymond. The cathedral and the bazaar: musings onlinux and open source by an accidental revolutionary. Inf.Res., 6(4), 2001.

[14] C. R. Reis and R. P. de Mattos Fortes. An overview ofthe software engineering process and tools in the Mozillaproject. In Open Source Software Development Workshop,pages 155–175, 2002.

[15] P. Runeson, M. Alexandersson, and O. Nyholm. Detectionof duplicate defect reports using natural language process-ing. In International Conference on Software Engineering(ICSE), pages 499–510, 2007.

[16] J. Sutherland. Business objects in corporate information sys-tems. ACM Comput. Surv., 27(2):274–276, 1995.

[17] C. Weiß, R. Premraj, T. Zimmermann, and A. Zeller. Howlong will it take to fix this bug? In Workshop on MiningSoftware Repositories, May 2007.



[ieee 2008 ieee international conference on dependable systems and networks with ftcs and dcc (dsn)...

Documents