potential biases in bug localization: do they matter?

Potential Biases in Bug Localization: Do They Matter?

Pavneet Singh Kochhar, Yuan Tian, David LoSingapore Management University

{kochharps.2012, yuan.tian.2012,davidlo}@smu.edu.sg

Issue Tracking

• Projects use issue tracking systems like JIRA

• Well-known projects receive large number of issue reports

• Large number of bug reports can overwhelm the number of developers.

• Mozilla developer - “Everyday, almost 300 bugs appear that need triaging. This is far too much for only the Mozilla programmers to handle” *

What have researchers proposed to overcome this issue?

* J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open bug repository,” in ETX, pp. 35–39, 2005

2/25

Bug Localization

Thousands of Source Code Files

GOAL: Find the buggy files ------>

3/25

How Bug Localization Works

• Uses fixed/closed bug reports

• Uses standard information retrieval (IR) techniques such as Vector space model (VSM)

• Computes similarity between bug reports & source code

• Returns rank list of potential buggy source code files

• Returned list is compared with actual buggy files to compute accuracy

4/25

Issues in Bug Localization

HOWEVER

What if bug localization results are biased?

• Past study* shows: • Upto 80% of the bug reports can be localized by

inspecting 5 source code files.• Results are promising

* Improving bug localization using structured information retrieval, R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry, ASE 2013

5/25

Our Study

Potential Biases in Bug Localization

1. Wrongly Classified Reports Herzig et al. *– 1/3 of reports marked as bugs are not bugs 2. Already Localized Reports

3. Incorrect Ground Truth Files Kawrykow et al.+ - Lot of changes are non-essential

* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013+ Non-essential changes in version histories D. Kawrykow and M. P. Robillard.. ICSE, 2011.

6/25

Our Study

Potential Biases in Bug Localization

1. Wrongly Classified Reports Herzig et al. *– 1/3 of reports marked as bugs are not bugs 2. Already Localized Reports

3. Incorrect Ground Truth Files Kawrykow et al.+ - Lot of changes are non-essential

* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013+ Non-essential changes in version histories D. Kawrykow and M. P. Robillard.. ICSE, 2011.

7/25

Dataset

Projects Organization Tracker Number of Issue Reports

HTTPClient Apache JIRA 746

Jackrabbit Apache JIRA 2402

Lucene-Java Apache JIRA 2443

Total = 5591 Issue Reports *

* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013

8/25

Evaluation Metric

Average precision

Mean Average Precision (MAP) – Mean of average precisions over all ranked lists.

9/25

BIAS 1– Report Misclassification

Projects Reported Actual Difference Cohen’s dHTTPClient 0.429 0.419 -2.33% 0.13

Jackrabbit 0.302 0.339 12.25%* 0.06

Lucene-Java

0.301 0.322 6.98% 0.04

Difference of -2.33% to 12.25% between MAP scores* Statistical significant differences (Mann-Whitney Wilcoxon test)Effect sizes are trivial (d<0.2)

Mean Average Precision (MAP) Scores

10/25


Mean Average Precision (MAP) ScoresActual to Reported HC JB LJ Overall

None 0.429 0.302 0.301 0.312RFE to BUG 0.427 0.303 0.304 0.313DOCUMENTATION to BUG 0.430 0.304 0.305 0.315IMPROVEMENT to BUG 0.416 0.299 0.295 0.307

REFACTORING to BUG 0.428 0.301 0.301 0.311BACKPORT to BUG 0.430 0.303 0.300 0.313CLEANUP to BUG 0.429 0.303 0.303 0.314

SPEC to BUG 0.435 0.302 0.301 0.312TASK to BUG 0.432 0.302 0.301 0.312TEST to BUG 0.429 0.328 0.313 0.334BUILD_SYSTEM to BUG 0.429 0.306 0.303 0.315

DESIGN_DEFECT to BUG 0.424 0.301 0.301 0.311OTHERS to BUG 0.439 0.303 0.301 0.313

* HC – HTTPClient, JB- Jackrabbit, LJ – Lucene-Java

11/25


Results:Significantly impacts bug localization result for 1/3 projectsHowever, effect sizes are negligible i.e., <0.2

12/25

BIAS 2– Localized Bug Reports

Category DescriptionFully All the buggy files are mentioned in the bug report

Partially Some of the buggy files are specified in the bug report

Not Bug reports do not specify any buggy files

Fully Localized Report (Example)Category DescriptionSummary DecompressingEntity not calling close on InputStream

retrieved by getContent

Description The method DecompressingEntity.writeTo(OutputStream outstream) does not close the InputStream retrieved by getContent().

Buggy Files DecompressingEntity.java

Categories

13/25


Manually Identifying Localized Reports

5591 Issue reports

1191 bug reports (Herzig et al.*)

Randomly selected 350

Files changed Summary & Description

Classified bug reports

* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013

14/25


Based on manual investigation:Build an algorithm to automatically classify bug reportsInput – Summary/Description of bug reports &

Files changed to fix the bugOutput – Bug reports classified into 1 out of 3

categories

Automatically Identifying Localized Reports

15/25


Number/ProportionProject Category Number Proportion

Fully 36 3.02%

HTTPClient Partially 28 2.35%

Not 35 2.93%

Fully 299 25.10%

Jackrabbit Partially 132 11.08%

Not 402 33.75%

Fully 63 5.28%

Lucene-Java Partially 87 7.30%

Not 109 9.15%

Overall 33.41% are fully localizedMore than 50% fully or partially localized

16/25


Projects Fully Partially NotHTTPClient 0.615 0.349 0.250

Jackrabbit 0.560 0.373 0.187

Lucene-Java 0.527 0.338 0.197

Difference between Fully & Not HTTPClient - 84.39% Jackrabbit - 99.86% Lucene-Java - 91.16%


17/25


ProjectsFully-Partially Partially-Not Fully-Not

p-value

d Effect Size

p-value

d Effect Size

p-value

d Effect Size

HTTPClient * 0.94 L * 0.53 M * 1.27 L

Jackrabbit * 0.56 M * 0.55 M * 1.14 L

Lucene-Java

* 0.53 M * 0.41 S * 1.04 L

Comparison – Fully vs. Partially vs Not

*Significant differences (p-value<0.05)

Effect sizes b/w Fully & Not are LARGE

18/25


Best & Worst bug reportsProject Fully Partially Not p-value

HTTPClientUpper 16 5 4

0.0041*Lower 6 4 15

*Significant differences (p-value<0.05)

JackrabbitUpper 35 9 6

2.807e-13*

Lower 7 1 42

Lucene-JavaUpper 22 18 10

8.724e-05*

Lower 5 18 27

19/25


Results:More than 50% of bugs are either fully or partially localizedMAP scores for fully & partially localized much higher than not localizedEffect sizes between fully & not localized are LARGE

20/25

BIAS 3– Non-Buggy Files

Manual Investigation

Randomly selected 100 not localized bug reports

Files changed to fix these bugs

Diff between original & modified file

Non-buggy = Cosmetic changes, refactorings etc.

clean GROUND TRUTH files

21/25


Example

22/25


Differences are not significantEffect sizes are trivial (<0.2)


Projects Dirty Clean Difference dHTTPClient 0.207 0.171 0.036 0.08

Jackrabbit 0.115 0.115 0.000 0.08

Lucene-Java

0.271 0.239 0.032 0.17

23/25


Results:28.11% of the files in the ground-truth are non-buggyDifferences between MAP scores are not significantEffect sizes are negligible i.e., <0.2

24/25

Conclusion

BIAS 1- Wrongly classified issue reports NOT statistically significant NO substantial impact

BIAS 2 – Localized bug reports Statistically significant Substantial impact

BIAS 3 – Non-buggy files: NOT statistically significant NO substantial impact

25/25

Thank You!

Email: [email protected]

Other Evaluation Metrics

HIT@N : Percentage of bug reports with at least one buggy file in top N ranked results

Mean Reciprocal Rank (MRR) Reciprocal rank is inverse of the rank of the 1st buggy file. MRR is average of the reciprocal ranks.

BIAS 1- Report Misclassification

BIAS 2- Localized Bug Reports

BIAS 3- Non-Buggy Files

BIAS 1, BIAS 2 & BIAS 3

Mean Reciprocal Rank (MRR) Scores

Appendix (Statistical Analysis)

• Mann-Whitney-Wilcoxon (MWW) test: Given a significance level = 0.05,if p-value <, then the test rejects the null hypothesis.

Appendix (BIAS-2 Results)Actual to Reported HC JB LJ Overall

None 0.429 0.302 0.301 0.312

RFE to BUG 0.427 0.303 0.304 0.313

DOCUMENTATION to BUG 0.430 0.304 0.305 0.315

IMPROVEMENT to BUG 0.416 0.299 0.295 0.307

REFACTORING to BUG 0.428 0.301 0.301 0.311

BACKPORT to BUG 0.430 0.303 0.300 0.313

CLEANUP to BUG 0.429 0.303 0.303 0.314

SPEC to BUG 0.435 0.302 0.301 0.312

TASK to BUG 0.432 0.302 0.301 0.312

TEST to BUG 0.429 0.328 0.313 0.334

BUILD_SYSTEM to BUG 0.429 0.306 0.303 0.315

DESIGN_DEFECT to BUG 0.424 0.301 0.301 0.311

OTHERS to BUG 0.439 0.303 0.301 0.313

potential biases in bug localization: do they matter?

Software