code coverage and test suite effectiveness: empirical study with real bugs in large systems

Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in

Large Systems

Pavneet Singh Kochhar, Ferdian Thung, David Lo Singapore Management University

{kochharps.2012,ferdiant.2013,davidlo}@smu.edu.sg

International Conference on Software Analysis, Evolution, and Reengineering (SANER’15)

Software Testing, Why Bother?

2

Functionality -- Requirements

Bugs -- Software reliability

Costs -- Late bugs cost more

Software Testing, Why Bother?

• Horgan and Mathur [1]– Adequate testing is critical to develop reliable

software• Tassey [2]

– Inadequate testing cost US economy 59 billion dollars annually

3

[1] J.R. Horgan and A.P. Mathur, “Software testing and reliability.” McGraw-Hill, Inc., 1996.[2] G. Tassey, “The economic impacts of inadequate infrastructure for software testing,” National Institute of Standards and Technology, 2002.

• Gopinath et al. [1] – • Analyze hundreds of open-source projects to measure

the quality of test suites• Projects used are small i.e., 10 LOC to 10,000 LOC.

• Inozemtseva et al. [2] – • Analyze the relationship between test suite size,

coverage and effectiveness• Five large software systems

Both these studies use mutants i.e., artificially injected bugs

[1] Code coverage for suite evaluation by developersion, R. Gopinath, C. Jensen, and G. Alex, ICSE 2014[2] Coverage is not strongly correlated with test suite effectiveness, L. Inozemtseva and R. Holmes, ICSE 2014.

4

Previous Studies

Code Coverage

5

• Percentage of the code executed by test cases

• Used as a proxy for adequacy of testing• Types:

– Statement Coverage– Branch Coverage

• We measure coverage using Cobertura*

*http://cobertura.github.io/cobertura/

Study Goals

To understand the correlation between the test suite size, coverage and effectiveness.

6

Is code coverage effective in killing real bugs?

Outline

• Motivation and Goals• Overall Process• Dataset• Empirical Results• Conclusion and Future Work

7

Overall Process

8

Outline


9

Dataset

10

Project Lines of Code Number of Bugs*

HTTPClient 122,288 67

Rhino 116,065 92

Project HTTPClient RhinoDescription Java library for

client side HTTP services

JavaScript Engine

Developed by Apache Mozilla

Build Tool Maven Ant

Issue Tracking JIRA Bugzilla

* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013

Test Suite Size & Coverage

11

Used Randoop tool to generate Junit tests for 5 mins

Project% of Original Test Suite Size

0.2 0.5 1 5 10 100

HTTPClient 7.43 15.62 39.13 197.82 396.17 3967.00

Rhino 7.64 16.01 40.10 202.52 405.46 4059.28

Project Coverage% of Original Test Suite Size

0.2 0.5 1 5 10 100

HTTPClient Line 7.5 11.0 17.2 28.0 31.8 37.4

Branch 2.8 4.4 7.6 14.4 17.2 22.5

Rhino Line 6.4 8.7 11.6 17.0 19.4 27.1

Branch 3.0 4.2 5.8 9.0 10.5 16.5

Test Suite Effectiveness

12

Test suite that runs successfully (i.e., all test cases run successfully) on a non-buggy version and fails on the buggy version (i.e., one of the test cases fails) kills the bug.

Point Biserial Correlation

13

• To measure the correlation between two variables when one of them is naturally dichotomous i.e., variable naturally takes value of 0 or 1.

• Pett et al. [1]Value Range Correlation

rpb2 ≥ 0.81 Very strong

0.49 ≤ rpb2 < 0.81 Strong

0.25 ≤ rpb2 < 0.49 Moderate

0.09 ≤ rpb2 < 0.25 Weak

0.00 ≤ rpb2 < 0.09 Very weak

[1] M. A. Pett. Nonparametric Statistics for Health Care Research: Statistics for Small Samples and Unusual Distributions. Sage Publications, Inc., 1997

Outline


14

Research Questions

15

RQ1: Is there a correlation between a test suite’s size and its effectiveness? RQ2: Is there a correlation between a test suite’s coverage and its effectiveness?

Research Questions

16

RQ1:Size vs Effectiveness

RQ1: Size vs Effectiveness

17

Test suite size is weakly to strongly correlated with test suite effectiveness.

Point Biserial Correlation

HTTPClient Rhino

rpb2 0.49 0.14

p-value * *

* Statistically Significant

Research Questions

18

RQ2:Coverage vs Effectiveness

RQ2: Coverage vs Effectiveness

19

Code coverage of a test suite is moderately to strongly correlated to its effectiveness.

Point Biserial CorrelationStatement Branch

HTTPClient Rhino HTTPClient Rhino

rpb2 0.33 0.59 0.36 0.55

p-value * * * *

* Statistically Significant

Conclusion & Future WorkUsing real bugs, we find that• Test suite size is weakly to strongly correlated

with test suite effectiveness.• Code coverage is moderately or strongly

correlated to the effectiveness of a test suite.

Future Work:• Expand the study to include more projects

– Address threats to external validity• Use human generated test cases

20

Thank you!

Questions? Comments? Advice?{kochharps.2012,ferdiant.2013}@[email protected]

22

Threats to Validity

• Internal validity:– We link bug reports to commits using bug ids– We use Randoop for 5 minutes

• External validity:– Only analyze 2 large software systems

• Construct validity:– We use point biserial correlation

23

Related Work• Empirical study on testing and coverage

– Gligoric et al. show that branch coverage is the best measure for test suite quality[1]

– Namin et al. show that test suite size and coverage is correlated with test suite effectiveness [2]

– Gopinath et al. investigate the correlation between coverage and a test suite’s effectiveness in killing mutants [3]

[1] M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and D. Marinov. Comparing non-adequate test suites using coverage criteria, ISSTA, 2013.[2] A. S. Namin and J. H. Andrews. The influence of size and coverage on test suite effectiveness, ISSTA, 2009.[3] R Gopinath, C. Jensen, and A. Groce, Code coverage for suite evaluation for developers, ICSE, 2014.