diversity-aware mutation adequacy criterion for improving fault detection capability
TRANSCRIPT
Diversity-Aware Mutation Adequacy Criterion for Improving Fault Detection Capability
Donghwan Shin, Shin Yoo, and Doo-Hwan Bae
KAIST, South Korea
@ Mutation 2016
ππ
ππ₯ ππ¦
kill
distinguish
diversity
kill
Β© Donghwan Shin, 2016
What do you think about this?
2
1 = killed ππ ππ
π 1 1
Assume: we have two mutants, and one test kills both.
Happy because it forms a mutation-adequate test suite.
Remove one of the mutants because they are redundant.
Consider more tests to discover the diversity of the mutants.
It increases the fault detection capability with little effort!
Β© Donghwan Shin, 2016
The concept of diversity has received much attention in software testing
3
Increase diversity of
tests
Explore wider range of program behaviors
Improve fault
detection capability
Adaptive Random Testing
Clustering-based test selection and prioritization
Information theoretic measures of diversity for test set selection
Β© Donghwan Shin, 2016
Diversity has received little attention in mutation
4
Generate diverse mutants
Just count the number
of killed mutants
May lose the diversity
of tests
Β© Donghwan Shin, 2016
Considering the diversity of mutants for a set of tests
5
1 = killed ππ ππ Idea: mutant distinguishment
β’ Two mutants are distinguished from each other by the set of tests that kill them (i.e., using kill patterns).
β’ No extra information is required. ππ 0 1
ππ 1 1
distinguished
undistinguished
Β© Donghwan Shin, 2016
Overview of empirical evaluation results
6
76.8The amount of the most increased fault detection rate (pp) by the π -criterion in compared to the π-criterion.
ZeroThere is no fault whose fault detection rate of the π -criterion is statistically decreased in compared to the π-criterion.
1.67The average size ratio of the test suites adequate to the π -criterion of the test suites adequate to the π-criterion.
* π-criterion: kill-only mutation adequacy criterionπ -criterion: diversity-aware mutation adequacy criterion
Β© Donghwan Shin, 2016
Outline
7
Formalize the basic concepts
Extend the mutation adequacy criterion
Evaluate fault
detection capability
Β© Donghwan Shin, 2016
Formalize the basic concepts: Kill and distinguish
8
Formalize the basic concepts
Extend the mutation adequacy criterion
Evaluate fault
detection capability
Β© Donghwan Shin, 2016
Test differentiator*: Simple notation for the notion of difference
9
* D. Shin and D. H. Bae, βA theoretical framework for understanding mutation-based testing methods,β in Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST 2016). IEEE, to appear.
βA test kills a mutant.β π π, ππ,π = π
βA test suite kills all mutants.β βπ β π΄, βπ β π»πΊ, π π, ππ,π = π
= 1 π‘ππ’π0 (ππππ π)
, if the behaviors of ππ and ππ for π are different
, otherwise
NotationsConcepts
= Existing mutation adequacy criterion (π-criterion)
π π, ππ, ππ
Β© Donghwan Shin, 2016
Mutant distinguishment: Two mutants are distinguished from each other by a test
10
A test π distinguishes two mutants ππ and ππ:
π π, ππ,ππ β π (π, ππ,ππ)
A set of tests π»πΊ distinguishes two mutants ππ and ππ:
βπ β π»πΊ, π π, ππ,ππ β π (π, ππ,ππ)
Β© Donghwan Shin, 2016
Extend the mutation adequacy criterion from βkill-onlyβ to βdiversity-awareβ
11
Formalize the basic concepts
Extend the mutation adequacy criterion
Evaluate fault
detection capability
Β© Donghwan Shin, 2016
Distinguishing mutation adequacy criterion: extended mutation adequacy considering diversity of mutants
12
Interesting properties
β’ Formally subsumes the π-criterion (thanks to π΄β²).
β’ Need to compare only at most π΄ mutants, not |π΄|π
β π΄ π.
A set of tests π»πΊ is distinguishing mutation-adequate:
βππ,ππ β π΄β², βπ β π»πΊ, π π, ππ,ππ β π π, ππ,ππ
where ππ β ππ,π΄β² = π΄βͺ ππ , and ππ = ππ
Β© Donghwan Shin, 2016
Working example: achieving π -adequate test suites
13
π (ππ, ππ,ππ) ππ ππ ππ ππ
ππ 1 1 0 0
ππ 0 0 1 1
ππ 1 0 1 0
Test suite Distinguished mutant setsNumber of mutant pairs
to be compared
{} {ππ,ππ,ππ,ππ,ππ} 4 (0-1,0-2,0-3,0-4)
{ππ} {ππ,ππ, ππ}, {ππ,ππ} 3 (0-3,0-4,1-2)
{ππ, ππ} {ππ}, {ππ,ππ}, {ππ, ππ} 2 (1-2, 3-4)
{ππ, ππ, ππ} {ππ}, {ππ}, {ππ}, {ππ}, {ππ} 0
Β© Donghwan Shin, 2016
Evaluate the fault detection capabilities of mutation adequacy criteria
14
Formalize the basic concepts
Extend the mutation adequacy criterion
Evaluate fault
detection capability
Β© Donghwan Shin, 2016
Research questions
RQ1: Are there rooms for improvement of the fault detection capability by distinguishing more mutants?
RQ2: Is the π -criterion more likely to detect faults than the π-criterion?
RQ3: How many tests are needed to be a mutation- adequate test suite?
(RQ4: How much time takes for selecting a mutation- adequate test suite?)
15
Β© Donghwan Shin, 2016
Evaluation setup
16
Defects4j[1]: Real fault/fix
Major[3]: Mutation
analysis tool
Fault/fix
Fault/fix Test pool
Mutantβs kill info.
π -suite
π-suite
[1] Just, RenΓ©, Darioush Jalali, and Michael D. Ernst. "Defects4J: A Database of existing faults to enable controlled testing studies for Java programs." Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 2014.
[2] Pacheco, Carlos, and Michael D. Ernst. "Randoop: feedback-directed random testing for Java." Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. ACM, 2007.
[3] Just, RenΓ©. "The Major mutation framework: Efficient and scalable mutation analysis for Java." Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 2014.
Test suite generator
Randoop[2]: Random test
data generator
Β© Donghwan Shin, 2016
Fault detection capability of an adequacy criterion
For each fault,
Fault detection capability of the π -criterion (or π-criterion)
= Fault detection rate of the π -suite (or π-suite)
= % fault-detecting test suites in the π -suite (or π-suite)
17
= 500 independent π-criterion adequate test suites
= 500 independent π -criterion adequate test suites
π-suite
π -suite
Β© Donghwan Shin, 2016
Subject faults, tests, and mutants
18
Total 45 isolated real faults from five applications
At most 10,000 tests for each fault as a random test pool
Generate mutants using the five commonly used mutation operators[1][2]
[1] A. Siami Namin, J. H. Andrews, and D. J. Murdoch. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the International Conference on Software Engineering (ICSE), pages 351β360, 2008.
[2] A.J. Offutt, A. Lee, G. Rothermel, R.H. Untch, and C. Zapf, βAn Experimental Determination of Sufficient Mutation Operators,β ACM Trans. Software Eng. and Methodology, vol. 5, no. 2, pp. 99-118, 1996.
Β© Donghwan Shin, 2016
RQ1: Are there rooms for improvement of fault detection capability by distinguishing more mutants?
19
Fault detection rate of π-suite < 1
Fault detection rate of π-suite = 1
Distinguished mutants rate of π-suite < 1
Type1: the room for improvement
Type3: the strong point of the k-criterion
Distinguished mutants rate of π-suite = 1
Type2: the weak point of the π -criterion
Type4: implicitly considered diversity of mutants
0% 20% 40% 60% 80% 100%
Type1 Type2 Type3 Type4
11.1% 51.1%11.1% 26.7%
Β© Donghwan Shin, 2016
Statisticallysignificant
Not statisticallysignificant
RQ2: Is the π -criterion more likely to detect faults than the π-criterion?
Yes, for all type1 faults!
β’ Statistically significant (πΆ=0.05, non-parametric prop. test)
β’ The effect size is also significant (increased at most 76.8 pp)
Never worse!
β’ The π -criterion is as effective as the π-criterion in the worst case.
20
0.00
0.00
0.60
0.60
2.00
2.20
6.60
10.40
17.60
22.00
76.80
0 50 100
Chart-11
β¦
Time-9
Chart-5
Math-14
Math-35
Math-68
Math-9
Math-27
Math-6
Math-90
Chart-12
Increased fault detection rate (pp)
Β© Donghwan Shin, 2016
RQ3: How many tests are needed to be a mutation-adequate test suite?
21
1.00
1.00
1.00
1.03
1.04
1.67
2.21
2.35
2.58
3.02
3.48
1 2 3 4
Closure-107
Math-14
Math-68
Math-4
Math-66
Average
Math-3
Math-103
Math-60
Lang-56
Math-104
Test suite size ratio
In average, the d-criterion needs 1.67 times more tests than the k-criterion.
Β© Donghwan Shin, 2016
Closer look: why the π -criterion is superior than the π-criterion in some cases?
Test suite size effect? NO!
22
One possible explanation
ππππ ππ
π»(ππ) π»(ππ)
π» π = set of tests that kills π
Commonly touched by π -criterionbut rarely touched by π-criterion
0.20%
80.80%
0%
20%
40%
60%
80%
100%
RND EXT
Fau
lt d
ete
ctio
n r
ate
(%
)
Test adequacy criterion
Chart 12
π -suiteπ-suite
Β© Donghwan Shin, 2016
Discussion: undistinguished or redundant?
23
For the two mutants βThey are undistinguishedβ βThey are redundantβ
IntentionTry to add more tests to
distinguish mutantsTry to remove mutants to
minimize the set of mutants
Basic ideaDistinguishing mutants
improves the fault detectionMutation score is inflated as a measure of the fault detection
Balance between TS and M
Improve TS for M Reduce M for TS
Representative paper This workAmmann et al.*
(ICST 2014)
* Ammann, Paul, Marcio E. Delamaro, and Jeff Offutt. "Establishing theoretical minimal sets of mutants." Software Testing, Verification and Validation (ICST), 2014 IEEE Seventh International Conference on. IEEE, 2014.
Assume: we have only two mutants, and one test kills both.
Β© Donghwan Shin, 2016
In summary, considering the diversity of mutants improves the fault detection capability
24
ππ
ππ₯ ππ¦
kill
distinguish
diversity
kill
βAll mutants are killed by a test suite.β
βAll pair of mutants are distinguished by a test suite.β
better than
in terms of fault detection capability
Questions?
Β© Donghwan Shin, 2016
Appendix: π-criterion formally subsumes π-criterion
25
βππ,ππ β π΄β², βπ β π»πΊ, π π, ππ,ππ β π (π, ππ,ππ)
βππ, ππ β π΄β², βπ β π»πΊ, π π, ππ,ππ β π (π, ππ, ππ)
(limit the scope of mutants) Let ππ = ππ β π΄β:
βππ, ππ β π΄β², βπ β π»πΊ, π π, ππ,ππ β π
βππ, ππ β π΄β², βπ β π»πΊ, π π, ππ,ππ = π
βπ β π΄, βπ β π»πΊ, π π, ππ,π = π
Start from the definition of π -criterion:
This is π-criterion!
Β© Donghwan Shin, 2016
Essentially the same as the equivalent mutant detection problem
βπ β π», π π, ππ,π = π
Appendix: Universally undistinguishable mutant detection problem
26
[1] L. Madeyski, W. Orzeszyna, R. Torkar, and M. Jo Μzala, βOvercoming the equivalent mutant problem: A systematic literature review and a comparative experiment of second order mutation,β IEEE Transactions on Software Engineering (TSE), vol. 40, no. 1, pp. 23β42, 2014. [2] M. Papadakis, Y. Jia, M. Harman, and Y. Le Traon, βTrivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique,β in Proceedings of the IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE), vol. 1. IEEE, 2015, pp. 936β946. [3] M. Kintis and N. Malevris, βMedic: A static analysis framework for equivalent mutant identification,β Information and Software Technology, vol. 68, pp. 1β17, 2015.
πΏ = (ππ,ππ)|βπ β π», π π, ππ,ππ = π π, ππ,ππ πΏ
ππ = (ππ,ππ)|βπ β π», π π,ππ,ππ = π
Two mutants ππ and ππ are universally undistinguishable
mutants when βπ β π», π π, ππ,ππ = π π, ππ,ππ .
Many practical approximations[1][2][3]