the irish software engineering research centrelero© 2006 1 what we currently know about software...
TRANSCRIPT
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 20061
What we currently know about software fault prediction:
A systematic review of the fault prediction literature
Presentation by Sarah Beecham
27 January 2010 – 3rd CREST Open Workshop – King’s College London
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 20062
Talk will cover
1. Introduction – why are we interested in fault prediction?
2. Our research questions
3. Methodology – how we derived our 111 studies
4. The results – the context (e.g. type of system, programming lang.) emerging themes (qualitative analysis)
5. Quality assessment & Conclusions
6. Future work – unresolved questions
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 20063
Why do we need to know about Fault Prediction?
• Fixing faults major cost driver in software development (Tomaszewski et al. 2006).
• Preventing & removing faults costs $50 - $78 billion per year in US alone (Levinson 2001; Runeson and Andrews 2003).
• Software testing and debugging phases take up most of this spend (Di Fatta et al. 2006).
• Identifying faults early in lifecycle with high degree of accuracy results in..
– higher quality software
– better use of resources
(Koru and Liu 2005; Bezerra et al. 2007; Oral and Bener 2007)
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 20064
Our research questions
RQ1: What is the context of the fault prediction model?
RQ2: What variables have been used in fault prediction models?
2.1: predictor or independent variables, e.g. size; McCabe
2.2: dependent variables –faults
RQ3: What modelling approaches have been used in the development of fault prediction models?
RQ4: How do studies measure performance of their models?
RQ5: How well do fault prediction models predict faults in code?
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 20065
Methodology (1)– Search Terms and Resources; selection criteria
Search terms and string (Fault* OR bug* OR corrections OR
corrective OR fix* OR defect*) in title AND (Software) anywhere
Sources/ databases: ACM Digital library; IEEEXplorekey conferences; e.g. (ICSM); (SCAM); (ISSTA) ; key authors; +..
Include study if: Empirical; Focused on fault prediction; predictor variable (input)linked to code as output; faults in code main output (dependent variable)
Exclude study if: testing, fault injection, inspections, reliability modelling, aspects, effort estimation, nano-computing.. List added to when conducting the review
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 20066
Methodology (2)Study Acceptance – 111 papers
Selection Process # of papersPapers extracted from databases, conferences & author names
1,316
Sift based on Title and Abstract -1154 rejected
Papers considered for review (full papers downloaded & reviewed)
162 primary 80 secondary
Papers accepted for the review (qualitative meta-analysis performed)
111 papers accepted papers
[TH1]Need to comment on imbalance here
• Each stage independently validated – 3 researchers involved• Meta-analysis/themes underpinned by previous SLRs
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Approaches to Fault prediction components:
LERO© 20067
CONTEXT
E.g. Type of System (NASA project, Telecomms, ..); Programming Language (C++, Java, …); Development process (Waterfall, Agile, ..) Programmer expertise.
Input Variable (Independent Variable)
e.g. complexity metrics, module size, even faults themselves
Modelling method
e.g. regression, machine learning, correlation, BBN
Output Variable (Dependent Variable)
Fault prone (fp) /not fault prone (nfp) unit ORNumber of faults; fault density, ranking
model predicts where fault will occur in the code with a % of accuracy.. – however this alone has little meaning – we need to know when the model predicts a fault that IS NOT there, and when a model doesn’t pick up a fault that IS there.
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Results … context first (RQ1)
LERO© 20068
Combined (OSS/NASA & Ind),
5%
Industrial, 44%
NASA, 24%
OSS, 27%
Origin of data, a large proportion from NASA promise repository
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
C/C++, 54%
Java, 26%
Various, 4%
Other, 7%
Not given,9%
Key: ‘Other’ languages include Assembly, Fortran and Protel
Context (2)–Programming Language
LERO© 20069
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Independent or ‘predictor’ Variables (RQ2.1)
LERO© 200610
Miscellaneous, 16%
Static Code Metrics, 43%
Dynamic Code Metrics, 2%
Change Data, 20%
Previous Fault Information, 19%
Key: Miscellaneous category includes code patterns, testing and inspection metrics, qualitative data, and function points...
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Code Granularity
LERO© 200611
05
10152025303540
Mod
uleCla
ss
Componen
t/unit
of c
ode
FileOth
er
Syste
m
Class
/Mod
ule/File
(com
binatio
n)
Projec
t
unit of output modelled
No
. of
stu
die
s
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Dependent Variables – Faults (RQ2.2)
LERO© 200612
continuous (# faults per unit, density, rank)
34%
categorical (fp or nfp)37%
categorical and continuous
25%
unclear4%
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Modelling approaches used to develop fault prediction models (RQ3)
LERO© 200613
Fault Localization (FL), 3%
Other, 5%
Bayes (BBN), 11%
Statistics, 47%
Machine Learning (ML), 34%
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Positive Performances (RQ5)
LERO© 200614
Table 18: Positive Performances (reported in 86 out of 107 studies)
Type of fault modelled Total %
Cont & Cat Continuous
only Categorical
only
90% or more 3 8 2 13 15%
80-89% 3 6 2 11 13%
70-79% 2 2 5 9 10% 50-69% includes where concentration of faults can be identified 12 9 9 30 35%
variable positive 50 - 100% 5 6 12 23 27%
Total 25 31 30 86 100%
Positive results reported in 86 out of 107 studies
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 200615
Quality Assessment
• Does the study report whether results are significant?
• Does study recognize/address imbalanced dataset problem?
• How has the performance of the model been assessed?
• How has the model been validated?
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 200616
Significance Test results..
• 111 studies – • 47% conducted significance tests• some selected metrics based on their significance • 53% of studies did not report such a test. • No certainty that model outputs are statistically significant in over half the
papers in this study
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Few studies balance data
LERO© 200617
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 200618
Performance measurementsMeasure Constructs and Definitions
Confusion Matrix constructs
True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN), Type I Error, Type II Error
Confusion Matrix composite measures
Recall, Precision, Specificity, Accuracy, Sensitivity, Balance, F-measure
Based on concept of code being faulty or not faultyOther measures: Descriptive statistics (e.g. frequencies, means, ratios); Regression coefficients, significance tests
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Performance indicator frequency
LERO© 200619
Performance Indicator Total
Obs.
Total %
Confusion Matrix and related composite measures 41 23
Confusion Matrix constructs 19 11
Descriptive statistics 34 19
Error rates 25 14
Best R ~/R2 20 11
Correlation 14
8
Compare means/ significance tests 10 6
None 6
3Other
6 3
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Generalisability of the models – Validation
LERO© 200620
Type of validation performed
#
%
Cross Validation (e.g. 10 fold, actual vs predicted, data splitting within study) 54 29%
Across systems and organisations 33 18%
Across methods/techniques 23 12%
Across releases, builds or versions (temporal) 22 12%
Across projects 16 8%
Across project components 8 4%
Baseline benchmarking, thresholds 6 3%
Across studies (replication) 5 3%
Across processes (e.g. developers or development groups, documents, parts of lifecycle) 5 3%
Across processed/synthetic/fault injected data 4 2%
Across programs (incl. plug-ins) 4 2%
Other (walkthrough, semantic validation, preliminary and specific evaluation, schneiderwind) 4 2%
Manual intervention/inspection 3 2%
Across experiments 2 1%
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Features identified as helpful to fault prediction…
LERO© 200621
Table 12: Features identified as helpful to model development Features/predictors that enhance model performance Total
Process metrics (external to actual code) 20
Size metrics (e.g. larger files correlate to higher fault density) 15
Distribution of faults (Pareto effect) 14
LOC metrics 10
OO metrics (response for class, coupling between object classes and weighted methods per class)
8
Structural/architectural complexity metrics 8
Fault persistence (temporal, e.g. over releases) 7
Historic Fault data (e.g. # of faults, type of fault) 7
Fault severity 6
Age of file (where new creations are correlated to higher fault density) 3
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE
Final Remarks…
To allow cross comparison of models we need:• A standard format to report model performance• To report false positives and false negatives• Understand that reporting one measure can be
misleading– trade-off between accuracy, precision and recall
(composite measures)
• Access to reliable quality data (next talk..)
LERO© 200622
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 200623
thank you
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 200624
Studies used for classifications/themes: (MacDonell and Shepperd 2007) Predictor variables and model validation; (Fenton and Neil 1999) modeling techniques; (Runeson et al. 2001) classification of fault prone components.
Bezerra, M. E. R., A. L. I. Oliveira and S. R. L. Meira (2007). A Constructive RBF Neural Network for Estimating the Probability of Defects in Software Modules. Neural Networks, 2007. IJCNN 2007. International Joint Conference on 2869-2874.
Di Fatta, G., S. Leue and E. Stegantova (2006). Discriminative pattern mining in software fault detection. Proceedings of the 3rd international workshop on Software quality assurance, Portland, OregonACM 62-69.
Fenton, N. E. and M. Neil (1999). "A critique of software defect prediction models." Software Engineering, IEEE Transactions on 25(5): 675-689.
Koru, A. G. and H. Liu (2005a). "Building Defect Prediction Models in Practice." IEEE Softw. 22(6): 23-29. Koru, A. G. and H. Liu (2005b). An investigation of the effect of module size on defect prediction using static measures.
Proceedings of the 2005 workshop on Predictor models in software engineering, St. Louis, MissouriACM 1-5. MacDonell, S. G. and M. J. Shepperd (2007). Comparing Local and Global Software Effort Estimation Models -- Reflections on
a Systematic Review. Empirical Software Engineering and Measurement, 2007. ESEM 2007. First International Symposium on 401-409.
Oral, A. D. and A. B. Bener (2007). Defect prediction for embedded software. Computer and Information Sciences, 2007. ISCIS 2007. 22nd International International Symposium on 1-6.
Runeson, P. and A. Andrews (2003). Detection or isolation of defects? An experimental comparison of unit testing and code inspection. Software Reliability Engineering, 2003. ISSRE 2003. 14th International Symposium on 3-13.
Runeson, P., M. C. Ohlsson and C. Wohlin (2001). A Classification Scheme for Studies on Fault-Prone Components. PROFES 2001 341-355.
Tomaszewski, P., H. Grahn and L. Lundberg (2006). A Method for an Accurate Early Prediction of Faults in Modified Classes. Software Maintenance, 2006. ICSM '06. 22nd IEEE International Conference on 487-496.
REFERENCES