the irish software engineering research centrelero© 2006 1 what we currently know about software...

THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© 20061

What we currently know about software fault prediction:

A systematic review of the fault prediction literature

Presentation by Sarah Beecham

27 January 2010 – 3rd CREST Open Workshop – King’s College London


Talk will cover

1. Introduction – why are we interested in fault prediction?

2. Our research questions

3. Methodology – how we derived our 111 studies

4. The results – the context (e.g. type of system, programming lang.) emerging themes (qualitative analysis)

5. Quality assessment & Conclusions

6. Future work – unresolved questions


Why do we need to know about Fault Prediction?

• Fixing faults major cost driver in software development (Tomaszewski et al. 2006).

• Preventing & removing faults costs $50 - $78 billion per year in US alone (Levinson 2001; Runeson and Andrews 2003).

• Software testing and debugging phases take up most of this spend (Di Fatta et al. 2006).

• Identifying faults early in lifecycle with high degree of accuracy results in..

– higher quality software

– better use of resources

(Koru and Liu 2005; Bezerra et al. 2007; Oral and Bener 2007)


Our research questions

RQ1: What is the context of the fault prediction model?

RQ2: What variables have been used in fault prediction models?

2.1: predictor or independent variables, e.g. size; McCabe

2.2: dependent variables –faults

RQ3: What modelling approaches have been used in the development of fault prediction models?

RQ4: How do studies measure performance of their models?

RQ5: How well do fault prediction models predict faults in code?


Methodology (1)– Search Terms and Resources; selection criteria

Search terms and string (Fault* OR bug* OR corrections OR

corrective OR fix* OR defect*) in title AND (Software) anywhere

Sources/ databases: ACM Digital library; IEEEXplorekey conferences; e.g. (ICSM); (SCAM); (ISSTA) ; key authors; +..

Include study if: Empirical; Focused on fault prediction; predictor variable (input)linked to code as output; faults in code main output (dependent variable)

Exclude study if: testing, fault injection, inspections, reliability modelling, aspects, effort estimation, nano-computing.. List added to when conducting the review


Methodology (2)Study Acceptance – 111 papers

Selection Process # of papersPapers extracted from databases, conferences & author names

1,316

Sift based on Title and Abstract -1154 rejected

Papers considered for review (full papers downloaded & reviewed)

162 primary 80 secondary

Papers accepted for the review (qualitative meta-analysis performed)

111 papers accepted papers

[TH1]Need to comment on imbalance here

• Each stage independently validated – 3 researchers involved• Meta-analysis/themes underpinned by previous SLRs

THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRE

Approaches to Fault prediction components:

LERO© 20067

CONTEXT

E.g. Type of System (NASA project, Telecomms, ..); Programming Language (C++, Java, …); Development process (Waterfall, Agile, ..) Programmer expertise.

Input Variable (Independent Variable)

e.g. complexity metrics, module size, even faults themselves

Modelling method

e.g. regression, machine learning, correlation, BBN

Output Variable (Dependent Variable)

Fault prone (fp) /not fault prone (nfp) unit ORNumber of faults; fault density, ranking

model predicts where fault will occur in the code with a % of accuracy.. – however this alone has little meaning – we need to know when the model predicts a fault that IS NOT there, and when a model doesn’t pick up a fault that IS there.


Results … context first (RQ1)

LERO© 20068

Combined (OSS/NASA & Ind),

5%

Industrial, 44%

NASA, 24%

OSS, 27%

Origin of data, a large proportion from NASA promise repository


C/C++, 54%

Java, 26%

Various, 4%

Other, 7%

Not given,9%

Key: ‘Other’ languages include Assembly, Fortran and Protel

Context (2)–Programming Language

LERO© 20069


Independent or ‘predictor’ Variables (RQ2.1)

LERO© 200610

Miscellaneous, 16%

Static Code Metrics, 43%

Dynamic Code Metrics, 2%

Change Data, 20%

Previous Fault Information, 19%

Key: Miscellaneous category includes code patterns, testing and inspection metrics, qualitative data, and function points...


Code Granularity

LERO© 200611

05

10152025303540

Mod

uleCla

ss

Componen

t/unit

of c

ode

FileOth

er

Syste

m

Class

/Mod

ule/File

(com

binatio

n)

Projec

t

unit of output modelled

No

. of

stu

die

s


Dependent Variables – Faults (RQ2.2)

LERO© 200612

continuous (# faults per unit, density, rank)

34%

categorical (fp or nfp)37%

categorical and continuous

25%

unclear4%


Modelling approaches used to develop fault prediction models (RQ3)

LERO© 200613

Fault Localization (FL), 3%

Other, 5%

Bayes (BBN), 11%

Statistics, 47%

Machine Learning (ML), 34%


Positive Performances (RQ5)

LERO© 200614

Table 18: Positive Performances (reported in 86 out of 107 studies)

Type of fault modelled Total %

Cont & Cat Continuous

only Categorical

only

90% or more 3 8 2 13 15%

80-89% 3 6 2 11 13%

70-79% 2 2 5 9 10% 50-69% includes where concentration of faults can be identified 12 9 9 30 35%

variable positive 50 - 100% 5 6 12 23 27%

Total 25 31 30 86 100%

Positive results reported in 86 out of 107 studies


Quality Assessment

• Does the study report whether results are significant?

• Does study recognize/address imbalanced dataset problem?

• How has the performance of the model been assessed?

• How has the model been validated?


Significance Test results..

• 111 studies – • 47% conducted significance tests• some selected metrics based on their significance • 53% of studies did not report such a test. • No certainty that model outputs are statistically significant in over half the

papers in this study


Performance measurementsMeasure Constructs and Definitions

Confusion Matrix constructs

True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN), Type I Error, Type II Error

Confusion Matrix composite measures

Recall, Precision, Specificity, Accuracy, Sensitivity, Balance, F-measure

Based on concept of code being faulty or not faultyOther measures: Descriptive statistics (e.g. frequencies, means, ratios); Regression coefficients, significance tests


Performance indicator frequency

LERO© 200619

Performance Indicator Total

Obs.

Total %

Confusion Matrix and related composite measures 41 23

Confusion Matrix constructs 19 11

Descriptive statistics 34 19

Error rates 25 14

Best R ~/R2 20 11

Correlation 14

8

Compare means/ significance tests 10 6

None 6

3Other

6 3


Generalisability of the models – Validation

LERO© 200620

Type of validation performed

#

%

Cross Validation (e.g. 10 fold, actual vs predicted, data splitting within study) 54 29%

Across systems and organisations 33 18%

Across methods/techniques 23 12%

Across releases, builds or versions (temporal) 22 12%

Across projects 16 8%

Across project components 8 4%

Baseline benchmarking, thresholds 6 3%

Across studies (replication) 5 3%

Across processes (e.g. developers or development groups, documents, parts of lifecycle) 5 3%

Across processed/synthetic/fault injected data 4 2%

Across programs (incl. plug-ins) 4 2%

Other (walkthrough, semantic validation, preliminary and specific evaluation, schneiderwind) 4 2%

Manual intervention/inspection 3 2%

Across experiments 2 1%


Features identified as helpful to fault prediction…

LERO© 200621

Table 12: Features identified as helpful to model development Features/predictors that enhance model performance Total

Process metrics (external to actual code) 20

Size metrics (e.g. larger files correlate to higher fault density) 15

Distribution of faults (Pareto effect) 14

LOC metrics 10

OO metrics (response for class, coupling between object classes and weighted methods per class)

8

Structural/architectural complexity metrics 8

Fault persistence (temporal, e.g. over releases) 7

Historic Fault data (e.g. # of faults, type of fault) 7

Fault severity 6

Age of file (where new creations are correlated to higher fault density) 3


Final Remarks…

To allow cross comparison of models we need:• A standard format to report model performance• To report false positives and false negatives• Understand that reporting one measure can be

misleading– trade-off between accuracy, precision and recall

(composite measures)

• Access to reliable quality data (next talk..)

LERO© 200622


thank you


Studies used for classifications/themes: (MacDonell and Shepperd 2007) Predictor variables and model validation; (Fenton and Neil 1999) modeling techniques; (Runeson et al. 2001) classification of fault prone components.

Bezerra, M. E. R., A. L. I. Oliveira and S. R. L. Meira (2007). A Constructive RBF Neural Network for Estimating the Probability of Defects in Software Modules. Neural Networks, 2007. IJCNN 2007. International Joint Conference on 2869-2874.

Di Fatta, G., S. Leue and E. Stegantova (2006). Discriminative pattern mining in software fault detection. Proceedings of the 3rd international workshop on Software quality assurance, Portland, OregonACM 62-69.

Fenton, N. E. and M. Neil (1999). "A critique of software defect prediction models." Software Engineering, IEEE Transactions on 25(5): 675-689.

Koru, A. G. and H. Liu (2005a). "Building Defect Prediction Models in Practice." IEEE Softw. 22(6): 23-29. Koru, A. G. and H. Liu (2005b). An investigation of the effect of module size on defect prediction using static measures.

Proceedings of the 2005 workshop on Predictor models in software engineering, St. Louis, MissouriACM 1-5. MacDonell, S. G. and M. J. Shepperd (2007). Comparing Local and Global Software Effort Estimation Models -- Reflections on

a Systematic Review. Empirical Software Engineering and Measurement, 2007. ESEM 2007. First International Symposium on 401-409.

Oral, A. D. and A. B. Bener (2007). Defect prediction for embedded software. Computer and Information Sciences, 2007. ISCIS 2007. 22nd International International Symposium on 1-6.

Runeson, P. and A. Andrews (2003). Detection or isolation of defects? An experimental comparison of unit testing and code inspection. Software Reliability Engineering, 2003. ISSRE 2003. 14th International Symposium on 3-13.

Runeson, P., M. C. Ohlsson and C. Wohlin (2001). A Classification Scheme for Studies on Fault-Prone Components. PROFES 2001 341-355.

Tomaszewski, P., H. Grahn and L. Lundberg (2006). A Method for an Accurate Early Prediction of Faults in Modified Classes. Software Maintenance, 2006. ICSM '06. 22nd IEEE International Conference on 487-496.

REFERENCES

the irish software engineering research centrelero© 2006 1 what we currently know about software...

Documents