promise 2011: "an iterative semi-supervised approach to software fault prediction"

24
An Iterative Semi-supervised Approach to Software Fault Prediction Huihua Lu, Bojan Cukic, Mark Culp Lane Department of Computer Science and Electrical Engineering Department of Statistics West Virginia University Morgantown, WV September 2011

Upload: cs-ncstate

Post on 06-May-2015

2.676 views

Category:

Technology


0 download

DESCRIPTION

Promise 2011:"An Iterative Semi-supervised Approach to Software Fault Prediction"Huihua Lu, Bojan Cukic and Mark Culp.

TRANSCRIPT

Page 1: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

An Iterative Semi-supervised Approach to Software Fault Prediction

Huihua Lu, Bojan Cukic, Mark Culp

Lane Department of Computer Science and Electrical EngineeringDepartment of Statistics

West Virginia UniversityMorgantown, WV

September 2011

Page 2: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 3: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Introduction

• Software Quality Assurance– Identify where faults hide, subject to V&V– Without automation, costly and time-consuming

• Software Fault Prediction– Software metrics: code metrics, complexity metrics, etc.– Software fault prediction models identify faulty modules

• Supervised learning algorithms are the norm

• Practical Problem– For one-of-a-kind systems or new systems, ground truth data

may be sparse

• The Goal of Our Study– Evaluate the performance of semi-supervised learning

approaches

Page 4: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Goal of the Study

Can we match the performance of supervised learning fault prediction models from a smaller set of labeled modules?

Consequence:If very few modules are labeled (very real scenario), include unlabeled modules for training Most published studies use 50% or more software modules for model training. This is not practical for new projects.

Page 5: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 6: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Semi-Supervised Learning-1

• Supervised Learning– Train a model from labeled (training) data only– Labeled data could be expensive to create

• Modules receive labels through detailed V&V

• Semi-Supervised Learning– Train a model from both the labeled data and the

unlabeled data • Include new modules as they become available in

a version control system.– Unlabeled data are the modules with unknown fault

content

Page 7: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Semi-Supervised Learning-2

• Traditional Semi-supervised Learning algorithms– Co-training

• Assumption: features can be separated into two sets

– Generative Learning (EM algorithm)• Assumption: need the knowledge of the distribution

of data

– Self-training• Assumption: None

Page 8: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Related Work

• In software fault prediction– Khoshgoftaar: Inductive semi-supervised learning

• Data from one project separated into labeled and unlabeled sets; performance are evaluated on a different project

• Achieved better performance than a tree-based supervised algorithm-C4.5

– Khoshgoftaar: Clustering based semi-supervised learning

• Extend unsupervised learning into semi-supervised learning• Better partitioning than unsupervised learning• Assume that human domain experts participate in classifying

modules into fault-prone and not-fault-prone.

– Many supervised learning modeling approaches.

Page 9: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 10: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Methodology-1

• Fitting the Fits (FTF) semi-supervised algorithm– A variant of Self-training [3]– Idea: Reduce the semi-supervised problem to some

form of a supervised problem– The Algorithm:

),(, ,ˆ )2(

ˆ)1(

following Repeat the

),(, ,

ˆ :Initialize

1

00

0

kkkk

Lk

L

LLLULU

L

YXDXDφY

YY

YXDXDφY

YY

Initialize the labels for U

Reset the labels for L

Fit the labels for U+L

Page 11: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Methodology-2

• The Base Learner:– Initializes the labels for unlabeled data

– “Improves” the labels of unlabeled data in iterations

– May lead to global convergence

• Random Forests– A good choice in the domain based on previous work– Robust to noise

ULU XDφYei ,.,. 0

XDφY kk ,ˆ i.e., 1

φ

Page 12: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Software Data Sets

• These are large NASA MDP projects (> 1,000 modules)

Page 13: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Performance Measures

• Labels in binary classification problem:– 1 - fault prone module– 0 - not fault prone module– For each module, estimate the probability

• Area under ROC curve and Probability of Detection (PD) used for performance comparison

– PD = ||

||

U

YU

)1Pr( cY

}75.0,5.0,1.0{

Page 14: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 15: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Experiments

• FTF with Random Forests vs. Random Forest

• Does FTF outperform supervised learning with the same size of labeled modules?– Size of labeled data: 2%, 5%, 10%, 25%, 50%– Stop the FTF algorithm after 50 iterations

• Is the behavior and performance of FTF consistent over different software projects?

Page 16: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 17: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Results: PC 3

Page 18: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Results at threshold 0.5

Page 19: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

At threshold 0.1

Page 20: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Overall Comparison

Page 21: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 22: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Summary

• Does FTF with Random Forests as base learner outperform supervised learning with Random Forest?

– Yes, in most cases.– Improvement modest and not statistically significant

• How small can the size of the labeled data set be for the FTF to start outperforming supervised learning?

– When 5% or more modules labeled, semi-supervised approach seems a promising direction.

– Performance improves in comparison to the same size of labeled modules

• Is the behavior and performance of FTF consistent over different data sets?

– Yes

Page 23: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Future Work

• Try out different base learners with FTF– Base learner in FTF has dramatic effects. RF used

because it performs well in software fault modeling– RF does not converge, other base learners might– Analyze robustness to noise

• Expand on projects of different size or from different domains

• Introduce more sophisticated semi-supervised algorithms

Page 24: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Questions

• Please direct questions to

Bojan Cukic: [email protected]

Huihua Lu: [email protected]