promise 2011: "an iterative semi-supervised approach to software fault prediction"

24

An Iterative Semi-supervised Approach to Software Fault Prediction Huihua Lu, Bojan Cukic, Mark Culp Lane Department of Computer Science and Electrical Engineering Department of Statistics West Virginia University Morgantown, WV September 2011

Upload: cs-ncstate

Post on 06-May-2015

2.676 views

Category:

Technology

0 download

Report

Download

Embed Size (px):

DESCRIPTION

Promise 2011:"An Iterative Semi-supervised Approach to Software Fault Prediction"Huihua Lu, Bojan Cukic and Mark Culp.

TRANSCRIPT

Page 1: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

An Iterative Semi-supervised Approach to Software Fault Prediction

Huihua Lu, Bojan Cukic, Mark Culp

Lane Department of Computer Science and Electrical EngineeringDepartment of Statistics

West Virginia UniversityMorgantown, WV

September 2011

Page 2: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 3: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Introduction

• Software Quality Assurance– Identify where faults hide, subject to V&V– Without automation, costly and time-consuming

• Software Fault Prediction– Software metrics: code metrics, complexity metrics, etc.– Software fault prediction models identify faulty modules

• Supervised learning algorithms are the norm

• Practical Problem– For one-of-a-kind systems or new systems, ground truth data

may be sparse

• The Goal of Our Study– Evaluate the performance of semi-supervised learning

approaches

Page 4: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Goal of the Study

Can we match the performance of supervised learning fault prediction models from a smaller set of labeled modules?

Consequence:If very few modules are labeled (very real scenario), include unlabeled modules for training Most published studies use 50% or more software modules for model training. This is not practical for new projects.

Page 5: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 6: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Semi-Supervised Learning-1

• Supervised Learning– Train a model from labeled (training) data only– Labeled data could be expensive to create

• Modules receive labels through detailed V&V

• Semi-Supervised Learning– Train a model from both the labeled data and the

unlabeled data • Include new modules as they become available in

a version control system.– Unlabeled data are the modules with unknown fault

content

Page 7: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Semi-Supervised Learning-2

• Traditional Semi-supervised Learning algorithms– Co-training

• Assumption: features can be separated into two sets

– Generative Learning (EM algorithm)• Assumption: need the knowledge of the distribution

of data

– Self-training• Assumption: None

Page 8: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Related Work

• In software fault prediction– Khoshgoftaar: Inductive semi-supervised learning

• Data from one project separated into labeled and unlabeled sets; performance are evaluated on a different project

• Achieved better performance than a tree-based supervised algorithm-C4.5

– Khoshgoftaar: Clustering based semi-supervised learning

• Extend unsupervised learning into semi-supervised learning• Better partitioning than unsupervised learning• Assume that human domain experts participate in classifying

modules into fault-prone and not-fault-prone.

– Many supervised learning modeling approaches.

Page 9: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 10: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Methodology-1

• Fitting the Fits (FTF) semi-supervised algorithm– A variant of Self-training [3]– Idea: Reduce the semi-supervised problem to some

form of a supervised problem– The Algorithm:

),(, ,ˆ )2(

ˆ)1(

following Repeat the

),(, ,

ˆ :Initialize

1

00

0

kkkk

Lk

L

LLLULU

L

YXDXDφY

YY

YXDXDφY

YY

Initialize the labels for U

Reset the labels for L

Fit the labels for U+L

Page 11: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Methodology-2

• The Base Learner:– Initializes the labels for unlabeled data

– “Improves” the labels of unlabeled data in iterations

– May lead to global convergence

• Random Forests– A good choice in the domain based on previous work– Robust to noise

ULU XDφYei ,.,. 0

XDφY kk ,ˆ i.e., 1

φ

Page 12: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Software Data Sets

• These are large NASA MDP projects (> 1,000 modules)

Page 13: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Performance Measures

• Labels in binary classification problem:– 1 - fault prone module– 0 - not fault prone module– For each module, estimate the probability

• Area under ROC curve and Probability of Detection (PD) used for performance comparison

– PD = ||

||

U

YU

)1Pr( cY

}75.0,5.0,1.0{

Page 14: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 15: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Experiments

• FTF with Random Forests vs. Random Forest

• Does FTF outperform supervised learning with the same size of labeled modules?– Size of labeled data: 2%, 5%, 10%, 25%, 50%– Stop the FTF algorithm after 50 iterations

• Is the behavior and performance of FTF consistent over different software projects?

Page 16: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 17: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Results: PC 3

Page 18: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Results at threshold 0.5

Page 19: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

At threshold 0.1

Page 20: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Overall Comparison

Page 21: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Page 22: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Summary

• Does FTF with Random Forests as base learner outperform supervised learning with Random Forest?

– Yes, in most cases.– Improvement modest and not statistically significant

• How small can the size of the labeled data set be for the FTF to start outperforming supervised learning?

– When 5% or more modules labeled, semi-supervised approach seems a promising direction.

– Performance improves in comparison to the same size of labeled modules

• Is the behavior and performance of FTF consistent over different data sets?

– Yes

Page 23: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Future Work

• Try out different base learners with FTF– Base learner in FTF has dramatic effects. RF used

because it performs well in software fault modeling– RF does not converge, other base learners might– Analyze robustness to noise

• Expand on projects of different size or from different domains

• Introduce more sophisticated semi-supervised algorithms

Page 24: Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

Questions

• Please direct questions to

Bojan Cukic: [email protected]

Huihua Lu: [email protected]

mailto:[email protected]

mailto:[email protected]

Deep Learning in Neuroradiology · unsupervised learning holds great promise for medical imaging, this review focuses on supervised learning. Viewed in this context, deep learning

Supervised and semi-supervised learning for NLP

Iterative Search for Weakly Supervised Semantic Parsing · 2019-06-01 · Proceedings of NAACL-HLT 2019 , pages 2669 2680 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association

Iterative Equalization

K-Means Implementation on FPGA for High-dimensional Data ...Whats K-means • Unsupervised vs Supervised – Classes are predetermined or not • A simple iterative clustering algorithm

3 Semi-Supervised Text Classiﬁcation Using EMtom/pubs/NigamEtAl-bookChapter.pdfThe seminal paper by Day [1969] presents an iterative EM-like approach for parameters of a mixture

Bimodal delivering on the promise - Gartner...Bimodal = Samurai + Ninja *Iterative Incremental Development Think Samurai Think Ninja Mode 1 Mode 2 Predictability Goal Exploration Price

Iterative Statements

An Iterative Algorithm for Extending Learners to a Semi-supervised Setting

Iterative Attention Mining for Weakly Supervised Thoracic Disease ...lelu/publication/MICCAI2018_ChestXRay_IAM.pdf · Iterative Attention Mining for Weakly Supervised Thoracic Disease

Detecting Slums from SPOT Data in Casablanca Morocco Using ... · Supervised classification in ENVI Feature Extraction is an iterative process from the “nearest neighbor” algorithm

Supervised Classification - igetdst-iget.in/tutorials/IGET_RS_008/IGET_RS_008 Supervised... · ... Supervised Classification using SAGA 2 Supervised Classification using SAGA Objective:

Iterative Methods in Combinatorial Optimizationhajiagha/NetDsgn11/Iterative-Methods-UMD.pdf · Iterative Methods in Combinatorial Optimization R. Ravi ... Iterative Relaxation Solution

THE ITERATIVE DESIGN OF A MOBILE LEARNING …Science education is an area in which mobile computing devices have shown particular promise. Mobile applications designed for science

Chap04 Iterative

Iterative persona

UNIT - IV Iterative Process Planning: 10. Iterative

Iterative Deepening

Supervised Convolutional GSN for Protein Secondary ...jzthree/datasets/ICML2014/slides.pdf• Supervised GSN –Stochastic iterative prediction through Markov chain –Initialization

Iterative development and agile methodsse.ewi.tudelft.nl/ti3115tu-2019/resources/02-iterative-development.pdf · Incremental and iterative development process •Iterative: we develop

9/4/20141 Iterative Project Management Chapter 2 – How Do Iterative Projects Function? Iterative Project Management / 01 - Iterative and Incremental Development

Iterative methods

Weakly-Supervised Action Segmentation with Iterative Soft ...liding/materials/ding2018weakly_poster.pdf · Li Ding (MIT) Chenliang Xu (University of Rochester) [email protected] [email protected]

Iterative Examples

ITERATIVE PROJECTION METHODS FOR SPARSE … · iterative projection methods for ... direct vs. iterative solution ... iterative projection methods for sparse linear systems and eigenproblems

of Promise of Promise

Arun Chauhan - Indiana University Bloomingtongrids.ucs.indiana.edu/ptliupages/publications/iterative... · Iterative Statistical Kernels on Contemporary GPUs 1 Iterative Statistical

Iterative Prototyping

Iterative Marketing

Coupled Bayesian Sets Algorithm for Semi-supervised ...rtw.ml.cmu.edu/papers/ecml12-verma-slides.pdfSao Paulo Brisbane Iterative BS using NELL’s Ontology Zhang & Liu, 2011 . Given

SSCR: Iterative Language-Based Image Editing via Self ...4415 Figure 2: An overview of our self-supervised counterfactual reasoning (SSCR). The iterative editor modiﬁes an image

Iterative Development

Iterative Project Management Module 1 - Iterative and Incremental Development

5. Iterative Development. © O. Nierstrasz P2 — Iterative Development 5.2 Iterative Development Overview Iterative development Responsibility-Driven

Iterative Compression