towards logistic regression models for predicting fault-prone code across software projects

Towards Logistic Regression Models for Predicting Fault-prone Code

across Software ProjectsErika Camargo

and Ochimizu Koichiro

Japan Institute of Science and Technology

ESEM 2009ESEM 20091

Contents

1. Abstract2. Background3. Problem Analysis4. Case study5. Results6. Conclusion and Future Work

2

Abstract

Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects.

First attempt of solution: simple log data transformations

P(y=1)

xX = X = design-design-complexity complexity metricmetric

P(Fault prone P(Fault prone class)class)

3

Background• Some design-complexity metrics have shown to

be good predictors of fault-prone classes in LR models

• Among these metrics are the Chidamber & Kemerer (CK) metrics

– 80th and 20th percentiles of the distributions can be used to determine high and low values

– Their thresholds cannot be determined before their use and should be derived and used locally

4

Problem Analysis

Can a LR model built with these kind of metrics work efficiently with different software projects?

LEAST FAULTY MOST FAULTY

Small Size SW project

Large Size SW project

X = Number of Methods

P (y=1)

105

20

Case Study

1. Data analysis of 7 different projects and application of simple log data transformations.

2. Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes).– Dependent Variables: CK-CBO, CK-RFC, CK-WMC– Independent Variables: Defects (from Bugzilla & CVS)

3. Test these models with 2 other smaller projects (with 11 and13 Java classes)

6

7

Challenge

(**) Eclipse Project

(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

produced biased regression estimates and reduce the predictive power of regression models

BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **



RFC Data of BNS is more spread than the data of

the MYL


8



RFC Data of BNS is more spread than the data of

the MYL


9

Case Study

Solution. Simple data transformation using “Log10”

Example :

10

Number of Outliers are lessData Spread is more uniform

LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm;Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed

Results

Effects of the Log data Transformations:• Elimination of great number of outliers• Overall goodness of fit of the 3 models is

better • Discrimination (Most Faulty/Least Faulty)– All models discriminate well between most Faulty

and Least Faulty classes of the Mylyn System– What about using different projects?

11

Results

Group Model Correct Classification (RAW DATA)

Correct Classification(LOG Tx DATA)

Effect

MF(6 classes)

CBO 2 5

RFC 5 5 =

WMC 6 6 =

LF(5 classes)

CBO 5 5 =

RFC 3 3 =

WMC 4 4 =

BOTH(11 classes)

CBO 7 10

RFC 8 8 =

WMC 10 10 =

BANKING SYSTEM

12

MF: Most FaultyLF: Least Faulty

Results

Group Model Correct Classification (RAW DATA)

Correct Classification(LOG Tx DATA)

Effect

MF(9 classes)

CBO 3 7

RFC 9 8

WMC 7 6

LF(4 classes)

CBO 4 4 =

RFC 0 3

WMC 0 4

BOTH(13 classes)

CBO 7 11

RFC 9 11

WMC 7 10

E-COMMERCE SYSTEM

13

MF: Most FaultyLF: Least Faulty

Conclusions and Future work

• CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects

• Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model.

• Further data exploration and study of data transformations

14

Thank you!questions, comments …

contact: [email protected]

15

towards logistic regression models for predicting fault-prone code across software projects

Documents

mylyn system

ecommerce system

banking system

cruise control system

elevator control system

factory automation system

data of bns

simple data transformation