comparative analysis of statistical and machine learning methods for predicting faulty modules

12

Click here to load reader

Upload: ruchika

Post on 30-Dec-2016

233 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Comparative analysis of statistical and machine learning methods for predicting faulty modules

Cp

RD

a

ARRAA

KSSLMRc

1

dotwptbwpf

cpp

h1

Applied Soft Computing 21 (2014) 286–297

Contents lists available at ScienceDirect

Applied Soft Computing

j ourna l ho me page: www.elsev ier .com/ locate /asoc

omparative analysis of statistical and machine learning methods forredicting faulty modules

uchika Malhotraepartment of Software Engineering, Delhi Technological University, Bawana Road, Delhi 110042, India

r t i c l e i n f o

rticle history:eceived 25 October 2011eceived in revised form 26 January 2014ccepted 22 March 2014vailable online 31 March 2014

eywords:oftware qualitytatic code metricsogistic regressionachine learning

eceiver Operating Characteristic (ROC)urve

a b s t r a c t

The demand for development of good quality software has seen rapid growth in the last few years. Thisis leading to increase in the use of the machine learning methods for analyzing and assessing publicdomain data sets. These methods can be used in developing models for estimating software qualityattributes such as fault proneness, maintenance effort, testing effort. Software fault prediction in the earlyphases of software development can help and guide software practitioners to focus the available testingresources on the weaker areas during the software development. This paper analyses and compares thestatistical and six machine learning methods for fault prediction. These methods (Decision Tree, ArtificialNeural Network, Cascade Correlation Network, Support Vector Machine, Group Method of Data HandlingMethod, and Gene Expression Programming) are empirically validated to find the relationship betweenthe static code metrics and the fault proneness of a module. In order to assess and compare the modelspredicted using the regression and the machine learning methods we used two publicly available datasets AR1 and AR6. We compared the predictive capability of the models using the Area Under the Curve

(measured from the Receiver Operating Characteristic (ROC) analysis). The study confirms the predictivecapability of the machine learning methods for software fault prediction. The results show that the AreaUnder the Curve of model predicted using the Decision Tree method is 0.8 and 0.9 (for AR1 and AR6 datasets, respectively) and is a better model than the model predicted using the logistic regression and othermachine learning methods.

© 2014 Published by Elsevier B.V.

. Introduction

As the size and complexity of the software is increasing day byay, it is usual to produce software with faults. The identificationf faults in a timely manner is essential as the cost of correctinghese faults increases exponentially in the later phases of the soft-are development life cycle. Hence, testing is a very expensiverocess and one-third to one-half of the overall cost is allocatedo the software testing activities [47]. There are several approacheseing proposed to detect the faults in the early phases of the soft-are development [47]. This motivates the development of faultrediction models that can be used in classification of a module asault prone or not fault prone.

Several static code metrics have been proposed in the past to

apture various aspects in the design and source code, for exam-le [17,22,35]. These metrics can be used in developing the faultrediction models. The prediction models can then be used by the

E-mail address: [email protected]

ttp://dx.doi.org/10.1016/j.asoc.2014.03.032568-4946/© 2014 Published by Elsevier B.V.

software organizations during the early phases of software devel-opment to identify faulty modules. The software organizations canuse these subset of metrics amongst the available large set of soft-ware metrics. The metrics are computed from the proposed modelto obtain the information about the quality of the software andcan be assessed in the early stages of software development. Thesequality models will allow the researchers and software practition-ers to focus the available testing resources on the faulty areas ofthe software, which will help in producing a improved quality, lowcost and maintainable software. The static code metrics have beenadvocated as widely used measures in the literature [9,36,37].

The machine learning (ML) methods have been successfullyapplied over the years on a range of problem domains such asmedicine, engineering, physics, finance and geology. These meth-ods have also started being used in solving the classification andcontrol problems [1,11–13,34,56]. In the existing literature, theresearchers have used various methods to establish the relation-

ship between the static code metrics and fault prediction. Thesemethods include the traditional statistical methods such as logisticregression (LR) [8,26,32,39] and the machine learning (ML) meth-ods such as decision trees [16,25,27,29,30,33,42,46,49,52], Naïve
Page 2: Comparative analysis of statistical and machine learning methods for predicting faulty modules

Comp

BNmssMdpiMouutMdtpHdaa

p(podcs

et

1

2

MtomftpodsfR[fwfdwt

riSmd

R. Malhotra / Applied Soft

ayes [6,9,10,38,41,49], Support Vector Machines [52,54], Artificialeural Networks [28,33,38]. However, few studies compare the LRodels with the ML models for software fault prediction using the

tatic code metrics. Hall et al. [18] concluded that more qualitytudies should be conducted for software fault prediction usingL methods and Menzies et al. [36] advocated the use of public

ata sets for software fault prediction. It is natural for the softwareractitioners and potential users to wonder, “Which ML method

s best?”, or more realistically, “Is the predictive capability of theL methods comparable or better than the traditional LR meth-

ds?”. For this reason, this paper compares the model predictedsing the LR method with the models predicted using the widelysed ML methods and rarely used ML methods (Cascade Correla-ion Network (CCN), Group Method of Data Handling Polynomial

ethod (GMDH), Gene Expression Programming (GEP)). The evi-ence obtained from the data based empirical studies can helphe software practitioners and researchers in obtaining the mostowerful support for accepting/rejecting a given hypothesis [2].ence, conducting empirical studies to compare the models pre-icted using the LR and the ML methods is important to developdequate body of knowledge so that the well formed and widelyccepted theories can be produced.

The main motivation of the paper is threefold: (1) to show theerformance of the ML methods such as Support Vector MachinesSVM), Decision Tree (DT), CCN, GMDH and GEP for software faultrediction; (2) to compare and assess the predictive performancef the models predicted using the ML methods with the model pre-icted using the LR methods; and (3) to evaluate the performanceapability of the ML methods using public data sets, i.e. across twoystems.

Thus, in this work we (1) build fault proneness models and (2)mpirically compare the results of the LR and the ML methods. Inhis paper, we investigate the below issues:

. How accurately and precisely do the static code metrics predictfaulty modules?

. Is the performance of the ML methods better than the LRmethod?

The validation of the models predicted using the LR and theL methods is carried out using Receiver Operating Characteris-

ic (ROC) analysis. We use the ROC curves to obtain the optimal cutff point that provides balance between the faulty and non faultyodules. The performance of the models constructed to predict

aulty or non faulty modules is evaluated using the Area Underhe Curve (AUC) obtained from the ROC analysis [11]. In order toerform the analysis we validate the performance of these meth-ds using public domain AR1 and AR6 data sets. The AR1 and AR6ata sets consist of 121 and 101 modules, respectively. These dataets were developed using C language [35]. The data are obtainedrom the Promise data repository [35] and collected by Softwareesearch Laboratory (Softlab), Bogazici University, Istanbul, Turkey19]. Thus, the main contributions of the paper are: first, we per-ormed the comparative analysis of the models using the LR methodith the models predicted using the ML methods for prediction of

aulty modules. Second, we analyze public domain and industrialata sets, hence analysing valuable data in an important area. Third,e analyze six ML methods and apply ROC analysis to determine

heir effectiveness.The paper is organized as follows: Section 2 presents the

esearch background that summarizes the static code metrics

ncluded and describes the sources from which the data is collected.ection 3 describes the descriptive statistics and the performanceeasures used for model evaluation. The results of model pre-

iction are presented in Section 4 and the models are validated

uting 21 (2014) 286–297 287

in Section 5. Section 6 summarized the threats to validity of themodels and the conclusions of the work are given in Section 7.

2. Research background

In this section we present the dependent and independent vari-ables used in this paper (Section 2.1). We also describe the datacollection procedure in Section 2.2.

2.1. Dependent and independent variables

Fault proneness is the binary dependent variable in this work.Fault proneness is defined as the probability of fault detection in amodule [2,4,7]. We use the LR method, which is based on probabil-ity prediction. The binary dependent variable is based on the faultsthat are found during the software development life cycle.

For this study, we predict fault prone modules from static codemetrics defined by Halstead [17], and McCabe [35]. The softwaremetrics selected in this paper are procedural and module basedmetrics, where a module is defined as the smallest individual unitof functionality. We find the relationship of the static code metricswith fault proneness since they are “useful”, “easy to use”, and“widely used” metrics [36].

First, the static code metrics are known as useful as they haveshown higher probability of fault detection in past [36]. The resultsin this study also show that the correctly predicted percentage offaulty modules was high. Second, the metrics like lines of code andthe metrics given by Halstead and McCabe [17,35] can be com-puted easily and at low cost, even for very large systems [30]. Third,many researchers have used the static code metrics in the litera-ture [8,25–30,32,33,38–42,46,49,52–54]. It has been stated in [36]that “Verification and Validation textbooks advise using static codecomplexity metrics to decide which modules are worthy of manualinspections”. Table 1 presents the static code metrics chosen in thisstudy.

2.2. Empirical data collection

This study makes use of two public domain data sets AR1 andAR6 available in the Promise data repository [45] and donated bySoftware Research Laboratory (Softlab), Bogazici University, Istan-bul, Turkey [23]. The data in AR1 and AR6 are collected from embed-ded software in a white-goods product. The data was collectedand validated by the Prest Metrics Extraction and Analysis Tool[44] available at http://softlab.boun.edu.tr/?q=resources&i=tools.The data in AR1 and AR6 was implemented in the C programminglanguage. Both the data sets were collected in 2008 and donatedby Softlab in 2009. The AR1 system consists of 121 modules (9faulty/112 non faulty). The AR6 system consists of 101 modules (15faulty/86 non faulty). Both the data sets comprise of 29 static codeattributes (McCabe, Halstead and LOC measures) and 1 fault infor-mation (false/true). Table 2 summarizes the distribution of faultymodules in the AR1 and AR6 data sets. The table shows that 7.44%of modules were faulty in the AR1 data set and 14.85% of moduleswere faulty in the AR6 data set.

3. Research methodology

In this section, steps taken to analyze the static code metrics aredescribed.

3.1. Descriptive statistics and outlier analysis

Before further analysis can be carried out, the data set mustbe suitably reduced by analysing it and then drawing meaningful

Page 3: Comparative analysis of statistical and machine learning methods for predicting faulty modules

288 R. Malhotra / Applied Soft Comp

Table 1Software metrics.

McCabe [26] Cyclomatic ComplexityDesign ComplexityEssential Complexity

Halstead [13] Num Operands (N1)Num Operators (N2)Num Unique Operands (n1)Num Unique Operators (n2)Error EstimateLength: N = N1 + N2

Level: L = V*/VV ∗ = (2 + n∗1) log2(2 + n∗

2)Volume: V = N * log2(n1 + n2)Content: I = L̄ ∗ V L̄ = 2

n1∗ n1

N2Difficulty: D = 1/LEffort: E = V/L̄Program time: T = E/ˇ

Lines of code (LOC) BlankCode and CommentsCommentsExecutableTotal

Miscellaneous Branch CountCall PairsCondition CountCyclomatic DensityDecision CountDecision DensityDesign DensityEdge CountEssential DensityMaintenance SeverityModified Condition CountMultiple Condition CountNode CountNormalized Cyclomatic ComplexityParameter CountPercent Comments

Table 2Data used in the study.

System Language TotalLOC

Totalmodules

% Faulty

Embedded software from C 2467 121 7.44

cmaims

fteatc

3

p

3

p

white-goods product (AR1)Embedded software from

white-goods product (AR6)C 2078 101 14.85

onclusions from it. Descriptive statistics such as minimum,aximum, mean, standard deviation can be effectively used

s measured for comparing different experimental studies. Thendependent variable having low variance will not differentiate

odules very well and hence are unlikely to be useful. Therefore,uch variables are to be excluded from the analysis.

Outliers are defined as the data points that are located awayrom the sample space. Outlier analysis is essential for identifyinghose data points that are over influential and must be analyzed forxclusion from the data set. In this study we found the Univariatend multivariate outliers. We used Mahalanobis Jackknife distanceo identify the multivariate outliers. The details on outlier analysisan be found in [3].

.2. Evaluating the performance of the models

The performance measures used to assess the quality of theredicted model in this work are described in this section.

.2.1. Sensitivity and specificityThe correctness of the LR and the ML methods are analyzed using

erformance measures such as the sensitivity and the specificity.

uting 21 (2014) 286–297

The sensitivity of the model is defined as the percentage of thefaulty modules correctly predicted and the specificity of the modelis defined as the percentage of the non faulty modules correctlypredicted. The sensitivity and specificity should be high in idealsituation. The sensitivity and the specificity as determined by theROC analysis (see Section 3.2.3 for details) are reported for eachmodel predicted in this paper.

3.2.2. PrecisionThe precision is defined as the ratio of number of modules

correctly classified to be faulty/non faulty to the total number ofmodules.

3.2.3. Receiver operating characteristics curve analysisThe performance of the models predicted in this work is ana-

lyzed using the ROC analysis [21]. During the development of theROC curves, many cutoff points between 0 and 1 are selected andthe sensitivity and specificity at each cut off point is calculated.

The Area Under the ROC Curve (AUC) is a combined measure ofthe sensitivity and specificity [14]. The AUC is used for comparingthe predictive capability of the models predicted using the LR andthe ML methods.

3.2.4. Cross validationThe prediction accuracy of the models can be assessed by apply-

ing them to different data sets. Hence, we performed k-crossvalidation of the models developed [51]. In this method the dataset is randomly divided into k subsets. Each time k−1 subsets areused as the training sets and one of the k subsets is used as the vali-dation set. Finally, we get the probability of occurrence of faults foreach of the module. The results were validated with the value of kequal to 10.

4. Analysis results

In this section we describe the analyses performed to find theeffect of the static code metrics and the fault proneness of the mod-ules. We first employed the LR [24] method, which is the widelyused statistical method to construct the fault proneness models.We then employed six machine learning methods (ANN, DT, SVM,CCN, GMDH, and GEP) to predict the fault proneness of the mod-ules. These methods have been rarely applied in predicting faultprone/not fault prone modules.

In these methods we applied multivariate analysis. Multivariateregression analysis is used to determine combined effect of staticcode metrics and fault proneness [1].

During the multivariate analysis we did not found any of theoutlier to be influential. In order to validate our results we usedtwo data sets AR1 and AR6.

4.1. Descriptive statistics

Tables 3 and 4 show “min”, “max”, “mean”, “std dev”, “variance”,“25% quartile” and “75% quartile” for all metrics considered in thisstudy. The data was prepared by removing the outliers and metricswhich did not had at least six values from the analysis.

4.2. Logistic regression (LR) analysis

The research methodology and the results of analysing the rela-tionship between the static code metrics and the fault pronenessusing the LR method is presented in this section.

4.2.1. Logistic regression (LR) modelingThe LR method is the most widely used statistical technique

[4] in the literature for predicting categorical dependent variable

Page 4: Comparative analysis of statistical and machine learning methods for predicting faulty modules

R. Malhotra / Applied Soft Computing 21 (2014) 286–297 289

Table 3Descriptive statistics for static code metrics (AR1 data set).

Metrics Minimum Maximum Mean Std. deviation

total loc 2 95 20.388 19.601blank loc 0 9 0.206 1.040comment loc 0 30 4.760 6.114code and comment loc 0 2 0.074 0.293executable loc 2 82 15.421 15.262unique operands 1 47 12.454 10unique operators 2 19 8.330 4.156total operands 1 118 23.743 24.096total operators 3 184 36.876 36.343halstead vocabulary 3 62 20.785 13.453halstead length 4 302 60.619 60.326halstead volume 4 1210 205.132 240.713halstead level 0.03 1 0.257 0.245halstead difficulty 1 33.33 7.798 6.377halstead effort 4 40,333.33 2861.796 5926.754halstead error 0 0.4 0.068 0.08043halstead time 0.22 2240.74 158.988 329.264branch count 0 76 9.603 13.09611decision count 0 38 4.802 6.548call pairs 0 19 2.239 3.413condition count 0 38 4.396 6.527multiple condition count 0 10 0.992 1.604cyclomatic complexity 1 28 4.570 4.857cyclomatic density 0.13 0.6 0.312 0.104decision density 0 2 0.683 0.582design complexity 0 19 2.239 3.413design density 0 6 0.6818 1.206normalized cyclomatic complexity 0.03 0.5 0.254 0.104formal parameters 0 3 0.331 0.568

Table 4Descriptive statistics for static code metrics (AR6 data set).

Metrics Minimum Maximum Mean Std. deviation

total loc 1 98 20.574 19.8889blank loc 0 48 1.079 6.4026comment loc 0 31 5.128 6.486code and comment loc 0 11 0.277 1.429executable loc 0 89 15.257 15.313unique operands 1 50 13.525 10.931unique operators 2 21 8.584 4.421total operands 1 109 24.405 22.906total operators 3 160 37.465 34.849halstead vocabulary 3 97 22.821 16.935halstead length 4 267 61 57.380halstead volume 4 1101 207.43 231.414halstead level 0.03 591 11.248 78.0381halstead difficulty 0.04 33.33 7.701 5.956halstead effort 4 18,666.67 2561.309 4013.737halstead error 0 14,775 275.563 1951.752halstead time 0.17 1037.04 142.270 223.001branch count 0 820.83 22.929 107.864decision count 0 27 4.208 5.757call pairs 0 20 2.643 3.598condition count 0 24 3.277 5.069multiple condition count 0 9 0.950 1.802cyclomatic complexity 1 19 3.841 3.732cyclomatic density 0.09 10 0.448 1.299decision density 0 3 0.671 0.688design complexity 0 20 2.470 3.443

fgcLM

tm

design density 0

normalized cyclomatic complexity 0

formal parameters 0

rom the set of independent variables (a detailed description isiven by [4,24]). In this study, independent variables are the staticode metrics and the dependent variable is fault proneness. TheR method is divided into two categories: (i) Univariate LR and (ii)

ultivariate LR.The univariate LR is a statistical method determines the rela-

ionship between each static code metric and fault proneness. Theultivariate LR is used to construct a model by using all the static

7 0.851 1.4210.38 0.197 0.0792 0.261 0.460

code metrics in combination for predicting faulty/non faulty mod-ules. In the LR method there are two stepwise selection methodsincluding forward selection and backward elimination [4]. In theforward stepwise method, the entry of each variable is examined

stepwise at each step. The backward elimination method entersall the independent variables in the model at a time. The metricsare then removed one at a time from the model until a stoppingcriteria is satisfied. In this work we have applied the backward
Page 5: Comparative analysis of statistical and machine learning methods for predicting faulty modules

290 R. Malhotra / Applied Soft Comp

Table 5Multivariate analysis for LR model (AR1 data set).

Variable Halstead difficulty Constant

B 0.140 −3.997SE 0.044 0.686p-Value 0.002 0.000

−2 log likelihood: 53.645R2 statistic: 0.37

Table 6Multivariate analysis for LR model (AR6 data set).

Variable Halstead effort Decision count Constant

B 0.000 0.388 −2.702SE 0.000 0.121 0.466p-Value 0.036 0.001 0.000

ei

stddmvrHc

dtmtunmg

4

(TcIniop

csuw

TP

−2 log likelihood: 62.025R2 statistic: 0.46

limination method on the static code metrics selected significantn the univariate analysis.

For the model predicted, the Coefficients (Ai’s), the statisticalignificance (p-value), Maximum Likelihood Estimation (MLE), andhe R2 statistics are reported. The higher value of the R2 statisticepicts that more is the effect of the independent variables on theependent variable and higher is the prediction accuracy of theodel. However, as stated in [6,55], “we should not interpret the

alue of R2 in logistic regression using the usual heuristics for linearegression R2s since they are built upon very different formulas”.ence, high value of R2s rarely occur in the LR analysis. The detailsan be found in [2].

A test of multicollinearity was performed on the models pre-icted in this paper. The presence of the multicollinearity makeshe interpretation of the model difficult. Let the covariates of the

odel predicted be Y1, Y2, . . .Yn. The maximum eigenvalue andhe minimum eigenvalue (emax and emin, respectively) is calculatedsing the principal component analysis method. The conditionalumber is defined as � =

√emax/emin. The multicollinearity of the

odel is not tolerable if the value of the conditional number isreater than 30 [5,50].

.2.2. Multivariate LR analysisIn this subsection we find the effect of the independent variables

static code metrics) on the dependent variable (fault proneness).he multicollinearity of the predicted models is acceptable as theonditional number for the models is below 30 (see Section 4.2.1).n Tables 3 and 4, we summarize the coefficient (B), statistical sig-ificance (p-value), R2, and correctness. One metric was selected

n the model predicted using AR1 (see Table 5). Table 6 shows twout of the selected static code metrics are included in the modelredicted using AR6 data set. The value of R2 statistic is 0.46.

Table 7 predicts accuracy of the model containing the staticode metrics, using forward selection method. Two metrics Hal-

tead effort and decision count are included in the model. Duringnivariate analysis all the static code metrics chosen in this workere found to be significant.

able 7redicted correctness of LR model.

Observed AR1 AR6Predicted (cutoff point = 0.06) Predicted (cutoff point = 0.18)

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 77 35 68.70% 65 21 75.58%1.00 2 7 77.80% 3 12 80.00%

uting 21 (2014) 286–297

The model is applied to all the system modules to compare theaccuracy of predicting faulty and non faulty software modules. Thecut off point is selected using the ROC analysis to avoid the selectionof arbitrary cut off point in the analysis [21]. This will provide bal-ance between the values of the sensitivity and the specificity. Theresults of the model predicted using the AR1 data set shows thatthe sensitivity and specificity are 77.8% and 68.7%, respectively. Incase of the AR6 data set, out of 15 modules actually fault prone,12 modules were predicted to be fault prone. The sensitivity of themodel is 80.00% (refer Table 5). Similarly, 65 out of 86 modules werepredicted to be not fault prone. Thus, the specificity of the model is75.58%. This shows that the model accuracy is not low.

4.3. Machine learning methods

In this section, we present the results of models predicted usingsix ML methods.

4.3.1. Artificial Neural Network (ANN)4.3.1.1. Architecture. We use Multilayer Feed Forward and backpropagation algorithm to train the network using the ANN method.The network constructed using the ANN method consists of threelayers, input layer with I nodes, hidden layer with H nodes, and out-put layer with O nodes. The nodes in the input layer are connectedto each node in the hidden layer and there is no direct connec-tion between the input and the output nodes. The ANN adjusts theweights between connection of each nodes repetitively by com-puting the difference between the predicted output and the actualnetwork. The network learns by searching the connection weightsthat minimizes the error between the actual and the predicted out-put on the training data set [46]. In order to perform the ANNanalysis, we used min–max normalization is used. The min–maxnormalization transforms the data in the range of 0–1 [20,50] usingthe following formula:

v′ = v − minM

maxM − minM(1)

where minM and maxM are the minimum and maximum values ofan metric M.

The ANN model was trained with the learning rate of 0.005, withthe aim to continue until the minimum sum of squared error isreached.

The principal components (P.C.) analysis was applied on the nor-malized metrics to produce domain metrics. The aim of the P.C.analysis is to transform the raw static code metrics to the measureswhich are not correlated to each other [31]. Given an a × b matrixof multivariate training data, where a represents system modulesand b represents the static code metrics. The P.C. analysis reducesthe column (a × p matrix) where b is reduced to p (p < b).

The following domain metrics were given as input to the ANNmodel using the AR1 data set:

• P1: Total operators, halstead length, halstead volume, halsteaderror, total operands

• P2: total operators, halstead length, halstead volume, halsteaderror, total operands

• P3: design density, normalized cyclomatic complexity, cyclo-matic density, blank loc, design complexity

• P4: halstead level, decision density, unique operators, normalizedcyclomatic complexity, cal pairs, code and comment loc, cyclo-matic density, decision density, formal parameters, comment loc

• P5: formal parameters, code and comment loc, decision density,normalized cyclomatic complexity, condition count

• P6: blank loc, code and comment loc, decision density, halsteadlevel, formal parameters

Page 6: Comparative analysis of statistical and machine learning methods for predicting faulty modules

R. Malhotra / Applied Soft Computing 21 (2014) 286–297 291

Table 8ANN summary for AR6 data set.

Architecture AR1 AR6

Layers 3 3Input units 7 9Hidden units 4 4Output units 2 2

m

ataas

4wmwt

Aasmpmoh

TP

TrainingAlgorithm Back propagation

P7: cyclomatic density, design density, blank loc, halstead diffi-culty, code and comment locP8: multiple condition count, code and comment loc, decisiondensity, unique operands, comment locP9: decision density, halstead difficulty, formal parameters, hal-stead level, blank loc

The following domain metrics were given as input to the ANNodel using the AR6 data set:

P1: Total operators, halstead volume, halstead length, totaloperands, unique operandsP2: Halstead level, halstead error, cyclomatic density, blank LOC,Branch countP3: Design density, design complexity, call pairs, normalizedcyclomatic complexity, comment LOCP4: Design density, unique operators, halstead difficulty, condi-tion count, multiple condition countP5: Formal parameters, normalized cyclomatic complexity, con-dition count, cyclomatic complexityP6: Formal parameters, normalized cyclomatic complexity, deci-sion density, design complexity, call pairsP7: Normalized cyclomatic complexity, design density, halsteaddifficulty, comment LOC, unique operands

Table 8 presents the architecture of the ANN model for the AR1nd AR6 data sets. The ANN method is non linear in nature, hencehe traditional statistical tests applied in the LR method are notpplicable here. Instead we used the AUC produced by the ROCnalysis to heuristically analyze and assess the importance of thetatic code metrics (input variables) for model prediction purpose.

.3.1.2. Artificial Neural Network (ANN) analysis. In this section,e present and assess the results produced by applying the ANNethod for the fault prediction. The module with one or more faultsas considered to be faulty. The input to the model predicted using

he ANN method were the P.C.s given above.Table 9 presents the accuracy of the model predicted using the

NN method. The cut off point for the AR1 and AR6 models are 0.05nd 0.11, respectively. In case of the AR1 data set, the sensitivity andpecificity of the model is 77.8% and 68.7%, respectively. Out of 9odules actually fault prone, 7 modules were predicted to be fault

rone. The sensitivity of the model is 94.20%. Similarly 13 out of 15odules were predicted not to be fault prone. Thus, the specificity

f the model is 86.50%. This shows that the model accuracy is veryigh.

able 9redicted correctness of ANN model.

Data set AR1 AR6Observed Predicted Predicted

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 77 35 68.70% 81 5 94.20%1.00 2 7 77.80% 2 13 86.50%

Fig. 1. Architecture of Support Vector Machines.

4.3.2. Support Vector Machine modelingThe SVM is a useful technique for classification of data and has

been successfully applied in various modeling applications such astext classification, image analysis and face identification [56]. TheSVM are capable of handling complex and large problems. Usingthe SVM method, an N-dimensional hyperplane is constructed thatseparates the input domain into binary or multi-class categories.This method constructs an optimal hyperplane which separates thecategories of the dependent variable on each side of the plane [48].The support vectors are present close to the hyperplane. The nonlinear relationships can be handled using the kernel function. TheRadial Basis Function (RBF) maps the non linear relationships intoa higher dimensional space so that the categories of the dependentvariable can be separated. The k-means clustering is used to find thecenters of the RBF network [43]. Fig. 1 depicts the architecture of theRBF kernel in the SVM method. The circles depict one category ofthe dependent variable and the rectangle depicts the other categoryof the dependent variable. The support vectors are depicted by theshaded portions of the rectangles and circles.

Given a set of (ai, bi),. . .,(am, bm) and bi ∈ {−1, +1} training sam-ples. ̨ = (i = 1,. . .,m) is a Lagrangian multipliers. K(ai, bi) is called akernel function and z is a bias. The discriminant function DF of twomodule SVM is given below [56]:

DF(b) =m∑

i=1

bi˛iK(ai, a) + z (2)

Then an input pattern a is classified as [56]:

a ={

+1 if DF(a) > 0

−1 if DF(a) < 0

}(3)

4.3.2.1. Support Vector Machine analysis results. In this section, wepresent results of employing the SVM method to predict the faultproneness of a module. In case of the AR1 model, out of 9 modulesactually fault prone, 7 modules were predicted to be fault prone(see Table 10). The sensitivity of the model is 77.78%. Similarly,

the specificity of the model is 66.07%. In case of the AR6 model,the sensitivity and specificity of the model is 73.33% and 73.26%,respectively. This shows that the model accuracy is less as com-pared to model predicted using the LR and ANN approach.

Table 10Predicted correctness of SVM model.

Observed AR1 (cutoff point 0.05) AR6Predicted Predicted

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 28 84 25.00% 63 23 73.33%1.00 0 9 100% 4 11 73.26%

Page 7: Comparative analysis of statistical and machine learning methods for predicting faulty modules

292 R. Malhotra / Applied Soft Computing 21 (2014) 286–297

Table 11Metrics included in DT models.

AR1 cyclomatic complexity, branch count, decision count,condition count, halstead effort, halstead time, comment loc,executable loc, unique operators, multiple condition count,halstead level, halstead difficulty, total operands, total loc,total operators, call pairs, design complexity, halstead length,halstead volume, halstead error, decision density, blank loc,code and comment loc, unique operands,normalized cyclomatic complexity, cyclomatic density.

AR6 cyclomatic complexity, decision density, design complexity,code and comment loc, call pairs, halstead error, halstead level,halstead difficulty, halstead time, multiple condition count,cyclomatic density, normalized cyclomatic complexity,formal parameters, total loc, blank loc, executable loc,unique operators, total operators, unique operands, comment loc,

4

n

4iddavaitw

4t6d

rirrt

4

tiriosecacw

TP

Table 13Predicted correctness of CCN model.

Observed AR1 AR6Predicted (cutoff point = 0.10) Predicted (cutoff point = 0.18)

0.00 1.00 Percent correct 0.00 1.00 Percent correct

halstead effort, total operands, halstead volume, andhalstead length metrics

.3.3. Decision Tree (DT) methodIn this section, we present the results for predicting faulty and

on faulty modules using the DT method.

.3.3.1. Decision Tree architecture. In the DT [42] method, eachnternal node in the tree represents the value of the indepen-ent variables and the terminal node represents the value of theependent variable. The Classification and Regression Tree (CRT)lgorithm splits the data into partitions based on the independentariables. At each step in the CRT algorithm the independent vari-ble having the strongest relationship with the dependent variables selected. In the CRT method, each parent node is split into onlywo child nodes. The aim of the CRT method is to maximize theithin-node homogeneity.

.3.3.2. Decision Tree model results. Table 11 shows the subset ofhe metrics included in the model predicted using the AR1 and AR

data sets. Table 12 presents the predicted correctness of the modeleveloped using the DT method.

The sensitivity and specificity of the model predicted withespect to the AR1 data set is 77.7% and 89.3%, respectively. Sim-larly, the sensitivity and specificity of the model predicted withespect to the AR6 data set is 93.3% and 81.4%, respectively. Theesults show that the accuracy of the DT model is high as comparedo the models predicted using the LR, SVM and ANN methods.

.3.4. Cascade Correlation Network (CCN)We used the correlation based feature selection (CFS) technique

o select the uncorrelated and best predictors out of the set of thendependent variables present in the training data set using cor-elation based heuristic [19]. All the possible combinations of thendependent variables were searched to find the best combinationf the independent variables. The CFS method ranks the metric sub-ets rather than the individual metric. The CFS is a heuristic thatvaluates the predictive ability of the individual variables (static

ode metrics in our case) for predicting the dependent variablelong with the correlation among the independent variables. Inase of the AR1 data set, the predicted model shows better accuracyhen the CFS technique was applied. However, in case of the AR6

able 12redicted correctness of DT model.

Observed AR1 AR6Predicted (cutoff point 0.08) Predicted (cutoff point 0.18)

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 70 16 89.30% 70 16 81.40%1.00 14 1 77.70% 14 1 93.30%

0.00 86 26 76.79% 64 22 74.42%1.00 1 8 88.89% 3 12 80.00%

data set, the model predicted showed lesser accuracy when subsetof metrics were included using CFS. Hence, we did not use the CFSmethod in case of predicting model using the AR6 data set.

The results of the correctness of predicted fault proneness modelusing the CCN approach is shown in Table 13. The sensitivity andspecificity of the model predicted with respect to the AR1 data setis 88.89% and 76.79%, respectively. Similarly, the sensitivity andspecificity of the model predicted with respect to the AR6 data setis 80% and 74.42%, respectively.

4.3.5. GMDH polynomial networkThe GMDH networks are known as self organizing networks that

mean the connections between the neurons are not fixed duringthe training process. In the GMDH network, the number of lay-ers are automatically decided so that an network with maximumprediction accuracy and without overfitting is constructed [48].

Table 14 shows the output generated for the GMDH polynomialmodel. Output from neuron j is shown as N(j). The final line showsthe output of the network. In this case, the probability of faultymodules being true is predicted.

The results of correctness of predicted fault proneness modelusing the GMDH polynomial network approach is shown inTable 15. The sensitivity and specificity of the model predicted withrespect to AR1 data set is 88.89% and 73.21%, respectively. Similarly,the sensitivity and specificity of the model predicted with respectto AR6 data set is 80% and 74.42%, respectively.

4.3.6. GEP analysisGenetic algorithms (GA) are search based algorithms that have

seen explosion of interest since 1980s. These are natural biologi-cal evolution algorithms and are much faster than the exhaustivesearch procedures. GEP was developed by Candida Ferreira in 1996and they are 100–60,000 faster than the GA [15]. GEP mutatequickly the valid expressions as they use Karva language to encodethe expressions. They use genes that have two sections the headand the tail. The head consist of functions, variables and constantsand the tail consist of variables and constants. The terminals in thetail are used if there are not are enough terminal in the head. Thefollowing are the major steps for training data set in the GEP:

(1) Creation of initial population.(2) Use of evolution to create well fitted individuals using mutation,

transposition, inverse and recombination steps.(3) Finding simpler functions.

In Table 16, we summarize the parameters to and determinedby the GEP method. We used 4 genes per chromosome and additionfunction to link the genes.

The fitness function uses the following function, i.e. the numberof faulty and non faulty modules predicted correctly with penaltyfor analysing the performance of the model. The aim of the GEPmethod is to maximize the value of the fitness function until the

stopping criterion is met.

Fitness = TFP + TNFPTFP + FNFP + FFP + TNFP

Page 8: Comparative analysis of statistical and machine learning methods for predicting faulty modules

R. Malhotra / Applied Soft Computing 21 (2014) 286–297 293

Table 14GMDH polynomial output.

AR1 defects{true} = 0.029694-1.4561*halstead error-6.685482*halstead error2̂-0.006916*unique operators+0.001303*unique operators2̂+0.188474*halstead error*unique operators

AR6 N(5) = 0.044327+0.114179*multiple condition count-0.025803*multiple condition count2̂+0.034002*call pairs-0.004033*call pairs2̂+0.018979*multiple condition count*call pairsN(3) = -0.019685+0.002822*halstead difficulty-0.000023*halstead difficulty2̂+1.369638*N(5)+0.199649*N(5)2̂-0.04608*halstead difficulty*N(5)N(10) = 0.056173-0.012912*cyclomatic complexity+0.004474*cyclomatic complexity2̂+0.086395*multiple condition count+0.003522*multiple condition count2̂-0.012155*cyclomatic complexity*multiple condition countN(8) = 0.00462-0.007771*total loc-0.000003*total loc2̂+2.216319*N(10)-1.249802*N(10)2̂+0.008383*total loc*N(10)N(2) = -0.034398+0.690112*N(3)+0.691399*N(3)2̂+0.506948*N(8)+0.505133*N(8)2̂-1.36183*N(3)*N(8)faults{true} = -0.005575+0.00025*halstead volume-9.274278e-007*halstead volume2̂+1.025279*N(2)-0.10709*N(2)2̂+0.00078*halstead volume*N(2)

Table 15Predicted correctness of GMDH model.

Observed AR1 AR6Predicted (cutoff point = 0.06) Predicted (cutoff point = 0.16)

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 82 30 73.21% 64 22 74.42%1.00 1 8 88.89% 3 12 80.00%

Table 16GEP parameters.

Population size 50 50Gene per chromosome 4 4Gene head length 8 8Generations required to train

the model349 442

Generations required forsimplification

35 52

Linking function AdditionFitness function Number of correct predictions with penalty

Table 17Predicted correctness of GEP model.

Observed Predicted (Cutoff point = 0.5) Predicted (cutoff point = 0.13)

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 111 1 99.11% 84 2 97.67%1.00 6 3 33.33% 7 8 53.33%

Table 18Results of validation of LR model.

Observed AR1 AR6Predicted Predicted

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 77 35 68.70% 65 21 53.50%1.00 2 7 77.80% 3 12 53.50%

Table 19Results of validation of ANN model.

Data set AR1 AR6Observed Predicted Predicted

0.00 1.00 Percent correct 0.00 1.00 Percent correct

usi

Table 20Results of validation of SVM model.

Data set AR1 AR6Observed Predicted Predicted

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 35 77 31.25% 59 27 68.60%1.00 1 8 88.89% 5 10 66.67%

Table 21Results of validation of DT model.

Data set AR1 AR6Observed Predicted Predicted

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 100 12 89.30% 70 16 81.40%1.00 2 7 77.70% 1 14 93.30%

Table 22Results of validation of CCN model.

Data set AR1 AR6Observed Predicted Predicted

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 85 27 75.89% 63 23 73.25%1.00 2 7 77.79% 4 11 73.33%

Table 23Results of validation of GMDH model.

Data set AR1 AR6Observed Predicted Predicted

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 77 35 68.70% 58 28 67.40%1.00 2 7 77.80% 5 10 66.70%

The results of correctness of predicted fault proneness modelsing the GEP method is shown in Table 17. The sensitivity andpecificity of the model predicted with respect to the AR1 data sets 33.33% and 99.11%, respectively. Similarly, the sensitivity and

0.00 78 34 66.67% 58 28 67.44%1.00 3 6 69.64% 4 11 73.33%

specificity of the model predicted with respect to the AR6 data setis 53.33% and 97.67%, respectively.

5. Model evaluation

The models predicted in the previous section are applied onthe same data set from which they are derived. The models pre-dicted should be applied on the different data sets to gain insightabout their prediction accuracy. Hence, we used 10-cross validationmethod on the LR, ANN, SVM, DT, CCN, GMDH, and GEP models

following the procedure given in Section 3. For performing the10-cross validation, the software modules were randomly dividedinto 10 partitions of approximately equal size (AR1: 9 partitionsof 12 data points each and 1 partition of 11 data points, AR6: 9
Page 9: Comparative analysis of statistical and machine learning methods for predicting faulty modules

294 R. Malhotra / Applied Soft Computing 21 (2014) 286–297

N (c)

pTMda

ort

Fig. 2. ROC curve for AR1 data set (a) LR (b) AN

artitions of ten data points each and 1 partition of 13 data points).ables 18–24 show the accuracy of the models predicted using theL methods. We present the results of the cross validation of pre-

icted models via the LR, ANN, SVM, DT, CCM, GMDH, and GEPpproaches in Tables 25 and 26.

As shown in Tables 25 and 26, the results of cross validation

f the ML models were better as compared to the cross validationesults of the LR model. The basis of comparison is the values ofhe AUC computed using the ROC analysis. The ROC analysis can

SVM (d) DT (e) CCN (f) GMDH (g) GEP models.

be used to obtain the optimal cut off point that provides the bal-ance between the number of faulty and non faulty modules. InFigs. 2 and 3, the ROC curves for the LR, ANN, SVM, DT, CCM, GMDH,and GEP models are presented.

In case of the AR1 data set, the AUC of model predicted usingthe LR method was 0.94 which is much lower than the models pre-

dicted using machine learning methods. Similar results were shownwhen we predicted model using the AR6 data set. The AUC of modelpredicted using the DT method is 0.865 in case of AR1 and 0.948
Page 10: Comparative analysis of statistical and machine learning methods for predicting faulty modules

R. Malhotra / Applied Soft Computing 21 (2014) 286–297 295

N (c)

iDmbsisa

Fig. 3. ROC curve for AR6 data set (a) LR (b) AN

n case of the AR6 data set. Thus, the model predicted using theT method shows the highest values of AUC. In line with the otherodels predicted in literature, the findings of this study may also

e validated externally. The models predicted using the DT method

howed the best results. Although the LR models are widely usedn construction of software quality models, but the results in thistudy show that the ML models show better predictive capabilitys compared to the LR model. Therefore, it appears that the model

SVM (d) DT (e) CCN (f) GMDH (h) GEP models.

predicted using the ML methods might lead to construction of theoptimum prediction models for developing fault prone models.

6. Threats to validity

The empirical studies consist of similar types of limitationswhich are summarized below:

Page 11: Comparative analysis of statistical and machine learning methods for predicting faulty modules

296 R. Malhotra / Applied Soft Comp

Table 24Results of validation of GEP model.

Data set AR1 AR6Observed Predicted Predicted

0.00 1.00 Percent correct 0.00 1.00 Percent correct

0.00 110 2 99.11% 84 2 97.67%1.00 8 1 11.11% 9 6 40.00%

Table 25Results of 10-cross validation of AR1 models.

Technique Sensitivity Specificity Proportioncorrect

AUC Cutoff point

LR 55.60 56.20 55.00 0.494 0.05ANN 75.70 61.60 69.42 0.711 0.05SVM 88.89 31.25 35.53 0.717 0.05DT 89.30 77.70 88.84 0.865 0.08CCN 77.89 75.89 76.03 0.786 0.10GMDH 69.64 66.67 69.42 0.744 0.06GEP 11.11 99.11 91.73 0.547 0.05

Table 26Results of 10-cross validation of AR6 models.

Technique Sensitivity Specificity Proportioncorrect

AUC Cutoff point

LR 53.30 53.50 53.00 0.538 0.08ANN 66.70 67.40 67.32 0.774 0.06SVM 66.67 68.60 68.32 0.721 0.13DT 93.30 81.40 83.16 0.948 0.11CCN 73.33 73.26 73.27 0.758 0.10

aH

7

atpGsoWdmso

pDut

[

[

[

[

[

[

[

[[

[

[

[

[

[[

[classes, in: International Conference on Advanced Computer Theory and Engi-

GMDH 73.33 67.44 68.31 0.702 0.10GEP 40.00 97.67 89.10 0.688 0.10

The degree to which the results can be generalized to otherresearch cannot be determined as we have used medium sizedsoftware in this paper.The study does not take into account the level of severity of faultshence the results cannot be used to determine the severity of themodules predicted to be faulty [14].The results in this study cannot be applied to the study havingdifferent dependent variable such as maintenance effort, testingeffort, etc.

Hence, the results in this study can be used for future guid-nce for predicting faulty modules using the static code metrics.owever, the results of this needs to be externally validated.

. Conclusion and future work

The main goal of our study was to examine the regression (LR),nd the ML methods (ANN, SVM, DT, CCN, GMDH and GEP) in ordero find the combined impact of the static code metrics on faultroneness. Thus we employed six ML methods (ANN, SVM, DT, CCN,MDH and GEP) methods to assess the relationship between thetatic code metrics and fault proneness. The primary contributionf this work is to assess the predictive capability of the ML methods.e also compare and analyze the performance of the models pre-

icted using the LR method with the models predicted using the MLethods. To validate the results we used two public domain data

ets AR1 and AR6. The AUC was used to determine the performancef the ML methods.

The AUC for the LR model was 0.494 and 0.538 for the models

redicted using the AR1 and AR6 data sets, respectively. The AUC forT model was 0.865 and 0.948, respectively. The models predictedsing the ML methods outperformed the models predicted usinghe LR method for both AR1 and AR6 data sets. Thus, the models

[

uting 21 (2014) 286–297

predicted using the DT showed the best results. This paper confirmsthat the models using the ML methods such as ANN, SVM, DT, CCN,GMDH and GEP have predictive ability for predicting faulty and nonfaulty modules.

Hence, the researchers and software practitioners can use themodel predicted in this paper to predict faulty or non faulty mod-ules in the early phases of the software development. The softwarepractitioners can focus the available testing resources on the faultyportions of the software.

Replicated studies with large sized software should be carriedout so that generalized results can be obtained. We may use theevolutionary or hybrid evolutionary algorithms for fault predictionin future.

References

[1] K.K. Aggarwal, Y. Singh, A. Kaur, R. Malhotra, Application of artificial neuralnetwork for predicting fault proneness models, in: International Conference onInformation Systems, Technology and Management (ICISTM 2007), New Delhi,India, March 12–13, 2007.

[2] K.K. Aggarwal, Y. Singh, A. Kaur, R. Malhotra, Empirical analysis for investigatingthe effect of object-oriented metrics on fault proneness: a replicated case study,Softw. Process Improv. Pract. 16 (1) (2009) 39–62 (John Wiley & Sons).

[3] V. Barnett, T. Price, Outliers in Statistical Data, John Wiley & Sons, 1995.[4] V. Basili, L. Briand, W. Melo, A validation of object-oriented design metrics as

quality indicators, IEEE Trans. Softw. Eng. 22 (10) (1996) 751–761.[5] D. Belsley, E. Kuh, R. Welsch, Regression Diagnostics: Identifying Influential

Data and Sources of Collinearity, John Wiley and Sons, New York, 1980.[6] A. Bener, B. Turhan, Analysis of Naive Bayes’ assumptions on software fault

data: an empirical study, Data Knowl. Eng. 68 (2009) 278–290.[7] L. Briand, W. Daly, J. Wust, Exploring the relationships between design meas-

ures and software quality, J. Syst. Softw. 51 (3) (2000) 245–273.[8] M. Chapman, D. Solomon, The relationship of cyclomatic complexity, essential

complexity and error rates, in: Proc. NASA Software Assurance Symp., 2002http://www.ivv.nasa.gov/business/research/osmasas/conclusion2002/MikeChapman The Relationship of Cyclomatic Complexity Essential Complexityand Error Rates.ppt

[9] K. Dejaeger, T. Verbraken, B. Baesens, Prediction Models Using Bayesian Net-work Classifiers, 39, 2013, pp. 237–257.

10] B. Diri, C. Catal, U. Sevim, Practical development of an Eclipse-based softwarefault prediction tool using Naive Bayes algorithm, Expert Syst. Appl. 38 (2011)2347–2353.

11] S. Dreiseitl, L. Ohno-Machado, Logistic Regression and Artificial Neural Net-work Classification models: a methodology review, J. Biomed. Inform. 35 (2002)352–359.

12] E. Duman, Comparison of decision tree algorithms in identifying bank cus-tomers who are likely to buy credit cards, in: Seventh International BalticConference on Databases and Information Systems, Kaunas, Lithuania, July 3–6,2006.

13] B. Eftekhar, K. Mohammad, H. Ardebili, M. Ghodsi, E. Ketabchi, Comparisonof artificial neural network and logistic regression models for prediction ofmortality in head trauma based on initial clinical data, BMC Med. Inform. Decis.Mak. (2005).

14] K. El Emam, S. Benlarbi, N. Goel, S. Rai, A Validation of Object-Oriented Metrics,Technical Report ERB-1063, NRC, 1999.

15] C. Ferreira, Gene expression programming: a new adaptive algorithm for solv-ing problems, Complex Syst. 13 (2001) 87–129.

16] L. Guo, Y. Ma, B. Cukic, H.S.H. Singh, Robust prediction of fault-proneness byrandom forests, in: 15th Int. Symp. Softw. Reliab. Eng., 2004.

17] M. Halstead, Elements of Software Science, Elsevier, 1977.18] T. Hall, D. Bowes, The state of machine learning methodology in software fault

prediction, in: 11th International Conference on Machine Learning and Appli-cations, 2012.

19] M. Hall, Correlation-based feature selection for discrete and numeric classmachine learning, in: Proceedings of the 17th International Conference onMachine Learning, 2007, pp. 359–366.

20] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Harchort India Pri-vate Limited, 2001.

21] J. Hanley, B.J. McNeil, The meaning and use of the area under a Receiver Oper-ating Characteristic ROC curve, Radiology 143 (1982) 29–36.

22] S. Henry, D. Kafura, Software structure metrics based on information flow, IEEETrans. Softw. Eng. 7 (5) (1981) 510–518.

23] http://softlab.boun.edu.tr24] D. Hosmer, S. Lemeshow, Applied Logistic Regression, John Wiley and Sons,

1989.25] A. Kaur, R. Malhotra, Application of random forest for predicting fault prone

neering, Thailand, December 20–22, 2008.26] T. Khoshgoftaar, An application of zero-inflated Poisson regression for software

fault prediction, in: Proc. 12th Intl. Symp. Software Reliability Eng., November,2001, pp. 66–73.

Page 12: Comparative analysis of statistical and machine learning methods for predicting faulty modules

Comp

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[[

[[

[

[[

[

[

[

[

[

[

[

R. Malhotra / Applied Soft

27] T. Khoshgoftaar, E. Allen, Model software quality with classification trees,Recent Adv. Reliab. Qual. Eng. (2001) 247–270.

28] T. Khoshgaftaar, E.D. Allen, J.P. Hudepohl, S.J. Aud, Application of neuralnetworks to software quality modeling of a very large telecommunicationssystem, IEEE Trans. Neural Netw. 8 (4) (1997) 902–909.

29] T. Khoshgoftaar, N. Seliya, Fault prediction modeling for software quality esti-mation: comparing commonly used techniques, Empirical Softw. Eng. 8 (3)(2003) 255–283.

30] A.G. Koru, H. Liu, Building effective defect-prediction models in practice, IEEESoftw. 22 (2005).

31] C.R. Kothari, Research Methodology Methods and Techniques, New Age Inter-national Limited, New Delhi, 2004.

32] S. Lessmann, B. Baesens, C. Mues, S. Pietsch, Benchmarking classification modelsfor software defect prediction: a proposed framework and novel findings, IEEETrans. Softw. Eng. 34 (2008) 485–496.

33] R. Malhotra, Prediction of high, medium, and low severity faults using softwaremetrics, Softw. Qual. Prof. (2013).

34] F. Marini, R. Bucci, A.L. Magri, A.D. Magri, Artificial neural networks in chemo-metrics: history, examples and perspectives, Microchem. J. 88 (2) (2008)178–185.

35] T.A. McCabe, Complexity measure, IEEE Trans. Softw. Eng. 2 (4) (1976) 308–320.

36] T. Menzies, J. DiStefano, A. Orrego, R. Chapman, Data mining static codeattributes to learn defect predictors, IEEE Trans. Softw. Eng. 32 (11) (2007)1–12.

37] T. Menzies, J. Stefano, M. Chapman, Learning early lifecycle IV and V qualityindicators, in: Proc. IEEE Software Metrics Symp., 2003.

38] A.T. Mısırlı, A.B. Bener, B. Turhan, An industrial case study of classifier ensem-bles for locating software defects, Softw. Qual. J. 19 (2011) 515–536.

39] N. Nagappan, T. Ball, Static analysis tools as early indicators of pre-release defectdensity, in: Proc. Intl. Conf. Software Eng., 2005, pp. 580–586.

40] A. Nikora, J. Munson, Developing fault predictors for evolving software systems,in: Proc. Ninth Intl. Software Metrics Symp. (METRICS’03), 2003.

41] A. Okutan, O.T. Yıldız, Software defect prediction using Bayesian networks,Empirical Softw. Eng. (2012).

42] A. Porter, R. Selby, Empirically guided software development using metric-based classification trees, IEEE Softw. (1990) 46–54.

43] Polyspace verifier1, 2005, http://www.polyspace.com/44] Prest Metrics Extraction and Analysis Tool, Available from: http://softlab.

boun.edu.tr/?q=resources&i=tools45] Promise, http://promisedata.org/repository/46] Y. Singh, A. Kaur, R. Malhotra, Application of decision trees for predicting fault

proneness, in: International Conference on Information Systems, Technologyand Management-Information Technology, Ghaziabad, India, 2009.

uting 21 (2014) 286–297 297

47] Y. Singh, R. Malhotra, Object Oriented Software Engineering, PHI Learning,2012.

48] P. Sherrod, DTreg Predictive Modeling Software, 2003.49] Q. Song, Z. Jia, M. Shepperd, S. Ying, J. Liu, A general software defect-proneness

prediction framework, IEEE Trans. Softw. Eng. 37 (2011) 356–370.50] K. Srinivasan, D. Fisher, Machine learning approaches to estimating software

development effort, IEEE Trans. Softw. Eng. (1995) 126–137.51] M. Stone, Cross-validatory choice and assessment of statistical predictions, J.

Royal Stat. Soc. 36 (1974) 111–147.52] B. Twala, Software faults prediction using multiple classifiers, in: 2011 3rd Int.

Conf. Comput. Res. Dev., 4, 2011, pp. 504–510.53] X. Wang, D. Bi, S. Wang, Fault recognition with Labeled multi-category, in: Third

Conference on Natural Computation, Haikou, China, 2007.54] L. Yu, An evolutionary programming based asymmetric weighted least squares

support vector machine ensemble learning methodology for software reposi-tory mining, Inf. Sci. 191 (2012) 31–46.

55] Y. Zhou, H. Leung, Empirical analysis of Object-Oriented Design Metrics forpredicting high severity faults, IEEE Trans. Softw. Eng. 32 (10) (2006) 771–784.

56] L. Zhao, N. Takagi, An application of Support vector machines to Chinese char-acter classification problem, in: IEEE International Conference on systems, Manand Cybernetics, Montreal, 2007.

Ruchika Malhotra is an assistant professor at theDepartment of Software Engineering, Delhi Technolog-ical University (formerly Delhi College of Engineering),Delhi, India. She was an assistant professor at the Uni-versity School of Information Technology, Guru GobindSingh Indraprastha University, Delhi, India. Prior to join-ing the school, she worked as full-time research scholarand received a doctoral research fellowship from the Uni-versity School of Information Technology, Guru GobindSingh Indraprastha, Delhi, India. She received her mas-ter’s and doctorate degree in software engineering fromthe University School of Information Technology, GuruGobind Singh Indraprastha University, Delhi, India. She

has received IBM Best Faculty Award 2013. She is executive editor of SoftwareEngineering: An International Journal. She is coauthor of a book on Object Ori-ented Software Engineering published by PHI Learning. Her research interests are

in software testing, improving software quality, statistical and adaptive predictionmodels, software metrics, neural nets modeling, and the definition and validationof software metrics. She has published more for than 80 research papers ininternational journals and conferences. Malhotra can be contacted by e-mail [email protected].