application of machine learning methods for software effort prediction

Application of Machine Learning Methods for Software Effort Prediction

Ruchika Malhotra

University School of Information Technology

Guru Gobind Singh Indraprastha University, Kashmere Gate, Delhi-110403, India

e-mail: [email protected]

Arvinder Kaur




Yogesh Singh




Abstract

Software effort estimation is an important area in the field

of software engineering. If the software development effort is

over estimated it may lead to tight time schedules and thus

quality and testing of software may be compromised. In con-

trast, if the software development effort is underestimated

it may lead to over allocation of man power and resource.

There are many models proposed in the literature for esti-

mating software effort. In this paper, we analyze machine

learning methods in order to develop models to predict soft-

ware development effort we used Maxwell data consisting 63

projects. The results show that linear regression, MSP and

M5Rules are effective methods for predicting software devel-

opment effort.

Introduction

Accurate estimation of resources, costs, manpower, etc., iscritical in order to be able to monitor, and control the projectcompletion within the time schedule. Software effort estima-tion can provide a key input to planning processes [7]. Theover estimation of software development effort may lead tothe risk of too many resources being allocated to the project.On the other hand the underestimation of the software devel-opment effort may lead to tight schedules. Thus, the softwarecontaining faults may be delivered to the customer. Thesefaults need tone foxed in the maintenance phase of softwarelife cycle, in which the cost of the removal will increase dras-tically [7].

The importance of software development effort estimationhas motivated the construction of models to predict softwaredevelopment effort in recent years. Software effort may beestimated using function point method [1] and COCOMO

[4] method. These methods estimate effort using pre speci-fied formulas that are developed using past experiences andhistorical data. Machine learning methods apply learning al-gorithms on training data in order to construct “rules” thatfit the data and these rules may be used to fit the new datasamples in a reasonable manner as well.

In this paper, we analyze machine learning methods in or-der to predict software development effort. In this paper,we analyze machine learning methods in order to predictsoftware development effort. In order to predict software de-velopment effort we use 8 methods: Linear Regression, Paceregression, MSP, REPTree, Radial basis function (RBF), andm5Rules. The results are obtained using Maxwell data setconsisting of 63 projects.

The paper is organized as follows: Section 2 summarizesthe related work. Section 3 summarizes the metrics stud-ied, describes sources from which the data is collected in thestudy. Section 4 presents the research methodology followedin this paper. The results of the models predicted for soft-ware development effort estimation are given in section 5.Finally, conclusions of the research are presented in section6.

Related Work

Sheppard, Schofield and Kitchenham described an auto-mated environment that supported the collection, storageand identification of various projects in order to predict effortfor a new project [11]. Krishnamurthy and fisher comparedthe CARTX and Backpropogation methods with statisticalmethods for software effort estimation using COCOMO data.The results showed that the machine learning methods werecompetitive [7].

Sentas et al. estimated effort/productivity using ordinal

1

ACM SIGSOFT Software Engineering Notes May 2010 Volume 35 Number 3

DOI: 10.1145/1764810.1764825 http://doi.acm.org/10.1145/10.1145/1764810.1764825

regression method. The method was applied to three datasets. The results were comparable to the results of the mostsignificant research in the field [10].

Recently, Li et al. proposed mutual information basedfeature selection method for feature selection in order to es-timate cost estimation. Case based reasoning technique wasused to predict the software effort [8].

Research Background

In this section we present the summary of independent anddependent variables, and empirical data collection.

Independent and Dependent Variables

Software development effort is used as dependent variablein this study. Software development effort is defined as thework carried out by the software supplier from specificationuntil delivery measured in terms of hours.

Most of the independent variables are categorical in theMaxwell [9] data set. The independent variables are summa-rized in Table 1. To incorporate the correlation of indepen-dent variables, a correlation based feature selection technique(CFS) is applied to select to select the best predictors out ofindependent variables in the datasets [6].

The best combinations of independent variable weresearched through all possible combinations of variables. CFSevaluates the best of a subset of variables by considering theindividual predictive ability of each feature along with thedegree of redundancy between them.

Empirical Data Collection

The project data is collected from one of the biggest com-mercial banks in Finland. Most features in this dataset are’categorical’. The ’categorical’ features can be further classi-fied into ’ordinal’ features and ’nominal’ features. The dataconsists of 63 projects, however, 62 projects were chosen foranalysis as one project was identified as outlier and is missingfrom the data.

Table 1: Collected MetricsAttribute Type Description

Applicationtype

nominal

1 = Information/on-lineservice2 = Transaction control,logistics, order processing3 = Customer service4 = Production control,logistics, order processing5 = Management informationsystem (MIS)

Hardwareplatform

nominal

1 = Personal computer (PC)2 = Mainframe (Mainfrm)3 = Multi-platform (Multi)4 = Mini computer (Mini)5 = Networked (Network)

Database nominal

1 = Relatnl (Relational)2 = Sequentl (Sequential)3 = Other (Other)4 = None (None)

User interface nominal1 = Grafical user interface2 = Text user interface

Wheredeveloped

nominal1 = In-house (Inhouse)2 = Outsourced (Outsrced)

Telon use nominal0 = No1 = Yes

Number ofdifferentdevelopmentlanguages used

ordinal

1 = one language used2 = two languages used3 = three languages used4 = four languages used

Customerparticipation ordinal

1 = Very low2 = Low3 = Nominal4 = High5 = Very high

Developmentenvironmentadequacy

ordinal


Staff availabilityordinal


Standards use ordinal


Methods use ordinal


Tools use ordinal


2



Softwares logicalcomplexity

ordinal


Requirementsvolatility ordinal


Qualityrequirements

ordinal


Efficiencyrequirements

ordinal


Installationrequirements

ordinal


Staff analysisskills

ordinal


Staff applicationknowledge ordinal


Staff tool skills ordinal


Staff team skills ordinal


Size numerical

Application size(numerical) Functionpoints measured usingthe experience method

Effort numerical

Work carried outby the software supplierfrom specification untildelivery, measured in hours

Research Methodology

Descriptive Statistics and Outlier analysis

The role of statistics is to function as a tool in analyzingresearch data and drawing conclusions from it. The researchdata must be suitably reduced so that the same can be readeasily and can be used for further analysis. Descriptive statis-tics concern development of certain indices or measures tosummarize data. The important statistics measures used forcomparing different case studies include mean, median, andstandard deviation. Data points, which are located in anempty part of the sample space, are called outliers. Outlieranalysis is done to find data points that are over influentialand removing them is essential. Univariate and multivari-ate outliers are found in our study. To identify multivariateoutliers, we calculate for each data point the MahalanobisJackknife distance. Mahalanobis Jackknife is a measure ofthe distance in multidimensional space of each observationfrom the mean center of the observations [2, 5].

The influence of univariate and multivariate outliers wastested. If by removing an univariate outlier the significance(see Section 3.4) of metric changes i.e., the effect of thatmetric on fault proneness changes then the outlier is to beremoved. Similarly, if the significance of one or more inde-pendent variables in the model depends on the presence orabsence of the outlier, then that outlier is to be removed.Details on outlier analysis can be found in [3].

Machine Learning Parameter Initialization

The following machine learning algorithms were passed as pa-rameters in order to predict fault proneness models (mostlywith the default settings) in WEKA tool [14] :

• Linear Regression: Linear regression is a method ofestimating the conditional expected value of one variabley given the values of some other variable or variables x.

• Least Median Squared: Implements a least mediansquared linear regression utilizing the existing to formpredictions. The least squared regression with the low-est median squared error is chosen as the final model.

• Pace Regression: pace regression is provably optimalwhen the number of coefficients tends to infinity. Itconsists of a group of estimators that are either overalloptimal or optimal under certain conditions. The valueof estimator was set to empirical bayes.

• Radial Basis Function (RBF) Network: RBF im-plements a guassian radial basis function network. Thecenters and widths of hidden units are derived using m-means and the outputs are combined from the hiddenlayer using LR. The value of m was set to 2.

• Support Vector Machine (SVM): SVM constructsan N-dimensional hyperplane that optimally separatesthe data set into two categories. The purpose of SVM

3



modeling is to find the optimal hyperplane that sepa-rates clusters of vector in such a way that cases withone category of the dependent variable on one side ofthe plane and the cases with the other category on theother side of the plane [12] . The regularization pa-rameter (c) was set to 1. The kernel function used wasPolykernel (We also used radial basis function but theresults obtained were not good).

• MSP: Implements base routines for generating M5Model trees and rules.

• RepTree: Fast decision tree learner. Builds a de-cision/regression tree using information gain/varianceand prunes it using reduced-error pruning (with back-fitting). Only sorts values for numeric attributes once.Missing values are dealt with by splitting the corre-sponding instances into pieces (i.e. as in C4.5).

• M5Rules: Generates a decision list for regression prob-lems using separate-and-conquer. In each iteration itbuilds a model tree using M5 and makes the ”best” leafinto a rule.

Evaluating the Performance of the Models

In this study the main measure used for evaluating modelperformance are the Relative absolute error and Root abso-lute squared error. Relative absolute error is a measure forsoftware measurement and is calculated as follows:

where:estimate is the output predicted by model for each obser-

vationactual is the actual output for each observation n is the

number of observationsThe Root absolute squared error is calculated as follows:

where: estimate is the output predicted by model for eachobservation

actual is the actual output for each observation n is thenumber of observations

Correlation analysis studies the variation of two or morevariables for determining the amount of correlation betweenthem. In order to analyze the relationship among predictedand actual effort we use Pearson’s Rho coefficient of correla-tion.

Holdout cross-validation

Holdout cross-validation is the simplest kind of cross-validation. Observations are chosen randomly from the ini-tial sample to form the validation data, and the remaining

Table 2: Descriptive Statistics

MetricMean Me-

dianStd.Dev.

Min. Max.

Per-centiles

25 75Syear 89.58 90 2.13 85 93 88 91App 2.35 2 0.99 1 5 2 3

Dba 1.03 1 0.44 0 4 1 1

Ifc 1.93 2 0.24 1 2 2 2

Source 1.87 2 0.33 1 2 2 2

Telonuse0.24 0 0.43 0 1 0 0.25

Nlan 2.54 3 1.01 1 4 2 3

T01 3.04 3 0.99 1 5 2 4

T02 3.04 3 0.71 1 5 3 3

T03 3.03 3 0.88 2 5 2 4

T04 3.19 3 0.69 2 5 3 4

T05 3.04 3 0.71 1 5 3 3

T06 2.90 3 0.69 1 4 3 3

T07 3.24 3 0.89 1 5 3 4

T08 3.80 4 0.95 2 5 3 5

T09 4.06 4 0.74 2 5 4 5

T10 3.61 4 0.89 2 5 3 4

T11 3.41 3 0.98 2 5 3 4

T12 3.82 4 0.69 2 5 3.75 4

T13 3.06 3 0.95 1 5 2 4

T14 3.25 3 1.00 1 5 2.75 4

T15 3.33 3 0.74 1 5 3 4

observations are retained as the training data. Normally,less than a third of the initial sample is used for validationdata. Thus, data set is randomly divided into testing andvalidations data sets in ratio of 70:30.

Results

This section presents the analysis results, following the pro-cedure described in the previous section.

Descriptive Statistics

Table 2 shows ”min”, ”max”, ”mean”, ”std dev”, ”25% quar-tile” and ”75% quartile” for all metrics considered in thisstudy.

Model Prediction

In this section, the comparative results of 8 methods chosenin this work are summarized on the Maxwell data set. Thecriteria of comparision are the performance measures givenin section 3. The subset of attributes was selected usingCFS method described in Section 3.3. App, T02, T03, T07,T08, T09, T10, T11, T13, T14, Size were selected from the

4



Table 3: Validation Results of Models Predicted

MethodRelativeabsolute

error

Rootrelativesquared

error

Correlationcoefficient

(r)Least SquareRegression

52.19 56.10 0.87

LinearRegression

41.00 38.16 0.91

PaceRegression

45.64 41.63 0.91

MSP 41.00 38.16 0.91

REP Tree 40.65 46.67 0.88

SVM 52.44 49.99 0.904

RBF 56.86 67.48 0.711

M5Rules 41.00 38.16 0.91

Figure 1: Validation Results

set of chosen metrics. Table 3 shows the results for eachprediction models. The correlation of the predicted effortand the observed effort is represented by the coefficient ofcorrelation (r).

The results shown in table 3 are obtained into two parts bydividing the dataset using holdout method of crossvalidation. 44 data points are used for training and 18 datapoints are used for validation of the model. The resultsshow that Linear Regression, MSP, and M5Rules are betteras compared to other models. Figure 1 shows the relativeabsolute error and root absolute squared errorcorresponding to each of the method.

Concluding remarks and future work

This empirical study presents the prediction of effort us-ing eight statistical and machine learning methods. Theindependent variables were eleven metrics selected by CFSmethod. The results are based on Maxwell data set. Themodes predicted using all the selected methods were vali-dated using holdout cross validation method. The resultspresented above shows that these independent variables ap-pear to be useful in predicting effort. The relative absoluteerror and root absolute squared error of linear regression,MSP and M5Rules 41 and 38 percent respectively. Thusperformance of MSP and M5Rules methods is competitiveto linear regression method.

More similar type of studies must be carried out with largedata sets to get an accurate measure of performance outsidethe development population. We plan to replicate our studyon large data set and industrial OO software system. Wefurther plan to replicate our study to predict models basedon early analysis and design artifacts.

References

[1] Albrecht A.J., and Gaffney J.R. (1983): Software func-tion, Source ines of Code, and deveopment effort predic-tion: a software science Validation, IEEE transaction onSoftware Engineering, 9(6), pp.639-648.

[2] Aggarwal K.K., Singh Y., Kaur A., Malhotra R.(2008): Empirical Analysis for Investigating the Ef-fect of Object-Oriented Metrics on Fault Proneness: AReplicated Case Study, Forthcoming in Software Pro-cess Improvement and Practice, Wiley.

[3] Barnett V., Price T. (1995): Outliers in Statistical Data.John Wiley Sons.

[4] Boehm B.W. (1981): Software Engineering Economics.Prentice-Hall.

[5] Hair J., Anderson, R., Tatham., W. (2006): Black Mul-tivariate Data Analysis, Pearson Education.

[6] Hall, M. (2000): Correlation-based feature selection fordiscrete and numeric class machine learning. In: Pro-ceedings of the 17th International Conference on Ma-chine Learning, pp.359-366.

[7] Krishnamurthy S. and Fisher D. (1995): MachineLearning Approaches to Estimating Software Develop-ment Effort. IEEE transaction on Software Engineering,21(2), pp. 126-137.

[8] Li Y.F., Xie M., and Goh T.N. (2009): A Study of Mu-tual Information based Feature Selection for Case BasedReasoning in Software Cost Estimation. Expert Systemand Applications, 26, pp.5921-5931.

[9] Promise. http://promisedata.org/repository/.

5



[10] Sentas P., Angelis L., Stemeos I., and Bleris G. (2005):Software Productivity and Effort Prediction with Ordi-nal Regression. Information and Software Technology,47, pp.17-29.

[11] Shepperd M., Schofield C., and Kitchenham B. (1996):Effort Estimation Using Analogy. International Confer-ence on Software Engineering.

[12] Sherrod, P. (2003): DTreg Predictive Modeling Soft-ware.

[13] Stone, M. (1974): Cross-validatory choice and assess-ment of statistical predictions. J. Royal Stat. Soc., 36,pp.111-147.

[14] Weka. http://www.cs.waikato.ac.nz/ml/weka/

6



application of machine learning methods for software effort prediction

Documents