a comparison of machine learning methods for...

A Comparison of Machine Learning Methods for

Software Effort Estimation

Aydın Göze Polat, METU Computer Engineering

Department, Ankara, Turkey

Sevgi Yiğit, METU Computer Engineering

Department, Ankara, Turkey

Abstract— In this study we aimed to draw a big comparative

picture of the state of the art machine learning approaches for

the software effort estimation problem. For this purpose, several

datasets which were obtained from Promise data repository were

used for testing various machine learning techniques. The results

showed that, decision trees or rule induction based classifiers (i.e.

M5P trees) gave particularly good results for more than one

dataset. Moreover for certain datasets the best results were

achieved by other type of classifiers such as K*. Meta-classifiers

such as Additive Regression, when combined with M5P trees,

gave the best results in our tests.

Keywords—effort estimation, machine learning methods, NASA

projects, CHINA projects, PROMISE projects, WEKA.

I. INTRODUCTION

In today's competitive software industry, estimation of non-functional properties such as effort, reliability and performance is a critical part of the business. This paper focuses on effort estimation. Effort is the allocated time and resources (i.e. person months) for the development or maintenance of a software product. Effort can be forecast in a quantified form. To prevent project overruns, good effort estimates are critical, because a good effort estimate often converts into a good schedule and cost estimate [1].

There are various top-down and bottom-up strategies for effort

estimation. While traditional -and still most commonly used-

strategies heavily depend on top-down expert judgment [2],

there are also machine learning techniques as well as other

model based alternatives that may incorporate empirical,

mathematical, algorithmic or analogy based approaches which

may require a bottom-up analysis [9,11,12, 15-17].

A. Expert Judgment vs Model Based Effort Estimation

Although there is strong evidence suggesting that it is flawed, expert judgment is still the most dominant method in the industry [3]. Moreover, there is no substantial evidence suggesting that expert judgment is better than model based estimation. In fact, Standish Group surveys predict only 30% chance for a project to be delivered within expert's estimate [4]. Moreover T. Capers Jones' research on 50 project indicates that only 8% were delivered within 10% of the actual predictions [5]. To overcome this problem, Jorgensen suggests researchers to focus on improving the techniques that can refine expert judgment [6]. Although Jorgensen's argument that there can not be a single model that can optimally answer to all effort

estimation problems is theoretically sound1, this does not

necessarily mean that researchers should discard the benefits of model based approaches or ignore the potential of new models. Especially meta-models that are used in machine learning can give good estimate values, by minimizing the number of assumptions made by the modeler and by quickly generating new estimation models that can be fine-tuned to the characteristics of the project,.

Comparison between expert judgment and model based estimation reveals that expert judgment is no better than model based judgment when the models incorporate critical organization and domain specific knowledge. Moreover “expert adjusted model estimates”, where experts modify the model based estimates according to criteria not taken into account in the models, seem to be often more accurate than pure expert judgment [2].

Expert judgment is error prone and often overly optimistic.

For example in Teigen et al.'s study, it turns out even when

experts say they are at least 90% confident, on average only

about 60-70% of the estimation is successful [1]. Overly

optimistic estimations, in the best scenario, result in tight

schedules and extra working hours for the developers, which

results in reduced maintainability, reliability and overall

quality of the product. Without model or tool support, aside

from expert intuition, neither developers nor managers have

supporting evidence or a systematic means for rejecting

unrealistic budget or schedule propositions. To overcome

these problems, researchers either try to improve the

estimation techniques that experts use or come up with

improved computer models for better and more reliable model

or tool support.

B. No Free Lunch Theorem

Because of the no free lunch theorem, we know that there will

never be an estimation model that can achieve perfect results

for all estimation problems, since estimation problems are

intrinsically optimization problems [7].

A plausible scenario is that as software companies achieve

maturity in their domains and model based estimation

techniques produce robust and capable models that can

incorporate domain specific parameters, “expert adjustment”

will replace expert judgment. Therefore, there will always be

1 See the no free lunch theorem in B.

a need for experts to make domain specific “adjustments” (i.e.

calibration of the raw model, domain specific simplifications,

organization/event specific changes etc.).

Well known machine learning techniques can be used for

generating such raw models that can be “fine tuned” with

domain and organization specific knowledge.

II. BACKGROUND

Walston-Felix Model: The model was developed by Walston

& Felix from a database that consisted of 60 projects at IBM

federal systems division. This model constitutes participation,

customer-oriented changes, memory constraints etc. [28] and

formulates the relation between effort and delivered lines of

source code:

Effort = 5.2*(LOC)

0.91 (1)

where LOC is number of developed lines of source codes with comments.

Bailey-Basili Model: This model is based on the early work of Walston and Felix and was proposed to be used to express the size of the project with some measures like Line of code, executable statement, machine instructions, number of modules and a base line equation to relate this size to effort [29]. This equation is expressed in [30] as follows:

Effort = 0.73*(LOC)1.16

+ 3.5 (2)

Doty Model: This model was published by Doty in 1977. Doty define the relationship between effort and delivered lines of source code as in the following equation:

Effort = 5.288*(LOC)1.047

(3)

Machine Learning Techniques

Artificial neural networks (ANN) are powerful non-linear mathematical data modeling tools. The power of neural network comes from their ability to learn from experience, parallel processing, self organization, adaptability, fault tolerance and real-time operation properties [8].

ANNs are inspired from the features of human brain and its learning process. An ANN consists of simple interconnected units called artificial neurons. Each neuron has weighted inputs, summation function, activation function and an output. It computes net input by multiplying weights with inputs, and then process the net input with respect to activation function to generate an output.

Multilayer perceptron (MLP) is a simple feed forward ANN that is commonly used. In effort estimation, MLP with backpropagation is used in several studies [9-13].

Genetic programming (GP) which is inspired from evolution, can be used to find programs that can achieve a predefined task. This predefined task can be extremely specific. For

example Burgess et al. use GP to achieve an improve on effort estimation [14].

Linear regression tries to find a linear relation between a dependent variable and one or more explanatory variables. It is mostly suitable for capturing simple relationships between variables or estimating a conditional value of the dependent variable. It can be used for comparison with other models especially when they do not yield good results. Linear regression is used in various studies [15, 16] There are also other types of regression such as SMO regression (which uses support vector machines), additive regression (which can use any classifier, and used in this study as a meta model for M5Rules) as well as least square regression which is used in [17].

Finnie et al. point to the potential of case based reasoning (CBR). CBR can take advantage of old problems by using their solutions in similar or same (sub)problems. Moreover, Finnie et al. emphasize the fact that CBR is robust dynamic enough to readjust itself according to the new project data [18, 19] .

Researchers may also use instance based classifiers such as K* where analogy between instances in the dataset can be established via a similarity function. Such analogies prove to be useful for certain datasets that consist of similar entries, (e.g. for this study K* gives good results for NASA93).

Decision trees or rule induction algorithms are also one of the popular methods that easily achieve nonlinear behavior according to certain trigger rules that overall recursively divide the data into finer details and make the classify it. For example M5Rules which is available in Weka, generates a decision list for regression problems using separate-and-conquer [20]. M5Rules are used in this study as well as other studies such as [15] and [9]. Other rule induction systems can use decision tables and conjunctive rules. These techniques are also used in the literature for effort estimation [21].

Support vector machines (SVM) can divide the data into

several sections according to the hyper plane it builds from the

input space [22]. It uses hyper plane to overcome nonlinear

relationships (by easily separating the non-separable data in

the hyper plane). SVMs are used in this study as well as in

[15].

Bootstrap aggregating or bagging is a (meta) technique that

is used for reducing the variance of the training set for the

selected classifier by means of bootstrap samples that uses

repeated values [23]. Malhotra et al. use bagging in their study

[15].

ABC-algorithm model: Artificial bee colony (ABC) algorithm which is one of the swarm intelligence-based bio inspired optimization algorithms is proposed by Karaboğa et al [26]. It mimics intelligent foraging behaviors of honeybees for evolving the optimal solutions of problems. The ABC algorithm outperforms other evolutionary algorithms such as genetic algorithm (GA), particle swarm optimization (PSO), differential evolution (DE), simulated annealing (SA) in terms of the quality of solution and the computation efficiency [27]. This model estimates the parameters in equation (4) using ABC algorithm and calculates the effort.

Effort = a(LOC)b (4)

III. PROPOSED METHODOLOGY

A. Data Collection

Promise data repository contains an abundance of data for effort estimation [24]. In this study, several datasets such as NASA, COCOMO81, MAXWELL, COCOMO_NASA2 (NASA93), and CHINA are used. Each dataset has varying number of attributes. The first dataset which is named as NASA is taken from [10] for comparison purposes.

TABLE I. VARIOUS DATASETS USED FOR COMPARISON

Dataset NASA COCOMO81 MAXWELL NASA93 CHINA

#Entries 18 61 62 93 499

#Attributes 3 18 27 23 19

B. Model Comparison Using Weka

Promise data repository uses arrf file format which can be processed via Weka. Weka is a machine learning tool that incorporates a wide spectrum of machine learning algorithms [20]. We have used Weka for comparison between error rates of various classifiers such as: M5Rules, ANN, K*, Conjunctive Rule, Decision Table, Additive Regression, Support Vector Machine with Regression (SMOreg), Bagging, Linear Regression, SVM, Classification And Regression Trees, Least Square Regression and Radial Basis Function (RBF).

IV. COMPARISON OF VARIOUS MODELS

In this study, various approaches at modeling were used on

benchmark datasets. Table II represents these approaches on

some benchmark datasets and their respective test results such

as Correlation Coefficient, Root Absolute Error (RAE), Root

Relative Squared Error (RRSE), Mean Magnitude of Relative

Error (MMRE), Root Mean Square Error (RMSE), Mean

Absolute Error (MAE) and PRED. Although for each dataset,

several machine learning approaches (almost all applicable

supervised learning approaches) were tested, only the first

and second best models are shown in the table below. (For

other results: http://www.metu.edu.tr/~163109/all_results.rar)

TABLE II. COMPARISON RESULTS OF VARIOUS MACHINE LEARNING APPROACHES ON BENCHMARK DATASETS

Article Dataset Methods Correlation

Coefficient RAE RRSE MMRE RMSE MAE

PRED

(0.25)

proposed

NASA18

ABC 0.996 8 10 9 46508 41309 100

M5Rules 0.99 13 16 14.47 26512 44348 100

[1]

ANN 0.97 18 34 12 17.44 36373 80

Halstead 0.99 395 594 175.65 308709 194.47 20

Walston-Felix 0.99 183 237 155.55 123.45 90.49 0

Bailey-Basili 0.99 32 48 20.29 41330 15.98 80

Doty 0.99 425 575 302.5 299.47 209.49 0

[2]

COCOMO81

Halstead - - 887.78 26575 - - -

Walston-Felix - - 83584 1880.9 - - -

Bailey-Basili - - 60.45 1691.6 - - -

Halstead - - 125.75 1382.1 - - -

proposed

Maxwell62-- M5Rules 0.95 28.55 32.42 58.22 2739.5 1857.07 42.85

K* 0.83 55.83 56.74 111.88 4794.68 3631.8 33.33

NASA93 -- K* 0.82 25.72 53.11 80.11 322.02 129.76 56.25

M5Rules 0.8 34.56 56.76 168.75 344.16 174.38 46.87

[3]

NASA93

Conjunctive Rule - - - 1246.63 695.31 - -

Decision Table - - - 1127.37 536.26 - -

M5Rules - - - 801.09 377.35 - -

Halstead - - - 18963 6814 -

http://www.metu.edu.tr/~163109/all_results.rar



PRED

(0.25)

Walson-Felix - - - 1244.3 583 - -

Bailey-Basili - - - 1097.2 472.2 - -

Doty - - - 954.37 416.99 - -

proposed CHINA--

AdditiveReg

(M5Rules) 0.98 13.69 19.91 16.83 777.81 415559 80.59

AdditiveReg

(SMOreg) 0.93 16.32 38.33 41502 1497.58 495.14 86.47

[4]

M5Rules - - - 41442 - - 52

Bagging - - - 74.23 - - 34.66

Linear Regression - - - 17.97 - - 36

SVM - - - 25.63 - - 38.66

ANN - - - 143.79 - - 11.33

[5]

ANN - - - 90 - - 22

CHINA

Classification And

Regression Trees - - - 77 - - 26

Least Square

Regression - - - 72 - - 33

Adjusted Analogy

Based Estimate

(Euclidean

Distance)

- - - 38 - - 57

Adjusted Analogy

Based Estimate

(Minkowski

Distance)

- - - 43 - - 61

[6]

Augmented

COCOMO - - - 65 - -

Pred(20)

31.67

[6]

Parsimonious

COCOMO - - - 64 - - 30.4

[7]

Clustering - - - 1.03 - - Pred(30)

35.6

[8]

Regressive - - - 62.3 - - -

ANN - - - 35.2 - - -

Case Based

Reasoning - - - 36.2 - - -

[9]

CHINA

MART (Multiple

Additive

Regression Trees)

- - - 8.97 - - 88.89

RBF (Radial Basis

Function) - - - 19.07 - - 72.22

SVR_Linear - - - 17.4 - - 88.89



PRED

(0.25)

SVR_RBF - - - 17.8 - - 83.33

Linear Regression - - - 23.3 - - 72.22

[10] CHINA Genetic

Programming - - - 44.55 - - 23.5

ANN - - - 60.63 - - 56

[11] CHINA ANN - - - 17 - - -

Case Based

Reasoning - - - 42 - - -

[12] CHINA Linear Regression - - - 23.3 - - 72.22

RBF - - - 19.07 - - 72.22

# Single minus sign – means the error values are not given in the respective article.

## Double minus sign -- means some of the attributes are discarded in the preprocessing phase to achieve

better cross-validation results.

As Table II, correlation coefficient of ANN has the most

smallest value, but ABC model is more suitable in other error

metrics. Since NASA dataset that consists of 18 projects is

small, ABC gets better results. ABC only used number of kilo-

lines of code (KLOC) to obtain the model, so obtained model

only applied on NASA18.

In Table II, Conjunctive Rule, Decision Table, M5Rules,

Halstead, Walson-Felix, Bailey-Basili, Doty models are

generated on NASA dataset that contains 93 projects. These

models are evaluated by MMRE and RMSE. Walson-Felix has

the lowest RMSE rate, but Doty has the lowest mmre rate.

In the literature, there are a lot of models which are developed

using different approaches on CHINA dataset. The

performance evaluation of these models is generally based on

MMRE and PRED. A good estimation model is expected to

return high PRED and correlation coefficient values and low

error values. One can deduce from the results, by looking at

the error rates (especially MMRE and RMSE are widely used)

and PRED values, Additive Regression meta-classifier

combined with M5Rules or SMOreg gives relatively good

results and highly close to the best results which are obtained

from MART. Moreover type of dataset affects which model

gives the best results. Fore example, for NASA93, K* gives

much better results than any other models.

V. DISCUSSION

Because of the no free lunch theorem, as long as there is new data, there will always be a design issue for models that we seek to optimize for effort estimation problems. The issue arises from the fact that when designing/creating a new model for effort estimation, modeler needs to make certain assumptions. For example, when designing an ANN model, it is generally the modeler's task to decide/modify the number of nodes and hidden layers according to the problem. For instance, Sing et al.'s design for multilayer perceptron with back propagation, uses 3 input nodes, one hidden layer with 5 nodes and 1 output node [25]. Although a comparison between relative error of several models may be considered an indicator for the limitation or potential of such a model, unless there are severe theoretical (and/or practical) limitations in the framework, bad results do not necessarily mean that the choice of framework used by the modeler is the wrong one, since bad results can be easily obtained with bad design choices. Therefore, when making assumptions and design choices, a modeler can never be too careful. A way to partly overcome this problem is minimizing the number of such assumptions via another layer (e.g. a meta-modeling layer such as additive regression etc.) that intrinsically searches for the optimal/near optimal choices and reduces the chances of erroneous design.

As it can be seen from the comparison table, our best two results in the China dataset used additive regression as a meta-layer. This helped us to ignore the parameters of M5Rules and SMOreg (support vector machine for regression), since additive regression would enhance their performance without

our intervention2. Since there was minimum amount of

choices from the modeler side, the results were reliable. Also note that multiple additive regression tree (MART) was the only model that gave better MMRE and PRED values. As one may realize, MART simply adds one more layer to additive regression trees to achieve even better results via better expressive power with still minimum modeler intervention.

Our test using K* in NASA93 gave good results. Since K* makes use of analogies between each entries, the fact that NASA93 consists of entries that can indicate whether projects that were developed in the same NASA center, same development mode, same system, same etc., may explain the increased success of K*.

CONCLUSION

In this study, we have first discussed why researchers should

be interested in machine learning techniques and then

compared several machine learning alternatives on well

known datasets from Promise repository to obtain a big

comparative picture for the most successful3 machine learning

approaches for effort estimation on datasets that describe

various properties of the projects in various detail. M5P

decision trees, K* and SMOreg methods generally gave good

results in our tests. Other good results from the literature were

MART, clustering, RBF and SVR_Linear. ABC model

acquires the best results on NASA18 dataset. Since effort

calculation formula is insufficient, ABC model didn't apply

other datasets. The application of other datasets on ABC

model by improved formula is left as future work.

ACKNOWLEDGMENT

We sincerely appreciate the guidance and encouragement that came from our professor Ali Doğru.

REFERENCES

[1] M. Jørgensen, K.H. Teigen, K. Ribu. Better sure than safe? Over-confidence in judgement based software development effort prediction intervals in Journal of Systems and Software 70 (1–2), Feb 2004, pp 79–93

[2] M. Jørgensen. A Review of Studies on Expert Estimation of Software Development Effort in Journal of Systems and Software 70(1-2), 2004, pp 37--60

[3] C. Jarabek, Expert Judgement in Software Effort Estimation

[4] J. Johnson, My Life Is Failure, Standish Group Int’l, 2006.

[5] T.C. Jones, Estimating Software Costs, Mc-Graw-Hill, 1998

2 Additive regression was also tested in previous datasets

however it never returned better results, which may be

considered as a supporting evidence that our designs were

already giving satisfactory results

4 Since we have only selected models that were first tested

for cross-validation, our comparison table does not contain

entries that hides low cross-validation results.

[6] M. Jørgensen and B. Boehm, Software Development Effort Estimation:Formal Models or Expert Judgment? in IEEE Software, 2008

[7] D.H. Wolpert, W.G.Macready, No Free Lunch Theorems for Optimization, IEEE Transactions on Evolutionary Computation 1, 67., 1997, available at http://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf

[8] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College Publishing Company, Boston , 2000.

[9] N.H. Chiu and S.J. Huang,The adjusted analogy-based software effort estimation based on similarity distances,‖ The Journal of Systems and Software, vol. 80, pp. 628-640, 2007

[10] J. Kaur, S. Singh, and K. Singh Kahlon, Comparative Analysis of the Software EffortEstimation Models in World Academy of Science, Engineering and Technology 22 2008

[11] G.R. Finnie and G.E. Wittig, A Comparison of Software Effort Estimation Techniques: Using Function Points with Neural Networks, Case-Based Reasoning and Regression Models,‖ Journal of Systems and Software, vol. 39, pp. 281-289, 1997

[12] C.J. Burgess and M.Lefley,Can genetic programming improve software effort estimation? A comparative evaluation,‖ Information and Software Technology, vol. 43, pp. 863-873, 2001.

[13] G.R.Finnie and G.E. Wittig, AI Tools for Software Development Effort Estimation,‖ in Proc. SEEP '96 , 1996, International Conference on Software Engineering: Education and Practice (SE:EP '96).

[14] C.J. Burgess and M.Lefley,Can genetic programming improve software effort estimation? A comparative evaluation,‖ Information and Software Technology, vol. 43, pp. 863-873, 2001

[15] R. Malhotra, A. Jain, Software Effort Prediction using Statistical andMachine Learning Methods in (IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 2, No.1, January 2011

[16] M.O. Elish, Improved estimation of software project effort using multiple additive regression trees,‖ Expert Systems with Applications, vol. 36, pp. 10774-10778, 2009.

[17] N.H. Chiu and S.J. Huang,The adjusted analogy-based software effort estimation based on similarity distances,‖ The Journal of Systems and Software, vol. 80, pp. 628-640, 2007.

[18] G.R. Finnie and G.E. Wittig, A Comparison of Software Effort Estimation Techniques: Using Function Points with Neural Networks, Case-Based Reasoning and Regression Models,‖ Journal of Systems and Software, vol. 39, pp. 281-289, 1997

[19] G.R.Finnie and G.E. Wittig, AI Tools for Software Development Effort Estimation,‖ in Proc. SEEP '96 , 1996, International Conference on Software Engineering: Education and Practice (SE:EP '96).

[20] Weka. Available: http://www.cs.waikato.ac.nz/ml/weka/

[21] H. Duggal, P. Singh Study the Performance of M5-Rules Algorithm and Decision Table Majority Classifier for Modeling of Effort Estimation of Software Projects

[22] S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, K.R.K. Murthy, Improvements to the SMO Algorithm for SVM Regression, IEEE Transactions on Neural Networks, vol. 13, March 2001.

[23] L.Breiman, Bagging predictors,‖ Machine Learning, vol. 24, pp. 123-140, Aug. 1996

[24] D. of USA, “Parametric cost estimating handbook, second edition,” 1999. J. Sayyad Shirabad and T. Menzies, “The PROMISERepository of Software Engineering Databases..” School of Information Technology and Engineering, University of Ottawa, Canada, 2005. Available from http://promise.site.uottawa.ca/SERepository.

[25] J. Sing, B. Sahoo, Application of Artificial Neural Network for Procedure and Object Oriented Software Effort Estimation

[26] D. Karaboga, Basturk, B., A powerful and efficient algorithm for numerical function optimization: artificial bee colony algorithm, Journal of Global Optimization, Vol. 39, pp. 459-471, 2007.

[27] D. Karaboga, B. Basturk, On the performance of artificial bee colony (ABC) algorithm, Applied Soft Computing, Vol.8, pp. 687-697, 2008.

[28] D.K. Srivastava , D.S. Chauhan and R. Singh VRS Model: A Model for Estimation of Efforts and Time Duration in Development of IVR

http://www.cs.waikato.ac.nz/ml/weka/

http://promise.site.uottawa.ca/SERepository

Software System in International Journal of Software Engineering Vol. 5, pp 27-46,2012

[29] S. A. Abbas, X. Liao, A. Rehman, A. Azam, M.I, Cost Estimation: A Survey of Well-known Historic Cost Estimation Techniques, in Journal of Emerging Trends in Computing and InformationSciences, Vol 4,1,pp. 612-636, 2012

[30] J. J. Bailey and V. R. Basili, A meta-model for software development resource expenditures, 1981

a comparison of machine learning methods for...

Documents