a comparison of machine learning methods for...
TRANSCRIPT
A Comparison of Machine Learning Methods for
Software Effort Estimation
Aydın Göze Polat, METU Computer Engineering
Department, Ankara, Turkey
Sevgi Yiğit, METU Computer Engineering
Department, Ankara, Turkey
Abstract— In this study we aimed to draw a big comparative
picture of the state of the art machine learning approaches for
the software effort estimation problem. For this purpose, several
datasets which were obtained from Promise data repository were
used for testing various machine learning techniques. The results
showed that, decision trees or rule induction based classifiers (i.e.
M5P trees) gave particularly good results for more than one
dataset. Moreover for certain datasets the best results were
achieved by other type of classifiers such as K*. Meta-classifiers
such as Additive Regression, when combined with M5P trees,
gave the best results in our tests.
Keywords—effort estimation, machine learning methods, NASA
projects, CHINA projects, PROMISE projects, WEKA.
I. INTRODUCTION
In today's competitive software industry, estimation of non-functional properties such as effort, reliability and performance is a critical part of the business. This paper focuses on effort estimation. Effort is the allocated time and resources (i.e. person months) for the development or maintenance of a software product. Effort can be forecast in a quantified form. To prevent project overruns, good effort estimates are critical, because a good effort estimate often converts into a good schedule and cost estimate [1].
There are various top-down and bottom-up strategies for effort
estimation. While traditional -and still most commonly used-
strategies heavily depend on top-down expert judgment [2],
there are also machine learning techniques as well as other
model based alternatives that may incorporate empirical,
mathematical, algorithmic or analogy based approaches which
may require a bottom-up analysis [9,11,12, 15-17].
A. Expert Judgment vs Model Based Effort Estimation
Although there is strong evidence suggesting that it is flawed, expert judgment is still the most dominant method in the industry [3]. Moreover, there is no substantial evidence suggesting that expert judgment is better than model based estimation. In fact, Standish Group surveys predict only 30% chance for a project to be delivered within expert's estimate [4]. Moreover T. Capers Jones' research on 50 project indicates that only 8% were delivered within 10% of the actual predictions [5]. To overcome this problem, Jorgensen suggests researchers to focus on improving the techniques that can refine expert judgment [6]. Although Jorgensen's argument that there can not be a single model that can optimally answer to all effort
estimation problems is theoretically sound1, this does not
necessarily mean that researchers should discard the benefits of model based approaches or ignore the potential of new models. Especially meta-models that are used in machine learning can give good estimate values, by minimizing the number of assumptions made by the modeler and by quickly generating new estimation models that can be fine-tuned to the characteristics of the project,.
Comparison between expert judgment and model based estimation reveals that expert judgment is no better than model based judgment when the models incorporate critical organization and domain specific knowledge. Moreover “expert adjusted model estimates”, where experts modify the model based estimates according to criteria not taken into account in the models, seem to be often more accurate than pure expert judgment [2].
Expert judgment is error prone and often overly optimistic.
For example in Teigen et al.'s study, it turns out even when
experts say they are at least 90% confident, on average only
about 60-70% of the estimation is successful [1]. Overly
optimistic estimations, in the best scenario, result in tight
schedules and extra working hours for the developers, which
results in reduced maintainability, reliability and overall
quality of the product. Without model or tool support, aside
from expert intuition, neither developers nor managers have
supporting evidence or a systematic means for rejecting
unrealistic budget or schedule propositions. To overcome
these problems, researchers either try to improve the
estimation techniques that experts use or come up with
improved computer models for better and more reliable model
or tool support.
B. No Free Lunch Theorem
Because of the no free lunch theorem, we know that there will
never be an estimation model that can achieve perfect results
for all estimation problems, since estimation problems are
intrinsically optimization problems [7].
A plausible scenario is that as software companies achieve
maturity in their domains and model based estimation
techniques produce robust and capable models that can
incorporate domain specific parameters, “expert adjustment”
will replace expert judgment. Therefore, there will always be
1 See the no free lunch theorem in B.
a need for experts to make domain specific “adjustments” (i.e.
calibration of the raw model, domain specific simplifications,
organization/event specific changes etc.).
Well known machine learning techniques can be used for
generating such raw models that can be “fine tuned” with
domain and organization specific knowledge.
II. BACKGROUND
Walston-Felix Model: The model was developed by Walston
& Felix from a database that consisted of 60 projects at IBM
federal systems division. This model constitutes participation,
customer-oriented changes, memory constraints etc. [28] and
formulates the relation between effort and delivered lines of
source code:
Effort = 5.2*(LOC)
0.91 (1)
where LOC is number of developed lines of source codes with comments.
Bailey-Basili Model: This model is based on the early work of Walston and Felix and was proposed to be used to express the size of the project with some measures like Line of code, executable statement, machine instructions, number of modules and a base line equation to relate this size to effort [29]. This equation is expressed in [30] as follows:
Effort = 0.73*(LOC)1.16
+ 3.5 (2)
Doty Model: This model was published by Doty in 1977. Doty define the relationship between effort and delivered lines of source code as in the following equation:
Effort = 5.288*(LOC)1.047
(3)
Machine Learning Techniques
Artificial neural networks (ANN) are powerful non-linear mathematical data modeling tools. The power of neural network comes from their ability to learn from experience, parallel processing, self organization, adaptability, fault tolerance and real-time operation properties [8].
ANNs are inspired from the features of human brain and its learning process. An ANN consists of simple interconnected units called artificial neurons. Each neuron has weighted inputs, summation function, activation function and an output. It computes net input by multiplying weights with inputs, and then process the net input with respect to activation function to generate an output.
Multilayer perceptron (MLP) is a simple feed forward ANN that is commonly used. In effort estimation, MLP with backpropagation is used in several studies [9-13].
Genetic programming (GP) which is inspired from evolution, can be used to find programs that can achieve a predefined task. This predefined task can be extremely specific. For
example Burgess et al. use GP to achieve an improve on effort estimation [14].
Linear regression tries to find a linear relation between a dependent variable and one or more explanatory variables. It is mostly suitable for capturing simple relationships between variables or estimating a conditional value of the dependent variable. It can be used for comparison with other models especially when they do not yield good results. Linear regression is used in various studies [15, 16] There are also other types of regression such as SMO regression (which uses support vector machines), additive regression (which can use any classifier, and used in this study as a meta model for M5Rules) as well as least square regression which is used in [17].
Finnie et al. point to the potential of case based reasoning (CBR). CBR can take advantage of old problems by using their solutions in similar or same (sub)problems. Moreover, Finnie et al. emphasize the fact that CBR is robust dynamic enough to readjust itself according to the new project data [18, 19] .
Researchers may also use instance based classifiers such as K* where analogy between instances in the dataset can be established via a similarity function. Such analogies prove to be useful for certain datasets that consist of similar entries, (e.g. for this study K* gives good results for NASA93).
Decision trees or rule induction algorithms are also one of the popular methods that easily achieve nonlinear behavior according to certain trigger rules that overall recursively divide the data into finer details and make the classify it. For example M5Rules which is available in Weka, generates a decision list for regression problems using separate-and-conquer [20]. M5Rules are used in this study as well as other studies such as [15] and [9]. Other rule induction systems can use decision tables and conjunctive rules. These techniques are also used in the literature for effort estimation [21].
Support vector machines (SVM) can divide the data into
several sections according to the hyper plane it builds from the
input space [22]. It uses hyper plane to overcome nonlinear
relationships (by easily separating the non-separable data in
the hyper plane). SVMs are used in this study as well as in
[15].
Bootstrap aggregating or bagging is a (meta) technique that
is used for reducing the variance of the training set for the
selected classifier by means of bootstrap samples that uses
repeated values [23]. Malhotra et al. use bagging in their study
[15].
ABC-algorithm model: Artificial bee colony (ABC) algorithm which is one of the swarm intelligence-based bio inspired optimization algorithms is proposed by Karaboğa et al [26]. It mimics intelligent foraging behaviors of honeybees for evolving the optimal solutions of problems. The ABC algorithm outperforms other evolutionary algorithms such as genetic algorithm (GA), particle swarm optimization (PSO), differential evolution (DE), simulated annealing (SA) in terms of the quality of solution and the computation efficiency [27]. This model estimates the parameters in equation (4) using ABC algorithm and calculates the effort.
Effort = a(LOC)b (4)
III. PROPOSED METHODOLOGY
A. Data Collection
Promise data repository contains an abundance of data for effort estimation [24]. In this study, several datasets such as NASA, COCOMO81, MAXWELL, COCOMO_NASA2 (NASA93), and CHINA are used. Each dataset has varying number of attributes. The first dataset which is named as NASA is taken from [10] for comparison purposes.
TABLE I. VARIOUS DATASETS USED FOR COMPARISON
Dataset NASA COCOMO81 MAXWELL NASA93 CHINA
#Entries 18 61 62 93 499
#Attributes 3 18 27 23 19
B. Model Comparison Using Weka
Promise data repository uses arrf file format which can be processed via Weka. Weka is a machine learning tool that incorporates a wide spectrum of machine learning algorithms [20]. We have used Weka for comparison between error rates of various classifiers such as: M5Rules, ANN, K*, Conjunctive Rule, Decision Table, Additive Regression, Support Vector Machine with Regression (SMOreg), Bagging, Linear Regression, SVM, Classification And Regression Trees, Least Square Regression and Radial Basis Function (RBF).
IV. COMPARISON OF VARIOUS MODELS
In this study, various approaches at modeling were used on
benchmark datasets. Table II represents these approaches on
some benchmark datasets and their respective test results such
as Correlation Coefficient, Root Absolute Error (RAE), Root
Relative Squared Error (RRSE), Mean Magnitude of Relative
Error (MMRE), Root Mean Square Error (RMSE), Mean
Absolute Error (MAE) and PRED. Although for each dataset,
several machine learning approaches (almost all applicable
supervised learning approaches) were tested, only the first
and second best models are shown in the table below. (For
other results: http://www.metu.edu.tr/~163109/all_results.rar)
TABLE II. COMPARISON RESULTS OF VARIOUS MACHINE LEARNING APPROACHES ON BENCHMARK DATASETS
Article Dataset Methods Correlation
Coefficient RAE RRSE MMRE RMSE MAE
PRED
(0.25)
proposed
NASA18
ABC 0.996 8 10 9 46508 41309 100
M5Rules 0.99 13 16 14.47 26512 44348 100
[1]
ANN 0.97 18 34 12 17.44 36373 80
Halstead 0.99 395 594 175.65 308709 194.47 20
Walston-Felix 0.99 183 237 155.55 123.45 90.49 0
Bailey-Basili 0.99 32 48 20.29 41330 15.98 80
Doty 0.99 425 575 302.5 299.47 209.49 0
[2]
COCOMO81
Halstead - - 887.78 26575 - - -
Walston-Felix - - 83584 1880.9 - - -
Bailey-Basili - - 60.45 1691.6 - - -
Halstead - - 125.75 1382.1 - - -
proposed
Maxwell62-- M5Rules 0.95 28.55 32.42 58.22 2739.5 1857.07 42.85
K* 0.83 55.83 56.74 111.88 4794.68 3631.8 33.33
NASA93 -- K* 0.82 25.72 53.11 80.11 322.02 129.76 56.25
M5Rules 0.8 34.56 56.76 168.75 344.16 174.38 46.87
[3]
NASA93
Conjunctive Rule - - - 1246.63 695.31 - -
Decision Table - - - 1127.37 536.26 - -
M5Rules - - - 801.09 377.35 - -
Halstead - - - 18963 6814 -
Article Dataset Methods Correlation
Coefficient RAE RRSE MMRE RMSE MAE
PRED
(0.25)
Walson-Felix - - - 1244.3 583 - -
Bailey-Basili - - - 1097.2 472.2 - -
Doty - - - 954.37 416.99 - -
proposed CHINA--
AdditiveReg
(M5Rules) 0.98 13.69 19.91 16.83 777.81 415559 80.59
AdditiveReg
(SMOreg) 0.93 16.32 38.33 41502 1497.58 495.14 86.47
[4]
M5Rules - - - 41442 - - 52
Bagging - - - 74.23 - - 34.66
Linear Regression - - - 17.97 - - 36
SVM - - - 25.63 - - 38.66
ANN - - - 143.79 - - 11.33
[5]
ANN - - - 90 - - 22
CHINA
Classification And
Regression Trees - - - 77 - - 26
Least Square
Regression - - - 72 - - 33
Adjusted Analogy
Based Estimate
(Euclidean
Distance)
- - - 38 - - 57
Adjusted Analogy
Based Estimate
(Minkowski
Distance)
- - - 43 - - 61
[6]
Augmented
COCOMO - - - 65 - -
Pred(20)
31.67
[6]
Parsimonious
COCOMO - - - 64 - - 30.4
[7]
Clustering - - - 1.03 - - Pred(30)
35.6
[8]
Regressive - - - 62.3 - - -
ANN - - - 35.2 - - -
Case Based
Reasoning - - - 36.2 - - -
[9]
CHINA
MART (Multiple
Additive
Regression Trees)
- - - 8.97 - - 88.89
RBF (Radial Basis
Function) - - - 19.07 - - 72.22
SVR_Linear - - - 17.4 - - 88.89
Article Dataset Methods Correlation
Coefficient RAE RRSE MMRE RMSE MAE
PRED
(0.25)
SVR_RBF - - - 17.8 - - 83.33
Linear Regression - - - 23.3 - - 72.22
[10] CHINA Genetic
Programming - - - 44.55 - - 23.5
ANN - - - 60.63 - - 56
[11] CHINA ANN - - - 17 - - -
Case Based
Reasoning - - - 42 - - -
[12] CHINA Linear Regression - - - 23.3 - - 72.22
RBF - - - 19.07 - - 72.22
# Single minus sign – means the error values are not given in the respective article.
## Double minus sign -- means some of the attributes are discarded in the preprocessing phase to achieve
better cross-validation results.
As Table II, correlation coefficient of ANN has the most
smallest value, but ABC model is more suitable in other error
metrics. Since NASA dataset that consists of 18 projects is
small, ABC gets better results. ABC only used number of kilo-
lines of code (KLOC) to obtain the model, so obtained model
only applied on NASA18.
In Table II, Conjunctive Rule, Decision Table, M5Rules,
Halstead, Walson-Felix, Bailey-Basili, Doty models are
generated on NASA dataset that contains 93 projects. These
models are evaluated by MMRE and RMSE. Walson-Felix has
the lowest RMSE rate, but Doty has the lowest mmre rate.
In the literature, there are a lot of models which are developed
using different approaches on CHINA dataset. The
performance evaluation of these models is generally based on
MMRE and PRED. A good estimation model is expected to
return high PRED and correlation coefficient values and low
error values. One can deduce from the results, by looking at
the error rates (especially MMRE and RMSE are widely used)
and PRED values, Additive Regression meta-classifier
combined with M5Rules or SMOreg gives relatively good
results and highly close to the best results which are obtained
from MART. Moreover type of dataset affects which model
gives the best results. Fore example, for NASA93, K* gives
much better results than any other models.
V. DISCUSSION
Because of the no free lunch theorem, as long as there is new data, there will always be a design issue for models that we seek to optimize for effort estimation problems. The issue arises from the fact that when designing/creating a new model for effort estimation, modeler needs to make certain assumptions. For example, when designing an ANN model, it is generally the modeler's task to decide/modify the number of nodes and hidden layers according to the problem. For instance, Sing et al.'s design for multilayer perceptron with back propagation, uses 3 input nodes, one hidden layer with 5 nodes and 1 output node [25]. Although a comparison between relative error of several models may be considered an indicator for the limitation or potential of such a model, unless there are severe theoretical (and/or practical) limitations in the framework, bad results do not necessarily mean that the choice of framework used by the modeler is the wrong one, since bad results can be easily obtained with bad design choices. Therefore, when making assumptions and design choices, a modeler can never be too careful. A way to partly overcome this problem is minimizing the number of such assumptions via another layer (e.g. a meta-modeling layer such as additive regression etc.) that intrinsically searches for the optimal/near optimal choices and reduces the chances of erroneous design.
As it can be seen from the comparison table, our best two results in the China dataset used additive regression as a meta-layer. This helped us to ignore the parameters of M5Rules and SMOreg (support vector machine for regression), since additive regression would enhance their performance without
our intervention2. Since there was minimum amount of
choices from the modeler side, the results were reliable. Also note that multiple additive regression tree (MART) was the only model that gave better MMRE and PRED values. As one may realize, MART simply adds one more layer to additive regression trees to achieve even better results via better expressive power with still minimum modeler intervention.
Our test using K* in NASA93 gave good results. Since K* makes use of analogies between each entries, the fact that NASA93 consists of entries that can indicate whether projects that were developed in the same NASA center, same development mode, same system, same etc., may explain the increased success of K*.
CONCLUSION
In this study, we have first discussed why researchers should
be interested in machine learning techniques and then
compared several machine learning alternatives on well
known datasets from Promise repository to obtain a big
comparative picture for the most successful3 machine learning
approaches for effort estimation on datasets that describe
various properties of the projects in various detail. M5P
decision trees, K* and SMOreg methods generally gave good
results in our tests. Other good results from the literature were
MART, clustering, RBF and SVR_Linear. ABC model
acquires the best results on NASA18 dataset. Since effort
calculation formula is insufficient, ABC model didn't apply
other datasets. The application of other datasets on ABC
model by improved formula is left as future work.
ACKNOWLEDGMENT
We sincerely appreciate the guidance and encouragement that came from our professor Ali Doğru.
REFERENCES
[1] M. Jørgensen, K.H. Teigen, K. Ribu. Better sure than safe? Over-confidence in judgement based software development effort prediction intervals in Journal of Systems and Software 70 (1–2), Feb 2004, pp 79–93
[2] M. Jørgensen. A Review of Studies on Expert Estimation of Software Development Effort in Journal of Systems and Software 70(1-2), 2004, pp 37--60
[3] C. Jarabek, Expert Judgement in Software Effort Estimation
[4] J. Johnson, My Life Is Failure, Standish Group Int’l, 2006.
[5] T.C. Jones, Estimating Software Costs, Mc-Graw-Hill, 1998
2 Additive regression was also tested in previous datasets
however it never returned better results, which may be
considered as a supporting evidence that our designs were
already giving satisfactory results
4 Since we have only selected models that were first tested
for cross-validation, our comparison table does not contain
entries that hides low cross-validation results.
[6] M. Jørgensen and B. Boehm, Software Development Effort Estimation:Formal Models or Expert Judgment? in IEEE Software, 2008
[7] D.H. Wolpert, W.G.Macready, No Free Lunch Theorems for Optimization, IEEE Transactions on Evolutionary Computation 1, 67., 1997, available at http://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf
[8] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College Publishing Company, Boston , 2000.
[9] N.H. Chiu and S.J. Huang,The adjusted analogy-based software effort estimation based on similarity distances,‖ The Journal of Systems and Software, vol. 80, pp. 628-640, 2007
[10] J. Kaur, S. Singh, and K. Singh Kahlon, Comparative Analysis of the Software EffortEstimation Models in World Academy of Science, Engineering and Technology 22 2008
[11] G.R. Finnie and G.E. Wittig, A Comparison of Software Effort Estimation Techniques: Using Function Points with Neural Networks, Case-Based Reasoning and Regression Models,‖ Journal of Systems and Software, vol. 39, pp. 281-289, 1997
[12] C.J. Burgess and M.Lefley,Can genetic programming improve software effort estimation? A comparative evaluation,‖ Information and Software Technology, vol. 43, pp. 863-873, 2001.
[13] G.R.Finnie and G.E. Wittig, AI Tools for Software Development Effort Estimation,‖ in Proc. SEEP '96 , 1996, International Conference on Software Engineering: Education and Practice (SE:EP '96).
[14] C.J. Burgess and M.Lefley,Can genetic programming improve software effort estimation? A comparative evaluation,‖ Information and Software Technology, vol. 43, pp. 863-873, 2001
[15] R. Malhotra, A. Jain, Software Effort Prediction using Statistical andMachine Learning Methods in (IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 2, No.1, January 2011
[16] M.O. Elish, Improved estimation of software project effort using multiple additive regression trees,‖ Expert Systems with Applications, vol. 36, pp. 10774-10778, 2009.
[17] N.H. Chiu and S.J. Huang,The adjusted analogy-based software effort estimation based on similarity distances,‖ The Journal of Systems and Software, vol. 80, pp. 628-640, 2007.
[18] G.R. Finnie and G.E. Wittig, A Comparison of Software Effort Estimation Techniques: Using Function Points with Neural Networks, Case-Based Reasoning and Regression Models,‖ Journal of Systems and Software, vol. 39, pp. 281-289, 1997
[19] G.R.Finnie and G.E. Wittig, AI Tools for Software Development Effort Estimation,‖ in Proc. SEEP '96 , 1996, International Conference on Software Engineering: Education and Practice (SE:EP '96).
[20] Weka. Available: http://www.cs.waikato.ac.nz/ml/weka/
[21] H. Duggal, P. Singh Study the Performance of M5-Rules Algorithm and Decision Table Majority Classifier for Modeling of Effort Estimation of Software Projects
[22] S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, K.R.K. Murthy, Improvements to the SMO Algorithm for SVM Regression, IEEE Transactions on Neural Networks, vol. 13, March 2001.
[23] L.Breiman, Bagging predictors,‖ Machine Learning, vol. 24, pp. 123-140, Aug. 1996
[24] D. of USA, “Parametric cost estimating handbook, second edition,” 1999. J. Sayyad Shirabad and T. Menzies, “The PROMISERepository of Software Engineering Databases..” School of Information Technology and Engineering, University of Ottawa, Canada, 2005. Available from http://promise.site.uottawa.ca/SERepository.
[25] J. Sing, B. Sahoo, Application of Artificial Neural Network for Procedure and Object Oriented Software Effort Estimation
[26] D. Karaboga, Basturk, B., A powerful and efficient algorithm for numerical function optimization: artificial bee colony algorithm, Journal of Global Optimization, Vol. 39, pp. 459-471, 2007.
[27] D. Karaboga, B. Basturk, On the performance of artificial bee colony (ABC) algorithm, Applied Soft Computing, Vol.8, pp. 687-697, 2008.
[28] D.K. Srivastava , D.S. Chauhan and R. Singh VRS Model: A Model for Estimation of Efforts and Time Duration in Development of IVR
Software System in International Journal of Software Engineering Vol. 5, pp 27-46,2012
[29] S. A. Abbas, X. Liao, A. Rehman, A. Azam, M.I, Cost Estimation: A Survey of Well-known Historic Cost Estimation Techniques, in Journal of Emerging Trends in Computing and InformationSciences, Vol 4,1,pp. 612-636, 2012
[30] J. J. Bailey and V. R. Basili, A meta-model for software development resource expenditures, 1981