machine learning approaches for predicting software maintainability: a fuzzy-based transparent model

www.ietdl.org

IE

d

Published in IET SoftwareReceived on 25th February 2013Revised on 17th May 2013Accepted on 9th June 2013doi: 10.1049/iet-sen.2013.0046

Special Issue on Empirical Studies in SoftwareEngineering

T Softw., 2013, Vol. 7, Iss. 6, pp. 317–326oi: 10.1049/iet-sen.2013.0046

ISSN 1751-8806

Machine learning approaches for predicting softwaremaintainability: a fuzzy-based transparent modelMoataz A. Ahmed, Hamdi A. Al-Jamimi

Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia

E-mail: [email protected]

Abstract: Software quality is one of the most important factors for assessing the global competitive position of any softwarecompany. Thus, the quantification of the quality parameters and integrating them into the quality models is very essential.Many attempts have been made to precisely quantify the software quality parameters using various models such as Boehm’sModel, McCall’s Model and ISO/IEC 9126 Quality Model. A major challenge, although, is that effective quality modelsshould consider two types of knowledge: imprecise linguistic knowledge from the experts and precise numerical knowledgefrom historical data.Incorporating the experts’ knowledge poses a constraint on the quality model; the model has to betransparent.In this study, the authorspropose a process for developing fuzzy logic-based transparent quality prediction models.They applied the process to a case study where Mamdani fuzzy inference engine is used to predict software maintainability.Theycompared the Mamdani-based model with other machine learning approaches.The resultsshow that the Mamdani-basedmodel is superior to all.

1 Introduction

Software quality not only reflects how good a systemperforms, but also how good the system is built [1].According to the ISO 9126 standard definition, softwarequality is ‘the totality of features and attributes of asoftware product that bear on its ability to satisfy stated orimplied needs’ [2]. Software features and attributes can becategorisedinto two different groups: internal and external.External attributes, for example, maintainability, are visibleto the stakeholders (e.g. customers, users and developmentproject managers) of the product; internal attributes, forexample, class cohesion, concern the developer of theproduct. Internal quality attributes are typically directlymeasurable during different stages of the software lifecycle.In contrast, external quality attributes are typically indirectlypredictable depending on a number of internal qualityattributes [3].A ‘software quality model’ is meant to define the different

external attributes that are of interest to the stakeholders alongwith their level of importance. The model also defines thefunctional dependencies between the external attributes ofinterest and the corresponding internal attributes which wecan measure. Such internal attributes serve as predictors topredict future external quality attributes at the early stagesduring the development lifecycle.Different prominent generic software quality models were

proposed by various researchers such as, Boehm’s Model[4], ISO/IEC 9126 Model [2], McCall Model [5] andDromey’s Model [6]. These models allow us to define andmeasure the software quality in different perspectives bydescribing and evaluating a set of software attributes.

However, these prominent models provide moreframeworks for assessing and predicting the softwarequality than concrete models ready for the practitioners toapply. Guided by these prominent models, many attempts[7–10] have been proposed in the literature to mainlyaddress the functional dependencies which are typicallymore complex than just a simple linear relationship. Arecent comprehensive survey reveals that those attemptssuffer from shortcomings; ‘model transparency’ was acommon shortcoming to all [11].Model transparency refers to the model property where the

functional dependencies are interpretable by humans; itallows incorporating two kinds of knowledge: expertknowledge and knowledge manifested in historical data.Transparency would allow the experts to directlymanipulate the model structure for any necessary additionsor modifications [12–14]. This is critical for the qualitymodel to be effective and allow benefitting from theexperts’ knowledge. Historical data provide numericalquantitative measurements from past projects regarding theinternal and external quality attributes. Human experts usetheir experience to provide qualitative descriptions of thecorrelation between the internal and external qualityattributes. Similarly, when using the model for actualprediction there are two sources of information: inspectiondata and practitioners. Inspection data are measurements onthe predictors; for example, cohesion measurement usingthe LCOM5 metric. Practitioners provide judgment oncohesion as well; an expert may suggest a level of cohesionfor a given class. Experts/practitioners use linguistic values,for example, high, medium and low, which are impreciseby nature. Same applies to the dependencies suggested by

317& The Institution of Engineering and Technology 2013

www.ietdl.org
the experts; they are also imprecise. Transparency is alsocrucial to make the model more practical for use as itallows the practitioners to provide their own judgment onthe predictors in linguistic terms. This is in contrast to themore theoretical black box prediction models, for example,neural network (NN)-based models, which neither allow theincorporation of the expert’s knowledge in building themodel, nor allow the use of the linguistic predictors’measurements.Fuzzy logic (FL) offers significant advantages over other
approaches because of its ability to naturally represent thehuman-provided qualitative linguistic knowledge withregard to the quality relationships and apply flexibleinference rules [15]. As for transparency, an implicitassumption states that fuzzy rules are by nature easy tointerpret. However, this could be wrong when dealing withcomplex multivariable systems or when the generatedpartitioning is meaningless for the experts [12]. In thispaper, we study the effect of automatic rule generation andstructure optimisationon model transparency as presentedlater in Section 6. We, consequently, propose apost-training procedure to enforce model transparency.We conducted a case study using maintainability as an

external attribute of interest. In this context, a qualityprediction model is manifested in a form of a set of fuzzyrules that relate the internal attributes to the external ones.It is worth noting here that there are two popular fuzzyinference engines for building fuzzy systems, namelyMamdani method [16] and Takagi-Sugeno (T-S) method[17]. Mamdani’s engine offers very relevant capabilitieswhen it comes to transparency and accommodatingimprecise data. We did compare its performance with othermachine learning (ML) techniques’ performance withregard to accuracy to assess the trade-offs, if any. Wemainly compared the performance of Mamdani’s engine tothe T-S’ engine, support vector machines (SVM)[18],Bayesian networks (BN) [19], Radial basis function (RBF)[20] and probabilistic neural networks (PNN) [21]. Owingtothe space limitation of the paper we take out all the detailsof these techniques, interested readers can consult thecorresponding references. The Mamdani-based modelseemed to be superior to all. Furthermore, we applied apost-training process to enforce the transparency of thetrained model; the experiment shows that the accuracy ofthe transparent model is comparable with the original oneand suggests that the approach is promising.The remainder of the paper is structured as follows. Section

2 discusses some prominent related work. Section 3 describesour generic model development process. Section 4 describesthe experimental setup. Experimental evaluation isdiscussed in Section 5; this includes comparing theperformance of the fuzzy models against the ML models.Section 6 presents an experiment to assess and improve thetransparency of the Mamdani-based model. Section 7discusses the validity and the limitations of this work.Section 8 concludes the paper along with recommendationsfor future work.

2 Related work

ML techniques have been applied in different domains,including software engineering. ML techniques offeralgorithms that have the ability to enhance theirperformance automatically through experience [22].Al-Jamimiand Ahmed provide a recent comprehensive


critical analysis of the many studies available in theliterature for the development of software quality predictionmodels using ML techniques [11]. Owingto the spacelimitation of this paper, we provide here some highlights ononly the literature we believe is the most prominent andrepresentative.BN has been applied by many researchers for quality

prediction [11]. van-Koten and Gray present a goodrepresentation of the approach [23]. They used BN topredict maintainability for object-oriented (OO) softwaresystems. The model is constructed using Li and Henry’smetric data, which were collected from two different OOsystems [24]. The prediction accuracy of their model wasshown to be better than, or at least, is competitive againstthe common regression-based models. Althoughthe studyshowed that BN is a useful modelling technique forsoftware maintainability prediction, further studies are stillrequired to realisethe full potential as well as the limitation.FL has also been applied extensively to the quality

prediction problem [11]. For instance, Jeet and Dhir [25]were able to show that their proposed fuzzy predictionmodel offers improvements over a comparable Bayesianapproach because of the advantageous accuracy of fuzzyover crisp approach.As it offers more powerful regression capabilities, SVM

has also caught researchers’ attention in developingprediction models [11]. Malhotraet al. [26] applied SVM inmodelling the relationship between the OO metrics andfault proneness to help planning and performing testing byfocusing resources on the fault-prone parts of the designand code. The study shows that the SVM method may alsobe used in constructing software quality models. However,similar types of studies are required to be conducted inorder to establish the acceptability of the model.The work of Thwin and Quah [27] respresnts the popular

appoach of applying NN in software quality prediction. Intheir work, they predict the number of defects in a class andalso predict the number of lines changed per class. Two NNmodels are used: Ward NN and general regression NN(GRNN). The GRNN model is found to predict moreaccurately than the Ward network model.Other researchers considered different approaches to

predict software maintainability. Examples of theseapproaches are multivariate adaptive regression splines(MARS) [28], artificial NN (ANN) [29], projection pursuitregression [30] and case-based reasoning [31].In conclusion, none of the attempts discussed above

considered model transparency as an objective. To the bestof our knowledge, previously proposed models were nottransparent enough for humans to incorporate theirknowledge. None of the previous studies predictmaintainability in the context of the popular Li and Henry[24] datasets using FL. Moreover, the performance of theFL-based models was not compared with the other MLtechniques. These observations motivated us to conduct thisresearch.

3 Quality prediction model developmentprocess

Prediction models are very crucial for practical softwareengineering since in many circumstances we would like topredict an attribute of some entity that does not yet exist.For example, suppose a software product must be highlymaintainable. The software construction may take some

IET Softw., 2013, Vol. 7, Iss. 6, pp. 317–326doi: 10.1049/iet-sen.2013.0046

www.ietdl.org
time, and we want to provide early assurance that the systemwill meet the maintainability targets. However,maintainability is defined in terms of maintenance effort,something we clearly cannot measure before the product isfinished. To provide maintainability indicators before theproduct is complete, we can build a model of the factorsthat affect maintainability, and then predict the likelymaintainability based on our understanding of the systemwhile it is still under development [3].To solve a prediction problem, we need (i) a ‘prediction
model’ that represents the functional dependency betweenthe dependent variable (e.g. the external attribute ofinterest) and the independent variables (e.g. the internalattributes of impact); (ii) ‘an inference procedure’ tocompute the dependent variable given values for theindependent variables; and (iii) a ‘prediction procedure’ thatcombines the model and inference procedure to makepredictions about future values.In this paper, we use FL as the underlying representation of

the prediction model. We base the inference procedure onMamdani’s engine. The prediction model takes the form offuzzy rules.For example, considering maintainability as the external

attribute of interest and size and cohesion as the internalattributes of impact, the model would appear in a formsimilar to the following:

IF cohesion is low AND size is large THEN maintainability islowIF cohesion is moderate AND size is medium THENmaintainability is moderateIF cohesion is high AND size is small THEN maintainabilityis highIF cohesion is moderate AND size is large THENmaintainability is moderate…IF cohesion is high AND size is moderate THENmaintainability is high

Fig. 1 Fuzzy sets for an input variable: size

Fig. 2 Fuzzy sets for hypothetical software maintainability


Examples of fuzzy membership functions for the linguisticvalues (e.g. high, low and medium) shown in Fig. 1 andFig. 2, are trained to improve their prediction quality (i.e. ameasure of the prediction accuracy). In this example, eachinput variable (i.e. size and cohesion) as well as the outputvariable (i.e. maintainability) is represented by threelinguistic values. Each linguistic value is represented by afuzzy set that has a mean value; the corresponding standarddeviation is computed to maximiseoverlap with theneighbours’ linguistic values. Applicable rules are appliedaccording to the corresponding membership strength.Fig. 3 illustrates our generic process for building

fuzzy-based quality prediction models. The process ismeant to serve as a quality assessment framework thatpractitioners can use according to the data available. Theprocess is meant to be applied on the available repositoriesof historical projects. It also considers expert’s knowledgeregarding functional dependencies among quality attributes.We discuss the different components of the process in thesequel:

1. Analyse data: analyse the available historical data to buildthe model. By assessing the target external quality attribute ofinterest, this task is meant to measure the correlation betweeneach available internal measurement and the target attribute.Accordingly, internal attributes with high correlation withthe target external attribute are selected as the predictors.It is worth noting here that although the individualcorrelation between some internal attributes and the targetexternal attribute might not be significant, it is still possiblethat the combination of these internal attributes can have astrong correlation with the target external attribute. Owingtothe limited size of this paper, we defer the analysis anddiscussion of this possibility to another publication.2. Build the model structure: structure the model based onthe selected independent variables and the target dependent

Fig. 3 Software quality assessment framework


www.ietdl.org
variable. Ideally, the experts provide the initial set of rules.Experts provide fuzzy sets for the linguistic values for eachvariable used.3. Train the model: adapt the initial model to reflect thedependencies inherent in the historical data. A largerpercentage, typically two-thirds, of the available dataset isused for training where the data inputs and the expectedoutput are presented to the model to improve predictionaccuracy.4. Enforce transparency: reconstruct the linguistic values ofeach variable to make sure that the number of linguistic valuesstays the same. This step is crucial from the practicalityperspective. Training may change the number of linguisticvalues possible for each variable; typically, trainingincreases the number of linguistic values for each variable.This makes it harder and maybe even impractical for use bythe practitioners during the prediction procedure. Enforcingtransparency makes sure that the set of rules are meaningfulto the experts.5. Analyse the model’s accuracy: validate the model’saccuracy against unseen data. The model’s accuracy ismainly measured as how close the model’s prediction is tothe actual value of the dependent variable using thevalidation data. Based on the obtained accuracy, the modelcan be considered as an acceptable such that it can beutilisedto predict the intended quality attribute. In case ofnon-acceptable accuracy, the model is restructured againand, in turn, the previous three stages are repeated till theappropriate model is obtained.
4 Case study: maintainability prediction

Given some available data, we conducted a case studyapplying our process to the software maintainabilityprediction problem. We evaluated the capability of theMamdani-based FL in predicting software maintainabilityusing some OO measurements published by Li and Henry[24]. Li and Henry’s datasets consist of measurements usingfive Chidamber and Kemerer metrics: DIT, NOC, RFC,LCOM and WMC; and four Li and Henry metrics: MPC,DAC, NOM and SIZE2; as well as SIZE1, see Table 1. Wemeasured maintainability by using the CHANGE metric by

Table 1 Correlations between CHANGE and OO metrics

Metric Pearson’s correlation coefficient

QUESdataset

UIMSdataset

BOTHdatasets

DIT (depth of theinheritance tree)

− 0.09 − 0.43 − 0.25

NOC (number of children) NA 0.56 0.24MPC (message-passingcoupling)

0.46 0.45 0.45

RFC (response for a class) 0.38 0.64 0.51LCOM (lack of cohesion ofmethods)

0.05 0.57 0.30

DAC (data abstractioncoupling)

0.08 0.63 0.37

WMC (weighted methodper class)

0.43 0.65 0.67

NOM (number ofmethods)

0.14 0.64 0.38

SIZE1 (lines of code) 0.64 0.63 0.65SIZE2 (number ofproperties)

0.14 0.67 0.41


counting the number of lines in the code which has beenchanged during a 3-year maintenance period [24, 32].Measurements were collected from a total of 110 classes intwo OO software systems; User Interface ManagementSystem (UIMS) contains 39 classes, whereas QualityEvaluation System (QUES) contains 71 classes. The firstcolumn of Table 1 lists all the OO metrics we considered. Itis not feasible to show the statistics and all themeasurements per class because of the limited space;however, interested researchers can consult [24]. Weconsider a third dataset, BOTH, by merging the twoavailable datasets QUES and UIMS. The rationale behindBOTH is 2-fold. First, BOTH is to help overcome the datascarcity problem when validating our process; it allows forone more experiment that is richer with regard to thenumber of classes considered. Second, BOTH containsclasses of heterogeneous characteristics coming fromdifferent projects. This is actually closer to the real-worldpractice where a ‘universal’ model should be trained usinga number of datasets which would be typicallyheterogeneous. This is also important in practice since thepractitioners typically use a prediction model against newdevelopment that is not necessarily homogeneous with thehistorical data. BOTH tests the generalisationability of ourprocess for building a kind of universal model for thepractitioners to use, subject to some calibration based onexpert’s input.To obtainthe most relevant independent variables to the

dependent variable (i.e. CHANGE), we used Pearson’scorrelation coefficients. The correlation–often measured as acorrelation coefficient–indicates the strength and directionof a linear relationship between two variables. Table 1shows the Pearson’s correlation coefficients betweenCHANGE and each of the OO metrics. The levels ofcorrelation vary from one dataset to another as shown inTable 1. Thus, this paper regards the QUES, UIMS andBOTH datasets as being heterogeneous and constructs aseparate maintainability prediction model for each. We willcomment on the uncertainty that results from thisdiscrepancy in Section 7.The numbers in Table 1 show the correlation between each

metric and the dependent variable (i.e. CHANGE). Thevalues that show strong and moderate correlation are shownin bold. For each dataset, we only use the metricscorresponding to bold numbers to build the correspondingmodel.

5 Experiments and results

Given the case study of the maintainability predictionproblem, we conducted a set of experiments using differentML models. We considered the three datasets discussedabove in each experiment. Each dataset was divided into atraining set and a validation set with a ratio of 2:1. Owingtothe non-deterministic nature of the training procedure, weran each experiment tentimes with different training datasetsand consider the averages and standard deviations. Theperformance of the various models was compared accordingto their prediction accuracy. The term prediction accuracyin this paper refers to how well a predictive modelconstructed using known data can predict the outcomes ofunseen data. Different prediction accuracy measures areused in the literature to quantitatively evaluate the OOsoftware maintainability prediction models [33–35]. Thesemeasures include magnitude of relative error (MRE),


www.ietdl.org
normalisedroot-mean-square error (NRMSE) and mean MRE(MMRE). Another measure is Pred(q), which refers to thepercentage of predicted values with MRE less than or equalto a specified value q.
5.1 Fuzzy logic-based models

In this first experiment, three prediction models areconstructed based on Mamdani’s inference engine byutilisingtype-1 FL system introduced by Mandel [36]. In thecontext of the three datasets each input variable isrepresented by three membership functions of Gaussiantype as shown in Fig. 1.This experiment considered the two popular fuzzy

inference engines: Mamdani and T-S. The performance ofthe Mamdani-based models is shown in Fig. 4. Fig. 4ademonstrates the performance of the QUES-based modelover ten different runs. Each run uses 100 epochs for thetraining in the context of the QUES dataset. Similarly, thesame work is repeated by using the UIMS and BOTHdatasets as shown in Figs. 4b and c.Table 2 shows the ten-run average prediction accuracy

achieved by the Mamdani-based models. The performancevaries from one dataset to another. It is obvious that theQUES-based model offers the best accuracy. This could bebecause QUES has a large number of cases (i.e. 71 classes,compared with 39 classes in case of the UIMS); this helps

Table 2 Accuracy measures of the Mamdani-based models (QUES U

Error measure Training

NRMSE MMRE NRMS

QUES 0.11 0.24 0.16UIMS 0.13 0.52 0.21BOTH 0.13 0.41 0.18

Fig. 4 Mamdani models performance during ten runs with 100 epochs

a Mamdani models performance using QUES datasetb Mamdani models performance using UIMS datasetb Mamdani models performance using BOTH dataset


in identifying the dependency patterns more accurately, ifany. Clearly, even though the dataset composed of bothdatasets has more cases, the accuracy is not better than thatof the QUES only. This could be because of the differentcharacteristics and heterogeneity of the merged datasets.To compare the performance, we developed another set of

models based on the T-S’ engine. We used theadaptive-network-based fuzzy-inference system (ANFIS)implemented in the Fuzzy Logic Toolbox of MATLAB todevelop the models [37]. ANFIS is an implementation ofthe T-S engine. It is worth noting here that ANFIS requiresthat the number of rules is equal to the number of allpossible combinations, mn, where m is the number ofmembership functions for each independent variable and nis the number of the independent variables, assuming thatall the variables have the same number of membershipfunctions. For instance, the QUES dataset has four inputindependent variables therefore the needed number of ruleswhen using the T-S-based model is 81 rules, but whenusing the Mamdani-based model the number can bereduced to any arbitrary number, for example, 14 rules.Given this constraint, it was impractical to apply ANFIS tothe UIMS dataset since the set of independent variablesincludes all the metrics in Table 1; all the metricsdemonstrated significant correlation to the output variable.Same applies to the BOTH dataset. The correspondingnumber of rules will be too large.

IMS and BOTH datasets)

Testing Testing

E MMRE Pred(0.25) Pred(0.30)

0.27 0.52 0.620.53 0.30 0.350.45 0.34 0.40

for each dataset


Table 3 Comparison between the Mamdani-based and T-S-based models (QUES datasets)

Error measure NRMSE MMRE Number of used fuzzy rules

Training Testing Training Testing

Mamdani-based model 0.11 0.16 0.24 0.27 14T-S-based model 0.0074 0.183 0.16 0.9 81

Table 5 NRMSE measures of SVM-based models (QUES,UIMS and BOTH datasets)

Error measure QUES UIMS BOTH

training 0.409 0.156 0.58testing 0.886 0.302 0.63

Table 6 MMRE measures of FL and BN-based models (QUESand UIMS dataset)

UsedModel

QUES UMIS

MMRE Pred(0.25)

Pred(0.30)

MMRE Pred(0.25)

Pred(0.30)

FL model 0.27 0.52 0.62 0.53 0.30 0.35BN model[23]

0.45 0.39 0.43 0.97 0.45 0.47

MARSmodel[28]

0.32 0.48 0.59 1.86 0.28 0.28

www.ietdl.org

To build a comparable T-S-based model, we used the sameinitial configuration used for Mamdani’s; the two modelsdiffered in the number of rules though ANFIS had morerules. A comparison between the accuracy of the twoengines is demonstrated in Table 3. Even though ANFISrequires many more rules, the Mamdani-based model offersbetter accuracy during testing than the model that uses theT-S’ engine. ANFIS shows better accuracy during training;this could be a demonstration because of over fitting giventhe large number of rules.In conclusion, the performance of the Mamdani-based

model has shown better performance than the T-S-basedmodel. Thus, in the rest of this paper we only consider theMamdani-based model (as the FL model) to be comparedwith the other ML-based models we developed as presentedin the sequel.

5.2 ANN-based models

In this experiment, two types of ANN have been applied toour datasets: PNN and RBF. Table 4 shows the errormeasures of the PNN-based models and the RBF-basedmodels. The PNN model was based on one hundredneurons with a Gaussian kernel function and a sigma foreach variable. NRMSE is used in this experiment to showthe error measures. We use the results in Table 4 in ourcomparison in Section 5.4.

5.3 SVM-based models

For each dataset, we developed an SVM model.Corresponding NRMSE are presented in Table 5. The sameindependent and dependent variables used in the previousexperiments are used in this experiment. We consideredEpsilon-SVR type of SVM with a Sigmoid kernel function.We used the results in Table 5 in our comparison in Section5.4.

5.4 Comparison of the FL models performanceagainst the BN, MARS, NN and SVM models

van Koten and Gray [23] presented a BN maintainabilityprediction model for an OO software system. The modelwas constructed using Li and Henry’s datasets and the sameindependent variables used in this paper. MMRE was usedas an accuracy measure, thus we use the same measure tocompare with our models. Similarly, Zhou and Leung [28]

Table 4 NRMSE measures of PNN and RBF-based models (QUES, U

NN model QUES

Training Testing Train

PNN model 0.037 0.372 0.0RBF model 0.136 0.727 0.0


employed MARS to build software maintainabilityprediction models using the QUES and UIMS datasets. Inthe sequel, we compare the FL prediction models with theBN models [23] and the MARS models [28]. Table 6compares the accuracy of our Mamdani-based FL modelagainst the BN and MARS models using the QUES andUIMS datasets. Numbers in bold indicate the bestperformance.NRMSE accuracy values of the FL models compared with

the SVM, PNN and RBF models are presented in Table 7.The FL model shows superior accuracy over the othermodels in the context of the three datasets.

5.5 Comments on the accuracy

As van Koten and Gray [23] pointed out, in order for an effortprediction model to be considered accurate, MMRE ≤ 0.25and/or either Pred(0.25) ≥ 0.75 or Pred(0.30) ≥ 0.70 aresuggested to be acceptable. However, De Lucia et al. [38]reported earlier that the prediction accuracies of thesoftware maintenance effort prediction models are often lowand thus, it is very difficult to satisfy the criteria. Models ofvan Koten and Gary [23], and Zhou and Leung [28]confirmed this observation of De Lucia et al. Clearly, none

IMS and BOTH datasets)

UIMS BOTH

ing Testing Training Testing

09 0.551 0.12 0.4816 0.264 0.24 0.52


Table 7 Comparison between FL, SVM, PNN and RBF models using the NRMSE measure

Model QUES UIMS BOTH

Training Testing Training Testing Training Testing

FL model 0.11 0.16 0.13 0.21 0.13 0.18SVM model 0.409 0.886 0.156 0.302 0.58 0.63PNN model 0.037 0.372 0.009 0.551 0.12 0.48RBF model 0.136 0.727 0.016 0.264 0.24 0.52

Table 8 Exact three means for each variable (initial rules)

Rule # MPC RFC WMC SIZE1 CHANGE

1 150 30.5 648 8 222 22 1 122 82.5 223 150 30.5 385 82.5 224 86 1 122 8 425 150 30.5 648 8 426 22 1 385 82.5 27 86 1 122 82.5 228 22 30.5 385 82.5 29 86 60 385 157 2210 22 30.5 385 82.5 2211 86 30.5 385 82.5 212 22 30.5 648 82.5 2213 86 30.5 385 82.5 2214 86 30.5 385 82.5 42

www.ietdl.org

of the maintainability prediction models constructed in thispaper satisfy these criteria of an accurate prediction model.Please note that to be consistent, we use the Li and Henry[24] datasets which are used by van Koten and Gary, andZhou and Leung; we also use the same error measure(MMRE) as they did. Hence,the low prediction accuracythat is common to all could actually be because ofthecharacteristics of the datasets used; for examples, notenough cases, missing measurements of important factorsthat affect maintainabilityand so on. However, this needsmore investigation with other datasets. The other possibilitycould be the use of the MMRE measure itself. It is worthnoting here also that Shepperd and MacDonell [39] raiseconcerns regarding the MMRE being biased and not alwaysreliable as a prediction accuracy measure. Accordingly, weuse NRMSE in addition to MMRE in this paper. TheNRMSE measurements suggest that the models presented inthis paper can predict the maintainability of the OOsoftware systems in the Li and Henry datasets reasonablywell.Clearly, the accuracy measures need further investigations.

However, it is important to note here that the Mamdani-basedmaintainability prediction model has been able to achievebetter prediction accuracy than all the other ML and otherregression models.It is expected that the accuracy of the prediction models

studied in this paper can be improved given more trainingcases. However, having more training cases is a challengein its own because of the data scarcity problem common toall software engineering research. This is where Mamdani’smodel will have even further significant advantage overothers‘transparency’. Using transparent models, expertsinputs can be used to improve the accuracy of the modelssignificantly. We discuss the transparency of Mamdani’smodels in detail in the sequel.

6 Transparency enforcement

The previous experiments show that the Mamdani-basedmodel offers best accuracy among others. In this section,we study the transparency of the Mamdani-based model.Model transparency refers to the capability of a model toexpress the system behaviour in a form interpretable by thepractitioners. It allows the practitioner to directly modifythe model structure as perceived appropriate. The termtransparency appeared in the literature in different formssuch as interpretability, compactness, completeness orconsistency [24]. The fundamental structures behind thedifferent ML techniques applied in the previous sectionsuggest that, among all, only FL offers ‘potential’transparent models. We emphasisebeing ‘potential’ herebecause an implicit assumption states that fuzzy rules areby nature easy to be interpreted. This could be wrong whendealing with complex multivariable systems or when thegenerated number of fuzzy sets for each variable, which are


interpreted as linguistic labels, is meaningless for thepractitioners [12]. Moreover, those fuzzy sets should beshared across all the rules. However, this is not the case asmost training procedures result in different fuzzy sets fordifferent rules. Typically, the number of linguistic values(i.e. fuzzy sets) possible for each linguistic variableincreases significantly as a result of training. Consequently,the practitioners find it hard to comprehend and managesuch many linguistic values; the model turns into a blackbox prediction model. This transparency issue discouragesthe incorporation of the practitioner’s knowledge inbuilding the model as well as the use of the practitioner’sinput with regard to the predictors’ assessment. To get anidea of the issue, letus get an insight into the physicalrepresentation of a fuzzy model. Given the degree ofmembership function as Gaussian, each linguistic value isrepresented by its mean value. The rules represent thedifferent linguistic values for each variable, where eachcombination of the linguistic values has its own response.For example, in our experiments, three different linguisticvalues have been defined for each input variable. In thecase of the QUES dataset, Table 8 shows the initialMamdani-based model rules based on four input variableswith three means representing the three fuzzy sets for eachvariable. The number of rules is set to 14, where the rowsshow the rules, the first four columns demonstrate the inputvariables and the last column shows the response for thegiven inputs. Note that under each column there are onlythree distinct values to represent the means for only threefuzzy sets. Typically, the mean related to each fuzzy set hasits own interpretation and is labelled accordingly with thecorresponding linguistic value. For instance, the first rule inTable 8 could have been provided/interpreted by an expert as

If (MPC is ‘high’) and (RFC is ‘medium’) and (WMC is‘high’) and (SIZE1 is ‘low’) Then CHANGE is ‘medium’.


Table 9 Different means for each variable (trained rules)


1 20.65 150.35 23.62 648.11 9.652 16.26 21.79 0.00 122.76 71.253 22.81 149.98 32.78 385.51 82.444 25.80 80.47 15.51 119.26 1.145 43.86 150.07 29.81 648.02 8.266 0.00 21.38 60.01 386.11 77.187 27.02 85.41 0.77 123.66 81.998 0.31 21.52 60.16 385.34 82.379 22.17 85.78 62.88 384.92 156.5110 20.10 22.23 24.82 384.94 82.5111 0.01 86.71 60.86 385.13 80.4412 22.23 22.01 29.44 648.02 81.6113 27.94 86.71 5.87 385.33 77.7314 40.04 83.56 24.21 384.72 86.80

Table 10 Trained rules after clustering the different means


1 24 150 20 648 62 24 22 2 122 803 24 150 20 385 804 24 90 20 122 65 42 150 20 648 66 2 22 60 385 807 24 90 2 122 808 2 22 60 385 809 24 90 60 385 15710 24 22 20 385 8011 2 90 60 385 8012 24 22 20 648 8013 24 90 2 385 8014 42 90 20 385 80

Table 11 Performance of the Mamdani model before and afterclustering the rules (QUES dataset)

Error Measure Trained rules Clustered rules

Training Testing Training Testing

NRMSE 0.15 0.201 0.167 0.21MMRE 0.22 0.295 0.29 0.35Pred(0.25) 0.62 0.52 0.54 0.52Pred(0.30) 0.62 0.57 0.59 0.57

www.ietdl.org

Training the model resulted in a set of 14 rules where eachrule has its own linguistic values (manifested by the means forthe fuzzy sets). Table 9 shows the resultant set of rules. Forinstance, the MPC column shows 14 different means for thefirst input variable instead of three means. This observationsuggests that the practitioner has to deal with 14 linguisticvalues for each linguistic variable (e.g. predictor) ratherthan just three as he/she originally specified. This confirmsthat the assumption that just using FL will always result ina transparent model as usually assumed by the researchersis wrong; the training procedure typically increases thenumber of linguistic values. This clearly impedes thepractitioner’s ability to comprehend and tune the behaviourof the model as perceived appropriate. Transparencyrequires that the number of linguistic values for eachvariable (predictor/response) stay the same, and theselinguistic values themselves are shared across the rules.Ideally, training should preserve the number of linguistic

values for each variable, but only tune the set of means andcorresponding standard deviations to minimisethe errorwhen validated against the historical data. This way itwould incorporate human experts with historical data whilebuilding the model. Experts can even tune the resultantmodel further if needed. In contrast, as the number oflinguistic values significantly increases as a result oftraining, experts may not be able to tune the model as theymay not be able to comprehend the effect of the changes onthe behaviour. This would be similar to the weights of aNN; it is hard for humans to comprehend the effect ofmanually changing such weights; that is why NN areconsidered black boxes for practitioners.To the best of our knowledge, addressing model

transparency in software quality prediction has not caughtthe researchers’ attention so far. Previous works haveaddressed the problem of the loss of interpretability infuzzy modelling mainly in the process simulation andcontrol domain [12, 13].The above discussion motivated us to conduct this

experiment consider the transparency in ourMamdani-based maintainability prediction model. Theobjective is to reduce the number of linguistic values totheir original settings and assess accuracy accordingly. Ourapproach is based on clustering the linguistic values into anumber of clusters equal to the original number of linguisticvalues for each variable. The rationale is to use the mean ofthe cluster as a mean for a representative linguistic value forthe values in the cluster. We replace any rule’s linguistic


value by its cluster’s representative value. Thismanipulation results in a set of rules with a number oflinguistic values reduced to the original number; transparentand interpretable by humans. The resultant transparentmodel is then validated against the dataset to assessaccuracy. It is worth noting here that we validate the modelagainst both the training and test datasets. We re-validateagainst the training dataset since we indeed changed themodel by clustering the values; we need to make sure thatthe model can still predict the training dataset as well as thetesting dataset with acceptable accuracy. If there isacceptable accuracy, the model is accepted; otherwise, themodel development process is repeated as discussed above.Considering our Mamdani model for the QUES dataset,

Table 10 shows that after clustering each variable in thefuzzy model has three different means. Of course, theydiffer from the initial values; they also differ from thetrained values. This set incorporates both expert andhistorical knowledge. This is expected to be interpretable byhumans.We validated this new model against the same training and

testing data used before. The obtained accuracy measures inthe two cases, before and after clustering the means of thetrained rules are presented in Table 11. We conclude thatthe trained rules may contain a number of membershipfunctions different from the original number of themembership functions for each variable. Nevertheless, it ispossible to cluster the linguistic values which result fromtrading to preserve the original number of the membershipfunctions without significantly affecting the overall accuracy.In summary, the conducted experiments suggest that our

process was successful in developing a promisingMamdani-based model maintainability prediction. Ourmodel showed superiority to the other models in twoaspects: accuracy and transparency. With regard to


www.ietdl.org
accuracy, our model offered better MMRE and NRMSE thanother models when considering testing data. This reflects thelearning and generalisationability of our model. It is worthnoting here that in some cases the differences in accuracybetween our model and others are not highly significant.However, the big advantage of our model is that it offershigh transparency compared with the other models whichare more or less black boxes.
7 Threat to validity and limitations

Data scarcity poses a major threat to the validity of the resultsof this study. Clearly, the two datasets that have been used inthe study are not enough to draw strong conclusions. Datascarcity is a common problem for software engineeringresearch though. A corresponding limitation in our case isthat we cannot claim that any of the three Mamdani modelswe developed is generic enough to be used as is by thepractitioners. More calibration given data from the specificpractitioner environment would be needed. Future effortwill try to make contact with some potential softwarehouses to allow using their repositories and expertise inevolving a generic model. It is worth emphasisinghere,although, that the process presented in this paper can beused by the practitioners to develop their own models usingtheir repositories.

8 Conclusion and future research

There are two types of knowledge for building models:imprecise linguistic human knowledge and precisenumerical historical data. The pre-requisite to effectivecombination of the two knowledge types is the transparencyof the model. In this paper, we showed that the trainingprocedures of the FL systems do not necessarily producetransparent models as wrongly assumed by manyresearchers. Accordingly, we developed a process fordeveloping transparent FL-based quality models. Weapplied our process to a case study to predict softwaremaintainability through some software measures; we usedtwo common and publicly available datasets.We compared the Mamdani-based models with T-S-based,

SVM, PNN, RBF, BN and MARS models. TheMamdani-based model offered the best accuracy among all.The model accuracy measured in NRMSE was acceptable.However, that was not the case considering the MMREcriteria. Future work will try to apply the process and buildmodels using other datasets and/or the expert’s input forbetter assessment.Furthermore, we assessed the Mamdani-based model’s

transparency after training. We also conducted anexperiment to enforce the transparency of the trained model.The results show that the approach is promising; theoriginal number of linguistic values for each linguisticvariable (e.g., predictor) can be preserved without impairingthe accuracy of the model.Recall that our experiments on the two different datasets

along with the third combined dataset have shown differentcorrelations to the available measures. This raises an issuewith regard to practical cases where a practitioner developsa model using a given dataset (e.g. QUES) whereas a newproject (e.g.UIMSlike) poses different characteristics whichsuggest that a different model should be used. In this case,developing the prediction model should consider theuncertainty resulting from ignoring some internal attributes,


for example, SIZE2 which is ignored in QUES but actuallyneeded in UIMS. It is ‘uncertainty’ in the sense that someinternal attributes have an effect on some projects but noton all. Future work will consider this issue of uncertainty toinvestigate applying type-2 FL to address imprecision anduncertainty simultaneously [36, 40].

9 Acknowledgment

The authors wish to acknowledge King Fahd University ofPetroleum and Minerals (KFUPM) for utilising the variousfacilities in carrying out this research.

10 References

1 Gillies, C.A.: ‘Software quality: theory and management’ (Lexington,Ky., 2011, 3rd edn.)

2 Standard II: ISO-9126 Software Product Evaluation –QualityCharacteristics and Guidelines for Their Use, 1991

3 Fenton, N.E., Pfleeger, S.L.: ‘Software metrics: arigorous and practicalapproach’ (USA PWS Publishing Co., Boston, MA, 1998)

4 Boehm, B.W., Brown, J.R., Kaspar, H., Lipow, M., MacLeod, G.J.,Merritt, M.J.: ‘Characteristics of software quality’ (North HollandPublishing Company, 1978)

5 McCall, J.A., Richards, P.K., Walters, G.F.: ‘Factors in SoftwareQuality’. Technical Report Volume I, NTIS, NTIS Springfield, VA,NTIS AD/A-049 014, 1977

6 Dromey, R.G.: ‘A model for software product quality’, IEEE Trans.Softw. Eng., 1995, 21, (2), pp. 146–162

7 Challa, J.S., Paul, A., Dada, Y., Nerella, V., Srivastava, P.R., Singh,A.P.: ‘Integrated software quality evaluation: afuzzy multi-criteriaapproach’, J. Inf. Process. Syst., 2011, 7, (3), pp. 473–518

8 Sharma, A., Kumar, R., Grover, P.S.: ‘Estimation of quality for softwarecomponents –an empirical approach’. Proc. SIGSOFT SoftwareEngineering Notes, 2008, pp. 1–10

9 Lamouchi, O., Cherif, A.R., Lévy, N.: ‘A framework basedmeasurements for evaluating an IS quality’. Proc. Fifth onAsia-Pacific Conf. on Conceptual Modelling, Wollongong, NSW,Australia, 2008

10 Srivastava, P.R., Kumar, K.: ‘An approach towards software qualityassessment’, Commun. Comput. Inf. Syst. Series’, 2009, 31, (6),pp. 345–346

11 Al-Jamimi, H.A., Ahmed, M.: ‘Machine learning-based software qualityprediction models: state of the art’. Proc. Fourth Int. Conf. onInformation Science and Applications, Pattaya, Thailand, 2013

12 Guillaume, S.: ‘Designing fuzzy inference systems from data: aninterpretability– oriented review’, IEEE Trans. Fuzzy Syst., 2001, 9,(3), pp. 426–43

13 Paiva, R.P., Dourado, A.: ‘Interpretability and learning in neuro-fuzzysystems’, Fuzzy Sets Syst., 2004, 147, (1), pp. 17–38

14 Gactoa, M.J., Alcalá, R., Herrera, F.: ‘Interpretability of linguistic fuzzyrule-based systems: an overview of interpretability measures’, Inf. Sci.,2011, 181, (20), pp. 4340–4360

15 Zadeh, L.A.: ‘From computing with numbers to computing with words–from manipulation of measurements to manipulation of perceptions’,Int. J. Appl. Math. Comput. Sci., 2002, 12, (3), pp. 307–324

16 Mamdani, E.H., Assilian, S.: ‘An experiment in linguistic synthesiswith a fuzzy logic controller’, Int. J. Man-Mach. Stud., 1975, 7, (1),pp. 1–13

17 Takagi, T., Sugeno, M.: ‘Fuzzy identification of systems and itsapplications to modelling and control’, IEEE Trans. Syst. ManCybern., 1985, 15, (1), pp. 116–131

18 Hamel, L.H.: ‘Knowledge discovery with support vector machines’(Wiley, 2009)

19 Koski, T., Noble, J.M.: ‘Bayesian networks. An introduction’ (JohnWiley and Sons, Ltd., 2009)

20 Buhmann, M.D.: ‘Radial basis functions: theory and implementations’(Cambridge University Press, 2003)

21 Specht, D.F.: ‘Probabilistic neural networks’, Neural Netw., 1990, 3, (1),pp. 109–118

22 Zhang, D., Tsai, J.J.P.: ‘Machine learning applications in softwareengineering’ (World Scientific Inc., 2005)

23 van-Koten, C., Gray, A.R.: ‘An application of Bayesian network forpredicting object-oriented software maintainability’, Inf. Softw.Technol., 2006, 48, (1), pp. 59–67

24 Li, W., Henry, S.: ‘Object oriented metrics that predict maintainability’,J. Syst. Softw., 1993, 23, (2), pp. 111–122


www.ietdl.org
25 Jeet, K., Dhir, R.: ‘Bayesian and fuzzy approach to assess and predict the
maintainability of software: acomparative study’ (InternationalScholarly Research Network ISRN Software Engineering, 2012)

26 Malhotra, R., Kaur, A., Singh, Y.: ‘Empirical validation ofobject-oriented metrics for predicting fault proneness at differentseverity levels using support vector machines’, Int. J. Syst. AssuranceEng. Manage., 2010, 1, (3), pp. 269–281

27 Thwin, M.M.T., Quah, T.-S.: ‘Application of neural networks forsoftware quality prediction using object-oriented metrics’, J. Syst.Softw., 2005, 76, (2), pp. 147–56

28 Zhou, Y., Leung, H.: ‘Predicting object-oriented softwaremaintainability using multivariate adaptive regression splines’, J. Syst.Softw., 2007, 80, (8), pp. 1349–1361

29 Dash, Y., Dubey, S.K., Rana, A.: ‘Maintainability prediction of objectoriented software system by using artificial neural network approach’,Int. J. Soft Comput. Eng., 2012, 2, (2), pp. 420–423

30 Li-jin, W., Xin-xin, H., Zheng-yuan, N., Wen-hua, K.: ‘Predictingobject-oriented software maintainability using projection pursuitregression’. Proc. First Int. Conf. on Information Science andEngineering (ICISE2009), 2009, pp. 3827–3830

31 Haiquan, Y., Gaoliang, P., Wenjian, L.: ‘An application of casebased reasoning to predict structure maintainability’. Proc. Int.Conf. on Computational Intelligence and Software Engineering, 2009,pp. 1–5


32 Nguyen, V., Boehm, B., Danphitsanuphan, P.: ‘A controlled experimentin assessing and estimating software maintenance tasks’, Inf. Softw.Technol., 2011, 53, (6), pp. 682–691

33 Prabhakar, Dutta, M.: ‘Prediction of software effort using artificialneural network and support vector machine’, Int. J. Adv. Res. Comput.Sci. Softw. Eng., 2013, 3, (3), pp. 40–46

34 Al-Jamimi, H.A., Ahmed, M.: ‘Prediction of software maintainabilityusing fuzzy logic’. Proc. IEEE Third Int. Conf. on SoftwareEngineering and Service Science (ICSESS), 2012, pp. 702–705

35 Mohantya, R., Ravib, V., Patrac, M.R.: ‘Hybrid intelligent systems forpredicting software reliability’, Appl. Soft Comput., 2013, 13, (1),pp. 189–200

36 Mandel Software: http://www.sipi.usc.edu/~mendel/software/37 Jang, R.J.S.: ‘ANFIS: adaptive-network-based fuzzy inference system’,

IEEE Trans. Syst. ManCybern., 1993, 23, (3), pp. 665–8538 DeLucia, A., Pompella, E., Stefanucci, S.: ‘Assessing effort estimation

models for corrective maintenance through empirical studies’, Inf.Softw. Technol., 2005, 47, (1), pp. 3–15

39 Shepperd, M., MacDonell, S.: ‘Evaluating prediction systems insoftware project estimation’, Inf. Softw. Technol., 2012, 54, (8),pp. 820–827

40 Ahmed, M., Muzaffar, Z.: ‘Handling imprecision and uncertainty insoftware development effort prediction: a type-2 fuzzy logic basedframework’, J. Inf. Softw. Technol., 2009, 51, (3), pp. 640–654


machine learning approaches for predicting software maintainability: a fuzzy-based transparent model

Documents