enhanced quality metrics for identifying class complexity and predicting faulty modules using...

Upload: journal-of-computing

Post on 04-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Enhanced Quality Metrics for Identifying Class Complexity and Predicting Faulty Modules Using Ensemble Classifiers

    1/7

    Enhanced Quality Metrics for Identifying ClassComplexity and Predicting Faulty Modules

    Using Ensemble ClassifiersC. NEELAMEGAM Dr. M. PUNITHAVALLI

    Abstract Software industry is increasingly relying on software metrics to determine the extent of software quality through

    defect identification. Detection of faults help to reduce time and cost spent during every phase of software life cycle. Identifying

    quality metrics is a challenging task due to the ever changing software requirements and increasing complexity and size of the

    software application. Using quality metrics for fault detection is a two-step process, where the first stages measures the

    complexity of the software module, which is used in the second step to predict faulty modules. In this paper, along with the

    traditional object oriented metrics, four new metrics are proposed. An ensembling classification model that combines three

    classifiers is proposed to predict faulty modules in C++ projects. The performance of the proposed system is analyzed in terms

    of accuracy, precision, recall and F Measure. The experimental results showed positive improvement in the performance of

    prediction with the inclusion of the proposed metric and ensemble classifier.

    Index Terms Class Complexity, Defect Detection, Ensemble Classification, Objected Oriented Software, Quality Metrics,

    Software Quality.

    1 INTRODUCTION

    Software quality metrics are methods that quantita-tively determine the extent to which a software process,product or project possess a certain quality attribute.They are used to measure software engineering products(design, source code, etc), processes (analysis, design,coding, testing, etc.) and professionals (efficiency orproductivity of an individual designer). Techniques and

    methods that identify and predict faults using these quali-ty metrics has gained wide acceptance in the past fewdecades (Catal and Diri, 2009; Chowdhury andZulkernine, 2011) as they have direct impact on thesoftware products time, cost and scope. The high usageof software system poses high quality demand from us-ers, which results in increased software complexity. Faultprediction is a strategy to identify faulty parts of a pro-gram, so that the testing process can concentrate only onthose regions. This will improve the testing process andindirectly help to reduce development life cycle, projectrisks, resource and infrastructure costs. Fault prediction

    models can be either process oriented (development andmaintenance) or product oriented (design and usability).Usage of software metrics to evaluate the quality of soft-ware design has attracted software industries as they helpto assess large software system quickly at low cost. In-creased usage of Object Oriented (OO) paradigm has beenenvisaged in recent software products, which has in-creased the need for new quality metrics to be devised.

    -----------------------------------------C.Neelamegam is with the Computer Applications at Sri Venkatesara Col-

    lege Computer Applications and Management,Coimbatore , Tamilna-

    du,India-641 112.

    M.Punithavalli is with the Computer Application at Sri Ramakrishna

    College of Engineering,Coimbatore,India.

    Existing metrics for fault module detection include CKmetrics and Mood metrics along with traditional generalmetrics like simple metrics and program complexitymeasures. Traditional metrics do not consider OO para-digms like inheritance, encapsulation and passing of mes-sage and therefore do not perform well with fault predic-tion. The OO metrics have been developed specifically to

    analyze the performance of OO system. But, the increasein software complexity and size is increasing the demandfor new metrics to identify flaws in the design and codeof software system. This demand has necessitated theresearchers to focus on adopting new metrics for whichestablished practices have yet to be developed. This paperfocuses on such needs through the development of fourmetrics for OO design. In particular, this work analyzesmetrics for measuring class complexities that can be usedas a medium to identify design defects. For this purpose,four metrics based on flow of information, friendclass/function, inheritance and cohesion are proposed.

    Several studies have focused on evaluating the usefulnessof software metrics to predict software design faults(Damm and Lundberg, 2007). These techniques can beloosely categorized as statistical techniques, structuralpatterns based techniques, metrics based techniques, for-mal / relational concept analysis and software incon-sistency management techniques. Classification, a fre-quently used data mining technique, has found wide us-age in a range of problem domains such as finance, medi-cine, engineering, geology and physics. Combining soft-ware metrics and single classifiers is a methodology thathas gained attention recently. This study proposes amethodology that combines software metrics and a suiteof classifiers (ensembling) to design a fault predictionmodel. The rest of the paper is organized as follows. Sec-tion 2 presents the four proposed metrics to calculate thecomplexity of the class. The methodology used by pro-

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 66

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 Enhanced Quality Metrics for Identifying Class Complexity and Predicting Faulty Modules Using Ensemble Classifiers

    2/7

    posed prediction based ensemble classification that usesexisting and proposed metrics to predict faulty class ispresented in Chapter 3. Several experiments were con-ducted to analyze the performance of the proposed met-rics and ensemble classifier to predict faulty modules. Theresults are presented and discussed in Section 4. Section 5concludes the work with future research directions.

    2 PROPOSED CLASS COMPLEXITY METRICS

    This section presents the four proposed complexitymetrics. With all the metrics a high value denotes highfunctional complexity and point towards serious designflaws that requires extensive testing and redesigning.2.1 Class Method Flow Complexity Measure

    (CMFCM)

    The two famous metrics, Cyclomatic complexity andthe structural fan-in/fan-out, are concerned with the con-trol flow of a program and ignore the data or information

    flow complexity. Two measures that are used during in-formation flow complexity are Fan-In and Fan-Out. Fan-In measures the information flow into the procedure, thatis, it is the sum of number of parameters passed to amodule from outside and global variables read by thesame module. Fan-out, on the other hand, indicates thesum of number of return values of a module and globalvariables written by the same module. According to Hen-ry and Kafura (1981), the module complexity can be cal-culated as in Equation (1).

    CMFCM = (Fan-In * Fan-Out)2 + Code Length (1)

    It is known fact that in object-oriented systems, the pri-vate data (internal data) of an object cannot be directlyaccessed by other objects and therefore programmers useparameter passing and return values. The Fan-In (FI) andFan-Out (FO) measures for a method m should takeninto consideration these values and can be calculated us-ing the following Equations (2) and (3).

    FI = 1+ Nm1 + (NIP+ NPV+ NPU+ NLV + NGVR) + f( ) (2)

    FO = = 1 + Nm2 + (NOP+ NGVW) + f( ) (3)

    where Nm1 is the number of objects called, Nm2 is thenumber of objects that call this method, NIP is number of

    input parameters, NPV is the number of protected varia-bles, NPU is number of public variables, NLV is the numberof Local variables, NOP is the number of parameters writ-ten to, NGVR and NGVWare number of global variables readand written to and f( ) is a function which returns a value1 if method m returns a value, zero otherwise.Another property that has to be considered while consid-ering OO systems is the coupling among entities. TheCoupling Among Entities (CAE) is calculated as the sumof indirect coupling metric and direct coupling metric(Equation 4).

    CAE = DCM + IDCM (4)

    where

    DCM=minParametersof.NominMethodsof.No

    Ctheinparametersof.NoCinMethodsof.No

    and ICM=Product of DCM of all methods in the length oftwo entities and C is the class and m is the method. Now,Equation (1) can now be rewritten as

    CMFCM = (FI + FO) * CAE * MCL (5)

    Here, the multiplicative operator in the traditional com-

    plexity measure is replaced by an additive operator. Thismodification was done to accommodate coupling amongentities computation. This has the added advantage ofreducing computation complexity. In the equation, MCLis the module code length and is calculated using Equa-tion (6).

    MCL = LOC + MLOC + CLOC + (CL * j) + BL (6)

    where LOC is the line of codes with comments and blanklines, MLOC is the multiline of code which is calculatedas LOC * number of separate statements in the same line,CLOC is the line of code that contain comments and iscalculated as the sum of LOC and number of commentlines. CL * j expression denotes the number of lines thatcontain more than one comment statement and BL de-notes the blank lines. The proposed CMFCM metric is amethod level metric.2.2 Friend Class Complexity Metric (FCCM)

    A friend class is defined as a function or method thatcan access all private and protected members of a class towhich it is declared as a friend. While considering com-plexity measure for friend classes, the following charac-teristics have be noted.1. On declaration of friend class, all member functions

    of the friend class become friend of the class in whichthe friend class is declared.

    2. Friend class cannot be inherited and every friendshiphas to be explicitly declared.

    3. The friendship relation is not symmetricIn the field of OO metrics for fault detection, studies onfriend classes are minimum, inspite of its extensive usage(Counsell and Newson, 2000; Counsell et al., 2004). Friendconstructs are violation of encapsulation and will compli-cate a program, which in turn, makes debugging moredifficult. Moreover, the task of tracking polymorphismalso becomes more complex while using friend classes.

    According to Chidamber and Kemerer's principle onlythose methods which require additional design effortshould be counted for metric measurement and inheritedmethods or methods from friend classes are not definedin the class and therefore, need not be included. Howev-er, it has been proved that the coupling that exists be-tween friend classes increase fault proneness of a class(Briand et al., 1997). These methods consider relationshipand type of association between class attributes andmethods and do not consider the relationship betweenfriend attributes and external attributes. This section pro-poses a modified version, which considers this relation-ship and extends coupling metrics to use these friendmetrics. Using these metrics, a new coupling measure todetermine the class complexity is proposed.Coupling measure can be either Direct Coupling (DC) orIndirect Coupling (IDC). DC here refers to the normal

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 67

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 Enhanced Quality Metrics for Identifying Class Complexity and Predicting Faulty Modules Using Ensemble Classifiers

    3/7

    coupling factor, while IDC refers to coupling while friendfunctions or classes are used. Thus, the new coupling fac-tor is defined as a sum of DC and IDC (Equation 7).

    CFNew = DC + IDC (7)

    DC is calculated using the method specified in Mood

    metric suite. The IDC of a class is calculated as average ofMethod IDC (MIDC) factor. The MIDC is modified toidentify a factor called actual friend methods, which isintroduced because generally, a friend class declarationgrants access to all methods in a class but in reality only afew of these methods are actually called by other classes.The MIDC combined with this factor is calculated usingEquation (8).

    MC

    N

    1iVPCGFGVWGVR

    N

    NNN)N(N

    MIDC

    MC

    iiiii (8)

    where NGF is the number of global functions, NPC is thenumber of messages to other classes and NV is the num-ber of references to instance variables of other classes andNMC is the number of actual methods in the class which iscalculated as the difference between the number of meth-ods (NM) and Number of Hidden Methods (NHM) in aclass (Equation 9). Hidden methods are methods thatcannot be called on an object of the class without the useof the friend construct.

    NMC = NM NHM (9)

    The number of hidden methods is calculated as the num-ber of methods in a class that access hidden membersofclasses which declare the current class as a friend (Englishet al., 2005). NHM is calculated as the sum of twomeasures. The first is the number the hidden methodsbelonging to other classes accessed by the class. Thismeasure is called in this study as Number of EXternalHidden Methods NHME. The second measure is thenumber of hidden methods that are invoked by otherclasses from the class. This measure is referred in thisstudy as Number of Internal Hidden Methods NHMI.

    Thus NHM is calculated asNHM = NHME + NHMI (10)

    Using the above metric, the complexity measure can becalculated by modifying Equation (5) as given below

    FCCM = (FI + FO) * CFNEW * MCL (11)

    Again, this metric is a method level metric, where a high

    value indicates design flaws.

    2.3 Class Complexity from Inheritance (CCI)

    Inheritance a powerful mechanism in OO program-ming provides a method for reusing code of existing ob-jects or establishes a subtype from an existing object, orboth. Inheritance metrics are used to analyze various as-

    pects of a program interms of depth, breadth in a hierar-chy and its overriding complexity. It can be used tomeasure class complexity as a measure of data / methodshared from ancestor classes. The class complexity whiletaking inheritance into consideration depends mainly onthe inheritance type (single, multiple, multi-level inher-itance). Apart from this, while calculating the class com-plexity with respect to inheritance, the complexity im-posed by inherited methods and inherited attributesshould also be considered. Thus the proposed CCI metricconsiders the individual complexity of a class while tak-ing the properties of inheritance into consideration (ICC),inherited method complexity (IMC) and inherited attrib-ute complexity (IAC) and is calculated using Equation(12).

    CCI = ICC + IMC + IAC (12)

    where ICC of a class i is calculated as

    ICC = NA +AN

    2i

    iICC (13)

    Here, the ICC of the root of the inheritance tree is zero asit has no parent. The ICC measure thus takes into consid-eration the depth of the class in the inheritance hierarchy,number of parents of the class and their depth in the in-heritance hierarchy along with the type of inheritance.The IMC is calculated as

    IMC = (NPD * 1)+ (NDD * 2) + (NUD* 3) (14)

    where NPD is the number of primary data variables, NDD is

    the number of derived data variables and NUD is thenumber of user defined data type variables. The classifi-cation of data types is similar to the one proposed byArockiam and Aloysius (2011), who defined PD as in-built data types like int, float and char, DD as in-builtstructures like arrays and UD as user designed structureswhich are formed by combining PD and DD. Examplesfor UD includes structure, union and class. As suggestedby the same author, a cognitive weight of 1, 2 and 3 areused along with NPD, NDD and NUD respectively. Thesecognitive weights are assigned according to the cognitivephenomenon suggested by Wang (2002) which assigns

    weight for PD=1, DD=2 and UD=3. Finally, IAC is calcu-lated again by assigning cognitive weights to the controlstructures in the method. The control structures consid-ered are sequence statements, branching statements, itera-tive statements and call statements. As suggested byWang (2002) a value of 1, 2, 3 and 2 are assigned to thesestatements respectively.

    2.4Class Complexity From CohesionMeasure(Cccm)

    Cohesion of a class describes how the methods of a classare related to each other. In general, a high cohesion isdesirable as it promotes encapsulation, while a low cohe-

    sion indicates high likelihood of errors, design changeand high class complexity. This section presents a metricto calculate class complexity through cohesion measure.Four types of cohesion methods are used, namely, Cohe-

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 68

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 Enhanced Quality Metrics for Identifying Class Complexity and Predicting Faulty Modules Using Ensemble Classifiers

    4/7

    sion Among Attribute in a class Measure (CAA), Ratio ofCohesion Interactions (RCI), Cohesion Among Methodsin a class (CAMC) and Normalized Hamming Distance(NHD) Metric. Here RCI, CAMC and NHD are calculatedusing steps as provided by (Briand et al., 1999; Bansiya etal., 1999; Counsell et al., 2005). The RCI considers the datato method relationship, the CAMC considers the method-method interactions The CCCM metric is included in thisstudy to measure the degree of similarity among methodswhile considering attribute usage. Thus, the CCCM iscalculated as using the procedure given in below.

    CCCM = CCM + RCI + CAMC / 3 (15)1) Calculate the number of methods in a class, M (={m1,

    m2, })2) Calculate the number of instance variables in each

    method, Vi ({=V1,V2, V3, }, i M)3) Calculate number of methods using each instance

    from V, NVi, as (M Vi)-1. The value is 1 used to re-move the attributes similarity dependency from the

    method it is declared. Calculate CCCM as

    xVM

    N

    M

    i

    Vi

    )1(

    1 .

    3 FAULT PREDICTION USING OBJECT ORIENTEDMETRICS

    The present study proposes the use of machine learn-ing algorithm to analyze the performance of the proposed

    metrics in predicting design flaws in OO programs. Theproposed method consists of four steps. (i) Selection ofmetrics (ii) Dimensionality Reduction (iii) Normalize themetric values and (iv) Implement prediction model. Herethe prediction model is proposed as an ensemble-binaryclassification task, where a module is predicted as eitherfaulty (complex) or not-faulty (normal).

    3.1SELECTION OF METRICS

    The four proposed metrics are combined with twentyexisting metrics (Table 1) during fault prediction. Theselected existing metrics were chosen because of theirwide usage in fault detection.

    TABLE1:LISTOFSELECTEDEXISTINGMETRICS

    A Simple metrics CChidamber &

    Kemerer's Metrics

    1) LOC (Total num-

    ber of lines)

    1) WMC

    (Weighted Methods

    per Class)

    2) BR (Number of

    methods)

    2) DIT (Depth of

    Inheritance Tree)

    3) NOP (Total Num-

    ber of Unique Opera-

    tors)

    3)

    NC (Numberof children)

    4) NOPE (Total

    Number of Unique

    4) COC (Coupling

    between object

    Operands) classes)

    5) RE (Readability

    with Comment per-

    centage)

    5) RC (Response

    for a Class)

    6) VO (Volume)

    6) LCM (Lack of

    Cohesion in Meth-

    ods)

    B Mood Metrics DProgram Com-

    plexity Measure

    1) MHF (Method

    hiding factor)

    1) CC (Cyclomat-

    ic Complexity)

    2) AHF (Attribute

    hiding factor)

    2) FI-FO(Fan-In

    Fan-Out)

    3) MIF (Method in-

    heritance factor)

    4) AIF (Attribute in-

    heritance factor)

    5)

    PF (Polymorphismfactor)

    6) CF (Coupling fac-

    tor)

    3.2 DIMENSIONALITY REDUCTIONThe vital step in designing a classification model is

    the selection of a set of input metrics, which unless select-ed carefully will result in Curse of dimensionality. Thisphenomenon can be avoided by the use of dimensionalityreduction procedure, which aims to reduce the number ofinput variables by removing irrelevant data and retaining

    only those data which are most discriminating. In thepresent study, Sensitivity Analysis of data is used for thispurpose. Sensitivity analysis analyzes the importance ofeach input data in relation to a particular model and es-timates the rate of change of output as a result of varyingthe input values. The resulting estimates can be used todetermine the importance of each input variable (Saltelliet al., 2000). This study adopts the Sensitivity Casual In-dex (SCI) proposed by Goh and Wong (1991).

    3.3 NORMALIZATIONThis step is used to normalize each input to the same

    range and makes sure that the initial default parametervalues are appropriate and every input at the start hasequal important. Further, normalization is performed toimprove the training process of the classifier. Normaliza-tion is performed by estimating the upper and lowerbounds for each metric value and then scale them usingEquation (17).

    )Vmin()Vmax(

    )Vmin(VV

    jj

    jj'j

    (17)

    where 'jV is the normalized or scaled value, min(Vj)

    and max(Vj) are the maximum and minimum bounds

    of the metric j from n observations respectively.

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 69

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 Enhanced Quality Metrics for Identifying Class Complexity and Predicting Faulty Modules Using Ensemble Classifiers

    5/7

    The result of normalization thus, maps each input

    value to a closed interval [0, 1].

    3.4 PREDICTION MODELThe steps to build the prediction model are given below.

    Step 1 : Identify a classifier

    Step 2 : Identify the feature vector to be used as inputStep 3 : Partitioning method (Training and Testing sets)

    Step 4 : Train and test the classifier using n-fold cross-

    validation method

    The present study uses ensemble classifier for fault pre-

    diction. Three classifiers, namely, Feed Forward Back

    Propagation Artificial Neural Network (BPNN), Support

    Vector Machine (SVM) and K-Nearest Neighbour (KNN)

    are considered during ensembling. Input feature vector is

    created by combining the proposed metrics (Section 2)

    with traditional metrics and salient data is identified us-

    ing the procedures described in Sections 3(A) and 3(B).The resultant dataset is then partitioned into training and

    testing set using hold method. The holdout method ran-

    domly partitions the dataset into two independent sets,

    training and testing. Generally, two-thirds of the data are

    allocated to be the training set and remaining one-third is

    allocated as test set. The method is pessimistic because

    only a portion of the initial data is used to derive the

    model. Thus, given a set of input data set (combined pro-

    posed and traditional metrics) and the ensemble classifier

    (BPNN + SVM + KNN) marks each given input as belong-

    ing to one of two categories (faulty or not-faulty).The result of the classifiers in the ensemble model is com-

    bined using a combination of majority voting and

    weighting scheme. The modified majority vote scheme

    that combines weighting scheme is explained below. Let

    the decision of the ith classifier be defined as dt, j {0, 1}, t

    = 1, , T and j = 1, , C, where T is the number of classi-

    fiers and C is the number of classes. If the ith classifier

    chooses class j, then dt,j = 1 and 0, otherwise. In majority

    voting scheme, a class j is chosen, if

    tT

    1t j,t

    c

    1j

    T

    1t J,tw*dmaxd

    (18)

    Here wt (weight assigned to classifier t) is calculated us-ing Kuncheva (Kuncheva, 2004) method (Equation 19).

    t

    t

    t

    p

    pw

    1log (19)

    4 EXPERIMENTAL RESULTS

    The proposed fault-detection classifier systems usingsoftware metrics was developed using MATLAB 2009and all the experiments were conducted on a Pentium IV

    machine with 4GM RAM. The NASA IV & V FacilityMDP data (http://mdp.ivv.nasa.gov/repository.html),consists of error data from several projects. This studyuses KC1 project, which consist of records related to a

    real-time project written in C++ consisting of 43000 LOC.The dataset has a total of 1571 modules out of which 319are faulty modules while 1252 are non-faulty modules.The feature vector created has 24 dimensions each repre-senting one selected metric. This vector was first normal-ized to an interval [0, 1] to ensure that all the 20 valueshave equal importance. Dimensionality reduction wasnext performed on this set to select discriminating metricsby calculating SCI of each input dimension over the entirenormalized dataset with =0.1. After calculation of SSI,the metrics were arranged in descending order of SSI andthe top 15 metrics were selected. The resultant featurevector, after dimensionality reduction consists ofCMFCM, FCCM, CCI, CCCM, LOC, BR, RE, WMC, DIT,NC, COC, RC, LCM, MHF, AHF, MIF, AIF, PF and CF. Itcan be seen that the resultant reduced dataset consists ofonly those metrics which has impact on complexitymeasure and all the four proposed methods were select-ed. This indicates that the proposed metrics can be used

    as a good fault indicator. The reduced dataset with 19metrics is then divided into training (943 modules) andtesting (628) datasets. To evaluate the effect of the pro-posed metrics, a 15 metric feature set without the pro-posed metrics was also considered.Four classification performance metrics were used duringevaluation. They are accuracy, precision, recall and F-measure, which are derived from the confusion matrix. A10-fold cross validation method was used with all exper-iments. The performance of the single classifiers wascompared with that of ensemble classifiers. For SVM clas-sifier, the regularization parameter was set to 1, the kernel

    function used was Gaussian and bandwidth of the kernetwas set to 0.5. For K-NN classifier, k was set to 3. ForBPNN classifier, 2 hidden nodes with learning rate of 0.2were used. T-Test was performjed at 95% confidence lev-el (0.05 level) to analyze the significant difference be-tween SVM and BPNN, SVM and KNN. The T-test meth-od adopted was proposed by Nadeau and Bengio (2003).This method was adopted because it is more suited forclassifiers adapting 10-fold cross-validation method (Diet-terich, 1998). The traditional student t test, method pro-duces more false significant differences due to the de-pendencies that exists in the estimates. Further, the affectof the proposed metrics in classification performance isascertained by running the experiments with the existingmetric set containing 24 metrics and analyzing the classi-fication accuracy. From the three single classifiers, four 2-classifiers PEMs (BPNN + KNN, BPNN + SVM, KNN +SVM) and one 3-classifier PEMs (BPNN + KNN + SVM)were built.Tables 3 to 5 shows the 1-classifier, 2-classifer and 3-classifier PEM performance of the proposed BPNN, KNNand SVM based ensemble predictors based on Accuracy,Precision, Recall and F Measure. To analyze the ad-vantage obtained by the proposed predictors the pro-posed models are compared with their traditional single

    classifier counterparts. In these tables, SD denotes thestandard deviation and the column Sig denotes the statusof significance. In the Sig column, Yes denotes that thereis a significance performance difference between single

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 70

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 Enhanced Quality Metrics for Identifying Class Complexity and Predicting Faulty Modules Using Ensemble Classifiers

    6/7

    prediction model and the corresponding ensemble pre-diction model, while a No represents insignificant per-formance. A + sign at the end denotes that ensembleprediction model has outperformed the correspondingsingle prediction model, while sign denotes the oppo-

    site.

    TABLE3:PERFORMANCE OF BPNN BASED ENSEMBLE PREDICTION MODELS

    FeatureSet

    ModelAccuracy Precision Recall F Measure

    Mean SD Sig Mean SD Sig Mean SD Sig Mean SD Sig

    19MetricsFeature set

    BPNN 77.38 3.562 80.12 2.981 84.01 3.015 82.02 3.298

    BPNN+SVM 84.26 2.100 Yes(+) 85.74 2.441 Yes(+) 89.87 2.640 Yes(+) 87.76 2.221 Yes(+)

    BPNN+KNN 81.92 0.960 Yes(+) 84.11 1.569 No(-) 88.14 1.010 Yes(+) 86.08 0.674 Yes(+)

    BPNN+SVM+KNN 82.74 1.703 Yes(+) 85.18 2.258 No(-) 88.57 1.270 Yes(+) 86.84 1.188 Yes(+)

    24 MetricFeature set

    BPNN 82.16 1.201 Yes(+) 84.76 2.697 No(-) 88.22 1.180 Yes(+) 86.46 1.047 Yes(+)

    BPNN+SVM 89.91 1.236 Yes(+) 97.36 0.899 Yes(+) 93.44 0.587 Yes(+) 95.36 0.745 Yes(+)

    BPNN+KNN 94.55 1.579 Yes(+) 98.93 0.371 Yes(+) 92.94 1.574 Yes(+) 95.84 0.361 Yes(+)

    BPNN+SVM+KNN 96.17 1.314 Yes(+) 99.94 0.012 Yes(+) 94.16 1.122 Yes(+) 96.96 0.202 Yes(+)

    TABLE4:PERFORMANCE OF KNN BASED ENSEMBLE PREDICTION MODELS

    FeatureSet

    ModelAccuracy Precision Recall F Measure

    Mean SD Sig Mean SD Sig Mean SD Sig Mean SD Sig

    20 MetricFeature set

    BPNN 84.98 2.416 89.72 0.126 95.42 0.124 92.48 0.397

    BPNN+SVM 89.26 1.841 Yes(+) 91.76 0.441 Yes(+) 96.42 0.441 Yes(+) 94.03 0.241 Yes(+)

    BPNN+KNN 87.89 0.306 Yes(+) 89.97 0.314 Yes(+) 95.89 0.467 No(-) 92.84 0.978 Yes(+)

    BPNN+SVM+KNN 88.98 0.566 Yes(+) 91.12 0.876 Yes(+) 96.16 0.978 No(-) 93.57 0.618 Yes(+)

    24 MetricFeature set

    BPNN 87.81 0.382 Yes(+) 90.76 0.924 Yes(+) 96.02 0.997 No(-) 93.32 0.344 Yes(+)

    BPNN+SVM 89.91 1.236 Yes(+) 97.36 0.899 Yes(+) 93.44 0.587 Yes(+) 95.36 0.745 Yes(+)

    BPNN+KNN 90.26 1.077 Yes(+) 97.94 0.821 Yes(+) 92.67 0.687 Yes(+) 95.23 0.798 Yes(+)

    BPNN+SVM+KNN 96.17 1.314 Yes(+) 99.94 0.012 Yes(+) 94.16 1.122 Yes(+) 96.96 0.202 Yes(+)

    TABLE5:PERFORMANCE OF SVM BASED ENSEMBLE PREDICTION MODELS

    FeatureSet

    ModelAccuracy Precision Recall F Measure

    Mean SD Sig Mean SD Sig Mean SD Sig Mean SD Sig

    20 MetricFeature set

    BPNN 90.62 1.161 90.34 0.040 98.43 0.068 94.21 1.014

    BPNN+SVM 93.99 1.991 Yes(+) 92.34 1.461 Yes(+) 98.77 0.241 Yes(+) 95.45 0.166 Yes(+)

    BPNN+KNN 92.96 0.989 Yes(+) 91.27 0.785 Yes(+) 98.01 0.114 No(-) 94.52 0.045 Yes(+)

    BPNN+SVM+KNN 93.41 1.562 Yes(+) 92.08 1.318 Yes(+) 98.54 0.981 No(-) 95.20 0.681 Yes(+)

    24 MetricFeature set

    BPNN 93.16 1.199 Yes(+) 91.76 0.978 Yes(+) 98.12 0.457 No(-) 94.83 0.457 Yes(+)

    BPNN+SVM 90.26 1.077 Yes(+) 97.94 0.821 Yes(+) 92.67 0.687 Yes(+) 95.23 0.798 Yes(+)

    BPNN+KNN 94.55 1.579 Yes(+) 98.93 0.371 Yes(+) 92.94 1.574 Yes(+) 95.84 0.361 Yes(+)

    BPNN+SVM+KNN 96.17 1.314 Yes(+) 99.94 0.012 Yes(+) 94.16 1.122 Yes(+) 96.96 0.202 Yes(+)

    From the results, it could be seen that the inclusion of theproposed metrics has increased the efficiency of the en-

    semble classification model with all performance parame-ters. Further, the application of ensembling concept hasalso improved the performance of the fault prediction.This is evident from the significant difference obtainedwhen compared with single classifier models., Among thefour data selection algorithms, the Sequential Selectionmethod produced significant improvement to classifica-tion performance. While comparing the three classifiers,the performance of SVM-based prediction models is bet-ter when compared with BPNN and KNN. While consid-ering the number of classifiers, the 3-classifier ensemblemodel ranked first when compared with all other models.

    The best performance was produced by the model thatcombines BPNN, KNN and SVM classifiers with 19 met-rics feature set.

    5.CONCLUSIONThis paper proposed four new metrics for evaluating thecomplexity of object oriented software products. Further,

    the usage of these metrics on OO fault module detectionwas also analyzed using ensemble prediction classifiers.For this purpose, 24 metrics that are related to with thecomplexity factor of a system were selected. Sensitivityindex was used to select relevant metrics for classificationafter normalization. Three classifiers, namely, BPNN,SVM and KNN were used to generate ensemble classifi-ers. These classifiers are termed as 1-classifier ensembleprediction models. The three classifiers were groupedtogether to form five ensemble models (identified as 2-classifier and 3-classifier models). The performance wasanalyzed using accuracy, precision, recall and F-measure.

    When comparing with single classifier systems, all theproposed models produced improved classification per-formance and among the 16 models, the 3-classifier modelthat combined BPNN, SVM and KNN produced best re-

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 71

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/30/2019 Enhanced Quality Metrics for Identifying Class Complexity and Predicting Faulty Modules Using Ensemble Classifiers

    7/7

    sults. Moreover, the results also prove that the proposedcomplexity measuring metrics improve the faulty moduleprediction irrespective of the classifier used. Future re-search is planned in the direction of development of newdesign metrics and their use with the proposed classifiers.

    6. References

    [1]

    Arockiam, L. and Aloysius, A. (2011)Attribute WeightedClass Complexity: A New Metric for Measuring Cognitive

    Complexity of OO Systems, World Academy of Science, En-

    gineering and Technology, 58,Pp. 808-813.

    [2] Bansiya, J., Etzkorn, L., Davis, C. and Li, W. (1999) A class

    cohesion metric for object-oriented designs, Journal of Ob-

    ject-Oriented Program, Vol. 11, No. 8, Pp. 47-52.

    [3] Briand, L.C., Devanbu, P.T. and Melo, W.L. (1997) An Inves-

    tigation into Coupling Measures for C++, International Con-

    ference on Software Engineering, Pp.412-421.

    [4] Briand, L.C., Morasca, S. and Basili, V.R. (1999) Defining and

    validating measures for object-based high-level design, IEEE

    Transactions on Software Engineering, Vol. 25, No. 5, Pp.722-743.

    [5] Catal, C and Diri, B. (2009) Investigating the effect of dataset

    size, metrics sets, and feature selection techniques on soft-

    ware fault prediction problem, Information Science, Elsevier,

    Vol. 179, Pp. 1040-1058.

    [6] Chowdhury, I. and Zulkernine, M. (2011) Using complexity,

    coupling, and cohesion metrics as early indicators of vulner-

    abilities, Journal of Systems Architecture, Elsevier, Vol. 57,

    Pp. 294-313.

    [7] Counsell, S. and Newson, P. (2000) Use of Friends in C++

    Software: An Empirical Investigation. Journal of Systems and

    Software, Vol.53, No.1, Pp.15.21.[8] Counsell, S., Newson, P. and Mendes, E. (2004) Design Level

    Hypothesis Testing Through Reverse Engineering of Object-

    Oriented Software, International Journal of Software Engi-

    neering, Vol.14, No.2, Pp.207.220.

    [9] Counsell, S., Swift, S. and Crampton, J. (2006) The interpreta-

    tion and utility of three cohesion metrics for object-oriented

    design, ACM Transactions on Software Engineering and

    Methodology (TOSEM), Vol. 15, No. 2, Pp.123-149.

    [10] Damm, L.O. and Lundberg, L. (2007) Company-Wide Im-

    plementation of Metrics for Early Software Fault Detection,

    Proceedings of the 29th international Conference on Software

    Engineering (ICSE '07), IEEE Computer Society, USA, Pp.

    560-570.

    [11] Dietterich, T. (1998) Approximate statistical tests for compar-

    ing supervised classification learning algorithms, Neural

    Computation, Vol. 10, Pp. 18951924.

    [12] English, M., Buckley, J., Cahill, T. and Lynch, K. (2005) An

    Empirical Study of the Use of Friends in C++ Software, Inter-

    national Workshop on Program Comprehension, Pp. 329.332.

    [13] Goh, T.H. and Wong, F. (1991) Semantic extraction using

    neural network modeling and sensitivity analysis, Proceed-

    ings of IEEE International Joint Conference on Neural Net-

    works, Pp. 1821.

    [14] Henry, S.M. and Kafura, D. (1981) Software structure metrics

    based on information flow, IEEE Transactions on Sofware

    Engineering, Vol. SE-7, Pp. 510-518.

    [15] http://mdp.ivv.nasa.gov/repository.html

    [16] Kuncheva, L.I. (2004) Combining pattern classifiers: Methods

    and algorithms, Wiley-Interscience, New Jersey.

    [17] Nadeau, C. and Bengio, Y. (2003) Inference for the generali-

    zation error, Machine Learning, Vol. 52, Pp.239281.

    [18] Saltelli, A., Chan, K. and Scott, E.M. (2000) Sensitivity Analy-

    sis, John Wiley & Sons.

    [19]

    Wang. Y, (2002) On Cognitive Informatics, IEEE InternationalConference on Cognitive Informatics, Pp. 69-74.

    [20] Kuncheva, L.I. (2004) Combining pattern classifiers: Methods

    and algorithms, Wiley-Interscience, New Jersey.

    [21] Nadeau, C. and Bengio, Y. (2003) Inference for the generali-

    zation error, Machine Learning, Vol. 52, Pp.239281.

    [22] Saltelli, A., Chan, K. and Scott, E.M. (2000) Sensitivity Analy-

    sis, John Wiley & Sons.

    [23] Wang. Y, (2002) On Cognitive Informatics, IEEE International

    Conference on Cognitive Informatics, Pp. 69-74.

    C.Neelamegam received M.ScDegree from Bharathidasan Universi-

    ty ,Trichy,India.He is the head of the computer Application depart-ment at Sri Venkatesara College of Computer Application and man-agement,Coimbatore,India. His research work has appeared inGlobal journal of computer science ant technology, International

    journal of Computer Applications, Journal of Computer Science.

    M.Punithvalli holds a Ph.D Degree in Computer Applications fromAlagappa University , Tamilnadu,India. She is the Director of Com-puter Application Department of Sri Ramakrishna College of Engi-neering ,Coimbatore,India. She has produced some Ph.D scholar.she is a Editorial Bord member of various publications.She has pub-lished many research paper in different journals.

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 72

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617