[ieee 2010 second international conference on computational intelligence, modelling and simulation...

An Improved Similarity Measure for Binary Features in Software Clustering

Rashid Naseem∗, Onaiza Maqbool†, Siraj Muhammad‡∗† Dept. of Computer Science, Quaid-I-Azam University,Islamabad

‡ Elixir Technology Pakistan (PVT) LTDEmail: ∗[email protected],†[email protected], ‡[email protected]

Abstract—In recent years, there has been increasing in-terest in exploring clustering as a technique to recover thearchitecture of software systems. The efficacy of clusteringdepends not only on the clustering algorithm, but also on thechoice of entities, features and similarity measures used duringclustering. It is also important to understand characteristicsof the domain in which clustering is being applied, since theperformance of different measures and algorithms may varydepending on these characteristics. In the software domain, theJaccard similarity measure gives better results as compared toother similarity measures for binary features. In this paper,we highlight cases where the Jaccard measure may fail tocapture similarity between entities appropriately. We proposea new similarity measure which overcomes these deficiencies.Our experimental results indicate the better performance ofthe new similarity measure for software systems exhibiting thedefined characteristics.

Keywords-Software Clustering, Binary Features, Jaccard-NM Measure, Jaccard Measure, Arbitrary decisions

I. INTRODUCTION

Clustering is the process of finding similar groups in data,and has been applied in diverse disciplines. In the softwaredomain, an important application of cluster analysis is in au-tomating the recovery of high level architecture of softwaresystems from the source code. Recovering the architectureis important when the architectural documentation does notexist, or when the maintainers of large software systemschange the system structure without updating documenta-tion. In this situation, attempts must be made to recover thearchitecture from the source code. This recovery can be veryuseful for software understanding and re-modularization.

Given the importance of architectural level understandingof large software systems, different techniques have beenused to automatically extract the architecture from the sourcecode. Besides clustering [1], [2], these techniques includeconcept analysis [3], association rule mining [4], and graph-ical visualization [5].

The first step in the clustering process includes selectionof entities to be clustered, and their features. The second stepis to apply a similarity measure to determine which entitiesare most similar based on their features. Next, a clustering al-gorithm is applied to group together similar entities. Finally,the results are evaluated. Results of clustering depend on thenature of the data, features selected, similarity measure usedand on the clustering algorithm. It has also been seen that

the results vary depending on the domain in which clusteringis applied. For example, experimental results show that forsoftware, the Jaccard similarity measure gives better resultsthan other measures for binary features [6], [7].

In this paper, we highlight deficiencies of the Jaccardmeasure which may deteriorate results. For example, onecase is when the Jaccard measure takes arbitrary decisions.We call a decision arbitrary when more than two entitiesare equally similar. In such a case an algorithm selects twoentities to be clustered arbitrarily. It has been shown thatthese arbitrary decisions are problematic [1]. We introducea new similarity measure ”Jaccard-NM”, and illustrate howit overcomes deficiencies in the Jaccard measure. Our exper-imental results show that in general, our similarity measureproduces better results than the Jaccard measure for softwareclustering.

The organization of this paper is as follows. In Section2 we present the literature survey. An overview of ourclustering approach is given in Section 3. We discuss ournew measure in detail in Section 4. Section 5 discusses ourexperimental setup. In Section 6, we present and analyzethe experimental results. Finally, in Section 7, we presentthe conclusions and future work.

II. RELATED WORK

According to Jackson et al. [8], selecting an appropriatesimilarity measure is more important than the choice of aclustering algorithm. Therefore, to find the usefulness of aparticular measure, researchers have conducted studies forevaluating these similarity measures for software clustering.

In [6], Davey and Burd evaluated four similarity mea-sures for binary features including Jaccard, Sorensen Dice,Canberra and Correlation coefficient. They concluded thatthe Jaccard and Sorensen Dice measures perform identically.They recommended the Jaccard similarity measure becauseit is simple and easy to use.

In 1999, Anquetil et al. [9] evaluated different similaritymeasures and algorithms. They evaluated Jaccard coefficient,Sorensen Dice coefficient, Simple Matching coefficient, Cor-relation coefficient and Taxonomic distance on size anddesign criterion. From experimental results they concludedthat Jaccard and Sorensen Dice give good results becausethese measures do not consider absence of features (zeros).Simple Matching produces black hole (cluster whose size is

Second International Conference on Computational Intelligence, Modelling and Simulation

978-0-7695-4262-1/10 $26.00 © 2010 IEEE

DOI 10.1109/CIMSiM.2010.34

98


978-0-7695-4262-1/10 $26.00 © 2010 IEEE

DOI 10.1109/CIMSiM.2010.34

111


978-0-7695-4262-1/10 $26.00 © 2010 IEEE

DOI 10.1109/CIMSiM.2010.34

111

very large as compared to other clusters) clusters becauseit considers absent features, and therefore does not producegood results.

Saeed et al. [7] compared Jaccard, Correlation, SimpleMatching and Sorensen Dice measures. From experimentalresults they concluded that Jaccard and Sorenson Dicesimilarity measures show identical results. The behaviorof the Correlation coefficient is similar to Jaccard in thesoftware domain due to the high number of absent features.Their conclusion regarding the Simple Matching coefficientis the same as presented in [9], i.e. the weight given to absentfeatures in Simple coefficient deteriorates results.

Maqbool and Babri developed a new software clusteringalgorithm called weighted combined algorithm [10]. Sincethe features become non-binary when this algorithm isused, they introduced the Unbiased Ellenberg measure fornon-binary features. They evaluated the weighted combinedalgorithm and five similarity measures including Jaccard,Ellenberg, Unbiased Ellenberg, Euclidian distance and Pear-son correlation coefficient. From experimental results theyconcluded that weighted combined algorithm gives betterresults than combined and complete linkage algorithms withUnbiased Ellenberg measure.

A comprehensive study of similarity measures has beencarried out in [11]. In this paper, nine similarity measuresincluding Jaccard, Sorensen Dice, Simple, Ochiai, Faith,Baroni-Urbani & Buser, Kulczynski, Russel & Rao, Roger& Tanimoto are compared. From experimental results theyconclude that for binary data four out of the five performbetter (Jaccard, Kulczynski, Ochiai and Sorensen) becausethese measures do not employ zero-zero matches (jointabsences). For fuzzy set ordination (FSO) they suggestedto use one of the five similarity measures which performbetter: Baroni-Urbani & Buser, Jaccard, Kulczynski, Ochiaiand Sorensen.

III. CLUSTERING

Clustering is the process of grouping together similarentities based on their features. For object-oriented softwaresystems an entity is usually a class [12], whereas forstructured systems, functions or files have been selected asentities [9]. Before carrying out clustering, preprocessingmust be performed on a software system to extract entitiesand relevant features. As a result, an (N x P) feature matrixis produced where N represents the number of entitiesand P represents the number of features. Features may becharacteristics of an entity e.g. the variables or user definedtypes used by it, or the relationships between entities e.g.the inheritance or containment relation in object-orientedsystems. For software, features are usually binary i.e. theyindicate the presence or absence of a characteristic orrelationship. A feature matrix for a small system containing3 entities and 5 relationships is presented in Table I.

Table IFEATURE (N X P) MATRIX FOR A SMALL SYSTEM

f1 f2 f3 f4 f5E1 0 1 0 1 1E2 1 1 0 0 0E3 1 0 1 0 1

After entity and feature selection, in the second step ameasure is applied to calculate similarity which results in amatrix indicating the similarity between every pair of entitiesin the system. In the next step, similar entities are placed inclusters by using using some clustering algorithm. In the thelast step, automatically produced results are evaluated usingsome assessment measure.

A. Similarity Metrics

Some of the well known similarity measures for binaryfeatures are listed in Table II

Table IISIMILARITY MEASURES FOR MODULARIZATION

S. No Name Mathematical representation1 Jaccard a/(a + b + c)2 Simple Matching (a + d)/(a + b + c + d)3 Sorensen Dice 2a/(2a + b + c)4 Sokal Sneeth a/(a + 2(b + c))5 Bray Curtis distance (b + c)/(2a + b + c)6 Rogers- Tanimoto (a + d)/a + 2(b + c) + d)7 Gower- Legendre (a + d)/(a + 0.5(b + c) + d)

These similarity measures are used to calculate the simi-larity between two entities (or singleton clusters). For thesemeasures, a, b, c and d can be determined using Table III.Suppose we have two entities X and Y , then a is the numberof features that are present (1) in both entities, b and crepresent the number of features that are present (1) in oneentity and absent (0) in the other entity and d represents thenumber of features that are absent (0) in both entities. n isthe total number of the features.

For software, a measure which does not consider d(absence of feature) has been shown to perform betterthan measures which do [7], [13]. In software clustering,measures considering d have poor performance becauseabsence of features does not show similarity between twoentities e.g. if two entities do not use the same variable,it does not show that they are similar. As discussed inSection II, the Jaccard measure gives better results than othermeasures.

B. Agglomerative Hierarchical Clustering Algorithms

Agglomerative Hierarchical Clustering Algorithms em-ploy a bottom-up clustering approach. They treat each entityas a singleton cluster, and merge the two most similar

99112112

Table IIICONTINGENCY TABLE

Y1 0 Sum

X1 a b a+b0 c d c + dSum a + c b + d n

clusters at each step until all entities are merged into a singlecluster. The widely used basic agglomerative hierarchicalclustering algorithms are Complete Linkage (CL), SingleLinkage (SL), Weighted Average (WA) and UnweightedAverage (UWA). When two entities are merged into a clus-ter, similarity between the newly formed cluster, and otherclusters/entities is calculated differently by these algorithms.Suppose we have three entities E1, E2 and E3. Using thesealgorithms, similarity between E1 and newly formed clusterE23 is calculated as [13]:

• Complete LinkageSimilarity(E1, E23) =min(Similarity(E1, E2), Similarity(E1, E3)).

• Single LinkageSimilarity(E1, E23) =max(Similarity(E1, E2), Similarity(E1, E3)).

• Weighted AverageSimilarity(E1, E23) = (1/2 ∗ Similarity(E1, E2) +1/2 ∗ Similarity(E1, E3)).

• Unweighted AverageSimilarity(E1, E23) = (Similarity(E1, E2) ∗size(E2) + Similarity(E1, E3) ∗size(E3))/(size(E2) + size(E3).

C. Quality Assessment

There are two types of methods to assess the quality ofresults, Internal and External. External assessment involvesan expert or authoritative decomposition, which is a view ofthe software’s architecture prepared by a human expert. Tocompare the expert decomposition with the one producedby clustering, an assessment measure is used. MojoFmis an external assessment measure which calculates thepercentage of Move and Join operations to transform thedecomposition produced by a clustering algorithm to anexpert decomposition [14]. To compare the result A of ouralgorithm to expert decomposition B, we have:

MojoFM(M) =(

1− mno(A, B)max(∀mno(A, B)

)∗ 100 (1)

where mno represents the minimum number of operationsto transform an automatically created decomposition to theexpert decomposition. A higher MoJoFM value denotesgreater correspondence between the two decompositions andhence better results.

Internal assessment measures internal software character-istics, for example, cohesion or coupling. Another assess-ment measure is number of clusters. The absolute numberof clusters may not be an indicator of software quality, butif there is a relatively large number of clusters it meansthat the clusters are cohesive and small in size. If there is arelatively smaller number of clusters it means that clustershave large size [6]. Thus this measure can be used for arough assessment of software quality.

IV. THE JACCARD-NM SIMILARITY MEASURE

In this section, we first present two cases which highlightsome deficiencies of the Jaccard measure. We then introducea new Jaccard-like similarity measure which overcomesthese deficiencies.

A. Cases- Case1: Number of a’s is different among entities, but

similarity is sameAn example feature matrix with six entities (E1-E6) and

five features (f1-f5) for this case is presented in Table IV.The corresponding similarity table using the Jaccard measureis given in Table V. It can be seen from Table V that theJaccard measure finds entities E1 and E2, E3 and E4, and E5and E6 to be equally similar although they have a differentnumber of a’s. From Table IV it can be seen that for theentities E5 and E6, a = 5, for E3 and E4, a = 4, andfinally for E1 and E2, a = 2. Thus any algorithm which usesthe Jaccard measure for calculating similarity will take anarbitrary decision. In this case, it may be more appropriateto cluster the entities sharing the largest number of featuresi.e. having the greater number of a’s.

Table IVSOFTWARE SYSTEM A

Entities f1 f2 f3 f4 f5E1 1 1 0 0 0E2 1 1 0 0 0E3 1 1 1 1 0E4 1 1 1 1 0E5 1 1 1 1 1E6 1 1 1 1 1

Table VSIMILARITY MATRIX USING JACCARD FOR SYSTEM A

Entities E1 E2 E3 E4 E5 E6E1 0E2 1 0E3 0.5 0.5 0E4 0.5 0.5 1 0E5 0.4 0.4 0.8 0.8 0E6 0.4 0.4 0.8 0.8 1 0

- Case2: Number of a’s is very high among two entities,but they are not completely similar

100113113

An example feature matrix with four entities (E1-E4) andten features (f1-f10) for this case is presented in Table VI.The corresponding similarity table using the Jaccard measureis given in Table VII. It can be seen from Table VII thatJaccard finds entities E1 and E2 to be most similar, sincethey share one feature f1. On the other hand, E3 and E4are found to be less similar although they share 8 features,because of feature f10 which they do not share and featuref9 which is accessed by E4 but not by E3 . The featurevectors of E3 and E4 indicate common functionality. In thiscase, it may be more useful to cluster the entities sharinga larger number of features (greater number of a’s) even ifthere are a few b’s and c’s indicating differences.

Table VISOFTWARE SYSTEM B

Entities f1 f2 f3 f4 f5 f6 f7 f8 f9 f10E1 1 0 0 0 0 0 0 0 0 0E2 1 0 0 0 0 0 0 0 0 0E3 1 1 1 1 1 1 1 1 0 0E4 1 1 1 1 1 1 1 1 1 0

Table VIISIMILARITY MATRIX USING JACCARD FOR SYSTEM B

E1 E2 E3 E4E1 0E2 1 0E3 0.125 0.125 0E4 0.111 0.111 0.888 0

B. Derivation of Jaccard-NM

The two cases presented above illustrate that the problemin the Jaccard measure arises because it does not considerthe proportion of common features as compared to the totalfeatures. To solve this problem, we take Jaccard Coefficientand add total number (n) of features to denominator. Ournew measure is defined as follows.

Jaccard−NM =a

a + b + c + n(2)

where n is the total number of features.

n = a + b + c + d (3)

Now

Jaccard−NM =a

a + b + c + (a + b + c + d)(4)

=a

2(a + b + c) + d(5)

It is interesting to note that this measure considers d, i.e.the absence of features. Previous research indicates thatmeasures considering d do not give good results, becauseabsence of features does not indicate similarity [7], [9]. It

can be seen from Table II that similarity measures containingd consider it in the numerator (a sign of similarity) as well asthe denominator. However, in our case, we do not considerd to be a sign of similarity between two entities, rather it isincluded in the denominator only, for determining proportionof common features as compared to total features.

We now apply Jaccard-NM to the two cases described inTable IV and Table VI. Table VIII presents the result forJaccard-NM for case1. We can see from Table VIII thatJaccard-NM prioritizes the similarity between entities E5and E6, E3 and E4, and E1 and E2. Thus the decision tocluster entities is no longer arbitrary; E5 and E6 are mostsimilar and will be grouped first.

Table VIIISIMILARITY MATRIX USING JACCARD-NM FOR SYSTEM A

Entities E1 E2 E3 E4 E5 E6E1 0E2 0.28 0E3 0.22 0.22 0E4 0.22 0.22 0.44 0E5 0.2 0.2 0.4 0.4 0E6 0.2 0.2 0.4 0.4 0.5 0

Table IX gives the result of applying Jaccard-NM to case2(System B). Since E3 and E4 are most similar, a clusteringalgorithm will now cluster E3 and E4 first, as was suggestedbecause E3 and E4 share more features (higher number ofa′s) as compared to E1 and E2.

Table IXSIMILARITY MATRIX USING JACCARD-NM FOR SYSTEM B

E1 E2 E3 E4E1 0E2 0.09 0E3 0.052 0.052 0E4 0.055 0.055 0.42 0

V. EXPERIMENTAL SETUP

In this section, we describe the test systems and clusteringsetup for our experiments.

Table XBRIEF DESCRIPTION OF DATA SETS

S. No. PLC FES SAVT1 Total number source

code lines51768 10402 27311

2 Total number ofheader (.h) files

27 39 70

3 Total number ofimplementation(.cpp,.cxx) files

27 76 37

4 Total number ofClasses

69 47 97

101114114

Table XIRELATIONSHIPS BETWEEN CLASSES THAT WERE USED FOR EXPERIMENTS

Name DescriptionSame Inheritance Hierarchy The relationship between classes that are derived from same classSame Class Containment Represents that classes contain objects of same classSame Class in Methods Represents classes containing objects of same class declared in a method locally or as parameterSame Generic Class Represents that two classes are used as instantiating parameters to same generic classSame Generic Parameter The relationship between two generic classes which have same class as their parameterSame File This relationship indicates that source code of both classes is written in same fileSame Folder Represents that files containing source code of two classes reside in same folder

Table XIISTATISTICS OF RELATIONSHIPS AMONG ENTITIES

CountRelationship Type PLC FES SAVTSame Inheritance Hierarchy 26 166 986Same Class Containment 58 56 1032Same Class in Methods 162 384 1900Same Generic Class 465 91 49Same Generic Parameter 0 4 0Same File 1812 42 264Same Folder 1812 42 264

A. Selection of Data Sets

For our experiments, we selected three object-orientedsoftware systems which have been developed in Visual C++.These are proprietary software systems. Printer LanguageConverter (PLC) is part of another software which providesconversion of intermediate data structures to printer lan-guage. Fact Extractor System (FES) is a software systemwhich reads source code, parses it and finds entities and theirdifferent relationships in object-oriented systems. StatisticalAnalysis Visualization Tool (SAVT) is an application whichprovides functionality related to statistical data and resultvisualization. A brief description of these data sets is givenin Table X.

B. Fact Extraction

We used the FES (Fact Extractor System) to extractdetailed designed information i.e entities and relationshipsfrom source code of Visual C++ systems.

C. Entities and Features

Since in object-oriented systems, a class is a basic unit weselected class as an entity. Of the different relationships thatexist between classes, we selected seven sibling (indirect)relationships [12] listed in Table XI. We used these relation-ships because they are well known, and frequently occurwithin object-oriented systems. Moreover, the similaritymeasures listed in Table II can only be applied to indirectrelationships. Information about relationships among entitiesin test systems is given in Table XII.

D. Similarity Measures

To find the similarity between entities we selected theJaccard similarity measure which has been shown to givebetter results than other measures for binary features in thecase of software clustering. We compare its results with ournew measure, Jaccard-NM.

E. Algorithms

To cluster the most similar entities we selected wellknown agglomerative clustering algorithms including Com-plete linkage, Weighted average and Unweighted averagedescribed in Section III-B.

F. Assessment

We obtained expert decompositions for each test sys-tem by asking designers of the systems to prepare thedecompositions. For assessment of results, we compare ourautomatically producing clustering results with the expertdecompositions at each step of hierarchical clustering usingthe latest version of MoJo i.e. MoJoFM [14]. Results arereported by selecting the maximum MoJoFM value obtainedduring the clustering process.

VI. EXPERIMENTAL RESULTS AND ANALYSIS

Experimental results of clustering algorithms for the threetest systems using MoJoFm are given in Table XIII andFigure 1.

Table XIIIEXPERIMENTAL RESULTS OF JACCARD AND JACCARD-NM USING

MOJOFM. JACC STANDS FOR JACCARD, NM STANDS FORJACCARD-NM

PLC FES SAVTJacc NM Jacc NM Jacc NM

CL 42 74 36 50 53 53WA 48 76 43 48 56 44UWA 42 71 50 55 53 49Average 44 74 43 51 54 49

We can see from Table XIII and Figure 1 that for PLCand FES, our similarity measure produces significantly betterresults than the Jaccard similarity measure. For SAVT,Jaccard produces better results than the Jaccard-NM measure

102115115

Figure 1. Experimental Results for complete, weighted average andunweighted average using MoJoFM

for two out of three algorithms. An analysis of the FES andPLC data sets shows that the two cases discussed in SectionIV are found in these data sets, while they do not occur inthe SAVT data set.

Thus our experimental results indicate that the Jaccard-NM measure performs significantly better than the Jaccardmeasure for systems that exhibit the characteristics describedin Table IV and Table VI. Hence, Jaccard-NM can be usedto improve clustering results for software systems, resultingin an architecture that is closer to an expert’s view.

VII. CONCLUSION

In this paper, we proposed a new similarity measure forsoftware clustering when the features are binary. This mea-sure, called Jaccard-NM, overcomes some deficiencies in theJaccard measure, which has shown better results for softwareclustering as compared to other similarity measures. Wepresented cases where the Jaccard measure may not giveappropriate results, and then demonstrated the better resultsof Jaccard-NM for these cases.

We conducted experiments using well known clusteringalgorithms and proprietary software systems using the Jac-card measure and Jaccard-NM. The results clearly show thebetter performance of our new measure. We conclude fromour experimental work that in general Jaccard-NM producesbetter results than Jaccard.

In the future, we intend to test the new measure usingmore data sets. We would also like to investigate how it canbe tailored for non-binary features.

REFERENCES

[1] O. Maqbool and H. A. Babri, “Hierarchical clustering forsoftware architecture recovery,” IEEE Trans. Software Eng.,vol. 33, no. 11, pp. 759 – 780, November 2007.

[2] P. Andritsos and V. Tzerpos, “Information theoretic softwareclustering,” IEEE Trans. Software Eng., vol. 31, no. 2, pp.150 – 165, February 2005.

[3] P. Tonella, “Concept analysis for module restructuring,” IEEETrans. software Eng., vol. 27, pp. 351–363, Apr 2001.

[4] C. Tjortjis, L. Sinos, and P. Layzell, “Facilitating programcomprehension by mining association rules from sourcecode,” Proc. Int’l Workshop Program Comprehension, pp. 125– 132, May 2003.

[5] M. Consens, A. Mendelzon, and A. Ryman, “Visualizing andquerying software structures,” Proc. of the Intl. Conferenceon Software Engineering(ICSE), vol. 133, pp. 138–156, May1992.

[6] J. Davey and E. Burd, “Evaluating the suitability of data clus-tering for software remodularisation,” Proc. Working Conf.Reverse Eng., pp. 268 – 276, November 2000.

[7] M. Saeed, O. Maqbool, H. A. Babri, S. Hassan, and S. Sarwar,“Software clustering techniques and the use of combined al-gorithm,” Proc. Int’l Conf. Software Maintenance and Reeng.,pp. 301 – 306, March 2003.

[8] D. A. Jackson, K. M. Somer, and H. H. Harvey, “Similaritycoefficients: measures of co-occurance and association orsimply measure of occurance,” The American Naturalist, vol.133, pp. 436–453, March 1989.

[9] N. Anquetil and T. C. Lethbridge, “Experiments with cluster-ing as a software remodularization method,” Proc. WorkingConf. Reverse Eng., pp. 235–255, 1999.

[10] O. Maqbool and H. A. Babri, “The weighted combinedalgorithm: a linkage algorithm for software clustering,” Proc.Int’l Conf. Software Maintenance and Reeng., pp. 15 – 24,2004.

[11] R. L. Boyce and P. C. Ellison, “Choosing the best similarityindex when performing fuzzy set ordination on binary data,”Journal of Vegetation Science, vol. 12, pp. 711–720, 2001.

[12] A. Q. Abbasi, “Application of appropriate machine learningtechniques for automatic modularization of software systems,”MPhil. thesis, Quaid-e-Azam University Islamabad, 2008.

[13] N. Anquetil, C. Fourier, and T. C. Lethbridge, “Experimentswith hierarchical clustering algorithms as software remodu-larization methods,” Proc. Working Conf. Reverse Eng., 1999.

[14] Z. Wen and V. Tzerpos, “An effectiveness measure for soft-ware clustering algorithms,” Proc. Int’l Workshop ProgramComprehension, pp. 194 – 203, June 2004.

103116116

[ieee 2010 second international conference on computational intelligence, modelling and simulation...

Documents