[ieee 2010 international conference on emerging technologies (icet) - islamabad, pakistan...
TRANSCRIPT
![Page 1: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)](https://reader036.vdocuments.mx/reader036/viewer/2022082721/5750ab971a28abcf0ce0a5de/html5/thumbnails/1.jpg)
Role of Relationships during Clustering ofObject-Oriented Software Systems
Siraj Muhammad∗, Onaiza Maqbool†,Abdul Qudus Abbasi‡∗†Dept. of Computer Science, Quaid-e-Azam University,Islamabad
‡Elixir Technology Pakistan (Pvt) Ltd
Email: ∗[email protected],†[email protected],†[email protected]
Abstract—Clustering has been applied by researchers for thearchitecture recovery of software systems. Clustering algorithmsform clusters of similar entities, where similarity is determinedby the characteristics of an entity or the relationships that existbetween entities. Thus selecting appropriate relationships is im-portant for improving cluster quality. As compared to structuredsystems, for which relationships have been evaluated, relativelylittle work has been done for object-oriented software systems todetermine which relationships produce better clustering results.In this paper, we divide relationships within object-orientedsystems into different categories and evaluate them. We conductexperiments on three test systems using well known hierarchicalclustering algorithms. Our experimental results indicate therelationships that improve the quality of clustering results.
I. INTRODUCTION
It is a well known fact that software maintenance is an
expensive activity. According to Glass [1], maintenance is the
single most expensive software activity, and hence perhaps
also the most important. Studies report that system understand-
ing takes up 47% or more of the software maintenance effort
[2]. These facts have led researchers to explore techniques for
easing software system understanding.
System understanding may be gained at the detailed (pro-
gram) level, or at the high (architectural) level. Architec-
ture, which refers to the structure of a system, “comprises
software elements, the externally visible properties of those
elements, and relationships among them” [3]. Architectural
level understanding is important for many reasons including
determining whether a system has the ability to fulfill its
requirements, to adapt to changing requirements, and also for
enabling reuse of components [4]. A widely used technique for
gaining architectural level understanding of software systems
is clustering, which refers to the process of grouping similar
entities together. Researchers have employed this technique for
gaining architectural understanding [5], [6], for architecture
recovery [4], [7], and have also developed new clustering
algorithms for this purpose [8], [9].
When the clustering process is applied, the first step is
to identify the entities that are to be clustered, and their
features (characteristics) or relationships on the basis of which
similarity between entities will be determined. Since the results
of clustering depend on identifying appropriate relationships,
there is a need to evaluate which relationships are more
important and provide relatively better results.
For structured systems, researchers have evaluated individ-
ual relationships [8], [10]. Moreover, they have categorized
relationships into direct and indirect, and have evaluated
them. Direct relationships represent an immediate connection
between two entities (e.g. If function f1 calls another function
f2, f1 and f2 are directly related), whereas indirect rela-
tionships represent the proportion of common features that
two entities share (e.g. If functions f1 and f2 both call a
function f3, then f1 and f2 are indirectly related). Although
researchers have worked on clustering object-oriented software
systems, there has been little focus on identifying relationships
that may be more useful, and may lead to better architectural
understanding. Also, there has been no categorization of
relationships in object-oriented systems into direct/indirect. It
is relevant to note that the number of relationships in object-
oriented systems is larger than in structured systems due to
features e.g. inheritance, therefore evaluating relationships is
of greater significance.
In this paper, we identify important relationships that may
exist between entities within object-oriented systems. We place
these relationships into direct and indirect categories and then
evaluate these categories experimentally by clustering three
real life software systems. For clustering, we use well known
clustering algorithms. Thus this paper addresses the important
issue of identifying (through experiments) the relationship
categories that produce better clustering results for object-
oriented systems, and thus produce better architectural under-
standing.
Organization of this paper is as follows. Section II describes
related work. Section III discusses our clustering approach.
Section IV describes the experimental setup, results and anal-
ysis. Section V presents conclusions and future work.
II. RELATED WORK
For recovering the architecture of software systems, archi-
tectural and non-architectural inputs can be used [11]. Non-
architectural inputs include static, dynamic, formal and non-
formal relationships, whereas architectural inputs are architec-
tural styles and viewpoints.
Static relationships are extracted from the source code, and
directly impact the software behavior. They have been used
by researchers for architecture recovery of structured as well
as object-oriented systems. Koschke [4] used a number of re-
978-1-4244-8058-6/10/$26.00 ©2010 IEEE
2010 6th International Conference on Emerging Technologies (ICET)
270
![Page 2: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)](https://reader036.vdocuments.mx/reader036/viewer/2022082721/5750ab971a28abcf0ce0a5de/html5/thumbnails/2.jpg)
TABLE ICATEGORIES OF RELATIONSHIPS BETWEEN ENTITIES IN AN OBJECT-ORIENTED SYSTEM
Name Description Direct/IndirectInheritance based (IC)Inheritance The inheritance relationship between a base class and a derived class DSame Inheritance Hierarchy The relationship between classes that are derived from same class IInheritance Type The inheritance type that exists between classes i.e. private, protected or public DVirtual Method Override Represents that virtual methods written in a base class are overridden in derived class DBase Class Variable Access Represents that at least one member variable of base class is accessed by derived class DBase Class Method Access Represents that at least one method of base class is accessed by derived class DContainment based (CC)Containment as Object The relationship is formed by declaring an object of a class in another class (container) DSame Class Containment Represents that classes contain objects of same class IVariable Access Represents that at least one public data member of contained class is accessed by container class DMethod Access Represents that at least one method of contained class is accessed by container class DAssociation based (AC)Maintaining Pointer The relationship where an address variable of one class is declared in another class DMaintaining Reference The relationship formed by declaring a reference variable of one class in another class DMethod Parameter Represents that a method of a class takes an object/pointer/ref. of another class as its parameter DMethod Local Represents that a method of a class declares an object/pointer of another class as its local member DSame Class in Methods Represents classes containing objects of same class declared in a method locally or as parameter IFiles based (FC)Same File This relationship indicates that source code of both classes is written in same file IInclude Source File Represents that source file of one class includes source file of other class using include statement DSame Folder Represents that files containing source code of two classes reside in same folder I
lationships e.g. function calls, global variable access, function
parameters for architectural component recovery in structured
systems. These relationships are also used in the Bunch tool
[5]. For object-oriented systems, static relationships including
inheritance and containment have been explored in [12].
Dynamic relationships are determined during program ex-
ecution. Dynamic information supplements the information
obtained statically and thus may also be useful in architecture
recovery, as shown by results in [13].
Non-formal relationships do not have direct impact on the
behavior of software. For example, file name and directory
structure may be helpful in gaining architectural understand-
ing. Researchers have used different non-formal relationships
in their experiments [8], [10].
For structured systems, researchers have divided relation-
ships into direct and indirect and have evaluated these cat-
egories separately [14], [15]. Results indicate that indirect
relationships produce better results [15].
For object-oriented systems Abbasi used twenty six relation-
ships for architecture recovery [16]. He used the relationships
together and did not evaluate them individually.
III. OUR CLUSTERING APPROACH
As described in Section I, clustering has been used by many
researchers for software architecture recovery of structured [7],
[10] as well as object-oriented systems [12], [17]. Clustering is
concerned with grouping entities based on their characteristics
(features) and/or relationships. For software, an entity may be
a file, function or a class. Relationships represent the depen-
dencies between entities. In the first step during clustering, a
feature matrix i.e. (m x n) matrix is formed where m is the
number of entities and n is the number of features representing
the relationships between entities. For our experiments we
selected classes as entities. We used a subset of the 26 different
relationships used by Abbasi [16] to find similar entities. The
relationships we used are given in Table I. ’D’ represents a
direct relationship and ’I’ represents an indirect relationship.
We selected this set because it represents commonly used
relationships within object-oriented systems.
To find the similarity between entities there are differ-
ent similarity measures e.g. Euclidean distance, Jaccard co-
efficient [14]. To evaluate both direct and indirect relationships
using the same similarity measure, we used an objective
function which is a count of the number of relationships
that exist between entities. Greater number of relationships
between two entities indicates higher similarity between them.
After producing the similarity matrix, a clustering algorithm
is applied to cluster the similar entities. Hierarchical agglomer-
ative clustering algorithms, which cluster the two most similar
entities at every step, are commonly used because they are ca-
pable of representing the hierarchical structure of a software’s
architecture. The clustering algorithm works till all the entities
are in a single cluster or the specified number of clusters
is formed. We used well known hierarchical agglomerative
clustering algorithms i.e. Complete linkage, Weighted average
and Unweighted average algorithms for clustering [18].
To assess the results produced by the clustering algorithms,
we compared them with the architecture produced manually
by human experts. To reduce bias, three expert decompositions
were prepared for each test system. The results of the algo-
rithms were compared with the expert decompositions using
the MoJoFM assessment measure [19], which is the latest
version of the MoJo measure [20]. The value of MoJoFM
lies between 0 and 100, where a 0 indicates no similarity be-
tween the decompositions being compared, and 100 indicates
total similarity. The decomposition produced automatically
was compared at every step with each of the three expert
decompositions, and the MoJoFM values thus obtained were
271
![Page 3: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)](https://reader036.vdocuments.mx/reader036/viewer/2022082721/5750ab971a28abcf0ce0a5de/html5/thumbnails/3.jpg)
then averaged.
IV. EXPERIMENTAL SETUP AND RESULTS
A. DataSet Description
For our experiments we selected three software systems
which are developed in Visual C++. These systems are pro-
prietary systems developed by a software company which
has a fairly large customer base. SAVT helps in analysis
and visualization of statistical data. FES is a fact extractor.
It reads C++ source code files, extracts entities, and finds
relationships among the entities. PEDS is related to electrical
power systems. It solves economic power dispatch problem
using conventional and evolutionary computing techniques.
Overview of these systems is provided in Table II and Table
III.
TABLE IITEST SOFTWARE SYSTEMS
SAVT FES PEDSS. No.1 Total number of source code
lines27311 10402 16360
2 Total number of header (.h)files
70 39 31
3 Total number of implemen-tation (.cpp, .cxx) files
76 37 27
4 Total number of Classes 97 47 41
TABLE IIIRELATIONSHIPS WITHIN TEST SYSTEMS. D - DIRECT RELATIONSHIPS, I -
INDIRECT RELATIONSHIPS
Relationship Specification SAVT FES PEDSTotal Relationships AmongClasses
5201 1229 473
Inheritance based (IC) 1242 365 180Inheritance depth (D) 26 54 13
Same inheritance hierarchy (I) 986 166 70
Inheritance type (D) 26 54 13
Virtual method override (D) 21 12 6
Base class variable access (D) 100 33 43
Base class method access (D) 83 46 35
Containment based (CC) 1199 143 61Containment as object (D) 41 26 12
Same class containment (I) 1032 56 12
Variable access (D) 49 23 20
Method access (D) 77 38 17
Association based (AC) 2171 514 136Maintaining pointer (D) 41 11 9
Maintaining reference (D) 0 0 0
Method parameter (D) 77 63 22
Method local (D) 153 56 29
Same class in methods (I) 1900 384 76
File and folder based (FFC) 528 84 72Same file (I) 264 42 36
Include source file (D) 264 42 36
Same folder (I) 0 0 0
B. Experimental Results
We conducted different sets of experiments to evaluate
relationship categories. These are described in the following
sections.
1) Direct and Indirect relationships: As described in Sec-
tion II, indirect relationships have shown better results than
direct relationships for structured systems. To evaluate the per-
formance of indirect relationships for object-oriented systems,
in our first set of experiments we combined all indirect rela-
tionships in Table III and clustered the software systems. The
results were compared with those obtained by combining all
direct relationships. The experimental results using MoJoFM
are given in Table IV. “All” indicates combined direct and
indirect relationships.
TABLE IVRESULTS OF DIRECT, INDIRECT AND ALL RELATIONSHIPS USING
MOJOFM (D - DIRECT, I - INDIRECT, CL - COMPLETE LINKAGE, WA -WEIGHTED AVERAGE, UWA - UNWEIGHTED AVERAGE)
CL WA UWAD I All D I All D I All
SAVT 32 43 46 32 43 42 35 30 45FES 30 42 37 45 47 44 41 50 41PEDS 35 55 32 34 52 34 34 48 36
Fig. 1. Experimental Results for Direct, Indirect and All Relationships
It can be seen from Table IV and Figure 1 that indirect
relationships produce better results than direct relationships
for all algorithms and test systems, except in one case (shown
highlighted in Table IV). This shows that the information
contained within indirect relationships is more meaningful for
architectural understanding, since it is closer to the way human
experts view the system as indicated by the higher MoJoFM
values.
An interesting observation is that indirect relationships also
produce better results than All relationships in almost all
cases. This is contrary to results obtained earlier for structured
272
![Page 4: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)](https://reader036.vdocuments.mx/reader036/viewer/2022082721/5750ab971a28abcf0ce0a5de/html5/thumbnails/4.jpg)
Fig. 2. Experimetal results of category wise direct and indirect relationships using MoJoFM
systems, where experimental results show that including a
larger number of relationships improves results [14]. Our
results indicate that combining direct and indirect relationships
together may deteriorate results, thus it is important to differ-
entiate between them especially for object-oriented systems.
2) Category-wise direct and indirect relationships: To gain
further insight, we conducted experiments by dividing the
relationships into the four categories (inheritance, containment,
association, and file and folder based) presented in Table III.
Within each category, we divided relationships into direct and
indirect. Experimental results are given in Table V.
It can be seen from Table V and Figure 2 that in general,
results of indirect relationships are better than those of direct
relationships for each of the categories. For FES, results for
indirect are better than or same as for direct except in the
case of Complete linkage for containment. For SAVT, results
for indirect are better than or same as for direct in all cases
except in the case of Unweighted average for association. This
is also the case for PEDS for the inheritance, association and
file and folder categories, where results of direct are slightly
better than for indirect only in case of Unweighted average
for association. However, all three algorithms produce better
results for direct relationships for the containment category
for PEDS. This may be due to the small number of indirect
relationships in containment based category (12 out of 61, as
can be seen from Table III). Due to this small number, the
information provided may be insufficient for an algorithm to
produce meaningful clusters.
TABLE VCATEGORY-WISE COMPARISON BETWEEN CLUSTERING ALGORITHM AND
EXPERT DECOMPOSITION RESULTS USING MOJOFM. (D - DIRECT, I -INDIRECT, CL - COMPLETE LINKAGE, WA - WEIGHTED AVERAGE, UWA
- UNWEIGHTED AVERAGE)
CL WA UWAD I D I D I
Inheritance (IC)SAVT 27 37 28 37 28 42FES 31 38 31 37 33 45PEDS 24 35 25 43 25 49Containment (CC)SAVT 30 32 37 42 33 49FES 33 32 37 42 45 49PEDS 29 26 36 26 43 26Association (AC)SAVT 29 29 34 35 39 31FES 32 34 29 37 30 34PEDS 26 27 28 28 41 40File and Folder (FFC)SAVT 28 28 31 31 34 34FES 32 32 37 37 36 36PEDS 28 28 38 38 38 38
Figure 2 shows that results for File and folder category are
the same for all algorithms and test systems. This is because
the number of direct and indirect relationships is the same
in this category (Table III). Moreover, the relationships exist
between the same entities, thus leading to same clustering
results.
273
![Page 5: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)](https://reader036.vdocuments.mx/reader036/viewer/2022082721/5750ab971a28abcf0ce0a5de/html5/thumbnails/5.jpg)
Fig. 3. Category wise results using MoJoFM
3) Category-wise relationships: Table VI and Figure 3
present category-wise results for the inheritance, containment,
association, and file and folder based categories for the three
clustering algorithms. It can be seen from Table VI and Figure
3 that no single category produces better results for all test
systems. Based on results of two out of three algorithms, the
inheritance category performs better for SAVT, containment
category performs better for FES and association category
performs better for PEDS. From the experimental results, it
appears that results of individual categories are dependent
on system structure. For example, for SAVT, the inheritance
relationship has played a major role in arriving at a sub-system
structure. Thus even though the results do not indicate better
performance of one category, they are useful because they may
be used to gain insight into the design of a system.
TABLE VICATEGORY WISE RESULTS FOR COMPLETE, WEIGHTED AND
UNWEIGHTED AVERAGE USING MOJOFM
CL IC CC AC FCSAVT 37 33 29 28FES 36 34 36 32PEDS 27 26 26 28WASAVT 37 37 31 31FES 41 50 34 37PEDS 28 31 43 38UWASAVT 34 37 32 34FES 35 45 38 36PEDS 28 42 43 38
V. CONCLUSION AND FUTURE WORK
Clustering has been applied by various researchers for gain-
ing architectural understanding and recovering the architecture
of software systems. Relationships play a very important role
during clustering, since they are used to determine similarity
between entities to be clustered. For structured systems, rela-
tionships have been evaluated and they have been categorized
into direct/indirect. For object-oriented systems the number
of relationships is larger than for structured systems due to
features e.g. inheritance, thus it is important to evaluate which
of the relationships are more useful. However, no attempt has
been made to find the usefulness of relationships.
In this paper we divided relationships into direct and indirect
for object-oriented systems and evaluated different categories
of relationships. From our experimental results, we conclude
that in general indirect relationships produces better results
than direct relationships. An evaluation of various categories
including inheritance, containment, association, and file and
folder based reveals that no single category produces better
results for all datasets. Thus the clustering results depend upon
the structure of software systems.
In the future, we intend to evaluate different combinations
of relationships. The relationships may also be evaluated for
other datasets and algorithms, and the role of relationships
may be explored for refactoring.
REFERENCES
[1] R. L. Glass, “Frequently forgotten fundamental facts about softwareengineering,” IEEE Software, vol. 18, no. 3, pp. 111–112, May/Jun 2001.
[2] R. Hall, “Seven ways to cut software maintenance costs,” Datamation,vol. 33, no. 14, pp. 81–83, 1987.
[3] L. Bass and P. Clements and R. Kazman, Software Architecture inPractice, Second ed. Pearson Education, 2004.
[4] R. Koschke, “Atomic architectural component recovery for programunderstanding and evolution,” Ph.D. dissertation, Institut fr Informatik,Universitt Stuttgart, 2000.
[5] B. S. Mitchell and S. Mancoridis, “On the automatic modularizationof software systems using the bunch tool,” IEEE Trans. Software Eng.,vol. 32, no. 3, pp. 193 – 208, March 2006.
[6] V. Tzerpos, “Comprehension driven software clustering,” Ph.D. disserta-tion, Graduate Department of Computer Science University of Toronto,2001.
[7] O. Maqbool and H. A. Babri, “Hierarchical clustering for softwarearchitecture recovery,” IEEE Trans. Software Eng., vol. 33, no. 11, pp.759 – 780, November 2007.
[8] P. Andritsos and V. Tzerpos, “Information theoretic software clustering,”IEEE Trans. Software Eng., vol. 31, no. 2, pp. 150 – 165, February 2005.
[9] O. Maqbool and H. A. Babri, “The weighted combined algorithm: alinkage algorithm for software clustering,” Proc. Int’l Conf. SoftwareMaintenance and Reeng., pp. 15 – 24, 2004.
[10] N. Anquetil and T. C. Lethbridge, “Recovering software architecturefrom the names of source files,” Journal of Software Maintenance:Research and Practice, vol. 11, p. 201221, December 1999.
[11] S. Ducasse and D. Pollet, “Software architecture reconstruction: Aprocess-oriented taxonomy,” IEEE Trans. Software Eng., vol. 35, no. 4,pp. 573–591, July-Aug 2009.
[12] M. Trifu, “Architecture-aware, adaptive clustering of object-orientedsystems,” Master’s thesis, Forschungszentrum Informatik Karlsruhe,2003.
[13] C. Xiao and V. Tzerpos, “Software clustering based on dynamic depen-dencies,” Proc. Int’l Conf. Software Maintenance and Reeng., pp. 124– 133, 2005.
[14] J. Davey and E. Burd, “Evaluating the suitability of data clustering forsoftware remodularisation,” Proc. Working Conf. Reverse Eng., pp. 268– 276, November 2000.
274
![Page 6: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)](https://reader036.vdocuments.mx/reader036/viewer/2022082721/5750ab971a28abcf0ce0a5de/html5/thumbnails/6.jpg)
[15] N. Anquetil and T. C. Lethbridge, “Experiments with clustering as asoftware remodularization method,” Proc. Working Conf. Reverse Eng.,pp. 235–255, 1999.
[16] A. Q. Abbasi, “Application of appropriate machine learning techniquesfor automatic modularization of software systems,” MPhil. thesis, Quaid-e-Azam University Islamabad, 2008.
[17] T. Systa, “Static and dynamic reverse engineering techniques for javasoftware systems,” Ph.D. dissertation, University of Tampere, 2000.
[18] J. Han and M. Kamber, Data Mining: Concepts and Techniques.Morgan Kaufmann, 2006.
[19] Z. Wen and V. Tzerpos, “An effectiveness measure for software clus-tering algorithms,” Proc. Int’l Workshop Program Comprehension, pp.194 – 203, June 2004.
[20] V. Tzerpos and R. C. Holt, “Mojo : A distance metric for softwareclusterings,” Proc. Working Conf. Reverse Eng., pp. 187 – 193, October1999.
275