Download - WCRE 1999 / 2009

1 of 63

WCRE 1999 / 2009

Experiments with clustering

as a software

remodularization method

Nicolas AnquetilNicolas AnquetilTimothy C. LethbridgeTimothy C. Lethbridge

2 of 63

Forewarning

Nicolas: After this research I became suspicious of the

usefulness of clustering for remodularization.

I still am.

3 of 63

You have been warned

(although note that Tim has a less gloomy view)

4 of 63

Agenda Background of the research Overview of the paper From then until now And now what? An analogy Another analogy

5 of 63

Background of the research

Context: KBRE group, U. of Ottawa, Canada CSER project (Consortium for Software

Engineering Research) Pairs: university/company

(U. Of Ottawa/Telecom. company) Focus on real problems and/or

real situations

6 of 63


The project: One company's PBX 2+ MLOC 2+ K files 10+ possible configurations 10+ years old (in 1999) 2 proprietary languages 1 directory 0 packages

7 of 63


Company situation: High turnover (18 months) High entry barrier (6+ months to be

productive) Aging software (and languages) Configuration management difficulties

8 of 63


9 of 63

Overview of the paper

””providing solutions providing solutions to help software to help software engineers understand, engineers understand, restructure or restructure or migrate old software migrate old software towards more modern towards more modern architecture and/or architecture and/or languages”languages”

10 of 63


Possible solution:Possible solution:

””Clustering is used Clustering is used to gather software to gather software components into components into modules significant modules significant to the software to the software engineers.”engineers.”

11 of 63

Overview of the paper Seminal paper by Theo Wiggerts, “Using

Clustering Algorithms in Legacy Systems Remodularization”, WCRE'97 Summary of the literature on clustering Lists all the possible choices Lists some advantages and drawbacks of

these choices

12 of 63


””Clustering is a Clustering is a sophisticated sophisticated research domain with research domain with many methods [...] many methods [...] Reverse engineering Reverse engineering is a young domain is a young domain [...] Clustering has [...] Clustering has been used with no been used with no deep understanding of deep understanding of all the issues all the issues involved.”involved.”

13 of 63


””Conclusions of Conclusions of Wiggerts' paper are Wiggerts' paper are those of the those of the literature which may literature which may not entirely hold for not entirely hold for reverse engineering.”reverse engineering.”

14 of 63

Overview of the paper For example:

Living things naturally fit in an evolution tree (more or less)

Not so with software modularization

This must impact the tools we use and how we use them

15 of 63

Overview of the paper Three issues

What clustering algorithms to use?

How to compute cohesion? How to describe entities? How to evaluate the results?

16 of 63

Overview of the paper Algorithms

We tested mainly hierarchical agglomerative algorithms

Some tests with hill-climbing algorithms (”Bunch” tool: Mancoridis)

17 of 63

Overview of the paper Entities

We clustered files (into packages)

Description Elements contained in the files: Types, variables, routines, macros,

comments, identifiers

18 of 63


Reminder:Reminder:

””Clustering Clustering algorithms do not algorithms do not discoverdiscover some hidden some hidden structure in a structure in a system, but system, but imposeimpose a a structure on the set structure on the set of entities they are of entities they are given.”given.”

19 of 63

Overview of the paperSome results

Redundancies among description schemes: File, routine, variable, macro, type Comments, identifiers

20 of 63


Combining features (routine + variable + ...) improves the results

21 of 63


Direct/sibling links Sibling more used and better

22 of 63


Avoid “sparse” descriptive features Avoid similarity metrics that consider absence

of a feature as significant

23 of 63


24 of 63

From then until now Raw numbers What extensions?

25 of 63

From then until nowReferences (volume)

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 20090

2

4

6

8

10

12

14

16

18

-

[data from Google scholar][data from Google scholar]

26 of 63

From then until nowReferences (authors)

P.Tonella(8), F.Ricca(7), C.Girardi(5), E.Pianta(5)

O.Maqbool(7), HA.Babri(6) C.Tjortjis(5) N.Anquetil(5) S.Ducasse(5) K.Sartipi(4)


27 of 63

From then until nowReferences (venue)

Thesis=11

CSMR = 6 IWPC = 6 WCRE = 5 J.Soft.Maint.

Evol. = 4

J.Syst.Soft. = 4

ICSM = 3

ICSE = 2

Trans.Syst.Eng. = 2


28 of 63

From then until nowSome extensions

Clustering, how? New/improved algorithms New/improved distance metrics

Clustering what? New entities (and/or description)

Clustering, why?

Other extensions

29 of 63

From then until nowNew algorithm

Genetic algorithm [Mahdavi]

“Combined algorithm” [Saeed, Maqbool, Babri, Hassan, Sarwar]

30 of 63

From then until nowNew distance metric

Minimization of information loss [Andritsos, Tzerpos]

31 of 63

From then until nowNew entities

Static web pages [Di Lucca,

Fasolino, Tramontana]

[Tonella,Ricca,Pianta, Girardi]

Association rules [Maqbool,Babri]

Data vs. Control [Davey,Burd],

[Sartipi,Kontogiannis]

Dynamic data [Stroulia,Systä]

Co-change records

32 of 63

From then until nowOther extensions

Evaluations / comparisons [Tonella], [Wu, Holt], [Parsa, Bushehrian]

Framework

33 of 63

From then until nowOther extensions

Needs of maintainers? [Tjortjis, Layzell]

Input for visualization tools [Ducasse]

Naming clusters [Tzerpos], [Maqbool, Babri]

34 of 63


35 of 63

And now what? Back to paper's results Wild ideas in clustering Related topics

36 of 63

And now what?Paper's results

Choice of (traditional) algorithm matters little It will give a result Not significantly better or worse than other

37 of 63


Choice of similarity metric matters little

As long as they don't consider absence of a feature as a sign of similarity

38 of 63


Choice of description scheme for entity matters a bit more

May be source of short term progress? Using dynamic information?

39 of 63

And now what?Wild ideas

Consider new entities? Individual instructions? Non code: requirements, model elements,

tests, … ?

Process-wise modularization? Clustering requirements, models elements, ...

40 of 63

And now what?Related topics

Problem without solution? Software modularization is highly subjective Packages are not mutually exclusive Decisions must be made that are always

wrong (and always correct)

41 of 63


Modularization is a logical (virtual) decomposition based on semantics High cohesion, low coupling may only be an

(imperfect) by-product of pre-chosen modularization

Cohesion/coupling not a driving force but a secondary goal?

Other forces, e.g. packages of “comparable” sizes

42 of 63


Typical example: Utility package Low cohesion, high coupling java.util

BitSet, Calendar, Currency, Dictionary, EventListenerProxy, Formatter, Observable, Random, ResourceBundle, Scanner, UUID, TimeZone, ...

43 of 63


How to evaluate results? Open question in the paper

Cohesion/coupling Normaly useless because it is the function

optimized by the algorithms Gold standard

Manually: expensive, not precise Automatically: biased

44 of 63


How to evaluate results? Other metrics, e.g. Stability, Non-extremity

[Wu]

45 of 63


46 of 63


”The fact that all six algorithms are ranked low on authoritativeness suggests that they may not be mature enough for use in production on large systems undergoing evolutionary change.However ...”

[Wu, Holt, 2005]

47 of 63

An analogy A short story of Belo Horizonte:

In 1893 a new capital is planned in the state of Minas Gerais (Brazil)

The arquitects/urbanists get inspiration from Washington D.C.

48 of 63

An analogy The initial architecture:

Planned Belo Horizonte

49 of 63

An analogy The city grew (2.5 Mhab., area=5.1 Mh.)

50 of 63

An analogy The city grew (2.5 Mhab.)

51 of 63

An analogy Could we remodularize that?

52 of 63

An analogy Could we remodularize that?

53 of 63

An analogy Analogy with software clustering:

Initial architecture is completly lost in the overall city

Regularities would allow to find only small “clusters”

There are large “empty” parts difficult to (automatically) cluster

A division into districts would necessarily be subjective

54 of 63


55 of 63

Another analogy You are a 21-year old leaving university

You buy a large house because you have a good job

You are not well organized You have a general concept that “food goes in

the kitchen and clothes go in the bedroom” But much of your stuff is strewn around

56 of 63

Another analogy Initially you do not have many things, so the

disorganization doesn't matter

After a while, you accumulate very many worldly goods

You constantly can't find things Your new partner starts complaining

57 of 63

Another analogy You realize it is time to organize things better

You are a computer scientist so you want to apply a clustering algorithm

58 of 63

Another analogy But what criteria to use?

Things made in the same country go together?

Oops, the 'China' cluster is too big Temporal cohesion?

Things used in the morning in one place, things used in the evening in another place?

– Where does 'toothbrush' go?

59 of 63

Another analogy Functional cohesion

Everything for each recipe I make is kept together

But utilities (things used commonly) are separately organized as a cluster

Too awkward

60 of 63

Another analogy In the end, your approach is pragmatic:

1.You decide from general experience on a set of general categories and storage locations

2. You spend a weekend moving things into these locations (yes there are thousands of things)

61 of 63

Another analogy

3. As you proceed, you notice Some things do not fit in any categories Some categories are not so well chosen Some categories overlap

4. You refactor the categories a bit and move things around

62 of 63

How can this be applied to software? Use a clustering tool to mainly to give you a

sense of the possibilities Combine with other RE tools to learn about

the functionality of each module as well as other properties

But also apply general wisdom about good software design

63 of 63

How can this be applied to software? Play with the parameters of the clustering tool

and other RE tools, refactoring until you have achieved a remodularization that you understand

Ideally, tools would allow instant adjustment with good visualization

Retain documents describing the resulting design

Download - WCRE 1999 / 2009

Top Related