Transcript
Page 1: Improving Software Maintenance using Unsupervised Machine Learning techniques

Dottorato in Scienze Computazionali e Informatiche, XXV Ciclo

Ph.D. Candidate: Valerio Maggio

Thesis Advisors: Dr. Sergio Di MartinoDr. Anna Corazza

June 5th, 2013

UNSUPERVISEDMACHINE LEARNING FOR SOFTWARE MAINTENANCE

Page 2: Improving Software Maintenance using Unsupervised Machine Learning techniques

THESISG O A L

UNSUPERVISEDMACHINE LEARNING FOR SOFTWARE MAINTENANCE

These solutions exploit (unsupervised) machine learning techniques to mine information from the source code

Define and experimentally evaluate solutions (techniques and prototype tools) for automatic software analysis to support

software maintenance activities.

Page 3: Improving Software Maintenance using Unsupervised Machine Learning techniques

There exist different types of Software Maintenance.

SOFTWARE MAINTENANCE“A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution)

Page 4: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOFTWARE MAINTENANCE“A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution)

Software Maintenance represents the most expensive, time consuming and challenging

phase of the whole development process.

Page 5: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOFTWARE MAINTENANCE“A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution)

Software Maintenance represents the most expensive, time consuming and challenging

phase of the whole development process.

Software Maintenance could account up to the 85-90% of the total software costs.

Page 6: Improving Software Maintenance using Unsupervised Machine Learning techniques

ISSUESSOFTWARE MAINTENANCE

“Software Maintenance is about change!” (cit. S. Jarzabek)

Change Analysis

Page 7: Improving Software Maintenance using Unsupervised Machine Learning techniques

ISSUESSOFTWARE MAINTENANCE

“Software Maintenance is about change!” (cit. S. Jarzabek)

Change AnalysisProgram

Comprehension

The documentation is usually scarce or not up to date!

Page 8: Improving Software Maintenance using Unsupervised Machine Learning techniques

ISSUESSOFTWARE MAINTENANCE

The source code (usually) represents the most reliable source of information about the system

“Software Maintenance is about change!” (cit. S. Jarzabek)

Change AnalysisProgram

Comprehension

The documentation is usually scarce or not up to date!

Page 9: Improving Software Maintenance using Unsupervised Machine Learning techniques

ISSUESSOFTWARE MAINTENANCE

The source code (usually) represents the most reliable source of information about the system

“Software Maintenance is about change!” (cit. S. Jarzabek)

Change AnalysisProgram

ComprehensionReverse Engineering

The documentation is usually scarce or not up to date!

Page 10: Improving Software Maintenance using Unsupervised Machine Learning techniques

REVERSEE N G I N E

E R I N G

• Definition of tools and techniques to support maintenance activities• Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document

• Goal: To aid the comprehension of the system

Page 11: Improving Software Maintenance using Unsupervised Machine Learning techniques

REVERSEE N G I N E

E R I N G

• Definition of tools and techniques to support maintenance activities• Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document

• Goal: To aid the comprehension of the system

STATIC ANALYSIS

DYNAMIC ANALYSIS

Page 12: Improving Software Maintenance using Unsupervised Machine Learning techniques

REVERSEE N G I N E

E R I N G

• Definition of tools and techniques to support maintenance activities• Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document

• Goal: To aid the comprehension of the system

STATIC ANALYSIS DYNAMIC

ANALYSIS

Page 13: Improving Software Maintenance using Unsupervised Machine Learning techniques

MACHINEL E A R N

I N G

• Provides computational effective solutions to analyze large data sets

Page 14: Improving Software Maintenance using Unsupervised Machine Learning techniques

MACHINEL E A R N

I N G

• Provides computational effective solutions to analyze large data sets

• Provides solutions that can be tailored to different tasks/domains

Page 15: Improving Software Maintenance using Unsupervised Machine Learning techniques

MACHINEL E A R N

I N G

• Provides computational effective solutions to analyze large data sets

• Provides solutions that can be tailored to different tasks/domains

• Requires many efforts in:

• the definition of the relevant information best suited for the specific task/domain

Page 16: Improving Software Maintenance using Unsupervised Machine Learning techniques

MACHINEL E A R N

I N G

• Provides computational effective solutions to analyze large data sets

• Provides solutions that can be tailored to different tasks/domains

• Requires many efforts in:

• the definition of the relevant information best suited for the specific task/domain

• the application of the learning algorithms to the considered data

Page 17: Improving Software Maintenance using Unsupervised Machine Learning techniques

UN

SUPE

RVIS

ED L

EARN

ING

• Supervised Learning: • Learn from labelled samples

• Unsupervised Learning:• Learn (directly) from the data

Learn by examples

Page 18: Improving Software Maintenance using Unsupervised Machine Learning techniques

UN

SUPE

RVIS

ED L

EARN

ING

• Supervised Learning: • Learn from labelled samples

• Unsupervised Learning:• Learn (directly) from the data

Learn by examples

Page 19: Improving Software Maintenance using Unsupervised Machine Learning techniques

UN

SUPE

RVIS

ED L

EARN

ING

• Supervised Learning: • Learn from labelled samples

• Unsupervised Learning:• Learn (directly) from the data

Learn by examples

(+) No cost of labeling samples

(-) Trade-off imposed on the quality of the data

Page 20: Improving Software Maintenance using Unsupervised Machine Learning techniques

THESISC O N T RI B U T I

O N S

&

Page 21: Improving Software Maintenance using Unsupervised Machine Learning techniques

THESISC O N T RI B U T I

O N S

Contributions to three relevant and related open issues in Software Maintenance

&

Page 22: Improving Software Maintenance using Unsupervised Machine Learning techniques

THESISC O N T RI B U T I

O N S

Contributions to three relevant and related open issues in Software Maintenance

1. SOFTWARE RE-MODULARIZATION

3. CLONE DETECTION

2. SOURCE CODE NORMALIZATION

&

Page 23: Improving Software Maintenance using Unsupervised Machine Learning techniques

ISSUE#1

SOFTWARERE-MODULARIZATION

Page 24: Improving Software Maintenance using Unsupervised Machine Learning techniques

Software Classes

UI Process Components

UI Components

Data Access Components

Data Helpers / Utilities

Security

Operational Management

Communications

Business Components

Application Facade

Buisiness Workflows

Messages InterfacesService Interfaces

Re-modularization provides a way to support software maintainers by automat ica l ly grouping together (clustering) “related” software classes

SOFTWARE RE-MODULARIZATION

PROBLEM

S T A T EM E N T

Page 25: Improving Software Maintenance using Unsupervised Machine Learning techniques

Re-modularization provides a way to support software maintainers by automat ica l ly grouping together (clustering) “related” software classes

SOFTWARE RE-MODULARIZATION

External Systems

Service Consumers

Service

s

Service Interfaces

Messages Interfaces

Cross Cutting

Security

Operational M

anagement

Comm

unications

Data Data Access

ComponentsData Helpers /

Utilities

Pres

entatio

n UI Components

UI Process Components

Busin

ess

Application Facade

Buisiness Workflows

Business Components

Clusters of Software Classes

PROBLEM

S T A T EM E N T

Page 26: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DELE

XIC

AL

INFO

RMAT

ION

Page 27: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DELE

XIC

AL

INFO

RMAT

ION

Page 28: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DELE

XIC

AL

INFO

RMAT

ION

(IDEA): Exploit the lexical information gathered from the source code to produce clusters of classes that are lexically related.

Page 29: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DELE

XIC

AL

INFO

RMAT

ION

(IDEA): Exploit the lexical information gathered from the source code to produce clusters of classes that are lexically related.

State of the Art: Information Retrieval (IR) based approaches.

Page 30: Improving Software Maintenance using Unsupervised Machine Learning techniques

IR IN

DEXI

NG

Page 31: Improving Software Maintenance using Unsupervised Machine Learning techniques

Implicit assumption: The “same” words are used whenever a particular concept occurs

IR IN

DEXI

NG

Page 32: Improving Software Maintenance using Unsupervised Machine Learning techniques

1. Tokenization

2. Normalization draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...

Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

Implicit assumption: The “same” words are used whenever a particular concept occurs

IR IN

DEXI

NG

Page 33: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE Z

ON

ESLE

XIC

AL

INFO

RMAT

ION

Page 34: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE Z

ON

ESLE

XIC

AL

INFO

RMAT

ION

Class Names

Page 35: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE Z

ON

ESLE

XIC

AL

INFO

RMAT

ION

Class Names

Attribute Names

Page 36: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE Z

ON

ESLE

XIC

AL

INFO

RMAT

ION

Class Names

Attribute Names

Method Names

Method Names

Page 37: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE Z

ON

ESLE

XIC

AL

INFO

RMAT

ION

Class Names

Attribute Names

Method Names

Parameter NamesMethod Names

Parameter Names

Page 38: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE Z

ON

ESLE

XIC

AL

INFO

RMAT

ION

Class Names

Attribute Names

Method Names

Parameter Names

Comments

Comments

Method Names

Parameter Names

Comments

Page 39: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE Z

ON

ESLE

XIC

AL

INFO

RMAT

ION

Source Code

Class Names

Attribute Names

Method Names

Parameter Names

Comments

Comments

Method Names

Parameter NamesSource Code

Comments

Page 40: Improving Software Maintenance using Unsupervised Machine Learning techniques

ZON

E IN

DEXI

NG

Page 41: Improving Software Maintenance using Unsupervised Machine Learning techniques

ZON

E IN

DEXI

NG

RQ1: Do terms in different Zones provide different contributions?

Page 42: Improving Software Maintenance using Unsupervised Machine Learning techniques

1. Tokenization

2. Normalization

ZON

E IN

DEXI

NG

Draws, the, are, NullHandle, box, r, draw, g, Graphics, color, displayBox, ...

draw, the, are, null, handl, box, r, draw, g, graphic, color, display, box, ...

RQ1: Do terms in different Zones provide different contributions?

Page 43: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE LEX

ICON JFreeChart JEdit

JUnit Xerces

Page 44: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE L

EXIC

ON JFreeChart

• Good lexicon in every Zone!

Page 45: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE L

EXIC

ON JEdit

• Very good in Comments • Very poor in Method and Parameter names

Page 46: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE L

EXIC

ON JUnit

• Good in Class and Attribute names• No Comments at all

Page 47: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE L

EXIC

ON Xerces

• Poor in Method, Parameter names and Comments• Good in Method Code and Variable names

Page 48: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DE L

EXIC

ON JFreeChart JEdit

JUnit XercesRQ2: How to automatically weight the different contributions ?

Page 49: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONTRIBUTION

SOFTWARE RE-MODULARIZATION

• We propose a source code Zone Indexing

Corazza A., Di Martino, S., Maggio, V., Scanniello, G.Investigating the use of lexical information for software system clustering(2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7

Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.Combining machine learning and information retrieval techniques for software clustering(2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0

Page 50: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONTRIBUTION

SOFTWARE RE-MODULARIZATION

• We propose a source code Zone Indexing

• An approach to automatically weight the contribution of different terms in Zones

• Maximum-Likelihood Estimation (MLE) Approach

Corazza A., Di Martino, S., Maggio, V., Scanniello, G.Investigating the use of lexical information for software system clustering(2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7

Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.Combining machine learning and information retrieval techniques for software clustering(2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0

Page 51: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONTRIBUTION

SOFTWARE RE-MODULARIZATION

• We propose a source code Zone Indexing

• An approach to automatically weight the contribution of different terms in Zones

• Maximum-Likelihood Estimation (MLE) Approach

• A variant of the K-medoid clustering algorithm

• Tailored to the software clustering task

Corazza A., Di Martino, S., Maggio, V., Scanniello, G.Investigating the use of lexical information for software system clustering(2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7

Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.Combining machine learning and information retrieval techniques for software clustering(2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0

Page 52: Improving Software Maintenance using Unsupervised Machine Learning techniques

StackedVerticalBarRenderer

LineAndShapeRenderer

K-M

EDO

ID V

ARIA

NTArrayHolder

Plot

LinearPlotFitAlgorithm

LinearPlotFit

HorizontalNumberAxis

NumberAxisPropertyEditPanel

PlotFitAlgorithm

ChartFrame

GraphicsHolder

PropertyEditPanel

DateAxisPropertyEditPanel

StandardLegend

LegendItem

FigureLegend

VerticalNumberAxis

VerticalXYDataPlot

DataPlot

SampleXYDatasetSampleXYDatasetThread

VerticalPlot

FitAlgorithm

Page 53: Improving Software Maintenance using Unsupervised Machine Learning techniques

StackedVerticalBarRenderer

LineAndShapeRenderer

K-M

EDO

ID V

ARIA

NTArrayHolder

Plot

LinearPlotFitAlgorithm

LinearPlotFit

HorizontalNumberAxis

NumberAxisPropertyEditPanel

PlotFitAlgorithm

ChartFrame

GraphicsHolder

PropertyEditPanel

DateAxisPropertyEditPanel

StandardLegend

LegendItem

FigureLegend

VerticalNumberAxis

VerticalXYDataPlot

DataPlot

SampleXYDatasetSampleXYDatasetThread

Non-extreme Distribution property

VerticalPlot

FitAlgorithm

Page 54: Improving Software Maintenance using Unsupervised Machine Learning techniques

StackedVerticalBarRenderer

LineAndShapeRenderer

K-M

EDO

ID V

ARIA

NTArrayHolder

Plot

LinearPlotFitAlgorithm

LinearPlotFit

HorizontalNumberAxis

NumberAxisPropertyEditPanel

PlotFitAlgorithm

ChartFrame

GraphicsHolder

PropertyEditPanel

DateAxisPropertyEditPanel

StandardLegend

LegendItem

FigureLegend

VerticalNumberAxis

VerticalXYDataPlot

DataPlot

SampleXYDatasetSampleXYDatasetThread

Non-extreme Distribution property

VerticalPlot

FitAlgorithm

Page 55: Improving Software Maintenance using Unsupervised Machine Learning techniques

StackedVerticalBarRenderer

LineAndShapeRenderer

K-M

EDO

ID V

ARIA

NTArrayHolder

Plot

LinearPlotFitAlgorithm

LinearPlotFit

HorizontalNumberAxis

NumberAxisPropertyEditPanel

PlotFitAlgorithm

ChartFrame

GraphicsHolder

PropertyEditPanel

DateAxisPropertyEditPanel

StandardLegend

LegendItem

FigureLegend

VerticalNumberAxis

VerticalXYDataPlot

DataPlot

SampleXYDatasetSampleXYDatasetThread

Non-extreme Distribution property

VerticalPlot

FitAlgorithm

ExtremeExtreme

Page 56: Improving Software Maintenance using Unsupervised Machine Learning techniques

StackedVerticalBarRenderer

LineAndShapeRenderer

K-M

EDO

ID V

ARIA

NTArrayHolder

Plot

LinearPlotFitAlgorithm

LinearPlotFit

HorizontalNumberAxis

NumberAxisPropertyEditPanel

PlotFitAlgorithm

ChartFrame

GraphicsHolder

PropertyEditPanel

DateAxisPropertyEditPanel

StandardLegend

LegendItem

FigureLegend

VerticalNumberAxis

VerticalXYDataPlot

DataPlot

SampleXYDatasetSampleXYDatasetThread

Non-extreme Distribution property

VerticalPlot

FitAlgorithm

Page 57: Improving Software Maintenance using Unsupervised Machine Learning techniques

AUTHORITATIVENESSRE

SULTS

Comparison of Clustering Results with an Authoritative Target Partition

Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces0

0.2

0.4

0.6

0.8

1.0

19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART

FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS

Page 58: Improving Software Maintenance using Unsupervised Machine Learning techniques

AUTHORITATIVENESSRE

SULTS

Comparison of Clustering Results with an Authoritative Target Partition

Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces0

0.2

0.4

0.6

0.8

1.0

19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART

FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS

Page 59: Improving Software Maintenance using Unsupervised Machine Learning techniques

AUTHORITATIVENESSRE

SULTS

Comparison of Clustering Results with an Authoritative Target Partition

Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces0

0.2

0.4

0.6

0.8

1.0

19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART

FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS

Page 60: Improving Software Maintenance using Unsupervised Machine Learning techniques

AUTHORITATIVENESSRE

SULTS

Comparison of Clustering Results with an Authoritative Target Partition

19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART

0

0.2

0.4

0.6

0.8

1.0

Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces

FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS

Page 61: Improving Software Maintenance using Unsupervised Machine Learning techniques

ISSUE#2

SOURCE CODENORMALIZATION

Page 62: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DELE

XIC

AL

INFO

RMAT

ION

Implicit assumption: The “same” words are used whenever a particular concept occurs

Page 63: Improving Software Maintenance using Unsupervised Machine Learning techniques

SOUR

CE

CO

DELE

XIC

AL

INFO

RMAT

ION

Implicit assumption: The “same” words are used whenever a particular concept occurs

Page 64: Improving Software Maintenance using Unsupervised Machine Learning techniques

1. Tokenization

2. Normalization1.5 Identifier Splitting

• Splitting algorithms based on naming conventions

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• NullHandle ==> Null | Handle

• displayBox ==> display | Box

draw, the, are, box, r, rectangl, g, graphic, color, ....

STAT

E O

F TH

E A

RT

TOO

LS

display, box

null, handl

Page 65: Improving Software Maintenance using Unsupervised Machine Learning techniques

1. Tokenization

2. Normalization1.5 Identifier Splitting

• Splitting algorithms based on naming conventions

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• NullHandle ==> Null | Handle

• displayBox ==> display | Box

draw, the, are, box, r, rectangl, g, graphic, color, ....

STAT

E O

F TH

E A

RT

TOO

LS

display, box

null, handl

Page 66: Improving Software Maintenance using Unsupervised Machine Learning techniques

1. Tokenization

2. Normalization1.5 Identifier Splitting

• Splitting algorithms based on naming conventions

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• NullHandle ==> Null | Handle

draw, the, are, box, r, rectangl, g, graphic, color, ....

STAT

E O

F TH

E A

RT

TOO

LS

display, box

null, handl

Page 67: Improving Software Maintenance using Unsupervised Machine Learning techniques

1. Tokenization

2. Normalization1.5 Identifier Splitting

• Splitting algorithms based on naming conventions

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• NullHandle ==> Null | Handle

• displayBox ==> display | Box

draw, the, are, box, r, rectangl, g, graphic, color, ....

STAT

E O

F TH

E A

RT

TOO

LS

null, handl

display, box

Page 68: Improving Software Maintenance using Unsupervised Machine Learning techniques

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

IDEN

TIFIE

RS S

PLIT

TIN

G

Page 69: Improving Software Maintenance using Unsupervised Machine Learning techniques

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• drawXORRect ==> drawXOR | Rect

IDEN

TIFIE

RS S

PLIT

TIN

G

Page 70: Improving Software Maintenance using Unsupervised Machine Learning techniques

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• drawXORRect ==> drawXOR | Rect

• drawxorrect ==> NO SPLIT

Splitting algorithms based on naming conventions

are not robust enough

IDEN

TIFIE

RS S

PLIT

TIN

G

Page 71: Improving Software Maintenance using Unsupervised Machine Learning techniques

Splitting algorithms based on naming conventions

are not robust enough

ABBR

EVIAT

IONS

EXP

AN

SIO

N

Page 72: Improving Software Maintenance using Unsupervised Machine Learning techniques

Splitting algorithms based on naming conventions

are not robust enough

• Heavy use of Abbreviations in the source code

• r as for Rectangle

• rect as for Rectangle

ABBR

EVIAT

IONS

EXP

AN

SIO

N

Page 73: Improving Software Maintenance using Unsupervised Machine Learning techniques

Splitting algorithms based on naming conventions

are not robust enough

• Heavy use of Abbreviations in the source code

• r as for Rectangle

• rect as for Rectangle

ABBR

EVIAT

IONS

EXP

AN

SIO

N

Page 74: Improving Software Maintenance using Unsupervised Machine Learning techniques

1. Tokenization

CO

DE N

OR

MA

LIZA

TIO

N

2. Normalization1.5 Code Normalization

• AMAP (Hill and Pollock, 2008)

• SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011)

• GenTest+Normalize (Lawrie and Binkley, 2011)

• LINSEN (Corazza, Di Martino and Maggio, 2012)

draw, the, are, null, handl, box, rectangl, graphic, color, display, box,

rectangl,

draw, xor,rectangl

Page 75: Improving Software Maintenance using Unsupervised Machine Learning techniques

1. Tokenization

CO

DE N

OR

MA

LIZA

TIO

N

2. Normalization1.5 Code Normalization

• AMAP (Hill and Pollock, 2008)

• SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011)

• GenTest+Normalize (Lawrie and Binkley, 2011)

• LINSEN (Corazza, Di Martino and Maggio, 2012)

draw, the, are, null, handl, box, rectangl, graphic, color, display, box,

rectangl,

draw,xor,rectangl

Page 76: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONTRIBUTION

SOURCE CODE NORMALIZATION

• LINSEN: novel technique for code normalization

• Able to both Split Identifiers and Expand possible occurring abbreviations

Corazza, A., Di Martino, S., Maggio, V.LINSEN: An efficient approach to split identifiers and expand abbreviations(2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3

Page 77: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONTRIBUTION

SOURCE CODE NORMALIZATION

• LINSEN: novel technique for code normalization

• Able to both Split Identifiers and Expand possible occurring abbreviations

• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)

• Exploit different Sources of Information to find the matching words

Corazza, A., Di Martino, S., Maggio, V.LINSEN: An efficient approach to split identifiers and expand abbreviations(2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3

Page 78: Improving Software Maintenance using Unsupervised Machine Learning techniques

SKET

CH

OF

THE

ALG

OR

ITH

M

• NODES correspond to characters of the current identifier

Model: Weighted Directed GraphExample: drawXORRect identifier

0

11

1

9

4

5

7

8

2 63

10

Page 79: Improving Software Maintenance using Unsupervised Machine Learning techniques

SKET

CH

OF

THE

ALG

OR

ITH

M

• ARCS corresponds to matchings between identifier substrings and dictionary words

• NODES correspond to characters of the current identifier

Model: Weighted Directed GraphExample: drawXORRect identifier

0

11

1

9

4

5

7

8

2 63

10

Page 80: Improving Software Maintenance using Unsupervised Machine Learning techniques

SKET

CH

OF

THE

ALG

OR

ITH

M

• ARCS corresponds to matchings between identifier substrings and dictionary words

• Padding Arcs to ensure the Graph always connected

• NODES correspond to characters of the current identifier

Model: Weighted Directed GraphExample: drawXORRect identifier

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3

“w”,C-MAX “X”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“rectangle”,c(“rectangle”)

“R”,C-MAX

Page 81: Improving Software Maintenance using Unsupervised Machine Learning techniques

SKET

CH

OF

THE

ALG

OR

ITH

M

• Every Arc is Labelled with the corresponding dictionary word

• Weights represent the “cost” of each matching

• Cost function [c(“word”)] favors longest words and domain-

related information

Model: Weighted Directed GraphExample: drawXORRect identifier

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3

“w”,C-MAX “X”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“rectangle”,c(“rectangle”)

“R”,C-MAX

Page 82: Improving Software Maintenance using Unsupervised Machine Learning techniques

SKET

CH

OF

THE

ALG

OR

ITH

M

• The final Mapping Solution corresponds to the sequence of labels

in the path with the minimum cost (Djikstra Algorithm)

Model: Weighted Directed GraphExample: drawXORRect identifier

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3

“w”,C-MAX “X”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“rectangle”,c(“rectangle”)

“R”,C-MAX

Page 83: Improving Software Maintenance using Unsupervised Machine Learning techniques

SPLITTING

Accuracy Rates for the comparison withDTW (Madani et. al 2010)

0

0.25

0.5

0.75

1

JhotDraw 5.1 Lynx 2.8.5DTW LINSEN DTW LINSEN

DTW LINSEN

RESULTS

# 1

Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011)

0

0.25

0.5

0.75

1

which 2.20 a2ps 4.14

GenTest LINSEN

DTW O(n3)

Page 84: Improving Software Maintenance using Unsupervised Machine Learning techniques

Accuracy Rates for the comparison withAMAP (Hill and Pollock, 2008)

0

0.25

0.5

0.75

1

CW DL OO AC PR SL TOTAL (Avg)

AMAPLINSEN

EXPANSIONRESULTS

# 2

CW: Combination WordsDL: Dropped LettersOO: Others

AC: AcronymsPR: PrefixSL: Single Letters

Page 85: Improving Software Maintenance using Unsupervised Machine Learning techniques

ISSUE#3

CLONEDETECTION

Page 86: Improving Software Maintenance using Unsupervised Machine Learning techniques

PROBLEM

S T A T EM E N T

CLONE DETECTION

Page 87: Improving Software Maintenance using Unsupervised Machine Learning techniques

PROBLEM

S T A T EM E N T

CLONE DETECTION

Clones Textual Similarity

Page 88: Improving Software Maintenance using Unsupervised Machine Learning techniques

PROBLEM

S T A T EM E N T

CLONE DETECTION

Clones Functional Similarity

Page 89: Improving Software Maintenance using Unsupervised Machine Learning techniques

PROBLEM

S T A T EM E N T

CLONE DETECTION

Clones affect the reliability of the system!

Sneaky Bug!

Page 90: Improving Software Maintenance using Unsupervised Machine Learning techniques

Duplix

Scorpio

PMD

CCFinder

Dup CPD

Duplix

Shinobi

Clone Detective

Gemini

iClones

KClone

ConQAT

Deckard

Clone Digger

JCCD

CloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDD

STAT

E O

F TH

E A

RT

TOO

LS

Page 91: Improving Software Maintenance using Unsupervised Machine Learning techniques

Duplix

Scorpio

PMD

CCFinder

Dup CPD

Duplix

Shinobi

Clone Detective

Gemini

iClones

KClone

ConQAT

Deckard

Clone Digger

JCCD

CloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDDText Based Tools:Text is compared line by line

STAT

E O

F TH

E A

RT

TOO

LS

Page 92: Improving Software Maintenance using Unsupervised Machine Learning techniques

Duplix

Scorpio

PMD

CCFinder

Dup CPD

Duplix

Shinobi

Clone Detective

Gemini

iClones

KClone

ConQAT

Deckard

Clone Digger

JCCD

CloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDD

Token Based Tools:Token sequences are compared to sequences

STAT

E O

F TH

E A

RT

TOO

LS

Page 93: Improving Software Maintenance using Unsupervised Machine Learning techniques

Duplix

Scorpio

PMD

CCFinder

Dup CPD

Duplix

Shinobi

Clone Detective

Gemini

iClones

KClone

ConQAT

Deckard

Clone Digger

JCCD

CloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDDSyntax Based Tools:Syntax subtrees are compared to each other

STAT

E O

F TH

E A

RT

TOO

LS

Page 94: Improving Software Maintenance using Unsupervised Machine Learning techniques

Duplix

Scorpio

PMD

CCFinder

Dup CPD

Duplix

Shinobi

Clone Detective

Gemini

iClones

KClone

ConQAT

Deckard

Clone Digger

JCCD

CloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDD

Graph Based Tools:(sub) graphs are compared to each other

STAT

E O

F TH

E A

RT

TOO

LS

Page 95: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONTRIBUTION CLONE DETECTION

• Combining different sources of information to improve the effectiveness of the detection process

Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.A Tree Kernel based approach for clone detection(2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8

Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability(2013) Communications in Computer and Information Science, In Press

Page 96: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONTRIBUTION CLONE DETECTION

• Combining different sources of information to improve the effectiveness of the detection process

• Structural and Lexical Information

Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.A Tree Kernel based approach for clone detection(2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8

Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability(2013) Communications in Computer and Information Science, In Press

Page 97: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONTRIBUTION CLONE DETECTION

• Combining different sources of information to improve the effectiveness of the detection process

• Structural and Lexical Information

• Application of Kernel Methods to Source Code Structures

Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.A Tree Kernel based approach for clone detection(2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8

Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability(2013) Communications in Computer and Information Science, In Press

Page 98: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONTRIBUTION CLONE DETECTION

• Combining different sources of information to improve the effectiveness of the detection process

• Structural and Lexical Information

• Application of Kernel Methods to Source Code Structures

• Typically applied in other field such as NLP or Bioinformatics

Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.A Tree Kernel based approach for clone detection(2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8

Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability(2013) Communications in Computer and Information Science, In Press

Page 99: Improving Software Maintenance using Unsupervised Machine Learning techniques

CODESTRUCTURES

KER

NEL

S FO

R ST

RUC

TURE

SComputation of the dot product between (Graph) Structures

K( ),

Page 100: Improving Software Maintenance using Unsupervised Machine Learning techniques

CODESTRUCTURES

KER

NEL

S FO

R ST

RUC

TURE

S

Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function)

Program Dependencies Graph (PDG)(Directed) Graph structure representing the relationship among the different statement of a program

Computation of the dot product between (Graph) Structures

K( ),

Page 101: Improving Software Maintenance using Unsupervised Machine Learning techniques

CODESTRUCTURES

KER

NEL

S FO

R ST

RUC

TURE

S

Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function)

Program Dependencies Graph (PDG)(Directed) Graph structure representing the relationship among the different statement of a program

Computation of the dot product between (Graph) Structures

K( ),

Page 102: Improving Software Maintenance using Unsupervised Machine Learning techniques

CODEKE

RNEL

FO

R C

LONE

S

Page 103: Improving Software Maintenance using Unsupervised Machine Learning techniques

<

x y = =

x +

x 1

y -

y 1

while

block

while

block

block

if

>

b a = =

a +

a 1

b -

b 1

>

b 0 =

c 3

CODE ASTKE

RNEL

FO

R C

LONE

S

Page 104: Improving Software Maintenance using Unsupervised Machine Learning techniques

<

x y = =

x +

x 1

y -

y 1

while

block

while

block

block

if

>

b a = =

a +

a 1

b -

b 1

>

b 0 =

c 3

CODE AST AST KERNELKE

RNEL

FO

R C

LONE

S

<

block

while

= =

block

=

y -

=

x +

+

x 1

-

y 1

<

x y

>

b 0 =

c 3

if

block

>

b a

-

b 1

<block

while

+

a 1

=

b -

=

a +

Page 105: Improving Software Maintenance using Unsupervised Machine Learning techniques

while

block<

x y

KERNELSFOR CODE

STRUCTURES:AST

KERN

EL F

EATU

RES

Page 106: Improving Software Maintenance using Unsupervised Machine Learning techniques

while

block<

x y

KERNELSFOR CODE

STRUCTURES:AST

KERN

EL F

EATU

RES Instruction Class (IC)

i.e., LOOP, CALL,CONDITIONAL_STATEMENT

Page 107: Improving Software Maintenance using Unsupervised Machine Learning techniques

while

block<

x y

KERNELSFOR CODE

STRUCTURES:AST

KERN

EL F

EATU

RES Instruction Class (IC)

i.e., LOOP, CALL,CONDITIONAL_STATEMENT

Instruction (I)i.e., FOR, IF, WHILE, RETURN

Page 108: Improving Software Maintenance using Unsupervised Machine Learning techniques

while

block<

x y

KERNELSFOR CODE

STRUCTURES:AST

KERN

EL F

EATU

RES Instruction Class (IC)

i.e., LOOP, CALL,CONDITIONAL_STATEMENT

Instruction (I)i.e., FOR, IF, WHILE, RETURN

Context (C)i.e., Instruction Class of the closer statement node

Page 109: Improving Software Maintenance using Unsupervised Machine Learning techniques

while

block<

x y

KERNELSFOR CODE

STRUCTURES:AST

KERN

EL F

EATU

RES Instruction Class (IC)

i.e., LOOP, CALL,CONDITIONAL_STATEMENT

Instruction (I)i.e., FOR, IF, WHILE, RETURN

Context (C)i.e., Instruction Class of the closer statement node

Lexemes (Ls)Lexical information gathered (recursively) from leaves

Page 110: Improving Software Maintenance using Unsupervised Machine Learning techniques

while

block<

x y

KERNELSFOR CODE

STRUCTURES:AST

KERN

EL F

EATU

RES

IC = Conditional-ExprI = Less-operatorC = LoopLs= [x,y]

IC = LoopI = while-loopC = Function-BodyLs= [x, y]

Instruction Class (IC)i.e., LOOP, CALL,CONDITIONAL_STATEMENT

Instruction (I)i.e., FOR, IF, WHILE, RETURN

Context (C)i.e., Instruction Class of the closer statement node

Lexemes (Ls)Lexical information gathered (recursively) from leaves

IC = BlockI = while-bodyC = LoopLs= [ x ]

Page 111: Improving Software Maintenance using Unsupervised Machine Learning techniques

CLONE DETECTION• Comparison with another (pure) AST-based clone detector• Comparison on a system with randomly seeded clones

0

0.25

0.5

0.75

1

Precision Recall F-measure

CloneDigger Tree Kernel Tool

RESULTS

Results refer to clones where code

fragments have been modified by adding/

removing or changing code statements

Page 112: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONCLUSIONS

Page 113: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONCLUSIONS

Page 114: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONCLUSIONS

Page 115: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONCLUSIONS

Page 116: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONCLUSIONS

Page 117: Improving Software Maintenance using Unsupervised Machine Learning techniques

CONCLUSIONSFUTURE RESEARCH DIRECTIONS

• Re-modularization:

• Integrate code normalization algorithm (LINSEN) to lexical analysis

• Code Normalization:

• Investigate the contributions of different sources of information used for disambiguations

• Clone Detection:

• Combine Static (and Kernel methods) with Dynamic Analysis

Page 118: Improving Software Maintenance using Unsupervised Machine Learning techniques

THANK YOU

June 5th, 2013 - Ph.D. Defense

Valerio MaggioPh.D. Student, University of Naples “Federico II”

[email protected]


Top Related