linsen an efficient approach to split identifiers and expand abbreviations

LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS

Anna Corazza, Sergio Di Martino, Valerio MaggioUniversità di Napoli “Federico II”

26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy

MOTIVATIONS IR FOR SE

IR F

OR

NATU

RAL

LAN

GU

AG

E

1. Tokenization

IR F

OR

NATU

RAL

LAN

GU

AG

E

1. Tokenization

IR F

OR

NATU

RAL

LAN

GU

AG

E

Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E


1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E


1. Change to Lower case

draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E



draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...2.Remove StopWords

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E



draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords

3.Apply Stemming4. ...

Implicit assumption: The “same” words are used whenever a particular concept is described

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E



draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords

3.Apply Stemming4. ...

1. Tokenization

IR F

OR

SOUR

CE

CO

DE

2. Normalization1.5 Identifier Splitting

1. Tokenization

IR F

OR

SOUR

CE

CO

DE

2. Normalization1.5 Identifier Splitting

• snake_case Splitter: r’(?<=\w)_’

• display_box ==> display | box

• camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’

• displayBox ==> display | Box

draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...

IR F

OR

SOUR

CE

CO

DE

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• drawXORRect ==> drawXOR | Rect

• drawxorrect ==> NO SPLIT

IR F

OR

SOUR

CE

CO

DE

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• drawXORRect ==> drawXOR | Rect

• drawxorrect ==> NO SPLIT

IR F

OR

SOUR

CE

CO

DE

Splitting algorithms based on naming conventions are not robust enough


• Heavy use of Abbreviations in the source code

IR F

OR

SOUR

CE

CO

DE


• Heavy use of Abbreviations in the source code

• rect as for Rectangle

• r as for Rectangle

IR F

OR

SOUR

CE

CO

DE

1. Tokenization

IDEN

TIFIE

R M

AP

PIN

G

2. Normalization1.5 Identifier Mapping

• SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011)

• GenTest+Normalize (Lawrie and Binkley, 2011)

• AMAP (Hill and Pollock, 2008)

• ...• LINSEN

draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...

LINSENA L G O

R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping

LINSENA L G O


• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)

LINSENA L G O



• Applied on a Graph-based model

LINSENA L G O



• Applied on a Graph-based model

• Able to both Split Identifiers and Expand possible occurring abbreviations

There could be multiple and equally correct splitting or expansion solutions

THE

AM

BIG

UIT

Y P

ROBL

EM


• r as for Rectangle OR red

THE

AM

BIG

UIT

Y P

ROBL

EM


• r as for Rectangle OR red

THE

AM

BIG

UIT

Y P

ROBL

EM

•nsISupport ==> ns|IS|up|ports OR

==> ns|I|Supports

DICTIONARIES

DICTIONARIES

Application-aware Dictionaries

DICTIONARIES

Application-aware Dictionaries(108,315 Entries)

DICTIONARIES


(22,940 Entries)

(108,315 Entries)

DICTIONARIES


(22,940 Entries)

(108,315 Entries)

(588 Entries)

GR

AP

H M

ODE

L Model: Weighted Directed Graph Example: drawXORRect identifier

GR

AP

H M

ODE

L

• NODES correspond to characters of the current identifier

Model: Weighted Directed Graph Example: drawXORRect identifier

0

11

1

9

4

5

7

8

2 63

10

GR

AP

H M

ODE

L

• ARCS corresponds to matchings between identifier substrings and dictionary words



0

11

1

9

4

5

7

8

2 63

10

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“rectangle”,c(“rectangle”)

GR

AP

H M

ODE

L


• Application of the String Matching Algorithm (BYP)



0

11

1

9

4

5

7

8

2 63

10





GR

AP

H M

ODE

L


• Application of the String Matching Algorithm (BYP)

• Padding Arcs to ensure the Graph always connected



0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3

“w”,C-MAX “X”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX





“R”,C-MAX

GR

AP

H M

ODE

L

• Every Arc is Labelled with the corresponding dictionary word

• Weights represent the “cost” of each matching

• Cost function [c(“word”)] favors longest words and words coming from the application-aware dictionaries


0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3


“O”,C-MAX

“e”,C-MAX10






“R”,C-MAX

GR

AP

H M

ODE

L

• The final Mapping Solution corresponds to the sequence of labels in the

path with the minimum cost (Djikstra Algorithm)


0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3


“O”,C-MAX

“e”,C-MAX10






“R”,C-MAX

STRING MATCHING• Application of the Baeza-Yates and Perlberg (BYP) Algorithm

• Signature: BYP(identifier, word, φ(word))

• identifier: target string

• word: string to match

• φ(·): Tolerance (Error) function

• Bounds the length of acceptable matchings

Advantage: Use the same algorithm for both the splitting and the expansion step with different input Tolerance function

• BYP(identifier, word, φSplit(word))

φSplit : Exact Matching (i.e., No Errors allowed)B

YP

FO

R SP

LITT

ING

.. draw, the, are, null, handle, box, red, rectangle, ...

... echo, testing, threading, xpm, xor, ....

... abort absolute abstract ... or ... raw ...

DFile DCOMPUTER-SCIENCE DEnglish

Identifier:drawXORRect



YP

FO

R SP

LITT

ING




0

11

1

9

4

5

7

8

2 63

10





YP

FO

R SP

LITT

ING




0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX “R”,C-MAX





YP

FO

R SP

LITT

ING



0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X


”,C-MAX

“O”,C-MAX

“e”,C-MAX10


“draw”,c(“draw”)

“R”,C-MAX

..

draw, the, are, null, handle, box, red, rectangle, ...





YP

FO

R SP

LITT

ING


0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X


”,C-MAX

“O”,C-MAX

“e”,C-MAX10


“draw”,c(“draw”) “xor”,c(“xor”)

“R”,C-MAX

..


DFile ... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE DEnglish




YP

FO

R SP

LITT

ING

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X


”,C-MAX

“O”,C-MAX

“e”,C-MAX10


“raw”,c(“raw”)“draw”,c(“draw”) “xor”,c(“xor”)

“R”,C-MAX

..



xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ... or ...

raw ...

DEnglish




YP

FO

R SP

LITT

ING

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X


”,C-MAX

“O”,C-MAX

“e”,C-MAX10





“R”,C-MAX

..



xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ...

or ...

raw ...

DEnglish


• BYP(identifier, word, φExp(word))

• φExp : Approximate Matching

BY

P F

OR

EXPA

NSI

ON

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X


”,C-MAX

“O”,C-MAX

“e”,C-MAX10





“R”,C-MAX

..


... echo, testing, threading, xpm,

xor, ....


or ...

raw ...

DEnglishDFile




BY

P F

OR

EXPA

NSI

ON

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X


”,C-MAX

“O”,C-MAX

“e”,C-MAX10



“R”,C-MAX

..



xor, ....


or ...

raw ...

DEnglishDFile




BY

P F

OR

EXPA

NSI

ON

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X


”,C-MAX

“O”,C-MAX

“e”,C-MAX10



“R”,C-MAX


xor, ....


or ...

raw ...

DEnglish

“red”,c(“red”)

..

draw, the, are, null, handle, box,

red, rectangle, ...

DFile




BY

P F

OR

EXPA

NSI

ON

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X


”,C-MAX

“O”,C-MAX

“e”,C-MAX10



“R”,C-MAX


xor, ....


or ...

raw ...

DEnglish

“red”,c(“red”)“rectangle”,c(“

rectangle”)

..

draw, the, are, null, handle, box,

red,

rectangle, ...

DFile


EMPIRICAL

E V A L UA T I O N

RESEARCH QUESTIONS

• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?

EMPIRICAL

E V A L UA T I O N

RESEARCH QUESTIONS


• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?

EMPIRICAL

E V A L UA T I O N

RESEARCH QUESTIONS


• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?

• RQ3:What is the ability of the LINSEN approach in dealing with different types of abbreviations?

CASE STUDIESDTW (Madani et. al 2010)

GenTest+Normalize (Lawrie and Binkley, 2011)

RQ1 andRQ2}

RQ1 only

LUDISO Dataset (2012)

AMAP (Hill and Pollock, 2008)RQ3 only

15 out of 750 software systemsCovering the 58% of total identifiers

EMPIRICAL

E V A L UA T I O N

EVALUATION METRICS

Comparability of Results: Accuracy rate

Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]

EMPIRICAL

E V A L UA T I O N

EVALUATION METRICS

Comparability of Results: Accuracy rate

• Identifier Level evaluation: Each mapping result must be completely correct

• Soft-word Level evaluation: “Partial credit” given to each word correctly mapped

Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]

As for the comparison with GenTest+Normalize (Lawrie and Binkley, 2011)

EMPIRICAL

E V A L UA T I O N

RQ1: SPLITTING

Accuracy Rates for the comparison withDTW (Madani et. al 2010)

0

0.25

0.5

0.75

1

JhotDraw 5.1 Lynx 2.8.5DTW LINSEN DTW LINSEN

DTWLINSEN

RESULTS

R Q 1

Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011)

0

0.175

0.35

0.525

0.7

which 2.20 a2ps 4.14

Identifier Level

0

0.2

0.4

0.6

0.8


Soft-word Level

GenTestLINSEN

GenTestLINSEN

RESULTS

R Q 1 RQ1: SPLITTING

Accuracy Rates for the comparison withDTW (Madani et. al 2010)

0

0.25

0.5

0.75

1

JhotDraw 5.1 Lynx 2.8.5

DTWLINSEN

RQ2: MAPPINGRE

SULTSR Q 2

Accuracy Rates for the comparison with Normalize (Lawrie and Binkley, 2011)

0

0.15

0.3

0.45

0.6


Identifier Level

0

0.225

0.45

0.675

0.9


Soft-word Level

NormalizeLINSENNormalize

LINSEN

RESULTS

R Q 2 RQ2: MAPPING

Accuracy Rates for the comparison withAMAP (Hill and Pollock, 2008)

0

0.225

0.45

0.675

0.9

CW DL OO AC PR SL

AMAPLINSEN

RQ3: EXPANSIONRE

SULTSR Q 3

CW: Combination WordsDL: Dropped LettersOO: Others

AC: AcronymsPR: PrefixSL: Single Letters

CONCLUSIONS

FUTURE WORKS• Evaluation of the impact of each adopted dictionary on

the performance

• Improve or change or add dictionaries

• Improve the implementation of the prototype to speed up the computation

• Make use of parallel computation to process each identifier in isolation

THANK YOU

26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy

Valerio MaggioPh.D. Student, University of Naples “Federico II”[email protected]

mailto:[email protected]

mailto:[email protected]

linsen an efficient approach to split identifiers and expand abbreviations

Technology

rectangle r

box draw

rectangle irforsourcecode

expansion solutions

g o r i t h

tokenization irforsourcecode

split irforsourcecode

splitting algorithms