linsen an efficient approach to split identifiers and expand abbreviations

75
LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS Anna Corazza, Sergio Di Martino, Valerio Maggio Università di Napoli “Federico II” 26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy

Upload: valerio-maggio

Post on 29-Jun-2015

130 views

Category:

Technology


0 download

DESCRIPTION

"Linsen an efficient approach to split identifiers and expand abbreviations" Slides presented at the International Conference of Software Maintenance (ICSM) 2012, Riva del Garda (TN), Italy

TRANSCRIPT

Page 1: LINSEN an efficient approach to split identifiers and expand abbreviations

LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS

Anna Corazza, Sergio Di Martino, Valerio MaggioUniversità di Napoli “Federico II”

26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy

Page 2: LINSEN an efficient approach to split identifiers and expand abbreviations

MOTIVATIONS IR FOR SE

Page 3: LINSEN an efficient approach to split identifiers and expand abbreviations

MOTIVATIONS IR FOR SE

Page 4: LINSEN an efficient approach to split identifiers and expand abbreviations

IR F

OR

NATU

RAL

LAN

GU

AG

E

Page 5: LINSEN an efficient approach to split identifiers and expand abbreviations

1. Tokenization

IR F

OR

NATU

RAL

LAN

GU

AG

E

Page 6: LINSEN an efficient approach to split identifiers and expand abbreviations

1. Tokenization

IR F

OR

NATU

RAL

LAN

GU

AG

E

Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

Page 7: LINSEN an efficient approach to split identifiers and expand abbreviations

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E

Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

Page 8: LINSEN an efficient approach to split identifiers and expand abbreviations

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E

Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

1. Change to Lower case

draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...

Page 9: LINSEN an efficient approach to split identifiers and expand abbreviations

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E

Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

1. Change to Lower case

draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...2.Remove StopWords

Page 10: LINSEN an efficient approach to split identifiers and expand abbreviations

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E

Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

1. Change to Lower case

draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords

3.Apply Stemming4. ...

Page 11: LINSEN an efficient approach to split identifiers and expand abbreviations

Implicit assumption: The “same” words are used whenever a particular concept is described

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E

Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

1. Change to Lower case

draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords

3.Apply Stemming4. ...

Page 12: LINSEN an efficient approach to split identifiers and expand abbreviations

Implicit assumption: The “same” words are used whenever a particular concept is described

1. Tokenization

2. Normalization

IR F

OR

NATU

RAL

LAN

GU

AG

E

Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

1. Change to Lower case

draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords

3.Apply Stemming4. ...

Page 13: LINSEN an efficient approach to split identifiers and expand abbreviations

1. Tokenization

IR F

OR

SOUR

CE

CO

DE

2. Normalization1.5 Identifier Splitting

Page 14: LINSEN an efficient approach to split identifiers and expand abbreviations

1. Tokenization

IR F

OR

SOUR

CE

CO

DE

2. Normalization1.5 Identifier Splitting

• snake_case Splitter: r’(?<=\w)_’

• display_box ==> display | box

• camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’

• displayBox ==> display | Box

draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...

Page 15: LINSEN an efficient approach to split identifiers and expand abbreviations

IR F

OR

SOUR

CE

CO

DE

Page 16: LINSEN an efficient approach to split identifiers and expand abbreviations

IR F

OR

SOUR

CE

CO

DE

Page 17: LINSEN an efficient approach to split identifiers and expand abbreviations

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• drawXORRect ==> drawXOR | Rect

• drawxorrect ==> NO SPLIT

IR F

OR

SOUR

CE

CO

DE

Page 18: LINSEN an efficient approach to split identifiers and expand abbreviations

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’

• drawXORRect ==> drawXOR | Rect

• drawxorrect ==> NO SPLIT

IR F

OR

SOUR

CE

CO

DE

Splitting algorithms based on naming conventions are not robust enough

Page 19: LINSEN an efficient approach to split identifiers and expand abbreviations

Splitting algorithms based on naming conventions are not robust enough

• Heavy use of Abbreviations in the source code

IR F

OR

SOUR

CE

CO

DE

Page 20: LINSEN an efficient approach to split identifiers and expand abbreviations

Splitting algorithms based on naming conventions are not robust enough

• Heavy use of Abbreviations in the source code

• rect as for Rectangle

• r as for Rectangle

IR F

OR

SOUR

CE

CO

DE

Page 21: LINSEN an efficient approach to split identifiers and expand abbreviations

Splitting algorithms based on naming conventions are not robust enough

• Heavy use of Abbreviations in the source code

• rect as for Rectangle

• r as for Rectangle

IR F

OR

SOUR

CE

CO

DE

Page 22: LINSEN an efficient approach to split identifiers and expand abbreviations

1. Tokenization

IDEN

TIFIE

R M

AP

PIN

G

2. Normalization1.5 Identifier Mapping

• SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011)

• GenTest+Normalize (Lawrie and Binkley, 2011)

• AMAP (Hill and Pollock, 2008)

• ...• LINSEN

draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...

Page 23: LINSEN an efficient approach to split identifiers and expand abbreviations

LINSENA L G O

R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping

Page 24: LINSEN an efficient approach to split identifiers and expand abbreviations

LINSENA L G O

R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping

• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)

Page 25: LINSEN an efficient approach to split identifiers and expand abbreviations

LINSENA L G O

R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping

• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)

• Applied on a Graph-based model

Page 26: LINSEN an efficient approach to split identifiers and expand abbreviations

LINSENA L G O

R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping

• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)

• Applied on a Graph-based model

• Able to both Split Identifiers and Expand possible occurring abbreviations

Page 27: LINSEN an efficient approach to split identifiers and expand abbreviations

There could be multiple and equally correct splitting or expansion solutions

THE

AM

BIG

UIT

Y P

ROBL

EM

Page 28: LINSEN an efficient approach to split identifiers and expand abbreviations

There could be multiple and equally correct splitting or expansion solutions

• r as for Rectangle OR red

THE

AM

BIG

UIT

Y P

ROBL

EM

Page 29: LINSEN an efficient approach to split identifiers and expand abbreviations

There could be multiple and equally correct splitting or expansion solutions

• r as for Rectangle OR red

THE

AM

BIG

UIT

Y P

ROBL

EM

•nsISupport ==> ns|IS|up|ports OR

==> ns|I|Supports

Page 30: LINSEN an efficient approach to split identifiers and expand abbreviations

DICTIONARIES

Page 31: LINSEN an efficient approach to split identifiers and expand abbreviations

DICTIONARIES

Page 32: LINSEN an efficient approach to split identifiers and expand abbreviations

DICTIONARIES

Application-aware Dictionaries

Page 33: LINSEN an efficient approach to split identifiers and expand abbreviations

DICTIONARIES

Application-aware Dictionaries(108,315 Entries)

Page 34: LINSEN an efficient approach to split identifiers and expand abbreviations

DICTIONARIES

Application-aware Dictionaries

(22,940 Entries)

(108,315 Entries)

Page 35: LINSEN an efficient approach to split identifiers and expand abbreviations

DICTIONARIES

Application-aware Dictionaries

(22,940 Entries)

(108,315 Entries)

(588 Entries)

Page 36: LINSEN an efficient approach to split identifiers and expand abbreviations

GR

AP

H M

ODE

L Model: Weighted Directed Graph Example: drawXORRect identifier

Page 37: LINSEN an efficient approach to split identifiers and expand abbreviations

GR

AP

H M

ODE

L

• NODES correspond to characters of the current identifier

Model: Weighted Directed Graph Example: drawXORRect identifier

0

11

1

9

4

5

7

8

2 63

10

Page 38: LINSEN an efficient approach to split identifiers and expand abbreviations

GR

AP

H M

ODE

L

• ARCS corresponds to matchings between identifier substrings and dictionary words

• NODES correspond to characters of the current identifier

Model: Weighted Directed Graph Example: drawXORRect identifier

0

11

1

9

4

5

7

8

2 63

10

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“rectangle”,c(“rectangle”)

Page 39: LINSEN an efficient approach to split identifiers and expand abbreviations

GR

AP

H M

ODE

L

• ARCS corresponds to matchings between identifier substrings and dictionary words

• Application of the String Matching Algorithm (BYP)

• NODES correspond to characters of the current identifier

Model: Weighted Directed Graph Example: drawXORRect identifier

0

11

1

9

4

5

7

8

2 63

10

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“rectangle”,c(“rectangle”)

Page 40: LINSEN an efficient approach to split identifiers and expand abbreviations

GR

AP

H M

ODE

L

• ARCS corresponds to matchings between identifier substrings and dictionary words

• Application of the String Matching Algorithm (BYP)

• Padding Arcs to ensure the Graph always connected

• NODES correspond to characters of the current identifier

Model: Weighted Directed Graph Example: drawXORRect identifier

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3

“w”,C-MAX “X”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“rectangle”,c(“rectangle”)

“R”,C-MAX

Page 41: LINSEN an efficient approach to split identifiers and expand abbreviations

GR

AP

H M

ODE

L

• Every Arc is Labelled with the corresponding dictionary word

• Weights represent the “cost” of each matching

• Cost function [c(“word”)] favors longest words and words coming from the application-aware dictionaries

Model: Weighted Directed Graph Example: drawXORRect identifier

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3

“w”,C-MAX “X”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“rectangle”,c(“rectangle”)

“R”,C-MAX

Page 42: LINSEN an efficient approach to split identifiers and expand abbreviations

GR

AP

H M

ODE

L

• The final Mapping Solution corresponds to the sequence of labels in the

path with the minimum cost (Djikstra Algorithm)

Model: Weighted Directed Graph Example: drawXORRect identifier

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3

“w”,C-MAX “X”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“rectangle”,c(“rectangle”)

“R”,C-MAX

Page 43: LINSEN an efficient approach to split identifiers and expand abbreviations

STRING MATCHING• Application of the Baeza-Yates and Perlberg (BYP) Algorithm

• Signature: BYP(identifier, word, φ(word))

• identifier: target string

• word: string to match

• φ(·): Tolerance (Error) function

• Bounds the length of acceptable matchings

Advantage: Use the same algorithm for both the splitting and the expansion step with different input Tolerance function

Page 44: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φSplit(word))

φSplit : Exact Matching (i.e., No Errors allowed)B

YP

FO

R SP

LITT

ING

.. draw, the, are, null, handle, box, red, rectangle, ...

... echo, testing, threading, xpm, xor, ....

... abort absolute abstract ... or ... raw ...

DFile DCOMPUTER-SCIENCE DEnglish

Identifier:drawXORRect

Page 45: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φSplit(word))

φSplit : Exact Matching (i.e., No Errors allowed)B

YP

FO

R SP

LITT

ING

.. draw, the, are, null, handle, box, red, rectangle, ...

... echo, testing, threading, xpm, xor, ....

... abort absolute abstract ... or ... raw ...

0

11

1

9

4

5

7

8

2 63

10

DFile DCOMPUTER-SCIENCE DEnglish

Identifier:drawXORRect

Page 46: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φSplit(word))

φSplit : Exact Matching (i.e., No Errors allowed)B

YP

FO

R SP

LITT

ING

.. draw, the, are, null, handle, box, red, rectangle, ...

... echo, testing, threading, xpm, xor, ....

... abort absolute abstract ... or ... raw ...

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX “R”,C-MAX

DFile DCOMPUTER-SCIENCE DEnglish

Identifier:drawXORRect

Page 47: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φSplit(word))

φSplit : Exact Matching (i.e., No Errors allowed)B

YP

FO

R SP

LITT

ING

... echo, testing, threading, xpm, xor, ....

... abort absolute abstract ... or ... raw ...

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“draw”,c(“draw”)

“R”,C-MAX

..

draw, the, are, null, handle, box, red, rectangle, ...

DFile DCOMPUTER-SCIENCE DEnglish

Identifier:drawXORRect

Page 48: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φSplit(word))

φSplit : Exact Matching (i.e., No Errors allowed)B

YP

FO

R SP

LITT

ING

... abort absolute abstract ... or ... raw ...

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“draw”,c(“draw”) “xor”,c(“xor”)

“R”,C-MAX

..

draw, the, are, null, handle, box, red, rectangle, ...

DFile ... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE DEnglish

Identifier:drawXORRect

Page 49: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φSplit(word))

φSplit : Exact Matching (i.e., No Errors allowed)B

YP

FO

R SP

LITT

ING

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”) “xor”,c(“xor”)

“R”,C-MAX

..

draw, the, are, null, handle, box, red, rectangle, ...

DFile ... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ... or ...

raw ...

DEnglish

Identifier:drawXORRect

Page 50: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φSplit(word))

φSplit : Exact Matching (i.e., No Errors allowed)B

YP

FO

R SP

LITT

ING

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“R”,C-MAX

..

draw, the, are, null, handle, box, red, rectangle, ...

DFile ... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ...

or ...

raw ...

DEnglish

Identifier:drawXORRect

Page 51: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φSplit(word))

φSplit : Exact Matching (i.e., No Errors allowed)B

YP

FO

R SP

LITT

ING

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“R”,C-MAX

..

draw, the, are, null, handle, box, red, rectangle, ...

DFile ... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ...

or ...

raw ...

DEnglish

Identifier:drawXORRect

Page 52: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φExp(word))

• φExp : Approximate Matching

BY

P F

OR

EXPA

NSI

ON

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“raw”,c(“raw”)“draw”,c(“draw”)

“or”,c(“or”)

“xor”,c(“xor”)

“R”,C-MAX

..

draw, the, are, null, handle, box, red, rectangle, ...

... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ...

or ...

raw ...

DEnglishDFile

Identifier:drawXORRect

Page 53: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φExp(word))

• φExp : Approximate Matching

BY

P F

OR

EXPA

NSI

ON

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“draw”,c(“draw”) “xor”,c(“xor”)

“R”,C-MAX

..

draw, the, are, null, handle, box, red, rectangle, ...

... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ...

or ...

raw ...

DEnglishDFile

Identifier:drawXORRect

Page 54: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φExp(word))

• φExp : Approximate Matching

BY

P F

OR

EXPA

NSI

ON

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“draw”,c(“draw”) “xor”,c(“xor”)

“R”,C-MAX

... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ...

or ...

raw ...

DEnglish

“red”,c(“red”)

..

draw, the, are, null, handle, box,

red, rectangle, ...

DFile

Identifier:drawXORRect

Page 55: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φExp(word))

• φExp : Approximate Matching

BY

P F

OR

EXPA

NSI

ON

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“draw”,c(“draw”) “xor”,c(“xor”)

“R”,C-MAX

... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ...

or ...

raw ...

DEnglish

“red”,c(“red”)“rectangle”,c(“

rectangle”)

..

draw, the, are, null, handle, box,

red,

rectangle, ...

DFile

Identifier:drawXORRect

Page 56: LINSEN an efficient approach to split identifiers and expand abbreviations

• BYP(identifier, word, φExp(word))

• φExp : Approximate Matching

BY

P F

OR

EXPA

NSI

ON

0

11

1

9

4

“r”,C-MAX

“d”,C-MAX

5

7

8

2

“R”,

C-MA

X

6“a”,C-MAX 3“w”,C-MAX “X

”,C-MAX

“O”,C-MAX

“e”,C-MAX10

“c”,C-MAX“t”,C-MAX

“draw”,c(“draw”) “xor”,c(“xor”)

“R”,C-MAX

... echo, testing, threading, xpm,

xor, ....

DCOMPUTER-SCIENCE ... abort absolute abstract ...

or ...

raw ...

DEnglish

“red”,c(“red”)“rectangle”,c(“

rectangle”)

..

draw, the, are, null, handle, box,

red,

rectangle, ...

DFile

Identifier:drawXORRect

Page 57: LINSEN an efficient approach to split identifiers and expand abbreviations

EMPIRICAL

E V A L UA T I O N

RESEARCH QUESTIONS

• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?

Page 58: LINSEN an efficient approach to split identifiers and expand abbreviations

EMPIRICAL

E V A L UA T I O N

RESEARCH QUESTIONS

• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?

• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?

Page 59: LINSEN an efficient approach to split identifiers and expand abbreviations

EMPIRICAL

E V A L UA T I O N

RESEARCH QUESTIONS

• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?

• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?

• RQ3:What is the ability of the LINSEN approach in dealing with different types of abbreviations?

Page 60: LINSEN an efficient approach to split identifiers and expand abbreviations

CASE STUDIESDTW (Madani et. al 2010)

GenTest+Normalize (Lawrie and Binkley, 2011)

RQ1 andRQ2}

RQ1 only

LUDISO Dataset (2012)

AMAP (Hill and Pollock, 2008)RQ3 only

15 out of 750 software systemsCovering the 58% of total identifiers

EMPIRICAL

E V A L UA T I O N

Page 61: LINSEN an efficient approach to split identifiers and expand abbreviations

EVALUATION METRICS

Comparability of Results: Accuracy rate

Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]

EMPIRICAL

E V A L UA T I O N

Page 62: LINSEN an efficient approach to split identifiers and expand abbreviations

EVALUATION METRICS

Comparability of Results: Accuracy rate

Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]

EMPIRICAL

E V A L UA T I O N

Page 63: LINSEN an efficient approach to split identifiers and expand abbreviations

EVALUATION METRICS

Comparability of Results: Accuracy rate

• Identifier Level evaluation: Each mapping result must be completely correct

• Soft-word Level evaluation: “Partial credit” given to each word correctly mapped

Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]

As for the comparison with GenTest+Normalize (Lawrie and Binkley, 2011)

EMPIRICAL

E V A L UA T I O N

Page 64: LINSEN an efficient approach to split identifiers and expand abbreviations

RQ1: SPLITTING

Accuracy Rates for the comparison withDTW (Madani et. al 2010)

0

0.25

0.5

0.75

1

JhotDraw 5.1 Lynx 2.8.5DTW LINSEN DTW LINSEN

DTWLINSEN

RESULTS

R Q 1

Page 65: LINSEN an efficient approach to split identifiers and expand abbreviations

Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011)

0

0.175

0.35

0.525

0.7

which 2.20 a2ps 4.14

Identifier Level

0

0.2

0.4

0.6

0.8

which 2.20 a2ps 4.14

Soft-word Level

GenTestLINSEN

GenTestLINSEN

RESULTS

R Q 1 RQ1: SPLITTING

Page 66: LINSEN an efficient approach to split identifiers and expand abbreviations

Accuracy Rates for the comparison withDTW (Madani et. al 2010)

0

0.25

0.5

0.75

1

JhotDraw 5.1 Lynx 2.8.5

DTWLINSEN

RQ2: MAPPINGRE

SULTSR Q 2

Page 67: LINSEN an efficient approach to split identifiers and expand abbreviations

Accuracy Rates for the comparison with Normalize (Lawrie and Binkley, 2011)

0

0.15

0.3

0.45

0.6

which 2.20 a2ps 4.14

Identifier Level

0

0.225

0.45

0.675

0.9

which 2.20 a2ps 4.14

Soft-word Level

NormalizeLINSENNormalize

LINSEN

RESULTS

R Q 2 RQ2: MAPPING

Page 68: LINSEN an efficient approach to split identifiers and expand abbreviations

Accuracy Rates for the comparison withAMAP (Hill and Pollock, 2008)

0

0.225

0.45

0.675

0.9

CW DL OO AC PR SL

AMAPLINSEN

RQ3: EXPANSIONRE

SULTSR Q 3

CW: Combination WordsDL: Dropped LettersOO: Others

AC: AcronymsPR: PrefixSL: Single Letters

Page 69: LINSEN an efficient approach to split identifiers and expand abbreviations

CONCLUSIONS

Page 70: LINSEN an efficient approach to split identifiers and expand abbreviations

CONCLUSIONS

Page 71: LINSEN an efficient approach to split identifiers and expand abbreviations

CONCLUSIONS

Page 72: LINSEN an efficient approach to split identifiers and expand abbreviations

CONCLUSIONS

Page 73: LINSEN an efficient approach to split identifiers and expand abbreviations

CONCLUSIONS

Page 74: LINSEN an efficient approach to split identifiers and expand abbreviations

FUTURE WORKS• Evaluation of the impact of each adopted dictionary on

the performance

• Improve or change or add dictionaries

• Improve the implementation of the prototype to speed up the computation

• Make use of parallel computation to process each identifier in isolation

Page 75: LINSEN an efficient approach to split identifiers and expand abbreviations

THANK YOU

26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy

Valerio MaggioPh.D. Student, University of Naples “Federico II”[email protected]