clone detection in python

61
DATE: May 13, 2013 Florence, Italy Clone Detection in Python Valerio Maggio ([email protected])

Upload: valerio-maggio

Post on 24-Jun-2015

732 views

Category:

Technology


0 download

DESCRIPTION

"Clone detection in Python": Slides presented at EuroPython 2012 Clone Detection in Python highlights the topic of code duplication detection using Machine Learning techniques. Some examples on Python code duplications and C-Python implementation duplications are reported as well.

TRANSCRIPT

Page 1: Clone detection in Python

DATE: May 13, 2013Florence, Italy

Clone Detection in Python

Valerio Maggio ([email protected])

Page 2: Clone detection in Python

Introduction

Duplicated Code

Number one in the stink parade is duplicated code. If you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them.

2

Page 3: Clone detection in Python

Introduction

The Python Way

3

Page 4: Clone detection in Python

Introduction

The Python Way

3

Page 5: Clone detection in Python

Introduction

NLTK (tree.py)

4

Page 6: Clone detection in Python

Introduction

NLTK (tree.py)

4

Page 7: Clone detection in Python

Introduction

NLTK (tree.py)

4

Page 8: Clone detection in Python

Introduction

NLTK (tree.py)

4

Page 9: Clone detection in Python

Introduction

NLTK (tree.py)

4

Page 10: Clone detection in Python

Introduction

NLTK (tree.py)

4

Page 11: Clone detection in Python

Introduction

Duplicated Code‣ Exists: 5% to 30% of code is similar

• In extreme cases, even up to 50%

- This is the case of Payroll, a COBOL system

5

Page 12: Clone detection in Python

Introduction

Duplicated Code‣ Exists: 5% to 30% of code is similar

• In extreme cases, even up to 50%

- This is the case of Payroll, a COBOL system

‣ Is often created during development

5

Page 13: Clone detection in Python

Introduction

Duplicated Code‣ Exists: 5% to 30% of code is similar

• In extreme cases, even up to 50%

- This is the case of Payroll, a COBOL system

‣ Is often created during development

• due to time pressure for an upcoming deadline

5

Page 14: Clone detection in Python

Introduction

Duplicated Code‣ Exists: 5% to 30% of code is similar

• In extreme cases, even up to 50%

- This is the case of Payroll, a COBOL system

‣ Is often created during development

• due to time pressure for an upcoming deadline

• to overcome limitations of the programming language

5

Page 15: Clone detection in Python

Introduction

Duplicated Code‣ Exists: 5% to 30% of code is similar

• In extreme cases, even up to 50%

- This is the case of Payroll, a COBOL system

‣ Is often created during development

• due to time pressure for an upcoming deadline

• to overcome limitations of the programming language

‣ Three Public Enemies:

5

Page 16: Clone detection in Python

Introduction

Duplicated Code‣ Exists: 5% to 30% of code is similar

• In extreme cases, even up to 50%

- This is the case of Payroll, a COBOL system

‣ Is often created during development

• due to time pressure for an upcoming deadline

• to overcome limitations of the programming language

‣ Three Public Enemies:

• Copy, Paste and Modify

5

Page 17: Clone detection in Python

DATE: May 13, 2013Part I: Clone Detection

Clone Detection in Python

Page 18: Clone detection in Python

DATE: May 13, 2013Part I: Clone Detection

Clone Detection in Python

Page 19: Clone detection in Python

Part I: Clone Detection

Code Clones

‣ There can be different definitions of similarity, based on:

• Program Text (text, syntax)• Semantics

7

(Def.) “Software Clones are segments of code that are similar according to some definition of similarity” (I.D. Baxter, 1998)

Page 20: Clone detection in Python

Part I: Clone Detection

Code Clones

‣ There can be different definitions of similarity, based on:

• Program Text (text, syntax)• Semantics

‣ Four Different Types of Clones

7

(Def.) “Software Clones are segments of code that are similar according to some definition of similarity” (I.D. Baxter, 1998)

Page 21: Clone detection in Python

Part I: Clone Detection

The original one

8

# Original Fragmentdef do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines

Page 22: Clone detection in Python

Part I: Clone Detection

Type 1: Exact Copy‣ Identical code segments except for differences in layout, whitespace,

and comments

9

Page 23: Clone detection in Python

Part I: Clone Detection

Type 1: Exact Copy‣ Identical code segments except for differences in layout, whitespace,

and comments

9

# Original Fragmentdef do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines

def do_something_cool_in_Python (filepath, marker='---end---'): lines = list() # This list is initially empty

with open(filepath) as report: for l in report: # It goes through the lines of the file if l.endswith(marker): lines.append(l) return lines

Page 24: Clone detection in Python

Part I: Clone Detection

Type 2: Parameter Substituted Clones‣ Structurally identical segments except for differences in identifiers,

literals, layout, whitespace, and comments

10

Page 25: Clone detection in Python

Part I: Clone Detection

Type 2: Parameter Substituted Clones‣ Structurally identical segments except for differences in identifiers,

literals, layout, whitespace, and comments

10

# Type 2 Clonedef do_something_cool_in_Python(path, end='---end---'): targets = list() with open(path) as data_file: for t in data_file: if l.endswith(end): targets.append(t) # Stores only lines that ends with "marker" #Return the list of different lines return targets

# Original Fragmentdef do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines

Page 26: Clone detection in Python

Part I: Clone Detection

Type 3: Structure Substituted Clones‣ Similar segments with further modifications such as changed, added (or deleted)

statements, in additions to variations in identifiers, literals, layout and comments

11

Page 27: Clone detection in Python

Part I: Clone Detection

Type 3: Structure Substituted Clones‣ Similar segments with further modifications such as changed, added (or deleted)

statements, in additions to variations in identifiers, literals, layout and comments

11

import osdef do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones

Page 28: Clone detection in Python

Part I: Clone Detection

Type 3: Structure Substituted Clones‣ Similar segments with further modifications such as changed, added (or deleted)

statements, in additions to variations in identifiers, literals, layout and comments

11

import osdef do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones

Page 29: Clone detection in Python

Part I: Clone Detection

Type 3: Structure Substituted Clones‣ Similar segments with further modifications such as changed, added (or deleted)

statements, in additions to variations in identifiers, literals, layout and comments

11

import osdef do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones

Page 30: Clone detection in Python

Part I: Clone Detection

Type 3: Structure Substituted Clones‣ Similar segments with further modifications such as changed, added (or deleted)

statements, in additions to variations in identifiers, literals, layout and comments

11

import osdef do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones

Page 31: Clone detection in Python

Part I: Clone Detection

Type 4: “Semantic” Clones‣ Semantically equivalent segments that perform the same

computation but are implemented by different syntactic variants

12

Page 32: Clone detection in Python

Part I: Clone Detection

Type 4: “Semantic” Clones‣ Semantically equivalent segments that perform the same

computation but are implemented by different syntactic variants

12

# Original Fragmentdef do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines

def do_always_the_same_stuff(filepath, marker='---end---'): report = open(filepath) file_lines = report.readlines() report.close() #Filters only the lines ending with marker return filter(lambda l: len(l) and l.endswith(marker), file_lines)

Page 33: Clone detection in Python

Part I: Clone Detection

What are the consequences?‣ Do clones increase the maintenance effort?‣ Hypothesis:

• Cloned code increases code size

• A fix to a clone must be applied to all similar fragments

• Bugs are duplicated together with their clones

‣ However: it is not always possible to remove clones

• Removal of Clones is harder if variations exist.

13

Page 34: Clone detection in Python

Part I: Clone Detection 14

Duplix

Scorpio

PMD

CCFinder

Dup CPDDuplix

ShinobiClone Detective

Gemini

iClones

KClone

ConQAT

DeckardClone Digger

JCCDCloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDD

Clone Detection Tools

Page 35: Clone detection in Python

Part I: Clone Detection 14

Duplix

Scorpio

PMD

CCFinder

Dup CPDDuplix

ShinobiClone Detective

Gemini

iClones

KClone

ConQAT

DeckardClone Digger

JCCDCloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDD

‣ Text Based Tools:• Lines are compared to other

lines

Clone Detection Tools

Page 36: Clone detection in Python

Part I: Clone Detection 14

Duplix

Scorpio

PMD

CCFinder

Dup CPDDuplix

ShinobiClone Detective

Gemini

iClones

KClone

ConQAT

DeckardClone Digger

JCCDCloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDD

‣ Token Based Tools:• Token sequences are compared to

sequences

Clone Detection Tools

Page 37: Clone detection in Python

Part I: Clone Detection 14

Duplix

Scorpio

PMD

CCFinder

Dup CPDDuplix

ShinobiClone Detective

Gemini

iClones

KClone

ConQAT

DeckardClone Digger

JCCDCloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDD

‣Syntax Based Tools:• Syntax subtrees are

compared to each other

Clone Detection Tools

Page 38: Clone detection in Python

Part I: Clone Detection 14

Duplix

Scorpio

PMD

CCFinder

Dup CPDDuplix

ShinobiClone Detective

Gemini

iClones

KClone

ConQAT

DeckardClone Digger

JCCDCloneDr SimScan

CLICS

NiCAD

Simian

Duploc

Dude

SDD

‣Graph Based Tools:• (sub) graphs are compared to each

other

Clone Detection Tools

Page 39: Clone detection in Python

Part I: Clone Detection

Clone Detection Techniques

15

‣ String/Token based Techiniques:

• Pros: Run very fast

• Cons: Too many false clones‣ Syntax based (AST) Techniques:

• Pros: Well suited to detect structural similarities

• Cons: Not Properly suited to detect Type 3 Clones

‣ Graph based Techniques:

• Pros: The only one able to deal with Type 4 Clones

• Cons: Performance Issues

Page 40: Clone detection in Python

Part I: Clone Detection

The idea: Use Machine Learning, Luke

‣ Use Machine Learning Techniques to compute similarity of fragments by exploiting specific features of the code.

‣ Combine different sources of Information

• Structural Information: ASTs, PDGs

• Lexical Information: Program Text

16

Page 41: Clone detection in Python

Part I: Clone Detection

Kernel Methods for Structured Data

‣ Well-grounded on solid and awful Math

‣ Based on the idea that objects can be described in terms of their constituent Parts

‣ Can be easily tailored to specific domains

• Tree Kernels• Graph Kernels• ....

17

Page 42: Clone detection in Python

Part I: Clone Detection

Defining a Kernel for Structured Data

18

Page 43: Clone detection in Python

Part I: Clone Detection

Defining a Kernel for Structured Data

The definition of a new Kernel for a Structured Object requires the definition of:

18

Page 44: Clone detection in Python

Part I: Clone Detection

Defining a Kernel for Structured Data

The definition of a new Kernel for a Structured Object requires the definition of:

‣ Set of features to annotate each part of the object

18

Page 45: Clone detection in Python

Part I: Clone Detection

Defining a Kernel for Structured Data

The definition of a new Kernel for a Structured Object requires the definition of:

‣ Set of features to annotate each part of the object

‣ A Kernel function to measure the similarity on the smallest part of the object

18

Page 46: Clone detection in Python

Part I: Clone Detection

Defining a Kernel for Structured Data

The definition of a new Kernel for a Structured Object requires the definition of:

‣ Set of features to annotate each part of the object

‣ A Kernel function to measure the similarity on the smallest part of the object

• e.g., Nodes for AST and Graphs

18

Page 47: Clone detection in Python

Part I: Clone Detection

Defining a Kernel for Structured Data

The definition of a new Kernel for a Structured Object requires the definition of:

‣ Set of features to annotate each part of the object

‣ A Kernel function to measure the similarity on the smallest part of the object

• e.g., Nodes for AST and Graphs

‣ A Kernel function to apply the computation on the different (sub)parts of the structured object

18

Page 48: Clone detection in Python

Part I: Clone Detection

Kernel Methods for Clones: Tree Kernels Example on AST

‣ Features: We annotate each node by a set of 4 features

• Instruction Class

- i.e., LOOP, CONDITIONAL_STATEMENT, CALL

• Instruction

- i.e., FOR, IF, WHILE, RETURN

• Context

- i.e. Instruction Class of the closer statement node

• Lexemes

- Lexical information gathered (recursively) from leaves

- i.e., Lexical Information

19

FOR

Page 49: Clone detection in Python

Part I: Clone Detection

Kernel Methods for Clones: Tree Kernels Example on AST

‣ Kernel Function:

• Aims at identify the maximum isomorphic Tree/Subtree

20

K(T1, T2) =X

n2T1

X

n02T2

�(n, n0) ·Ksubt(n, n0)

block

print

p0.0s

=

1.0

=

p

s

f

block

print

y1.0x

=

x

=

y

x

f

Ksubt(n, n0) = �sim(n, n0) + (1� �)

X

(n1,n2)2Ch(n,n0)

k(n1, n2)

Page 50: Clone detection in Python

DATE: May 13, 2013Part II: Clones and Python

Clone Detection in Python

Page 51: Clone detection in Python

DATE: May 13, 2013Part II: Clones and Python

Clone Detection in Python

Page 52: Clone detection in Python

Part II: In Python

The Overall Process Sketch22

1. Pre Processing

Page 53: Clone detection in Python

Part II: In Python

The Overall Process Sketch22

block

print

p0.0s

=

1.0

=

p

s

f

block

print

y1.0x

=

x

=

y

x

f

1. Pre Processing 2. Extraction

Page 54: Clone detection in Python

Part II: In Python

The Overall Process Sketch22

block

print

p0.0s

=

1.0

=

p

s

f

block

print

y1.0x

=

x

=

y

x

f

block

print

p0.0s

=

1.0

=

p

s

f

block

print

y1.0x

=

x

=

y

x

f

1. Pre Processing 2. Extraction

3. Detection

Page 55: Clone detection in Python

Part II: In Python

The Overall Process Sketch22

block

print

p0.0s

=

1.0

=

p

s

f

block

print

y1.0x

=

x

=

y

x

f

block

print

p0.0s

=

1.0

=

p

s

f

block

print

y1.0x

=

x

=

y

x

f

1. Pre Processing 2. Extraction

3. Detection 4. Aggregation

Page 56: Clone detection in Python

Part II: In Python

Detection Process23

Page 57: Clone detection in Python

Part II: In Python

Empirical Evaluation‣ Comparison with another (pure) AST-based: Clone Digger

• It has been the first Clone detector for and in Python :-)

• Presented at EuroPython 2006‣ Comparison on a system with randomly seeded clones

24

‣ Results refer only to Type 3 Clones

‣ On Type 1 and Type 2 we got the same results

Page 58: Clone detection in Python

Part II: In Python

Precision/Recall Plot25

0

0.25

0.50

0.75

1.00

0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98

Precision, Recall and F-Measure

Precision Recall F1

Precision: How accurate are the obtained results? (Altern.) How many errors do they contain?

Recall: How complete are the obtained results? (Altern.) How many clones have been retrieved w.r.t. Total Clones?

Page 59: Clone detection in Python

Part II: In Python

Is Python less clone prone?26

Roy et. al., IWSC, 2010

Page 60: Clone detection in Python

Part II: In Python

Clones in CPython 2.5.1

27

Page 61: Clone detection in Python

DATE: May 13, 2013Florence, Italy

Thank you!