mining and analysis of control structure variant clones guo qiao

42
Mining and Analysis of Control Structure Variant Clones Guo Qiao

Upload: laurence-mclaughlin

Post on 31-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Mining and Analysis of Control Structure Variant Clones

Guo Qiao

Outline

• Clones and Control Structure Variant Clones• Research Motivation• Approach for mining control structure variant clones• Evaluation of precision and recall• Case study of control structure variant clones• Refactorability evaluation

2

• Clones are common in software systems. The percentage of clones in systems varied from 6.5% to 59.5%, average proportion is 14.6%. (Chen et al. @2014)

Code duplication (Software Clone)

3

Concordia university

• Clones are harmful

• Identified as the worst code smell (Rahman @2010)• Indication of poor software maintainability

(Mondal @2011)• Cause system design quality degrade

Why clone is a problem?

Clone refactoring can eliminate bad effects.

4

• Type-1: Identical code fragments except for variations in whitespace, layout and comments. (Clear)

• Type-2: Syntactically identical fragments except for variations in identifiers, literals, types, whitespace, layout and comments. (Clear)

• Type-3: Copied fragments with further modifications such as changed, added or removed statements, in addition Type-1 variation.

• Type-4: Two or more code fragments that perform the same computation but are implemented by different syntax text.

Clone CategorizationMost widely accepted definition is from Roy @2009

5

• Type-4 clones can be divided into subcategories.

Dispute about Type-4 Clones

• Type-4 clones are syntactically different semantic clones and still undecidable. • Type-4 clones are behaviorally similar code

fragments regarding to their input/output.

6

Definition• Control structure variant clones (CSVC) are

clones use different control structures to implement the same functionality.

Control Structure Variant Clone?

7

From the perspective of clone refactoring, a different strategy is required to refactor Control Structure variant clones. Extract common code fragment Analysis of code functionality

Motivation

8

Jürgens et al [2010] on the clones beyond copy-paste revealed:

– The state-of-the-art clone detectors did not achieve a recall of more than 10%.

– In 52 manually checked methods, 32 were behaviorally similar but syntactically different to other methods.

No approach tailored to find these clones

Motivation

9

Propose an approach to mine control structure variant clones accurately.

The mining process should take into account:1. Control structure matching2. Functional similarity evaluation

Goal

10

Overall Approach

Code example Control Dependency Tree

Phase 1: Control Structure Matching

12

•Loop variants• Enhanced for loop• Iterator-based for or while loop• Index-based for or while loop• Do-while loop

•Conditional variants• If-else statement• Conditional expression (Ternary operator ?: )• Switch statement

Common Control Structures in Java

13

Loop Variable: • Start index • End index • Step

We consider two loops L1 and L2 as functionally equivalent, if they have the same loop variable value.

Unified Representation of Loops

14

Control Structure Equivalents

15

• Start index

Control Structure Matching

16

• End index

Control Structure Matching

17

• Update Step

Control Structure Matching

18

Conditional Variant Equivalents

19

Java Binding: unique string representing a variable, object type, or method invocation.

IBinding: • IMethodBinding • ITypeBinding• IVariableBinding (Excluded)

Phase 2: Function Similarity Evaluation

20

IMethodBinding represents method signatures.ITypeBinding represents the Java types.

Binding Information

21

1. All Collection subtypes are generalized to java.util.Collection.

Post-processing of Bindings

22

2. Ignore the binding keys of the methods which access the next element.

Post-processing of Bindings

23

• Jaccard Similarity Coefficient

• Specify the threshold Φ

Quantify Functional Similarity

24

Study Setup• Select projects.• Select clone detection tool.• Investigation of the results.

Evaluation

25

• 6 open-source systems from different domain, vary in size and history.

Selection of Projects

26

• Three criteria for tool selection:1. Able to detect clones with control structure variations.2. Available for download.3. Take a reasonable time to detect clones.

• Tried five different clone detection tools:CCFinder –Not able to find semantic cloneJSCtracker –Not able to finish detection processNiCad–Returns abnormal clone groupsDeckard—Not able to finish detectionSebyte works well for our experiment

Selection of Detection tool

27

• Trade off between precision and recall• Identify 285 true positives (TP), 475 false

positives (FP)

Best Threshold

28

Threshold value 0.5 achieved a performance score of 0.64 (precision), and 0.91 (recall)

Best Threshold

29

Average 8.8 milliseconds for each clone pair

Execution Time

30

Q1 : Which variation is most frequently occurring?

Q2 : Does the evolution of a programming language affect the introduction of control structure variant clones?

Case Study

31

• 6 different loops, make 15 combinations, 7 of them have instances

Case Study

32

Fact: The largest category is Enhanced for loop VS Iterator-based while loop, which has 109 instances.

Answer to Q1: Enhanced for loop and Iterator-based while loop appear most often

Case Study

33

Fact: Enhanced for loop is involved in all top 3 categories, they have 209 clone pairs, account for 73%

Answer to Q2: Enhanced for loop introduced in Java 5, significantly affects the introduction of control structure variant clone.

Case Study

34

State-of-the-art refactoring tool--JDeodorant

Clone Refactoring Evaluation

35

Initialization of arrays from collectionsVariations Hindering Refactoring

36

Clone 1

Clone 2

Temporary variablesVariations Hindering Refactoring

37

Clone 1

Clone 2

Exchange of method invocation expressionsVariations Hindering Refactoring

38

Clone 1

Clone 2

A B

B A

Alternative branching statementsVariations Hindering Refactoring

39

Clone 1

Clone 2

Conclusion

• Control structure variant clones do exist in systems

• They are introduced because the language evolves, e.g., the new feature Enhanced For

• 42% of the clones we found are refactorable

40

• Improve the approach to convert one data structure to another to refactor an additional 19% of the control structure variant clones.

Future Work

41

• Develop code to unify different control structures and perform the refactoring.

Thanks!

42

Visit our Benchmark of Control structure variant clones athttp://users.encs.concordia.ca/~nikolaos/IWSC_2015/