phd dissertation

68
Patricia Deshane Ph.D. Defense April 14, 2010

Upload: patricia-deshane

Post on 06-Jul-2015

360 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: PhD Dissertation

Patricia Deshane

Ph.D. Defense

April 14, 2010

Page 2: PhD Dissertation

Introduction

Perspectives on copy-and-paste and code cloning

Problems of cloning and possible solutions

Dimensions for Tool Development

How related clone tracking tools define clone

properties and provide clone lifecycle support

Evaluation

Prevalence of clones, renaming, and errors

User study on clone visualization and renaming

User study on clone comparison (forthcoming)

Conclusion

Page 3: PhD Dissertation
Page 4: PhD Dissertation

Copy and paste – love it or hate it?

Short-term benefits

Copy/paste variable, type, or method names

Save typing

Remember a name’s spelling

Copy/paste blocks, methods, or classes

A similar solution exists…so why write from scratch?

Learn from past projects and examples

Different than plagiarism, this is software reuse!

Libraries & frameworks are designed well for reuse

But, the resulting clones need to be modified

(software maintenance … software quality)

Page 5: PhD Dissertation

Clones as a software maintenance problem

Clone location & relationship forgotten over time

Page 6: PhD Dissertation

Clones as a software maintenance problem

Clone location & relationship forgotten over time

Page 7: PhD Dissertation

Clones as a software evolution problem

Clones also naturally evolve over time

Long-term or short-term…still need maintenance

Clones as a software quality problem

Software bugs and inconsistencies

(can be made for various reasons)

The addition of a new feature – apply update to all?

A bug is propagated & fixed – can become a new bug!

A clone is modified to fit its task – inconsistent rename

A single clone (parameterized clone) is modified,

usually only identifier names and literal constants

(numerical, character, boolean, or string values)

Page 8: PhD Dissertation

Clones as an aesthetic or design problem

How does the source code look and smell?

Look

Code clones – an artificial increase in # of LOC

(duplication adds “unnecessary” lines of code)

Clones can make code more complex, less readable

Smell

Code smell – a hint that something could be wrong

(abstraction should be used whenever possible)

Design decision:

Create abstractions from the beginning, later on, or

not at all? If not from beginning, cloning is done

Page 9: PhD Dissertation

Clone Detection

Algorithms and tools to detect code clones (exact

duplicates & “near-miss clones”) in pre-existing,

legacy source code

Retroactive – clone detection gives false positives

& false negatives…humans need to verify results

Clone Prevention?

Clone detection in real-time, disable copy/paste?

Clone Removal

Remove clones from system ASAP (refactoring)

When? Once and Only Once vs. Rule of Three

Page 10: PhD Dissertation

Clones can be reasonable, beneficial, or

necessary

Clones can keep code clean & understandable

(GUI code, procedure with too many parameters)

Programming language can have limitations

(lack of expressiveness, no abstraction support)

Clones should be kept in the source code

Is it worth refactoring? Clone genealogies study:

Short-lived clones may diverge soon

Long-living clones are due to shortcomings of language

Making changes to clones is risky for companies

Page 11: PhD Dissertation

Clones exist (the cloning problems do, too)

May not be desirable or possible to refactor

the clones, they need to be managed instead

Tool support is needed for all stages of the

clone lifecycle from clone creation to clone

extinction (which includes clone editing)

CnP – suite of Eclipse plug-ins for proactive

copy-and-paste-induced clone management

CReN - consistent renaming of identifiers

LexId – inferring lexical patterns in identifiers

CSeR – code segment reuse

CnP Clone

Visualization

CSeR Diff-Visualization

Page 12: PhD Dissertation

Clone tracking tools with a focus on editing

[CnP, Clonescape, CPC] – proactive

[Codelink, LAPIS, CloneTracker] - retroactive

Definitions of clone properties

Clone similarity

Clone model

Clone visualization

Clone persistence

Clone documentation and clone attributes

Clone lifecycle support

4 lifecycle stages: clone creation, clone capture,

clone editing, and clone extinction

Page 13: PhD Dissertation
Page 14: PhD Dissertation
Page 15: PhD Dissertation

Defining clones (similarity when captured)

Retroactive clone tracking tools rely on clone

detection tools or programmer’s selection of

clones – can yield inaccurate clones

Proactive clone tracking tools (including CnP)

capture copy/paste – 100% accurate, identical

Managing clones (similarity when edited)

Some corresponding code between related clones

(identifiers, substrings, fields/methods, etc.)

Longest-common subsequence (LCS) algorithm

Levenshtein distance (LD) - the edit distance

Page 16: PhD Dissertation

Clone location

Character offset and length in a file

Copied and pasted source code is represented to the

largest continuous set of abstract syntax tree (AST)

nodes within the range

Copied and pasted source code that is only partially

contained within an AST node is not captured

File name plus line range

Clone region descriptor (CRD)

Tells of the clone’s relative location using syntactic,

structural, and lexical information

(for example, the clone’s alignment with code blocks)

Page 17: PhD Dissertation

Clone relationship

Clone group – related clones are viewed at the

same level of group membership symmetrically

(also called: region set, clone class, etc.)

Knowing the origin can be useful for clone comparison

and clone visualization (separate from the model)

Clone family – distinguishes between the original

code (the parent) and the duplicated copies

(children, which are siblings to each other)

Page 18: PhD Dissertation

Clone groups (related clones)

Clone group #1

Clone group #2

Page 19: PhD Dissertation

Markers – colored bars and highlights

CnP clone visualization – shows clone locations,

clone groups, clone origin and subsequent pastes

CSeR diff-visualization - highlights user edits

Warnings – error prevention or detection

CnP – warnings about external identifier scoping

Alerts – clone modification notification

Alert the programmer when clones are edited

Views and graphs

Views – show lists of clones, clone groups, etc.

Graphs – can be complicated to understand

Page 20: PhD Dissertation

Pasted code

Original code

CnP Clone Visualization

Page 21: PhD Dissertation

CSeR Diff-Visualization

Page 22: PhD Dissertation

A flat database (text file)

CnP - stores each clone’s location (file name,

clone’s starting character position & length in #

of characters) within each clone group

CReN - stores each identifier’s location

(identifier’s starting character position & length

in # of characters) within each identifier group

CSeR - stores character positions & change info.

XML file, SQL database, file meta-data

Can store clone information for tagged clones,

including links between clones, & copy/paste

activity, clone modification history also

Page 23: PhD Dissertation

Additional information about clones provided

by the programmer

The reason why the code was duplicated

Only one form of clone classification

Whether the clone should be removed from the

system (clone severity)

CnP does not have this feature (yet)

Page 24: PhD Dissertation
Page 25: PhD Dissertation

How were the clones created?

Copy/paste

Other – manual typing, cut/paste/paste,

automatic code generation, etc.

Why were the clones created?

Intentional clones – code that the programmer

intended to reuse

Accidental clones – code that is similar due to a

protocol requirement

Page 26: PhD Dissertation

Tracking copy-and-paste actions (proactive)

Detects the creation of new clones that are made

via copying and pasting

Listens to document activity in Eclipse’s Java

editor & makes correspondences when identical

CnP tracks only “significant” clones that contain:

More than two statements, or

At least one conditional statement, loop statement, or

method, or

A type definition (class or interface)

Other tools’ policies:

At least 30 tokens, specified minimum clone length

Page 27: PhD Dissertation

Importing from clone detection tools

(retroactive)

Complements proactive clone tracking

Clone detection tool results are listed,

programmer selects which of the reported clone

groups to import, start tracking these clones

Programmer selection may not be required

Selecting clones (retroactive)

Clones are just selected manually by programmer

Programmer needs to know which clones to

select and where they are in the system

Page 28: PhD Dissertation

Inter-clone editing (between clones)

Same physical change is needed between all

related clones such as a new feature or bug fix

In related work, this is called linked editing,

synchronous editing, and simultaneous editing

Update in one place like with an abstraction

But with inter-clone editing, clones remain in system

CnP does not have this feature (yet)

Page 29: PhD Dissertation

Intra-clone editing (within clones)

Only the relationship is the same between the

clones, not the physical change itself

CReN – consistent renaming

of identifiers within clones

Copied

code

Pasted

code

Page 30: PhD Dissertation

for(i = 1; i < size; i++)

{

if(array[i] < low) {

low = array[i];

}

}

Page 31: PhD Dissertation

Intra-clone editing (within clones)

Only the relationship is the same between the

clones, not the physical change itself

LexId – consistent renaming

of substrings within clones

Copied

code

Pasted

code

Page 32: PhD Dissertation

Refactoring

Actually a form of clone editing

CnP does not have this feature (yet)

Clone group #1

Clone group #2

Page 33: PhD Dissertation

Refactoring

Actually a form of clone editing

CnP does not have this feature (yet)

Clone group #1

Clone group #2

Abstraction #2

Abstraction #1

Page 34: PhD Dissertation

Clone divergence (loss of similarity)

Clones may naturally separate from one another

But if copied and pasted, likely to retain similarity

Unlike with refactoring, with clone divergence

the cloning relationship is removed

Tools allow the programmer to remove a

clone from a clone group (for tracking)

Programmer has full control over the clones that

are considered related (similar) to one another

Clones can be “linked” and “unlinked”

(for inter-clone editing)

Page 35: PhD Dissertation
Page 36: PhD Dissertation

There is significant code reuse in commercial

and open source software

Clone detection tools find clones during tests

Case study with CCFinderX and SimScan

clone detection tools on SCL and Eclipse JDT

UI plug-in source code

For SCL, SimScan found 102 clone groups, 70

which were intentional and useful clone groups

50 out of the 70 intentional, useful clone groups

consisted of clones that were likely copy/pasted

These 50 groups could have been supported with CnP

Page 37: PhD Dissertation

Most (65-67%) copied-and-pasted code

fragments require renaming at least one

identifier [CP-Miner]

Difficult to tell retroactively whether code

was actually copy/pasted or renamed

Some tools look at the correspondence

between identifiers over software versions

(in version control systems), which can be

used to determine renaming inconsistencies

[Clever, Vaci]

Page 38: PhD Dissertation

[CP-Miner]

Page 39: PhD Dissertation

[Bug Isolation]

Page 40: PhD Dissertation

[DECKARD-based]

Page 41: PhD Dissertation
Page 42: PhD Dissertation

Subject Characteristics

14 male subjects - 8 undergraduate, 6 graduate

students from Clarkson MCS and ECE departments

Knowledge of Java/Swing required, IDE optional

Study Procedure

Subjects came one at a time to a user study lab

Background about the problems of copy/paste

clones and the three features were presented

The source code and graphical Paint program

that was used for the tasks was shown to them

Subjects were recorded with video/audio

Page 43: PhD Dissertation

Annotated screenshot of the Paint program

Page 44: PhD Dissertation

Debugging tasks

Task 1: Moving the blue slider does not change

the pixel color.

rSlider should be bSlider (on line 120)

Page 45: PhD Dissertation

Debugging tasks

Task 2: Moving the thickness slider does not

change the pixel thickness.

colorChangeListener should be

thicknessChangeListener (on line 142)

Page 46: PhD Dissertation

Modification tasks

Task 3: Add a titled border to colorPanel and to

thicknessPanel.

Page 47: PhD Dissertation

Task 3

Page 48: PhD Dissertation

Modification tasks

Task 4: Add color to the label of each color slider

- red, green, and blue.

Page 49: PhD Dissertation

Task 4

Page 50: PhD Dissertation

Renaming tasks (with CReN)

Task 5: Rename colorPanel to thicknessPanel and

rPanel to tPanel within the clone.

Page 51: PhD Dissertation

Renaming tasks (with CReN)

Task 6: Rename toolPanel to clearUndoPanel,

pencilButton to clearButton, and eraserButton to

undoButton within the clone.

Page 52: PhD Dissertation

Renaming tasks (with LexId)

Task 7: Rename rPanel to gPanel and rSlider to

gSlider in the green slider clone (shown), and

rPanel to bPanel and rSlider to bSlider in the blue

slider clone.

Page 53: PhD Dissertation

Renaming tasks (with LexId)

Task 8: Rename bPanel to tPanel and bSlider to

tSlider in the thickness slider clone.

Page 54: PhD Dissertation

Results - Time per Task

The time (in minutes) to complete each pair of tasks.

Page 55: PhD Dissertation

Results – Time per Task

Statistical hypothesis testing on the paired time data.

Page 56: PhD Dissertation

Results – Solution Correctness

Correct states when running the program or when finished.

Page 57: PhD Dissertation

Results – Method of Completion

Number of subjects who used each location and inspection

method for debugging and modification tasks.

Page 58: PhD Dissertation

Results – Method of Completion

Number of times each renaming method was used for

renaming tasks.

Page 59: PhD Dissertation

Discussion

Confounding factors for clone visualization

Clone visualization is not forced on the user

Subjects would have produced correct solutions to

Task 3 if they had made use of cloning information

Varying levels of subjects’ prior experience

Threats to validity

Some subjects had more prior knowledge/experience

Tasks fairly close to real-world GUI programming tasks

Tool design

Need to further improve the clone visualization

Need to tell programmers exactly what was renamed

Page 60: PhD Dissertation
Page 61: PhD Dissertation

Research contributions

The copy-and-paste (CnP) tool

Proactive tracking

Intra-clone editing

AST-based

Dimensions of clone tracking tool development

Definition of the clone lifecycle

Realization about clone visualization

Future work

Theory about copy-and-paste and abstractions

Other applications of this research

Page 62: PhD Dissertation
Page 63: PhD Dissertation

CnP user study paper published in ICPC 2010

Papers went through a rigorous reviewing process

Only 15/76 (< 20%) accepted as a full paper

CReN paper published in ETX 2007

This is a young topic that is getting recognition

Cited 7 times according to ACM Digital Library

2 additional citations are reported on CiteSeerX

39 downloads in the last year (from ACM website)

4 downloads in the last week (from ACM website)

Page 64: PhD Dissertation

The position of the source code characters as represented in

an ASTNode.

Page 65: PhD Dissertation

The three cases when capturing a range of source

code using the Eclipse AST API.

Page 66: PhD Dissertation

Identifier Matching

Page 67: PhD Dissertation

Identifier Partitioning

Page 68: PhD Dissertation

Intra-clone editing (within clones)

Only the relationship is the same between the

clones, not the physical change itself

LexId – consistent renaming

of substrings within clones

Copied

code

Pasted

code