phd dissertation

Post on 06-Jul-2015

361 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Patricia Deshane

Ph.D. Defense

April 14, 2010

Introduction

Perspectives on copy-and-paste and code cloning

Problems of cloning and possible solutions

Dimensions for Tool Development

How related clone tracking tools define clone

properties and provide clone lifecycle support

Evaluation

Prevalence of clones, renaming, and errors

User study on clone visualization and renaming

User study on clone comparison (forthcoming)

Conclusion

Copy and paste – love it or hate it?

Short-term benefits

Copy/paste variable, type, or method names

Save typing

Remember a name’s spelling

Copy/paste blocks, methods, or classes

A similar solution exists…so why write from scratch?

Learn from past projects and examples

Different than plagiarism, this is software reuse!

Libraries & frameworks are designed well for reuse

But, the resulting clones need to be modified

(software maintenance … software quality)

Clones as a software maintenance problem

Clone location & relationship forgotten over time

Clones as a software maintenance problem

Clone location & relationship forgotten over time

Clones as a software evolution problem

Clones also naturally evolve over time

Long-term or short-term…still need maintenance

Clones as a software quality problem

Software bugs and inconsistencies

(can be made for various reasons)

The addition of a new feature – apply update to all?

A bug is propagated & fixed – can become a new bug!

A clone is modified to fit its task – inconsistent rename

A single clone (parameterized clone) is modified,

usually only identifier names and literal constants

(numerical, character, boolean, or string values)

Clones as an aesthetic or design problem

How does the source code look and smell?

Look

Code clones – an artificial increase in # of LOC

(duplication adds “unnecessary” lines of code)

Clones can make code more complex, less readable

Smell

Code smell – a hint that something could be wrong

(abstraction should be used whenever possible)

Design decision:

Create abstractions from the beginning, later on, or

not at all? If not from beginning, cloning is done

Clone Detection

Algorithms and tools to detect code clones (exact

duplicates & “near-miss clones”) in pre-existing,

legacy source code

Retroactive – clone detection gives false positives

& false negatives…humans need to verify results

Clone Prevention?

Clone detection in real-time, disable copy/paste?

Clone Removal

Remove clones from system ASAP (refactoring)

When? Once and Only Once vs. Rule of Three

Clones can be reasonable, beneficial, or

necessary

Clones can keep code clean & understandable

(GUI code, procedure with too many parameters)

Programming language can have limitations

(lack of expressiveness, no abstraction support)

Clones should be kept in the source code

Is it worth refactoring? Clone genealogies study:

Short-lived clones may diverge soon

Long-living clones are due to shortcomings of language

Making changes to clones is risky for companies

Clones exist (the cloning problems do, too)

May not be desirable or possible to refactor

the clones, they need to be managed instead

Tool support is needed for all stages of the

clone lifecycle from clone creation to clone

extinction (which includes clone editing)

CnP – suite of Eclipse plug-ins for proactive

copy-and-paste-induced clone management

CReN - consistent renaming of identifiers

LexId – inferring lexical patterns in identifiers

CSeR – code segment reuse

CnP Clone

Visualization

CSeR Diff-Visualization

Clone tracking tools with a focus on editing

[CnP, Clonescape, CPC] – proactive

[Codelink, LAPIS, CloneTracker] - retroactive

Definitions of clone properties

Clone similarity

Clone model

Clone visualization

Clone persistence

Clone documentation and clone attributes

Clone lifecycle support

4 lifecycle stages: clone creation, clone capture,

clone editing, and clone extinction

Defining clones (similarity when captured)

Retroactive clone tracking tools rely on clone

detection tools or programmer’s selection of

clones – can yield inaccurate clones

Proactive clone tracking tools (including CnP)

capture copy/paste – 100% accurate, identical

Managing clones (similarity when edited)

Some corresponding code between related clones

(identifiers, substrings, fields/methods, etc.)

Longest-common subsequence (LCS) algorithm

Levenshtein distance (LD) - the edit distance

Clone location

Character offset and length in a file

Copied and pasted source code is represented to the

largest continuous set of abstract syntax tree (AST)

nodes within the range

Copied and pasted source code that is only partially

contained within an AST node is not captured

File name plus line range

Clone region descriptor (CRD)

Tells of the clone’s relative location using syntactic,

structural, and lexical information

(for example, the clone’s alignment with code blocks)

Clone relationship

Clone group – related clones are viewed at the

same level of group membership symmetrically

(also called: region set, clone class, etc.)

Knowing the origin can be useful for clone comparison

and clone visualization (separate from the model)

Clone family – distinguishes between the original

code (the parent) and the duplicated copies

(children, which are siblings to each other)

Clone groups (related clones)

Clone group #1

Clone group #2

Markers – colored bars and highlights

CnP clone visualization – shows clone locations,

clone groups, clone origin and subsequent pastes

CSeR diff-visualization - highlights user edits

Warnings – error prevention or detection

CnP – warnings about external identifier scoping

Alerts – clone modification notification

Alert the programmer when clones are edited

Views and graphs

Views – show lists of clones, clone groups, etc.

Graphs – can be complicated to understand

Pasted code

Original code

CnP Clone Visualization

CSeR Diff-Visualization

A flat database (text file)

CnP - stores each clone’s location (file name,

clone’s starting character position & length in #

of characters) within each clone group

CReN - stores each identifier’s location

(identifier’s starting character position & length

in # of characters) within each identifier group

CSeR - stores character positions & change info.

XML file, SQL database, file meta-data

Can store clone information for tagged clones,

including links between clones, & copy/paste

activity, clone modification history also

Additional information about clones provided

by the programmer

The reason why the code was duplicated

Only one form of clone classification

Whether the clone should be removed from the

system (clone severity)

CnP does not have this feature (yet)

How were the clones created?

Copy/paste

Other – manual typing, cut/paste/paste,

automatic code generation, etc.

Why were the clones created?

Intentional clones – code that the programmer

intended to reuse

Accidental clones – code that is similar due to a

protocol requirement

Tracking copy-and-paste actions (proactive)

Detects the creation of new clones that are made

via copying and pasting

Listens to document activity in Eclipse’s Java

editor & makes correspondences when identical

CnP tracks only “significant” clones that contain:

More than two statements, or

At least one conditional statement, loop statement, or

method, or

A type definition (class or interface)

Other tools’ policies:

At least 30 tokens, specified minimum clone length

Importing from clone detection tools

(retroactive)

Complements proactive clone tracking

Clone detection tool results are listed,

programmer selects which of the reported clone

groups to import, start tracking these clones

Programmer selection may not be required

Selecting clones (retroactive)

Clones are just selected manually by programmer

Programmer needs to know which clones to

select and where they are in the system

Inter-clone editing (between clones)

Same physical change is needed between all

related clones such as a new feature or bug fix

In related work, this is called linked editing,

synchronous editing, and simultaneous editing

Update in one place like with an abstraction

But with inter-clone editing, clones remain in system

CnP does not have this feature (yet)

Intra-clone editing (within clones)

Only the relationship is the same between the

clones, not the physical change itself

CReN – consistent renaming

of identifiers within clones

Copied

code

Pasted

code

for(i = 1; i < size; i++)

{

if(array[i] < low) {

low = array[i];

}

}

Intra-clone editing (within clones)

Only the relationship is the same between the

clones, not the physical change itself

LexId – consistent renaming

of substrings within clones

Copied

code

Pasted

code

Refactoring

Actually a form of clone editing

CnP does not have this feature (yet)

Clone group #1

Clone group #2

Refactoring

Actually a form of clone editing

CnP does not have this feature (yet)

Clone group #1

Clone group #2

Abstraction #2

Abstraction #1

Clone divergence (loss of similarity)

Clones may naturally separate from one another

But if copied and pasted, likely to retain similarity

Unlike with refactoring, with clone divergence

the cloning relationship is removed

Tools allow the programmer to remove a

clone from a clone group (for tracking)

Programmer has full control over the clones that

are considered related (similar) to one another

Clones can be “linked” and “unlinked”

(for inter-clone editing)

There is significant code reuse in commercial

and open source software

Clone detection tools find clones during tests

Case study with CCFinderX and SimScan

clone detection tools on SCL and Eclipse JDT

UI plug-in source code

For SCL, SimScan found 102 clone groups, 70

which were intentional and useful clone groups

50 out of the 70 intentional, useful clone groups

consisted of clones that were likely copy/pasted

These 50 groups could have been supported with CnP

Most (65-67%) copied-and-pasted code

fragments require renaming at least one

identifier [CP-Miner]

Difficult to tell retroactively whether code

was actually copy/pasted or renamed

Some tools look at the correspondence

between identifiers over software versions

(in version control systems), which can be

used to determine renaming inconsistencies

[Clever, Vaci]

[CP-Miner]

[Bug Isolation]

[DECKARD-based]

Subject Characteristics

14 male subjects - 8 undergraduate, 6 graduate

students from Clarkson MCS and ECE departments

Knowledge of Java/Swing required, IDE optional

Study Procedure

Subjects came one at a time to a user study lab

Background about the problems of copy/paste

clones and the three features were presented

The source code and graphical Paint program

that was used for the tasks was shown to them

Subjects were recorded with video/audio

Annotated screenshot of the Paint program

Debugging tasks

Task 1: Moving the blue slider does not change

the pixel color.

rSlider should be bSlider (on line 120)

Debugging tasks

Task 2: Moving the thickness slider does not

change the pixel thickness.

colorChangeListener should be

thicknessChangeListener (on line 142)

Modification tasks

Task 3: Add a titled border to colorPanel and to

thicknessPanel.

Task 3

Modification tasks

Task 4: Add color to the label of each color slider

- red, green, and blue.

Task 4

Renaming tasks (with CReN)

Task 5: Rename colorPanel to thicknessPanel and

rPanel to tPanel within the clone.

Renaming tasks (with CReN)

Task 6: Rename toolPanel to clearUndoPanel,

pencilButton to clearButton, and eraserButton to

undoButton within the clone.

Renaming tasks (with LexId)

Task 7: Rename rPanel to gPanel and rSlider to

gSlider in the green slider clone (shown), and

rPanel to bPanel and rSlider to bSlider in the blue

slider clone.

Renaming tasks (with LexId)

Task 8: Rename bPanel to tPanel and bSlider to

tSlider in the thickness slider clone.

Results - Time per Task

The time (in minutes) to complete each pair of tasks.

Results – Time per Task

Statistical hypothesis testing on the paired time data.

Results – Solution Correctness

Correct states when running the program or when finished.

Results – Method of Completion

Number of subjects who used each location and inspection

method for debugging and modification tasks.

Results – Method of Completion

Number of times each renaming method was used for

renaming tasks.

Discussion

Confounding factors for clone visualization

Clone visualization is not forced on the user

Subjects would have produced correct solutions to

Task 3 if they had made use of cloning information

Varying levels of subjects’ prior experience

Threats to validity

Some subjects had more prior knowledge/experience

Tasks fairly close to real-world GUI programming tasks

Tool design

Need to further improve the clone visualization

Need to tell programmers exactly what was renamed

Research contributions

The copy-and-paste (CnP) tool

Proactive tracking

Intra-clone editing

AST-based

Dimensions of clone tracking tool development

Definition of the clone lifecycle

Realization about clone visualization

Future work

Theory about copy-and-paste and abstractions

Other applications of this research

CnP user study paper published in ICPC 2010

Papers went through a rigorous reviewing process

Only 15/76 (< 20%) accepted as a full paper

CReN paper published in ETX 2007

This is a young topic that is getting recognition

Cited 7 times according to ACM Digital Library

2 additional citations are reported on CiteSeerX

39 downloads in the last year (from ACM website)

4 downloads in the last week (from ACM website)

The position of the source code characters as represented in

an ASTNode.

The three cases when capturing a range of source

code using the Eclipse AST API.

Identifier Matching

Identifier Partitioning

Intra-clone editing (within clones)

Only the relationship is the same between the

clones, not the physical change itself

LexId – consistent renaming

of substrings within clones

Copied

code

Pasted

code

top related