phd dissertation
TRANSCRIPT
Patricia Deshane
Ph.D. Defense
April 14, 2010
Introduction
Perspectives on copy-and-paste and code cloning
Problems of cloning and possible solutions
Dimensions for Tool Development
How related clone tracking tools define clone
properties and provide clone lifecycle support
Evaluation
Prevalence of clones, renaming, and errors
User study on clone visualization and renaming
User study on clone comparison (forthcoming)
Conclusion
Copy and paste – love it or hate it?
Short-term benefits
Copy/paste variable, type, or method names
Save typing
Remember a name’s spelling
Copy/paste blocks, methods, or classes
A similar solution exists…so why write from scratch?
Learn from past projects and examples
Different than plagiarism, this is software reuse!
Libraries & frameworks are designed well for reuse
But, the resulting clones need to be modified
(software maintenance … software quality)
Clones as a software maintenance problem
Clone location & relationship forgotten over time
Clones as a software maintenance problem
Clone location & relationship forgotten over time
Clones as a software evolution problem
Clones also naturally evolve over time
Long-term or short-term…still need maintenance
Clones as a software quality problem
Software bugs and inconsistencies
(can be made for various reasons)
The addition of a new feature – apply update to all?
A bug is propagated & fixed – can become a new bug!
A clone is modified to fit its task – inconsistent rename
A single clone (parameterized clone) is modified,
usually only identifier names and literal constants
(numerical, character, boolean, or string values)
Clones as an aesthetic or design problem
How does the source code look and smell?
Look
Code clones – an artificial increase in # of LOC
(duplication adds “unnecessary” lines of code)
Clones can make code more complex, less readable
Smell
Code smell – a hint that something could be wrong
(abstraction should be used whenever possible)
Design decision:
Create abstractions from the beginning, later on, or
not at all? If not from beginning, cloning is done
Clone Detection
Algorithms and tools to detect code clones (exact
duplicates & “near-miss clones”) in pre-existing,
legacy source code
Retroactive – clone detection gives false positives
& false negatives…humans need to verify results
Clone Prevention?
Clone detection in real-time, disable copy/paste?
Clone Removal
Remove clones from system ASAP (refactoring)
When? Once and Only Once vs. Rule of Three
Clones can be reasonable, beneficial, or
necessary
Clones can keep code clean & understandable
(GUI code, procedure with too many parameters)
Programming language can have limitations
(lack of expressiveness, no abstraction support)
Clones should be kept in the source code
Is it worth refactoring? Clone genealogies study:
Short-lived clones may diverge soon
Long-living clones are due to shortcomings of language
Making changes to clones is risky for companies
Clones exist (the cloning problems do, too)
May not be desirable or possible to refactor
the clones, they need to be managed instead
Tool support is needed for all stages of the
clone lifecycle from clone creation to clone
extinction (which includes clone editing)
CnP – suite of Eclipse plug-ins for proactive
copy-and-paste-induced clone management
CReN - consistent renaming of identifiers
LexId – inferring lexical patterns in identifiers
CSeR – code segment reuse
CnP Clone
Visualization
CSeR Diff-Visualization
Clone tracking tools with a focus on editing
[CnP, Clonescape, CPC] – proactive
[Codelink, LAPIS, CloneTracker] - retroactive
Definitions of clone properties
Clone similarity
Clone model
Clone visualization
Clone persistence
Clone documentation and clone attributes
Clone lifecycle support
4 lifecycle stages: clone creation, clone capture,
clone editing, and clone extinction
Defining clones (similarity when captured)
Retroactive clone tracking tools rely on clone
detection tools or programmer’s selection of
clones – can yield inaccurate clones
Proactive clone tracking tools (including CnP)
capture copy/paste – 100% accurate, identical
Managing clones (similarity when edited)
Some corresponding code between related clones
(identifiers, substrings, fields/methods, etc.)
Longest-common subsequence (LCS) algorithm
Levenshtein distance (LD) - the edit distance
Clone location
Character offset and length in a file
Copied and pasted source code is represented to the
largest continuous set of abstract syntax tree (AST)
nodes within the range
Copied and pasted source code that is only partially
contained within an AST node is not captured
File name plus line range
Clone region descriptor (CRD)
Tells of the clone’s relative location using syntactic,
structural, and lexical information
(for example, the clone’s alignment with code blocks)
Clone relationship
Clone group – related clones are viewed at the
same level of group membership symmetrically
(also called: region set, clone class, etc.)
Knowing the origin can be useful for clone comparison
and clone visualization (separate from the model)
Clone family – distinguishes between the original
code (the parent) and the duplicated copies
(children, which are siblings to each other)
Clone groups (related clones)
Clone group #1
Clone group #2
Markers – colored bars and highlights
CnP clone visualization – shows clone locations,
clone groups, clone origin and subsequent pastes
CSeR diff-visualization - highlights user edits
Warnings – error prevention or detection
CnP – warnings about external identifier scoping
Alerts – clone modification notification
Alert the programmer when clones are edited
Views and graphs
Views – show lists of clones, clone groups, etc.
Graphs – can be complicated to understand
Pasted code
Original code
CnP Clone Visualization
CSeR Diff-Visualization
A flat database (text file)
CnP - stores each clone’s location (file name,
clone’s starting character position & length in #
of characters) within each clone group
CReN - stores each identifier’s location
(identifier’s starting character position & length
in # of characters) within each identifier group
CSeR - stores character positions & change info.
XML file, SQL database, file meta-data
Can store clone information for tagged clones,
including links between clones, & copy/paste
activity, clone modification history also
Additional information about clones provided
by the programmer
The reason why the code was duplicated
Only one form of clone classification
Whether the clone should be removed from the
system (clone severity)
CnP does not have this feature (yet)
How were the clones created?
Copy/paste
Other – manual typing, cut/paste/paste,
automatic code generation, etc.
Why were the clones created?
Intentional clones – code that the programmer
intended to reuse
Accidental clones – code that is similar due to a
protocol requirement
Tracking copy-and-paste actions (proactive)
Detects the creation of new clones that are made
via copying and pasting
Listens to document activity in Eclipse’s Java
editor & makes correspondences when identical
CnP tracks only “significant” clones that contain:
More than two statements, or
At least one conditional statement, loop statement, or
method, or
A type definition (class or interface)
Other tools’ policies:
At least 30 tokens, specified minimum clone length
Importing from clone detection tools
(retroactive)
Complements proactive clone tracking
Clone detection tool results are listed,
programmer selects which of the reported clone
groups to import, start tracking these clones
Programmer selection may not be required
Selecting clones (retroactive)
Clones are just selected manually by programmer
Programmer needs to know which clones to
select and where they are in the system
Inter-clone editing (between clones)
Same physical change is needed between all
related clones such as a new feature or bug fix
In related work, this is called linked editing,
synchronous editing, and simultaneous editing
Update in one place like with an abstraction
But with inter-clone editing, clones remain in system
CnP does not have this feature (yet)
Intra-clone editing (within clones)
Only the relationship is the same between the
clones, not the physical change itself
CReN – consistent renaming
of identifiers within clones
Copied
code
Pasted
code
for(i = 1; i < size; i++)
{
if(array[i] < low) {
low = array[i];
}
}
Intra-clone editing (within clones)
Only the relationship is the same between the
clones, not the physical change itself
LexId – consistent renaming
of substrings within clones
Copied
code
Pasted
code
Refactoring
Actually a form of clone editing
CnP does not have this feature (yet)
Clone group #1
Clone group #2
Refactoring
Actually a form of clone editing
CnP does not have this feature (yet)
Clone group #1
Clone group #2
Abstraction #2
Abstraction #1
Clone divergence (loss of similarity)
Clones may naturally separate from one another
But if copied and pasted, likely to retain similarity
Unlike with refactoring, with clone divergence
the cloning relationship is removed
Tools allow the programmer to remove a
clone from a clone group (for tracking)
Programmer has full control over the clones that
are considered related (similar) to one another
Clones can be “linked” and “unlinked”
(for inter-clone editing)
There is significant code reuse in commercial
and open source software
Clone detection tools find clones during tests
Case study with CCFinderX and SimScan
clone detection tools on SCL and Eclipse JDT
UI plug-in source code
For SCL, SimScan found 102 clone groups, 70
which were intentional and useful clone groups
50 out of the 70 intentional, useful clone groups
consisted of clones that were likely copy/pasted
These 50 groups could have been supported with CnP
Most (65-67%) copied-and-pasted code
fragments require renaming at least one
identifier [CP-Miner]
Difficult to tell retroactively whether code
was actually copy/pasted or renamed
Some tools look at the correspondence
between identifiers over software versions
(in version control systems), which can be
used to determine renaming inconsistencies
[Clever, Vaci]
[CP-Miner]
[Bug Isolation]
[DECKARD-based]
Subject Characteristics
14 male subjects - 8 undergraduate, 6 graduate
students from Clarkson MCS and ECE departments
Knowledge of Java/Swing required, IDE optional
Study Procedure
Subjects came one at a time to a user study lab
Background about the problems of copy/paste
clones and the three features were presented
The source code and graphical Paint program
that was used for the tasks was shown to them
Subjects were recorded with video/audio
Annotated screenshot of the Paint program
Debugging tasks
Task 1: Moving the blue slider does not change
the pixel color.
rSlider should be bSlider (on line 120)
Debugging tasks
Task 2: Moving the thickness slider does not
change the pixel thickness.
colorChangeListener should be
thicknessChangeListener (on line 142)
Modification tasks
Task 3: Add a titled border to colorPanel and to
thicknessPanel.
Task 3
Modification tasks
Task 4: Add color to the label of each color slider
- red, green, and blue.
Task 4
Renaming tasks (with CReN)
Task 5: Rename colorPanel to thicknessPanel and
rPanel to tPanel within the clone.
Renaming tasks (with CReN)
Task 6: Rename toolPanel to clearUndoPanel,
pencilButton to clearButton, and eraserButton to
undoButton within the clone.
Renaming tasks (with LexId)
Task 7: Rename rPanel to gPanel and rSlider to
gSlider in the green slider clone (shown), and
rPanel to bPanel and rSlider to bSlider in the blue
slider clone.
Renaming tasks (with LexId)
Task 8: Rename bPanel to tPanel and bSlider to
tSlider in the thickness slider clone.
Results - Time per Task
The time (in minutes) to complete each pair of tasks.
Results – Time per Task
Statistical hypothesis testing on the paired time data.
Results – Solution Correctness
Correct states when running the program or when finished.
Results – Method of Completion
Number of subjects who used each location and inspection
method for debugging and modification tasks.
Results – Method of Completion
Number of times each renaming method was used for
renaming tasks.
Discussion
Confounding factors for clone visualization
Clone visualization is not forced on the user
Subjects would have produced correct solutions to
Task 3 if they had made use of cloning information
Varying levels of subjects’ prior experience
Threats to validity
Some subjects had more prior knowledge/experience
Tasks fairly close to real-world GUI programming tasks
Tool design
Need to further improve the clone visualization
Need to tell programmers exactly what was renamed
Research contributions
The copy-and-paste (CnP) tool
Proactive tracking
Intra-clone editing
AST-based
Dimensions of clone tracking tool development
Definition of the clone lifecycle
Realization about clone visualization
Future work
Theory about copy-and-paste and abstractions
Other applications of this research
CnP user study paper published in ICPC 2010
Papers went through a rigorous reviewing process
Only 15/76 (< 20%) accepted as a full paper
CReN paper published in ETX 2007
This is a young topic that is getting recognition
Cited 7 times according to ACM Digital Library
2 additional citations are reported on CiteSeerX
39 downloads in the last year (from ACM website)
4 downloads in the last week (from ACM website)
The position of the source code characters as represented in
an ASTNode.
The three cases when capturing a range of source
code using the Eclipse AST API.
Identifier Matching
Identifier Partitioning
Intra-clone editing (within clones)
Only the relationship is the same between the
clones, not the physical change itself
LexId – consistent renaming
of substrings within clones
Copied
code
Pasted
code