an approach to source code plagiarism
DESCRIPTION
seminar on the ieee paper titled as "An Approach To Source Code Plagiarism Detection and Investigation using Latent Semantic Analysis"TRANSCRIPT
AN APPROACH TO SOURCE CODE
PLAGIARISM DETECTION AND
INVESTIGATION USING LATENT SEMANTIC
ANALYSIS
Authors: Georgina Cosma and Mike Joy
Presented by
Varsha Bhat K(1DS09CS105)
INTRODUCTION
Source code plagiarism: reuse of source code
authored by someone else & fail to adequately
acknowledge the fact
It may occur intentionally or unintentionally
Used by higher education academics
CHALLENGES INVOLVED
Detect similar file pairs
Investigating the similar source code fragments
within the detected files
Determine if the similarity is suspicious or innocent
Burden of proof:-
“Not only do we need to detect instances of
plagiarism, we must also be able to demonstrate
beyond reasonable doubt that those instances are
not chance similarities.”
EXISTING TOOLS
Category of the tools:
Fingerprint based systems
String matching systems
Parameterized matching systems
These were identified by Mozgovoy
FINGER PRINT BASED SYSTEM
Create finger print for each of the files
Finger print contains statistical information
There are various metrics used for detecting
plagiarism. Ex: Halstead‟s metrics
Example of such a system is the ITPAD
STRING MATCHING APPROACH
The various steps involved here are :--
Stage I is the process of tokenization
Then the source code is written as a series of token
strings
Tokens are compared to check for similarity
Example tools are:
MOSS
YAP3
JPLAG
PARAMETERIZED MATCHING SYSTEMS
Detects identical and near duplicate sections of
source code
Achieved by matching source code sections whose
identifiers have been substituted for systematically
Ex: DUP tool
INFORMATION RETRIEVAL METHODS
Represents program as indexed set of keywords
Computed the frequency of these keywords
Then computed the pair wise similarity
Ex: PDetect
PLAGATE
Detect similar source code files
Investigate the similar code fragments within them
The view of investigation is to gather evidence for
proving plagiarism by indicating contribution levels
of fragments
This enhances detection performance of existing
algorithms
Uses the technique of Latent Semantic Analysis to
achieve this
LATENT SEMANTIC ANALYSIS
It is an information retrieval technique
Text collection is preprocessed
Represented as a term-by-file matrix
Matrix transformation is applied
Singular value decomposition performed
Thus uncovers latent relationships
Derives meaning of terms by approximating the
structure of term usage among document using
SVD
a11 a12
a21 a22
Can detect transitive relationships unlike the
traditional text retrieval systems
Helps reduce noise in the data
Overcomes problems of synonymy and polysemy
Changes to document structure will not affect the
detection
Language independent
ADVANTAGES
DISADVANTAGES
Gives relatively high similarity values for non copied
programs also
SIMILARITY IN SOURCE CODE FILES
Key factors for judging similarity in files are
Nature of programming language and the problem
Variance in solution
Supporting source code already given
Assignment requirements
Fragments under investigation must not be
Short
Simple
Standard
Trivial
Limited functionality
Frequently published
SIMILARITY CATEGORIES
Source code fragments have varying contribution to evidence for plagiarism
Thus arises the need for a criterion for identifying the contributions
Contribution levels
1. Contribution level 0- no contribution
2. Contribution level 1- low contribution
3. Contribution level 2- high contribution
Similarity levels
1. Level 0- innocent
2. Level 1- suspicious
PLAGATE SYSTEM
Aim: enhance the process of plagiarism detection
and investigation
It is integrated with external detection tools as an
enhancer
Components
1. PlaGate Detection tool (PGDT)
2. PlaGate Query tool (PGQT)
FUNCTIONALITY
SYSTEM REPRESENTATION
File copus C
C={ F1, F2, …….. Fn }
Source code fragment „s‟ from source code file F
F ɛ C
F= { s1, s2, ………sp }
Set of source code fragments S
File length ‘lf’ where
Source code fragment length ‘ls’
LSA PROCESS IN PLAGATE
Preprocess the files
Transform the corpus of files into an m x n matrix
A=[ ]
Term weighting algorithm are applied to them
value of term in file:
SVD is performed on the weighted matrix A
Reduction of dimention
DETECTION AND CLASSIFICATION PROCESS IN
PLAGATE
PGQT component transforms the input file or
fragment into a query vector „q‟
Then q is projected onto the k-dimensional space
Thus we get:
We now measure similarity between Q and all the
source code files in the corpus by using similarity
measure
Cosine similarity measure is the most popular
EXPERIMENTATION
Four corpora consisting of java source code files
Corpora is produced by undergraduate students at
University of Warwick
Students were given simple skeleton code to start
with The Data Sets
PERFORMANCE EVALUATION MEASURES
sim(Fa,Fb) gives the similarity of two files and is
computed using similarity measure
Recall and Precision are two most commonly used
measures for information retrieval systems
A threshold is selected Ø
Files that have sim(Fa,Fb) ≥ Ø are detected
Overall performance will be evaluated by combining
both the measures
Closer the value of F to 1.00 the better is the
detection performance
PLAGATE VS JPLAG AND SHERLOCK
Performance when tools function alone and when
integrated with PlaGate is evaluated
List of suspicious file are created
Results
Recall increases after integration with PGDT
This constant increase indicates PGDT and
external tools compliment each other
Further increase seen when both PGDT and
PGQT are integrated but at the cost of Precision
JPlag alone had high Precision and low
Recall in all data sets
Sherlock and JPlag, both string matching
algorithms vary significantly in detection
performance
Similarity often occurs in groups containing more
than 2 files
JPlag and Sherlock fail to parse some suspicious
files due to Local Confusion
Local Confusion occurs when some code segments
shorter than the minimum- match length have been
shuffled in files as they are string matched
algorithms
PlaGate does not suffer from this sort of local
confusion as it does not depend on the structure of
the code
Example
CONCLUSION
LSA based technique for plagiarism detection and
investigation as enhancers
Detection of missed source code files by current
plagiarism detection tools
Integration with PlaGate increases Recall at the
cost of Precision
Classification of similarity by PlaGate into
contribution levels
PlaGate is language independent
Unlike other tools that find the similarity of two
files, PlaGate finds the relative similarity
FUTURE WORK
Automating dimensionality reduction is still a
problem
Miss classification of source code fragment
PlaGate behavior is not as stable as the string
matching algorithms