an approach to source code plagiarism

AN APPROACH TO SOURCE CODE

PLAGIARISM DETECTION AND

INVESTIGATION USING LATENT SEMANTIC

ANALYSIS

Authors: Georgina Cosma and Mike Joy

Presented by

Varsha Bhat K(1DS09CS105)

INTRODUCTION

Source code plagiarism: reuse of source code

authored by someone else & fail to adequately

acknowledge the fact

It may occur intentionally or unintentionally

Used by higher education academics

CHALLENGES INVOLVED

Detect similar file pairs

Investigating the similar source code fragments

within the detected files

Determine if the similarity is suspicious or innocent

Burden of proof:-

“Not only do we need to detect instances of

plagiarism, we must also be able to demonstrate

beyond reasonable doubt that those instances are

not chance similarities.”

EXISTING TOOLS

Category of the tools:

Fingerprint based systems

String matching systems

Parameterized matching systems

These were identified by Mozgovoy

FINGER PRINT BASED SYSTEM

Create finger print for each of the files

Finger print contains statistical information

There are various metrics used for detecting

plagiarism. Ex: Halstead‟s metrics

Example of such a system is the ITPAD

STRING MATCHING APPROACH

The various steps involved here are :--

Stage I is the process of tokenization

Then the source code is written as a series of token

strings

Tokens are compared to check for similarity

Example tools are:

MOSS

YAP3

JPLAG

PARAMETERIZED MATCHING SYSTEMS

Detects identical and near duplicate sections of

source code

Achieved by matching source code sections whose

identifiers have been substituted for systematically

Ex: DUP tool

INFORMATION RETRIEVAL METHODS

Represents program as indexed set of keywords

Computed the frequency of these keywords

Then computed the pair wise similarity

Ex: PDetect

PLAGATE

Detect similar source code files

Investigate the similar code fragments within them

The view of investigation is to gather evidence for

proving plagiarism by indicating contribution levels

of fragments

This enhances detection performance of existing

algorithms

Uses the technique of Latent Semantic Analysis to

achieve this

LATENT SEMANTIC ANALYSIS

It is an information retrieval technique

Text collection is preprocessed

Represented as a term-by-file matrix

Matrix transformation is applied

Singular value decomposition performed

Thus uncovers latent relationships

Derives meaning of terms by approximating the

structure of term usage among document using

SVD

a11 a12

a21 a22

Can detect transitive relationships unlike the

traditional text retrieval systems

Helps reduce noise in the data

Overcomes problems of synonymy and polysemy

Changes to document structure will not affect the

detection

Language independent

ADVANTAGES

DISADVANTAGES

Gives relatively high similarity values for non copied

programs also

SIMILARITY IN SOURCE CODE FILES

Key factors for judging similarity in files are

Nature of programming language and the problem

Variance in solution

Supporting source code already given

Assignment requirements

Fragments under investigation must not be

Short

Simple

Standard

Trivial

Limited functionality

Frequently published

SIMILARITY CATEGORIES

Source code fragments have varying contribution to evidence for plagiarism

Thus arises the need for a criterion for identifying the contributions

Contribution levels

1. Contribution level 0- no contribution

2. Contribution level 1- low contribution

3. Contribution level 2- high contribution

Similarity levels

1. Level 0- innocent

2. Level 1- suspicious

PLAGATE SYSTEM

Aim: enhance the process of plagiarism detection

and investigation

It is integrated with external detection tools as an

enhancer

Components

1. PlaGate Detection tool (PGDT)

2. PlaGate Query tool (PGQT)

FUNCTIONALITY

SYSTEM REPRESENTATION

File copus C

C={ F1, F2, …….. Fn }

Source code fragment „s‟ from source code file F

F ɛ C

F= { s1, s2, ………sp }

Set of source code fragments S

File length ‘lf’ where

Source code fragment length ‘ls’

LSA PROCESS IN PLAGATE

Preprocess the files

Transform the corpus of files into an m x n matrix

A=[ ]

Term weighting algorithm are applied to them

value of term in file:

SVD is performed on the weighted matrix A

Reduction of dimention

DETECTION AND CLASSIFICATION PROCESS IN

PLAGATE

PGQT component transforms the input file or

fragment into a query vector „q‟

Then q is projected onto the k-dimensional space

Thus we get:

We now measure similarity between Q and all the

source code files in the corpus by using similarity

measure

Cosine similarity measure is the most popular

EXPERIMENTATION

Four corpora consisting of java source code files

Corpora is produced by undergraduate students at

University of Warwick

Students were given simple skeleton code to start

with The Data Sets

PERFORMANCE EVALUATION MEASURES

sim(Fa,Fb) gives the similarity of two files and is

computed using similarity measure

Recall and Precision are two most commonly used

measures for information retrieval systems

A threshold is selected Ø

Files that have sim(Fa,Fb) ≥ Ø are detected

Overall performance will be evaluated by combining

both the measures

Closer the value of F to 1.00 the better is the

detection performance

PLAGATE VS JPLAG AND SHERLOCK

Performance when tools function alone and when

integrated with PlaGate is evaluated

List of suspicious file are created

Results

Recall increases after integration with PGDT

This constant increase indicates PGDT and

external tools compliment each other

Further increase seen when both PGDT and

PGQT are integrated but at the cost of Precision

JPlag alone had high Precision and low

Recall in all data sets

Sherlock and JPlag, both string matching

algorithms vary significantly in detection

performance

Similarity often occurs in groups containing more

than 2 files

JPlag and Sherlock fail to parse some suspicious

files due to Local Confusion

Local Confusion occurs when some code segments

shorter than the minimum- match length have been

shuffled in files as they are string matched

algorithms

PlaGate does not suffer from this sort of local

confusion as it does not depend on the structure of

the code

Example

CONCLUSION

LSA based technique for plagiarism detection and

investigation as enhancers

Detection of missed source code files by current

plagiarism detection tools

Integration with PlaGate increases Recall at the

cost of Precision

Classification of similarity by PlaGate into

contribution levels

PlaGate is language independent

Unlike other tools that find the similarity of two

files, PlaGate finds the relative similarity

FUTURE WORK

Automating dimensionality reduction is still a

problem

Miss classification of source code fragment

PlaGate behavior is not as stable as the string

matching algorithms

an approach to source code plagiarism

Education