[ieee 2014 international conference on computation of power , energy, information and communication...

2014 INTERNATIONAL CONFERENCE ON COMPUTATION OF POWER, ENERGY, INFORMATION AND COMMUNICATION (ICCPEIC)

Distance Indices for the Detection of Similarity in C programs

Julie Baby, Kannan T, Vinod P, Viji Gopal Department of Computer Science & Engineering

SCMS School of Engineering and Technology, Ernakulam, Kerala, India {jbnellickal, kannanO 16, pvinod21}@gmail.com, [email protected]

Abstract: There has been proliferation in the use of plagiarized articles or source code amongst student and research community. This paper focus on an efficient method that can differentiate between plagiarized and nonplagiarized programs. SimilaritylDistance measurement techniques are used to classify the test file. Thirty six distance metrics are used to determine intra class and inter class proximity. Unseen file not used for frequency extraction are predicted with higher accuracy. This depict that our proposed model using intralinter family threshold can be implemented to identify plagiarized programs with better detection rate.

Keywords- Plagiarism, Family, Distance, Similarity and Attributes.

I. INTRODUCTION

The word plagiarism has its root from two Latin words: plagiarius, an abductor, and plagiare, to steal [12]. The evolution of technology, especially Internet has changed the way of thinking. Plagiarism can be defmed as the theft of others work or words or ideas and presenting it as own work. There are many possible motivations for students to engage in plagiarism - poor time management skills, cost of failure related to time or financial resources, group pressure, values or expectations, friendship or desire to help classmate, negative perception towards the teacher, hardware or software or library or access to teachers or staff are inadequate. To some extent, plagiarism can be avoided by acknowledging that the material is taken from some other source and provide necessary information regarding the source.

Plagiarism can be categorized as copy-paste plagiarism, paraphrasing, translated plagiarism, artistic plagiarism, idea plagiarism and code plagiarism. Here we are mainly concerned with source code plagiarism especially in C programs. It is the process of reusing the source code of a program, to create another program that appears visually different, using a small number of routine transformations, which do not require detailed understanding of the program. With a few editor operations, we can create plagiarized program [2]. Increased number of plagiarized documents calls a need for a system that can detect plagiarism.

This paper focus on an efficient method that can differentiate between plagiarized program and nonplagiarized program using distance indices. Thirty six different types of distance measuring functionalities are used to find the distance between the files of each family, by comparing their frequency vector tables. A base file is created for each family using distance measure. A file which has a minimum distance with other files of the

family is known as base file/center file. A threshold range is determined by calculating the distance score between the base file and other files of the family and nonfamily. A test file is assigned a family if more than 50% of distance measures fall within the threshold of a family.

This paper is organized as follows. Section II describes the related work. Section III discusses the proposed methodology. Section IV introduces the experimental result and findings. In Section Y, we discuss conclusion and future work.

II. RELATED WORK

In [1], author talk about various distance/similarity measures. These similarity measures are categorized based on three aspects: (a) syntactic similarity (b) implementation caveats and (c) semantics. In [2], author makes a survey on different plagiarism detection systems. The survey is divided into four categories namely plagiarism in documents, plagiarism in code, plagiarism technique and algorithms used for plagiarism. Also the author proposes a system that can detect different plagiarism attempts.

In [5], the author proposes an inter-lingual plagiarism detection tool. It compares the intermediate code produced by the compiler suite for the same programs in two languages. The proposed methodology finds the plagiarized programs which are written in C language. In [3], authors review on academician viewpoints on sourcecode plagiarism in an undergraduate student context. They consider difference in opinions amongst source-code issues and provide a defmition for source-code plagiarism. In [4], authors investigate on textual plagiarism.

In [7], the author developed a source code plagiarism detector called Deimos, which can be extended to handle plagiarism in other programming languages by implementing new scanners and parsers. The result of this detector cannot be used as a final decision if a document is plagiarized. Rather can be used as an input or suggestion for manual detection. In [8], the author review on source code plagiarism. Methods are divided into (a) that detect plagiarism (b) that prevents plagiarism. Also this paper developed automatic tools that assist in detecting plagiarism.

In [9], the author introduces BUAA _ AntiPlagiarism system that detects source code plagiarism through the analysis of programs syntax structure. The output of the system is a group of clusters of all suspicious plagiarized programs after calculating the pair wise similarities.

978-1-4799-3 826-1114/$31.00©20 14 IEEE 462

JULIE BABY et al. : DISTANCE INDICES FOR THE DETECTION OF SIMILARITY IN C PROGRAMS

In [10], the author introduces a tool called PlaGate that can be integrated with existing plagiarism detection tools

to improve plagiarism detection performance. The tool also makes an investigation on fmding the similarity between source-code files. In [11], the author describes a source code plagiarism detection that identifies plagiarism accurately when the position of the function is changed by the plagiarist.

In [13], the author introduces a method that detects plagiarized documents among source code files. This method considers a document to be plagiarized if they are similar than the average similarity between the documents. The method is divided into six stages: (a) prefiltering (b) Segmentation and similarity measurement stage (c) Segment matching (d) Post-filtering (e) Document-wise distance evaluation (t) Corpus analysis presentation. In [14], the author describes a technique that detects plagiarism in source code. The technique transforms the code into a reusable index. This index is queried against a set of input files to find the plagiarized code. In [15], the author discuss a method that identify similar program code using maximal similar sub graphs This approach considers both the syntactic structure and dataflow within the program.

III. PROPOSED METHOD

In this section, we propose a method that uses distance measures to find plagiarism in C programs. Similarity/distance is defined as a quantitative degree that enumerates the logical separation of two objects represented by a set of measurable attributes [6]. We detect the code plagiarisms by calculating a similarity score between programs to be compared. Here thirty six different types of similarity/distance measuring functionalities are used. Figure 1 shows the proposed method and Table I show the distance measures.

Figure]. Proposed Method

Distance name

Euclidean distance

CityBlock Distance

Chebyshev Distance

Sorensen Distance

Gower Distance

Soergel Distance

Kulczynski Distance

Canberra Distance

Lorentzian Distance

Intersection Distance

Ruzicka Distance

Tanimoto Distance

Czekanowski Distance

Harmonic mean Distance

Dice Distance

TABLE T. DISTANCE MEASURES

Equation

tlp,-Qf d= i=!

d=flp,-Q,I 1=1

d = max, IP,-Q I ±Ip,-Q,I d= i I

±lp,+Q,1 1=1

d=�±lpi-Q" d i�l tlp,-QI d= i-I tmax(p" Q)

i=i

flp,-Q,I d= ;=] t.min(p"Q)

d=f1p,-Q,1 '�llp,+Q" d = ±In(l+lpi-QD

1=1

d= ±min(pj,Q) i=l

±min(p" Q) d = ;=1 ±max(p" Q)

d=

d=

i=l

Imax(p"Q)-Imin(p"Q) j=! i=l I max(p"Q)

i=1

2±min(p" Q) i-I ±(P,+Q) ;=1

d=2f p,Q, '�IP+Q d=

d

, , 2LP,Q, i-I

d 2 d 2 LP,+LQ, ;=1 ;=0

463

2014 INTERNA nONAL CONFERENCE ON COMPUTA nON OF POWER, ENERGY, INFORMA nON AND COMMUNICA nON (TCCPETC)

Bhattacharrya Distance

Hellinger Distance

Matusita Distance

Squared-chord Distance

Squared_euclid ean Distance

Neyman Distance

Squared Distance

Probabilistic symmetric Distance

Divergence Distance

Clark Distance

Kullback Iiebler Distance

Jeffreys Distance

K divergence Distance

Jensen-Shannon

Distance

Jensen Difference Distance

Topsoe Distance

Taneja Distance

d=-ln±�PiQi i=l d=2 1-±�PiQ i=l

2-2±�PiQ d= 1=1

d= t,(JP:-JQJ d=±(Pi-QJ ;=1 d=±(Pi-Q} i=l Pi d=±(Pi-Q]

'-I P,+Q,

d=2i,(Pi-Q} '�I P,+Q,

d = 2± (Pi-Q} ;�1 (Pi+Q}

d= t(lp,-Q,IJ �'_I Pi+Qi

d = 'f.Piln Pi ;=1 Q/

d (p ) Pi d=L ,-Q,ln-i=! Qi d 2p

d = Ipill i i=! ' Pi+Q, d o�[ t[P" {

� l'Q,+�� III 2 i=! Pi+Q Pi+Q d [Plnp+OlnO [PtO] [PtO]] d = {; I I 2 �i �I T In 12-1

d=±lp,I{�J+Q,i{�JJ '°1 p,+Q, p,+Q,

d= f[ P,+Q}{ P,+Q ] '=1 2 vP,Q

Kumar Johnson Distance

Average Distance

Wave Hedges Distance

Additive Symmetric Distance

d�t [(Pl-Qill ,=1 2(PiQJ tip, - QI + maxlp, - QI

d= H 2

d-± Ipi-Qil -i=! max [Pi' QJ

d= ± (Pi-Q}(P,+Q) ,=, P,Q,

The entire data set is divided into two: Training and test set. Files are tokenized and each token is mapped to an equivalent integer code. Then prominent attributes for each family is selected. Attributes that are dominant are required to represent birth mark of a family. Reserved words that are present in more than 40% files of a family are considered as prominent attributes for that family. A Frequency Vector Table (FVT) is created for each family using prominent attributes and the attribute frequency. Using this FVT, distance between files of a family is computed (intra family distance). File with minimum distance is selected as base file/center file. Base file is the most similar file to all other files in that family. Using 36 distance measure we compute the base file for a specific family. We might have same/different files obtained as base file with all these similarity measurement indices. Distance between base file and all other files in the family/nonfamily are computed to fix threshold. For classifying test files, distance between base file and test file is evaluated and if the distance values lies in the threshold range of more than 50% of distance measures, the file is said to belong to corresponding family. The proposed method can be divided into six modules as follows:

1. Data set Preparation 2. Convert 3. Extract prominent attributes 4. Compute distance 5. Compute Threshold 6. Classify test files

A. Data set Preparation

Separating data into training and testing is a crucial part of evaluating data mining models. Using similar data for training and testing can minimize the effect of data discrepancies and helps to better understand the characteristics of the model. The model can be evaluated by measuring the predictions against the test set.

B. Convert

A list of token is created consisting of built in functions and operators used in C language. The list is arranged in

464


alphabetical order and each predefined word is assigned with a unique integer code. (Refer Table II)

T ABLE II. TOKEN LIST

Tokens Equivalent codes

Char 148

Do 185

Int 348

% 645

++ 646

-- 648

In the convert module, each C file is first tokenized and then it is converted into its equivalent integer code. These tokens can be identifiers, keywords, constants, operators, labels etc. Consider the mapping of reserved words to unique code as shown in Table III

TABLE TIT. CONVERTING STATEMENTS To INTEGER CODES

Statement Tokenized Corresponding

Output codes

int 348

int a, b=10; a 1000

679

b 1000 - 663

10 1001

685

C. Extracting Prominent Attributes

Every attribute is not required to represent the features of a family. Attributes that are present in at least 40% of the files in a family is selected, considering prominent ones and others are rejected.

Initially the total number of programs in which the attribute is present and the frequency of attribute are computed. If the program count is greater than 40% of the total files in a family, then it is selected as a candidate attribute for a family. Frequency of these attributes is determined to find the distance between the files in a family. Consider Table IV representing frequency of attributes in a family consisting of total 26 programs.

TABLE IV. EXAM1'LE OF FREQUENCY LIST

Attribute

138

144

148

164

178

. .

1005

Frequency

26

26

2

3

19

26

Here attribute 138 is present in 26 files, attribute 148 present in 2 files etc. In this case, attributes with frequency greater than 40% of 26 (i.e. lO) are selected as prominent attributes i.e. 138,144,178 .. , .. 1005 are selected. Refer Table V.

TABLE V. EXAMPLE OF PROMINENT ATTRIBUTE FREQUENCY LIST

Attribute Frequency

138 26

144 26

178 19

1005 26

D. Compute distance between files in a family

Distances between the files in a family are computed to find the similarity. Lesser the distance means more similar the files and vice versa. A frequency vector table representing frequency of prominent attributes are used for ascertaining distance. Here 36 distance measures are used for verifying similarity between files. For each distance measure an nXn matrix is generated where n is the total number of programs in the family. Consider files Fl, F2, F3 . . . Fn belongs to a family then frequency vector table of the family may be represented as shown in Table VI.

T ABLE VI. FREQUENCY VECTOR TABLE

Al A2 A3 A4 .. Ad

F1 PI P2 P3 P4 . . Pd

F2 QI Q2 Q3 Q4 . . Qd

. .

Fn -- -- -- -- . . --

Here Al,A2 . . . Ad represents the prominent attributes of the family. PI, P2 Pd and Ql, Q2, Q3 ... ... Qd are the frequency of the prominent attributes in files Fl and F2.

Using the 36 different distance measures, distances between the files are calculated. Distance matrix for Euclidean distance is represented as shown in Table VII, where Fi indicate files, dEUCIj is Euclidean distance between program i and program j.

TABLE VII. DISTANCE MATRIX FOR EUCLIDEAN DISTANCE

Fl F2 .. Fn

Fl 0 dEuc" . . dEuc",

F2 dEuC21 0 . . dEuc,,,

.. . . . . . . . .

Fn dEuc", dEuc", . . 0

465

2014 INTERNA nONAL CONFERENCE ON COMPUTA nON OF POWER, ENERGY, INFORMA nON AND COMMUNICA nON (ICCPEIC)

A verage of values in each row is subsequently computed. Average values are used to generate another matrix of size nX36, i.e. 36 distance values for each files and it is represented as shown in Table VIII

TABLE VIII. AVERAGE DISTANCE MATRIX

Euclidean City_Block .. Additive

Symmetric

FI dEUCI dCBI . . dASI F2 dEUC2 dCB2 . . dAS2 F3 dEUC3 dCB3 . . dASJ .. . . . . . . . .

Fn dEucn dCBn dAsn

Minimum distance for each distance indices is computed. File with minimum distance value in a distance measure is selected as base file for particular measure. The base file can be same or different for 36 distance measures.

E. Threshold Computation

Any test files with distance to base file of a family in the threshold range of more than 50% of distance measures is said to belong to specific family. For the purpose of generating threshold for all distance measures in a family, distance between base files of each distance measure and files of family is computed.

If number of files having distance value less than

average distance is greater than number of files having distance greater than average distance, then threshold range is selected as minimum distance to average distance else range is selected as average distance to maximum distance.

F. Classify testfiles (Unseen files)

Initially, test files are converted to equivalent integer codes. In order to check whether a file belongs to a family, distance between the test file and base file of a family is computed. For distance calculation frequency vector table of testing files is generated using prominent attributes of the family. A test file belongs to a family, only if for more than 50% of the distance measures assign the test file to a specific family program.

IV. EXPERIMENTAL RESULTS

Database was collected from 26 students program. Four families of C Program were taken for training. They include operations in binary tree, double ended queue (Dequeue), circular queue (Cqueue) and matrix. Each family consists of 26 programs prepared by students. 24 files belonging to different families were considered for testing. All test files are classified correctly to their respective families (Refer Table IX).

TABLE IX : ACCURACY

Files Original Predicted TD FD

Class Class

BTREE_2. txt Btree 97.22 2.78 Btree

BTREE 3.txt Btree 97.22 2.78 Btree

BTREE_ 4.txt Btree 77.78 22.22 Btree

BTREE_S.txt Btree 97.22 2.78 Btree

BTREE 6.txt Btree 97.22 2.78 Btree

BTREE_7.txt Btree 97.22 2.78 Btree

cq28. txt Crque 80.S6 19. 44 Crque

cq29. txt Crque 80.S6 19. 44 Crque

cq30.txt Crque 83.33 16. 67 Crque

dq29.txt Dqueue 72.22 27.78 Dqueue

dq30.txt Dqueue 77.78 22.22 Dqueue

dqq2.txt Dqueue 91.67 8.33 Dqueue

MATRX_3.txt Matrix 97.22 2.78 Matrix

Here, True Detection (TD) represents the percentage of distance measures classifying Tl in family Fl and False Detection (FD) represents the percentage of distance measures misclassifying Tl in family F 1. Remaining programs that does not belong to any of the family is predicted as nonfamily. (Refer Table X).

TABLE X. FILES CLASSIFIED As NONFAMILY

Files Original class Predicted class

LEXi cal. txt Token Nonfamily

POLY 13.txt Polynomial Nonfamily

POLY_14.txt Polynomial Nonfamily

POL Y IS.txt Polynomial Nonfamily

POLY_16.txt Polynomial Nonfamily

POL Y 17.txt Polynomial Nonfamily

s2.txt Sparse Matrix Nonfamily





V. CONCLUSION

A model was developed to classify the C source code. Four families of programs were taken, each containing 26 program files. Twenty Four files of various families were considered for testing. Dataset is divided into two: (a) training set and (b) test set. The C program files are converted into codes. Prominent attributes are extracted. Frequencies of each attribute are computed and a frequency vector table is derived using the frequencies of each selected attribute. Thirty six distance measures are used to show the similarity between files. For all distance

466


measures a base file, which is having minimum distance is determined. Threshold range is generated for all 36 distance measures. Distance between base files of a family and test files is calculated. Among the 36 distance measures, 97.22% correctly classify the test files. Hence we come to the conclusion that source code plagiarism can be found out using similarity measurement techniques. Frequency space used for classifying can be further reduced by extracting signature for a family using Multiple Sequence Alignment techniques.

VI. REFERENCE

[I] Cha, Sung-Hyuk, Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. In the Proc. of International Journal of Mathematical Models and Methods in Applied Sciences-Volume 1, 2007, pp: 300-307.

[2] A. S. Bin-Habtoor, M. A. Zaher, A Survey on Plagiarism Detection Systems, International Journal of Computer Theory and Engineering- Volume 4, 2012.

[3] Cosma, Georgina and Joy, Mike, Towards a Definition of SourceCode Plagiarism, IEEE Trans. Education, Volume 51, 2008, pp: 195-200

[4] H. Maurer, F. Kappe, and B. Zaka, "Plagiarism - A Survey, " Journal of Universal Computer Science, vol. 12, 2006, pp: 1050-1084.

[5 ] Arwin, Christian and Tahaghoghi, S. M. M. , Plagiarism Detection Across Programming Languages, In the Proc. of the 29th Australasian Computer Science Conference - Volume 48, ACSC '06, 2006, pp: 277-286.

[6] Doreswamy, M.G Manohar, K. S Hemanth, A Study on Similarity

Measure Functions on Engineering Materials Selection, In the Proc. of the I st International Conference on Artificial

Intelligence, Soft Computing and Applications- Volume I,

AIAA-201 1,201 I,pp: 157-168. [7] Kustanto, Cynthia and Liem, Inggriani, Automatic Source Code

Plagiarism Detection, In the Proc. of IEEE Computer Society, SNPD '09, 2009, pp: 481-486.

[8] Wolfgang Granzer, Friedrich Praus, Peter Balog, "Source Code Plagiarism in Computer Engineering Courses " Journal on Systemics, Cybernetics and Informatics - Volume 11, 2013.

[9] Hao Xiong, Haihua Yan, Zhoujun Li, Hu Li, "

BUAA_AntiPlagiarism: A System To Detect Plagiarism For C Source Code " International Conference on Computational Intelligence and Software Engineering, 2009, pp: 1-5

[10] Georgina Cosma and Mike Joy, " An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis ", IEEE transactions on computers, vol. 61, no. 3, March 2012

[II] Li, Xiao and Zhong, Xiao Jing, The Source Code Plagiarism Detection Using AST, In the Proc. of the 2010 International Symposium on Intelligence Information Processing and Trusted Computing, 2010, pp: 406-408.

[12] http://www. historians.org/about-aha-and-

membership/governance/policies-and-documents/statement-on

plagiarism. [I3] Brixtel, Romain and Fontaine, Mathieu and Lesner, Boris and

Bazin, Cyril and Robbes, Romain, Language-Independent Clone Detection Applied to Plagiarism Detection, SCAM 'IO,IEEE Computer Society, 2010, pp: 77-86

[14] Muddu, Basavaraju and Asadullah, Allahbaksh M. and Bhat,

Vasudev D. ,CPDP: A robust technique for plagiarism detection in source code, IEEE, IWSC '13,2013, pp: 39-45 .

[15] Krinke, Jens, Identifying Similar Code with Program Dependence Graphs, Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'OI), 2001,pp: 301--

AUTHOR'S BIOGRAPHY

Julie Baby completed her B.Tech in CSE from RCET, Kerala. She is currently pursuing M.Tech in Computer Science with specialization in Information System from SCMS School of Engineering and Technology, Kerala. Her area of interest is Information Security

Kannan T, completed his B . Tech in CSE from SCMS School of Engineering and Technology, Kerala. Currently working at Aricent Technologies (Holdings) Ltd, Gurgaon

Vinod P., completed his B. Tech in CSE from RGPV, Madhya Pradesh. M.Tech in Information Technology from RGPV, Madhya Pradesh and Ph.D in Computer Engineering from NIT Jaipur.

He has 2 Journals, 4 book chapters and 30 international papers to his credit. He has executed a project entitled PROSIM: Probabilistic Signature for metamorphic mal ware detection, funded by

MCIT, New Delhi. He is presently working as Associate Professor in Department of CSE, SSET, Karukutty. His areas of interest are Information security, Pattern Analysis and Cryptography

Viji Gopal did her B.Tech in Computer Science and Engineering from College of Engineering,

Karunagappally and M.E. in Computer Science and Engineering from RMK College of Engineering,

Chennai. She is currently working as Assisstant Professor in the Dept of Computer Science and Engineering in SCMS School of Engineering and Technology, Karukutty. She has a total of 8 years of

experience in teaching. Her areas of interest are networking and data mining. She has 3 international publications and 4 national publications.

467

[ieee 2014 international conference on computation of power , energy, information and communication...

Documents