[ieee 2014 international conference on computation of power , energy, information and communication...
TRANSCRIPT
2014 INTERNATIONAL CONFERENCE ON COMPUTATION OF POWER, ENERGY, INFORMATION AND COMMUNICATION (ICCPEIC)
Distance Indices for the Detection of Similarity in C programs
Julie Baby, Kannan T, Vinod P, Viji Gopal Department of Computer Science & Engineering
SCMS School of Engineering and Technology, Ernakulam, Kerala, India {jbnellickal, kannanO 16, pvinod21}@gmail.com, [email protected]
Abstract: There has been proliferation in the use of plagiarized articles or source code amongst student and research community. This paper focus on an efficient method that can differentiate between plagiarized and nonplagiarized programs. SimilaritylDistance measurement techniques are used to classify the test file. Thirty six distance metrics are used to determine intra class and inter class proximity. Unseen file not used for frequency extraction are predicted with higher accuracy. This depict that our proposed model using intralinter family threshold can be implemented to identify plagiarized programs with better detection rate.
Keywords- Plagiarism, Family, Distance, Similarity and Attributes.
I. INTRODUCTION
The word plagiarism has its root from two Latin words: plagiarius, an abductor, and plagiare, to steal [12]. The evolution of technology, especially Internet has changed the way of thinking. Plagiarism can be defmed as the theft of others work or words or ideas and presenting it as own work. There are many possible motivations for students to engage in plagiarism - poor time management skills, cost of failure related to time or financial resources, group pressure, values or expectations, friendship or desire to help classmate, negative perception towards the teacher, hardware or software or library or access to teachers or staff are inadequate. To some extent, plagiarism can be avoided by acknowledging that the material is taken from some other source and provide necessary information regarding the source.
Plagiarism can be categorized as copy-paste plagiarism, paraphrasing, translated plagiarism, artistic plagiarism, idea plagiarism and code plagiarism. Here we are mainly concerned with source code plagiarism especially in C programs. It is the process of reusing the source code of a program, to create another program that appears visually different, using a small number of routine transformations, which do not require detailed understanding of the program. With a few editor operations, we can create plagiarized program [2]. Increased number of plagiarized documents calls a need for a system that can detect plagiarism.
This paper focus on an efficient method that can differentiate between plagiarized program and nonplagiarized program using distance indices. Thirty six different types of distance measuring functionalities are used to find the distance between the files of each family, by comparing their frequency vector tables. A base file is created for each family using distance measure. A file which has a minimum distance with other files of the
family is known as base file/center file. A threshold range is determined by calculating the distance score between the base file and other files of the family and nonfamily. A test file is assigned a family if more than 50% of distance measures fall within the threshold of a family.
This paper is organized as follows. Section II describes the related work. Section III discusses the proposed methodology. Section IV introduces the experimental result and findings. In Section Y, we discuss conclusion and future work.
II. RELATED WORK
In [1], author talk about various distance/similarity measures. These similarity measures are categorized based on three aspects: (a) syntactic similarity (b) implementation caveats and (c) semantics. In [2], author makes a survey on different plagiarism detection systems. The survey is divided into four categories namely plagiarism in documents, plagiarism in code, plagiarism technique and algorithms used for plagiarism. Also the author proposes a system that can detect different plagiarism attempts.
In [5], the author proposes an inter-lingual plagiarism detection tool. It compares the intermediate code produced by the compiler suite for the same programs in two languages. The proposed methodology finds the plagiarized programs which are written in C language. In [3], authors review on academician viewpoints on sourcecode plagiarism in an undergraduate student context. They consider difference in opinions amongst source-code issues and provide a defmition for source-code plagiarism. In [4], authors investigate on textual plagiarism.
In [7], the author developed a source code plagiarism detector called Deimos, which can be extended to handle plagiarism in other programming languages by implementing new scanners and parsers. The result of this detector cannot be used as a final decision if a document is plagiarized. Rather can be used as an input or suggestion for manual detection. In [8], the author review on source code plagiarism. Methods are divided into (a) that detect plagiarism (b) that prevents plagiarism. Also this paper developed automatic tools that assist in detecting plagiarism.
In [9], the author introduces BUAA _ AntiPlagiarism system that detects source code plagiarism through the analysis of programs syntax structure. The output of the system is a group of clusters of all suspicious plagiarized programs after calculating the pair wise similarities.
978-1-4799-3 826-1114/$31.00©20 14 IEEE 462
JULIE BABY et al. : DISTANCE INDICES FOR THE DETECTION OF SIMILARITY IN C PROGRAMS
In [10], the author introduces a tool called PlaGate that can be integrated with existing plagiarism detection tools
to improve plagiarism detection performance. The tool also makes an investigation on fmding the similarity between source-code files. In [11], the author describes a source code plagiarism detection that identifies plagiarism accurately when the position of the function is changed by the plagiarist.
In [13], the author introduces a method that detects plagiarized documents among source code files. This method considers a document to be plagiarized if they are similar than the average similarity between the documents. The method is divided into six stages: (a) prefiltering (b) Segmentation and similarity measurement stage (c) Segment matching (d) Post-filtering (e) Document-wise distance evaluation (t) Corpus analysis presentation. In [14], the author describes a technique that detects plagiarism in source code. The technique transforms the code into a reusable index. This index is queried against a set of input files to find the plagiarized code. In [15], the author discuss a method that identify similar program code using maximal similar sub graphs This approach considers both the syntactic structure and dataflow within the program.
III. PROPOSED METHOD
In this section, we propose a method that uses distance measures to find plagiarism in C programs. Similarity/distance is defined as a quantitative degree that enumerates the logical separation of two objects represented by a set of measurable attributes [6]. We detect the code plagiarisms by calculating a similarity score between programs to be compared. Here thirty six different types of similarity/distance measuring functionalities are used. Figure 1 shows the proposed method and Table I show the distance measures.
Figure]. Proposed Method
Distance name
Euclidean distance
CityBlock Distance
Chebyshev Distance
Sorensen Distance
Gower Distance
Soergel Distance
Kulczynski Distance
Canberra Distance
Lorentzian Distance
Intersection Distance
Ruzicka Distance
Tanimoto Distance
Czekanowski Distance
Harmonic mean Distance
Dice Distance
TABLE T. DISTANCE MEASURES
Equation
tlp,-Qf d= i=!
d=flp,-Q,I 1=1
d = max, IP,-Q I ±Ip,-Q,I d= i I
±lp,+Q,1 1=1
d=�±lpi-Q" d i�l tlp,-QI d= i-I tmax(p" Q)
i=i
flp,-Q,I d= ;=] t.min(p"Q)
d=f1p,-Q,1 '�llp,+Q" d = ±In(l+lpi-QD
1=1
d= ±min(pj,Q) i=l
±min(p" Q) d = ;=1 ±max(p" Q)
d=
d=
i=l
Imax(p"Q)-Imin(p"Q) j=! i=l I max(p"Q)
i=1
2±min(p" Q) i-I ±(P,+Q) ;=1
d=2f p,Q, '�IP+Q d=
d
, , 2LP,Q, i-I
d 2 d 2 LP,+LQ, ;=1 ;=0
463
2014 INTERNA nONAL CONFERENCE ON COMPUTA nON OF POWER, ENERGY, INFORMA nON AND COMMUNICA nON (TCCPETC)
Bhattacharrya Distance
Hellinger Distance
Matusita Distance
Squared-chord Distance
Squared_euclid ean Distance
Neyman Distance
Squared Distance
Probabilistic symmetric Distance
Divergence Distance
Clark Distance
Kullback Iiebler Distance
Jeffreys Distance
K divergence Distance
Jensen-Shannon
Distance
Jensen Difference Distance
Topsoe Distance
Taneja Distance
d=-ln±�PiQi i=l d=2 1-±�PiQ i=l
2-2±�PiQ d= 1=1
d= t,(JP:-JQJ d=±(Pi-QJ ;=1 d=±(Pi-Q} i=l Pi d=±(Pi-Q]
'-I P,+Q,
d=2i,(Pi-Q} '�I P,+Q,
d = 2± (Pi-Q} ;�1 (Pi+Q}
d= t(lp,-Q,IJ �'_I Pi+Qi
d = 'f.Piln Pi ;=1 Q/
d (p ) Pi d=L ,-Q,ln-i=! Qi d 2p
d = Ipill i i=! ' Pi+Q, d o�[ t[P" {
� l'Q,+�� III 2 i=! Pi+Q Pi+Q d [Plnp+OlnO [PtO] [PtO]] d = {; I I 2 �i �I T In 12-1
d=±lp,I{�J+Q,i{�JJ '°1 p,+Q, p,+Q,
d= f[ P,+Q}{ P,+Q ] '=1 2 vP,Q
Kumar Johnson Distance
Average Distance
Wave Hedges Distance
Additive Symmetric Distance
d�t [(Pl-Qill ,=1 2(PiQJ tip, - QI + maxlp, - QI
d= H 2
d-± Ipi-Qil -i=! max [Pi' QJ
d= ± (Pi-Q}(P,+Q) ,=, P,Q,
The entire data set is divided into two: Training and test set. Files are tokenized and each token is mapped to an equivalent integer code. Then prominent attributes for each family is selected. Attributes that are dominant are required to represent birth mark of a family. Reserved words that are present in more than 40% files of a family are considered as prominent attributes for that family. A Frequency Vector Table (FVT) is created for each family using prominent attributes and the attribute frequency. Using this FVT, distance between files of a family is computed (intra family distance). File with minimum distance is selected as base file/center file. Base file is the most similar file to all other files in that family. Using 36 distance measure we compute the base file for a specific family. We might have same/different files obtained as base file with all these similarity measurement indices. Distance between base file and all other files in the family/nonfamily are computed to fix threshold. For classifying test files, distance between base file and test file is evaluated and if the distance values lies in the threshold range of more than 50% of distance measures, the file is said to belong to corresponding family. The proposed method can be divided into six modules as follows:
1. Data set Preparation 2. Convert 3. Extract prominent attributes 4. Compute distance 5. Compute Threshold 6. Classify test files
A. Data set Preparation
Separating data into training and testing is a crucial part of evaluating data mining models. Using similar data for training and testing can minimize the effect of data discrepancies and helps to better understand the characteristics of the model. The model can be evaluated by measuring the predictions against the test set.
B. Convert
A list of token is created consisting of built in functions and operators used in C language. The list is arranged in
464
JULIE BABY et al. : DISTANCE INDICES FOR THE DETECTION OF SIMILARITY IN C PROGRAMS
alphabetical order and each predefined word is assigned with a unique integer code. (Refer Table II)
T ABLE II. TOKEN LIST
Tokens Equivalent codes
Char 148
Do 185
Int 348
% 645
++ 646
-- 648
In the convert module, each C file is first tokenized and then it is converted into its equivalent integer code. These tokens can be identifiers, keywords, constants, operators, labels etc. Consider the mapping of reserved words to unique code as shown in Table III
TABLE TIT. CONVERTING STATEMENTS To INTEGER CODES
Statement Tokenized Corresponding
Output codes
int 348
int a, b=10; a 1000
679
b 1000 - 663
10 1001
685
C. Extracting Prominent Attributes
Every attribute is not required to represent the features of a family. Attributes that are present in at least 40% of the files in a family is selected, considering prominent ones and others are rejected.
Initially the total number of programs in which the attribute is present and the frequency of attribute are computed. If the program count is greater than 40% of the total files in a family, then it is selected as a candidate attribute for a family. Frequency of these attributes is determined to find the distance between the files in a family. Consider Table IV representing frequency of attributes in a family consisting of total 26 programs.
TABLE IV. EXAM1'LE OF FREQUENCY LIST
Attribute
138
144
148
164
178
. .
1005
Frequency
26
26
2
3
19
26
Here attribute 138 is present in 26 files, attribute 148 present in 2 files etc. In this case, attributes with frequency greater than 40% of 26 (i.e. lO) are selected as prominent attributes i.e. 138,144,178 .. , .. 1005 are selected. Refer Table V.
TABLE V. EXAMPLE OF PROMINENT ATTRIBUTE FREQUENCY LIST
Attribute Frequency
138 26
144 26
178 19
1005 26
D. Compute distance between files in a family
Distances between the files in a family are computed to find the similarity. Lesser the distance means more similar the files and vice versa. A frequency vector table representing frequency of prominent attributes are used for ascertaining distance. Here 36 distance measures are used for verifying similarity between files. For each distance measure an nXn matrix is generated where n is the total number of programs in the family. Consider files Fl, F2, F3 . . . Fn belongs to a family then frequency vector table of the family may be represented as shown in Table VI.
T ABLE VI. FREQUENCY VECTOR TABLE
Al A2 A3 A4 .. Ad
F1 PI P2 P3 P4 . . Pd
F2 QI Q2 Q3 Q4 . . Qd
. .
Fn -- -- -- -- . . --
Here Al,A2 . . . Ad represents the prominent attributes of the family. PI, P2 Pd and Ql, Q2, Q3 ... ... Qd are the frequency of the prominent attributes in files Fl and F2.
Using the 36 different distance measures, distances between the files are calculated. Distance matrix for Euclidean distance is represented as shown in Table VII, where Fi indicate files, dEUCIj is Euclidean distance between program i and program j.
TABLE VII. DISTANCE MATRIX FOR EUCLIDEAN DISTANCE
Fl F2 .. Fn
Fl 0 dEuc" . . dEuc",
F2 dEuC21 0 . . dEuc,,,
.. . . . . . . . .
Fn dEuc", dEuc", . . 0
465
2014 INTERNA nONAL CONFERENCE ON COMPUTA nON OF POWER, ENERGY, INFORMA nON AND COMMUNICA nON (ICCPEIC)
A verage of values in each row is subsequently computed. Average values are used to generate another matrix of size nX36, i.e. 36 distance values for each files and it is represented as shown in Table VIII
TABLE VIII. AVERAGE DISTANCE MATRIX
Euclidean City_Block .. Additive
Symmetric
FI dEUCI dCBI . . dASI F2 dEUC2 dCB2 . . dAS2 F3 dEUC3 dCB3 . . dASJ .. . . . . . . . .
Fn dEucn dCBn dAsn
Minimum distance for each distance indices is computed. File with minimum distance value in a distance measure is selected as base file for particular measure. The base file can be same or different for 36 distance measures.
E. Threshold Computation
Any test files with distance to base file of a family in the threshold range of more than 50% of distance measures is said to belong to specific family. For the purpose of generating threshold for all distance measures in a family, distance between base files of each distance measure and files of family is computed.
If number of files having distance value less than
average distance is greater than number of files having distance greater than average distance, then threshold range is selected as minimum distance to average distance else range is selected as average distance to maximum distance.
F. Classify testfiles (Unseen files)
Initially, test files are converted to equivalent integer codes. In order to check whether a file belongs to a family, distance between the test file and base file of a family is computed. For distance calculation frequency vector table of testing files is generated using prominent attributes of the family. A test file belongs to a family, only if for more than 50% of the distance measures assign the test file to a specific family program.
IV. EXPERIMENTAL RESULTS
Database was collected from 26 students program. Four families of C Program were taken for training. They include operations in binary tree, double ended queue (Dequeue), circular queue (Cqueue) and matrix. Each family consists of 26 programs prepared by students. 24 files belonging to different families were considered for testing. All test files are classified correctly to their respective families (Refer Table IX).
TABLE IX : ACCURACY
Files Original Predicted TD FD
Class Class
BTREE_2. txt Btree 97.22 2.78 Btree
BTREE 3.txt Btree 97.22 2.78 Btree
BTREE_ 4.txt Btree 77.78 22.22 Btree
BTREE_S.txt Btree 97.22 2.78 Btree
BTREE 6.txt Btree 97.22 2.78 Btree
BTREE_7.txt Btree 97.22 2.78 Btree
cq28. txt Crque 80.S6 19. 44 Crque
cq29. txt Crque 80.S6 19. 44 Crque
cq30.txt Crque 83.33 16. 67 Crque
dq29.txt Dqueue 72.22 27.78 Dqueue
dq30.txt Dqueue 77.78 22.22 Dqueue
dqq2.txt Dqueue 91.67 8.33 Dqueue
MATRX_3.txt Matrix 97.22 2.78 Matrix
Here, True Detection (TD) represents the percentage of distance measures classifying Tl in family Fl and False Detection (FD) represents the percentage of distance measures misclassifying Tl in family F 1. Remaining programs that does not belong to any of the family is predicted as nonfamily. (Refer Table X).
TABLE X. FILES CLASSIFIED As NONFAMILY
Files Original class Predicted class
LEXi cal. txt Token Nonfamily
POLY 13.txt Polynomial Nonfamily
POLY_14.txt Polynomial Nonfamily
POL Y IS.txt Polynomial Nonfamily
POLY_16.txt Polynomial Nonfamily
POL Y 17.txt Polynomial Nonfamily
s2.txt Sparse Matrix Nonfamily
s3.txt Sparse Matrix Nonfamily
s4.txt Sparse Matrix Nonfamily
s5.txt Sparse Matrix Nonfamily
s6.txt Sparse Matrix Nonfamily
V. CONCLUSION
A model was developed to classify the C source code. Four families of programs were taken, each containing 26 program files. Twenty Four files of various families were considered for testing. Dataset is divided into two: (a) training set and (b) test set. The C program files are converted into codes. Prominent attributes are extracted. Frequencies of each attribute are computed and a frequency vector table is derived using the frequencies of each selected attribute. Thirty six distance measures are used to show the similarity between files. For all distance
466
JULIE BABY et al. : DISTANCE INDICES FOR THE DETECTION OF SIMILARITY IN C PROGRAMS
measures a base file, which is having minimum distance is determined. Threshold range is generated for all 36 distance measures. Distance between base files of a family and test files is calculated. Among the 36 distance measures, 97.22% correctly classify the test files. Hence we come to the conclusion that source code plagiarism can be found out using similarity measurement techniques. Frequency space used for classifying can be further reduced by extracting signature for a family using Multiple Sequence Alignment techniques.
VI. REFERENCE
[I] Cha, Sung-Hyuk, Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. In the Proc. of International Journal of Mathematical Models and Methods in Applied Sciences-Volume 1, 2007, pp: 300-307.
[2] A. S. Bin-Habtoor, M. A. Zaher, A Survey on Plagiarism Detection Systems, International Journal of Computer Theory and Engineering- Volume 4, 2012.
[3] Cosma, Georgina and Joy, Mike, Towards a Definition of SourceCode Plagiarism, IEEE Trans. Education, Volume 51, 2008, pp: 195-200
[4] H. Maurer, F. Kappe, and B. Zaka, "Plagiarism - A Survey, " Journal of Universal Computer Science, vol. 12, 2006, pp: 1050-1084.
[5 ] Arwin, Christian and Tahaghoghi, S. M. M. , Plagiarism Detection Across Programming Languages, In the Proc. of the 29th Australasian Computer Science Conference - Volume 48, ACSC '06, 2006, pp: 277-286.
[6] Doreswamy, M.G Manohar, K. S Hemanth, A Study on Similarity
Measure Functions on Engineering Materials Selection, In the Proc. of the I st International Conference on Artificial
Intelligence, Soft Computing and Applications- Volume I,
AIAA-201 1,201 I,pp: 157-168. [7] Kustanto, Cynthia and Liem, Inggriani, Automatic Source Code
Plagiarism Detection, In the Proc. of IEEE Computer Society, SNPD '09, 2009, pp: 481-486.
[8] Wolfgang Granzer, Friedrich Praus, Peter Balog, "Source Code Plagiarism in Computer Engineering Courses " Journal on Systemics, Cybernetics and Informatics - Volume 11, 2013.
[9] Hao Xiong, Haihua Yan, Zhoujun Li, Hu Li, "
BUAA_AntiPlagiarism: A System To Detect Plagiarism For C Source Code " International Conference on Computational Intelligence and Software Engineering, 2009, pp: 1-5
[10] Georgina Cosma and Mike Joy, " An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis ", IEEE transactions on computers, vol. 61, no. 3, March 2012
[II] Li, Xiao and Zhong, Xiao Jing, The Source Code Plagiarism Detection Using AST, In the Proc. of the 2010 International Symposium on Intelligence Information Processing and Trusted Computing, 2010, pp: 406-408.
[12] http://www. historians.org/about-aha-and-
membership/governance/policies-and-documents/statement-on
plagiarism. [I3] Brixtel, Romain and Fontaine, Mathieu and Lesner, Boris and
Bazin, Cyril and Robbes, Romain, Language-Independent Clone Detection Applied to Plagiarism Detection, SCAM 'IO,IEEE Computer Society, 2010, pp: 77-86
[14] Muddu, Basavaraju and Asadullah, Allahbaksh M. and Bhat,
Vasudev D. ,CPDP: A robust technique for plagiarism detection in source code, IEEE, IWSC '13,2013, pp: 39-45 .
[15] Krinke, Jens, Identifying Similar Code with Program Dependence Graphs, Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'OI), 2001,pp: 301--
AUTHOR'S BIOGRAPHY
Julie Baby completed her B.Tech in CSE from RCET, Kerala. She is currently pursuing M.Tech in Computer Science with specialization in Information System from SCMS School of Engineering and Technology, Kerala. Her area of interest is Information Security
Kannan T, completed his B . Tech in CSE from SCMS School of Engineering and Technology, Kerala. Currently working at Aricent Technologies (Holdings) Ltd, Gurgaon
Vinod P., completed his B. Tech in CSE from RGPV, Madhya Pradesh. M.Tech in Information Technology from RGPV, Madhya Pradesh and Ph.D in Computer Engineering from NIT Jaipur.
He has 2 Journals, 4 book chapters and 30 international papers to his credit. He has executed a project entitled PROSIM: Probabilistic Signature for metamorphic mal ware detection, funded by
MCIT, New Delhi. He is presently working as Associate Professor in Department of CSE, SSET, Karukutty. His areas of interest are Information security, Pattern Analysis and Cryptography
Viji Gopal did her B.Tech in Computer Science and Engineering from College of Engineering,
Karunagappally and M.E. in Computer Science and Engineering from RMK College of Engineering,
Chennai. She is currently working as Assisstant Professor in the Dept of Computer Science and Engineering in SCMS School of Engineering and Technology, Karukutty. She has a total of 8 years of
experience in teaching. Her areas of interest are networking and data mining. She has 3 international publications and 4 national publications.
467