plagiarism detector using cosine similarity - text mining

If you can't read please download the document

Upload: sakthi-dasans

Post on 16-Apr-2017

2.786 views

Category:

Education

3 download

Report

Download

Embed Size (px):

TRANSCRIPT

Plagiarism Detector using cosine similarity - Text mining

An Experimental venture to detect plagiarism among the text document using Cosine similarity

Plagiarism Detector using cosine similarity - Text mining

To build a plagiarism model, create a Term Document Matrix (TDM) from the corpus which then passed to Single Value Decomposition (SVD) to obtain U matrix ,S matrix and V matrix.

Plagiarism Detector using cosine similarity - Text mining

Steps to for create TDM

1. Remove stops words (such as is, of, be, to, the etc.,) from the corpus.

2. Apply Stemmer to each token in the corpus to get rid of inflections.

3. Construct a count matrix.

4. Modify the count matrix with TFIDF (Term Frequency Inverse Document Frequency)

Resultant of modified matrix is your TDM

Plagiarism Detector using cosine similarity - Text mining

Once we have built our TDM matrix, we call upon a powerful but little known technique called Singular Value Decomposition or SVD to analyse the matrix for us. We have to pass TDM to SVD then we will land up with three simpler matrix named U, S, Vt.

Plagiarism Detector using cosine similarity - Text mining

Since our interest is to find document Plagiarism, we have to do document document similarity. Vt matrix correspond to Document vector coordinates. We have to compute Matrix S*Vt in which cosine similarity is applied.

If the resultant angle is between 0 and 90 then there exits a relationship (some similarity ) between the two vector coordinated.

Plagiarism Detector using cosine similarity - Text mining

Cosine similarity is a measure of similarity between two vectors.

The technique is also used to compare documents in text mining. In addition, it is used to measure cohesion within clusters in the field of data mining.

Consultation

We trained the model with 1038 documents.

Our testing set induce 256 documents in that 235 document have been correctly predicted by the model.

Accuracy of the model: 92%

Usage

This type of application can be used in any online portal such as question and answers site (forums)or blogs to check the pre existence of post.

Some time this application is also used in universities to detect plagiarised assignment/works that may be submit by smart students.

Usage

This type of application can be used in any online portal such as question and answers site (forums)or blogs to check the pre existence of post.

Some time this application is also used in universities to detect plagiarised assignment/works that may be submit by smart students.

Thank you

If like the presentation...I would like to know your insert on endorsing me for my skills on my linkedin profile page. I would greatly appreciate If you could endorse me for Data mining, Text mining, Big Data, Machine Learning, Algorithms, and Mongodb.http://www.linkedin.com/profile/view?id=48289105

Thank you

For more details on plagiarism detector http://shakthydoss.com/plagiarism-detector/

Sakthi dasanhttp://Shakthydoss.comTwitter - @shakthydoss