dr. rao muhammad adeel nawab - ilmoirfan€¦ · dr. rao muhammad adeel nawab sentence 1: textreuse...

79

Upload: others

Post on 27-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder
Page 2: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Dr. Rao Muhammad Adeel Nawab

2

How to Read a Research Paper?

Session V

Making Summary and Documenting a Paper

Page 3: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

How to Work

��نن .اکام ��نن کام ی

شو�

نی �

شو�

ن. ا�

ک ی ش

و�نی �

شو�

ن��� ھ �ے �و سا�ت اهللا ��نن .ام

Dr. Rao Muhammad Adeel Nawab

3

ین اك نستع إیاك نعبد وإی ت :آ�یی �ے مد د م ھ �ہ �ب

تں اور � �ی ے �ہ

ت��� ی �ببادت ری �ہ �ی ا هللا �م �ت ں بی �ی ے �ہ ��ت بن ا

Page 4: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

4

Dr. Rao Muhammad Adeel Nawab

��تتں� دعا��ی

ط ٱلمستقیم صر ر ھم ط ٱلذین أنعمت علی ٱھدنا ٱلص��ی

ہ �یہ دعا مانن :ںروزا�ن

ع ے ا�نن

و �تن �پ� � �� راہ �ب د�ی راہ د�ھا ان �و�وں ں �بی ��ی

ا�ہ �بی ام

Power of Dua

Page 5: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

5

Dr. Rao Muhammad Adeel Nawab

یة إنما األعمال بالن

ے۔ �ۓ �ہ ماری �ببادت ا �ہ ا اور �پ�ھابن �یہ �پ��بن�یت �ے �� �ن د�ت �� خن دا �وق خن

ا اور �خن �� رضن ان �ے هللا �(هللا ے د�بیھ ں۔) سا�ت ں اور �پ�ھا��ی �پ���ی

صلى هللا علیھ وسلم نے ف رمایارسول �

ے۔( �توں �پ� �ہ �ی کا دارومدار �ن )ا�مال

Power of Neyat

Page 6: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

6Dua - Take Help from Allah before starting any Task

Dr. Rao Muhammad Adeel Nawab

Page 7: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Balanced Life is Ideal Life 7

Get Excellence in five things

A Journey from BIGNNER toEXCELLENCE

You must have acombination of five thingswith different variations.However, aggregate will besame.

Dr. Rao Muhammad Adeel Nawab

Page 8: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Have a DADDU YAR in life to drain out on daily basis

8Excellence – FRIENDS

Dr. Rao Muhammad Adeel Nawab

کا ں ا�بباب ے ز�ی�ت ��ی ےکار �ہ وم�ب �ب�ہ

ک ے انی � �ہنکا� و

ت�وص � �� خن بی و �پ ص�ہ

�نش

Page 9: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

9Excellence – FAMILY

Dr. Rao Muhammad Adeel Nawab

Take Dua’s of Parents and elders by doing their خدمت and ادب

Your wife/husband must be your best friend

Be humble and kind to kids, subordinates and poor people

Page 10: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

10

OutlineUnderstand Order and Flow

Template-based Approach to Read a Paper

Dr. Rao Muhammad Adeel Nawab

Page 11: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

11

Understand Order and Flow

Dr. Rao Muhammad Adeel Nawab

Page 12: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Order and Flow 12

Dr. Rao Muhammad Adeel Nawab

Document (or Paper) Level

Connection between Sections

Section Level

Connection between Paragraphs

Paragraph Level

Connection between Sentences

Sentence Level

Connection between Words / Phrases

Page 13: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Paper Outline 13

Dr. Rao Muhammad Adeel Nawab

Abstract

Introduction

Related Work

Corpus Generation Extract sentence/passage pairsAnnotation guidelinesAnnotations Corpus statisticsExamples from the corpusLinguistic analysis of the transformations

Page 14: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Paper Outline (cont.) 14

Dr. Rao Muhammad Adeel Nawab

Text reuse detection experimentsTranslation plus mono-lingual analysisExperimental setup

Results and analysis

Conclusions and future work

Acknowledgements

References

Page 15: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

15

Template-based Approach to Read a Paper

Dr. Rao Muhammad Adeel Nawab

Page 16: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Reading Abstract 16

Dr. Rao Muhammad Adeel Nawab

Read sentence by sentence and do interpretation of each sentenceTemplate

Problem (or Research Problem)Importance of ProblemApplication(s) of ProblemSummary of Existing LiteratureResearch GapProposed SolutionCharacteristics of Proposed SolutionResults and Main Findings

Page 17: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract 17

Dr. Rao Muhammad Adeel Nawab

Sentence 1:Text reuse is becoming a serious issue in many fields andresearch shows that it is much harder to detect when itoccurs across languages.

InterpretationCross-lingual Text Reuse Detection is a challenging task

InsightsResearch Problem Cross-Lingual Text Reuse DetectionImportance It is a wide spread problem and also

difficult to detect

Page 18: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract 18

Dr. Rao Muhammad Adeel Nawab

Sentence 2:The recent rise in multi-lingual content on the Web hasincreased cross-language text reuse to an unprecedentedscale.

InterpretationReason(s) for rise in cross-lingual text reuse

InsightsJustification why cross-lingual text reuse detection is an important problem to be addressed, it has generic applications and why it is on rise?

Page 19: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract 19

Dr. Rao Muhammad Adeel Nawab

Sentence 3:Although researchers have proposed methods to detect it, onemajor drawback is the unavailability of large-scale gold standardevaluation resources built on real cases.

InterpretationSummary of existing literature and Research Gap

InsightsSummary of existing literature

- Researchers have proposed methods to detect it

Research gap - Unavailability of large-scale gold standard evaluation resources built on real cases

Page 20: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract 20

Dr. Rao Muhammad Adeel Nawab

Sentence 4:To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus forthe English-Urdu language pair.

InterpretationPurposed Solution

InsightsNeed to be very specific in proposing solution

Cross-language sentence/passage level text reuse corpus English-Urdu language pair

Page 21: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract 21

Dr. Rao Muhammad Adeel Nawab

Sentence 5:The Cross-Language English-Urdu Corpus (CLEU) hassource text in English while the derived text is inUrdu.

InterpretationBrief details of Main Contribution i.e. proposed solution (cross-lingual text reuse corpus for English-Urdu language pair)

InsightsBrief Detail of Proposed Solution - source text in English while the derived text is in Urdu.Note that this is the Selling Point of the Paper

Page 22: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract 22

Dr. Rao Muhammad Adeel Nawab

Sentence 6:It contains in total 3,235 sentence/passage pairs manuallytagged into three categories i.e., near copy, paraphrasedcopy, and independently written.

InterpretationMain characteristic of the proposed solution (cross-lingual text reuse corpus for English-Urdu language pair)

InsightsTotal 3,235 sentence/passage pairs Manually tagged Three categories i.e., near copy, paraphrased copy, and independently written.

Page 23: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract 23

Sentence 7:Further, as a second contribution, we evaluate the Translation plusMono-lingual Analysis method using three sets of experiments on theproposed dataset to highlight its usefulness.

InterpretationBrief details of Secondary contribution and applications of the proposed solution

InsightsTechnique – translation plus mono-lingual analysis Experiments – 3Evaluation – comparison of various techniques on the same dataset (proposed in this study)Note that it is not the selling point of this research work

Dr. Rao Muhammad Adeel Nawab

Page 24: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract 24

Sentence 8:Evaluation results (F1 = 0.732 binary, F1 = 0.552 ternary classification)indicate that it is harder to detect cross-language real cases of textreuse, especially when the language pairs have unrelated scripts.

InterpretationTypes of classification, best results and main finding

InsightsTernary Classification – Verbatim, Paraphrased and Independently WrittenBinary Classification – Derived vs non-Derived ResultsResults - (F1 = 0.732 binary, F1 = 0.552 ternary classification) Main Findings - It is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts.

Dr. Rao Muhammad Adeel Nawab

Page 25: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract 25

Dr. Rao Muhammad Adeel Nawab

Sentence 9:The corpus is a useful benchmark resource for thefuture development and assessment of cross-languagetext reuse detection systems for the English-Urdulanguage pair.

InterpretationStrengths and applications of proposed solution (cross-lingual text reuse corpus for English-Urdu language pair)

Page 26: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract: Overall Interpretations 26

Dr. Rao Muhammad Adeel Nawab

Cross-lingual text reuse detection is a challenging task

Reasons for rise in cross-lingual text reuse (Importance and application)

Summary of existing literature and Research Gap

Purposed Solution

Brief details of Main Contribution i.e. proposed solution (cross-lingual text reuse corpus for English-Urdu language pair)

1

2

3

4

5

Page 27: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Abstract: Overall Interpretations 27

Dr. Rao Muhammad Adeel Nawab

Main characteristic of proposed solution (cross-lignaultext reuse corpus for English-Urdu language pair)

Brief details of secondary contribution

Types of classification, best results and main finding

Strengths andApplications of proposed solution

6

7

8

9

Page 28: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Reading Introduction 28

Dr. Rao Muhammad Adeel Nawab

Summarize each paragraph into a single sentence

See the order and flow of paragraphs

Page 29: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

29Introduction: Passage 1

Dr. Rao Muhammad Adeel Nawab

Text reuse, the process of creating new texts using existing ones, hasbecome very common because of free, readily available, and large digitalrepositories. In addition, state of- the-art text processing applications havemade it very simple to copy-paste text and give it a new identity. Textborrowed from such sources can be reused verbatim (copy paste) orrewritten (paraphrased). If the rewriting process involves complex editingoperations (e.g., lexical substitution, changes in syntax, summarization,synonym replacement, altering word order, or verb or noun nominalization)then the borrowed text transforms into an independently written piece(Clough, Gaizauskas, Piao, & Wilks, 2002; Maurer, Kappe, & Zaka, 2006).Moreover, new text can be created using text from one or more sources andthe amount of reused text varies from local text reuse (such as, a singleword, small chunks, or sentences) to global text reuse (i.e., an entiredocument; Mittelbach, Lehmann, Rensing, & Steinmetz, 2010; Seo & Croft,2008).

Page 30: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

30Introduction - Passage 1 - Summary

Dr. Rao Muhammad Adeel Nawab

Definition of Text Reuse

Importance of Text Reuse

Levels of Text Reuse

Verbatim

Paraphrased

Independently Written

Types of Text Reuse (Local vs Global)

Page 31: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

31Introduction: Passage 2

Dr. Rao Muhammad Adeel Nawab

Unlike academic plagiarism (the unacknowledged reuseof text), text reuse is a common practice in journalism.Newspapers pay news agencies for their text(s) (heretermed source text) to generate news stories (termedderived text). The text purchased from a news agencycan be reused “verbatim” or “paraphrased” to createthe newspaper story. However, at times the newspaperstory might also be independently written without usingany news agency text (Clough, 2010).

Page 32: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

32Introduction - Passage 2 - Summary

Dr. Rao Muhammad Adeel Nawab

Definition of Plagiarism

Process of text reuse in Journalism

Page 33: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

33Introduction: Passage 3

Dr. Rao Muhammad Adeel Nawab

Text reuse can either be mono-lingual (when the source and derived text share the same language)or cross-lingual (when the source text is in one language and the derived text is in another). Mono-lingual text reuse detection has been a subject undergoing intense study for the researchcommunity for some time, but recently the focus has shifted towards detecting text reuse acrosslanguages (Ceska, Toman, & Jezek, 2008; Franco-Salvador, Gupta, Rosso, & Banchs, 2016; Gupta,Barrón-Cedeño, & Rosso, 2012; Potthast, Barrón- Cedeño, Stein, & Rosso, 2011). A recent studysuggested that the scale of cross-language text reuse and plagiarism is increasing (Barrón-Cedeño,Gupta, & Rosso, 2013). This is because of the following reasons: (a) users of under-resourcedlanguages, which are very large in number, commonly use text(s) from resource-rich languages, (b)speakers of one language staying in a country other than their own can consult the text(s) in theirnative language, and (c) often speakers of one language are keen to write in a foreign language.Likewise, the recent rise in multi-linguality, freely available machine translation systems, andintelligent word processors are contributing to an environment where it is easy to reuse text acrosslanguages, but with a perception of being harder to detect such reuse (Somers, Gaspari, & Niño,2006). Therefore, there is an ever-increasing necessity to develop standard evaluation resourcesand methods to detect cross-language text reuse for the various language pairs.

Page 34: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

34Introduction - Passage 3 - Summary

Dr. Rao Muhammad Adeel Nawab

Mono-lingual vs Cross-lingual text reuse

Three main reasons for rise in cross-lingual text reuse

Page 35: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

35Introduction: Passage 4

Dr. Rao Muhammad Adeel Nawab

To develop, evaluate, and analyze methods for crosslanguage text reuse (either local or global), gold standardbenchmark corpora are needed. These corpora can begenerated in three ways: (a) artificial - using an automatictext altering tool, (b) simulated - humans are asked torewrite source text to create new text, and (c) real - newagency’s text is reused by journalists to create thenewspaper story. It seems likely that cross-language textreuse detection methods which are trained on real examplesare more likely to give realistic performance that weinvestigate further in our paper.

Page 36: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

36Introduction - Passage 4 - Summary

Dr. Rao Muhammad Adeel Nawab

Why it is important to develop cross-lingual text

reuse corpus?

Three ways to generate a corpus

artificial

simulated

real

Page 37: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

37Introduction: Passage 5

Dr. Rao Muhammad Adeel Nawab

This study aims to develop a publicly available largescale benchmark corpus that contains realexamples of cross-language text reuse at sentence/passage level1 for the English-Urdu languagepair. Urdu belongs to the Indo- Aryan family, widely spoken in Pakistan and the northern parts ofIndia (Alam, Mehmood, & Nelson, 2015). Moreover, it has a strong Perso-Arabic influence in itsvocabulary and is written in a Perso-Arabic script from right to left. It is also spoken world-widebecause of the South Asian Diaspora (with large populations in the Middle East, United States, UK,Norway, and Canada etc.; Daud, Khan, & Che, 2016). Despite that, for the English-Urdu languagepair, there are no publicly available cross language text reuse detection datasets known to us.Moreover, previous research has tended to focus more on European languages. The corpusdeveloped as an outcome of this study contains 3,235 pairs of real examples of cross-language textreuse at sentence/passage level (the source text is in English whereas derived text is in Urdu). Eachsentence/passage pair is categorised as i) Near Copy (NC; 751 pairs), ii) Paraphrased Copy (PC; 1751pairs), or iii) Independently Written (IW; 733 pairs). The corpus is representative enough to serveas a benchmark dataset for: (a) developing and evaluating techniques for cross-language text reusedetection for the English-Urdu language pair, (b) obtaining an insight into what edit operations arelikely used by journalists in reusing text, and (c) to foster text reuse detection research in theEnglish-Urdu language pair.

Page 38: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

38Introduction - Passage 5 - Summary

Dr. Rao Muhammad Adeel Nawab

Main aim of this study

Importance of Urdu

Summary of Literature Review

Research Gap

Proposed Solution (or Corpus)

Characteristics and applications of proposed solution

Page 39: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

39Introduction: Passage 6

Dr. Rao Muhammad Adeel Nawab

The remainder of this article is organized as follows. We first reviewpreviously developed cross-lingual text reuse or plagiarism detectioncorpora. Then we present a detailed discussion on the CLEU corpusconstruction, its statistics, characteristics, linguistic analysis, andexample cases. This is followed by the explanation of cross languagetext reuse detection experiments that we performed on our corpus tohighlight its strengths and its utility for evaluation purposes. Finally, wepresent the results and their analysis and then conclude the article.

Page 40: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

40Introduction - Passage 6 - Summary

Dr. Rao Muhammad Adeel Nawab

Organization of Paper

Page 41: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Introduction: Overall Interpretations 41

Dr. Rao Muhammad Adeel Nawab

Definition of Text Reuse, Importance of Text Reuse,Levels of Text Reuse (Verbatim, Paraphrased,Independently Written) and Types of Text Reuse (Local vsGlobal)

Definition of Plagiarism and Process of text reuse inJournalism

Mono-lingual vs Cross-lingual text reuse and Three mainreasons for rise in cross-lingual text reuse

1

2

3

Page 42: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Introduction: Overall Interpretations 42

Dr. Rao Muhammad Adeel Nawab

Organization of Paper6

Main aim of this study, Importance of Urdu, Summary ofLiterature Review, Research Gap, Proposed Corpus (orSolution), It’s characteristics and applications

5

Why it is important to develop cross-lingual text reuse corpus?and Three ways to generate a corpus (artificial, Simulated &real)

4

Page 43: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Reading - Related Work 43

Dr. Rao Muhammad Adeel Nawab

Summarize each paragraph into a single sentence

See the order and flow of paragraphs

Page 44: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

44Related Work: Passage 1

Dr. Rao Muhammad Adeel Nawab

In the previous literature, efforts have been made to develop standard evaluationresources for measuring cross language text reuse (and plagiarism) for different thelanguage pairs. For example, PAN authors have developed a series of corpora withartificial and simulated examples of plagiarism at document level (Potthast, Barrón-Cedeño, Eiselt, Stein, & Rosso, 2010; Potthast, Eiselt, Barrón- Cedeño, Stein, &Rosso, 2011; Potthast et al., 2012–2014; Stein, Rosso, Stamatatos, Koppel, &Agirre, 2009). The majority (90%) of the text plagiarism cases in these corpora aremono-lingual, however, there exists a small portion (10%) of cross-lingual plagiarismcases too. These cross language plagiarism cases are for the English-German andEnglish-Spanish language pairs. Most of these cases are artificial (created usingautomatic MT [Machine Translation] system that is, Google Translate3) but a smallnumber of them are created manually (i.e., translated by humans). These corporahave been used to evaluate text plagiarism detection methods in the competitionsheld annually.

Page 45: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

45Related Work: Passage 1 - Summary

Dr. Rao Muhammad Adeel Nawab

Opening sentence

Summary of PAN text reuse corpora

Page 46: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

46Related Work: Passage 2

Dr. Rao Muhammad Adeel Nawab

The CL!TR4 (Cross-Language Indian Text Reuse) corpus is the first of its kind developedspecifically for the analysis of cross-language text reuse detection in the Hindi-Englishlanguage pair at document level (Barrón-Cedeño, Rosso, Devi, Clough, & Stevenson,2013). The suspicious documents it contains are in Hindi and the source documents inEnglish language. The training set includes 198 suspicious (Hindi) and 5,032 source(English) documents, whereas the test set has 190 suspicious (Hindi) and 5,032 source(English) documents. The CL!TR corpus contains simulated cases of text reuse. Thevolunteers involved in the study were asked to answer a set of 10 questions, related tothe tourism and computer science domains, to create suspicious documents. It containsthree types of revisions, categorized by the amount of obfuscation used, namely“Exact” (without any modifications, translation only), “Light” (very few modifications,translation, and manual correction), and “Heavy” (detailed modifications, translation,and manual correction). The corpus also contains “Original” (independently written)documents which were generated without referring to the source documents but usingthe learning material provided.

Page 47: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

47Related Work: Passage 2 - Summary

Dr. Rao Muhammad Adeel Nawab

English-Hindi CLITRA Cross-Lingual Text Reuse Corpus

Page 48: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

48Related Work: Passage 3

Dr. Rao Muhammad Adeel Nawab

Another cross-language corpus of 110 documents (55 source inEnglish and 55 plagiarized in Bangla) that contains simulatedplagiarism cases and was built using student’s reports from auniversity (Arefin, Morimoto, & Sharif, 2013). Two groups of 55students each, were asked to write a report on a given topic. 50reports are used as training set whereas the remaining 10 as testset. Plagiarism cases were obfuscated by replacing contents withseveral plagiarized fragments of different lengths. However, thecorpus is not available to download.

Page 49: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

49Related Work: Passage 3 - Summary

Dr. Rao Muhammad Adeel Nawab

English-Bangle Cross-Lingual Text Reuse Corpus

Page 50: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

50Related Work: Passage 4

Dr. Rao Muhammad Adeel Nawab

Recently, a cross-language (Urdu-English language pair) document levelplagiarism detection corpus was submitted for the PAN 2016 shared task (Hanifet al., 2015). The corpus is divided in two sets, 500 source (Urdu) and 500suspicious (English) documents, and contains only simulated examples ofplagiarism. The source documents are Wikipedia excerpts whereas theplagiarized documents were manually created by university students. Thestudents were asked to plagiarize 270 documents on three levels of obfuscation(“Near Copy,” “Light Revision,” and “Heavy Revision”), whereas 230documents in the corpus are “Nonlabialized.” Moreover, the plagiarism casesinserted in the suspicious documents are of various length that is small ( < 50tokens), medium (50–100 tokens), and large (100–200 tokens). The corpus isthe first cross language (Urdu-English pair) dataset created for plagiarismdetection research at the document level.

Page 51: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

51Related Work: Passage 4 - Summary

Dr. Rao Muhammad Adeel Nawab

English-Urdu CLUE Cross-Lingual Text Reuse Corpus

Page 52: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

52Related Work: Passage 5

Dr. Rao Muhammad Adeel Nawab

CLiPA (Cross-Language Plagiarism Analysis) is a publicly availablefragment or sentence level corpus containing five source sentences (inEnglish) which were used to generate plagiarized cases (in Spanish andItalian) using both machine translation (artificial) and manual translation(simulated; Barrón-Cedeño, Rosso, Pinto, & Juan, 2008). The machinetranslation cases were generated using five different services to havevariations whereas for manually (human) simulated plagiarism cases,nine volunteers were asked to plagiarize each of the five sourcefragments. They were further requested to generate the same numberof nonplagiarized cases as well. The corpus was used in experiments ontext plagiarism detection research in the English-Spanish and English-Italian language pairs.

Page 53: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

53Related Work: Passage 5 - Summary

Dr. Rao Muhammad Adeel Nawab

English-Spanish and English-Italian CLIPA Cross-Lingual

Text Reuse Corpus

Page 54: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

54Related Work: Passage 6

Dr. Rao Muhammad Adeel Nawab

In summary, the corpora discussed above either contain artificialor simulated examples of cross-language text reuse (orplagiarism). Cross-language text reuse detection methodsdeveloped using these non-real types of text reuse are unlikely toperform well on real cases of text reuse that occur in real worldscenarios (e.g., academia, journalism; Weber-Wulff, 2010).

Page 55: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

55Related Work: Passage 6 - Summary

Dr. Rao Muhammad Adeel Nawab

Summary of existing literature and Research Gap

(unavailability of real examples of text reuse).

Page 56: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

56Related Work: Passage 7

Dr. Rao Muhammad Adeel Nawab

Moreover, the simulated cases created in a controlled environmentusing crowd-sourcing do not represent the strategies used by humanswhen rewriting text in real life. Because cross-language text reuse isincreasing day-by day, first, there is an urgent need to develop textreuse detection corpora with real examples of text reuse. Second,the available corpora for research are created at document level andthere are no corpora available at sentence/passage level for theEnglish-Urdu language pair. Last, the corpora listed above are notlarge enough to generate robust results. This is not surprising becauseit takes a lot of manual effort to create corpora with simulatedexamples of text reuse or plagiarism.

Page 57: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

57Related Work: Passage 7 - Summary

Dr. Rao Muhammad Adeel Nawab

Limitations of existing work (or corpora)

Page 58: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

58Related Work: Passage 8

Dr. Rao Muhammad Adeel Nawab

To develop and evaluate cross-language text reuse detectionmethods for the real-world scenario, we need to create corporawith real examples of text reuse. To fill this gap, our research workproposes a large-scale gold standard benchmark corpus containingreal examples to measure cross-language text reuse atsentence/passage level for the English-Urdu language pair. Thenext section describes the corpus generation process in detail.

Page 59: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

59Related Work: Passage 8 - Summary

Dr. Rao Muhammad Adeel Nawab

Justification for need of a new corpus, out contribution

in developing a new corpus and connection with the

next section

Page 60: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Related Work : Overall Interpretations 60

Dr. Rao Muhammad Adeel Nawab

Opening sentence and Summary of PAN text reusecorpora

English-Hindi CLITRA Cross-Lingual Text Reuse Corpus

English-Bangle Cross-Lingual Text Reuse Corpus

1

2

3

English-Urdu CLUE Cross-Lingual Text Reuse Corpus4

Page 61: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Related Work : Overall Interpretations 61

Dr. Rao Muhammad Adeel Nawab

Summary of existing literature and Research Gap(unavailability of real examples of text reuse).

6

English-Spanish and English-Italian CLIPA Cross-Lingual TextReuse Corpus

5

Justification for need of a new corpus, out contribution indeveloping a new corpus and connection with the nextsection

8

Limitations of existing work (or corpora)7

Page 62: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Reading - Corpus Generation 62

Dr. Rao Muhammad Adeel Nawab

Purpose of CorpusCross-lingual text reuse detection

Corpus Generation ProcessExtracting sentence/passage pairsPreparation of annotation guidelinesAnnotation of text by three annotatorsComputing inter-annotator agreement

Page 63: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Corpus Characteristics 63

Dr. Rao Muhammad Adeel Nawab

Language Pair

English-Urdu

Levels of Text Reuse

Verbatim – 741.Paraphrased – 1751.Independently Written –733.

Standardization

XML format

Global or Local

Local (Sentence / Passage level)Size of Corpus total 3235 Pairs

Page 64: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Reading – Text Reuse Detection Experiments 64

Dr. Rao Muhammad Adeel Nawab

Techniques:In which category it falls?How it works?Strengths and weaknesses?In which previous studies it has been used?

Translation plus Monolingual analysis N-gram OverlapGreedy String TilingLongest Common Subsequence

For each technique note 4 things

Page 65: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Evaluation Methodology 65

Dr. Rao Muhammad Adeel Nawab

Binary classification

derived (verbatim + paraphrased) vs non-derived(independently written).

Ternary classification

verbatimvs paraphrasedvs independently written.

Supervised text classification task

Page 66: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Evaluation Methodology 66

Dr. Rao Muhammad Adeel Nawab

Evaluation Measures

Precision

Machine Learning algorithms

J48

Machine Learning Toolkit

WEKA

Recall

F1

Random Forest

SMO

Page 67: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Reading – Results and Analysis 67

Dr. Rao Muhammad Adeel Nawab

Explain “Terms” in the Table.

Explain “Overall” best results

Explain results with individual techniques

Conclude your results

proposed approach outperforms baseline approach

Page 68: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Summarize and Document Paper in Tabular Format 68

Dr. Rao Muhammad Adeel Nawab

Sr no. Year Paper Title Authors

1 2018

CLEU - A Cross-Language English-Urdu Corpus and Benchmark For Text Reuse Experiments

Iqra muneerMuhammad SharjeelMuntaha IqbalRao M. Adeel NawabPaul Rayson

Page 69: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

69

Dr. Rao Muhammad Adeel Nawab

Conference / Journal Publisher Problem Importance of

Problem

Journal of the Association for

Information Science and Technology (JASIST).

John Wiley & Sons.

Cross Lingual Text Reuse Detection

The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale.

Summarize and Document Paper in Tabular Format

Page 70: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

70

Dr. Rao Muhammad Adeel Nawab

Applications of Problem

Summary of Literature Review Research Gap

1. Cross-lingual Plagiarism detection

2. Duplicate content removal from Web

Cross-lignaul text reuse detection corpora have been developed for various languages including English Urdu, English-Hindi, English Spanish.

One major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases for cross-lingual text reuse detection, particularly for English-Urdu language pair

Summarize and Document Paper in Tabular Format

Page 71: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

71

Dr. Rao Muhammad Adeel Nawab

Proposed Solution Purpose of Corpus

Corpus Generation Process

A cross-language sentence/passage level text reuse corpus for the English-Urdu language pair

Develop systems to detect cross-lignual text reuse for English-Urdu language pair

1. Data collection from news articles

2. Related pairs extractions3. Annotation guidelines

Corpus / DatasetSummarize and Document Paper in Tabular Format

Page 72: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

72

Dr. Rao Muhammad Adeel Nawab

Corpus Characteristics

Number of documents: 900Levels of Text reuse1. Exact Copy2. Paraphrase Copy3. Independently Written

Language: English – UrduLicense: Creative (Open access) Publicly available

Corpus / Dataset

Summarize and Document Paper in Tabular Format

Page 73: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

73

Dr. Rao Muhammad Adeel Nawab

Technique Toolkit Evaluation Measures

Evaluation Methodology

Translation + Mono-lingual Analysis

1. Longest Common Subsequence

2. N-gram Overlap3. Greedy String Tiles

Weka1. Precision2. Recall3. F1-measure

1. Supervised document classification task.

2. Ten fold cross validation

Summarize and Document Paper in Tabular Format

Page 74: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

74

Dr. Rao Muhammad Adeel Nawab

Classifiers Results Main Finding(s)

1. Random Forest2. Naive Bayes3. J484. SMO

Classification It is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts.

Binary Ternary

F1 = 0.735 using GST-

mml1

F1 = 0.549 using GST-mml1

Summarize and Document Paper in Tabular Format

Page 75: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

75

Dr. Rao Muhammad Adeel Nawab

Future Work Any Remarks Source Code URL?

Improve results by developing a new

technique / algorithm.- Not available publicly

Summarize and Document Paper in Tabular Format

Page 76: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

76

Physical Health

Mental Health

Social Health

Key to Success

7-9 hours sleep per night

3 healthy meals daily

30 minutes brisk walk or running or exercise

Offer 5 Namaaz daily with Jamaat

Help at least one person daily for هللا کی رضا

Practice Six Things on Daily Basis to Become a Great Human Being (Insha Allah)

Recite Durood Sharif daily (Min: 100 – Max: 125K)

Dr. Rao Muhammad Adeel Nawab

Page 77: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

BECOME A VOLUNTEER

MAKE A D I FFERENCE

Page 78: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

ھا �گا ر ا�پن

��ی ل��، ذوق �ن �و ا�پ بت ن

��ھا �گ ر ا�پ ���ن � �ہ

�و� وں �ے �بعد

تامد�

� �ا� ے نپ

ن�س � �ور �� نبات �پ ز�وں ں��ن ز��ی �ن

ھا �گا ر ا�پ ���ن ھا �گا، اک �ہ ر ا�پ اک ��ن

Page 79: Dr. Rao Muhammad Adeel Nawab - ilmoirfan€¦ · Dr. Rao Muhammad Adeel Nawab Sentence 1: Textreuse is becoming a serious issue in many fields and research shows that it is much harder

Jazak Allah Khair

79

Dr. Rao Muhammad Adeel NawabEmail: [email protected]