analysing language-specific differences in …content from other language versions. therefore, in...

ANALYSING LANGUAGE-SPECIFIC DIFFERENCES IN

MULTILINGUAL WIKIPEDIA

Fakultät für Elektrotechnik und Informatikder Gottfried Wilhelm Leibniz Universität Hannover

zur Erlangung des Grades

Master of Science

M. Sc.

Thesisvon

Simon Gottschalk

Erstprüfer: Prof. Dr. techn. Wolfgang NejdlZweitprüfer: Prof. Dr. Robert Jäschke

Betreuer: Dr. Elena Demidova

2015

ABSTRACT

Wikipedia is a free encyclopedia that has editions in more than 280 languages.While Wikipedia articles referring to the same entity often co-exist in manyWikipedia language editions, such articles evolve independently and often con-tain complementary information or represent community-specific point of viewon the entity under consideration. In this thesis we analyse features that en-able to uncover such edition-specific aspects within Wikipedia articles to pro-vide users with an overview of overlapping and complementary informationavailable for an entity in different language editions.

In this thesis we compare Wikipedia articles at different levels of granular-ity: First, we identify similar sentences. Then, these sentences are merged toalign similar paragraphs. Finally, a similarity score at the article level is com-puted. To align sentences, we employ syntactic and semantic features includingcosine similarity, links to other Wikipedia articles and time expressions. Weevaluated the sentence alignment function on a dataset containing 1155 sen-tence pairs extracted from 59 articles in German and English Wikipedia thathad been annotated during a user study. Our evaluation results demonstratedthat the inclusion of semantic features can lead to an improvement of thebreak-even point from 70.95% to 77.52% in this dataset.

Given the sentence alignment function, we developed an algorithm to buildsimilar paragraphs starting from the sentences that have been aligned before.We implemented a visualization of the algorithm results that enables users toobtain an overview of the similarities and differences in the articles by lookingat the paragraphs aligned using the proposed algorithm and the other para-graphs, whose contents are unique to an article in a specific language edition.To further support this comparison, we defined an overall article similarityscore and applied this score to illustrate temporal differences between articleeditions. Finally, we created a Web-based application presenting our resultsand visualising all the aspects described above.

In the future work, the algorithms developed in this thesis can be directlyapplied as a help for Wikipedia authors to provide an overview of the en-tity representation across Wikipedia language editions. These algorithms canalso build a basis for cultural research towards better understanding of thelanguage-specific similarities and differences in multilingual Wikipedia.

Contents

Table of Contents iii

List of Figures vii

List of Tables ix

List of Algorithms xi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background on Multilingual Wikipedia 7

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Wikipedia Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Neutrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Linguistic Point of View . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Reasons for Multilingual Differences . . . . . . . . . . . . . . . . . . . 12

2.5 Wikipedia Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Background on Multilingual Text Processing 17

3.1 NLP for Multilingual Text . . . . . . . . . . . . . . . . . . . . . . . . 17

iii

iv

3.1.1 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.2 Textual Features . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.3 Topic Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.4 Sentence Splitting . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.5 Other NLP techniques . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Aligning Multilingual Text . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Comporable Corpora . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.2 Plagiarism Detection in Multilingual Text . . . . . . . . . . . 24

4 Approach Overview 27

5 Feature Selection and Extraction 31

5.1 Syntactic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Evaluation on Sentence Similarity of Parallel Corpus . . . . . . . . . 33

5.3 Semantic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4 Evaluation of Entity Extraction Tools . . . . . . . . . . . . . . . . . . 39

5.4.1 Aim and NER tools . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4.3 Entity Extraction and Comparison . . . . . . . . . . . . . . . 41

5.4.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Sentence Alignment and Evaluation 47

6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Pre-Selection of Sentence Pairs . . . . . . . . . . . . . . . . . . . . . 49

6.3 Selection of Sentence Pairs for Evaluation . . . . . . . . . . . . . . . 52

6.4 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.5 Judgement of Similarity Measures . . . . . . . . . . . . . . . . . . . . 54

6.6 Second Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.7 Pre-Selection and Creation of Similarity Function . . . . . . . . . . . 59

6.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Paragraph Alignment and Article Comparison 69

7.1 Finding Similar Paragraphs . . . . . . . . . . . . . . . . . . . . . . . 69

7.1.1 Aggregation of Neighboured Sentences . . . . . . . . . . . . . 71

7.1.2 Aggregation of Proximate Sentence Pairs . . . . . . . . . . . . 72

7.1.3 Paragraph Aligning Algorithm . . . . . . . . . . . . . . . . . . 74

v

7.2 Similarity on Article Level . . . . . . . . . . . . . . . . . . . . . . . . 75

7.2.1 Text Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.2.2 Feature Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.2.3 Overall Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.3 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8 Implementation 85

8.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.2 Comparison Extracting . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8.3 Preprocessing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8.4 Text Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.5 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

9 Discussion and Future Work 93

9.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

9.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 94

Bibliography 97

List of Figures

1.1 Text Comparison Example . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 English Wikipedia Article ”Großer Wannsee” . . . . . . . . . . . . . 8

2.2 Interlanguage links for the English article ”Pfaueninsel” . . . . . . . . 10

3.1 First Paragraphs of the Wikipedia Article ”Berlin” . . . . . . . . . . 22

4.1 Process of Article Comparison . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Precision Recall Graphs for Textual Features with Break-Even Points 35

5.2 Box Plots for Textual Features . . . . . . . . . . . . . . . . . . . . . . 36

6.1 Screenshot of User Study on Similar Sentences . . . . . . . . . . . . . 54

6.2 Correlation of Syntactic Features for First Data Set . . . . . . . . . . 56

6.3 Correlation of Text Length Similarity . . . . . . . . . . . . . . . . . . 57

6.4 Correlation of External Links Similarity . . . . . . . . . . . . . . . . 57

6.5 Correlation of Time and Entity Similarity for the first Dataset . . . . 58

6.6 Iteration to Create Similarity Functions . . . . . . . . . . . . . . . . . 60

6.7 Precision-recall Diagram of Sentences with Overlapping Facts . . . . 64

6.8 Precision-recall Diagram of Sentences with the Same Facts . . . . . . 65

6.9 Precision-recall Diagram of Sentences with the Same Facts (AdjustedSimilarity Functions) . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.1 Paragraph Construction Example (Step 1) . . . . . . . . . . . . . . . 70

7.2 Paragraph Construction Example (Steps 2 and 3) . . . . . . . . . . . 70

vii

viii LIST OF FIGURES

7.3 Paragraph Construction Example (Steps 4 and 5) . . . . . . . . . . . 71

7.4 Comparison of the English and German article on ”Knipp” . . . . . . 79

7.5 Website Example: Text . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.6 Website Example: Links . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.7 Website Example: Images . . . . . . . . . . . . . . . . . . . . . . . . 82

7.8 Website Example: Authors . . . . . . . . . . . . . . . . . . . . . . . . 82

7.9 Website Example: Overall Similarity . . . . . . . . . . . . . . . . . . 83

8.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8.2 Preprocessing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 89

List of Tables

2.1 Statistics on Wikipedias in Different Languages . . . . . . . . . . . . 9

3.1 Machine Translation Example . . . . . . . . . . . . . . . . . . . . . . 18

5.1 Example Sentence Pairs for Time Similarity . . . . . . . . . . . . . . 37

5.2 Statistics of the N3 Dataset . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 Number of Entities Extracted from English Texts . . . . . . . . . . . 42

5.4 Number of Entities Extracted from German Texts . . . . . . . . . . . 42

5.5 Results of Entity Extraction . . . . . . . . . . . . . . . . . . . . . . . 44

6.1 Wikipedia Articles Used in the User Study . . . . . . . . . . . . . . . 49

6.2 Feature Combination Distribution in 14 Wikipedia Articles . . . . . . 50

6.3 Weights of Similarity Functions for Pre-Selection . . . . . . . . . . . . 51

6.4 Feature combination distribution in pre-selected Sentence Pairs . . . 52

6.5 Feature Distribution in the Dataset for the First Round of Evaluation 53

6.6 Correlation Coefficients for Similarity Measures . . . . . . . . . . . . 55

6.7 Dataset Evaluated in the Second Round . . . . . . . . . . . . . . . . 62

6.8 Retrieved Sentence Pairs per Article Pair . . . . . . . . . . . . . . . . 67

7.1 Composition of Overall Similarity . . . . . . . . . . . . . . . . . . . . 78

7.2 60 Wikipedia Article Pairs Ordered by Overall Similarity . . . . . . . 84

8.1 Example of Revisions of an Article in Different Languages . . . . . . 88

8.2 Example of Revision Triples . . . . . . . . . . . . . . . . . . . . . . . 89

ix

List of Algorithms

5.1 Computation of TP, FP and FN for the Evaluation of Entity Extraction 436.1 Identification of Candidates for Similar Sentences . . . . . . . . . . . 517.1 Extension of Sentence Pairs with Neighbours . . . . . . . . . . . . . . 727.2 Extension of a Sentence with its Neighbours . . . . . . . . . . . . . . 727.3 Aggregation of Sentence Pairs . . . . . . . . . . . . . . . . . . . . . . 737.4 Paragraph Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

xi

1Introduction

Wikipedia1 is a user-generated online encyclopaedia that is available in more than280 languages and is widely used: Currently it counts more than 24 million registeredusers alone in the English Wikipedia and for the 12 most populated of the availablelanguage editions there are more than a million of articles each2. Wikipedia articlesdescribing real-world entities, topics, events and concepts evolve independently indifferent language editions. Up to now, there are just insufficient possibilities tobenefit from the knowledge that can be gained from these differences, although thiscould be useful for social research purposes or to extend Wikipedia articles withcontent from other language versions. Therefore, in this thesis we propose methods toautomate a detailed comparison of Wikipedia articles that describe the same entitiesin different languages and create an example application that presents the findings tohuman users.

Wikipedia articles can be compared at different levels of granularity. In this workwe focus on three levels: the sentence level, the paragraph level and the article level.They are presented in a bottom-up order: Similar sentences are identified and mergedto find similar paragraphs. The fraction of overlapping paragraphs is then used as animportant component for the similarity score at the article level.

At first, we develop methods to identify and align similar sentences in the articles.To do so, we analyse effectiveness of several syntactic and semantic features extractedfrom the texts. Moreover, we go further than related studies in this field by aligningnot only the sentences with the same facts, but also the sentences with partly over-lapping contents. As this step builds a foundation for the paragraph alignment andthe article comparison, we perform an extensive user study to evaluate and fine tuneour proposed similarity functions. In the second step, we use the resulting sentencealignment to develop algorithms for alignment of similar paragraphs. This paragraphalignment method contributes to the improved visualisation of the textual compari-

1http://www.wikipedia.org/2http://meta.wikimedia.org/wiki/List of Wikipedias

1

http://www.wikipedia.org/ http://meta.wikimedia.org/wiki/List_of_Wikipedias

2 Chapter 1 Introduction

son by creating bigger paragraphs from the sentence pairs that were assigned in theprevious step. Finally, as Wikipedia articles contain much more information than theraw texts (images, authors, links, etc.), we define further similarity measures that areapplied at the article level to compute an overall similarity value for two articles indifferent languages.

With these approaches to find similarities and differences across article pairs thatdescribe the same entity in different languages, there are many possibilities to do in-vestigations of cross-lingual differences: Amongst others, we implement applicationsillustrating the development of the article similarity over time, rank article pairs bytheir similarity and oppose the article texts in different languages to visualise commonparagraphs. These applications can support Wikipedia editors and researchers pro-viding an overview over similarities and differences of the articles and their temporaldevelopment.

1.1 Motivation

While collaboration is an indispensable part of Wikipedia editing within one languageedition, this becomes a problem across languages: Besides from the language linksinterlinking articles on the same entities, multilingual coordination is difficult acrossthe Wikipedia - each Wikipedia even has a separate set of user accounts3. Therefore,a tool that compares articles across languages can be a possibility to bridge this gap.Further aspects that our research aims at are listed below:

Social and cultural research: As Wikipedia articles are continuously writtenover a large range of time by a big amount of editors, a study on Wikipediaarticles can always be seen as an investigation of the users as well.

Help for Wikipedia authors: When a Wikipedia author wants to add somethingto an article, it is very probably that he will find additional information in anarticle in an other language. If we provide a means to visualise the text passagesor concepts that do not occur in the version of the author’s language, he canquickly get an idea which information is worthy to add to the article.

Trustworthiness of Wikipedia: Wikipedia is part of many investigations andprograms – both for direct human interaction and for indirect information col-lections of automated systems. Given this importance of Wikipedia as aninformation resource, there have been many discussions on the reliability ofWikipedia4.

Taking into account not just one Wikipedia edition, but extracting the infor-mation of more than one language version, it becomes possible to collect in-

3http://en.wikipedia.org/wiki/Wikipedia:Multilingual coordination4http://en.wikipedia.org/wiki/Reliability of Wikipedia

http://en.wikipedia.org/wiki/Wikipedia:Multilingual_coordination http://en.wikipedia.org/wiki/Reliability_of_Wikipedia

1.2 Problem Definition 3

formation from independent groups of authors5 and either to further expandthe knowledge with language-exclusive content or to discover language-specificdifferences. This allows for a better estimation of how reliable the texts are.

Statistics: Many different statistics and tools about Wikipedia are accessi-ble, mostly about the development of page views and edits6. This shows thatthere is a big interest in the automation of interesting information coming fromWikipedia. For multilingual comparisons, there is the website www.manypedia.com that is similar to our approach, but does not go deeper into textual simi-larity.

Existence of Neutrality across languages: Finally, the question arises whetherit is possible to hold the idea of a neutral point of view across languages whichalso means across cultures. However, this can be seen as a question that is outof the scope of this thesis and rather touches topics of sociology.

1.2 Problem Definition

The comparison of Wikipedia articles that describe the same entity in different lan-guages can be split into two tasks: The first task solely refers to the texts of thearticles and takes place on the sentence and paragraph level. Here, the goal is to linksimilar text parts. The second task is done on the article level and takes additionalinformation into account, for example the authors and external links mentioned infoot notes.

Text Comparison

The text comparison is done to get a precise information of how similar the textsare and where their similarities and differences are. Figure 1.1 shows what the textcomparison should result in (with shortened versions of the English and Germanabstracts of the Wikipedia article about the General Post Office): The English textis shown on the left and the German one the right. The parts that are identified assimilar are linked by green lines.

In this example, two subtopics are found that occur similarly in both languages:The first is a general description of the General Post Office, its founding and its estab-lishment as state postal system and telecommunications carrier. The second commonfact is about the office of Postmaster General created in 1961. The black parts withoutlinks are containing information that is unique to the respective language.

5As shown in [Digital Methods], this does not hold completely, as some authors contribute tomultiple Wikipedias editions.

6http://en.wikipedia.org/wiki/Wikipedia:Statistics

www.manypedia.comwww.manypedia.com http://en.wikipedia.org/wiki/Wikipedia:Statistics


TheuGeneraluPostuOfficeu]GPODuwasuofficiallyuestablisheduinuEnglanduinuÜ66öubyuCharlesuIIuanduitueventuallyugrewutoucombineutheufunctionsuofustateupostalusystemuandutelecommunicationsucarrier-uSimilaruGeneraluPostuOfficesuwereuestablisheduacrossutheuBritishuEmpire-

[…]uInuÜ66ÜutheuofficeuofuPostmasteruGeneraluwasucreatedutouoverseeutheuGPO-

DasubritischeuGeneraluPostuOfficeu]GPODuwurdeuimuJahruÜ66öuoffizielluvonuCharlesuII-ugegründet-uEsuwurdeuimuLaufuseineruGeschichteustaatlicheuPostLuunduTelekommunikationsbehörde-

[…]uDieuÜberwachunguderuGPOuoblagudemuÜ66ÜueingeführtenuPostmasteruGeneral-uInuspäterenuJahrhundertenuerhieltudasuGPOudasuMonopolufürudieuTelekommunikationuunduversuchteüudiesuauchufürudenuRundfunkuzuuerreichen-

Figure 1.1 Text Comparison Example

This kind of comparison reveals several possibilities for a human reader: On afirst glance, you can see which text parts are similar across languages. In contrast,unmarked text parts contain contents exclusive for the respective language version.So, if you are interested in discussing the linguistic point of view (as defined in [17])of the articles, you can investigate which and how many parts are similar and alsocompare the text structures that way. As a Wikipedia author, you can look on thetext in the other language and search for unmarked passages that represent facts notappearing in the article of your language.

To do the paragraph alignment, there are two premises that are different fromrelated studies:

Sentences can be aligned if they just share some of their facts. A sentence in one language can be assigned to more than one sentence in the

other languages.

This is mainly done to support the paragraph construction that is done afteridentifying the (partially) overlapping sentences.

On the article level the paragraph alignment allows for deriving numerical valuesto judge the semantic similarity of cross-lingual texts: The fraction of texts whosecontents are found in the other text as well and a simple comparison of the textlengths are examples for such measures.

Article Comparison

Another kind of comparing articles is also considering aspects in the surroundings ofthem that are not directly part of the texts like external links and mentioned entities.Although not directly connected to the article’s texts, this also includes a comparisonof the authors and their locations. The paragraph alignment is included in the articlecomparison by computing the fraction of paragraphs that has counterparts in theother article.

1.3 Overview 5

Revision History

In the context of investigating the history of web7, we also aim at inspecting thesimilarity of articles over the course of time. Two things are needed for this goal: Foran article, several revisions (articles at a specific point of time) have to be collectedover time and for each of the revision pairs found that way an overall similarity hasto be defined that is derived from both the text and the revision comparison values.

1.3 Overview

In the next two chapters, background information and related work on the topicsof the special characteristics of multilingual Wikipedia (Chapter 2) and multilingualtext processing (Chapter 3) – including related approaches like plagiarism detection– are described to give a first overview of the challenges and possible solutions.

Chapter 4 gives a clear idea how we tackle our research aim and contains a sketchof the procedure that is applied to reach our goals.

To identify similar sentences across articles, it is necessary to extract additionalinformation from sentences. Their collection and usage is explained in Chapter 5; inChapter 6 their effectiveness with regards to sentence alignment is investigated andevaluated by a user study. Having constructed a sentence alignment function then,similar sentences can be joined by the algorithms given in Chapter 7. Aside from puretext comparison, this chapter also contains information about similarity measures onthe article level and screenshots of our example application.

The realisation of the information extraction process (including a pre processingpipeline) is described in Chapter 8.

Finally, we discuss our results in Chapter 9.

7as it is for example done in https://www.l3s.de/en/projects/iai/∼/alexandria/

https://www.l3s.de/en/projects/iai/~/alexandria/

2Background on Multilingual Wikipedia

Wikipedia describes itself as follows1:

”Wikipedia is a free-access, free content Internet encyclopedia, supportedand hosted by the non-profit Wikimedia Foundation. Anyone who canaccess the site can edit almost any of its articles. Wikipedia is the sixth-most popular website2 and constitutes the Internet’s largest and mostpopular general reference work.

Jimmy Wales and Larry Sanger launched Wikipedia on January 15, 2001.. . . Initially only in English, Wikipedia quickly became multilingual as itdeveloped similar versions in other languages, which differ in content andin editing practices.”

A Wikipedia article describes one real world entity, topic, event or concept, forexample ”Barack Obama”, ”Politics” or ”United States elections, 2012”. Each articlehas once been created by a user and users can extend and edit it afterwards. Dueto this, a Wikipedia article never reaches a final, but rather develops over time. Thestate of an article at a specified point of time is called a revision3.

Wikipedia articles are written in a markup language called Wiki markup, allowingfor image inclusion, hyper linking and more. There is even structured data availablefor Wikipedia articles that is part of some articles in the form of an info box andcan be accessed by the DBpedia data set4. In Figure 2.1, there is an example for anEnglish Wikipedia article about the German lake ”Großer Wannsee” that for examplehas an infobox on the right, a photo gallery and external links on the bottom.

1http://en.wikipedia.org/wiki/Wikipedia2http://www.alexa.com/siteinfo/wikipedia.org3However, when speaking about an article in this thesis, we often mean the most current revision

or the represented object4http://dbpedia.org/

7

http://en.wikipedia.org/wiki/Wikipedia http://www.alexa.com/siteinfo/wikipedia.org http://dbpedia.org/

8 Chapter 2 Background on Multilingual Wikipedia

Figure 2.1 English Wikipedia Article ”Großer Wannsee”

The following kinds of information are given by the Wikipedia markup and usedin our research:

Images: Throughout the article, images can be displayed. These are stored onthe Wikipedia servers and mostly available as smaller thumbnail and as theoriginal file.

Internal links: Words that are hyper linked within the text and refer to otherWikipedia pages

External links: Some words or sentences are assigned to one or more foot notes.The foot notes may contain links to external web sites.

2.1 Overview

In March 2015, there were 288 languages for which an own Wikipedia existed5. Thisfor example includes languages with more than a million of articles like English andGerman, but also Greek, Afrikaans and Greenlandic with less articles. Among these287 language versions, there even are some non-official languages like Simple English

5http://meta.wikimedia.org/wiki/List of Wikipedias

http://meta.wikimedia.org/wiki/List_of_Wikipedias

2.2 Wikipedia Guidelines 9

and Esperanto or regional varieties like Bavarian. Table 2.1 shows the number ofarticles and authors for a few Wikipedias6.

Language Articles Edits Users Active Users1 English 4,636,933 741,415,363 22,986,467 133,3272 Swedish 1,946,828 28,362,530 403,378 2,8843 Dutch 1,794,646 43,378,141 639,436 4,1364 German 1,771,852 141,065,828 1,996,778 19,583...

79 Afrikaans 33,412 1,357,691 62,903 126...

125 Bavarian 10,467 435,892 26,447 71...

Table 2.1 Statistics on Wikipedias in Different Languages

Interlanguage links

Articles that are representing the same real world entity are interlinked by interlan-guage links that are collected in a central database by the Wikidata project7.8

As an example, for the Wikipedia article about ”Berlin”, there are 221 interlan-guage links. For the ”Pfaueninsel” – an island in Berlin –, there are 11 links, as seenin Figure 2.2 (ten outgoing interlanguage links plus the English article itself).

2.2 Wikipedia Guidelines

To support its authors and to reach the aim of being a ”free, reliable encyclopedia”9,Wikipedia has introduced different policies and guidelines. Two of them are presentedin the following, because they are of major relevance for our investigation of multi-lingual articles.

2.2.1 Translations

The contents of different Wikipedias are evolving independent from each other, whichresults in different numbers and varying levels of detail of articles (as shown in Sec-

6as of November 2, 20147http://en.wikipedia.org/wiki/Wikipedia:Wikidata8There have been recent changes in the handling of interlanguage links: In the Wikidata, all

language links for one article are stored compactly together. Before that, each language version ofan article had its own list of language links.

9http://en.wikipedia.org/wiki/Wikipedia:Policies and guidelines

http://en.wikipedia.org/wiki/Wikipedia:Wikidatahttp://en.wikipedia.org/wiki/Wikipedia:Policies_and_guidelines


Figure 2.2 Interlanguage links for the English article ”Pfaueninsel”

tion 2.4). Obviously, this leads to the question, whether it is reasonable to trans-late Wikipedia articles (or parts of them) from one language into another10. TheWikipedia policies regarding this question are stated as follows11:

”Articles on the same subject in different languages can be edited indepen-dently; they do not have to be translations of one another or correspondclosely in form, style or content. Still, translation is often useful to spreadinformation between articles in different languages. [...] Wikipedia con-sensus is that an unedited machine translation, left as a Wikipedia article,is worse than nothing.”

According to this quote, (parts of) articles in different languages can be dividedinto three groups:

Independently evolved text: In the natural case, Wikipedia authors are editingarticles on their own without overtaking every information from the article inother languages.

Human translations: Especially when an article is not existing in an author’slanguage, he may want to copy e.g. the English one. To do so, he adopts allthe information and translates them manually. So, this approach is useful tospread information without additional research. To indicate that an article hasbeen translated from another one or is still in the process of being translated,Wikipedia gives some advices, including Wiki markup techniques.

10As seen in Section 2.1, the English Wikipedia has by far the most articles, so this is a frequentexample when translations can be used to spread information

11http://en.wikipedia.org/wiki/Wikipedia:Translation

http://en.wikipedia.org/wiki/Wikipedia:Translation

2.3 Linguistic Point of View 11

Machine translation: Instead of translating an article manually, machine trans-lation techniques could be used to save effort. However, this is not allowedbecause of low quality results and the fact that each user can access machinetranslation tools on its own.

For our research, we are considered both of the top two approaches. As textsmay even contain the same information although there was no human translationand different parts of the article can behave differently, the borders between the bothcases become blurred anyway.

2.2.2 Neutrality

A big concern for Wikipedia is the neutrality of its articles. To emphasize this, the”core content policy” neutral point of view (NPOV ) was introduced and defined asfollows12:

”Editing from a neutral point of view (NPOV) means representing fairly,proportionately, and, as far as possible, without bias, all of the signifi-cant views that have been published by reliable sources on a topic. AllWikipedia articles and other encyclopedic content must be written from aneutral point of view. NPOV is a fundamental principle of Wikipedia andof other Wikimedia projects. This policy is nonnegotiable and all editorsand articles must follow it.”

As the following section will show, studies on multi lingual Wikipedias show thatit is reasonable to state that the NPOV is not fulfilled across different languages.

2.3 Linguistic Point of View

There has been some research aiming at the judgement whether the NPOV can hold inthe cross-lingual context. In [12], the authors describe the global consensus hypothesiswhich says ”that encyclopedic world knowledge is largely consistent across culturesand languages”. As they find out that ”knowledge diversity across Wikipedias islarge”, they discard this hypothesis and emphasize that this has an important impacton many technologies using Wikipedia data are said to rely on that hypothesis.

To distinguish this phenomena from the NPOV, the concept of a Linguistic Pointof View (LPOV ) is introduced in [17] which is motivated by the question ”will rela-tively isolated language communities of Wikipedia develop their divergent represen-tations for topics?”.

12http://en.wikipedia.org/wiki/Wikipedia:Neutral point of view

http://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view


2.4 Reasons for Multilingual Differences

Assuming the NPOV and an equal workload of Wikipedia authors, the translatedcontent of Wikipedia articles would be the same in different languages. This is notthe case because of many reasons that are described in the following.

Different number of authors and different interests

As shown in Section 2.1, the different Wikipedias vary in the amount of authors –and therefore in the amount of articles as well 13.

In [8] there is a more detailed investigation not only of the number of pages indifferent languages, but also of their lengths. The authors have analyzed 48 Wikipediaentries about 48 persons and found out that all of them have an English version,but only 26 persons have articles in more than 20 languages – including the formerSecretary-General of the United Nations, Kofi Annan, who had the most – 86 –language links at the time of the study.

This leads to believe that the English Wikipedia is some kind of superset of theother Wikipedias which is proven to be wrong in that investigation. Of the 43 re-maining persons with at least one non-English entry, they are comparing the numberof sentences in the different language’s articles. The result shows that for 17 of thepersons, there exists at least one language version with an article that has more – forsome persons more than twice as many – sentences than the English one.

Given these results, the authors conclude that ”despite the fact that English hasdescriptions for the most number of Wikipedia entries across all languages, Englishdescriptions can not always be considered as the most detailed descriptions” and that”Multilingual Wikipedia is full of information asymmetries”.

As one of the main sources for this asymmetry, they describe that many persons,locations and events are just important within smaller communities that speaks thesame language. To illustrate this, they give examples of a Mexican singer with aSpanish entry only and a Greek mountain with four entries in different languages.

Cultural Reasons

There are some topics which are of interest for many people, but the perception ofthem varies across different cultural communities. As a result, there are articles thatexist in many languages and may even have a similar number of sentences, but theircontents differ a lot.

13Looking at the Swedish Wikipedia, the relationship between the number of authors and ofarticles must not be directly related: In 2013, nearly half of the Swedish articles were auto-matically created by bots which lead to various debates (http://blog.wikimedia.org/2013/06/17/swedish-wikipedia-1-million-articles/

http://blog.wikimedia.org/2013/06/17/swedish-wikipedia-1-million-articles/http://blog.wikimedia.org/2013/06/17/swedish-wikipedia-1-million-articles/

2.4 Reasons for Multilingual Differences 13

[26] describes this aspect in detail with a very extensive comparison of the Wikipediaentries for the Srebrenica massacre in the language versions of countries that weredirectly affected by the happenings in 1995, namely the Serbian, Bosnian, Dutch,Croatian, Serbo-Croatian and English version.

They find a lot of differences that can easily be explained by cultural biasses.Among these differences, there are the three following ones:

There have been many discussions whether to title the article with ”SrebrenicaMassacre” or ”Srebrenica Genocide”.

The victim counts differ across the language versions.

The people who are blamed for the happenings are not named in a consistentway: In the Serbian article, they avoid to call them ”Bosnian Serb forces”, butprefer to say ”Army of the Republika Srpska”.

Beyond this inspection of the texts themselves, the authors use some other meth-ods that are described in Section 2.5 – including the comparison of images and linksused in the articles. Moreover, they take a look at the location of the Wikipediaauthors that contributed to the pages14 and emphasize the role of power editors whoare responsible for a major part of an article.

In the end of their study, they conclude that the NPOV does not hold, as it isnot possible to find any ”universality” between the different articles about the sametopic.

Advertisement

Probably the most critical aspect concerning the NPOV occurs, when Wikipediapages are edited or created by people who believe that they will profit from thosechanges. This kind of advertisement is – of course – a big contrast to the idea ofneutrality.

There have been many investigations to find such non-neutral editing that leadto the banning of more than 250 user accounts in October 2013 whose owners ”havebeen paid to write articles on Wikipedia promoting organizations or products, andhave been violating numerous site policies and guidelines”15.

Furthermore, there are more detailed reports of companies that have directlyinfluenced articles which affect them. [16] states that every third German companylisted on the stock exchange has behaved in that way. This includes small changesas for example replacing the text ”one of the leading companies” with ”the leadingcompany” and bigger changes where the content of a company’s press release wasinserted into an article without further adjustments.

14For non-registered users, you can see the IP address and draw conclusions about the user’slocations.

15http://blog.wikimedia.org/2013/10/21/sue-gardner-response-paid-advocacy-editing/

http://blog.wikimedia.org/2013/10/21/sue-gardner-response-paid-advocacy-editing/


Political Reasons

Similar to the manipulations for advertisement, which is done by the respective com-panies, there is also manipulation with political background going on. For example,there have been reports claiming that the Russian government has edited Wikipediapages on flight MH17 to blame the Ukraine for having shot down that flight16. Al-though this was quite a naive approach for manipulation and maybe just done byan individual without any official instruction, this gives an insight of how mightypolitical manipulations may be.

Wikipedia edits that were done from people connected to governmental institu-tions can be found by utilising the circumstance that edits of unregistered Wikipediausers are annotated with the IP address of their author. Having a list of IP addressesused in state institutions, it becomes a rather easy task to generate lists of suspiciousedits. This has been done (the edit on the MH17 page was found this way) andresulted in a collection of Twitter accounts sending messages as soon as such an edithas been done and recognized17.

To spread the manipulated texts, it is especially interesting for governmentalsources to change not only the articles of the Wikipedia of their own language, butalso of other languages. For example, there are more than 3000 edits18 from theGerman parliament and government within the English Wikipedia.

2.5 Wikipedia Studies

As already mentioned, there exist some tools that are taking use of multilingualWikipedia. We will show three of them in this section.

Concept Similarity

In [17], the website www.manypedia.com is presented. On that interface, you canenter an article name and two languages. After a short calculation time, both ar-ticles are opposed and some information is highlighted: The images of each articleare shown compactly together at the top, there are some statistics about the edithistory (amongst others, the number of edits and the top authors) and – as the mostimportant aspect – a concept similarity which is the overlap of common Wikipedialinks mentioned in the articles. Additionally, you can translate non-English articlesinto English by machine translation.

16http://www.wired.co.uk/news/archive/2014-07/18/russia-edits-mh17-wikipedia-article17https://jarib.github.io/anon-history/18Taking a closer look at the edits, the biggest part of edits can be identified as small and non-

manipulative corrections, though.

www.manypedia.com http://www.wired.co.uk/news/archive/2014-07/18/russia-edits-mh17-wikipedia-article https://jarib.github.io/anon-history/

2.5 Wikipedia Studies 15

Social Research

In [26], many aspects were manually investigated to compare the article about themassacre of Srebrenica and visualised, including the following:

Authors: The locations of anonymous editors per language version is shown bypie charts.

Table of contents: The table of contents is manually aligned to mark passagesthat appear in more than one language version.

External link hosts: From the external URLs mentioned in an article, only thefirst part (which are the host names) are considered to oppose the host namesin a table where common ones are marked. The host name is used instead ofthe complete URL because many multilingual web sites – and these are the onesthat are probably mentioned in articles in different languages – can be unitedthis way.

Images: There also is a table for common images. Here, even similar imagesare aligned.

Edit History

There are some tools to visualise the number of edits over time19 20. Although thesetools do not focus on the comparison of multilingual articles, it is no problem to dothis kind of visualisation for more than one language at once. The first tool evencontains a world map where the locations of editors are marked.

19http://sonetlab.fbk.eu/wikitrip/20http://sergionunes.com/p/wikichanges/

http://sonetlab.fbk.eu/wikitrip/ http://sergionunes.com/p/wikichanges/

3Background on Multilingual Text Processing

Our aim of analyzing multilingual Wikipedia articles written in different languagestouches different research topics that will be addressed in this section. On the onehand, there are topics that cover the inspection of multilingual text documents ofdifferent kind – for instance scientific papers or accurate translations, but not nec-essarily Wikipedia or even web specific texts. On the other hand, there are a lot ofWikipedia specific investigations.

To structure this, we will give an overview of related work in the following topics:

Multilingual Natural Language Processing: To allow to find similarities betweentexts by using some automatic procedures, it is mandatory to extract informa-tion from them. To further allow the comparison of multilingual texts, theextraction has to be applicable in different languages.

Text Aligning / Parallel Corpora: Many machine translation programs requirea collection of aligned texts in multiple languages. To build such a multilingualcorpus, it is necessary to find similar passages in texts.

Plagiarism Detection: Plagiarism cannot only be done by copying texts that arewritten in the same language, but also by adopting texts in another language.Identifying such plagiarism touches our research topic and will discussed inSection 3.2.2.

3.1 NLP for Multilingual Text

3.1.1 Machine Translation

To conduct a comparison of texts that is solely based on the texts themselves and doesnot have any further extracted information available, it is necessary to have both texts

17

18 Chapter 3 Background on Multilingual Text Processing

written in the same language. Because of their size, number and frequent changes, it isnot possible to do a manual translation of Wikipedia articles. However, an evaluationwill later show that it is not possible to get good results for text comparison withoutusing translations (see Wikipedia baseline described in Section 6). That is the reasonwhy it is essential for this study to use machine translation.

There exists a big amount of machine translators that can be queried using webapplications, such as the Bing Translator1 or the Google Translator2. For some trans-lators, there exists a web API to simplify the access for programmers (for example,the Bing Translator using the Microsoft Translator API3).

There are different approaches for machine translation including rule”=based,dictionary”=based and statistical translations4. The statistical approach (which isused by the Microsoft Translator5) is based on so-called multilingual parallel corpora.That are texts in at least two languages where the belonging parts are connected toeach other. From these corpora, statistical features can be extracted which allow todo the translations.

As already shown in Section 2.2.1 and by the example in table 3.1, machinetranslated texts do not reach the quality of human translations. Nevertheless, theycan be used for the multilingual text comparison, as the translated text is not usedto be presented to a human reader, but for example to compare single words.

German sen-tence

Maschinelle Übersetzung, auch automatische Übersetzung,bezeichnet die Übersetzung von Texten aus der Quell-sprache in eine Zielsprache mit Hilfe eines Computerpro-gramms.

human transla-tion (English)

Machine Translation, also automatic translation, is thetranslation of texts from a source language in a source lan-guage with the help of a computer program.

machine transla-tion (English)

Machine translation, automatic translation means thetranslation of texts from the source language into a tar-get language with the help of a computer program.

Table 3.1 Example of Machine Translation: Human versus Machine Trans-lation

1http://www.bing.com/translator/2https://translate.google.de/3http://www.microsoft.com/translator/default.aspx4http://en.wikipedia.org/wiki/Machine translation5http://www.microsoft.com/translator/automatic-translation.aspx

http://www.bing.com/translator/ https://translate.google.de/ http://www.microsoft.com/translator/default.aspx http://en.wikipedia.org/wiki/Machine_translation

3.1 NLP for Multilingual Text 19

3.1.2 Textual Features

To support the comparison of texts, there exist methods that are well-known in thefield of Information Retrieval:

N-grams

An n-gram is a sequence of tokens in a text of the length n. For example, thesentence ”An n-gram is a sequence” can be split up in the character 5-grams (n = 5)”An n-”, ”n n-g”, ” n-gr” and so on. On the word level, this sentence consist of thefollowing word bigrams (n = 2): ”An n-gram”, ”n-gram is”, ”n-gram is”, ”is a” and”a sequence”.

Stop Word Removal

Natural-language text often contain words that are just part of the text structure,but add nothing when doing a comparison of texts. These words are removed fromtexts by using a black list of so called stop words. For example, the sentence ”Stopwords are the words that are filtered out” becomes ”Stop words words filtered” afterstop word removal with a common black list.

Stemming

For text comparison, it does not matter how words are declined or conjugated. There-fore, it helps to reduce each word to its word stem. This is done in a process calledstemming. This can for example be done by cutting words off after applying specificrules. In the example sentence from before, the resulting sentence is ”stop word filter”after stop word removal followed by stemming.

3.1.3 Topic Extraction

One of the tasks in NPL that focuses on the semantic features of a text is the ex-traction of elements that are useful for information collection and belong to the sameconcepts. Below, two types of such elements and their extraction will be shown:named entities and dates. Both types can be used to match elements with each otherfor texts in different languages.

Named Entity Recognition

Named entities are often divided into three categories: People, organisations andlocations. The following line gives an example of these categories:


Since the United States [Location] presidential election of 2008, held onTuesday, November 4, 2008, Barack Obama [Person] from the DemocraticParty [Organisation] is the 44th U.S. [Location] [Location] president.

[22] gives an overview of techniques for Named Entity Recognition (NER); [9] givesa more detailed idea of the implementation of the Stanford Named Entity Recognizer6

which does not only use local information but profits from building long-distancedependency models. NER programs usually use language-specific models that aretrained with text collections where each named entity is annotated.

Although named entities seem to be a good means for comparing texts, they sufferthe disadvantage of not being unique. For example, the location entity ”U.S.” fromthe example sentence above can be named ”United States” as well. The differencescan be even bigger for comparisons with texts in other languages: ”United States”is ”Vereinigte Staaten” in German and ”Ñîåäèí¼ííûå Øòàòû” in Russian. It ispossible to overcome this problem by defining specific similarity measures [20] or bydoing transliteration and normalisation [28].

The problem of finding canonical unambiguous references for the same entitiesis known as named entity normalization (NEN)[13] and can be attacked by usinganother kind of NER which is called entity linking : Each entity is not identified byits name, but by a unique resource identifier. Given its big number of articles – andtherefore unique entities – Wikipedia (and DBpedia respectively, as they share theirentities) is often used as a resource for such entities[18] [21] [7].

The following sentence illustrates the Wikipedia entity linking for the examplefrom above:

Since the United States presidential election of 2008 [http:// en.wikipedia.org/ wiki/ United States presidential election, 2008 ], held on Tuesday, Novem-ber 4, 2008, Barack Obama [en.wikipedia.org/ wiki/ United States presidentialelection, 2008 ] from the Democratic Party [en.wikipedia.org/ wiki/ DemocraticParty ( United States)] is the 44th U.S. president.

Cross-lingual comparisons of entities can be done using the language links providedby Wikipedia (see Section 2.1). In Section 5.4, there is a detailed evaluation of Wikifyand DBpedia Spotlight for entity extraction.

In this thesis, the NER is only performed on Wikipedia texts. Given their internallinks, the question arises, whether an additional entity extracting adds any additionalinformation. The Wikipedia manual of Style/Linking7 lists the rule ”Generally, a linkshould appear only once in an article” (followed by some exceptions: for example,links can be attached to the same entity name in both the info box and the article).Instead of the entity linking approach, [1] at first builds a bilingual dictionary of the

6http://nlp.stanford.edu/software/CRF-NER.shtml7http://en.wikipedia.org/wiki/Wikipedia:Manual of Style

http://en.wikipedia.org/wiki/United_States_presidential_election,_2008http://en.wikipedia.org/wiki/United_States_presidential_election,_2008en.wikipedia.org/wiki/United_States_presidential_election,_2008en.wikipedia.org/wiki/United_States_presidential_election,_2008en.wikipedia.org/wiki/Democratic_Party_(United_States)en.wikipedia.org/wiki/Democratic_Party_(United_States) http://nlp.stanford.edu/software/CRF-NER.shtml http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style

3.1 NLP for Multilingual Text 21

whole Wikipedia article collection and then searches for occurrences of the link textin a n-gram based approach. While this method promises a higher confidence for thecorrectness of found words, it fails to detect conjugated entities8.

Time Extraction

Another type of named entities are time annotations that represent dates, time rangesor even sets of them. This is done by using language-specific rules like string matchingfor month names or date expression matching (by searching for susbtrings of the type”d m yyyy” with ”d” representing a day number, ”m” a month number and ”yyyy”the four digits of a year). In contrast to other entities, the comparison of timeannotations should not rely on a simple string equality. For example, a day may bematched with its containing week.

3.1.4 Sentence Splitting

With the aim not only to calculate a single similarity value for two articles, but also todemonstrate similarities and differences in smaller parts of the articles, the text mustsomehow be split into smaller parts. As suggested in [3], sentence-based splitting isa good approach which is part of many multilingual NLP tools9 10.

To split a text into sentences, there are several difficulties making obvious that itdoes not suffice to split the text at every punctuation mark. [4] contains two examplesentences for this behaviour:

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov.29.

Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.

Instead of splitting at every punctuation mark, a model must be trained before-hand by learning from a language-specific corpus with isolated sentences.

3.1.5 Other NLP techniques

Multilingual NLP contains many other topics of which two major important oneswill be described in this section – together with reasons why they are not primarilyinteresting for this research.

8DBpedia Spotlight http://dbpedia-spotlight.github.io/demo/ detects the correct ”Autralia” en-tity in the phrase ”an Australian woman”

9http://nlp.stanford.edu/software/corenlp.shtml10http://opennlp.apache.org/index.html

http://dbpedia-spotlight.github.io/demo/ http://nlp.stanford.edu/software/corenlp.shtml http://opennlp.apache.org/index.html


Figure 3.1 First Paragraphs of the Wikipedia Article ”Berlin”

Text chunk extraction

Until now, the identification of text parts that belong together was limited to thesentence level. To split the text into bigger parts, sliding window techniques weredeveloped that are used to subdivide a text into paragraphs that cover the samesubtopics. One of them is the TextTiling algorithm [11] that makes use of lexicalco-occurence and distribution of words to find gaps between sentences that indicatea change in topic.

Wikipedia articles already consist of many rather small paragraphs11 who canbe easily extracted12 and can be connected in a bottom-up manner to create biggerparagraphs. Figure 3.1 gives an example of the paragraphs (black framed) in the topsection of the English ”Berlin” article.

Language identification

There exist NLP tools like NGramJ13 to identify the language of a text by buildingan ngram profile of a text and comparing it with statistical features of the ngramprofiles in different languages. For this thesis, it is assumed that every sentence inan article from the Wikipedia version in the Language L only consists of sentenceswritten in L14. As every sentence can be assigned to language L of the Wikipediaversion, language identification is not needed.

11In a test set of 14 long articles, there were approximatively 3.606 sentences per paragraph12This is done during the HTML parsing: Paragraphs are within

tags.13http://ngramj.sourceforge.net/14This assumption does not hold for quotes in the original language.

http://ngramj.sourceforge.net/

3.2 Aligning Multilingual Text 23

3.2 Aligning Multilingual Text

The alignment of text is the task of identifying corresponding sentences within twotexts in different languages. Typically, this is done to train machine translators andthe input text is given as a parallel corpus. As Wikipedia can not be seen as aparallel corpus [1], this section describes approaches of text alignment for comparablecorpora.

3.2.1 Comporable Corpora

In [1], a first approach to align similar sentences in multilingual Wikipedia was done.They describe the difficulty when applying sentence alignment functions on Wikipediatexts: For some article pairs, one article may be the translation of the other one, butthen the articles may be very different.

They propose two methods to compute similarity values for sentence pairs:

Machine translation based approach: One of the two texts (the Dutch one intheir study) is translated into English, such that both articles are available inthe same language. Then, the texts are split into sentences. After stop wordremoval, the sentence similarity is computed by Jaccard word overlap (whichwill be explained in Section 5.1).

Link based approach: In the beginning, a bilingual lexicon is created that mapsarticle names to a unique language-independent representation of the article(using the language links described in Section 2.1). Then, a set of n-grams ofthe sizes n = 1, n = 2, n = 3 and n = 4 is created for each sentence. For each ofthe n-grams created this way, the lexicon is queried. If it returns a Wikipediaarticle, the Wikipedia term is added to the n-gram. Finally, the similarity iscomputed by Jaccard overlap as well.

This approach is mainly done to find out whether sentence alignment can bedone without translations. This would be appropriate, as the main goal is touse the findings for machine translation. Otherwise, there is some kind of acausality dilemma.

To do the sentence alignment with the help of these similarity measures, thefollowing method is exercised:

1. All the sentence pairs are ranked by their similarity score (computed with oneof the two methods above).

2. The sentence pair with the highest similarity is chosen and put in the list ofsimilar sentences.


3. All those sentence pairs were one of the sentences is contained in the chosensentence pairs are removed from the ranked list.

4. The previous two steps are done until the list is empty.

By the filtering in the third step, the compliance with the following assumptionis assured: Each sentence may be aligned with only one other sentence in the otherarticle. We call this the 1:1 assumption.

In [19], the second, link based approach, and a similar same alignment methodis used, but results are improved by including thresholds for length similarity andsentence similarity. Referring to [10], they find out that sentences that differ a lot intheir lengths are mostly not similar. Therefore, only sentences that reach a characterlength correlation of at least 0.5 and only sentence pairs that exceed a similarity scoreof 0.02 are added to the ranked list.

For a data set of 30 article pairs, they reach a precision of 21% compared to 10%when using the original methods in [1].

3.2.2 Plagiarism Detection in Multilingual Text

[3] gives an overview how to define different types of plagiarism – the adoption of textwithout reference to the source – and how to detect them. For this purpose, theyclassify plagiarism detection into monolingual and cross-lingual. The latter class isthe one being of interest for this thesis’ research as it refers to the identification ofsimilar subsections within texts in different languages. The requirements are notthe same as the ones given in the problem of this thesis, as the following reviewover the plagiarism detection methods will show. Besides the actual procedure oftext processing, we will review the techniques that are used for the – syntactic andsemantic – comparison of text passages.

[3] provides an abstract design for the cross-lingual plagiarism detection that canbe seen in [2] in a concrete example implementation. The procedure consists ofthe following steps, given an input document dq in a language Lq and a documentcollection D:

1. Reduce the complexity of dq and then translate it using machine translation.This results in a new document representation d′q.

2. By using d′q, reduce D to a smaller set of documents Dx that are good candidatesto have been plagiarised by dq.

3. For each document dx ∈ Dx: Perform a pair-wise comparison with dq to findparts sq from dq and sx from dx that are similar.

4. Merge found pairs if they are within a short distance.

3.2 Aligning Multilingual Text 25

For the first step, [2] uses a summarization strategy that extracts the n mostimportant sentences from dq. The pairwise comparison in the third step is performedby computing the longest common sequence similarity15 between translated sentences.

In this thesis, the plagiarism detection process does not have to be executedcompletely in this way: The first two steps are irrelevant, because the documentcollection D initially only consists of a single article (for each language of interest)that is the Wikipedia article in a different language found by using the language link.

Nevertheless, the second step gives some ideas with regards to the comparisonof articles on the ”document level”, without splitting the text into smaller passages.For that case, it is worthwhile to take a look at different features that are used tocalculate similarity values between smaller text fragments.

[3] gives an overview of textual features for cross-lingual plagiarism detection.They differ between syntactic, semantic and statistics features. The syntactic featuresaim at splitting the text into smaller passages: words, sentences or chunks. This canbe done by the methods described in sections 3.1.4 and 3.1.5.

Due to the inaccuracy of purely lexical (e.g. character-n-grams) or syntactic fea-tures on cross-lingual texts, it is of a major role to combine those measures withsemantic and statistical features. For the extraction of semantic features, synonymsof the words occuring in the text are collected to improve the identification of corre-sponding words.

In [2], a single feature is used which is the longest common subsequence of com-pared sentences. The similarity value derived from this feature that is later describedin Section 5.1 must exceed 0.65 to be assigned as plagiarised.

As the identification of similar sentences is not enough to detect plagiarism, apost-processing is applied on the detected sentence pairs where sentence pairs aremerged that have not more than 10 characters in between. This can be used for ouraim of paragraph alignment as well.

15http://www.cs.umd.edu/∼meesh/351/mount/lectures/lect25-longest-common-subseq.pdf

http://www.cs.umd.edu/~meesh/351/mount/lectures/lect25-longest-common-subseq.pdf

4Approach Overview

With the knowledge of the previous chapters (mainly the characteristics of multilin-gual Wikipedia and several techniques for multilingual text processing) the compari-son of Wikipedia articles becomes possible. This chapter will give an overview of thesteps that are needed to do so. This includes the data extraction, the comparisonof features in this data and finally the identification of similar sentences – and laterparagraphs – within the texts. To reason about the results, there are some aspectsthat will be evaluated e.g. by a user study.

Figure 4.1 gives a very rough overview of what is done during the process of articlecomparison for a single article pair. That means, there is one object for which anEnglish article and an article in another language exists.

Find Revisions

Extract Sentences and Features with

Pre-processing Pipeline

Store in Database

Find Similar Sentences

Find Similar Paragraphs

Compute Revision Similarity

Figure 4.1 Process of Article Comparison

The first three steps are part of the data collection and described in more detaillater in Section 8 when talking about the implementation: To observe the similaritydevelopment, we choose several revision pairs of the chosen article pair for different

27

28 Chapter 4 Approach Overview

points of time in the best possible equidistant manner. In the preprocessing step,each revision runs through a pipeline where the following things are done by HTMLparsing and using the Wikipedia API and other external tools:

Images, internal and external links (items) that are mentioned in the revisionare collected.

The raw text that is the essential part of the article (no tables, external linkslists etc.) is extracted and split into sentences.

When possible, the extracted items are assigned to sentences.

For the non-English revision, the sentences are translated by using a machinetranslation API.

For each sentence, more entity links are extracted using NER tools.

For each sentence, time annotations are extracted using NLP tools.

As this preprocessing consists of many extensive steps and this has do be forevery revision, it is not possible to do a ad hoc approach where the user can enteran arbitrary article and immediately gets the results. Therefore, the collected datais stored in a data base.

The similarity calculations are done ad hoc1. That means, if an article comparisonis requested, the respective data for both revisions is loaded from the data base.From now on, things are done in the bottom-up manner that was already describedin Section 1.2: From similar sentences, we construct similar paragraphs and finallydefine an overall revision similarity.

The problem of finding similar sentences is called sentence alignment. To dothis, we must define a set of similarity measures that are given as computation rulesthat return values in the range [0, 1] by comparing syntactic and semantic features.This is described in the following Chapter 5. In that chapter, there are two stepsof evaluation: In the first evaluation, we compute the textual similarity values forsentence pairs of a parallel English/German corpus and try to find out which of themperforms best. Secondly, we compare NER tools to choose the best performing onefor our preprocessing.

In order to align sentences, we need to have a sentence alignment function. Giventhe input of two sentences in different languages, this function calculates a similarityvalues that shall be higher proportional to their similarity, from ”different” to ”par-tially overlapping” to sentences that contain the same facts. To create and evaluatethis similarity function, a user study is done that contains of two steps: In the first,the set of similarity measures is decreased to those that have the biggest impact of

1apart from the overall similarity values for all revision pairs that are part of the history chart

29

the similarity. In the second step, the similarity function is composed of them. Thestudy is described in Chapter 6.

In Chapter 7, an algorithm is explained that forms paragraphs from the previouslyidentified matching sentence pairs. To compute the similarity of a paragraph, we usethe alignment function from Chapter 6 and add small penalties for unfitting sentences.

The fraction of overlapping paragraphs is a part of the revision similarity that isa single score for a whole revision pair. Other components are very similar to thoseused on the sentence level (text overlap, common entities, . . . ). Nevertheless, we givea concrete overview over all its similarity measures in Section 7.2 to distinguish itfrom the sentence similarity. There are even some measures like the author similaritythat can only be used on the article level.

30 Chapter 4 Approach Overview

5Feature Selection and Extraction

It is obligatory to extract comparable information from text parts to estimate theirsimilarity. Especially for multi-lingual comparisons, it is important not only to relyon pure syntactic textual information, but also on the semantics. In this chapter wewill address this task of the extraction of information within text and finally reviewhow big the influence of these different kinds of information – features – is.

There are different types of features: Some can be used on the sentence level (e.g.textual features), others are used for the revision comparison (e.g. authors).

At this stage, we will only focus on the comparison of two sentences in differentlanguages (sentence pairs). The similarity between two sentences will later help usto realise a bottom-up approach to find similar paragraphs by combining proximatesentences. Similarity measure on the revision level are partially similar and will bedescribed in Section 7.2.

Textual features were already introduced in 3.1.2, other features needed for therevision similarity are motivated from the findings in 2.4. In this section, each featurewill be described together with calculation rules to receive similarity values withrespect to the feature.

5.1 Syntactic Features

The syntactic features of a sentence are based on its textual content. To do the textualcomparison, non-English sentences are machine-translated into English. Furthermore,basic NLP techniques like ngrams are used. In the following, five syntactic featuresand their application for the computation of similarity values in the range of [0, 1] aredescribed.

31

32 Chapter 5 Feature Selection and Extraction

Text Overlap

The text overlap similarity (TO) is a simple application of the Jaccard coefficient ofthe words appearing in both sentences after stemming and stop word removal. Thecomputation is given in the following equation, with Wsi being the set of stemmedand stop word free words of sentence i:

simTO(s1, s2) =Ws1 ∩Ws2Ws1 ∪Ws2

. (5.1)

Bigram Overlap

A method proposed in [24] uses the Jaccard coefficient as well, but not on the singlewords (unigrams). Instead of them, bigrams of the text are identified. By computingthe overlap this way, the order of the terms is considered which is not the case forthe unigrams.

Word Cosine

The word cosine is often used in the context of information retrieval and takes intoaccount the selectivity of terms: The more frequent a term is, the less important it isfor the similarity computation. This is done by defining tf-idf weights that indicatethe selectivity of terms. The tf-idf concept is transferred to our requirements bytreating sentences as documents and thus having all the sentences from both articlesas the document collection D. A term t corresponds to an entity name. To comparetwo sentences, the following two term vectors are created beforehand:

term frequency (tf) for each sentence: number of occurrences of each term inthis sentence

inversed document frequency (idf): inverted and logarithmized number of sen-

tences where each term appears, computed as idft,d = log|D|

|{d′∈D|t∈d′}| .

The cosine similarity for two sentences is then computed as follows, with wt,d =tft,d · idft,d (tf − idf weight of the term t in the sentence d):

simCo(s1, s2) =

∑Ni=1wi,s1wi,s2ni√∑N

i=1w2i,s1

√∑Ni=1w

2i,s2

. (5.2)

Longest Common Subsequence

In the context of cross-lingual plagiarism detection, texts are compared using theirlongest common subsequence (LCS) [2]. This is the longest sequence of characters

5.2 Evaluation on Sentence Similarity of Parallel Corpus 33

that appear in that order in both text strings. For example, the LCS of the words”sentence” and ”subsequence” is ”seence”. Similarly to the bigram overlap, the orderof words matters as well in this case.

To build a similarity value between 0 and 1 with the LCS, the similarity measurefor LCS is defined as follows, with |s1| denoting the number of characters in the firstsentence:

simLCS(s1, s2) =|LCS(s1, s2)|max(|s1|, |s2|)

. (5.3)

Text Length Similarity

Different from the previous similarity measures, a comparison of the sentence lengths(in terms of characters of the original and untranslated text) does obviously not sufficeto be used as the only syntactic feature, because sentences with the same length canhave completely different contents.

To take into account the varying length for the same contents in different lan-guages, you can normalize the text length, such that the average number of charactersof texts in that language is considered. To do so, we computed the ratio of charactersin the Europarl corpus [14] (described in more detail in the following Section 5.2) forthe respective language pairs and use this value for normalization.

5.2 Evaluation on Sentence Similarity of Parallel

Corpus

To be able to compare the textual features (except text length similarity) and finallydecide for the best one, we applied them on a parallel corpus with aligned Germanand English sentences. The German texts were machine-translated into English whichallows to compare the sentences using different textual features. The first goal is tojudge the features’ quality which is done by plotting their precision and recall values.The second goal is to find good threshold values that can be used to classify unknownsentence pairs into parallel or not.

As our final goal is not just to find identical sentences, but also partially overlap-ping sentence pairs, this evaluation is only reasonable when stating the assumptionthat the textual similarity of sentences with the same contents highly correlates withthe similarity of partially overlapping ones. For the Jaccard overlap, this seems cor-rect, as identical sentences reach an overlap of 100%, while this drops to 50%, if thefirst sentence only contains half of the content of the second one.


Data

The Europarl corpus [14] is a sentence-aligned parallel corpus extracted from the pro-ceedings of the European Parliament. It contains 20 versions, each with a documentof English sentences and one in another language like German, Dutch or Portuguese.For example, the German corpus has 2,176,537 sentences. The primary goal of thiscorpus is to support machine translation systems1.

As we are mainly focussing on German and English Wikipedia articles, we chosethe first 500 lines of the English/German corpus and translated the German sentencesinto English using the Bing Translator2. This leads to a data set of 500 human-writtenEnglish sentences and 500 former German, but now also English machine-translatedsentences.

Approach

To judge the similarity measures, it does not suffice to solely compare the correctlyaligned sentence pairs, because it is also important to know how they behave onwrongly aligned sentence pairs. Therefore all four measures are applied on eachpossible combination of sentences which are 500 · 500 = 250.000 sentence pairs intotal.

In the first part of this evaluation, the features are compared in terms of precisionand recall. That means, for each feature, the set of 250.000 sentence pairs is sorteddescending by their similarity value. After that, the top k sentence pairs are takenfrom the sorted set to compute precision (fraction of correctly aligned sentence pairsamong the k pairs) and recall (fraction of all the 500 sentence pairs that is returned).This is done stepwise for k = 0 until k = 250.000.

In the second part, the meaning of the similarity value is investigated. Whenlater applying the features, it is necessary to create some threshold value. If this isexceeded by the similarity of a sentence pair, its sentences will be aligned. To findout how this threshold should be set for the various measures, a box plot3 is createdfor each of them. To do so, the similarity between each of the 500 correctly alignedsentence pairs is computed. Having a set of 500 similarity values, it then becomespossible to identify their median and their quartiles.

Results

Figure 5.1 shows the result of the computation of precision and recall for the fourfeatures. For an easy comparison, the break-even points (BEPs) are marked. Theseare the points where precision and recall equalise.

1http://www.statmt.org/europarl/2http://www.bing.com/translator/3http://en.wikipedia.org/wiki/Box plot

http://www.statmt.org/europarl/ http://www.bing.com/translator/ http://en.wikipedia.org/wiki/Box_plot

5.2 Evaluation on Sentence Similarity of Parallel Corpus 35

Obviously, the bigram similarity performs worst for our goals: While all measurescan be used in the area of high precision and low recall, the recall for bigram simi-larity is much lower for a precision below 90%. One reason for this is that machinetranslation cannot guarantee to be order-preserving for word pairs.

The other three features perform similarily well and are well-suited to our goals.For a precision of 80%, they still return just slightly less than 80% of all parallel sen-tence pairs. While it is not possible to say whether Cosine or Text Overlap similarityperform better on this data (BEPs are within one percent), the LCS similarity is abit worse compared to them for a recall bigger than 80%. As a result, we will onlyconsider Text Overlap and Cosine similarity in the following sections.

0

20

40

60

80

100

0 20 40 60 80 100

Pre

cisi

on

(%

)

Recall (%)

Text Overlap (78.6)

Cosine (77.9)

LCS (75.8)

Bigram (58.8)

Figure 5.1 Precision Recall Graphs for Textual Features with Break-EvenPoints

Figure 5.2 contains four box plots. The results make clear why an optimal precisioncannot be reached: Some sentence pairs are assigned very low similarity values, thereare even 7 pairs whose terms do not overlap at all. To put it another way, there areonly 21 pairs (4.2%) that completely overlap. For the least common subsequence, thesimilarity always is above 0, because all sentence pairs share at least some letters.

As a starting point for further investigation in Section 6, we will choose the sim-ilarity values of the first quartile as threshold (this means, 75% of the pairs werereturned in this data set), which is 0.39 for cosine and 0.3 for text overlap.


0.00

0.20

0.40

0.60

0.80

1.00

Cosine Text Overlap Bigram LCS

Sim

ilari

ty

Similarity Type

Figure 5.2 Box Plots for Textual Features

5.3 Semantic Features

Semantic features are additional information from the texts and found by extractingthem with some external tools or given by the Wikipedia formatting. Due to specialcharacteristics, it is often more difficult to derive similarity measures from them.

Time Similarity

Time annotations can be extracted from sentences in the form of e.g. ”2013-04-17”, ”2014” or even ”19XX” (representing a whole century). That means, each timeannotation represents a time range that can be a single day or even whole centuries. Iftwo sentences contain overlapping time annotations, this is a sign that they representevents that were happening at the same time and hence probably are the same.

An important aspect is that the presence of the same time ranges not always isa good hint for a high probability that the sentences represent the same contents:The wider the range of a time period, the less important is its meaning for thetime similarity. This is demonstrated by the examples in table 5.1 where the firstsentence pair is talking about different facts happened in the same century. Thesecond sentence pair has the same facts and the same concrete date is mentioned inboth sentences. Another issue is that some sentences contain more than one timeannotation which makes a trivial computation of overlapping days difficult.

5.3 Semantic Features 37

English Sentence German Sentence German Sentence(manually trans-lated)

Time Anno-tations

1 Emigration fromEurope began withSpanish and Por-tuguese settlers inthe 16th century [...].

Die Reformationim 16. Jahrhundertspaltete die westlicheKirche [...] in einenkatholischen undevangelischen Teil.

The reformation inthe 16th centruysplit the West-ern church into acatholic and anevangelic part.

16th century(ca. 36 524days)

2 On 1 January 2007,Romania and Bul-garia became EUmembers.

Am 1. Januar 2007wurden als 26. und27. MitgliedstaatRumänien und Bul-garien in die Unionaufgenommen.

On 1 January 2007,Romania and Bul-garia were includedinto the Union as26th and 27th mem-ber.

1 January 2007(1 day)

Table 5.1 Example Sentence Pairs for Time Similarity

To overcome the first issue, we assign relevance values to the time intervals ac-cording to their length: the longer the time interval, the smaller the relevance value.We set the weight of the time interval w(ti) = 1, if ti represents a particular date,w(ti) = 0.85 for a month and w(ti) = 0.6 for a year.

To compute the time-based similarity SimT ime between two sentences s1, s2, wealign each time annotation ti with its best matching counterpart tj (if any) in thesesentences and sum up the minimum relevance weights of the aligned annotations toget a time overlap value tovl:

tovl(s1, s2) =∑ti∈s1

∑tj∈s2

{min(w(ti), w(tj))

∗

0, otherwise(5.4)

∗if ti, tj refer to an overlapping time interval, and there is no other overlapping tj′ ∈ s2with a higher weight for min(w(ti), w(tj′)).

If, for example, the annotations ”2011/03/20” and ”2011/03” are aligned, therelevance weight for a month is taken.

The time overlap is computed for both directions, summed up and then normalizedby the total number of the time annotations in the sentences |s1| and |s2|:

SimT ime(s1, s2) =tovl(s1, s2) + tovl(s2, s1)

|s1|+ |s2|. (5.5)

For example, if sentence s1 contains the time annotations ”2011/03/20” and”2011” and sentence s2 contains ”2011/03”, the similarity is calculated as


SimT ime(s1, s2) =(0.85 + 0.6) + (0.85)

2 + 1≈ 0.767.

Entity Similarity

In sentences speaking about the same facts probably the same entities are mentioned.Therefore, we define an entity similarity measure. Entities are found by uniting thegiven internal Wikipedia links and other links extracted by DBpedia Spotlight. Asentities always refer to Wikipedia pages, we also call this feature Wikipedia annota-tions.

It is important to account for the selectivity of entities. However, we observed thatdue to the sparsity and the distribution of the Wikipedia annotations in sentences,cosine similarity measures directly applied to the Wikipedia annotations do not leadto a very precise sentence alignment. This can be shown by the following examplewith sentences from the Wikipedia articles about the German capital Berlin:

English sentence: Berlin is Germany’s largest city.

German sentence: Zudem ist Berlin der bedeutendste Verlagsstandort Deutsch-lands.

German sentence (human-translated): Besides, Berlin is the major publishingcenter in Germany.

Both sentences contain the single entity ”Germany” which can be found more than50 times in both articles. Therefore, it should nearly add nothing to the similarityof the sentences. However, both the text overlap and the cosine similarity fail inthis case: The Jaccard overlap obviously returns a similarity value of 1, but so doesthe cosine similarity because of its normalization factor in the denominator. This isillustrated in the following cosine calculation (with w defined as in 5.1) where otherterms than the entity ”Berlin” are not shown:

SimEntity Cosine(s1, s2) =wBerlin,s1wBerlin,s2√w2Berlin,s1

√w2Berlin,s2

= 1. (5.6)

Due to the sparsity and the distribution of the Wikipedia annotations in sentences,it is necessary to change this behaviour by adding a smoothing factor ~n to the cosinesimilarity computation. We create a vector ~n, where ni is the weight of the annotationi computed as:

ni = max(0.1, (1−dfi − 2α

))β, (5.7)

5.4 Evaluation of Entity Extraction Tools 39

where dfi denotes the number of sentences with an aligned annotation i. The weightscomputed by equation 5.7 are in the interval [0.1β, 1] with the lower weights corre-sponding to the more common annotations.

For the most selective annotations that appear in two sentences (dfi = 2)4, the

cosine sentence similarity computation remains unchanged. For less selective anno-tations the similarity is reduced faster compared to cosine using two factors: β thatcontrols the degree of similarity decrease and α that limits the maximal number ofoccurrences by which an annotation is considered as relevant.

When calculating the similarity SimEntity of two sentences s1 and s2 based on theWikipedia annotations, the tf − idf weights of the annotations are adjusted by ~n:

SimEntity(s1, s2) =

∑Ni=1wi,s1wi,s2ni√∑N

i=1 w2i,s1

√∑Ni=1w

2i,s2

, (5.8)

where wi,sj is the tf − idf weight of the annotation i in the sentence sj and N isthe number of distinct aligned annotations in the sentences of both articles. Weexperimentally set α = 25 and β = 5.

External Link Similarity

Common external links can lead to a higher similarity of sentences as well. Similarlyto the entity similarity, the selectivity of external links has to be taken into account.Therefore, the same calculation is used (with α = 5 and β = 2, because links aremuch rarer).

External Link Hosts Similarity

The external link similarity can be split up into two parts: The comparison of thefull URLs and of the host names only (with smoothing as well). We weighted theexternal link similarity with 25% and the host similarity with 75% to compute a singlesimilarity score from both measures.

5.4 Evaluation of Entity Extraction Tools

As described in Section 3.1.3, one of the semantic features used to identify similarsentences is the recognition of named entities (NER). To get an idea of the qualityof NER tools and to rate their performance on texts in different languages, we ranan evaluation of two NER tools that were applied on manually annotated texts inEnglish and German.

4If the annotation just appears in one sentence, it can never occur in both sentences of a sentencepair


5.4.1 Aim and NER tools

To do the NER on the Wikipedia, i

analysing language-specific differences in …content from other language versions. therefore, in...

Documents