cross-lingual entity matching and infobox alignment in ... · the data sources need to share a su...

20
Cross-lingual Entity Matching and Infobox Alignment in Wikipedia Daniel Rinser, Dustin Lange, Felix Naumann Hasso Plattner Institute, Potsdam, Germany Abstract Wikipedia has grown to a huge, multi-lingual source of encyclopedic knowledge. Apart from textual content, a large and ever- increasing number of articles feature so-called infoboxes, which provide factual information about the articles’ subjects. As the dierent language versions evolve independently, they provide dierent information on the same topics. Correspondences between infobox attributes in dierent language editions can be leveraged for several use cases, such as automatic detection and resolution of inconsistencies in infobox data across language versions, or the automatic augmentation of infoboxes in one language with data from other language versions. We present an instance-based schema matching technique that exploits information overlap in infoboxes across dierent lan- guage editions. As a prerequisite we present a graph-based approach to identify articles in dierent languages representing the same real-world entity using (and correcting) the interlanguage links in Wikipedia. To account for the untyped nature of infobox schemas, we present a robust similarity measure that can reliably quantify the similarity of strings with mixed types of data. The qualitative evaluation on the basis of manually labeled attribute correspondences between infoboxes in four of the largest Wikipedia editions demonstrates the eectiveness of the proposed approach. 1. Entity and Attribute Matching across Wikipedia Lan- guages Wikipedia is a well-known public encyclopedia. While most of the information contained in Wikipedia is in textual form, the so-called infoboxes provide semi-structured, factual information. They are displayed as tables in many Wikipedia articles and state basic facts about the subject. There are dier- ent templates for infoboxes, each targeting a specific category of articles and providing fields for properties that are relevant for the respective subject type. For example, in the English Wikipedia, there is a class of infoboxes about companies, one to describe the fundamental facts about countries (such as their capital and population), one for musical artists, etc. However, each of the currently 281 language versions 1 defines and main- tains its own set of infobox classes with their own set of prop- erties, as well as providing sometimes dierent values for cor- responding attributes. Figure 1 shows extracts of the English and German in- foboxes for the city of Berlin. The arrows indicate matches between properties. It is already apparent that matching purely based on property names is futile: The terms Population den- sity and Bev ¨ olkerungsdichte or Governing parties and Reg. Parteien have no textual similarity. However, their property values are more revealing: <3,857.6/km 2 > and <3.875 Einw. je km 2 > or <SPD/Die Linke> and <SPD und Die Linke> have a high textual similarity, respectively. Email addresses: [email protected] (Daniel Rinser), [email protected] (Dustin Lange), [email protected] (Felix Naumann) 1 as of March 2011 Our overall goal is to automatically find a mapping between attributes of infobox templates across dierent language ver- sions. Such a mapping can be valuable for several dierent use cases: First, it can be used to increase the information qual- ity and quantity in Wikipedia infoboxes, or at least help the Wikipedia communities to do so. Inconsistencies among the data provided by dierent editions for corresponding attributes could be detected automatically. For example, the infobox in the English article about Germany claims that the population is 81,799,600, while the German article specifies a value of 81,768,000 for the same country. Detecting such conflicts can help the Wikipedia communities to increase consistency and information quality across language versions. Further, the de- tected inconsistencies could be resolved automatically by fus- ing the data in infoboxes, as proposed by [1]. Finally, the cover- age of information in infoboxes could be increased significantly by completing missing attribute values in one Wikipedia edition with data found in other editions. An infobox template does not describe a strict schema, so that we need to collect the infobox template attributes from the template instances. For the purpose of this paper, an infobox template is determined by the set of attributes that are men- tioned in any article that reference the template. The task of matching attributes of corresponding infoboxes across language versions is a specific application of schema matching. Automatic schema matching is a highly researched topic and numerous dierent approaches have been developed for this task as surveyed in [2] and [3]. Among these, schema-level matchers exploit attribute la- bels, schema constraints, and structural similarities of the schemas. However, in the setting of Wikipedia infoboxes these Preprint submitted to Information Systems October 19, 2012

Upload: vuongxuyen

Post on 14-Aug-2019

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

Cross-lingual Entity Matching and Infobox Alignment in Wikipedia

Daniel Rinser, Dustin Lange, Felix Naumann

Hasso Plattner Institute, Potsdam, Germany

Abstract

Wikipedia has grown to a huge, multi-lingual source of encyclopedic knowledge. Apart from textual content, a large and ever-increasing number of articles feature so-called infoboxes, which provide factual information about the articles’ subjects. As thedifferent language versions evolve independently, they provide different information on the same topics. Correspondences betweeninfobox attributes in different language editions can be leveraged for several use cases, such as automatic detection and resolutionof inconsistencies in infobox data across language versions, or the automatic augmentation of infoboxes in one language with datafrom other language versions.

We present an instance-based schema matching technique that exploits information overlap in infoboxes across different lan-guage editions. As a prerequisite we present a graph-based approach to identify articles in different languages representing thesame real-world entity using (and correcting) the interlanguage links in Wikipedia. To account for the untyped nature of infoboxschemas, we present a robust similarity measure that can reliably quantify the similarity of strings with mixed types of data. Thequalitative evaluation on the basis of manually labeled attribute correspondences between infoboxes in four of the largest Wikipediaeditions demonstrates the effectiveness of the proposed approach.

1. Entity and Attribute Matching across Wikipedia Lan-guages

Wikipedia is a well-known public encyclopedia. Whilemost of the information contained in Wikipedia is in textualform, the so-called infoboxes provide semi-structured, factualinformation. They are displayed as tables in many Wikipediaarticles and state basic facts about the subject. There are differ-ent templates for infoboxes, each targeting a specific categoryof articles and providing fields for properties that are relevantfor the respective subject type. For example, in the EnglishWikipedia, there is a class of infoboxes about companies, oneto describe the fundamental facts about countries (such as theircapital and population), one for musical artists, etc. However,each of the currently 281 language versions1 defines and main-tains its own set of infobox classes with their own set of prop-erties, as well as providing sometimes different values for cor-responding attributes.

Figure 1 shows extracts of the English and German in-foboxes for the city of Berlin. The arrows indicate matchesbetween properties. It is already apparent that matching purelybased on property names is futile: The terms Population den-sity and Bevolkerungsdichte or Governing parties and Reg.Parteien have no textual similarity. However, their propertyvalues are more revealing: <3,857.6/km2> and <3.875 Einw.je km2> or <SPD/Die Linke> and <SPD und Die Linke> havea high textual similarity, respectively.

Email addresses: [email protected](Daniel Rinser), [email protected] (Dustin Lange),[email protected] (Felix Naumann)

1as of March 2011

Our overall goal is to automatically find a mapping betweenattributes of infobox templates across different language ver-sions. Such a mapping can be valuable for several different usecases: First, it can be used to increase the information qual-ity and quantity in Wikipedia infoboxes, or at least help theWikipedia communities to do so. Inconsistencies among thedata provided by different editions for corresponding attributescould be detected automatically. For example, the infobox inthe English article about Germany claims that the populationis 81,799,600, while the German article specifies a value of81,768,000 for the same country. Detecting such conflicts canhelp the Wikipedia communities to increase consistency andinformation quality across language versions. Further, the de-tected inconsistencies could be resolved automatically by fus-ing the data in infoboxes, as proposed by [1]. Finally, the cover-age of information in infoboxes could be increased significantlyby completing missing attribute values in one Wikipedia editionwith data found in other editions.

An infobox template does not describe a strict schema, sothat we need to collect the infobox template attributes from thetemplate instances. For the purpose of this paper, an infoboxtemplate is determined by the set of attributes that are men-tioned in any article that reference the template.

The task of matching attributes of corresponding infoboxesacross language versions is a specific application of schemamatching. Automatic schema matching is a highly researchedtopic and numerous different approaches have been developedfor this task as surveyed in [2] and [3].

Among these, schema-level matchers exploit attribute la-bels, schema constraints, and structural similarities of theschemas. However, in the setting of Wikipedia infoboxes these

Preprint submitted to Information Systems October 19, 2012

Page 2: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

Figure 1: A mapping between the English and German infoboxes for Berlin

techniques are not useful, because infobox definitions only de-scribe a rather loose list of supported properties, as opposed to astrict relational or XML schema. Attribute names in infoboxesare not always sound, often cryptic or abbreviated, and the ex-act semantics of the attributes are not always clear from theirnames alone. Moreover, due to our multi-lingual scenario, at-tributes are labeled in different natural languages. This latterproblem might be tackled by employing bilingual dictionaries,if the previously mentioned issues were solved. Due to the flatnature of infoboxes and their lack of constraints or types, otherconstraint-based matching approaches must fail.

On the other hand, there are instance-based matching ap-proaches, which leverage instance data of multiple data sources.Here, the basic assumption is that similarity of the instancesof the attributes reflects the similarity of the attributes. To as-sess this similarity, instance-based approaches usually analyzethe attributes of each schema individually, collecting informa-tion about value patterns and ranges, amongst others, such asin [4]. A different, duplicate-based approach exploits informa-tion overlap across data sources [5]. The idea there is to findtwo representations of same real-world objects (duplicates) andthen suggest mappings between attributes that have the same orsimilar values. This approach has one important requirement:The data sources need to share a sufficient amount of commoninstances (or tuples, in a relational setting), i.e., instances de-scribing the same real-world entity. Furthermore, the duplicateseither have to be known in advance or have to be discovered de-spite a lack of knowledge of corresponding attributes.

The approach presented in this article is based on suchduplicate-based matching. Our approach consists of three steps:Entity matching, template matching, and attribute matching.

The process is visualized in Fig. 2. (1) Entity matching: First,we find articles in different language versions that describe thesame real-world entity. In particular, we make use of the cross-language links that are present for most Wikipedia articles andprovide links between same entities across different languageversions. We present a graph-based approach to resolve con-flicts in the linking information. (2) Template matching: Wedetermine a cross-lingual mapping between infobox templatesby analyzing template co-occurrences in the language versions.(3) Attribute matching: The infobox attribute values of the cor-responding articles are compared to identify matching attributesacross the language versions, assuming that the values of corre-sponding attributes are highly similar for the majority of articlepairs.

As a first step we analyze the quality of Wikipedia’s inter-language links in Sec. 2. We show how to use those links tocreate clusters of semantically equivalent entities with only oneentity from each language in Sec. 3. This entity matching ap-proach is evaluated in Sec. 4. In Sec. 5, we show how a cross-lingual mapping between infobox templates can be established.The infobox attribute matching approach is described in Sec. 6and in turn evaluated in Sec. 7. Related work in the areas ofILLs, concept identification, and infobox attribute matching isdiscussed in Sec. 8. Finally, Sec. 9 draws conclusions and dis-cusses future work.

2. Interlanguage Links

Our basic assumption is that there is a considerable amountof information overlap across the different Wikipedia languageeditions. Our infobox matching approach presented later re-quires mappings between articles in different language editions

2

Page 3: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

Wikipedia articles

(language 1)

Example

(1) Entity matching

(2) Template matching

(3) Attribute matching

Wikipedia articles

(language n)

Inter-language entity mappings

Inter-language infobox template mappings

Inter-language infobox template attribute mappings

Example

en:Berlin de:Berlin

en:Berlin de:Berlin

en:Infobox City

de:Infobox Stadt

en:Postal code

de:Postleit-zahlen

en:Area code de:Vorwahl

en:Cologne de:Köln

en:Cologne de:Köln

Figure 2: Overview of our approach

describing the same real-world entity in order to compare theirattribute values. In this section we establish such a mapping,i.e., we identify groups of articles in different Wikipedia lan-guage editions that describe the same real-world entity. Thisproblem is significantly easier than the general entity match-ing problem, due to the existence of so-called interlanguagelinks (ILLs) [6]. These links are community-maintained andprovide a unidirectional mapping between pairs of articles indifferent language editions. However, several characteristics ofthese ILLs, such as their unidirectional nature and the existenceof conflicts, entail challenges to unambiguously identify arti-cles describing the same entity.

2.1. Interlanguage links in Wikipedia

Wikipedia defines ILLs as links between “nearly equivalentor exactly equivalent” pages in different languages [6]. Theyare mostly manually created by the authors of Wikipedia arti-cles and are displayed in the sidebar accompanying every arti-cle. The main purpose of these links is to support the navigationbetween different language versions of articles for human read-ers; for example, a local topic might be covered in more detailand accuracy in the corresponding local Wikipedia edition thanin other language editions. Any article in any language editioncan have a list of such links, but can link to at most one ar-ticle of every other language edition. Each Wikipedia editionmaintains ILLs autonomously, and thus back-links are not au-tomatically inferred (though it is generally encouraged by theWikipedia guidelines to add such back-links manually).

ILLs might be incorrect or unsuitable to identify matchingentities for several reasons:

• Vague definition of equivalence: Each Wikipedia com-munity has implemented its own set of best practices andguidelines regulating structure, content, and organizationof articles. Thus, in practice, the required equivalenceof articles is often softened to similarity, simply becausethere is no “nearly equivalent or exactly equivalent” arti-cle in the other language.

• Different article granularities: In one edition, several re-lated topics or entities may be covered by a correspond-ing number of different articles, while in other editionsthis set of topics might be covered by only a single ar-ticle. Should all finer-grained articles link to the samegeneral article in the other language? To which of thefiner-grained articles should the general article link? Forinstance, Fig. 5 (p. 6) shows the German de: Alt, whichhad a choice of being linked to the more fine-grained con-cepts en: Alto or en: Contralto (and is linked to the latter).

• Homonyms are the source of erroneous ILLs, because au-thors often make linking decisions without regarding theactual article content, and are thus misled by syntacticallysimilar article titles.

• Cluster size and consistency: We expect each ILL-connected subset of entities to form a cluster no largerthan the number of languages considered. As we showin the next section, the transitive closure of ILLs in factyields many clusters of much larger size.

Ultimately, many of the problems of ILLs for the task ofautomated entity matching are rooted in the purpose of thoselinks, which is to aid navigation for human users rather thanto provide an unambiguous mapping for data mining tasks. Infact, the currently implemented system for ILLs is not withoutcontroversy even within the Wikipedia community [7] and thereare several proposals for improvements [8, 9, 10, 11]. Never-theless, we show that ILLs indeed help in the task of entitymatching using intelligent filtering for entity clusters.

2.2. Analysis of the Wikipedia interlanguage links

In this section we examine the structure of ILLs from twoangles: First we classify and quantify different linkage situa-tions, and second we regard the overall topology of ILLs. Ouranalysis is based on the official langlinks.sql.gz MySQLdumps of the English (en), German (de), French (fr), Italian (it),Dutch (nl), and Spanish (es) Wikipedia language editions [12].The raw data have been slightly preprocessed in order to over-come some inconveniences:

• Page redirects have been resolved, that is, ILLs pointingto a redirect page were dereferenced.

• ILLs originating from or targeting a page that is not in themain article namespace [13] of the respective Wikipediaedition were discarded.

• Only links among the six language editions mentionedabove have been considered.

3

Page 4: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

2.2.1. Linkage situationsLet A and B be the sets of all articles in two different Wiki-

pedia language editions. We define LAB to be the set of ILLsfrom articles in A to articles in B. Since neither symmetry nortransitivity are technically enforced, we identify three possiblesituations, visualized in Fig. 3:Bi-directional links: All links 〈a, b〉 ∈ LAB for which a back-link 〈b, a〉 ∈ LBA exists:

B = {〈a, b〉 ∈ LAB | 〈b, a〉 ∈ LBA}

Uni-directional links: All links 〈a, b〉 ∈ LAB for which there isno link 〈b, a′〉 ∈ LBA. That is, the target article of the link doesnot have a link to any article in the source language:

UAB = {〈a, b〉 ∈ LAB | ¬∃a′ ∈ A : 〈b, a′〉 ∈ LBA}

Conflicts: All links 〈a, b〉 ∈ LAB for which the backlink〈b, a′〉 ∈ LBA targets a different article a′ ∈ A:

CAB = {〈a, b〉 ∈ LAB | ∃a′ ∈ A : a′ , a ∧ 〈b, a′〉 ∈ LBA}

a b

(a) Bi-directional

a b

(b) Uni-directional

a1 b

a2

(c) Conflict

Figure 3: Different constellations of ILLs between language pairs

Table 1 quantifies the degree to which each of the describedlink constellations occurs among all 15 pairs of the six Wiki-pedia language editions considered.

There are some interesting observations to be made in thisdata: The percentage of bi-directional links is surprisingly highthroughout all considered language pairs, ranging from 97.26%(en↔ de) to 98.26% (fr↔ nl). Of the 9,010,048 ILLs amongall six languages, there are 4,405,202 bi-directional link pairs(97.78% of all links). This suggests an overall high qualityof interlanguage links, since bi-directional links are an indica-tion for consensus between two language editions. However, inabsolute numbers there is a considerable amount of conflicts,namely 115,605 conflicts between all 15 language pairs (thesum of all values in rows CAB and CBA in Tab. 1). In the nextsection we propose an entity matching method to resolve theseconflicts by choosing the most likely correct links.

3. A Graph-based Approach for Entity Matching

Given the six language editions mentioned above, our goalis to create clusters of entities in which there is at most oneentity from each language. To this end, we first analyze the ex-isting link topology and then suggest filter conditions to removelinks that conflict with our goal and retain clusters of same en-tities.

3.1. Analysis of the ILL Topology

The interlanguage links in Wikipedia form a large directedgraph, with articles from different language versions as verticesand ILLs between them as directed edges. One interesting as-pect to investigate are the connectivity properties of this graph,or, more specifically, the amount, sizes, and properties of theconnected components in the graph. Apart from the mere sizesof the components, a key property to analyze is whether a com-ponent contains more than one article in any given language.We call such components incoherent (cf. [14]) and the propertyitself will be referred to as a conflict in a component.

Obviously, all components with a size larger than the num-ber of languages considered (in this case 6) must be incoherent,but conflicts can also occur in smaller components with a sizeof at least 3 (because the source and target Wikipedia editionsof an ILL must differ). For the six languages included in thisanalysis, the graph consists of 3,402,643 vertices (articles) and9,010,048 edges – we consider only articles that are either asource or a target of at least one ILL.

Since the graph is a directed (probably cyclic) graph, thereare two commonly used definitions of connectivity for graphsor subgraphs:

Weak connectivity: A directed graph is weakly connected, ifffor every pair 〈v,w〉 of vertices there is a path from v tow on the underlying undirected graph.

Strong connectivity: A directed graph is strongly connected,iff for every pair 〈v,w〉 of vertices there is a directed pathfrom v to w and a directed path from w to v.

Due to the high degree of bi-directional links in the inter-language link graph, we introduce a third definition of connec-tivity:

Bi-directional connectivity: A directed graph is bi-direc-tionally connected, iff the undirected graph of bi-directional edges is connected. In other words, the graphis bi-directionally connected, if after removing all non-bi-directional edges and replacing the bi-directional edgeswith undirected edges, the resulting undirected graph isconnected. Note that a bi-directionally connected graphis also strongly connected.

Based on the different connectivity definitions, we discussthree approaches for decomposing a directed graph into sub-graphs. Weakly connected components (WCCs) are the max-imal weakly connected subgraphs of a digraph; strongly con-nected components (SCCs) are the maximal strongly connectedsubgraphs; and bi-directionally connected components (BCCs)are the maximal bi-directionally connected subgraphs. Figure 4illustrates this: The original graph (Fig. 4a) is decomposedinto WCCs (Fig. 4b), SCCs (Fig. 4c), and BCCs (Fig. 4d).To find all three types of components we employ standardO(|LAB| + |LBA|) algorithms.

The example demonstrates that the graph has fewer, butlarger weakly connected components than strongly connected

4

Page 5: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

en↔ de en↔ fr en↔ it en↔ nl en↔ esTotal links LAB 516,019 532,694 409,671 379,393 338,144Total links LBA 515,789 535,574 413,615 378,907 341,361

Bi-directional links B 501,781 521,942 402,662 372,117 330,978Bi-directional links B (%) 97.26% 97.72% 97.82% 98.15% 97.42%Uni-directional linksUAB 3,444 2,374 1,457 2,027 1,691Uni-directional linksUBA 5,886 5,986 6,051 3,238 5,232

Conflicts CAB 10,794 8,378 5,552 5,249 5,475Conflicts CBA 8,122 7,646 4,902 3,552 5,151

de↔ fr de↔ it de↔ nl de↔ es fr↔ itTotal links LAB 309,398 238,048 238,354 189,102 275,885Total links LBA 311,025 240,485 238,292 191,286 277,462

Bi-directional links B 303,515 233,907 233,702 185,220 271,056Bi-directional links B (%) 97.84% 97.76% 98.06% 97.38% 97.97%Uni-directional linksUAB 2,180 1,628 1,975 1,536 2,198Uni-directional linksUBA 3,096 3,762 2,160 3,117 3,666

Conflicts CAB 3,703 2,513 2,677 2,346 2,631Conflicts CBA 4,414 2,816 2,430 2,949 2,740

fr↔ nl fr↔ es it↔ nl it↔ es nl↔ esTotal links LAB 249,591 230,133 221,094 200,567 168,298Total links LBA 248,784 231,305 219,522 200,510 169,740

Bi-directional links B 244,863 225,475 216,458 196,033 165,493Bi-directional links B (%) 98.26% 97.73% 98.25% 97.75% 97.91%Uni-directional linksUAB 2,118 1,927 2,744 2,461 1,396Uni-directional linksUBA 1,861 2,924 1,320 2,281 2,303

Conflicts CAB 2,610 2,731 1,892 2,073 1,409Conflicts CBA 2,060 2,906 1,744 2,196 1,944

Table 1: Number of occurrences of each link constellation between all 15 language pairs. The basis for the percentage of bi-directional links in the fourth row is theaverage number of interlanguage links LAB, LBA among the two languages: 2|B|

|LAB |+|LBA |

components (by definition, every strongly connected compo-nent is also weakly connected). Furthermore, since every bi-directionally connected component is also strongly connected(but not vice versa), the third approach partitions the graph evenmore, resulting in more and smaller components.

Table 2 shows statistics for the three graph decompositionapproaches. As expected, there are fewer weakly connectedcomponents than strongly connected components, and fewerstrongly connected components than bi-directionally connectedcomponents. The more interesting numbers, however, are thesizes of the largest components and the number of incoher-ent components. The largest weakly connected componentcontains no less than 108 articles2, which describe clearlydistinct entities, ranging from en: Joint stock company, overen: German Student Corps to en: Uncle and en: Sister. Thissize, and the fact that in total there are 40,590 WCCs contain-ing multiple articles for at least one of the languages (thoughmost of them are not nearly as large), shows that there is a largeamount of inaccurate (or at least imprecise) links that collec-tively establish paths between clearly unrelated entities.

2More specifically, it contains 26 English, 26 German, 21 French, 13 Italian,13 Dutch, and 9 Spanish articles.

While the situation improves dramatically when requir-ing strong or even bi-directional connectivity between articles,many incoherent components remain. Nevertheless, a closerlook at the larger incoherent SCCs and BCCs reveals that inalmost all cases, strongly or bi-directionally connected articlescenter around roughly the same topic, as shown in the inco-herent SCC of Fig. 5. The example illustrates different articlegranularities as a common source of conflicts: The German ar-ticle covers both alto and contralto, while the other languagescover them in separate articles.

Note also that the number of Wikipedia editions taken intoaccount in this analysis (six) is rather small. While this obvi-ously does not affect the results of the analysis of links betweenlanguage pairs, it has a massive impact on the accumulation ofconflicts: Analyzing the Wikipedia data from August 2008, Bo-likowski has shown that the largest weakly connected compo-nent in the complete ILL graph (covering about 250 languages)consists of 72,284 articles [14].

The amount of conflicts and the surprisingly complex struc-ture of the ILL-graph show that the identification of conflict-free sets of articles describing the same real-world entity is notas trivial as it might have appeared at first glance. However, thefindings of this analysis help to design an appropriate approach,

5

Page 6: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

h i

a

b

c

d

e

f

g

(a) Original example digraph G

WCC 2

WCC 1

h i

a

b

c

d

e

f

g

(b) Weakly connected componentsin G

SCC 1 SCC 2

h i

a

b

c

d

e

f

g

(c) Strongly connected compo-nents in G

BCC 1 BCC 2 BCC 3

h i

a

b

c

d

e

f

g

(d) Bi-directionally connectedcomponents in G

Figure 4: Example digraph with connected components highlighted based on different definitions of connectivity

WCCs SCCs BCCsNumber of components 1,062,641 1,067,753 1,068,192

Total pages in components 3,402,643 3,319,822 3,319,055Average component size 3.202 3.109 3.107

Largest component 108 17 15Number of incoherent components 40,590 3,469 2,980

Table 2: Numbers of connected components in the ILL graph. Note: Only components with size > 1 are considered – that is also why the numbers of pages in thecomponents differ across the approaches.

en: Contralto

fr: Contralto

it: Contralto

es: Contralto

en: Alto

de: Alt (Stimmlage)

fr: Alto (voix)

nl: Alt (zangstem)

es: Alto

Figure 5: Real-world example of an incoherent strongly connected component.

presented next.

3.2. Whittling Down Large Clusters

We describe a multi-step, graph-based approach to iden-tify articles describing the same real-world entity that leveragesILLs. We use the term cluster to describe a set of articles. Thereare two requirements for our entity matching algorithm, whichboth aim to prohibit ambiguity in the resulting clusters:

1. The clusters have to be disjoint: an article may only ap-pear in one cluster.

2. The clusters have to be coherent: a cluster may containat most one article from each language.

Obviously, there is a straightforward way to build such clus-ters that obeys the requirements: Decompose the interlanguagelink graph into connected components (according to one of theabove described definitions of connectivity) and discard all in-coherent components, as proposed by Adar et al. [15]. How-ever, throwing away those components means a large loss of

potential concepts, especially because components containinga conflict tend to be much larger than coherent components. Itwould be desirable to find an approach that decomposes inco-herent components into smaller, coherent components.

The basic idea of our approach is to decompose the ILLgraph into strongly connected components (SCCs), and to re-solve conflicts by partitioning incoherent SCCs further usingtwo even stricter connectivity measures. Next, each step of theapproach and the underlying motivation is explained in detail;an overview of the steps is shown in Fig. 8, which also includesthe exact number of components after each step.

Step 1: Decompose ILL graph into SCCs. In general, all con-nectivity definitions described earlier (weak, strong, and bi-directional) form a reasonable basis to decompose the ILLgraph into subgraphs (components). However, weak connectiv-ity is the loosest form, and can lead to quite large componentsspanning several completely unrelated topics. While in somecases it might be advantageous to incorporate sparsely linkedarticles into a component (e.g., in the case of new articles thatare not yet fully integrated into the network of ILLs), the factthat an article (or a group of articles) is attached to other arti-cles by only a single, uni-directional link is often an indicationof this link being incorrect or at least imprecise.

On the other hand, bi-directional connectivity is a ratherstrong requirement for clusters. Figure 6 shows two real-worldexamples illustrating that requiring bi-directional connectivityis sometimes too strict. Especially in the first example, thereis a very strong linkage between all articles in the component.Nevertheless, in both examples, the components (which arestrongly connected), would each be split into two separate bi-directionally connected components.

Decomposing the graph into strongly connected compo-nents, in contrast, is a good compromise with regard to strict-ness. All coherent SCCs are added to the final set of clusters,

6

Page 7: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

en: Privilege

de: Privileg

it: Privilegio

nl: Privilege

es: Privilegio

fr: Privilège

(a)

en: Electric car

de: Elektroauto

es: Vehículo eléctrico de batería

fr: Véhicule électrique

(b)

Figure 6: Two real-world examples illustrating that bi-directional connectivity is sometimes too strict.

while the rest (3,469 incoherent components) is processed fur-ther in the following steps.

Step 2: Decompose incoherent SCCs into bi-directionally con-nected components. Figure 7 shows an example of one of thelargest incoherent SCCs (subsequently referred to as the “Cor-poration” example). Decomposing this component into bi-directionally connected components leaves us with two suchcomponents. The first component on the left is coherent andis thus added to the set of final clusters. The second compo-nent, however, is still quite large and contains two articles ineach language. Therefore, it is processed further in Step 3.

Overall, from the 3,469 incoherent components we gener-ated 4,241 BCCs, of which 2,980 are again incoherent and areprocessed in the next step.

Step 3: Decompose incoherent BCCs into bi-connected compo-nents (2CC). To further decompose incoherent BCCs, an evenstricter connectivity constraint is applied: Bi-connectivity (alsoknown as 2-connectivity) is defined for undirected graphs andrequires that each pair of vertices 〈v,w〉 is connected through atleast two vertex-independent paths. In other words, a graph isbi-connected, iff it has no cut vertices – vertices whose removalwould disconnect the graph For the directed graph of ILLs, wedefine bi-connectivity on the underlying undirected graph thatis formed by retaining only the bi-directional edges of the orig-inal graph as undirected edges.

Analogous to the definition of other connected components,a bi-connected component (2CC) is a maximal bi-connectedsubgraph of the original graph. One characteristic of bi-connected components is that they are edge-disjoint, but notnecessarily vertex-disjoint. More precisely, all cut vertices ofthe graph are contained in at least two 2CCs. This is a prob-lem with respect to the first requirement (disjoint clusters) andentails that we cannot simply adopt all 2CCs as clusters. Wetackle this problem by finding a (preferably maximal) subset

of 2CCs that are mutually vertex-disjoint. This is accomplishedby selecting the combination of 2CCs with the maximal numberof total vertices from the set of all possible 2CC combinations(that is, all combinations of 2CCs that are vertex-disjoint).

For the “Corporation” example, Fig. 7 shows the BCC Aand the three 2CCs B, C, and D – note that en: Corporation andit: Societa per azioni are cut vertices and as such each belongs totwo components. There are four possible combinations of 2CCsthat do not involve shared vertices: {B}, {C}, {D}, and {B,C}. Ofthese combinations, the latter incorporates considerably morepages than the other three, hence this subset of 2CCs is selectedand the contained components (B and C) are added to the finalset of clusters. The resulting, now coherent components arethus A, B, and C.

A schematic summary of the complete algorithm, with fo-cus on the flow of components, is presented in Fig. 8. Thenumbers in parentheses denote the aggregated number of com-ponents of each type. These numbers show that the 3,469 in-coherent SCCs after Step 1, rather than being discarded, arefinally decomposed into 5,664 coherent components – 894 inStep 2 and another 4,770 in Step 3.

4. Evaluation of Entity Matching

A representative qualitative evaluation of the clusters pro-duced by our approach would require the manual validation ofa significant portion of the clusters, which is very labor-intense,because the article contents have to be read and evaluated (asimple comparison of the article title is not sufficient). Thus,we perform a structural and theoretical evaluation of the results.

As we describe in Sec. 8, most prior approaches rely solelyon the extraction of weakly connected components to iden-tify clusters. Only Adar et al. acknowledge incoherent com-ponents (by completely discarding them) [15], while others donot tackle this problem at all [16, 17]. Either alternative bears

7

Page 8: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

en: S.A. (corporation)

de: Société Anonyme

fr: Société anonyme

it: Società anonima

en: Corporation

de: Großunternehmen

fr: Entreprise

es: Corporación

nl: Bedrijf

it: Società (diritto)

it: Società per azioni

de: Aktiengesellschaft

fr: Société par actions

en: Joint stock company

nl: Naamloze venootschap

es: Sociedad anónima BCC

2CC

B C D

Legend:

A

Figure 7: The “Corporation” example demonstrates the step-wise decomposition of conflicted graphs.

its problems: Discarding all incoherent components means alarge loss of potential clusters, while retaining them unchangedleads to ambiguities in the clusters and huge groups of sup-posedly related articles. Kinzler, in contrast, presents the mostinnovative and promising approach so far [18] (see Sec. 8 fordetails). Therefore, we compare the clusters produced by ouralgorithm with those of a re-implementation of Kinzler’s algo-rithm fed with the same ILL data. Table 3 shows basic statisticsof the two approaches.

Our approach Kinzler# concepts 1,069,948 1,069,576

# pages in concepts 3,316,247 3,317,752Average concept size 3.099 3.102

# complete cliques 1,055,859 1,054,099

Table 3: Statistics comparing our approach and Kinzler’s algorithm

The numbers reveal that both approaches are very simi-lar with respect to the structural properties of their resultingclusters. Both approaches overlap in 1,066,249 entities, leav-ing 3327 entities exclusively identified by our approach and3,716 exclusively by Kinzler’s. Kinzler’s algorithm incorpo-rates slightly more articles in slightly fewer clusters, hence pro-ducing marginally larger clusters.

As Kinzler himself has observed, the result of his iterativeapproach depends strongly on the order in which clusters aremerged. Lacking obvious choices for the order, he does notsuggest any specific order, and we likewise do not impose one.This can often lead to non-optimal results, especially if thereare several article candidates for the same language. In this sce-nario, if the “wrong” alternative is picked first, it “poisons” thecluster and prevents the better alternative of the same language

from being included into the cluster. Our approach, in contrast,works top-down taking a more holistic view on the graph, andis thus not dependent on some merge order.

To demonstrate this difference, we have applied both entitymatching algorithms to the “Corporation” example from Fig. 7.Both algorithms resolve the incoherent linkage situation by pro-ducing several smaller clusters. However, the two approachesgroup articles differently: In contrast to our approach (with re-sulting clusters A, B, and C shown in Fig. 7), Kinzler incor-porates the article it: Societa per azioni instead of it: Societa(diritto) into cluster B, although the latter is much more inter-linked with the other articles in this cluster. This has an im-pact on cluster C, too, because the article it: Societa per azioni(which would fit very well into this cluster) is already assignedto cluster B. Moreover, it: Societa (diritto) is not included inany cluster, because cluster B already contains an Italian articleand the articles in cluster C do not directly link to this article.The reason for this result is the merge order: The algorithmhas merged en: Corporation and it: Societa per azioni first, thuspreventing a segmentation that would be superior with regardto the overall interlinkage of the articles. Our graph-based ap-proach does not suffer from this issue, as it correctly recognizesthe clusters A, B, and C.

In summary, the high degree of consensus between thetwo algorithms, despite their contrary approaches (top-downversus bottom-up), suggests an overall good quality of both.The main difference is their handling of conflicts: Kinzler’salgorithm tries to maximize the cluster sizes and sometimesgroups sparsely interlinked articles together due to an unfavor-able merge order. On the other hand, our algorithm tries tofind the most strongly interlinked sub-components, sometimesat the expense of cluster sizes. Both approaches run in under

8

Page 9: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

for each incoherent SCC

for each incoherentBCC

Extract strongly connected components

interlanguage link graph

Find conflicts in SCCs

SCCs(1,067,753)

Extract bi-directionally connected components

Clusters(1,069,948)

coherent SCCs(1,064,284)

incoherent SCCs (3,469)

Find conflicts in bi-directionally connected

components

BCCs(4,241)

coherentBCCs(894)incoherent BCCs

(2,980)

Extract bi-connected components

Find conflicts in 2CCs

2CCs(8,828)

Find maximal subset of vertex-disjoint 2CCs

coherent 2CCs(8,689)

final 2CCs(4,770)

Figure 8: Schematic summary of the graph-based entity matching algorithm.The numbers in parentheses denote the aggregated number of components ofeach type.

one minute.Note that both approaches rely on the ILL structure in Wiki-

pedia, and thus their quality is directly dependent on the qual-ity of the interlanguage links. Though individual errors in thelink structure are often reliably recognized and hence do nothave a strong impact on the clusters, both approaches require anoverall good ILL graph: Related articles need to be interlinkedstronger than unrelated articles. Enriching ILLs in Wikipediaby analyzing the actual content of multilingual articles has al-ready been approached by other researchers [19].

5. Infobox Template Matching

After having successfully identified articles describing thesame real-world entity, we can now proceed with the actual in-fobox attribute matching approach. The goal of this approachis to compute a mapping between corresponding attributes forpairs of corresponding infobox templates in different languages.

First we take a closer look at infoboxes and their attributes, aswell as at the challenges that infobox data poses. In this sec-tion, we show how a cross-lingual mapping between infoboxtemplates can be established, before the actual infobox attributematching approach is presented in detail in Sec. 6.

5.1. A closer look at infoboxes

Since infobox data differs in many important aspects fromdata stored in structured data sources, such as relational orXML databases, it is imperative to be aware of the conse-quences of these differences for extensional schema matching.In this section, we clarify the main terminology concerning in-foboxes used in the sequel of this chapter. We also analyzeinfoboxes, their attributes, and their usage in articles from astatistical point of view.

Figure 9 shows the upper part of the rendered en: Infoboxcompany in the en: BMW article in the English Wikipediaand the corresponding portion of the infobox template invo-cation in the article’s wikitext. As we can see in the wiki-text, infobox instances state the name of the invoked tem-plate (Infobox company), and contain a set of attributes,each consisting of a name (for example foundation) andan associated value (1916). The attribute values are nottyped and can in fact contain arbitrary portions of wikitext,such as wiki links ([[Automotive industry]]), image ref-erences ([[Image:BMW Logo.svg|160px]]), invocations ofother templates ({{FWB|BMW}}), text, numbers, dates, or anycombination of these types. The example also demonstratesthat the attribute names do not necessarily correspond with thedisplay labels in the rendered infobox. In the majority of in-fobox templates, many attributes are optional.

{{ Infobox company

|company_name = Bayerische Motoren Werke AG

|company_logo = [[Image:BMW Logo.svg |160px]]

|company_type = [[ Aktiengesellschaft ]]

({{ FWB|BMW}})

|industry = [[ Automotive industry ]]

|foundation = 1916

|founder = [[Franz Josef Popp]]

}}

Figure 9: Example wikitext of an infobox template

An infobox template defines the structure and layout for aspecific type of infoboxes. Hence, it establishes a class of in-foboxes of the same type. Infobox templates are usually usedby several different articles, while each article defines differentvalues for the template’s attributes. For example, there is anen: Infobox company template in the English Wikipedia, whichis included in several thousand articles about individual com-panies, each of them providing appropriate values for the at-tributes. Such an inclusion of an infobox template in an article,with specific values for a template’s attributes, will be denotedas an infobox instance. Thus, the example shown in Fig. 9 isan instance of the en: Infobox company template. Similarly,infobox attributes are the formal attributes that an infobox tem-plate defines (industry or founder in the example). Infobox in-

9

Page 10: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

en de fr nlArticles (no redirect pages) 3,431,632 1,100,058 1,077,530 661,561

Articles with infobox 1,021,951 205,439 253,514 206,042Articles with infobox (%) 29.78% 18.68% 23.53% 31.14%

Infobox templates 2,247 675 806 959Articles per infobox template (average) 455 304 315 215Articles per infobox template (median) 14 31 23 15

Table 4: Some statistics about infobox templates and their usage in the four language editions.

stances assign values to attributes, thus forming attribute in-stances (for example, foundation = 1916 is an attribute in-stance).

Naming conventions allow the identification of infoboxtemplates in a wiki page’s source code. We ignore the rare casesin which known infoboxes do not follow these conventions. Inour own experience, this is the case for less than 1 % of theinfobox templates.

The purpose of Wikipedia infobox templates is not to pro-vide a structured, machine-readable representation of the data,but is rather purely based on reuse and layouting considerations.This has implications on the design of infobox templates andthus on their suitability for extensional schema matching.

A major challenge is that infobox templates define onlya rather loose schema: The set of allowed attributes is onlyimplicitly given by the template implementation; in manycases this implementation allows alternative names for at-tributes. Moreover, attributes are untyped and unconstrained.In fact, attribute values can contain any type of wikitext,and very often they contain a mixture of different types ofdata, such as text, numbers, dates, wiki links, and even stylemarkup (for example: operating income = 289 million

<small> (2009) </small>). In our specific setting ofcross-lingual matching of infoboxes, another challenge arises:The data sources whose schemas are to be matched are com-posed in different (natural) languages and use different format-ting conventions (numbers, dates, units).

Earlier, we have seen that different article granularity isan issue for entity matching. A similar problem arises in thecontext of infobox attribute matching: The attributes in dif-ferent language editions often expose a very different granu-larity. For instance, in the context of countries, the EnglishWikipedia distinguishes between population estimate, popu-lation estimate year and population estimate rank, while theGerman Wikipedia encodes all this information in a singleattribute EINWOHNER. There may also be hidden dependen-cies between different attributes, for example between the pop-ulation of a country and the year from which this numberoriginates. Hence, both combinations, 2,014,345 (2006) and2,143,271 (2009) for the same subject may be absolutely cor-rect and not constitute a conflict.

Another problem arises when lists of multiple values for oneproperty are split across multiple separate attributes, for exam-ple ruling party1, ruling party2, ruling party3. This is obvi-ously a problem for schema matching, because the order of thevalues is not necessarily consistent across languages.

Table 4 shows some general statistics about the number ofarticles and infoboxes in the four Wikipedia editions, as well asabout the number of distinct infobox templates and their inclu-sion in articles (lower part).

The striking divergence between the average and the me-dian number of articles per infobox template indicate a long tailof infobox templates. In fact, the number of articles per infoboxtemplate is inversely proportional to its statistical rank, thus thedistribution follows a power law. Many infobox templates areused only once, caused by infobox templates that are too spe-cific, abandoned, or new.

Another interesting aspect is the number and distributionof used attributes across infoboxes. Table 5 presents the totalnumber of attribute instances in the four language editions, aswell as the average and maximum numbers of attributes perinfobox instance.

en de fr nlSum 26,106,574 4,460,167 5,109,543 4,019,835Avg 26 22 20 20

Max 246 119 220 98

Table 5: Basic statistics about infobox attribute-value instances in the four lan-guage editions.

5.2. Template matching approach

Due to the autonomy of Wikipedia editions, each has itscompletely independent set of (infobox) templates. Thus, be-fore we can match infobox attributes across language versions,we need a mapping between corresponding infobox templatesfor all language pairs, i.e., pairs of infobox templates that de-scribe similar properties about the entity. For instance, it isnot useful to match the attributes of en: Infobox mountain andde: Infobox Album.

Similar to articles, the granularity of infobox templates isnot consistent across language editions; there often is no clearone-to-one mapping between those templates. For example, theEnglish Wikipedia uses the en: Infobox settlement for practi-cally all cities in the world. Other Wikipedias, however, oftenhave customized infobox templates for cities in different coun-tries (for example, de: Infobox Ort in Schweden, de: Infobox Ortin Japan etc. in the German Wikipedia and nl: Infobox plaatsin Australie, nl: Infobox gemeente Colombia etc. in the DutchWikipedia).

From a technical point of view, template definitions are or-dinary wiki pages that reside in a separate template names-

10

Page 11: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

pace.3 Being ordinary pages, they can themselves contain in-terlanguage links. However, the degree of interlinkage of tem-plates via ILLs is lower and varies strongly across languages.A much simpler and more reliable approach to find correspond-ing templates is to count co-occurrences of infobox templatesleveraging the entity matching of the previous sections. Moreprecisely, we define two infobox templates Ta in language ver-sion Wa and Tb in language version Wb to co-occur, if:

• Ta is used in an article Aa in language version Wa, and

• Tb is used in an article Ab in language version Wb, and

• Aa and Ab refer to the same entity as determined by entitymatching.

For instance, the English article en: Allianz uses en: Infoboxcompany and the German article de: Allianz SE uses de: InfoboxUnternehmen, and both articles describe the same entity, henceboth infobox templates co-occur in this case. We count suchco-occurrences for each pair of infobox templates. Table 6shows the top 20 co-occurring infobox templates between theEnglish and German language editions along with the respec-tive co-occurrence count.

English Infobox German Infobox countGerman location Gemeinde in Deutschland 11,948French commune Gemeinde in Frankreich 6,238film Film 6,041settlement Ort in den Vereinigten Staaten 4,551Italian comune Gemeinde in Italien 4,153musical artist Band 3,483football biography Fusballspieler 3,128ice hockey player Eishockeyspieler 2,478settlement Ort in Polen 2,035settlement Ort in Tschechien 1,923album Musikalbum 1,897Ort in Osterreich Gemeinde in Osterreich 1,568mountain Berg 1,550planet Asteroid 1,415football biography 2 Fußballspieler 1,339Aircraft Begin Flugzeug 1,227airport Flughafen 1,146U.S. County County (Vereinigte Staaten) 1,073Automobile PKW-Modell 915VG Computer- und Videospiel 912

Table 6: Top 20 infobox co-occurrences between English and German Wiki-pedia

Just like regular Wikipedia pages, templates can also beredirected. For instance, there is an en: Infobox company tem-plate, and there also is an en: Infobox Company template (notethe capitalization), which redirects to en: Infobox company.Prior to building the infobox template mapping, template redi-rects are resolved by recursively translating all redirecting tem-plates to their redirect target.

As with the distribution of infobox templates within indi-vidual Wikipedia editions, the noise ratio increases significantlytowards the lower end of the co-occurrence distributions. This

3http://en.wikipedia.org/wiki/Wikipedia:Template_

namespace.

is mainly due to different semantic interpretations of entities(for example, the English en: PostScript article uses en: Infoboxprogramming language, while the German de: PostScript ar-ticle includes de: Infobox Dateiformat (file format)). Noise isalso caused by incorrect entity mappings, articles with multi-ple infoboxes, and simple errors or typos in the infobox name.To filter out those noisy infobox template pairs, we apply twothresholds:

1. An absolute threshold of at least 5 co-occurrences2. A relative threshold of at least 20% co-occurrences:

co f req(Ta,Tb)min{ f reqa(Ta), f reqb(Tb)}

≥ 0.2 (1)

where co f req(Ta,Tb) is the co-occurrence frequency oftemplates Ta and Tb, and f reqa(Ta) and f reqb(Tb) are theindividual frequencies of those templates that co-occurwith at least one other template.

Please note that even after these filtering steps, there is anm : n relationship between corresponding templates. E.g.,a template en: Infobox company might correspond both tode: Infobox Organisation and de: Infobox Firma, while the lat-ter might also correspond to en: Infobox corporation. This isintended; after all, we are interested in all attribute mappingsbetween all templates that are somehow related.

Table 7 shows some basic statistics about the co-occurrences of infobox templates between the six languagepairs, after resolving template redirects and applying the thresh-olds described above. Apart from quantifying the degree of in-fobox co-occurrence between the different language pairs, thenumbers allow for another interesting observation: Similar tothe distribution of infobox templates within individual Wiki-pedia editions, also for infobox co-occurrences there is a largedivergence between the average and the median number of arti-cle pairs per infobox template pair (and this is after filtering outthe most infrequent pairs via the thresholds). This imbalanceposes a challenge for the attribute matching algorithm, becauseit entails that for the majority of infobox template pair only fewarticle pairs can serve as instances.

6. Infobox Attribute Matching

As described above, the main idea of the attribute match-ing approach is to analyze and compare the values of attributesacross infobox template pairs in different language editions.The attribute matching is done separately for each pair ofmatched infobox templates, and the goal is to produce a 1 : 1attribute mapping between corresponding attributes for each ofthese template pairs. Of course, an individual attribute might bemapped to different attributes in different templates.

6.1. Process overview

The input to our attribute matching system is a set of co-occurring infobox instance pairs for a given template pair. Con-sider the pair of corresponding infobox templates en: Infoboxcountry and de: Infobox Staat. To perform attribute matching

11

Page 12: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

en↔ de en↔ fr en↔ nl de↔ fr de↔ nl fr↔ nlInfobox template pairs 456 627 603 334 353 419

Article pairs 91,728 122,164 104,864 54,251 50,048 74,425Article pairs (average) 201 195 174 162 142 178Article pairs (median) 23 19 18 20 21 20

Table 7: Infobox co-occurrence statistics after resolving template redirects and applying thresholds

we are interested in all article pairs 〈Aen, Ade〉, where Aen usesen: Infobox country, Ade uses de: Infobox Staat, and both arti-cles describe the same entity. One example of such an articlepair is 〈en: Argentina, de: Argentinien〉. Given that both articlesdescribe the same real-world entity, we can assume that the val-ues of corresponding attributes are often same or similar. Evenif this is not the case for this particular article pair, we assumethat it is the case for a sufficient number of other pairs.

Figure 10 shows an overview of our matching approach.After selecting the pairs of corresponding articles as describedabove, we preprocess all infobox attribute values in all articlepairs to remove noise, such as formatting instructions, com-ments, whitespace, links, or footnotes.

Infobox occurrences

Selection of article pairs

Entity mappings

Preprocessor

Article pairs

Calculation of similarities between all pairs of attribute instances

Attribute instances of

infobox 1

Attribute instances of

infobox 2

Attribute instance

similarities

Similarity aggregator

Similarity scores for all attribute pairs

Global matcher

Attribute mapping

Infobox data

Figure 10: Overview of the matching process

Next, we compute pair-wise similarities for all attributecombinations for each of the infobox instance pairs. Figure 11shows a fictitious example of two infobox instances with simi-larity scores for all pairs of attribute instances. These pair-wiseinstance similarities are computed for all article pairs in whichthe infobox templates under consideration co-occur. The sim-ilarity measure used to compare two attribute instances is de-scribed in Sec. 6.2.

Having determined the pair-wise similarities of attribute in-stances, we aggregate the obtained similarity scores for eachattribute pair of corresponding templates. The result is a simi-larity score for each pair of attributes of the two infobox tem-plates, as outlined in Fig. 12. The details of this aggregation aswell as further aspects that influence the attribute-level similar-ity scores are explained in Sec. 6.3.

a = 500 foos

b = Foo Bar

c = May 17, 2006

x = Bar Foo

y = 400 as of 2006

0.40.6 1.0

0.10.00.3

Figure 11: Attribute instance similarities

a

b

c

x

y

0.30.7 0.9

0.10.10.2

Figure 12: Attribute-level similarities

In a last step, the attribute-level similarity scores are usedto derive correspondences between the attributes, that is, a one-to-one mapping of corresponding attributes in the two infoboxtemplates is generated (Fig. 13). An in-depth explanation ofthis global matching is presented in Sec. 6.4.

a

b

c

x

y

Figure 13: Final mapping of attributes

6.2. Similarity of attribute values

A core task of the matcher is to determine how similar twoattribute instances are. Since we do not want to rely on mereequality of attribute values, a suitable measure that quantifiesthe degree of similarity has to be found. First we identify thechallenge caused by the untyped nature of infobox attributes,and proposes a solution that involves the separation of (implicit)data types contained in attribute values. These different datatypes are then examined individually to determine the similar-ity between attribute values. Next, we identify wiki links andexternal links in attribute values as a further valuable feature.Finally, all individual similarity features are combined into oneoverall similarity score for a pair of attribute instances.

6.2.1. Separating data typesWikipedia infobox attribute values do not have an explicit

data type, and often contain a mixture of numbers, dates, and

12

Page 13: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

text. This heterogeneity makes an effective determination ofsimilarity between these values considerably more difficult.There are many established string similarity measures, suchas the Levenshtein Distance [20] or the Jaro similarity mea-sure [21] (for a comprehensive overview and evaluation ofstring similarity measures, see [22]). While these measures pro-duce good results to approximately match strings, they are notsuited for comparing other types of data, especially numericvalues (19 vs. 20) and dates (“30.12.1979” vs. “01.01.1980”).Moreover, both numbers and dates are often formatted differ-ently depending on the local context. Thus, the similarity of“123,456” appearing in the English Wikipedia and the samestring in the German Wikipedia should be quite low (due tothe different representations of thousands separators and deci-mal points). On the other hand, “January 25, 1980” is semanti-cally equivalent to “25.01.1980”, which should be reflected bya maximum similarity score.

To tackle these problems, one could try to detect the pre-dominant data type for each attribute value, and adapt the com-putation of similarity depending on this data type. However,since infobox attribute values frequently feature a combinationof data types, it is often not feasible to find one single data typethat fits the attribute value well. Instead, we decompose the at-tribute value strings into their number, date, and string portions,and apply different similarity measures for each type of data.Afterwards, the individual similarity scores for each data typeare averaged, weighted with the character fraction that the in-stances of the respective data type occupy in the original string.

To demonstrate this method, Listing 1 shows an example ofan attribute value with different intermingled data types. First,we detect all substrings that might represent a date and try toparse them with a very robust heuristic-driven parser, whichdoes not rely on fixed date formats4. The successfully parseddates are stored and the original substring representing the dateis removed from the attribute value. The same is then donefor numbers. Here, of course, we consider the locale of therespective Wikipedia language for parsing the number strings.Moreover, common exponential suffixes, such as “millions” or“bn” (and the corresponding variations in other languages), arereliably detected and resolved (that is, “3.4 bn” is interpretedas 3.4 × 109). The remaining part of the attribute value is thetext portion. Table 8 shows the resulting data for our exampleattribute value and the fraction of characters for that type asweight.

June 14, 1777 (original 13-star version) July 4,

1960 (current 50-stars version)

Listing 1: Example attribute value (taken from the Adoption attribute of theEnglish en: Flag of the United States article) with intermingled data types

6.2.2. A data type-aware similarity measureHaving separated the different data types, we can now cal-

culate the similarity between pairs of attribute instances by de-

4Our implementation uses the parser of the DateTime class in the POJavalibrary, see http://pojava.org/

Data WeightDates Sat Jun 14 00:00:00 CET 1777 25/76

Mon Jul 04 00:00:00 CET 1960

Numbers 13.0 4/7650.0

Text (original-star version) 47/76(current-stars version)

Table 8: Separation of different data types (dates, numbers, text) for the exam-ple attribute value shown in Listing 1.

termining individual scores and aggregating them to an overallscore.Similarity of the string portions. Much research has beenconducted on determining approximate string similarities. Themost commonly used measures can be classified into threecategories: Character-based, token-based, and hybrid ap-proaches [22]. Many infobox attribute values are rather shortstrings, but there is also a good portion of them that is longerand contains continuous text rather than single words. For thisreason, token-based methods outperform character-based sim-ilarity measures in our scenario. However, the most importantcharacteristic of our use case is the cross-lingual nature: Thestrings we need to compare are composed in different naturallanguages. Thus, using words as tokens is not use useful and weuse n-gram tokens, as they are much less sensitive to languagedifferences (at least with the Indo-European family languagesconsidered here).

As a result of these considerations, we employ Jaccard sim-ilarity with character-level n-grams (n = 2..4) to compare thestring portion of attribute values with token sets T1 and T2:

simstring(s1, s2) =|T1 ∩ T2|

|T1 ∪ T2|(2)

Similarity of the numeric portions. A simple and in our ex-periments effective approach is to calculate the ratio betweenthe absolute values of the numbers. To guarantee that the result-ing value is in the range [0, 1], we divide the smaller number bythe larger number.

However, since numbers often describe hard facts wherethere is no room for deviations, we want to “reward” two num-bers for being exactly equal. For example, it is unlikely that anattribute describing the number of doors of a car with a value of3 relates to an attribute in the other language with a value of 4,although the ratio between those numbers is quite high (0.75).Thus, we introduce a 50% penalty for numbers that are not ex-actly equal. Hence, the similarity function for two numbers n1and n2 is defined as:

simnum(n1, n2) =

1, if n1 = n2

12·

min{|n1|, |n2|}

max{|n1|, |n2|}, otherwise

(3)

Since we extract not only one number per attribute, butrather all contained numbers, we need to compare two sets ofnumbers, N1 and N2. To use simnum we first have to select ade-quate pairs of numbers 〈n1, n2〉 ∈ N1 × N2 from the two values.

13

Page 14: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

We do this by finding a one-to-one mapping between the num-bers in both sets, such that the total similarity between the pairsis maximal (greedily approximated due to the computationalcomplexity of this assignment problem). Our algorithm is basedon the token assignment approach of the TagLink similaritymeasure ([23], referred to as Algorithm 1): First, we computethe pairwise similarity simnum for all pairs of numbers 〈n1, n2〉,and sort the number pairs by their similarity in descending or-der. Then we select the pair 〈b1, b2〉 with the highest similarity,and remove all other pairs {〈n1, n2〉 | n1 = b1 ∨ n2 = b2} fromthe list of pairs. This is repeated, until there are no pairs left inthe list. The resulting set of matching pairs is Mn. This methodis also referred to as the greedy algorithm for maximum weightmatching [24].

Given the mapping of numbers, the most intuitive way tocompute the overall similarity of the two sets of numbers, isto simply calculate the arithmetic mean of the similarities ofmatched numbers. This, however, does not account for un-matched numbers in either of the sets. If, for example, N1 = {3}and N2 = {1, 2, 3, 4, 5}, we would find one perfect match 〈3, 3〉.Concluding that the similarity of N1 and N2 is 1.0, however,would be unnatural because there are four unmatched numbersin N2. Thus, we define the similarity between two sets of num-bers as:

simnums(N1,N2) =

∑〈n1,n2〉 ∈Mn

simnum(n1, n2)

max{|N1|, |N2|}(4)

Similarity of the date portions. We define the similarity oftwo dates as their absolute difference (in days) relative to themaximum range of dates in the two attributes:

simdate(d1, d2) = 1 −|d1 − d2|

maxDate − minDate(5)

where minDate is the earliest date found in any attribute in-stance of either of the two attributes that are being compared,and maxDate is the latest such date. The difference betweenthem, again, is expressed in days.

To compare two sets of dates D1 and D2, the same approachas for numbers is used. We first create a one-to-one mappingMd between dates in the two sets and calculate the final simi-larity as

simdates(D1,D2) =

∑〈d1,d2〉 ∈Md

simdate(d1, d2)

max{|D1|, |D2|}(6)

6.2.3. Wiki links and external links for similarityIn many cases an attribute value contains a link to another

Wikipedia page: Barack Obama was born in [[Honolulu]],which denotes a link to http://en.wikipedia.org/wiki/

Honolulu. The basic idea is to consider the set of articles tar-geted by a wiki link in both attribute instances. With the helpof the entity mapping we can easily determine which of thesearticles describe the same entity, and thus, how much overlapthere is between the two sets of targeted articles. To relate thenumber of overlapping wiki links to the total number of wiki

links contained in each of the two attribute instances, we usethe Dice coefficient [25], which is in [0, 1]:

simwikilinks(W1,W2) =2 · |W1 ∩W2|

|W1| + |W2|(7)

where W1 ∩W2 is the set of language-independent entities thatare targeted by wiki links in both attribute instances.

The same is done for external links, although for them noentity matching is necessary. We simply relate the number ofoverlapping (i.e., same) external links with the number of ex-ternal links in both attribute instances, again, using the Dicecoefficient:

simexternallinks(E1, E2) =2 · |E1 ∩ E2|

|E1| + |E2|(8)

6.2.4. Combining the individual similaritiesThe individual similarities need to be combined into an

overall similarity score between a pair of attribute instances.As described before, we have not only separated the differentdata types of the attribute values, but have also determined theshares of the respective data types in the original attribute valuestring. We use these fractions to weight the individual data typespecific similarities. Let fs1 and fs2 be the fractions of charac-ters assigned to the string portions of the values of two attributeinstances, and len1, len2 the string lengths of the original at-tribute values. Then, the average string fraction fs, weightedwith respect to the length of the attribute values, is:

fs =len1 · fs1 + len2 · fs2

len1 + len2

The weighted average number and date fractions fn and fd arecalculated analogously:

fn =len1 · fn1 + len2 · fn2

len1 + len2fd =

len1 · fd1 + len2 · fd2

len1 + len2

Given the average fractions of each data type, we can nowcompute the similarity simval of the values of two attribute in-stances a1 and a2 by combining the similarities of the string,number, and date portions, and weighting them by their frac-tions as well as by additional weights ws, wn, and wd. For thesake of clarity, let sims be the similarity simstring of the stringportions s1 and s2 of the attributes, simn the similarity of theirnumber portions, and simd the similarity of the date portions.

simval(a1, a2) =ws · fs · sims + wn · fn · simn + wd · fd · simd

ws · fs + wn · fn + wd · fd

The concrete weights used in our experiments are ws =

0.11, wn = 0.44, and wd = 0.44. These weights have beenlargely determined empirically, but there is a good reason forthe low weight of the string portion: The information densityis much higher for numbers and dates than for text: For thetwo strings “ca. 4” and “ca. 9” we calculate fs = 0.75 andfn = 0.25. Without further weights, the perfect similarity of thestring portion (“ca.”) would outweigh the rather poor number

14

Page 15: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

similarity, leading to a quite high overall score of 0.86. Weight-ing the number and date portions higher than the string portionadjusts this imbalance.

Next, the two link overlap ratios are considered, leading tothe final similarity score siminstance between two attribute in-stances a1 and a2. Again for clarity, let simw and sime be thesimilarities between the sets of wiki links and external links,respectively.

siminstance(a1, a2) =wv · simval(a1, a2) + ww · simw + we · sime

wv + ww + we

Here, the concrete weights in our experiments are wv = 0.3,ww = 0.6, and we = 0.1. Wiki link overlap is weighted veryhigh, because two attribute instances containing wiki links thattarget the same entity are a quite reliable indicator for the equiv-alence of these two attributes.

The wiki link overlap is only considered if both attribute in-stances contain wiki links, otherwise ww is set to 0. The reasonfor requiring both attribute instances to contain the links is sim-ple: While in one language edition, parts of the attribute valuemay be linked to the corresponding article, this may not be thecase in other Wikipedia editions. The same applies to externallinks: If both attribute instances do not contain external linkswe set we = 0.

6.3. Similarity of attribute pairs

Having designed a similarity measure for pairs of individ-ual attribute instances, we now derive the similarity betweenpairs of attributes. As described in Sec. 6.1, all co-occurring at-tribute instances for each matched template pair are comparedwith each other. More precisely, the attribute instance simi-larity siminstance is calculated for all attribute combinations foreach pair of articles that describe the same entity and use therespective pair of infobox templates. This way, we obtain a listof instance-level similarities for each pair of attributes. The to-tal similarity between two attributes can now be defined as thearithmetic mean of all instance-level similarities:

simattr(a1, a2) =

∑〈ai1 ,ai2 〉 ∈ Aa1 ,a2

siminstance(a1, a2)

| Aa1,a2 |(9)

where Aa1,a2 is the set of co-occurring attribute instance pairsfor the two attributes a1 and a2.

Due to alternative attribute names (or typos in attributenames), there may be multiple attributes in one infobox tem-plate that correspond to an attribute in the other infobox. Forexample, the template en: Infobox software uses programminglanguage and programming language interchangeably, whichare both equally good match candidates for the German Pro-grammiersprache attribute. Since for now we allow only one-to-one matches, we have to choose one of these attributes. Insuch a case, it is reasonable to pick the alternative that is usedmore frequently. However, the approach so far only considersthe average similarity of attribute pairs, but disregards attributefrequencies. Therefore, we present two additional methods toincorporate attribute frequencies into the matching:

1. To filter out very infrequent attributes, a threshold for at-tribute frequencies is applied. Let f req(an) be the num-ber of occurrences of attribute an in the set of comparedinfobox instance pairs I. The threshold condition is then:

f req(a1)|I|

>130

∧f req(a2)|I|

>1

30

The similarity score of attribute pairs that do not sat-isfy this condition is set to 0, hence excluding them fromthe further matching process. These attribute pairs oftenoriginate from typos in one of the attribute names, thuswe eliminate some false positives.

2. We incorporate the number of co-occurrencesco f req(a1, a2) of two attributes a1 and a2 into theattribute similarity score forming the adjusted attributesimilarity simadj(a1, a2):

simadj(a1, a2) = simattr(a1, a2) · log10(co f req(a1, a2))

In this way, additionally to regarding the raw similarityscore, we consider two attributes to be a better match ifthey co-occur more frequently. In the case of program-ming language versus programming language, for ex-ample, the former is used much more frequently togetherwith the German Programmiersprache attribute than thelatter (139 versus 2 co-occurrences). Incorporating thenumber of co-occurrences into the score ensures that,despite the slightly better raw similarity score betweenprogramming language and Programmiersprache, theother alternative is chosen. Scaling the co-occurrenceslogarithmically proved to be a good compromise, as it pe-nalizes very low frequencies, but on the other hand doesnot value high frequencies exaggeratedly.

Due to the additional factor, the final similarity score simadj

does not yield results in the range [0, 1] anymore. This, how-ever, is no problem, because the scores of all attribute pairs arescaled alike.

6.4. Global matching

Given the similarity scores for each pair of co-occurring at-tributes, we now have to derive an overall schema mapping.Here, we confine ourselves to finding one-to-one mappings be-tween attributes, although this does not always precisely repre-sent reality. Determining one-to-many or even m:n mappingstogether with appropriate concatenation operators is left for fu-ture work (see Sec. 9).

Though the degree of information overlap in correspond-ing infobox templates is mostly high, the mapping of attributesbetween two concrete infoboxes is usually not complete. Thismeans that often many attributes in one infobox do not havea corresponding partner in the second infobox, and vice versa.It is thus imperative for the global matching algorithm not toenforce a complete mapping among attributes.

Commonly used bi-partite matching algorithms like algo-rithms to achieve a stable marriage [26] or the Hungarian algo-rithm [27] always enforce a complete mapping. To avoid false

15

Page 16: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

positives we choose another straightforward and quite strict ap-proach to global matching, requiring both attributes to mutuallyprefer each other: First, for each attribute in both infobox tem-plates, the match with the best score is determined. Then, foreach attribute a of the first infobox template, the attribute’s bestmatch b is adopted to the final mapping only if a is also b’s bestmatch.Quantifying confidence. It is desirable for an automaticschema matching tool to express its confidence in a foundmatch. In this way, the obtained mapping can be filtered fur-ther, with the aim to increase precision (usually at the expenseof recall). We calculate two such measures for each reportedmatch:

Absolute confidence is an absolute measure of confidenceand is defined as the similarity score simattr of the attributematch:

cabs(a, b) = simattr(a, b)

Relative confidence is the confidence that the match is betterthan other possible matches for the participating attributes. It iscalculated as the proportion of the match’s similarity score tothe average score of both attributes’ second best matches sbm:

crel(a, b) = 1 −simattr(a, sbma) + simattr(sbmb, b)

2 · simattr(a, b)

As we show in the evaluation, both confidence measures to-gether are a quite good discriminator to filter out false positivesand thus to increase precision.

7. Evaluation of Attribute Matching

We now proceed to evaluate the attribute matching methoddescribed above. Due to poor infobox naming conventions inthe Italian Wikipedia (it is not possible to automatically dis-tinguish infoboxes from any other template) and buggy XMLdumps of the Spanish Wikipedia at the time of experimenting5,we restrict our experimental evaluation to the four Wikipedialanguage editions English, German, French, and Dutch. Beforeevaluating the most important aspect of our matching system,its effectiveness, we briefly present runtime results.

7.1. Efficiency

en↔ de 222m 02sen↔ fr 203m 39sen↔ nl 156m 21sde↔ fr 93m 31sde↔ nl 72m 59sfr↔ nl 116m 16s

Table 10: Runtime of the matcher in to-tal CPU time.

Table 10 shows theruntime of the matcherto find attribute correspon-dences between all infoboxtemplate pairs for the sixlanguage pairs. The timesdenote total CPU timeon the test system pow-ered by two quad-core In-tel Xeon X5355 processorsand 16 GB main memory.

5The bug has been reported at https://bugzilla.wikimedia.org/

show_bug.cgi?id=18694

Due to the multi-threaded implementation, the actual runtimeis much lower on a multi-core system. As demonstrated by thenumbers, the runtime scales roughly linearly with the numberof infobox instance pairs (cf. Tab. 7 in Sec. 5.2). In summary,the computational performance is sufficiently good and scaleswell enough to process larger amounts of infobox template orlanguage pairs.

7.2. Evaluation of matching effectiveness

To evaluate the overall performance of our matching ap-proach, we have manually (and carefully) created the expectedattribute correspondences for several infobox template pairsacross all six language pairs. In total, these hand-crafted evalu-ation mappings cover 96 infobox template pairs (this is 3.44%of all co-occurring infobox template pairs in the considered lan-guage pairs) with a total of 1,417 expected attribute pairs. Theinfobox template pairs have been manually selected to cover adiverse range of templates (e.g., templates with few or manyinstances) and matching problems (e.g., templates where at-tributes in the different languages use different units).

Table 9 shows the evaluation results of our approach, bro-ken down by language pair. The upper half of the table presentsthe absolute numbers of relevant/expected attribute pairs (as de-fined by the hand-crafted mappings), actually matched attributepairs and true positives (correctly matched pairs). The lowerhalf shows the three metrics commonly used to evaluate resultsof information retrieval and schema matching approaches:

Precision =|R ∩ M||M|

Recall =|R ∩ M||R|

F1 measure =2 · precision · recallprecision + recall

Of the total of 1,417 expected attribute pairs, we correctlyidentify 1,335 pairs, resulting in 82 false negatives and a re-call of 94.21%. The total number of matched attribute pairsis 1,441, thus there are 106 false positives leading to an overallprecision of 92.64%. Combining these two metrics, we obtain asatisfyingly high F1 measure of 93.42%. Breaking down thesenumbers to the different language pairs, we still see consistentlyhigh precision and recall. Thus, we conclude that the differentcombinations of natural languages do not have a high impacton the matcher’s performance (though this probably is not thecase when matching infoboxes originating from very dissimilarlanguages, especially if these languages do not share the samescript).

7.3. Applying confidence thresholds

In Sec. 6.4 we introduced two types of confidence mea-sures: absolute and relative confidence. Assuming that bothof them correlate roughly with the quality of a match, it shouldbe possible to control the precision/recall ratio by introducing athreshold on either of the confidence measures or a reasonablecombination of both. To test this assumption, Fig. 14 shows theimpact of different thresholds θ on

• absolute confidence: cabs ≥ θ (Fig. 14a),

16

Page 17: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

en↔ de en↔ fr en↔ nl de↔ fr de↔ nl fr↔ nl Total|R| 292 284 269 214 177 181 1,417|M| 299 298 268 217 180 179 1,441

|R ∩ M| 275 275 255 197 165 168 1,335Precision 91.97% 92.28% 95.15% 90.78% 91.67% 93.85% 92.64%

Recall 94.17% 96.83% 94.80% 92.06% 93.22% 92.82% 94.21%F1 measure 93.06% 94.50% 94.97% 91.42% 92.44% 93.33% 93.42%

Table 9: Evaluation of the computed attribute mapping. |R| is the number of relevant attribute pairs, |M| the number of matched attribute pairs and thus |R ∩ M| isthe number of true positives.

• relative confidence: crel ≥ θ (Fig. 14b),

• harmonic mean of absolute and relative confidence:2 · cabs · crel

cabs + cabs≥ θ (Fig. 14c).

The plots show that both confidence measures taken sep-arately can be used to increase matching precision (at the ex-pense of recall). However, combining both confidence mea-sures by taking their harmonic mean is a much better filter-ing criterion than either of them alone: We can increase pre-cision from originally 92.6% to 96.9% while retaining a recallof 74.4%. With either of the confidence measures alone, thislevel of precision is only reached for much lower values of re-call: 61.8% (cabs) and 52.5% (crel). This is actually not toosurprising: Only few false positives have both a high absoluteand a high relative confidence.

Since we are now able to influence precision and recallwith the help of the confidence threshold, we can plot a stan-dard precision-recall diagram by sampling precision and recallat different confidence thresholds (see Fig. 15). Note that bymerely adjusting the threshold, we can only filter out false pos-itives; recall is bound by the recall obtained without applyingany threshold (94.21%).

8. Related work

Research related to this work stems from several areas.First, the properties of interlanguage links (ILs) have been an-alyzed from different points of view. Second, ILLs have beenexploited for entity matching in several other works. Finally,the problem of Wikipedia infobox attribute matching has beenexplored in different ways by at least two independent researchgroups, thus highlighting the potential and interest in findingcorrespondences between infobox attributes.

8.1. Analysis of interlanguage linksBolikowski provides a very detailed analysis of the topol-

ogy of the complete Wikipedia ILL graph spanning all 250+

languages [14]. In this graph, vertices represent articles and di-rected edges represent the interlanguage links connecting them.With respect to (weakly) connected components in the graph,Bolikowski distinguishes coherent and incoherent components.A component is denoted as coherent, iff no two articles in thiscomponent belong to the same Wikipedia language edition. Themain focus of the analysis is on the distributions of componentsizes, node degrees, and clustering coefficients, each separately

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

0.90

0.92

0.94

0.96

0.98

1.00

Precision

Threshold

0.0

0.2

0.4

0.6

0.8

1.0

Recall

(a) Absolute confidence

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

0.90

0.92

0.94

0.96

0.98

1.00

Precision

Threshold

0.0

0.2

0.4

0.6

0.8

1.0

Recall

(b) Relative confidence

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

0.90

0.92

0.94

0.96

0.98

1.00

Precision

Threshold

0.0

0.2

0.4

0.6

0.8

1.0

Recall

(c) Harmonic mean of absolute and relative confidence

Figure 14: Impact of different thresholds (sampled at 0.05 intervals) on preci-sion and recall. Note the different scales for precision and recall.

determined for coherent and incoherent components. The re-sults of this analysis largely coincide with our observations, de-

17

Page 18: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

0.900

0.920

0.940

0.960

0.980

1.000

Recall

Precision

Figure 15: Precision-recall diagram for the complete set of evaluation mappings(all language pairs) obtained by sampling the combined confidence threshold at0.02 intervals. Note that the range of the precision axis starts at 0.9.

spite the lower number of languages considered. Bolikowskiconcludes that instead of consisting of only isolated cliques(which would indicate perfectly coherent and completely intra-linked components), the interlanguage link topology can be “in-formally described as a scale-free network of loosely connectednear-cliques”.

Further analysis of ILLs on the example of the Chinese,Japanese, Korean, and English Wikipedia editions has beenconducted by Arai et al. [28]. Rather than examining the topol-ogy of the graph, the authors focus on specific link patterns,such as mutual, inconsistent, and unidirectional links, betweenpairs of languages. They discover that between 88.7% to 93.8%of all ILLs between the four considered languages are bidirec-tional.

Finally, Hammwohner analyzes the quality of different as-pects of Wikipedia, including that of ILLs [29]. Without con-crete measurements the author lists several observed phenom-ena that may lead to inconsistencies among ILLs. These phe-nomena are mainly related to problems with disambiguationpages, links to wrong homonyms, and different granularities ofarticles in different languages.

8.2. Entity Matching in WikipediaThe task of entity matching in multi-lingual Wikipedia is

a prerequisite for several research areas, especially for cross-lingual information retrieval and the building of multi-linguallexical resources. Hence, this problem has been tackled (moreor less profoundly) by different researchers.

Adar et al. [15] have approached the problem from a graph-theoretical point of view by identifying weakly connected com-ponents in the graph of ILLs between English, French, Spanish,and German Wikipedia editions. To eliminate conflicts, theysimply discarded all components containing more than one ar-ticle in any given language. A specific evaluation of their entitymatching approach has not been conducted.

Hassan and Mihalcea [16] complete the ILLs betweenEnglish, Spanish, Arabic, and Romanian Wikipedia editionsby computing the transitive and symmetric6 closures, whichamounts to clustering weakly connected articles.

6In their work symmetry is incorrectly referred to as “reflectivity”.

Hecht and Gergle [17] also propose a graph-based view onILLs. To complete missing links between 25 language edi-tions, they perform a breadth-first search on the graph, ignoringedge direction, and thus, again, find weakly connected compo-nents. While the algorithm’s positive characteristic as “missinglink finder” is highlighted, incoherency in the identified com-ponents and discovery of erroneous ILLs are not discussed. Arather narrow, and thus not very representative, human evalu-ation was performed: Based on the analysis of 75 articles perlanguage, the percentage of missing links varies between 2%and 8% (depending on the language pair), while the evaluationof merely 25 bilingual article pairs did not reveal any incor-rect links. In another context the same authors have describeda different two-pass approach, however only very briefly andwithout evaluation [30].

Bolikowski has not only analyzed ILLs from a graph-theoretical point of view, but also briefly proposes a method toapproximately identify articles representing the same topic/en-tity [14]. The proposed approach, however, depends on a refer-ence language edition. Nodes not connected to any node of thereference language are not considered by this approach. Thus,this method is not well suited for the task of entity matchingacross many languages.

A bottom-up approach to identify clusters in Wikipedia ispresented by Kinzler [18]: Initially, for every single article inany language, a cluster is created. Afterwards, clusters are it-eratively merged until no more merges are possible. The key isthe merge condition: Two clusters are merged, if (1) there is anILL from any article in the first cluster to any article in the sec-ond cluster, and vice versa; and (2) the language sets of articlesin the clusters are disjoint (to avoid the formation of incoher-ent components). A manual qualitative evaluation of 250 (outof 2.8 million) clusters spanning five languages (English, Ger-man, French, Dutch, and Norwegian) resulted in no erroneouslymerged articles and six missing articles. Further details on thisalgorithm and a comprehensive comparison to our approach isgiven in Sec. 4.

The MENTA (Multilingual Entity Taxonomy) project in-tegrates entities from different language editions of Wikipediaand WordNet [31]. Similar to our approach, the authors definea graph on the cross-lingual links and then remove conflictinglinks. In their work, they define the optimization problem ofremoving as few links as possible to achieve a coherent graph.They apply an algorithm based on a linear program that runswith a logarithmic approximation guarantee [32].

8.3. Infobox attribute matching

The task of infobox attribute matching falls into the broaderarea of schema matching, and there is a multitude of furtherrelevant work. A general survey of common schema match-ing approaches is presented by Rahm and Bernstein [2]. Theydescribe and compare different methods, and establish a tax-onomy that is often used to classify schema matching ap-proaches. However, many of the traditionally used techniquesleverage the information of (mostly relational) schema defi-nitions, such as column data types and constraints, and are

18

Page 19: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

thus not applicable for the task at hand. Thus, our discus-sion of related work focusses on approaches for infobox at-tribute matching. Yet, our approach is inspired by a (traditional)instance-based schema matching solution, namely duplicate-based schema matching [5]. While other instance-based match-ing approaches usually analyze higher-level characteristics ofattribute instances within individual schemas, duplicate-basedschema matching relies on a priori identified instances repre-senting same real-world objects. To accommodate the multi-lingual nature of the infobox alignment task, as well as moti-vated by the unique characteristics of infobox attributes, thatmatching process has been largely adapted in several importantaspects.

To predict the matching probability of pairs of infobox at-tribute instances across different language versions, Adar et al.employ self-supervised machine learning with a logistic regres-sion classifier using a broad range of features, such as equalityand n-gram/word overlap of attribute keys and values, wiki linkoverlap, correlation of numerical attributes, and translation-based features [15]. To generate a labeled training set, they firstfind attribute pairs that have exactly equal values for many arti-cle pairs. Then they adopt all attribute instance pairs for theseattributes (not only the equal ones) as positive training data.Negative training data is derived from the positive matches, as-suming that each attribute in one infobox template only cor-responds to one attribute in the other infobox template. Aftertraining, the classifier is then used to predict if two attributeinstances match (which is modeled as a binary decision). Thelikelihood that two attributes are a match is then simply definedas the ratio between the number of matched and total instancepairs. Their approach was evaluated on different levels: Theclassifier reaches a precision of 90.7% (tested using 10-foldcross-validation), which is not to be confused with the precisionof the final attribute mapping. In order to quantify the precisionof the attribute mapping, they manually verified 200 attributepairs, of which 86% were classified as correct matches. How-ever, since they only verified found matches, the recall of theapproach remains unknown.

Our approach shares some ideas with the method presentedby Adar et al. However, our approach puts a much strongeremphasis on the similarity measure for attribute instances, es-pecially with respect to the handling of different data types. Wedo not use machine learning techniques, but rather rely on staticweighting of different similarity features. Altogether, these de-cisions seem to pay off, as demonstrated by the high precisionof 92.6% of our approach, at an even higher recall of 94.2%(see Sec. 7).

Also, Bouma et al. perform a matching of infobox attributesbased on instance-data [33]. They first normalize all infobox at-tribute values, especially with respect to number, date formats,and some units. Then they tested these normalized attributevalues for exact equality across different Wikipedia languageversions. The authors themselves state that their results are notdirectly comparable to those in [15] but concede that their recalland precision is lower (and thus also lower than our results). Inparticular, their limited set of normalizations and reliance onstrict value equality have a negative impact on recall. Further,

formatting irregularities and other effects negatively affect pre-cision.

9. Conclusions

To find an effective approach for the identification of groupsof Wikipedia articles describing the same real-world entities,we have analyzed the interlanguage linkage situation in Wiki-pedia. While we found that the majority of the interlanguagelinks (ILLs) are of good quality, the analysis revealed that thereare many conflicting links. In extreme cases, the accumulationof such erroneous or imprecise links result in large groups ofarticles that are connected by ILLs, but describe entirely unre-lated topics. We have shown how the ILLs in Wikipedia canbe leveraged to identify disjoint sets of articles representing thesame entity. The presented top-down approach operates on thegraph of ILLs and decomposes this graph into coherent compo-nents by successively applying stricter connectivity measures.

Using this entity mapping, we have demonstrated that anautomatic instance-based matching of Wikipedia infobox at-tributes is feasible and quite successful, despite the challengingcharacteristics of infobox data. Our fundamental assumptionon which this approach is based on, held true: Attribute pairsthat often contain highly similar values in different languageeditions are likely to correspond. The good evaluation resultsdemonstrate that putting a strong focus on a robust similaritymeasure that is capable of quantifying the similarity of stringswith different types of data, such as numbers and dates, is asuccessful strategy.

Future Work. There still remains, however, room for improve-ment. Currently, our approach is limited to one-to-one map-pings between attributes. During the work with the infoboxdata, we have found at least two scenarios that cannot be rep-resented with a one-to-one mapping: First, there exist alter-native names for attributes that are used interchangeably (seeSec. 6.3). With the current implementation, we have to choosebetween such alternative attributes. It would, however, bepreferable to find and report these alternatives. The secondscenario are one-to-many matches, resulting from different at-tribute specificity. Further, depending on the further applicationof the attribute mapping, it is not always sufficient to merely re-late a set of attributes to an attribute in the other template. Wewould rather like to determine the correct order and concatena-tion operators for the attributes on the “many” side, too [34].

Moreover, further processing and normalization of the data,such as detection and conversion of units, could yield better re-sults. It would also be interesting to evaluate whether the incor-poration of numeric correlation features (e.g., with the help ofthe Pearson correlation coefficient) can be used to reliably de-tect corresponding numeric values, even if they are representedin different units.

With respect to entity identification, one interesting ap-proach could be to feed the found infobox attribute mappingback to the entity matching stage. It could then be used to re-solve conflicts in components by assessing the similarity of in-fobox data between multiple competing articles of one language

19

Page 20: Cross-lingual Entity Matching and Infobox Alignment in ... · The data sources need to share a su cient amount of common instances (or tuples, in a relational setting), i.e., instances

version and the other articles in the component (of course, onlyif all involved articles contain infoboxes).

Applications. Several possible applications that can take ad-vantage of aligned infobox schemas have already been sug-gested in Section 1. One very interesting such application isthat of conflict detection. Given an attribute mapping betweena pair of infobox templates, one can analyze all correspond-ing attribute instance pairs and try to detect inconsistencies be-tween the attribute values. While it is trivial to detect whetherattribute values differ from each other (based on strict stringequality), the main challenge here is to decide whether differ-ing values really constitute a semantic conflict (as opposed toformatting, unit, and language differences). Having detectedpotential conflicts, these inconsistencies can either be correctedautomatically, or tools can be designed to support human Wiki-pedia authors with the correction of the data. Automatic con-flict resolution is most probably only feasible for a limited setof obvious inconsistencies, because it involves not only decid-ing which of the conflicting values is correct, but also needsto transform the value to the expected format in the other lan-guage editions (which is not clearly defined due to the looseschema). This transformation includes adaption of number anddate formats, but also translation of textual elements in the at-tribute values. Providing tool-support for authors, in contrast,is a much more straightforward approach. Such a tool couldpresent potential inconsistencies to Wikipedia authors, alongwith proposed resolutions, but leave the final decision and ex-ecution to a human author. Though not fully automatic, thebenefit of such a tool would still be high, because without toolsupport inconsistencies in attribute values are only rarely rec-ognized.

References

[1] E. Tacchini, A. Schultz, C. Bizer, Experiments with Wikipedia cross-language data fusion, in: Proceedings of the Workshop on Scripting andDevelopment for the Semantic Web (SFSW), Crete, Greece, 2009.

[2] E. Rahm, P. A. Bernstein, A survey of approaches to automatic schemamatching, VLDB Journal 10 (4) (2001) 334–350.

[3] J. Euzenat, P. Shvaiko, Ontology Matching, Springer Verlag, Berlin –Heidelberg – New York, 2007.

[4] H. H. Do, E. Rahm, COMA – a system for flexible combination of schemamatching approaches, in: Proceedings of the International Conference onVery Large Databases (VLDB), 2002, pp. 610–621.

[5] A. Bilke, F. Naumann, Schema matching using duplicates, in: Proceed-ings of the International Conference on Data Engineering (ICDE), Wash-ington, DC, 2005, pp. 69–80.

[6] Wikipedia, Help: Interlanguage links (2010).URL http://en.wikipedia.org/wiki/Help:Interlanguage_

links

[7] Serious problems with interlanguage links - wikien-l mailing list(2008).URL http://www.mail-archive.com/wikien-l@lists.

wikimedia.org/msg00371.html

[8] D. Kinzler, Ideas for a smarter inter-language link system (2008).URL http://brightbyte.de/page/Ideas_for_a_smarter_

inter-language_link_system

[9] Wikimedia Meta wiki, Fine interwiki (2010).URL http://meta.wikimedia.org/wiki/Fine_interwiki

[10] Wikimedia Meta wiki, Interlanguage use case (2010).URL http://meta.wikimedia.org/wiki/Interlanguage_use_

case

[11] Wikimedia Meta wiki, A newer look at the interlanguage link (2010).URL http://meta.wikimedia.org/wiki/A_newer_look_at_

the_interlanguage_link

[12] Wikimedia, Database backup dumps (2010).URL http://dumps.wikimedia.org/backup-index.html

[13] Wikipedia, Namespace (2010).URL http://en.wikipedia.org/wiki/Wikipedia:Namespace

[14] Ł. Bolikowski, Scale-free topology of the interlanguage links in Wiki-pedia, arXiv preprint.URL http://arxiv.org/abs/0904.0564v2

[15] E. Adar, M. Skinner, D. S. Weld, Information arbitrage across multi-lingual Wikipedia, in: Proceedings of the International Conference onWeb Search and Data Mining (WSDM), 2009, pp. 94–103.

[16] S. Hassan, R. Mihalcea, Cross-lingual semantic relatedness using ency-clopedic knowledge, in: Proceedings of the Conference on EmpiricalMethods in Natural Language Processing (EMNLP), Morristown, NJ,USA, 2009, pp. 1192–1201.

[17] B. Hecht, D. Gergle, The Tower of Babel meets Web 2.0: User-generatedcontent and its applications in a multilingual context, in: Proceedings ofthe International Conference on Human Factors in Computing Systems(CHI), New York, NY, USA, 2010, pp. 291–300.

[18] D. Kinzler, Building language-independent concepts from Wikipedia,Presented at WikiSym 2008 – the International Symposium on Wikis(2008).URL http://brightbyte.de/repos/papers/2008/

LangLinks-paper.pdf

[19] P. Sorg, P. Cimiano, Enriching the crosslingual link structure of Wikipedia– a classification-based approach, in: Proceedings of the AAAI Workshopon Wikipedia and Artifical Intelligence, 2008.

[20] V. I. Levenshtein, Binary codes capable of correcting deletions, insertions,and reversals, Soviet Physics Doklady 10 (8) (1966) 707–710.

[21] M. A. Jaro, Advances in record linking methodology as applied to the1985 census of Tampa Florida, Journal of the American Statistical Society64 (1989) 1183–1210.

[22] W. W. Cohen, P. Ravikumar, S. E. Fienberg, A comparison of string dis-tance metrics for name-matching tasks, in: Proceedings of IJCAI Work-shop on Information Integration, 2003, pp. 73–78.

[23] A. Salhi, H. Camacho, A string metric based on a one-to-one greedymatching algorithm, Research in Computing Science: Advances in Com-puter Science and Engineering 19 (2006) 171–182.

[24] D. E. Drake, S. Hougardy, A simple approximation algorithm for theweighted matching problem, Information Processing Letters 85 (2003)211–213.

[25] L. R. Dice, Measures of the amount of ecologic association betweenspecies, Ecology 26 (3) (1945) 297–302.

[26] D. Gale, L. S. Shapley, College admissions and the stability of marriage,The American Mathematical Monthly 69 (1) (1962) 9–15.

[27] H. W. Kuhn, The Hungarian Method for the assignment problem, NavalResearch Logistics Quarterly 2 (1-2) (1955) 83–97.

[28] Y. Arai, T. Fukuhara, H. Masuda, H. Nakagawa, Analyzing interlanguagelinks of Wikipedias, in: Proceedings of the Wikimania Conference, 2008.

[29] R. Hammwohner, Interlingual aspects of Wikipedia’s quality, in: Pro-ceedings of the International Conference on Information Quality (IQ),Boston, MA, 2007, pp. 39–49.

[30] B. Hecht, D. Gergle, Measuring self-focus bias in community-maintainedknowledge repositories, in: Proceedings of the International Conferenceon Communities and Technologies (C&T), University Park, PA, USA,2009, pp. 11–20.

[31] G. de Melo, G. Weikum, MENTA: inducing multilingual taxonomiesfrom wikipedia, in: Proceedings of the International Conference on In-formation and Knowledge Management (CIKM), 2010, pp. 1099–1108.

[32] G. de Melo, G. Weikum, Untangling the cross-lingual link structure ofWikipedia, in: Proceedings of the Annual Meeting of the Association forComputational Linguistics (ACL), 2010, pp. 844–853.

[33] G. Bouma, S. Duarte, Z. Islam, Cross-lingual alignment and completionof Wikipedia templates, in: Proceedings of the International Workshopon Cross Lingual Information Access: Addressing the Information Needof Multilingual Societies, 2009, pp. 21–29.

[34] R. Dhamankar, Y. Lee, A. Doan, A. Halevy, P. Domingos, iMAP: Dis-covering complex semantic matches between database schemas, in: Pro-ceedings of the ACM International Conference on Management of Data(SIGMOD), 2004, pp. 383–394.

20