Transcript
Page 1: LoCloud - D3.3: Metadata Enrichment services

Local content in a Europeana cloud

D3.3: Metadata Enrichment services

Authors:Aitor Soroa(EHU)Arantxa Otegi(EHU)Eneko Agirre(EHU)Rodrigo Agerri (EHU)

LoCloud is funded by the European Commission’s ICT Policy Support Programme

Version: 1.0

Page 2: LoCloud - D3.3: Metadata Enrichment services

Version Date Author Organisation Description

0.1 April 2014 Rodrigo Agerri EHU First version of NED systems

0.2 June 2014 Arantxa Otegi EHU Add section on evaluation

0.3 August 2014 Aitor Soroa EHU First complete draft

0.4 September 2014 Aitor Soroa EHU Final version

Statement of originality:This deliverable contains original unpublished work except where clearly indicated otherwise.Acknowledgement of previously published material and of the work of others has been madethrough appropriate citation, quotation or both.

carolusher
Typewritten Text
carolusher
Typewritten Text
- Published
carolusher
Typewritten Text
carolusher
Typewritten Text
carolusher
Typewritten Text
carolusher
Typewritten Text
carolusher
Typewritten Text
carolusher
Typewritten Text
carolusher
Typewritten Text
carolusher
Typewritten Text
carolusher
Typewritten Text
carolusher
Typewritten Text
View the LoCloud project deliverables
Page 3: LoCloud - D3.3: Metadata Enrichment services

Contents

1 Introduction 1

2 Background Link micro-service 32.1 Named Entity Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Data Sources for NED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Tools for NED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 DBpedia Spotlight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.2 Wikipedia Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Choosing a NED tool for the background link micro-service . . . . . . . . . . . . . . . . . . . . . 17

3 Vocabulary matching micro-service 183.1 Retrieving the vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Matching CH items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Implementation and examples of use 224.1 Background link micro-service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Vocabulary matching micro-service 23

6 API reference 256.1 Background Link micro-service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 Vocabulary Matching micro-service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3 HTML Status Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7 Conclusion 29

Page 4: LoCloud - D3.3: Metadata Enrichment services

Executive summary

Metadata enrichment services automatically link Cultural Heritage (CH) items with external information such asDBpedia pages or vocabulary concepts. Metadata enrichment is a useful task which assists users by providingcontextual information about a particular item. It also helps in the task of categorizing CH items with relevantterms and concepts from a particular vocabulary. Metadata enrichment comprises two di�erent micro-services:

� Background link micro-service, which automatically links CH items to background information containedin pages from an external resource like Wikipedia or DBpedia.

� Vocabulary matching micro-service, which automatically links metadata relevant to relevant classes in aprovided vocabulary.

The background link micro-service relies on DBpedia Spotlight, a state-of-the-art tool for performing NamedEntity Disambiguation (NED). DBpedia Spotlight was chosen because it performed best in our experiments, asshown in the deliverable.

The vocabulary matching micro-service is developed from scratch for the project. It synchronizes with thevocabulary service1, a collaborative platform to explore the potentials of crowdsourcing as a way of developingmultilingual, semantic thesauri for local heritage content.

All micro-services are implemented as REST services and are deployed into virtual machines (VM). Documen-tation on the service, including full API speci�cation, is avaliable at this address:

http://support.locloud.eu/LoCloud%20Enrichment%20Microservice

The metadata enrichment micro-services are used in the various enrichment work�ows automatically throughthe LoCloud Generic Enrichment Service, which allows to orchestrate various REST micro-services into complexenrichment work�ows.The purpose of this deliverable is to describe each metadata enrichment micro-service and to specify their corre-sponding APIs.

1Task 3.4 of LoCloud project.

Page 5: LoCloud - D3.3: Metadata Enrichment services

1 Introduction

This document describes the services developed within the LoCloud project with the aim of enriching the metadatainformation associated with Cultural Heritage (CH) items. Metadata enrichment is a useful task which assistsusers by providing contextual information about a particular item. It also helps in the task of categorizing CHitems with relevant terms and concepts from a particular vocabulary. Metadata enrichment comprises two di�erentmicro-services:

� Background link micro-service, which automatically links CH items to background information containedin pages from an external resource like Wikipedia or DBpedia.

� Vocabulary matching micro-service, which automatically links metadata relevant to relevant classes in aprovided vocabulary.

Figure 1 shows an example of a vocabulary match and background link for a particular Europeana CH item.The bottom of the �gure shows the CH item describing �The Major Oak� as displayed by the Europeana portal2.The item has been enriched and linked to two di�erent elements: the vocabulary concept �Oak� as de�ned by theLibrary of Congress Subject Headings vocabulary (left part of the �gure), and a Wikipedia page describing thesame tree (the right part).

Figure 1: Example of a vocabulary match and background link for the Europeana item describing �The MajorOak� (at the bottom of the �gure). The left part shows a match to a concept from the LCSH vocabulary. Theright part shows a background link to a Wikipedia page.

We have implemented the following three micro-services for metadata enrichment:

� A background link micro-service for English.

� A background link micro-service for Spanish.

2http://www.europeana.eu/portal/record/2022323/C052AA1727D9C258801CF676473953A0861A47C0.html

1

Page 6: LoCloud - D3.3: Metadata Enrichment services

� A multi-lingual vocabulary matching micro-service.

All micro-services are implemented as REST services and are deployed into virtual machines (VM). The micro-services are used in the various enrichment work�ows automatically through the LoCloud Generic EnrichmentService. The generic enrichment service allows to orchestrate various REST micro-services into complex enrich-ment work�ows. The user can create a work�ow by selecting and combining the micro-services he wants. Morespeci�cally for metadata enrichment, the textual descriptions (e.g. a description element) is posted to the back-ground link an vocabulary matching micro-services and is returned back enriched and integrated into an enrichedversion of EDM (Doerr et al., 2010).

The background link problem is closely related to the so called Named Entity Disambiguation (NED) task.Nowadays there exist many systems for performing NED, and therefore the micro-service has been implementedusing such a NED tool. In the deliverable we perform a state-of-the art on NED tools and datasets, and presentan evaluation of the chosen tool. The task of linking CH items with relevant vocabulary concepts has also gainedmuch attention during recent years. However, to our knowledge there exist no tools nor standard datasets to testthem. We thus designed and implemented the vocabulary matching micro-service from scratch.

The document is structured as follows. Section 2 describes in detail the background link micro-service. Section2.2 and 2.3 provide an in-depth analysis of current state of the art systems for NED, including testing datasets.Section 2.4 reports the results when evaluating two of the best systems for English and Spanish. Section 3describes the design of the vocabulary matching micro-service. Section 4 explains the actual implementation ofboth micro-services, and shows some examples of use. Section 6 describes the APIs for using the services. Finally,Section 7 draws some conclusions.

2

Page 7: LoCloud - D3.3: Metadata Enrichment services

2 Background Link micro-service

Current e�orts for the digitisation of Cultural Heritage are providing common users with access to vast amountof materials. Sometimes, though, this quantity comes at the cost of a restricted amount of metadata, with manyitems having very short descriptions and a lack of rich contextual information. For instance, �gure 2 shows the�Mona Lisa� CH item as displayed in Europeana. The item does not provide rich information associated withthe item. Wikipedia, in contrast, o�ers in-depth descriptions and links to related articles for many CH items, asshown in �gure 3. Therefore, Wikipedia is a natural target for automatic enrichment of CH items.

Figure 2: Europeana CH item describing �Mona Lisa�. Note that the description �eld does not provide anyinformation about the artwork.

Enriching CH items with information from Wikipedia or other external resources is not novel. In Haslhoferet al. (2010), for instance, the authors also acknowledge the interest of enriching CH items. They present theLEMMO framework, a tool to help users annotate Europeana items with external resources (i.e. Web pages,Dbpedia entries, etc.), thus extending Europeana items with user-contributed annotations.

The goal of the background link micro-service implemented in LoCloud is to automatically enrich CH itemswith external elements. This task is closely related to the �Named Entity Disambiguation� (NED) problem. Aswill be shown below, nowadays there exist many wide-coverage NED tools which can be used to implement thebackground link micro-service. Unfortunately, to the best of our knowledge, none of the existing NED solutionfocuses on the CH domain. However, almost all existing NED tools link the text with encyclopedic resources suchas Wikipedia or DBpedia, which cover many topics including CH. For instance, Wikipedia contains thousands ofelements from the CH domain, including paintings, archeological items, artists, etc. Therefore, we decided to usesuch a general purpose NED tool for LoCloud.

3

Page 8: LoCloud - D3.3: Metadata Enrichment services

Figure 3: Wikipedia entry for �Mona Lisa�.

In this section we analyze various systems and data sources for NED. We �rst start with an in-depth surveyof the current state-of-the-art, data sources, tools and technology related to NED. We then choose two opensource NED systems, and evaluate them on standard data sources in both English and Spanish. As a result ofthis evaluation, the best tool for NED is chosen, so that it becomes the core element of the background linkannotation micro-service.

2.1 Named Entity Disambiguation

The NED task focuses on recognizing, classifying and linking mentions of speci�c named entities in text. NEDsystems deal with two basic problems: name variation and name disambiguation. The former is produced becausea named entity can be mentioned using a great variety of surface forms (�Mona Lisa�, �La Gioconda�, �Gioconda�,etc.); the latter problem arises because the same surface form can refer to a variety of named entities: for ex-ample, the form �san juan� can be used to ambiguously refer to dozens of toponyms, persons, a saint, etc. (e.g,see http://en.wikipedia.org/wiki/San_Juan ). Therefore, in order to provide an adequate and comprehensiveaccount of named entities in text it is necessary to recognize the mention of a named entity, classify it as a type(e.g, person, location, painting, etc.), link it or disambiguate it to a speci�c entity, and resolve every form ofmentioning to the same entity in a text.

The �rst task of any NED system is to detect and identify of speci�c entities in running text, a task calledNamed Entity Recognition and Classi�cation (NERC). Current state-of-the-art processors achieve high perfor-mance in recognition and classi�cation of general categories such as people, places, dates or organisations (Nadeauand Sekine, 2007), e.g. OpenCalais service for English3.

Once the named entities are recognised they can be identi�ed with respect to an existing catalogue. Wikipediahas become the de facto standard as such a named entity catalogue. Entity Linking is the process of automaticlinking of the named entities occurring in free text to their corresponding Wikipedia articles. This task is typically

3http://www.opencalais.com/

4

Page 9: LoCloud - D3.3: Metadata Enrichment services

regarded as a Word Sense Disambiguation (WSD) problem (Agirre and Edmonds, 2007), where Wikipedia providesboth the dictionary and training examples. Public demos of systems which exploit Wiki�cation (only for English)are Spotlight4, CiceroLite from LCC5 and, Zemanta6, TAGME7 or The Wiki Machine8.

Automatic text Wiki�cation implies solutions for NED (Mihalcea and Csomai, 2007). For unambiguous termsit is not a problem, but in other cases word sense disambiguation must be performed. For example, the Wikipediadisambiguation page lists many di�erent articles that the term oak tree might refer to (a tree, a village in England,an artwork by Michael Craig-Martin, etc9).

The following sentence provides an example of a CH description with the corresponding Wikipedia links:

The largest Oak tree10 in England11, perhaps in the world, this famous tree has withstood lightning.

The named entity ambiguity problem has been formulated in two di�erent ways. Within computational lin-guistics, the problem was �rst conceptualised as an extension of the coreference resolution problem (Bagga andBaldwin, 1998). The Wiki�cation approach later used Wikipedia as a word sense disambiguation data set byattempting to reproduce the links between pages, as linked text is often ambiguous (Mihalcea and Csomai, 2007).Finally, using Wikipedia as in the Wiki�cation approach, NERC was included as a preprocessing step and a link orNIL was required for all identi�ed mentions (Bunescu and Pasca, 2006). This means that, as opposed to Wiki�-cation, links were to be provided only for named entities. The resulting terminology of these various approaches iscross-document coreference resolution, Wiki�cation, and Named Entity Linking (NEL)12. The term Named EntityDisambiguation will be used to refer to any of these three tasks indistinctly (Hachey et al., 2012).

Di�erent approaches have been proposed for the NEL. A system typically searches for candidate entities andthen disambiguates them, returning either the best candidate or NIL. Thus, although both WSD and NEL addressambiguity and synonymy in natural language, there is an important di�erence. In NEL, it is not assumed thatthe KB is complete, as it is in WSD with respect to WordNet, which means that named entity mentions presentin the text which do not have a reference in the KB must be marked as NIL. Moreover, the variation of namedentity mentions is higher than that of lexical mentions in WSD, which makes the disambiguation process muchnoisier.

The rise to prominence of Wikipedia has allowed developing wide-coverage NEL systems. The most populartask, in which both datasets and NEL systems can be found, is the Knowledge Base Population (KBP) taskat the NIST Text Analysis Conference (TAC). The goal of KBP is to promote research in automated systemsthat discover information about named entities as found in a large corpus and incorporate this information intoa knowledge base. The TAC 2012 �elds tasks in three areas all aimed at improving the ability to automaticallypopulate knowledge bases from text. For our purposes the Entity-Linking task is the most relevant:

�Given a name (of a Person, Organization, or Geopolitical Entity) and a document containing thatname, determine the KB node for the named entity, adding a new node for the entity if it is notalready in the KB. The reference KB is derived from English Wikipedia, while source documents comefrom a variety of languages, including English, Chinese, and Spanish�.13

The popularity of the KBP task has led to a huge number of NEL systems, although given that every partic-ipant was aiming at obtaining the highest accuracy, most of the systems di�er in so many dimensions that it is

4http://DBpedia-spotlight.github.io/demo/index.html5http://demo.languagecomputer.com/cicerolite/6http://www.zemanta.com/7http://tagme.di.unipi.it/8http://thewikimachine.fbk.eu/html/index.html9http://en.wikipedia.org/wiki/Oak_Tree_(disambiguation)10http://http://en.wikipedia.org/wiki/Major_Oak11http://http://en.wikipedia.org/wiki/England12Through this report we use the terms Entity Linking (EL) and Named Entity Linking (NEL) interchangeably.13http://www.nist.gov/tac/2012/KBP/index.html

5

Page 10: LoCloud - D3.3: Metadata Enrichment services

rather unclear which aspects of the systems are actually necessary for good performance and which aspects areharming it (Hachey et al., 2012).

The �rst large set of manually annotated named entity linking data was prepared for the KBP 2009 edition(McNamee et al., 2010). In the KBP 2012 edition, the reference KB is derived from English Wikipedia, whilesource documents come from a variety of languages, including English, Chinese, and Spanish. Other NEL evalua-tion datasets have been compiled independently of the KBP TAC shared tasks. These, together with some otherresources based on Linked Data which can be used for NED will be listed and described in Section 2.2.

NED allows computing direct references to people, locations, organizations, etc. For example, in the �nancialdomain NED can be used to link textual information about organizations to �nancial data, or in the touristdomain, NED can link information about hotels or destinations to particular opinions and/or facts about them.

Major components of a Named Entity Disambiguation System

There is a big heterogeneity regarding which techniques and algorithms NED systems currently use. However,almost all NED system can be classi�ed as performing three major steps:

� Identi�cation of mention strings: given a text document, the �rst step is to identify which text pieces arecapable of being linked to external entities. For instance, given the example above, the �rst step would beto identify �Oak Tree� and �England� as entity mentions. Some systems (particularly, DBpedia Spotlight)use the term �spotting� to refer to the mention string identi�cation step.

� Candidate selection: once entity mentions are identi�ed, the next step is to generate the candidate entitiesfor each particular mention string. Titles and other Wikipedia-derived aliases can be leveraged at this stageto capture synonyms. An ideal searcher should balance precision and recall to capture the correct entitywhile maintaining a small set of candidates. This reduces the computation required for disambiguation.

� Disambiguation: the last step involves choosing the best candidate among possible for each mention string.

2.2 Data Sources for NED

This section describes some of the relevant data sources for NED. The data sources are mainly either text corporadeveloped for NLP applications or Linked Data as part of the Linked Data14 initiative. Most of the research onNED systems has been undertaken on text corpora, although some systems are already using Linked Data datasetssuch as DBpedia15.

The data sources and systems described in this section will be those relevant to Wiki�cation and NEL16.The datasets are not focused on any particular domain, and, particularly, they do not belong to the CH domain.However, these are the standard datasets used to evaluate and compare NED systems, and thus they are usefulto draw conclusions on the bene�ts of each of the systems.

With the rise to prominence of Wikipedia, the Wiki�cation task was proposed (Mihalcea and Csomai, 2007).Instead of clustering entities, as in Cross-document Coreference Resolution, mentions of important concepts in thetext were to be linked to its corresponding Wikipedia article. Crucially, the Wiki�cation task di�ers from (NEL)in that the concepts to be disambiguated are not necessarily named entities and in assuming that the knowledgebase is complete.

As mentioned before, the �rst large datasets on NEL were created by the Text Analysis Conference (TAC)for the Knowledge Base Population (KBP) track. So far there have been 4 editions since 2009. Originally, the

14http://linkeddata.org/15http://DBpedia.org/About16The term NED will be used to refer to any of these two tasks indistinctly (Hachey et al., 2012)

6

Page 11: LoCloud - D3.3: Metadata Enrichment services

datasets were only available for English but the 2012 edition includes documents in Spanish. In addition to theKBP datasets, several others have been created (Cucerzan, 2007; Fader et al., 2009). Furthermore, there is somework on integrating NEL annotation with existing NERC datasets such as the CONLL 2003 datasets (Ho�artet al., 2011).

Other valuable datasets listed in Table 1 for NED are those related with Linked Data. Linked Data is de�nedas �about using the Web to connect related data that wasn't previously linked, or using the Web to lower thebarriers to linking data currently linked using other methods�. More speci�cally, Wikipedia de�nes Linked Dataas �a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data,information and knowledge on the Semantic Web using URIs and RDF�. Of course, the data to be linked can beany type of named entity currently available in the Web. Well-known and large linked data resources in the NLPcommunity are DBpedia, Freebase17 and Yago18, but there are many others including those supported by largeorganizations such as the BBC, the British Government, NASA, CIA, Yahoo, etc. Current count in the list ofLinked Data datasets is more than 300.

17http://www.freebase.com/18http://www.mpi-inf.mpg.de/yago-naga/yago/

7

Page 12: LoCloud - D3.3: Metadata Enrichment services

DataEntity

Type ofdata

Provision Storage Amount Language License Website

KBP2009

Newswire

Text corporaavailable fromLinguistic DataConsortium(LDC)

Annotated �les fordevelopment andevaluation

3904 instances English Privatehttp://apl.jhu.edu/

~paulmac/kbp.html

KBP2010

News,Blogs, Webdata

Datasets availablefrom LDC

Annotated �les fordevelopment andevaluation

3750 English Privatehttps://www.ldc.upenn.

edu

KBP2011

News, Webdata

Datasets availablefrom LDC

Annotated �les fordevelopment andevaluation

�6000 instancesfor develop-ment, trainingand evaluation

English Privatehttps://www.ldc.upenn.

edu/

AIDACoNLLYAGO

NewswireAvailable as textcorpora

Annotated �les 34596 mentions EnglishCC-BY-3.0 li-cense, PU

http://www.mpi-inf.

mpg.de/yago-naga/aida/

downloads.html

WikipediaMiner

Wikipedia,news

Text corpora Annotated �les 727 instances English Publichttp://www.nzdl.org/

wikification

DBpediaWikipediaarticles

API, dump Linked Data�3.77 millionnamed entities

English, Spanish,German, Dutch,Italian, French

CC-BY-SA li-cense

http://DBpedia.org/

Table 1: Data Sources for Named Entity Disambiguation

8

Page 13: LoCloud - D3.3: Metadata Enrichment services

KBP at TAC

The TAC KBP 2009 edition distributed a knowledge base extracted from a 2008 dump of Wikipedia and a test setof 3,904 queries. Each query consisted of an ID that identi�ed a document within a set of Reuters news articles,a mention string that occurred at least once within that document, and a node ID within the knowledge base.Each knowledge base node contained the Wikipedia article title, Wikipedia article text, a predicted entity type(person, organization, location or misc), and a key-value list of information extracted from the article's infobox.Only articles with infoboxes that were predicted to correspond to a named entity were included in the knowledgebase. The annotators favoured mentions that were likely to be ambiguous, in order to provide a more challengingevaluation. If the entity referred to did not occur in the knowledge base, it was labelled NIL. A high percentageof queries in the 2009 test set did not map to any nodes in the knowledge base: the gold standard answer for2,229 of the 3,904 queries was NIL.

In the 2010 challenge the same con�guration as in the 2009 challenge was used with the same knowledgebase. In this edition, however, a training set of 1,500 queries was provided, with a test set of 2,250 queries. Inthe 2010 training set, only 28.4% of the queries were NIL, compared to the 57.1% in the 2009 test data and the54.6% in the 2010 test data. This mismatch between the training and test data showed the importance of the NILqueries and it is argued that it may have harmed performance for some systems. This is because it can be quitedi�cult to determine whether a candidate that seems to weakly match the query should be discarded in favourof guessing a NIL. The most successful strategy to deal with this issue in the 2009 challenge was augmenting theknowledge base with extra articles from a recent Wikipedia dump. If a strong match against articles that did nothave any corresponding node in the knowledge base was obtained, then NIL was return for these matches.

In the KBP 2012 edition, the reference KB is derived from English Wikipedia, while source documents comefrom a variety of languages, including English, Chinese, and Spanish.

AIDA CoNLL Yago

This corpus contains assignments of entities to the mentions of named entities annotated for the original CoNLL2003 entity recognition task19. The entities are identi�ed by YAGO220 entity name, by Wikipedia URL21, or byFreebase22. The CoNLL 2003 dataset is required to create the corpus.

Wikipedia Miner

The Wikipedia Miner system was mainly tested on Wikipedia articles, by taking the links out and trying to putthem back in automatically. In addition, the system was also tested on news stories from the AQUAINT corpus,to see if it would work as well �in the wild� as it did on Wikipedia. The stories were automatically wiki�ed, andthen inspected by human evaluators. This dataset contains the news stories of the AQUAINT corpus.

DBpedia

DBpedia is the Linked Data version of Wikipedia. The DBpedia data set currently provides information aboutmore than 1.95 million �things�, including at least 80,000 persons, 70,000 places, 35,000 music albums, 12,000�lms classi�ed in a consistent ontology. In total, it contains almost 4 million entities. It also provides descriptionsin 12 di�erent languages. Altogether, the DBpedia data set consists of (more than) 103 million RDF triples.

The data set is interlinked with many other data sources from various domains (life sciences, media, geographicgovernment, publications, etc.), including the aforementioned Freebase and YAGO, among many others23.

19http://www.cnts.ua.ac.be/conll2003/ner/20http://www.mpi-inf.mpg.de/yago-naga/yago/21http://en.wikipedia.org/wiki/Main_Page22http://wiki.freebase.com/wiki/Machine_ID23http://wiki.DBpedia.org/Datasets

9

Page 14: LoCloud - D3.3: Metadata Enrichment services

2.3 Tools for NED

Most of the currently available systems have been developed as a result of the popularity of the Wiki�cation andKBP tasks introduced in Section 2. Furthermore, the rise of Linked Data datasets have also contributed to thedevelopment of industrial NED systems. Most systems either perform Wiki�cation (every concept is linked) orNEL (only named entities are disambiguated). Nowadays there are many NED systems for either Wiki�cation orNEL. In this study, we focus on systems that comply to the follow requirements:

� Can be used as standalone programs or libraries. This options discards many systems which only provideweb access.

� Have free licenses, so that systems can be distributed.

Table 2 lists the systems which ful�ll the aforementioned requisites; thereafter some details of each systemare provided.

10

Page 15: LoCloud - D3.3: Metadata Enrichment services

SystemService

LanguagesSources

availabilityProvision

ProgrammingLanguage

License Website URL

The WikiMachine

English Yes Library http://thewikimachine.fbk.eu/

IllinoisWiki�er

English Yes Jar, Library Java Publichttp://cogcomp.cs.illinois.

edu/page/software_view/

Wikifier

DBpediaSpolight

Dutch,English,German,Portuguese,Spanish

YesAPI, library,source code

Java

Apache 2.0, part ofthe code uses Ling-Pipe Royalty Free li-cense

http://DBpedia-spotlight.

github.com/

WikiMinerEnglish,Spanish

Yes Jar, library Java GNU GPLv3http://wikipedia-miner.cms.

waikato.ac.nz/

AIDA English Yes Jar, Restful API JavaCC-BY-NC-SAlicense

http://www.mpi-inf.mpg.de/

yago-naga/aida/index.html

Table 2: Systems and Services for Named Entity Disambiguation

11

Page 16: LoCloud - D3.3: Metadata Enrichment services

The Wiki Machine

The Wiki Machine is a Wiki�cation system developed at the FBK in Trento, Italy. In addition to machine learningtechniques, they use Linked Data to o�er multilingual Wiki�cation via DBpedia and Freebase. They also o�er apublicly available demo which compares their results with respect to AlchemyAPI, Zemanta and OpenCalais.

Illinois Wiki�er

The Illinois Wiki�er system is developed at the Cognitive Computation Group at the University of Illinois atUrbana Champaign24. They present a Wiki�cation system (Ratinov and Roth, 2009) using both local and globalfeatures. The results reported claim to outperform previous systems (Milne and Witten, 2008). It should benoted, however, that not many approaches to NED have evaluated their results with the same datasets, the KBPparticipants being the general exception. A newer version of the tool exists (Cheng and Roth, 2013).

DBpedia Spotlight

DBpedia Spotlight is a Wiki�cation tool for automatically annotating mentions of DBpedia resources in text,providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia(Mendes et al., 2011; Daiber et al., 2013). DBpedia Spotlight recognizes that names of concepts or entitieshave been mentioned (e.g. �Michael Jordan�), and subsequently matches these names to unique identi�ers (e.g.DBpedia:Michael_I._Jordan25, the machine learning professor or DBpedia:Michael_Jordan26 the basketballplayer).

DBpedia Spotlight can be used through their Web Application or Web Service endpoints. The Web Applica-tion is a user interface that allows entering text in a form and generates an HTML annotated version of the textwith links to DBpedia. The Web Service endpoints provide programmatic access to the demo, allowing retrievalof data also in XML or JSON. DBpedia is released under the Apache License 2.0.

DBpedia Spotlight has two approaches. The original DBpedia Spotlight implementation uses Apache Lucenefor disambiguation and LingPipe for spotting. Pre-built indexes and spotter models are available for English. Thesecond approach is an statistical approach.

Wikipedia Miner

Wikipedia Miner is a Wiki�cation system developed by the University of Waikato, New Zealand (Milne andWitten, 2008). The Wikipedia Miner can be used as a Web service or as a library via a Java API. The system usesmachine learning and graph-based approaches to detect and disambiguate and link terms in running text to theirWikipedia articles. The system was the �rst publicly available tool for Wiki�cation and many works still have it asa reference to evaluate their performance. Wikipedia Miner provided several bene�ts over previous Wiki�cationwork (Mihalcea and Csomai, 2007), by: (i) identifying in the input text of a set C of so-called context pages,namely, pages linked by spots that are not ambiguous because they only link to one article; (ii) calculating arelatedness measure between two articles based on the overlap between their in-linking pages in Wikipedia; and(iii) de�ning a notion of coherence with other context pages in the set C. These three main components of thesystem allowed them to obtain around 75% F measure over long and richly linked Wikipedia articles.

AIDA

AIDA is a framework and online tool for entity detection and disambiguation (Ho�art et al., 2011). Givena natural-language text or a Web table, it maps mentions of ambiguous names onto canonical entities (e.g.,individual people or places) registered in the YAGO2 knowledge base27.

24http://cogcomp.cs.illinois.edu/25http://DBpedia.org/page/Michael_I._Jordan26http://DBpedia.org/page/Michael_Jordan27http://www.mpi-inf.mpg.de/yago-naga/yago/

12

Page 17: LoCloud - D3.3: Metadata Enrichment services

Development

Dataset Instances

TAC 2010 1,675

Test

Dataset Instances

TAC 2011 1,114

AIDA 5,508

Table 3: Size of the datasets used in the evaluation

2.4 Evaluation

In the last section we describe the most relevant systems for state-of-the-art NED. An exhaustive evaluationof every system is clearly out of the scope this document, and therefore we will narrow our selection by re�ningthe requirements presented in section 2.3 and putting additional requirements to the NED tools:

� The software is easy to download, build and install.

� The software is up-to-date and, if possible, there is an active community beyond it.

� The system is multilingual, or, at least, it is able to perform NED for English and Spanish.

Among the systems described in the previous section, only DBpedia Spotlight and Wikipedia Miner ful�llthese additional requirements. Both are mature systems, are easy to build and install, and both systems have acommunity of users and developers willing to help in case of problems. We thus focus the evaluation on thosesystems only. To draw better conclusions from this evaluation exercise, we will be comparing the results obtainedusing the DBpedia Spotlight probabilistic models with those of the Wikipedia Miner. The evaluation consists ofrunning both NED systems on a selection of the standard datasets described in section 2.2, assessing their overallperformance, and testing di�erent aspects such as memory, disk usage, etc. This section presents the results ofthis evaluation.

Both DBpedia Spotlight and Wikipedia Miner depend on a set of parameters which greatly a�ect the overallbehavior of the systems. Given that we wanted to compare both systems in a fair way, we decided optimizethe parameters of each system in one single dataset (the development dataset), and then try the best parametercombination on a di�erent dataset (the test dataset). Table 3 shows the datasets used for development and testing.

The performance of the systems are measured using the standard precision and recall metrics. Precision isthe number of correctly assigned instances divided by the total instances as returned by the system. Recall is thenumber of correctly assigned instances divided by the number of instances in the gold standard dataset. We alsoreport coverage when required, the ratio of returned instances divided by the number of instances in the goldstandard.

In our particular setting we seek to maximize precision, that is, we care more about returning correct linksto DBpedia entities than trying to link all possible mentions in the input text. Because we focus our study onNED systems, we discard the so-called NIL instances (instances for which no correct entity exists in the ReferenceKnowledge Base) from the datasets.

2.4.1 DBpedia Spotlight

For this evaluation, we have used the so called statistical backend of DBpedia Spotlight28, publicly availablein their repository29. DBpedia spotlight has the following parameters:

28Available here: https://github.com/DBpedia-spotlight/DBpedia-spotlight/wiki/Statistical-implementation29https://github.com/DBpedia-spotlight/DBpedia-spotlight/wiki/Statistical-implementation

13

Page 18: LoCloud - D3.3: Metadata Enrichment services

Support Con�dence Coref. P R

0 0 False 89.23 73.92

10 0.1 False 89.15 73.33

10 0.1 True 89.23 73.92

Table 4: Results of DBpedia Spotlight on the development dataset for di�erent values for the Con�dence andSupport and Coref. parameters. P stands for precision and R stands for recall.

Mentions Context P R

Entities Window 89.23 73.92

Entities Full 87.62 61.37

Any Window 90.02 54.80

Any Full 87.68 61.37

Table 5: DBpedia Spotlight results on the development dataset. The column Mentions describes whether alltext is disambiguated (�Any�) as opposed to evaluating just the named entities (�Entities�). The column Contextdistinguishes between using the whole document as context (�Full�) or using a window of 50 words surroundingthe target mention (�Window�).

� Spotter: this parameter is used to specify which mentions should be disambiguated. Note that settingthis parameter bypasses the spotter algorithm of Spotlight, as it assumes that another tool has performedspotting instead.

� CorefResolution: coreference resolution deals with the problem of identifying two or more expressions ona text which refer to the same entity. This parameter allows to activate/deactivate the coreference modulewithin Spotlight. It should be noted that, despite the name given, this is not we we understand in NLP by�Coreference resolution�. This module simply tries to say that Michal Jordan and Jordan are mentions ofthe same entity, but it does not deal with many other forms of nominal coreference resolution.

� Con�dence: Disambiguate only when the selected entities have a con�dence value above this parameter.

� Support: This is a disambiguation threshold. According to the documentation, Spotlight will disambiguatea mention only when the selected entities have a support value above this parameter. The documentationdoes not specify how the support value is calculated.

Table 4 shows the results for several parameter values of DBpedia Spotlight on the development dataset(TAC 2010). For these experiments we provided the tool with the mentions (spots), as we are mostly interestedin analyzing the disambiguator performance for di�erent parameters. The results show that DBpedia Spotlightdisambiguator is very competitive and that it obtains performances that are very close to the best systems for thisdataset. The table also shows that the system is robust with respect to the di�erent values of parameters, andthat there are no signi�cant di�erences among them. Notably, coreference resolution module does not improvethe overall performance.

With the table above we �xed the best parameters30, and then we tried two additional settings:

� Whether to link any mention on the text the systems considers relevant or just the named entities.

� Whether to use the whole text or just a window of words around target mentions as disambiguation contexts.

30Support:0, Con�dence:0, CorefResolution: False

14

Page 19: LoCloud - D3.3: Metadata Enrichment services

TAC 2011 AIDA

Mentions Context P R mention/sec P R mention/sec

Entities Window 79.77 60.68 33 79.67 75.94 17

Any Window 87.31 61.83 14 81.94 71.12 5

Table 6: DBpedia Spotlight results on the test dataset.

TAC 2011 AIDA

threshold P R Coverage P R Coverage

default 87.31 61.83 70.82 81.94 71.12 86.80

1.0 85.73 60.94 71.09 79.34 72.67 91.59

Table 7: Results on the test dataset for di�erent values of spotter_threshold.

Table 5 shows the results on the development dataset. Using context instead of full document improves theresults on this dataset. Note that the TAC challenge involves linking a single mention string from a possibly largedocument; the results suggest that using the whole document as context introduces too much noise into thesystem, probably coming from words that are very distant from the target mention string. The results also showthat disambiguating only Named Entities leads to a higher recall, but lower precision.

With these results in hand, we decided to select the best setting31 for the rest of the experiments. Table 6shows the results of evaluating DBpedia Spotlight on the TAC 2011 and AIDA datasets using those parameters.The table also shows the mentions per second the tool is able to disambiguate on each dataset32.

The results shows that DBpedia Spotlight achieves very good state-of-the-art results on these datasets. Theyalso show that NED is a bit harder than Wiki�cation, and that Wiki�cation results obtain better overall perfor-mance although lower recall than NED.

We investigated further in order to increase the overall recall of the system by tweaking an additional parametercalled spotter_threshold, which regulates the conditions a string has to meet to be considered as a mention.Table 7 shows the results on the test dataset for setting this parameter with the default value and at 1.0. Thecoverage column describes the percentage of mentions that are detected by the system. The results show thatincreasing the threshold does increase coverage, but at a loss of both precision and recall. In other words, thesystem is able to detect more mentions, but those new mentions are, in general, incorrectly disambiguated.

Spanish dataset: TAC 2012

DBpedia Spotlight provides pre-trained models for Spanish, extracted from the Spanish Wikipedia. Therefore,running Spotlight for Spanish is just a matter of choosing the Spanish model instead of the English one. Weused the TAC 2012 Spanish dataset for testing Spotlight Spanish. Starting from 2012 the TAC/KBP conferenceincludes a task on Cross-lingual Entity Linking for Spanish and Chinese. On this setting, systems are providedwith a document in one language (Spanish or Chinese), and they have to link the mentions to entities belongingto an English Knowledge Base.

For evaluating the system we �rst run DBpedia Spotlight with Spanish over the TAC 2012 Spanish dataset,which outputs entities from Spanish DBpedia. We then map those entities to the corresponding English counter-parts using the interlingual links from Wikipedia33.

31Mentions: Any and Context: Window32The experiments were performed on a server with one 2.1 GHZ processor and 64GB RAM.33http://www.mediawiki.org/wiki/Interlanguage_links

15

Page 20: LoCloud - D3.3: Metadata Enrichment services

Model threshold P R Cov

ES default 78.15 55.80 71.40

ES 1.0 75.43 57.20 75.84

EN default 78.92 58.40 74.00

EN 1.0 75.24 59.26 78.76

Table 8: Spotlight results on Spanish TAC2012 Dataset.

We tried the same set of parameters as used for the English experiments in the test dataset. Besides, we triedtwo additional settings:

� Whether to use the Spanish model or the English model.

� Change the spotter_threshold parameter to increase coverage.

Table 8 shows the results for this dataset. Surprisingly, the English model performs better than the Spanishmodel on this dataset. This results might be partially explained by the fact that the Spanish Wikipedia is smaller,so less context words might be available for each entity. The English wikipedia has more words and some of thosewords are entity names, which often have the same spelling across languages. In other words, it could be thatoverall, the English context is still richer than the Spanish.

All in all, the table shows that Spotlight lacks recall, that is, it is not able to correctly link all the mentionsappearing in the document. However, when the system links a mention, it does so correctly over the 75% of thetime. Increasing the spotter_threshold increases coverage and also yields to slightly better recall �gures.

2.4.2 Wikipedia Miner

We used the 1.2.0 version of Wikipedia Miner trained with a dump from the 2011 English Wikipedia. Again, weused the development dataset to tune the parameters of the system. Wikipedia Miner has two main parameters:

� nimProbability: The system calculates a probability for each topic of whether a Wikipedian would considerit interesting enough to link to. This parameter speci�es the minimum probability a topic must have beforeit will be linked.

� disambiguationPolicy: whether each term should be disambiguated to a single interpretation, or to multipleones.

We tried di�erent values of the parameters on the development dataset. Being a a Wiki�cation tool, WikipediaMiner always tries to disambiguate as many mentions in the text as possible. Therefore, unlike DBpedia Spot-light34, we were not able to test the performance of Wikipedia Miner on a NED setting. In any case, theperformance of Wikipedia Miner is much worse than DBpedia Spotlight: the best parameters yielded a precisionof 57.07 and a recall of 63.43, a drop of 10 points in recall and almost 25 points in precision compared to theresults of DBpedia Spotlight on the same dataset.

Table 9 con�rms the previous results. The table shows the results of the best parameter combination forWikipedia Miner when applied to the test dataset. Once again, Wikipedia Miner is outperformed by a largemargin by DBpedia Spotlight in this dataset. We therefore discarded using Wikipedia Miner for the Englishbackground link micro-service.

34As described in the previous sections, DBpedia Spotlight is also a Wiki�cation tool, but it is easily customizable by allowing toprovide the Spotting by third party tools.

16

Page 21: LoCloud - D3.3: Metadata Enrichment services

P R

TAC 2011 55.29 55.60

AIDA 74.80 58.49

Table 9: Wikipedia Miner results on the test dataset.

Regarding Spanish, we tried to run Wikipedia Miner on the TAC Spanish dataset with no success. WikipediaMiner does not provide the Spanish models anymore (there are Spanish models available for previous versionsof Wikipedia Miner, but not for the 1.2.0 version we were testing), so the only way to run Wikipedia Miner isby building the Spanish models again. Unfortunately, model building with Wikipedia Miner turned to be a verycomplex task full of subtle details and corner cases. In summary, we were not able to build the models and, thus,to properly test Wikipedia Miner on the Spanish dataset. However, Wikipedia Miner performs signi�cantly worsethan DBpedia Spotlight in the English dataset and we have no reasons to think that it would be di�erent on theSpanish dataset.

2.5 Choosing a NED tool for the background link micro-service

Nowadays there are many tools which links pieces of text with appropriate entities belonging to a particularknowledge base such as Wikipedia or DBpedia. Although those systems are not focused on the CH domain, thefact that they link text to broad resources like Wikipedia or DBpedia make them appropriate for the backgroundlinking micro-service.

We have compared the results of two of the most widely used systems for NED, namely, DBpedia Spotlightand Wikipedia Miner, for both English and Spanish. The experiments clearly suggest using DBpedia Spotlight forthe background link micro-service in both languages: DBpedia Spotlight performed better than Wikipedia Minerin our experiments, is easier to install and deploy on virtual machines and is fast enough to allow an interactivemicro-service to be built on top of it.

DBpedia Spotlight have a set of parameters that tune its behavior. We have analyzed many variants of suchparameters and measured the performance of the system on a development dataset. Overall, DBpedia Spotlightis very robust and have shown very slight variations on performance when changing those parameters.

As in many areas, there is a fundamental trade-o� in NED between precision and recall, i.e., some systems tryto disambiguate as much strings as possible whereas some others only do so when they are con�dent enough. Forthe purpose of this project, we have decided maximize precision at the expense of not disambiguating all potentialstrings of the text.

17

Page 22: LoCloud - D3.3: Metadata Enrichment services

3 Vocabulary matching micro-service

Cultural Heritage institutions categorize their content by linking CH items with one or more relevant conceptsfrom a controlled vocabulary. With the advent of information technology and the desire to make available CHresources to the general public, there is an increasing need to facilitate interoperability across these di�erentcontexts. There are many situations in which users may bene�t from having the items organized into subjectcategories for browsing and exploration. For example, when users do not have clearly de�ned information needs(White et al., 2006), when attempting complex search tasks (Singer et al., 2012) or when they want to gain anoverview over a collection (Hornbæk and Hertzum, 2011). In such cases the provision of only a simple searchbox is insu�cient (Marchionini, 2006; Pirolli, 2009). This is particularly relevant to digital libraries where richuser/information interaction is common and requires alternative methods to support users (Rao et al., 1995).The provision of browsing functionalities through thesaurus-based search enhancements have all been shown toimprove the search experience.

The aim of the vocabulary matching micro-service is to automatically assign relevant concepts and terms fromselected vocabularies to CH items as provided by cultural institutions. The service receives metadata informationas input, and return the items enriched with links to one or more relevant concept from a list of vocabularies.

The vocabulary matching micro-service requires a set of vocabularies to match the CH items against. Manyschemes have been proposed to describe and manage cultural heritage data. Those schemes are usually foundin form of classi�cation schemes, subject heading lists, etc, and many times they are focused on a particular�eld, institution and even collections. Within the LoCloud project, Task 3.4 provides the vocabulary service35, acollaborative platform to explore the potentials of crowdsourcing as a way of developing multilingual, semanticthesauri for local heritage content. In summary, the vocabularies used in the vocabulary matching micro-serviceare those developed by the community using the vocabulary service of Task 3.4.

The vocabulary matching micro-service consists of two modules. One module, called vocabulary retriever,retrieves the vocabularies from the vocabulary service and creates an internal database with the concepts andlexicalizations on several languages. This module is executed on a regular basis so that the local database issynchronized and up to date with the vocabularies from the vocabulary service, whose concepts and terms areupdated in a continuous fashion. Once the vocabularies are retrieved and stored in the internal database, onesecond module, called the matching module, annotates CH items with appropriate concepts and terms as foundin the vocabularies.

3.1 Retrieving the vocabularies

The vocabulary matching micro-service depends on a set of vocabularies, each one comprising a set of conceptsand their lexicalizations on several languages. The vocabulary retriever module gathers all the content fromvocabularies as found in the vocabulary server. The �rst step is to retrieve the list of vocabularies, as well asthe URL of each vocabulary. Once the vocabulary list is retrieved from the Vocabulary server, each vocabularyis downloaded as a SKOS document format Miles et al. (2005). SKOS is an well known speci�cation, builtupon RDF and RDFs, for expressing the basic structure and content of concept schemes (thesauri, classi�cationschemes, subject heading lists, taxonomies, terminologies, glossaries and other types of controlled vocabulary). Itis published and maintained by the W3C Semantic Web Best Practices and Deployment Working Group. Figure4 shows an excerpt of the �Genres� vocabulary following SKOS. The excerpt describes one concept of the vo-cabulary, whose preferred in English name is �Ballads� (<skos:prefLabel> element. The �gure also shows thelexicalizations of the concept in German (�Balladen�) and French (�ballades�).

The SKOS vocabulary is parsed and the required information is extracted, namely, the vocabulary name, anda mapping between strings (the lexicalizations, which vary among languages) and the concepts they may refer to.

35Not to be confused with the vocabulary matching service, which is described here.

18

Page 23: LoCloud - D3.3: Metadata Enrichment services

<skos:Concept rdf:about="http :// test113.ait.co.at/tematres/vocab/?tema =501">

<skos:prefLabel xml:lang="en">Ballads </skos:prefLabel >

<skos:inScheme rdf:resource="http :// test113.ait.co.at/tematres/vocab/"/>

<skos:broader rdf:resource="http :// test113.ait.co.at/tematres/vocab /?tema=1"/>

<skos:broader rdf:resource="http :// test113.ait.co.at/tematres/vocab /?tema=3"/>

<skos:broader rdf:resource="http :// test113.ait.co.at/tematres/vocab /?tema=2"/>

<skos:broader rdf:resource="http :// test113.ait.co.at/tematres/vocab /?tema=4"/>

<skos:exactMatch >

<skos:Concept rdf:about="http :// test113.ait.co.at/tematres/dmGenresDE /?tema =15">

<skos:prefLabel xml:lang="de">Balladen </skos:prefLabel >

</skos:Concept >

</skos:exactMatch >

<skos:exactMatch >

<skos:Concept rdf:about="http :// test113.ait.co.at/tematres/dmGenresFR /?tema =15">

<skos:prefLabel xml:lang="fr">ballades </skos:prefLabel >

</skos:Concept >

</skos:exactMatch >

<dct:created >2013 -12 -03 11:42:15 </dct:created >

</skos:Concept >

Figure 4: Excerpt of SKOS document for one terminology concept.

Note that relations among concepts, such as those described by the SKOS <skos:broader> relations, are notused. In other words, the vocabulary matching micro-service only requires a list of terms linked to the relevantconcepts (a gazetteer). The parser would extract the following mappings from the excerpt showed in Figure 4.Note that mentions are lowercased, and white spaces are replaced by underscores.

String Language Concept

ballads en http://test113.ait.co.at/tematres/vocab/?tema=501

balladen de http://test113.ait.co.at/tematres/dmGenresDE/?tema=15

ballades fr http://test113.ait.co.at/tematres/dmGenresFR/?tema=15

Table 10 shows the list of vocabularies gathered by the retriever36, including their sizes (number of conceptsand terms) and supported languages. In total there are 29 vocabularies in three languages (English, German andFrench), which comprise circa 40, 000 concepts and 56, 000 di�erent terms.

Synchronization issues

The vocabulary server implemented in Task 3.4 is meant to be a collaborative platform where users can continuallycreate new terms and concepts or re�ne existing ones. The list of vocabularies, concepts and terms have thus adynamic nature, and the vocabulary matching micro-service has to be synchronized with the vocabulary server toguarantee that the current version of the vocabularies is used.

The vocabulary receiver module is executed on a regular basis37 and reconstructs an internal vocabularydatabase with the terms and concepts retrieved from the vocabulary server.

3.2 Matching CH items

The second module of the vocabulary matching micro-service is the so called matching module. This modulereceives CH items and matches the textual information with the appropriate terms from the vocabularies. The

36Note that this list may change as more and more vocabularies are integrated into the vocabulary server.37At the time being the receiver is executed once per day, but the synchronization period can be changed easily.

19

Page 24: LoCloud - D3.3: Metadata Enrichment services

matching module is the service exported to the users as a REST service, which can be called to enrich the CH items.

The service can be used to enrich any CH �eld, but the following �elds are particularly relevant:

� The description �eld, a textual description of the CH item.

� The subject �elds, so that the keywords described there can be linked to vocabulary terms.

� The title �eld.

The matching module analyzes the textual information and scans the tokens from left to right and considerthe longest possible span which has a vocabulary entry as a candidate mention. Mentions are linked with theappropriate vocabulary terms for a particular language.

20

Page 25: LoCloud - D3.3: Metadata Enrichment services

Vocabulary Concepts Labels Lang

Alexandria Digital Library Feature Type Thesaurus 210 1285 en

Archeological Objects Thesaurus Scotland 213 263 en

Archeological Sciences Thesaurus 132 160 en

Building Materials Thesaurus 569 1009 en

Components Thesaurus 1435 1650 en

Event Type Thesaurus 115 158 en

Evidence Thesaurus 80 120 en

FISH Archeological Objects Thesaurus 2279 2865 en

General Multilingual Environmental Thesaurus GEMET 2011 2011 en

General Subject headings for Film Archives 1412 1987 en

Irish Monuments 1760 1760 en

Irish Periods 27 27 en

MDA Archaeological Objects Thesaurus 1649 2100 en

Maritime Craft Thesaurus Scotland 56 61 en

Maritime Craft Type Thesaurus 316 415 en

Monument Thesaurus Wales 1859 2571 en

Monument Type Thesaurus Scotland 1335 1705 en

Monument Type Thesaurus 6497 9580 en

Period Thesaurus Wales 15 16 en

Period Thesaurus 30 30 en

Relator Terms for Use in Rare Book and Special Collections Cataloguing 46 78 en

Tesauro de Ciencias de la Documentación 1593 1593 en

Thesaurus PICO 4.1 902 1038 en

Thesaurus for Graphic Materials 1: Subject Terms 2437 4508 en

Thesaurus for Graphic Materials 2: Genre and Physical Characteristic Terms 655 1198 en

UK Archival Thesaurus (UKAT) 1223 1715 en

UNESCO thesaurus 8605 12794 en,es

dm:Genres 1772 2688 fr,de,en

Total 39,233 55,385 en,fr,de,es

Table 10: Vocabularies, including the number of concepts and terms.

21

Page 26: LoCloud - D3.3: Metadata Enrichment services

4 Implementation and examples of use

In this section we will brie�y explain the implementation of both the background link and vocabulary matchingmicro-services. We also provide the actual Internet addresses for our entry points, as well as some examples ofuse.

4.1 Background link micro-service

The background link micro-service is built upon DBpedia Spotlight package, version 0.6, which is publicly avail-able38. We used two instances of DBpedia Spotlight, one with the models trained for English and one withthe models trained for Spanish for the English and Spanish background link micro-services, respectively39. EachDBpedia Spotlight instance is installed on a di�erent virtual machine. The addresses of the virtual machines arethe following:

URL Service

http://test183.ait.co.at/rest/bglink English language

http://lc013.ait.co.at/rest/bglink Spanish language

Regarding parameters, we used the set of parameters that performed best in the datasets, as described insection 2.4.1. Those are the actual parameters:

� Spotter: Use Spotlight Spotter.

� CorefResolution: False.

� Con�dence: 0.

� Support: 0.

� threshold: 1.0.

We implemented a PHP wrapper that acts as endpoint of the REST service. The wrapper listens to a particularport on the VM, reads the input as described in the API documentation, runs DBpedia Spotlight, and transformsthe output to the required format, which is sent back to the caller.

Example of use

The following command sends the string �Guernica was painted by Picasso while he was in Paris.� to the Englishbackground link micro-service40.

curl -H "Accept: application/json"

-d "text=Guernica was painted by Picasso while he was in Paris."

http://test183.ait.co.at/rest/bglink

The response from the service is the following41:

38https://github.com/DBpedia-spotlight/DBpedia-spotlight/wiki/Statistical-implementation39DBpedia Spotlight models are available here: http://spotlight.sztaki.hu/downloads/40This example uses the curl application, which mimics a Web browser and can be used on a terminal. The same query can be made

directly using any web browser, just by typing the address http://test183.ait.co.at/rest/bglink?text=Guernica%20was%20painted%20by%20Picasso%20while%20he%20was%20in%20Paris.

41The JSON format of the response is described in section 6.2

22

Page 27: LoCloud - D3.3: Metadata Enrichment services

{"Status":200,

"Status_message":"Success",

"data":{

"Resources":[

{"URI":"http:\/\/DBpedia.org\/resource\/Guernica_(painting)",

"similarityScore":"0.5724195306431602",

"surfaceForm":"Guernica",

"offset":"0"},

{"URI":"http:\/\/DBpedia.org\/resource\/Pablo_Picasso",

"similarityScore":"0.9996545655239845",

"surfaceForm":"Picasso",

"offset":"24"},

{"URI":"http:\/\/DBpedia.org\/resource\/Paris",

"similarityScore":"0.9998111742947978",

"surfaceForm":"Paris",

"offset":"48"}]}}

The background link micro-service has identi�ed three entities and it has linked them to the appropriate DBpediapages. The Spanish background link micro-service is very similar, the only di�erence being the actual address ofthe entry point. The following command sends the string �Picasso pintó el Guernica mientras vivía en París.�:

curl -H "Accept: application/json"

-d "text=Picasso pintó el Guernica mientras vivía en París."

http://lc013.ait.co.at/rest/bglink

which produces the following output:

{"Status":200,

"Status_message":"Success",

"data":{

"Resources":[

{"URI":"http:\/\/es.DBpedia.org\/resource\/Pablo_Picasso",

"similarityScore":"0.9993425694476822",

"surfaceForm":"Picasso",

"offset":"0"},

{"URI":"http:\/\/es.DBpedia.org\/resource\/Guernica_y_Luno",

"similarityScore":"0.7089876977711708",

"surfaceForm":"Guernica",

"offset":"17"},

{"URI":"http:\/\/es.DBpedia.org\/resource\/Par\u00eds",

"similarityScore":"0.9999418831220671",

"surfaceForm":"Par\u00eds",

"offset":"44"}

]}}

5 Vocabulary matching micro-service

Section 3 explains the di�erent modules implemented for the vocabulary matching micro-service, namely, thevocabulary retriever and the matching module. Both modules have been designed and implemented for LoCloud.Besides, we built another PHP wrapper which actually implements the REST service, much like in the backgroundlink micro-service. The address of the vocabulary matching micro-service is the following:

23

Page 28: LoCloud - D3.3: Metadata Enrichment services

URL Service

http://test183.ait.co.at/rest/vmatch Vocabulary match

Example of use

The following command sends the text �Stembridge Windmill, High Ham, Somerset� to the vocabulary matchingmicro-service:

curl -H "Accept: application/json" \

-d "text=Stembridge Windmill, High Ham, Somerset" \

-d "lang=en" \

http://test183.ait.co.at/rest/vmatch

The response is the following:

{"Status":200,

"Status_message":"Success",

"data":{

"Resources":[

{"URI":"http://test113.ait.co.at/tematres/Irish_Monuments/?tema=285",

"vocab":"Irish Monuments"}

]}}

24

Page 29: LoCloud - D3.3: Metadata Enrichment services

6 API reference

This section describes the complete API for both micro-services. We �rst describe the background link serviceAPI, and then the vocabulary matching API. Finally, we show the error codes for the the REST services, whichare the same for both services.

6.1 Background Link micro-service

The background link micro-service receives raw UTF-8 text as input and produces the background link annotations.The output of the service is a JSON document with the annotated elements. This is its particular API:

� endpoint: The background link micro-service has two endpoints, depending on the language in which theCH items are described:

◦ http://test183.ait.co.at/rest/bglink for the English background link service.

◦ http://lc013.ait.co.at/rest/bglink for the Spanish background link service.

� parameters: just one parameter, �text�, the text to annotate. The text has to be UTF-8 encoded.

� example call:

curl -H "Accept: application/json" \

-d "text=Guernica" \

http://test183.ait.co.at/rest/bglink

� example output:

{ "data" : {

"Resources":

[

{"URI":"http://live.DBpedia.org/page/Guernica",

"similarityScore":0.5,

"surfaceForm":"Guernica",

"offset":0},

{"URI":"http://live.DBpedia.org/page/Guernica_(painting)",

"similarityScore":0.5,

"surfaceForm":"Guernica",

"offset":0}

]}

}

� Supported output types (POST/GET): application/json

The service produces a JSON document with an object of name Resources. The object contains a list of objects(one object per association), and each object of the list has the following elements:

� URI: the URI of a DBpedia page.

� similarityScore: a con�dence value of the association.

� surfaceForm: the original text snippet from which the association was derived.

� offset: the starting position of the o�set in the text. Useful if the original text contains many occurrencesof the same mention string.

25

Page 30: LoCloud - D3.3: Metadata Enrichment services

Request

Method URL Comment

POST http://test183.ait.co.at/rest/bglink English language

POST http://lc013.ait.co.at/rest/bglink Spanish language

Parameter Datatype Description

text String UTF-8 text to annotate.

Response

Status Response

200

JSON record with the matched URIs. For example:

{"URI":"http://live.DBpedia.org/page/Guernica",

"similarityScore":0.5,

"surfaceForm":"Guernica",

"offset":0},

{"URI":"http://live.DBpedia.org/page/Guernica_(painting)",

"similarityScore":0.5,

"surfaceForm":"Guernica",

"offset":0}

Parameter Description

URI URI of a DBpedia page.

similarityScore Con�dence value of the association.

surfaceForm Original text snippet from which the association was derived.

offset Starting position of the o�set in the text.

6.2 Vocabulary Matching micro-service

The vocabulary link micro-service receives raw UTF-8 text as input and produces the annotations to vocabularyconcepts. The output of the service is a JSON document with the annotated elements. This is its API speci�cation:

� endpoint: The endpoint for this service is http://test183.ait.co.at/rest/vmatch

� parameters:

◦ text: the text to annotate. The text has to be UTF-8 encoded.

◦ lang: a language tag following the xml:lang format. If omitted, the default �en� value will be used.

� example call:

curl -H "Accept: application/json" \

-d "text=Stembridge Windmill, High Ham, Somerset" \

-d "lang=en" \

http://test183.ait.co.at/rest/vmatch

� example output:

26

Page 31: LoCloud - D3.3: Metadata Enrichment services

{ "data" : {

"Resources":

[

{"URI":"http://test113.ait.co.at/tematres/Irish_Monuments/?tema=285",

"vocab":"Irish Monuments"}

]}

}

� Supported output types (POST/GET): application/json

The service produces a JSON document with an object of name Resources. The object contains a list of objects(one object per association), and each object of the list has the following elements:

� URI: the URI of a vocabulary concept.

� vocab: the name of the vocabulary used for matching.

Request

Method URL

POST http://test183.ait.co.at/rest/vmatch

Parameter Datatype Description

text String UTF-8 text to annotate.

lang String Two letter language code.

Response

Status Response

200

JSON record with the matched URIs. For example:

{"URI":"http://test113.ait.co.at/tematres/Irish_Monuments/?tema=285",

"vocab":"Irish Monuments"}

Parameter Description

URI URI of the vocabulary concept.

vocab Vocabulary name.

27

Page 32: LoCloud - D3.3: Metadata Enrichment services

6.3 HTML Status Codes

All status codes are standard HTTP status codes. The below ones are used in both APIs.

Status Code Description

200 OK

201 Created

202 Accepted (request accepted and queued for execution)

400 Bad request

401 Authenti�cation failure

403 Forbidden

404 Resource not found

405 Method not allowed

409 Con�ict

412 Precondition failed

413 Request entity too large

500 Internal server error

501 Not implemented

503 Service unavailable

28

Page 33: LoCloud - D3.3: Metadata Enrichment services

7 Conclusion

This deliverable describes the metadata enrichment services as implemented and deployed in the LoCLoud project.The enrichment services comprises the background link micro-service (for English and Spanish), which links CHitems to DBpedia elements, and the vocabulary matching micro-service, which maps CH items with relevantvocabulary concepts.

The background link micro-service relies on DBpedia Spotlight, a state-of-the-art tool for performing NamedEntity Disambiguation (NED). DBpedia Spotlight was chosen because it performed best in our experiments, asexplained in the deliverable.

The vocabulary matching micro-service is developed from scratch for the project. It synchronizes with thevocabulary service42, a collaborative platform to explore the potentials of crowdsourcing as a way of developingmultilingual, semantic thesauri for local heritage content.

All micro-services are implemented as REST services and are deployed into virtual machines (VM). Documen-tation on the service, including full API speci�cation, is avaliable at this address:

http://support.locloud.eu/LoCloud%20Enrichment%20Microservice

The metadata enrichment micro-services are used in the various enrichment work�ows automatically throughthe LoCloud Generic Enrichment Service, which allows to orchestrate various REST micro-services into complexenrichment work�ows.

42Task 3.4 of LoCloud project.

29

Page 34: LoCloud - D3.3: Metadata Enrichment services

References

Eneko Agirre and Philip Edmonds. Word Sense Disambiguation: Algorithms and Applications. Springer PublishingCompany, Incorporated, 1st edition, 2007. ISBN 1402068700, 9781402068706.

Amit Bagga and Breck Baldwin. Entity-based cross-document coreferencing using the vector space model. InProceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th InternationalConference on Computational Linguistics - Volume 1, ACL '98, pages 79�85, Stroudsburg, PA, USA, 1998.Association for Computational Linguistics.

Razvan C. Bunescu and Marius Pasca. Using encyclopedic knowledge for named entity disambiguation. In EACL,2006. URL http://acl.ldc.upenn.edu/E/E06/E06-1002.pdf.

Xiao Cheng and Dan Roth. Relational inference for wiki�cation. In EMNLP, 2013.

S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL, volume June, pages 708�716, 2007. URL http://acl.ldc.upenn.edu/D/D07/D07-1074.pdf.

Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. Improving e�ciency and accuracy inmultilingual entity extraction. In Proceedings of the 9th International Conference on Semantic Systems,I-SEMANTICS '13, pages 121�124, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1972-0. doi:10.1145/2506182.2506198. URL http://doi.acm.org/10.1145/2506182.2506198.

Martin Doerr, Stefan Gradmann, Ste�en Hennicke, Antoine Isaac, Carlo Meghini, and Herbert van de Sompel.The europeana data model (EDM). In World Library and Information Congress: 76th IFLA general conferenceand assembly, pages 10�15, 2010.

Anthony Fader, Stephen Soderland, and Oren Etzioni. Scaling wikipedia-based named entity disambiguation toarbitrary web text. In Proceedings of {WIKIAI09}, 2009.

B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J.R. Curran. Evaluating Entity Linking with Wikipedia.Artif. Intell., 194:130�150, January 2012. ISSN 0004-3702. doi: 10.1016/j.artint.2012.04.005. URL http:

//dx.doi.org/10.1016/j.artint.2012.04.005.

B. Haslhofer, E. M. Roochi, M. Gay, and R. Simon. Augmenting europeana content with linked data resources.Proceedings of the 6th International Conference on Semantic Systems, pages 40:1�40:3, New York, NY, USA,2010.

J. Ho�art, M.A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum.Robust Disambiguation of Named Entities in Text. In Conference on Empirical Methods in Natural LanguageProcessing, Edinburgh, Scotland, United Kingdom 2011, pages 782�792, 2011.

Kasper Hornbæk and Morten Hertzum. The notion of overview in information visualization. Interna-tional Journal of Human-Computer Studies, 69(7-8):509 � 525, 2011. ISSN 1071-5819. doi: 10.1016/j.ijhcs.2011.02.007. URL http://www.sciencedirect.com/science/article/B6WGR-529V18J-1/

2/95a091a9a1a8d5423cd3fbdbd6ff5fc2.

Gary Marchionini. Exploratory search: From �nding to understanding. Communications of the ACM, 49(4):41�46,2006.

P. McNamee, H.T. Dang, H. Simpson, P. Schone, and S.M. Strassel. An Evaluation of Technologies for KnowledgeBase Population. In Proceedings of the 7th International Conference on Language Resources and Evaluation,page 369�372., 2010.

Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. Dbpedia spotlight: Shedding light onthe web of documents. In Proceedings of the 7th International Conference on Semantic Systems, pages 1�8,New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0621-8.

30

Page 35: LoCloud - D3.3: Metadata Enrichment services

Rada Mihalcea and Andras Csomai. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of thesixteenth ACM conference on Conference on information and knowledge management, pages 233�242. ACM,2007.

Alistair Miles, Brian Matthews, Michael Wilson, and Dan Brickley. Skos core: Simple knowledge organisationfor the web. In Proceedings of the 2005 International Conference on Dublin Core and Metadata Applications:Vocabularies in Practice, DCMI '05, pages 1:1�1:9. Dublin Core Metadata Initiative, 2005. ISBN 8489315442,9788489315440. URL http://dl.acm.org/citation.cfm?id=1383465.1383467.

D. Milne and I.H. Witten. Learning to Link with Wikipedia. In Proceeding of the 17th ACM conference onInformation and knowledge mining - CIKM '08, page 509, New York, New York, USA, 2008. ACM Press. ISBN9781595939913.

David Nadeau and Satoshi Sekine. A survey of named entity recognition and classi�cation. Linguisticae Investi-gationes, 30(1):3�26, January 2007.

Peter Pirolli. Powers of 10: Modeling complex information-seeking systems at multiple scales. Computer, 42(3):33�40, 2009. doi: 10.1109/MC.2009.94.

Ramana Rao, Jan O. Pedersen, Marti A. Hearst, Jock D. Mackinlay, Stuart K. Card, Larry Masinter, Per-KristianHalvorsen, and George C. Robertson. Rich interaction in the digital library. Commun. ACM, 38(4):29�39, April1995. ISSN 0001-0782. doi: 10.1145/205323.205326. URL http://doi.acm.org/10.1145/205323.205326.

Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition. In Proceed-ings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL '09, pages 147�155,Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. ISBN 978-1-932432-29-9.

G. Singer, U. Norbisrath, and D. Lewandowski. Ordinary search engine users carrying out complex search tasks.Journal of Information Science, 2012.

Ryen W. White, Bill Kules, Steven M. Drucker, and M.C. Schraefel. Introduction. Commun. ACM, 49(4):36�39, April 2006. ISSN 0001-0782. doi: 10.1145/1121949.1121978. URL http://doi.acm.org/10.1145/

1121949.1121978.

31


Top Related