collaboratively constructed semantic resources
Post on 01-Feb-2017
Embed Size (px)
Proceedings of the 2nd Workshop on Collaboratively Constructed Semantic Resources, Coling 2010, pages 1927,Beijing, August 2010
Extending English ACE 2005 Corpus Annotation with Ground-truthLinks to Wikipedia
This paper describes an on-going annota-tion effort which aims at adding a man-ual annotation layer connecting an exist-ing annotated corpus such as the EnglishACE-2005 Corpus to Wikipedia. The an-notation layer is intended for the evalua-tion of accuracy of linking to Wikipedia inthe framework of a coreference resolutionsystem.
Collaboratively Constructed Resources (CCR)such as Wikipedia are starting to be used for anumber of semantic processing tasks that up tofew years ago could only rely on few manuallyconstructed resources such as WordNet and Sem-Cor (Fellbaum, 1998). The impact of the new re-sources can be multiplied by connecting them toother existing datasets, e.g. reference corpora. Inthis paper we will illustrate an on-going annota-tion effort which aims at adding a manual anno-tation layer connecting an existing annotated cor-pus such as the English ACE-2005 dataset1 to aCCR such as Wikipedia. This effort will producea new integrated resource which can be useful forthe coreference resolution task.
Coreference resolution is the task of identify-ing which mentions, i.e. individual textual de-scriptions usually realized as noun phrases or pro-nouns, refer to the same entity. To solve thistask, especially in the case of non-pronominal co-reference, researchers have recently started to ex-ploit semantic knowledge, e.g. trying to calculate
the semantic similarity of mentions (Ponzetto andStrube, 2006) or their semantic classes (Ng, 2007;Soon et al., 2001). Up to now, WordNet has beenone of the most frequently used sources of se-mantic knowledge for the coreference resolutiontask (Soon et al., 2001; Ng and Cardie, 2002). Re-searchers have shown, however, that WordNet hassome limits. On one hand, although WordNet hasa big coverage of the English language in termsof common nouns, it still has a limited coverageof proper nouns (e.g. Barack Obama is not avail-able in the on-line version) and entity descrip-tions (e.g. president of India). On the other handWordNet sense inventory is considered too fine-grained (Ponzetto and Strube, 2006; Mihalcea andMoldovan, 2001). In alternative, it has been re-cently shown that Wikipedia can be a promisingsource of semantic knowledge for coreference res-olution between nominals (Ponzetto and Strube,2006).
Consider some possible uses of Wikipedia.For example, knowing that the entity men-tion Obama is described on the Wikipediapage Barack_Obama2, one can benefit fromthe Wikipedia category structure. Categories as-signed to the Barack_Obama page can be usedas semantic classes, e.g. 21st-century presidentsof the United States. Another example of auseful Wikipedia feature are the links betweenWikipedia pages. For instance, some Wikipediapages contain links to the Barack_Obama page.Anchor texts of these links can provide alterna-
2The links to Wikipedia pages are given displaying onlythe last part of the link which corresponds to the title of thepage. The complete link can be obtained adding this part tohttp://en.wikipedia.org/wiki/.
tive names of this entity, e.g. Barack HusseinObama or Barack Obama Junior.
Naturally, in order to obtain semantic knowl-edge about an entity mention from Wikipediaone should link this mention to an appropriateWikipedia page, i.e. to disambiguate it usingWikipedia as a sense inventory. The accuracyof linking entity mentions to Wikipedia is a veryimportant issue. For example, such linking is astep of the approach to coreference resolution de-scribed in (Bryl et al., 2010). In order to evaluatethis accuracy in the framework of a coreferenceresolution system, a corpus of documents, whereentity mentions are annotated with ground-truthlinks to Wikipedia, is required.
The possible solution of this problem is to ex-tend the annotation of entity mentions in a corefer-ence resolution corpus. In the recent years, coref-erence resolution systems have been evaluated onvarious versions of the English Automatic ContentExtraction (ACE) corpus (Ponzetto and Strube,2006; Versley et al., 2008; Ng, 2007; Culotta etal., 2007; Bryl et al., 2010). The latest publiclyavailable version is ACE 20053.
In this paper we present an extension of ACE2005 non-pronominal entity mention annotationswith ground-truth links to Wikipedia. This exten-sion is intended for evaluation of accuracy of link-ing entity mentions to Wikipedia pages. The an-notation is currently in progress. At the momentof writing this paper we have completed around55% of the work. The extension can be exploitedby coreference resolution systems, which alreadyuse ACE 2005 corpus for development and testingpurposes, e.g. (Bryl et al., 2010). Moreover, En-glish ACE 2005 corpus is multi-purpose and canbe used in other information extraction (IE) tasksas well, e.g. relation extraction. Therefore, webelieve that our extension might also be useful forother IE tasks, which exploit semantic knowledge.
In the following we start by providing a briefoverview of the existing corpora annotated withlinks to Wikipedia. In Section 3 we describe somecharacteristics of the English ACE 2005 corpus,which are relevant to the creation of the extension.Next, we describe the general annotation princi-
ples and the procedure adopted to carry out theannotation. In Section 4 we present some anal-yses of the annotation and statistics about Inter-Annotator Agreement.
2 Related work
Recent approaches to linking terms to Wikipediapages (Cucerzan, 2007; Csomai and Mihalcea,2008; Milne and Witten, 2008; Kulkarni et al.,2009) have used two kinds of corpora for eval-uation of accuracy: (i) sets of Wikipedia pagesand (ii) manually annotated corpora. In Wikipediapages links are added to terms only wherethey are relevant to the context4. Therefore,Wikipedia pages do not contain the full annotationof all entity mentions. This observation appliesequally to the corpus used by (Milne and Wit-ten, 2008), which includes 50 documents from theAQUAINT corpus annotated following the samestrategy5. The corpus created by (Cucerzan, 2007)contains annotation of named entities only6. Itcontains 756 annotations, therefore for our pur-poses it is limited in terms of size.
Kulkarni et al. (2009) have annotated 109 doc-uments collected from homepages of various siteswith as many links as possible7. Their annotationis too extensive for our purposes, since they do notlimit annotation to the entity mentions. To tacklethis issue, one can use an automatic entity mentiondetector, however it is likely to introduce noise.
3 Creating the extension
The task consists of manually annotating thenon-pronominal mentions contained in the En-glish ACE 2005 corpus with links to appropriateWikipedia articles. The objective of the work isto create an extension of ACE 2005, where all thementions contained in the ACE 2005 corpus aredisambiguated using Wikipedia as a sense reposi-tory to point to. The extension is intended for the
evaluation of accuracy of linking to Wikipedia inthe framework of a coreference resolution system.
3.1 The English ACE 2005 Corpus
The English ACE 2005 corpus is composed of599 articles assembled from a variety of sourcesselected from broadcast news programs, newspa-pers, newswire reports, internet sources and fromtranscribed audio. It contains the annotation of aseries of entities (person, location, organization)for a total of 15,382 different entities and 43,624mentions of these entities. A mention is an in-stance of a textual reference to an object, whichcan be either named (e.g. Barack Obama), nom-inal (e.g. the president), or pronominal (e.g. he,his, it). An entity is an aggregate of all the men-tions which refer to one conceptual entity. Beyondthe annotation of entities and mentions, ACE 05contains also the annotation of local co-referencefor the entities; this means that mentions whichrefer to the same entity in a document have beenmarked with the same ID.
3.2 Annotating ACE 05 with WikipediaPages
For the purpose of our task, not all theACE 05 mentions are annotated, but only thenamed (henceforth NAM) and nominal (hence-forth NOM) mentions. The resulting additionalannotation layer will contain a total of 29,300mentions linked to Wikipedia pages. As specif-ically regards the annotation of NAM mentions,information about local coreference contained inACE 05 has been exploited in order to speed upthe annotation process. In fact, only the firstoccurrence of the NAM mentions in each doc-ument has been annotated and the annotation isthen propagated to all the other co-referring NAMmentions in the document.
Finally, it must be noted that in ACE 05, givena complex entity description, both the full ex-tent of the mention (e.g. president of the UnitedStates) and its syntactic head (e.g. president)are marked. In our Wikipedia extension only thehead of the mention is annotated, while the full ex-tent of the mention is available from the originalACE 05 corpus.