using wikipedia for hierarchical finer categorization of named entities

Using Wikipedia for Hierarchical Finer Categorization

of Named Entities

Aasish PappuLanguage Technologies Institute

Carnegie Mellon University

PACLIC 2009

2

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion

1 Introduction 3

• Structured and organized encyclopedic corpus is a suitable training corpus.– a wide range of topics– provides hyperlinks

1 Introduction 4

• In this paper1) Discuss the usability of Wikipedia2) Induce WordNet and Wikipedia domain

taxonomy into the feature space3) Using Maximum Entropy and SVM classifier

5

outline


2 Related Work 6

• Kazama and Torisawa (2007)– extracted gloss text

• Dakka and Cucerzan (2008)– tagging the Wikipedia data

• Bunescu and Pasca (2006)– built a disambiguation system

7

outline


3 Corpus Creation 8

• 10-18-2007 English version of Wikipedia• 2 million articles• 292,384 categories• a taxonomy with a depth about 10

– 5882 Wikipedia Stub categories– 105 domains

9

3 Corpus Creation

3.1 Categories in Wikipedia3.2 Named entity categories3.3 Procedure

3.1 Categories in Wikipedia 10

• taxonomy– constituted by categories– linked to other categories across depth and

breadth• contains cycles

– Tackled by Zesch and Gurevych, 2007• wikipedia taxonomy is not a tree

11

3 Corpus Creation


3.2 Named entity categories 12

• the domain hierarchy– 17 basic domains– 88 sub-domains

3.2 Named entity categories 13

• to avoid the bias towards any particular domain

• rules to choose set of categories– To ensure diversity in the categorization task– To ensure we select balanced categories– consider category with each parameter

closest to mean value under that domain

14

3 Corpus Creation


3.3 Procedure 15

• extract named entity phrases– using Stanford POS tagger

• extract typed dependency relationships• extract the content words around a named

entity– collect the NPs (noun phrases) and VPs (verb

phrases)

3.3 Procedure 16

1) Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity.

2) If, there are more than one such titles, consider the target title using minimum edit distance metric.

3) Pick all articles that fall under the same category as the target article.

4) Look for those articles that fall under the special categories that are chosen for the classification task.

5) Find the article that shares maximum number of categories with the target article and label the target article with the its special category.

3.3 Procedure 17

• About 10,000 samples– Training 75%– Testing 25%

3.3 Procedure 18

19

outline


4 Features 20

• four types of feature sets– a syntactic feature set– three semantic features

21

4 Features

4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features

4.1 Typed Dependency Feature 22

• phrase structure parse– nesting of multi-word constituents

• dependency parse– dependencies between individual words

• dependency relations gives a clue about probable semantic relations that can be associated with the named entity.

23

4 Features


4.2 Hypernyms 24

• preferred to have a hypernym feature which is semantically specific– hypernyms of all synsets are inversely

ordered according to their depth in the hypernym tree

– deepest hypernym in the lot is choosen as the target feature for that content word

25

4 Features


26

4 Features


4.3.1 Wordnet domains4.3.2 Wikipedia domains4.3.3 WDH vsWikipedia Domain System

4.3.1 Wordnet domains 27

• Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH)

• There are 5 top-level domains and 46 basic domains in WDH.

28

4 Features



4.3.2 Wikipedia domains 29

• indexed Wikipedia• search content words in the index for the

categories that contain more number of pages containing a content word

• Especially, pages with links are weighed double the pages that contains the word without a hyperlink.

30

4 Features



4.3.3 WDH vsWikipedia Domain System

31

32

outline


5 Experiments 33

34

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments

5.1 Experiment 1: Feature wise model5.2 Experiment 2: Feature combination model5.3 Experiment 3: Error analysis

36

outline



40

outline



42

outline


43

• presented a named entity categorization system– employs Wikipedia categories as classes

• adapted hierachial categorization of Wikipedia– mine relations among named entities

using wikipedia for hierarchical finer categorization of named entities

Documents

entity categories3

number of subcategories

special categories

wikipedia103 corpus

maximum number of categories

entity categoriesto

wikipedia domain taxonomy

wikipedia taxonomy