using wikipedia for hierarchical finer categorization of named entities

43
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC 2009

Upload: aileen

Post on 16-Feb-2016

49 views

Category:

Documents


2 download

DESCRIPTION

Using Wikipedia for Hierarchical Finer Categorization of Named Entities. Aasish Pappu Language Technologies Institute Carnegie Mellon University. PACLIC 2009. outline. 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

Using Wikipedia for Hierarchical Finer Categorization

of Named Entities

Aasish PappuLanguage Technologies Institute

Carnegie Mellon University

PACLIC 2009

Page 2: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

2

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion

Page 3: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

1 Introduction 3

• Structured and organized encyclopedic corpus is a suitable training corpus.– a wide range of topics– provides hyperlinks

Page 4: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

1 Introduction 4

• In this paper1) Discuss the usability of Wikipedia2) Induce WordNet and Wikipedia domain

taxonomy into the feature space3) Using Maximum Entropy and SVM classifier

Page 5: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

5

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion

Page 6: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

2 Related Work 6

• Kazama and Torisawa (2007)– extracted gloss text

• Dakka and Cucerzan (2008)– tagging the Wikipedia data

• Bunescu and Pasca (2006)– built a disambiguation system

Page 7: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

7

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion

Page 8: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

3 Corpus Creation 8

• 10-18-2007 English version of Wikipedia• 2 million articles• 292,384 categories• a taxonomy with a depth about 10

– 5882 Wikipedia Stub categories– 105 domains

Page 9: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

9

3 Corpus Creation

3.1 Categories in Wikipedia3.2 Named entity categories3.3 Procedure

Page 10: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

3.1 Categories in Wikipedia 10

• taxonomy– constituted by categories– linked to other categories across depth and

breadth• contains cycles

– Tackled by Zesch and Gurevych, 2007• wikipedia taxonomy is not a tree

Page 11: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

11

3 Corpus Creation

3.1 Categories in Wikipedia3.2 Named entity categories3.3 Procedure

Page 12: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

3.2 Named entity categories 12

• the domain hierarchy– 17 basic domains– 88 sub-domains

Page 13: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

3.2 Named entity categories 13

• to avoid the bias towards any particular domain

• rules to choose set of categories– To ensure diversity in the categorization task– To ensure we select balanced categories– consider category with each parameter

closest to mean value under that domain

Page 14: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

14

3 Corpus Creation

3.1 Categories in Wikipedia3.2 Named entity categories3.3 Procedure

Page 15: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

3.3 Procedure 15

• extract named entity phrases– using Stanford POS tagger

• extract typed dependency relationships• extract the content words around a named

entity– collect the NPs (noun phrases) and VPs (verb

phrases)

Page 16: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

3.3 Procedure 16

1) Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity.

2) If, there are more than one such titles, consider the target title using minimum edit distance metric.

3) Pick all articles that fall under the same category as the target article.

4) Look for those articles that fall under the special categories that are chosen for the classification task.

5) Find the article that shares maximum number of categories with the target article and label the target article with the its special category.

Page 17: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

3.3 Procedure 17

• About 10,000 samples– Training 75%– Testing 25%

Page 18: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

3.3 Procedure 18

Page 19: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

19

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion

Page 20: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

4 Features 20

• four types of feature sets– a syntactic feature set– three semantic features

Page 21: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

21

4 Features

4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features

Page 22: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

4.1 Typed Dependency Feature 22

• phrase structure parse– nesting of multi-word constituents

• dependency parse– dependencies between individual words

• dependency relations gives a clue about probable semantic relations that can be associated with the named entity.

Page 23: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

23

4 Features

4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features

Page 24: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

4.2 Hypernyms 24

• preferred to have a hypernym feature which is semantically specific– hypernyms of all synsets are inversely

ordered according to their depth in the hypernym tree

– deepest hypernym in the lot is choosen as the target feature for that content word

Page 25: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

25

4 Features

4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features

Page 26: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

26

4 Features

4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features

4.3.1 Wordnet domains4.3.2 Wikipedia domains4.3.3 WDH vsWikipedia Domain System

Page 27: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

4.3.1 Wordnet domains 27

• Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH)

• There are 5 top-level domains and 46 basic domains in WDH.

Page 28: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

28

4 Features

4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features

4.3.1 Wordnet domains4.3.2 Wikipedia domains4.3.3 WDH vsWikipedia Domain System

Page 29: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

4.3.2 Wikipedia domains 29

• indexed Wikipedia• search content words in the index for the

categories that contain more number of pages containing a content word

• Especially, pages with links are weighed double the pages that contains the word without a hyperlink.

Page 30: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

30

4 Features

4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features

4.3.1 Wordnet domains4.3.2 Wikipedia domains4.3.3 WDH vsWikipedia Domain System

Page 31: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

4.3.3 WDH vsWikipedia Domain System

31

Page 32: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

32

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion

Page 33: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

5 Experiments 33

Page 34: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

34

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments

5.1 Experiment 1: Feature wise model5.2 Experiment 2: Feature combination model5.3 Experiment 3: Error analysis

Page 35: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

35

Page 36: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

36

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments

5.1 Experiment 1: Feature wise model5.2 Experiment 2: Feature combination model5.3 Experiment 3: Error analysis

Page 37: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

37

Page 38: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

38

Page 39: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

39

Page 40: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

40

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments

5.1 Experiment 1: Feature wise model5.2 Experiment 2: Feature combination model5.3 Experiment 3: Error analysis

Page 41: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

41

Page 42: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

42

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion

Page 43: Using Wikipedia for Hierarchical Finer Categorization of Named Entities

43

• presented a named entity categorization system– employs Wikipedia categories as classes

• adapted hierachial categorization of Wikipedia– mine relations among named entities