using wikipedia for hierarchical finer categorization of named entities
DESCRIPTION
Using Wikipedia for Hierarchical Finer Categorization of Named Entities. Aasish Pappu Language Technologies Institute Carnegie Mellon University. PACLIC 2009. outline. 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion. - PowerPoint PPT PresentationTRANSCRIPT
Using Wikipedia for Hierarchical Finer Categorization
of Named Entities
Aasish PappuLanguage Technologies Institute
Carnegie Mellon University
PACLIC 2009
2
outline
1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion
1 Introduction 3
• Structured and organized encyclopedic corpus is a suitable training corpus.– a wide range of topics– provides hyperlinks
1 Introduction 4
• In this paper1) Discuss the usability of Wikipedia2) Induce WordNet and Wikipedia domain
taxonomy into the feature space3) Using Maximum Entropy and SVM classifier
5
outline
1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion
2 Related Work 6
• Kazama and Torisawa (2007)– extracted gloss text
• Dakka and Cucerzan (2008)– tagging the Wikipedia data
• Bunescu and Pasca (2006)– built a disambiguation system
7
outline
1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion
3 Corpus Creation 8
• 10-18-2007 English version of Wikipedia• 2 million articles• 292,384 categories• a taxonomy with a depth about 10
– 5882 Wikipedia Stub categories– 105 domains
9
3 Corpus Creation
3.1 Categories in Wikipedia3.2 Named entity categories3.3 Procedure
3.1 Categories in Wikipedia 10
• taxonomy– constituted by categories– linked to other categories across depth and
breadth• contains cycles
– Tackled by Zesch and Gurevych, 2007• wikipedia taxonomy is not a tree
11
3 Corpus Creation
3.1 Categories in Wikipedia3.2 Named entity categories3.3 Procedure
3.2 Named entity categories 12
• the domain hierarchy– 17 basic domains– 88 sub-domains
3.2 Named entity categories 13
• to avoid the bias towards any particular domain
• rules to choose set of categories– To ensure diversity in the categorization task– To ensure we select balanced categories– consider category with each parameter
closest to mean value under that domain
14
3 Corpus Creation
3.1 Categories in Wikipedia3.2 Named entity categories3.3 Procedure
3.3 Procedure 15
• extract named entity phrases– using Stanford POS tagger
• extract typed dependency relationships• extract the content words around a named
entity– collect the NPs (noun phrases) and VPs (verb
phrases)
3.3 Procedure 16
1) Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity.
2) If, there are more than one such titles, consider the target title using minimum edit distance metric.
3) Pick all articles that fall under the same category as the target article.
4) Look for those articles that fall under the special categories that are chosen for the classification task.
5) Find the article that shares maximum number of categories with the target article and label the target article with the its special category.
3.3 Procedure 17
• About 10,000 samples– Training 75%– Testing 25%
3.3 Procedure 18
19
outline
1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion
4 Features 20
• four types of feature sets– a syntactic feature set– three semantic features
21
4 Features
4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features
4.1 Typed Dependency Feature 22
• phrase structure parse– nesting of multi-word constituents
• dependency parse– dependencies between individual words
• dependency relations gives a clue about probable semantic relations that can be associated with the named entity.
23
4 Features
4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features
4.2 Hypernyms 24
• preferred to have a hypernym feature which is semantically specific– hypernyms of all synsets are inversely
ordered according to their depth in the hypernym tree
– deepest hypernym in the lot is choosen as the target feature for that content word
25
4 Features
4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features
26
4 Features
4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features
4.3.1 Wordnet domains4.3.2 Wikipedia domains4.3.3 WDH vsWikipedia Domain System
4.3.1 Wordnet domains 27
• Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH)
• There are 5 top-level domains and 46 basic domains in WDH.
28
4 Features
4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features
4.3.1 Wordnet domains4.3.2 Wikipedia domains4.3.3 WDH vsWikipedia Domain System
4.3.2 Wikipedia domains 29
• indexed Wikipedia• search content words in the index for the
categories that contain more number of pages containing a content word
• Especially, pages with links are weighed double the pages that contains the word without a hyperlink.
30
4 Features
4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features
4.3.1 Wordnet domains4.3.2 Wikipedia domains4.3.3 WDH vsWikipedia Domain System
4.3.3 WDH vsWikipedia Domain System
31
32
outline
1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion
5 Experiments 33
34
outline
1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments
5.1 Experiment 1: Feature wise model5.2 Experiment 2: Feature combination model5.3 Experiment 3: Error analysis
35
36
outline
1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments
5.1 Experiment 1: Feature wise model5.2 Experiment 2: Feature combination model5.3 Experiment 3: Error analysis
37
38
39
40
outline
1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments
5.1 Experiment 1: Feature wise model5.2 Experiment 2: Feature combination model5.3 Experiment 3: Error analysis
41
42
outline
1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion
43
• presented a named entity categorization system– employs Wikipedia categories as classes
• adapted hierachial categorization of Wikipedia– mine relations among named entities