semi-automatic methods for wordnet construction german rigau i claramunt rigau talp research center...
TRANSCRIPT
Semi-automatic methods for WordNet construction
German Rigau i Claramunthttp://www.lsi.upc.es/~rigauTALP Research CenterUniversitat Politècnica de Catalunya
Eneko Agirrehttp://www.ji.si.upc.es/users/enekoIxA NLP GroupUniversity of the Basque Country
2002 International WordNet Conference
Semi-automatic methods for WN construction 2002 International WordNet Conference - 2
• NLP and the Lexicon
Theoretical: WG, GPSG, HPSG. Practical: realistic complexity and coverage
• Lexical bottleneck (Briscoe 91)
Even worse for languages other than English
Setting
Semi-automatic methods for WN construction 2002 International WordNet Conference - 3
• Which LK is needed by a concrete NLP system?
• Where is this LK located?
• Which procedures can be applied?
Setting
Semi-automatic methods for WN construction 2002 International WordNet Conference - 4
• Which LK is needed by a concrete NLP system?
Phonology: phonemes, stress, etc. Morphology: POS, etc. Syntactic: category, subcat., etc. Semantic: class, SRs, etc. Pragmatic: usage, registers, TDs, etc. Translations: translation links
Setting
Semi-automatic methods for WN construction 2002 International WordNet Conference - 5
• Where is this LK located?
Human brain Structured Lexical Resources:
Monolingual and bilingual MRDs Thesauri
Unstructured Lexical Resources: Monolingual and bilingual Corpora
Mixing resources
Setting
Semi-automatic methods for WN construction 2002 International WordNet Conference - 6
• Which procedures can be applied?
Prescriptive approach Machine-aided manual construction
Descriptive approach Automatic acquisition from pre-existing
Lexical Resources
Mixed approach
Setting
Semi-automatic methods for WN construction 2002 International WordNet Conference - 7
• Setting• Words and Works • Merge approach
Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs
• Expand approach Translation of synsets: bilingual MRDs
• Interface for manual revision• Conclusions
Outline
Semi-automatic methods for WN construction 2002 International WordNet Conference - 8
Words and WorksWhere is this Lexical Knowledge located?
• Human brain: Linguistic String Project (Fox et al. 88)
Lexical Information for 10,000 entries
WordNet (Miller et al. 90) Semantic Information v1.6 with 99,642 synsets
Comlex (Grishman et al. 94) Syntactic information 38,000 English words
CYC Ontology (Lenat 95) a person-century of effort to produce 100,000
terms
LDOCE3-NLP dictionary with 80,000 senses
Semi-automatic methods for WN construction 2002 International WordNet Conference - 9
• Structured Lexical Resources Monolingual MRDs:
LDOCE• learner’s dictionary• 35,956 entries and 76,059 definitions• 86% semantic and 44% pragmatic codes• controlled vocabulary of 2,000 words• (Boguraev & Briscoe 89)• (Vossen & Serail 90)• (Bruce & Guthrie 92), (Wilks et al. 93)• (Dolan et al. 93), (Richardson 97)
Words and WorksWhere is this Lexical Knowledge located?
Semi-automatic methods for WN construction 2002 International WordNet Conference - 10
• Structured Lexical Resources Other Monolingual MRDs:
Webster’s (Jensen & Ravin 87) LPPL (Artola 93) DGILE (Castellón 93), (Taulé 95), (Rigau 98) CIDE (Harley & Glennon 97) AHD (Richardson 97) WordNet (Harabagiu 98)
Bilingual MRDs Collins Spanish/English (Knigth & Luk 94) Vox/Harrap’s Spanish/English (Rigau 98)
Words and WorksWhere is this Lexical Knowledge located?
Semi-automatic methods for WN construction 2002 International WordNet Conference - 11
• Structured Lexical Resources Thesauri:
Roget’s Thesaurus • 60,071 words in 1,000 categories• (Yarowsky 92), (Grefenstette 93), (Resnik 95)
Roget’s II and The New Collins Thesaurus
• (Byrd 89)
Macquarie’s thesaurus• (Grefenstette 93)
Bunrui Goi Hyou Japanese thesaurus• (Utsuro et al. 93)
Words and WorksWhere is this Lexical Knowledge located?
Semi-automatic methods for WN construction 2002 International WordNet Conference - 12
• Structured Lexical Resources Encyclopaedia
Grolier’s Encyclopaedia (Yarowsky 92) Encarta (Richardson et al. 98)
Others Telephonic Guides
Mixing structured lexical resources Roget’s Thesaurus and Grolier’s (Yarowsky 92) LDOCE, WN, Collins, ONTOS, UM (Knight & Luk
94) Japanese MRD to WN (Okumura & Hovy 94) LLOCE, LDOCE (Chen & Chang 98)
Words and WorksWhere is this Lexical Knowledge located?
Semi-automatic methods for WN construction 2002 International WordNet Conference - 13
• Unstructured Lexical Resources Corpora:
WSJ, Brown Corpus (SemCor), Hansard Proper Nouns (Hearst & Schütze 95) Idiosyncratic Collocations (Church et al. 91) Preposition preferences (Resnik and Hearst 93) Subcategorization structures (Briscoe and Carroll
97) Selectional restrictions (Resnik 93), (Ribas 95) Thematic structure (Basili et al. 92) Word semantic classes (Dagan et al. 94) Bilingual Lexicons for MT (Fung 95)
Words and WorksWhere is this Lexical Knowledge located?
Semi-automatic methods for WN construction 2002 International WordNet Conference - 14
• Using both structured and non-structured Lexical Resources MRDs and Corpora
(Liddy & Paik 92) (Klavans & Tzoukermann 96)
WordNet and Corpora (Resnik 93), (Ribas 95), (Li & Abe 95), (McCarthy
01)
Words and WorksWhere is this Lexical Knowledge located?
Semi-automatic methods for WN construction 2002 International WordNet Conference - 15
• Japanese Projects EDR (Yokoi 95)
Nine years project oriented to MT Bilingual Corpora with 250,000 words Monolingual, bilingual and coocurrence
dictionaries 200,000 general vocabulary 100,000 technical terminology 400,000 concepts
Words and WorksInternational Projects on Lexical Acquisition
Semi-automatic methods for WN construction 2002 International WordNet Conference - 16
• American Projects Comlex (Grishman et al. 94)
Syntactic information for 38,000 words
WordNet (Miller 90) Semantic Information more than 123,000 words organised in 99,000 synsets more than 116,000 relations between synsets
Pangloss (Knight & Luk 94) PUM, ONTOS, LDOCE semantic categories, WordNet
Cyc (Lenat 95) common-sense knowledge 100,000 concepts and 1,000,000 axioms
Words and Works International Projects on Lexical Acquisition
Semi-automatic methods for WN construction 2002 International WordNet Conference - 17
• European Projects Acquilex I and II
LA from monolingual and bilingual MRDs and corpora
LE-Parole Large-scale harmonised set of corpora and
lexicons for all the EU languages
EuroWordNet Multilingual WordNet for several European
Languages
Meaning Large-scale of LK from the web Large-scale WSD
Words and Works International Projects on Lexical Acquisition
Semi-automatic methods for WN construction 2002 International WordNet Conference - 18
• Syntactic Disambiguation (Dolan et al. 93)• Semantic Processing (Vanderwende 95)• WSD (Lesk 86), (Wilks & Stevenson 97), (Rigau 98)• IR (Krovetz & Croft 92)• MT (Knight and Luk 94), (Tanaka & Umemura 94)• Semantically enriching MRDs
(Yarowsky 92), (Knight 93), (Chen & Chan 98)
• Building LKBs (Bruce & Guthrie 92) (Dolan et al. 93) (Artola 93) (Castellón 93), (Taulé 95), (Rigau 98)
Words and WorksLexical Acquisition from MRDs
Semi-automatic methods for WN construction 2002 International WordNet Conference - 19
• This tutorial focus on: the massive acquisition of LK from MRDs (conventional, in any language) using (semi) automatic methodologies
• Why MRDs?
The conventional dictionaries for human use usually “contain spelling, pronunciation, hyphenation, capitalization, usage notes for semantic domains, geographic regions, and propiety; ethimological, syntactic and semantic information about the most basic units of the language” (Amsler 81)
Words and WorksAcquisition of LK from MRDs
Semi-automatic methods for WN construction 2002 International WordNet Conference - 20
• Conventional dictionaries are not systematic
• Dictionaries are built for human use
• Implicit Knowledge
words are described/translated in terms of words
Words and WorksMain Problems of MRDs
Semi-automatic methods for WN construction 2002 International WordNet Conference - 21
jardín_1_1 Terreno donde se cultivan plantas y flores ornamentales.
florero_1_4 Maceta con flores.ramo_1_3 Conjunto natural o artificial de flores, ramas o
hierbas.pétalo_1_1 Hoja que forma la corola de la flor. tálamo_1_3 Receptáculo de la flor. miel_1_1 Substancia viscosa y muy dulce que elaboran las abejas, en
una distensión del esófago, con el jugo de las flores y luego depositan en las celdillas de sus panales.
florería_1_1 Floristería; tienda o puesto donde se venden flores. florista_1_1 Persona que tiene por oficio hacer o vender flores.camelia_1_1 Arbusto cameliáceo de jardín, originario de Oriente,
de hojas perennes y lustrosas, y flores grandes, blancas, rojas o rosadas (Camellia japonica).
camelia_1_2 Flor de este arbusto. rosa_1_1 Flor del rosal.
Words and WorksMRDs and Semantic Knowledge
Semi-automatic methods for WN construction 2002 International WordNet Conference - 22
• Setting• Words and Works• Merge approach
Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs
• Expand approach Translation of synsets: bilingual MRDs
• Interface for manual revision• Conclusions
Outline
Semi-automatic methods for WN construction 2002 International WordNet Conference - 23
MRD1
MRDn
LDB1 Tax1
LDBn Taxn
MLKB... ... ...
LKB1
LKBn
...
Merge approachMain Methodology
Semi-automatic methods for WN construction 2002 International WordNet Conference - 24
• Taxonomy construction: (Rigau et al. 98, 97) monolingual MRDs Step 1: Selection of the main top beginners
for a semantic primitive Step 2: Exploiting genus, construction of
taxonomies for each semantic primitive
• Mapping taxonomies: (Daudé et al. 99) bilingual MRDs Step 3: Creation of translation links
Merge approachMain Methodology
Semi-automatic methods for WN construction 2002 International WordNet Conference - 25
• Problems following a pure descriptive approachCircularityErrors and inconsistenciesDefinitions with omitted genus
• Top dictionary senses do not usually represent useful knowledge for the LKBToo generalToo specific
Merge approach: Taxonomy ConstructionMethodology
Semi-automatic methods for WN construction 2002 International WordNet Conference - 26
Mixed Methodology
Prescriptive approach Manual construction of the Top Structure
Merge approach: Taxonomy ConstructionMethodology
Semi-automatic methods for WN construction 2002 International WordNet Conference - 27
Mixed Methodology
Descriptive approach Acquiring implicit information from MRDs
Prescriptive approach Manual construction of the Top Structure
Merge approach: Taxonomy Construction Methodology
Semi-automatic methods for WN construction 2002 International WordNet Conference - 28
Mixed Methodology
Descriptive approach Acquiring implicit information from MRDs
Prescriptive approach Manual construction of the Top Structure
Merge approach: Taxonomy Construction Methodology
Semi-automatic methods for WN construction 2002 International WordNet Conference - 29
Word sense: zumo_1_1 Attached-to: c_art_subst type.Definition: líquido que se extrae de las flores,
hierbas, frutos, etc. (liquid extracted from
flowers, herbs, fruits, etc).
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 30
• A) Attaching DGILE senses to semantic primitives 1) First labelling:
Conceptual Distance (Rigau 94)
2) Second labelling: Salient Words (Yarowsky 92)
• B) Filtering Process
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 31
• A.1) First labelling: Conceptual Distance (Agirre et al. 94)
length of the shortest path specificity of the concepts
)c,path(cc kwcwc
21
i2i1k2i2
1i1 )depth(c
1min)w,dist(w
using WordNet Bilingual dictionary
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 32
<object, ...>
<artifact, artefact>
<entity><entity>
<structure, construction>
<building, edifice>
<place of worship, ...>
<church, church building>
<abbey>
<house, lodging>
abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess)
<religious residence, cloiser>
<monastery>
<abbey>
<convent>
<abbey>
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 33
<object, ...>
<artifact, artefact>
<entity>
<structure, construction>
<building, edifice>
<place of worship, ...>
<church, church building>
<abbey> 06 ARTIFACT
<house, lodging>
abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess)
<religious residence, cloiser>
<monastery>
<abbey>
<convent>
<abbey>
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 34
• A.1) First labelling (Results)
29,205 labelled definitions (31%)61% accuracy at a sense level64% accuracy at a file level
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 35
• A.2) Second labelling: Salient Words (Yarowsky 92)
Pr(w)
SC)|Pr(wlogSC)|Pr(wSC)AR(w, 2
Importance local frequency appears more significantly more often in
the corpus of a semantic category than at other points in the whole corpus
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 36
• A.2) Second labelling (Results):
86,759 labelled definitions (93%)80% accuracy at a file level
biberón_1_1 ARTIFACT 4.8399 Frasco de cristal ... (glass flask ...)
biberón_1_2 FOOD 7.4443 Leche que contiene este frasco ... (milk contained in that flask ...)
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 37
• B) Filtering process (FOODs)
removes all genus terms
FILTER 1: not FOODs by the bilingual mapping
FILTER 2: appear more often as genus in other Semantic Primitive
FILTER 3: with a low frequency
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 38
• B) Filtering process (FOOD Results)
FILTER 1 FILTER 2LABEL2 #GT Accuracy #GT AccuracyLABEL2+F3>9 31 94% 31 100%LABEL2+F3>8 35 95% 35 100%LABEL2+F3>7 37 91% 37 95%LABEL2+F3>6 43 92% 41 94%LABEL2+F3>5 49 92% 47 92%LABEL2+F3>4 55 91% 56 91%LABEL2+F3>3 64 85% 65 87%LABEL2+F3>2 85 82% 82 83%LABEL2+F3>1 125 78% 123 82%
Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners
Semi-automatic methods for WN construction 2002 International WordNet Conference - 39
Word sense: vino_1_1 Hypernym: zumo_1_1.Definition: zumo de uvas fermentado.
(fermented juice of grapes).
Word sense: rueda_2_1 Hypernym: vino_1_1.Definition: vino procedente de la región de
Rueda (Valladolid). (wine from the region of Rueda).
Merge approach: Taxonomy Construction Step 2: Exploiting Genus
Semi-automatic methods for WN construction 2002 International WordNet Conference - 40
• Genus Sense Identification97% accuracy for nouns
• Genus Sense DisambiguationUnrestricted WSD (coverage 100%)Knowledge-based WSD (not supervised)Eight Heuristics (McRoy 92)
Combining several lexical resources Combining several methods
Merge approach: Taxonomy Construction Step 2: Exploiting Genus
Semi-automatic methods for WN construction 2002 International WordNet Conference - 41
Results:Results:
Polysemous OverallPrec. Cov. Prec. Cov.
Heuristic 1: Monosemous Genus Term - - 100% 16%Heuristic 2: Entry Sense Ordering 70% 100% 75% 100%Heuristic 3: Explicit Semantic Domain 100% 1% 100% 2%Heuristic 4: Word Matching 72% 61% 79% 56%Heuristic 5: Simple Concordance 57% 100% 65% 95%Heuristic 6: Cooccurrence Vectors 60% 100% 66% 97%Heuristic 7: Semantic Vectors 58% 99% 63% 94%Heuristic 8: Conceptual Distance 49% 95% 57% 89%Sum 79% 100% 83% 100%
Merge approach: Taxonomy Construction Step 2: Exploiting Genus
Semi-automatic methods for WN construction 2002 International WordNet Conference - 42
Knowledge provided by each heuristic:Knowledge provided by each heuristic:
OverallPrec. Cov.
- Heuristic 1: Monosemous Genus Term 79% 100%- Heuristic 2: Entry Sense Ordering 72% 100%- Heuristic 3: Explicit Semantic Domain 82% 98%- Heuristic 4: Word Matching 81% 100%- Heuristic 5: Simple Concordance 81% 100%- Heuristic 6: Cooccurrence Vectors 81% 100%- Heuristic 7: Semantic Vectors 81% 100%- Heuristic 8: Conceptual Distance 77% 100%Sum 83% 100%
Merge approach: Taxonomy Construction Step 2: Exploiting Genus
Semi-automatic methods for WN construction 2002 International WordNet Conference - 43
F2+F3>9: 35,099 definitionsF2+F3>9: 35,099 definitions
F2+F3>4: 40,754 definitionsF2+F3>4: 40,754 definitions
No filters: 111,624 definitionsNo filters: 111,624 definitions
FOOD [Castellón 93] F2+F3>9 F2+F3>4Genus terms 63 33 68Dictionary senses 392 952 1,242Levels 6 5 6Senses in level 1 2 18 48Senses in level 2 67 490 604Senses in level 3 88 379 452Senses in level 4 67 44 65Senses in level 5 87 21 60Senses in level 6 6 0 13
Merge approach: Taxonomy Construction Step 2: Exploiting Genus
Semi-automatic methods for WN construction 2002 International WordNet Conference - 44
...zumo_1_1 vino_1_1 quianti_1_1 zumo_1_1 vino_1_1 raya_1_8 zumo_1_1 vino_1_1 requena_1_1 zumo_1_1 vino_1_1 reserva_1_12 zumo_1_1 vino_1_1 ribeiro_1_1 zumo_1_1 vino_1_1 rioja_1_1 zumo_1_1 vino_1_1 roete_1_1 zumo_1_1 vino_1_1 rosado_1_3 zumo_1_1 vino_1_1 rueda_2_1 zumo_1_1 vino_1_1 sherry_1_1 zumo_1_1 vino_1_1 tarragona_1_1 zumo_1_1 vino_1_1 tintilla_1_1 zumo_1_1 vino_1_1 tintorro_1_1 zumo_1_1 vino_1_1 toro_3_1...
Merge approach: Taxonomy Construction Step 2: Exploiting Genus
Semi-automatic methods for WN construction 2002 International WordNet Conference - 45
C1C1
C2C2
C3C3
C5C5
C6C6
C4C4
Merge approach: Mapping Taxonomies Step 3: Creation of translation links
Semi-automatic methods for WN construction 2002 International WordNet Conference - 46
C1C1
C2C2
C3C3
C5C5
C6C6
C4C4
Merge approach: Mapping Taxonomies Step 3: Creation of translation links
Semi-automatic methods for WN construction 2002 International WordNet Conference - 47
• Connecting already existing HierarchiesRelaxation labelling AlgorithmConstraints
• BetweenSpanish taxonomy automatically
derived from an MRD (Rigau et al. 98)
WordNetusing a bilingual MRD
Merge approach: Mapping Taxonomies Step 3: Creation of translation links
Semi-automatic methods for WN construction 2002 International WordNet Conference - 48
animal
ave
rapaz
faisán
(Tops <animal, animate_being, ...>)
(person <beast, brute, ...>)(person <dunce, blockhead, ...>)
(animal <bird>)
(artifact <bird, shuttle, ...>)
(food <fowl, poultry, ...>)
(person <dame, doll, ...>)
(animal <pheasant>)
(food <pheasant>)
(animal <bird>)
(artifact <bird, shuttle, ...>)
(food <fowl, poultry, ...>)
(person <dame, doll, ...>)
Merge approach: Mapping Taxonomies Step 3: Creation of translation links
Semi-automatic methods for WN construction 2002 International WordNet Conference - 49
• Iterative algorithm for function optimisation based on local information
• it can deal with any kind of constraints variables (senses of the taxonomy) labels (synsets)
• Finds a weight assignment for each possible label for each variable weights for the labels of the same variable
add up to one weight assignation satisfies -to the
maximum possible extent- the set of constraints
Merge approach: Mapping Taxonomies Step 3: Relaxation Labelling algorithm
Semi-automatic methods for WN construction 2002 International WordNet Conference - 50
1) Start with a random weight assignment
2) Compute the support value for each label of each variable (according to the constraints)
3) Increase the weights of the labels more compatible with context and decrease those and decrease those of the less compatible labels.
4) If a stopping/convergence is satisfied, stop,otherwise go to step 2.
Merge approach: Mapping Taxonomies Step 3: Relaxation Labelling algorithm
Semi-automatic methods for WN construction 2002 International WordNet Conference - 51
• Rely on the taxonomy structure• Coded with three characters
X: Spanish Taxonomy, I (immediate), Y: English Taxonomy, A (ancestor) X: Relation, E (hypernym), O (hyponym), B (both)
• Examples:
IIEIIE AABAAB
++ ++
++++
Merge approach: Mapping Taxonomies Step 3: Constraints
Semi-automatic methods for WN construction 2002 International WordNet Conference - 52
Poly TOK, FOK TOK, FNOK total
animal 279 (90%) 30 (91%) 209 (90%)food 166 (94%) 3 (100%) 169 (94%)cognition 198 (67%) 27 (90%) 225 (69%)communication 533 (77%) 40 (97%) 573 (78%)
all TOK, FOK TOK, FNOK total
animal 424 (93%) 62 (95%) 486 (90%)food 166 (94%) 83 (100%) 249 (96%)cognition 200 (67%) 245 (90%) 445 (82%)communication 536 (77%) 234 (97%) 760 (81%)
Merge approach: Mapping Taxonomies Step 3: Results
Semi-automatic methods for WN construction 2002 International WordNet Conference - 53
piel
visón
marta
(substance <skin, fur, peel>)
(substance <sable, marte, coal_back>)
(substance <mink, mink_coat>)
Merge approach: Mapping Taxonomies Step 3: Example
Semi-automatic methods for WN construction 2002 International WordNet Conference - 54
• Setting• Words and Works • Merge approach
Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs
• Expand approach Translation of synsets: bilingual MRDs
• Interface for manual revision• Conclusions
Outline
Semi-automatic methods for WN construction 2002 International WordNet Conference - 55
Expand approach
• Take one WordNet as starting point• Translate synsets:
English: <car, automobile> Basque: <auto, berebil>
• We obtain a structurally similar WordNet in another language, but some of the synsets will be missing
• Use bilingual dictionarymaintien n.m. (attitude) bearing; (conservation) maintenance1. Keep bilingual senses (Agirre & Rigau 95)
maintien1: (attitude) bearing maintien2: (conservation) maintenance
2. Produce all translation pairs (Atserias et al. 97)maintien - bearing
maintien - maintenance
Semi-automatic methods for WN construction 2002 International WordNet Conference - 56
Expand approach - produce all pairings
• Used to produce the first version of the nominal part of the Spanish WordNet
• Based on WN 1.5• Both directions in bilingual dictionary merged
Spanish/English: 19,443 translation pairs English/Spanish: 16,324 translation pairs Harmonized bilingual: 28,131 translation pairs Overlap with WordNet: 12,665 nouns (14%)
• Two methods: class methods: consider only pairings conceptual distance methods: consider similarity of
synsets
Semi-automatic methods for WN construction 2002 International WordNet Conference - 57
Expand approach - produce all pairings
• Ten class methods Four monosemic criteria Four polysemic criteria Two hybrid criteria
• Three conceptual distance methods CD1: using pairwise word coocurrences CD2: using headword and genus CD3: using bilingual Spanish entries with multiple
translations
Semi-automatic methods for WN construction 2002 International WordNet Conference - 58
SW EW
SW EW
EW
SW EW
EWSW
SW EW
SW
Expand approach - produce all pairings
Class methods:• Four possible configurations for pairs which either
share an English word or an Spanish word: connected graph.
Semi-automatic methods for WN construction 2002 International WordNet Conference - 59
SW EW
SW EW
EW
Synset
Synset
Synset
Synset
SW EW
EWSW
Synset
Synset
SW EW
SW
Expand approach - produce all pairings
4 monosemous class methods:• All English words involved are monosemous in
WNM1
M2
M3
M4
Semi-automatic methods for WN construction 2002 International WordNet Conference - 60
SW EW
SW EW
EW
Synset+
Synset+
Synset+
Synset+
SW EW
EWSW
Synset+
Synset+
SW EW
SW
Expand approach - produce all pairings
4 polysemous class methods:• At least 1 English word involved is polysemous
P1
P2
P3
P4
Semi-automatic methods for WN construction 2002 International WordNet Conference - 61
<..., EW, ..., EW, ...>
SW
<..., headword-EW, ..., Ind-EW, ...>
SW
Expand approach - produce all pairings
2 other class methods• Variant criterion:
two synonyms share a single SW
• Field criterion:use field indicators in bilingual entry when available
VC
FC
Semi-automatic methods for WN construction 2002 International WordNet Conference - 62
Criterion #links #synsets #words %okmono1 3697 3583 3697 92mono2 935 929 661 89mono3 1863 1158 1863 89mono4 2688 1328 2063 85poly1 5121 4887 1992 80poly2 1450 1426 449 75poly3 11687 6611 3165 58poly4 40298 9400 3754 61Variant 3164 2195 2261 85Field 510 379 421 78
Expand approach - produce all pairings
Ten class methods (results)
Semi-automatic methods for WN construction 2002 International WordNet Conference - 63
)c,path(cc kwcwc
21
i2i1k2i2
1i1 )depth(c
1min)w,dist(w
• Using WordNet • Bilingual dictionary
Expand approach - produce all pairings
Conceptual Distance Methods (Agirre et al. 94)• length of the shortest path• specificity of the concepts
Semi-automatic methods for WN construction 2002 International WordNet Conference - 64
Expand approach - produce all pairings
• Three conceptual distance methods CD1: using pairwise word coocurrences from
monolingual dict. CD2: using headword and genus from monolingual def. CD3: using bilingual Spanish entries with multiple
translations
Semi-automatic methods for WN construction 2002 International WordNet Conference - 65
<object, ...>
<artifact, artefact>
<entity>
<structure, construction>
<building, edifice>
<place of worship, ...>
<church, church building>
<abbey>
<house, lodging>
abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess)
<religious residence, cloiser>
<monastery>
<abbey>
<convent>
<abbey>
Expand approach - produce all pairings
CD2
Semi-automatic methods for WN construction 2002 International WordNet Conference - 66
Criter. #links #synsets #words %okCD - 1 23,828 11,269 7,283 56CD - 2 24,739 12,709 10,300 61CD - 3 4,567 3,089 2,313 75
Expand approach - produce all pairings
Three conceptual distance methods
Semi-automatic methods for WN construction 2002 International WordNet Conference - 67
Expand approach - produce all pairings
• Keep SW-synset pairs produced by methods with precision above 85%
mono1 mono2 mono3 mono4 variant
• But, if two different methods propose the same SW-synset pair, it could get better confidence
try pairwise combinations of methods
Semi-automatic methods for WN construction 2002 International WordNet Conference - 68
method2method1 cd2 cd3 p1 p2 p3 p4cd1 size 15736 1849 2076 556 3146 15105
%ok 79 85 86 86 72 64cd2 size 0 2401 2536 592 3777 13246
%ok 0 86 88 86 75 67cd3 size 0 0 205 180 215 3114
%ok 0 0 95 95 100 77p1 size 0 0 0 0 77 178
%ok 0 0 0 0 100 88p2 size 0 0 0 0 28 78
%ok 0 0 0 0 77 96
Expand approach - produce all pairings
Combinations of methods: higher precision in some cases
Semi-automatic methods for WN construction 2002 International WordNet Conference - 69
Expand approach - produce all pairings
Results• SpWN v 0.1• BasqueWN v 0.1:
2 bilingual dictionaries apply first 8 class methods only
WNs #links #synsets #word #CS #poly links SpWN v0.0 10,982 7,131 8,396 87.4 1,777 Combination 7,244 5,852 3,939 85.6 2,075 SpWN v0.1 15,535 10,786 9,986 86.4 3,373 BasqueWN v0.1 41,107 23,486 22,166 >80.0 -
Semi-automatic methods for WN construction 2002 International WordNet Conference - 70
Expand approach - bilingual senses
• Smaller experiment with French bilingual dictionary• Based on WN 1.5• Keep structure of bilingual dictionary: bilingual
senses 21322 entries, 31502 subentries (senses) 16917 nominal subentries
• Disambiguation is possible: 1) one of the translation words is monosemous in WordNet.2) the translation is given by a list of words.3) a cue in French is provided alongside the translation.4) a semantic field is provided.
folie 1: n.f. madness provision 1: n.f. supply, storetrésor 2: n.m. (ressources) (comm.) finances
Semi-automatic methods for WN construction 2002 International WordNet Conference - 71
translation not in WordNet 891 5%unique translation, n senses 6,440 38%any combination of cases 1,2,3,4 9,586 57%total 16,917 100%
case 1; 1 sense 5,119 30%case 2; more than one translation 958 6%case 3; cue in French 3,702 22%case 4; semantic field 1,365 8%
Expand approach - bilingual senses
• Possible disambiguation case by case
Semi-automatic methods for WN construction 2002 International WordNet Conference - 72
Expand approach - bilingual senses
Disambiguation: Conceptual Density [Agirre & Rigau 95]:
• The relatedness of a certain word-sense to the words in the context (cue, other translations and/or semantic field) allows us to select that sense over the others
• Bilingual dictionary + English WordNetno result 8,311 53%result obtained 7,241 47%case 1; 1 sense 5,119 33%case 2; >1 trans 723 5%case 3; cue 1,399 9%total 15,552 100%
Semi-automatic methods for WN construction 2002 International WordNet Conference - 73
Expand approach - summary
• all pairings coverage and precision produce a good starting point for manual revision
• bilingual senses keeping bilingual sense might help precision very low coverage
Semi-automatic methods for WN construction 2002 International WordNet Conference - 74
• Setting• Words and Works • Merge approach
Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs
• Expand approach Translation of synsets: bilingual MRDs
• Interface for manual revision• Conclusions
Outline
Semi-automatic methods for WN construction 2002 International WordNet Conference - 75
Interface for manual revision
Semi-automatic methods for WN construction 2002 International WordNet Conference - 76
Interface for manual revision
Semi-automatic methods for WN construction 2002 International WordNet Conference - 77
Interface for manual revision
• Client/Server achitecture• Data base:
EWN design implemented on SQL tables English, Spanish, Catalan and Basque
• Interface: Perl CGI’s that access the data bases
Semi-automatic methods for WN construction 2002 International WordNet Conference - 78
• Setting• Words and Works • Merge approach
Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs
• Expand approach Translation of synsets: bilingual MRDs
• Interface for manual revision• Conclusions
Outline
Semi-automatic methods for WN construction 2002 International WordNet Conference - 79
Conclusions
• methods to automatically produce preliminary versions
• methods mainly for nouns• need to manually revise• merge approach
method to produce native hierarchies and word senses trust lexicographer’s hierarchies need to map to ILI in independent process
• expand approach method to translate English WN’s synsets trusts WN’s hierarchies, sense distinctions mapping to ILI for free
Semi-automatic methods for WN construction 2002 International WordNet Conference - 80
Conclusions
• merge approach manual work:
revising and re-organizing the automatic hierarchies (hard)
revising automatic mapping (very hard) allows for integration of data from monolingual
dictionary definition text itself lexico-semantic relations from definitions
• expand approach manual work:
revise proposed translations (fast) review the rest of the synsets (many) include glosses
Semi-automatic methods for WN construction 2002 International WordNet Conference - 81
Conclusions
• Interface to speed up manual work• Downloadable soon:
WN 1.5 in data-base format Interface
• WordNets can be checked at: http://www.lsi.upc.es/~nlp http://ixa.si.ehu.es/wei3.html
• This slides will (shortly) be available at : http:// ... http://www.ji.si.ehu.es/users/eneko
Semi-automatic methods for WN construction 2002 International WordNet Conference - 82
Bibliography
Semi-automatic methods for WordNet construction
German Rigau i Claramunthttp://www.lsi.upc.es/~rigauTALP Research CenterUniversitat Politècnica de Catalunya
Eneko Agirrehttp://www.ji.si.upc.es/users/enekoIxA NLP GroupUniversity of the Basque Country
2002 International WordNet Conference