semi-automatic methods for wordnet construction german rigau i claramunt rigau talp research center...

83
Semi-automatic methods for WordNet construction German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Universitat Politècnica de Catalunya Eneko Agirre http://www.ji.si.upc.es/users/eneko IxA NLP Group University of the Basque Country 2002 International WordNet Conference

Upload: fabiola-tranter

Post on 28-Mar-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WordNet construction

German Rigau i Claramunthttp://www.lsi.upc.es/~rigauTALP Research CenterUniversitat Politècnica de Catalunya

Eneko Agirrehttp://www.ji.si.upc.es/users/enekoIxA NLP GroupUniversity of the Basque Country

2002 International WordNet Conference

Page 2: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 2

• NLP and the Lexicon

Theoretical: WG, GPSG, HPSG. Practical: realistic complexity and coverage

• Lexical bottleneck (Briscoe 91)

Even worse for languages other than English

Setting

Page 3: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 3

• Which LK is needed by a concrete NLP system?

• Where is this LK located?

• Which procedures can be applied?

Setting

Page 4: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 4

• Which LK is needed by a concrete NLP system?

Phonology: phonemes, stress, etc. Morphology: POS, etc. Syntactic: category, subcat., etc. Semantic: class, SRs, etc. Pragmatic: usage, registers, TDs, etc. Translations: translation links

Setting

Page 5: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 5

• Where is this LK located?

Human brain Structured Lexical Resources:

Monolingual and bilingual MRDs Thesauri

Unstructured Lexical Resources: Monolingual and bilingual Corpora

Mixing resources

Setting

Page 6: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 6

• Which procedures can be applied?

Prescriptive approach Machine-aided manual construction

Descriptive approach Automatic acquisition from pre-existing

Lexical Resources

Mixed approach

Setting

Page 7: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 7

• Setting• Words and Works • Merge approach

Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs

• Expand approach Translation of synsets: bilingual MRDs

• Interface for manual revision• Conclusions

Outline

Page 8: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 8

Words and WorksWhere is this Lexical Knowledge located?

• Human brain: Linguistic String Project (Fox et al. 88)

Lexical Information for 10,000 entries

WordNet (Miller et al. 90) Semantic Information v1.6 with 99,642 synsets

Comlex (Grishman et al. 94) Syntactic information 38,000 English words

CYC Ontology (Lenat 95) a person-century of effort to produce 100,000

terms

LDOCE3-NLP dictionary with 80,000 senses

Page 9: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 9

• Structured Lexical Resources Monolingual MRDs:

LDOCE• learner’s dictionary• 35,956 entries and 76,059 definitions• 86% semantic and 44% pragmatic codes• controlled vocabulary of 2,000 words• (Boguraev & Briscoe 89)• (Vossen & Serail 90)• (Bruce & Guthrie 92), (Wilks et al. 93)• (Dolan et al. 93), (Richardson 97)

Words and WorksWhere is this Lexical Knowledge located?

Page 10: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 10

• Structured Lexical Resources Other Monolingual MRDs:

Webster’s (Jensen & Ravin 87) LPPL (Artola 93) DGILE (Castellón 93), (Taulé 95), (Rigau 98) CIDE (Harley & Glennon 97) AHD (Richardson 97) WordNet (Harabagiu 98)

Bilingual MRDs Collins Spanish/English (Knigth & Luk 94) Vox/Harrap’s Spanish/English (Rigau 98)

Words and WorksWhere is this Lexical Knowledge located?

Page 11: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 11

• Structured Lexical Resources Thesauri:

Roget’s Thesaurus • 60,071 words in 1,000 categories• (Yarowsky 92), (Grefenstette 93), (Resnik 95)

Roget’s II and The New Collins Thesaurus

• (Byrd 89)

Macquarie’s thesaurus• (Grefenstette 93)

Bunrui Goi Hyou Japanese thesaurus• (Utsuro et al. 93)

Words and WorksWhere is this Lexical Knowledge located?

Page 12: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 12

• Structured Lexical Resources Encyclopaedia

Grolier’s Encyclopaedia (Yarowsky 92) Encarta (Richardson et al. 98)

Others Telephonic Guides

Mixing structured lexical resources Roget’s Thesaurus and Grolier’s (Yarowsky 92) LDOCE, WN, Collins, ONTOS, UM (Knight & Luk

94) Japanese MRD to WN (Okumura & Hovy 94) LLOCE, LDOCE (Chen & Chang 98)

Words and WorksWhere is this Lexical Knowledge located?

Page 13: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 13

• Unstructured Lexical Resources Corpora:

WSJ, Brown Corpus (SemCor), Hansard Proper Nouns (Hearst & Schütze 95) Idiosyncratic Collocations (Church et al. 91) Preposition preferences (Resnik and Hearst 93) Subcategorization structures (Briscoe and Carroll

97) Selectional restrictions (Resnik 93), (Ribas 95) Thematic structure (Basili et al. 92) Word semantic classes (Dagan et al. 94) Bilingual Lexicons for MT (Fung 95)

Words and WorksWhere is this Lexical Knowledge located?

Page 14: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 14

• Using both structured and non-structured Lexical Resources MRDs and Corpora

(Liddy & Paik 92) (Klavans & Tzoukermann 96)

WordNet and Corpora (Resnik 93), (Ribas 95), (Li & Abe 95), (McCarthy

01)

Words and WorksWhere is this Lexical Knowledge located?

Page 15: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 15

• Japanese Projects EDR (Yokoi 95)

Nine years project oriented to MT Bilingual Corpora with 250,000 words Monolingual, bilingual and coocurrence

dictionaries 200,000 general vocabulary 100,000 technical terminology 400,000 concepts

Words and WorksInternational Projects on Lexical Acquisition

Page 16: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 16

• American Projects Comlex (Grishman et al. 94)

Syntactic information for 38,000 words

WordNet (Miller 90) Semantic Information more than 123,000 words organised in 99,000 synsets more than 116,000 relations between synsets

Pangloss (Knight & Luk 94) PUM, ONTOS, LDOCE semantic categories, WordNet

Cyc (Lenat 95) common-sense knowledge 100,000 concepts and 1,000,000 axioms

Words and Works International Projects on Lexical Acquisition

Page 17: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 17

• European Projects Acquilex I and II

LA from monolingual and bilingual MRDs and corpora

LE-Parole Large-scale harmonised set of corpora and

lexicons for all the EU languages

EuroWordNet Multilingual WordNet for several European

Languages

Meaning Large-scale of LK from the web Large-scale WSD

Words and Works International Projects on Lexical Acquisition

Page 18: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 18

• Syntactic Disambiguation (Dolan et al. 93)• Semantic Processing (Vanderwende 95)• WSD (Lesk 86), (Wilks & Stevenson 97), (Rigau 98)• IR (Krovetz & Croft 92)• MT (Knight and Luk 94), (Tanaka & Umemura 94)• Semantically enriching MRDs

(Yarowsky 92), (Knight 93), (Chen & Chan 98)

• Building LKBs (Bruce & Guthrie 92) (Dolan et al. 93) (Artola 93) (Castellón 93), (Taulé 95), (Rigau 98)

Words and WorksLexical Acquisition from MRDs

Page 19: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 19

• This tutorial focus on: the massive acquisition of LK from MRDs (conventional, in any language) using (semi) automatic methodologies

• Why MRDs?

The conventional dictionaries for human use usually “contain spelling, pronunciation, hyphenation, capitalization, usage notes for semantic domains, geographic regions, and propiety; ethimological, syntactic and semantic information about the most basic units of the language” (Amsler 81)

Words and WorksAcquisition of LK from MRDs

Page 20: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 20

• Conventional dictionaries are not systematic

• Dictionaries are built for human use

• Implicit Knowledge

words are described/translated in terms of words

Words and WorksMain Problems of MRDs

Page 21: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 21

jardín_1_1 Terreno donde se cultivan plantas y flores ornamentales.

florero_1_4 Maceta con flores.ramo_1_3 Conjunto natural o artificial de flores, ramas o

hierbas.pétalo_1_1 Hoja que forma la corola de la flor. tálamo_1_3 Receptáculo de la flor. miel_1_1 Substancia viscosa y muy dulce que elaboran las abejas, en

una distensión del esófago, con el jugo de las flores y luego depositan en las celdillas de sus panales.

florería_1_1 Floristería; tienda o puesto donde se venden flores. florista_1_1 Persona que tiene por oficio hacer o vender flores.camelia_1_1 Arbusto cameliáceo de jardín, originario de Oriente,

de hojas perennes y lustrosas, y flores grandes, blancas, rojas o rosadas (Camellia japonica).

camelia_1_2 Flor de este arbusto. rosa_1_1 Flor del rosal.

Words and WorksMRDs and Semantic Knowledge

Page 22: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 22

• Setting• Words and Works• Merge approach

Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs

• Expand approach Translation of synsets: bilingual MRDs

• Interface for manual revision• Conclusions

Outline

Page 23: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 23

MRD1

MRDn

LDB1 Tax1

LDBn Taxn

MLKB... ... ...

LKB1

LKBn

...

Merge approachMain Methodology

Page 24: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 24

• Taxonomy construction: (Rigau et al. 98, 97) monolingual MRDs Step 1: Selection of the main top beginners

for a semantic primitive Step 2: Exploiting genus, construction of

taxonomies for each semantic primitive

• Mapping taxonomies: (Daudé et al. 99) bilingual MRDs Step 3: Creation of translation links

Merge approachMain Methodology

Page 25: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 25

• Problems following a pure descriptive approachCircularityErrors and inconsistenciesDefinitions with omitted genus

• Top dictionary senses do not usually represent useful knowledge for the LKBToo generalToo specific

Merge approach: Taxonomy ConstructionMethodology

Page 26: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 26

Mixed Methodology

Prescriptive approach Manual construction of the Top Structure

Merge approach: Taxonomy ConstructionMethodology

Page 27: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 27

Mixed Methodology

Descriptive approach Acquiring implicit information from MRDs

Prescriptive approach Manual construction of the Top Structure

Merge approach: Taxonomy Construction Methodology

Page 28: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 28

Mixed Methodology

Descriptive approach Acquiring implicit information from MRDs

Prescriptive approach Manual construction of the Top Structure

Merge approach: Taxonomy Construction Methodology

Page 29: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 29

Word sense: zumo_1_1 Attached-to: c_art_subst type.Definition: líquido que se extrae de las flores,

hierbas, frutos, etc. (liquid extracted from

flowers, herbs, fruits, etc).

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 30: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 30

• A) Attaching DGILE senses to semantic primitives 1) First labelling:

Conceptual Distance (Rigau 94)

2) Second labelling: Salient Words (Yarowsky 92)

• B) Filtering Process

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 31: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 31

• A.1) First labelling: Conceptual Distance (Agirre et al. 94)

length of the shortest path specificity of the concepts

)c,path(cc kwcwc

21

i2i1k2i2

1i1 )depth(c

1min)w,dist(w

using WordNet Bilingual dictionary

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 32: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 32

<object, ...>

<artifact, artefact>

<entity><entity>

<structure, construction>

<building, edifice>

<place of worship, ...>

<church, church building>

<abbey>

<house, lodging>

abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess)

<religious residence, cloiser>

<monastery>

<abbey>

<convent>

<abbey>

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 33: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 33

<object, ...>

<artifact, artefact>

<entity>

<structure, construction>

<building, edifice>

<place of worship, ...>

<church, church building>

<abbey> 06 ARTIFACT

<house, lodging>

abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess)

<religious residence, cloiser>

<monastery>

<abbey>

<convent>

<abbey>

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 34: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 34

• A.1) First labelling (Results)

29,205 labelled definitions (31%)61% accuracy at a sense level64% accuracy at a file level

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 35: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 35

• A.2) Second labelling: Salient Words (Yarowsky 92)

Pr(w)

SC)|Pr(wlogSC)|Pr(wSC)AR(w, 2

Importance local frequency appears more significantly more often in

the corpus of a semantic category than at other points in the whole corpus

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 36: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 36

• A.2) Second labelling (Results):

86,759 labelled definitions (93%)80% accuracy at a file level

biberón_1_1 ARTIFACT 4.8399 Frasco de cristal ... (glass flask ...)

biberón_1_2 FOOD 7.4443 Leche que contiene este frasco ... (milk contained in that flask ...)

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 37: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 37

• B) Filtering process (FOODs)

removes all genus terms

FILTER 1: not FOODs by the bilingual mapping

FILTER 2: appear more often as genus in other Semantic Primitive

FILTER 3: with a low frequency

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 38: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 38

• B) Filtering process (FOOD Results)

FILTER 1 FILTER 2LABEL2 #GT Accuracy #GT AccuracyLABEL2+F3>9 31 94% 31 100%LABEL2+F3>8 35 95% 35 100%LABEL2+F3>7 37 91% 37 95%LABEL2+F3>6 43 92% 41 94%LABEL2+F3>5 49 92% 47 92%LABEL2+F3>4 55 91% 56 91%LABEL2+F3>3 64 85% 65 87%LABEL2+F3>2 85 82% 82 83%LABEL2+F3>1 125 78% 123 82%

Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners

Page 39: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 39

Word sense: vino_1_1 Hypernym: zumo_1_1.Definition: zumo de uvas fermentado.

(fermented juice of grapes).

Word sense: rueda_2_1 Hypernym: vino_1_1.Definition: vino procedente de la región de

Rueda (Valladolid). (wine from the region of Rueda).

Merge approach: Taxonomy Construction Step 2: Exploiting Genus

Page 40: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 40

• Genus Sense Identification97% accuracy for nouns

• Genus Sense DisambiguationUnrestricted WSD (coverage 100%)Knowledge-based WSD (not supervised)Eight Heuristics (McRoy 92)

Combining several lexical resources Combining several methods

Merge approach: Taxonomy Construction Step 2: Exploiting Genus

Page 41: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 41

Results:Results:

Polysemous OverallPrec. Cov. Prec. Cov.

Heuristic 1: Monosemous Genus Term - - 100% 16%Heuristic 2: Entry Sense Ordering 70% 100% 75% 100%Heuristic 3: Explicit Semantic Domain 100% 1% 100% 2%Heuristic 4: Word Matching 72% 61% 79% 56%Heuristic 5: Simple Concordance 57% 100% 65% 95%Heuristic 6: Cooccurrence Vectors 60% 100% 66% 97%Heuristic 7: Semantic Vectors 58% 99% 63% 94%Heuristic 8: Conceptual Distance 49% 95% 57% 89%Sum 79% 100% 83% 100%

Merge approach: Taxonomy Construction Step 2: Exploiting Genus

Page 42: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 42

Knowledge provided by each heuristic:Knowledge provided by each heuristic:

OverallPrec. Cov.

- Heuristic 1: Monosemous Genus Term 79% 100%- Heuristic 2: Entry Sense Ordering 72% 100%- Heuristic 3: Explicit Semantic Domain 82% 98%- Heuristic 4: Word Matching 81% 100%- Heuristic 5: Simple Concordance 81% 100%- Heuristic 6: Cooccurrence Vectors 81% 100%- Heuristic 7: Semantic Vectors 81% 100%- Heuristic 8: Conceptual Distance 77% 100%Sum 83% 100%

Merge approach: Taxonomy Construction Step 2: Exploiting Genus

Page 43: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 43

F2+F3>9: 35,099 definitionsF2+F3>9: 35,099 definitions

F2+F3>4: 40,754 definitionsF2+F3>4: 40,754 definitions

No filters: 111,624 definitionsNo filters: 111,624 definitions

FOOD [Castellón 93] F2+F3>9 F2+F3>4Genus terms 63 33 68Dictionary senses 392 952 1,242Levels 6 5 6Senses in level 1 2 18 48Senses in level 2 67 490 604Senses in level 3 88 379 452Senses in level 4 67 44 65Senses in level 5 87 21 60Senses in level 6 6 0 13

Merge approach: Taxonomy Construction Step 2: Exploiting Genus

Page 44: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 44

...zumo_1_1 vino_1_1 quianti_1_1 zumo_1_1 vino_1_1 raya_1_8 zumo_1_1 vino_1_1 requena_1_1 zumo_1_1 vino_1_1 reserva_1_12 zumo_1_1 vino_1_1 ribeiro_1_1 zumo_1_1 vino_1_1 rioja_1_1 zumo_1_1 vino_1_1 roete_1_1 zumo_1_1 vino_1_1 rosado_1_3 zumo_1_1 vino_1_1 rueda_2_1 zumo_1_1 vino_1_1 sherry_1_1 zumo_1_1 vino_1_1 tarragona_1_1 zumo_1_1 vino_1_1 tintilla_1_1 zumo_1_1 vino_1_1 tintorro_1_1 zumo_1_1 vino_1_1 toro_3_1...

Merge approach: Taxonomy Construction Step 2: Exploiting Genus

Page 45: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 45

C1C1

C2C2

C3C3

C5C5

C6C6

C4C4

Merge approach: Mapping Taxonomies Step 3: Creation of translation links

Page 46: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 46

C1C1

C2C2

C3C3

C5C5

C6C6

C4C4

Merge approach: Mapping Taxonomies Step 3: Creation of translation links

Page 47: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 47

• Connecting already existing HierarchiesRelaxation labelling AlgorithmConstraints

• BetweenSpanish taxonomy automatically

derived from an MRD (Rigau et al. 98)

WordNetusing a bilingual MRD

Merge approach: Mapping Taxonomies Step 3: Creation of translation links

Page 48: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 48

animal

ave

rapaz

faisán

(Tops <animal, animate_being, ...>)

(person <beast, brute, ...>)(person <dunce, blockhead, ...>)

(animal <bird>)

(artifact <bird, shuttle, ...>)

(food <fowl, poultry, ...>)

(person <dame, doll, ...>)

(animal <pheasant>)

(food <pheasant>)

(animal <bird>)

(artifact <bird, shuttle, ...>)

(food <fowl, poultry, ...>)

(person <dame, doll, ...>)

Merge approach: Mapping Taxonomies Step 3: Creation of translation links

Page 49: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 49

• Iterative algorithm for function optimisation based on local information

• it can deal with any kind of constraints variables (senses of the taxonomy) labels (synsets)

• Finds a weight assignment for each possible label for each variable weights for the labels of the same variable

add up to one weight assignation satisfies -to the

maximum possible extent- the set of constraints

Merge approach: Mapping Taxonomies Step 3: Relaxation Labelling algorithm

Page 50: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 50

1) Start with a random weight assignment

2) Compute the support value for each label of each variable (according to the constraints)

3) Increase the weights of the labels more compatible with context and decrease those and decrease those of the less compatible labels.

4) If a stopping/convergence is satisfied, stop,otherwise go to step 2.

Merge approach: Mapping Taxonomies Step 3: Relaxation Labelling algorithm

Page 51: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 51

• Rely on the taxonomy structure• Coded with three characters

X: Spanish Taxonomy, I (immediate), Y: English Taxonomy, A (ancestor) X: Relation, E (hypernym), O (hyponym), B (both)

• Examples:

IIEIIE AABAAB

++ ++

++++

Merge approach: Mapping Taxonomies Step 3: Constraints

Page 52: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 52

Poly TOK, FOK TOK, FNOK total

animal 279 (90%) 30 (91%) 209 (90%)food 166 (94%) 3 (100%) 169 (94%)cognition 198 (67%) 27 (90%) 225 (69%)communication 533 (77%) 40 (97%) 573 (78%)

all TOK, FOK TOK, FNOK total

animal 424 (93%) 62 (95%) 486 (90%)food 166 (94%) 83 (100%) 249 (96%)cognition 200 (67%) 245 (90%) 445 (82%)communication 536 (77%) 234 (97%) 760 (81%)

Merge approach: Mapping Taxonomies Step 3: Results

Page 53: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 53

piel

visón

marta

(substance <skin, fur, peel>)

(substance <sable, marte, coal_back>)

(substance <mink, mink_coat>)

Merge approach: Mapping Taxonomies Step 3: Example

Page 54: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 54

• Setting• Words and Works • Merge approach

Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs

• Expand approach Translation of synsets: bilingual MRDs

• Interface for manual revision• Conclusions

Outline

Page 55: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 55

Expand approach

• Take one WordNet as starting point• Translate synsets:

English: <car, automobile> Basque: <auto, berebil>

• We obtain a structurally similar WordNet in another language, but some of the synsets will be missing

• Use bilingual dictionarymaintien n.m. (attitude) bearing; (conservation) maintenance1. Keep bilingual senses (Agirre & Rigau 95)

maintien1: (attitude) bearing maintien2: (conservation) maintenance

2. Produce all translation pairs (Atserias et al. 97)maintien - bearing

maintien - maintenance

Page 56: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 56

Expand approach - produce all pairings

• Used to produce the first version of the nominal part of the Spanish WordNet

• Based on WN 1.5• Both directions in bilingual dictionary merged

Spanish/English: 19,443 translation pairs English/Spanish: 16,324 translation pairs Harmonized bilingual: 28,131 translation pairs Overlap with WordNet: 12,665 nouns (14%)

• Two methods: class methods: consider only pairings conceptual distance methods: consider similarity of

synsets

Page 57: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 57

Expand approach - produce all pairings

• Ten class methods Four monosemic criteria Four polysemic criteria Two hybrid criteria

• Three conceptual distance methods CD1: using pairwise word coocurrences CD2: using headword and genus CD3: using bilingual Spanish entries with multiple

translations

Page 58: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 58

SW EW

SW EW

EW

SW EW

EWSW

SW EW

SW

Expand approach - produce all pairings

Class methods:• Four possible configurations for pairs which either

share an English word or an Spanish word: connected graph.

Page 59: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 59

SW EW

SW EW

EW

Synset

Synset

Synset

Synset

SW EW

EWSW

Synset

Synset

SW EW

SW

Expand approach - produce all pairings

4 monosemous class methods:• All English words involved are monosemous in

WNM1

M2

M3

M4

Page 60: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 60

SW EW

SW EW

EW

Synset+

Synset+

Synset+

Synset+

SW EW

EWSW

Synset+

Synset+

SW EW

SW

Expand approach - produce all pairings

4 polysemous class methods:• At least 1 English word involved is polysemous

P1

P2

P3

P4

Page 61: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 61

<..., EW, ..., EW, ...>

SW

<..., headword-EW, ..., Ind-EW, ...>

SW

Expand approach - produce all pairings

2 other class methods• Variant criterion:

two synonyms share a single SW

• Field criterion:use field indicators in bilingual entry when available

VC

FC

Page 62: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 62

Criterion #links #synsets #words %okmono1 3697 3583 3697 92mono2 935 929 661 89mono3 1863 1158 1863 89mono4 2688 1328 2063 85poly1 5121 4887 1992 80poly2 1450 1426 449 75poly3 11687 6611 3165 58poly4 40298 9400 3754 61Variant 3164 2195 2261 85Field 510 379 421 78

Expand approach - produce all pairings

Ten class methods (results)

Page 63: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 63

)c,path(cc kwcwc

21

i2i1k2i2

1i1 )depth(c

1min)w,dist(w

• Using WordNet • Bilingual dictionary

Expand approach - produce all pairings

Conceptual Distance Methods (Agirre et al. 94)• length of the shortest path• specificity of the concepts

Page 64: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 64

Expand approach - produce all pairings

• Three conceptual distance methods CD1: using pairwise word coocurrences from

monolingual dict. CD2: using headword and genus from monolingual def. CD3: using bilingual Spanish entries with multiple

translations

Page 65: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 65

<object, ...>

<artifact, artefact>

<entity>

<structure, construction>

<building, edifice>

<place of worship, ...>

<church, church building>

<abbey>

<house, lodging>

abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess)

<religious residence, cloiser>

<monastery>

<abbey>

<convent>

<abbey>

Expand approach - produce all pairings

CD2

Page 66: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 66

Criter. #links #synsets #words %okCD - 1 23,828 11,269 7,283 56CD - 2 24,739 12,709 10,300 61CD - 3 4,567 3,089 2,313 75

Expand approach - produce all pairings

Three conceptual distance methods

Page 67: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 67

Expand approach - produce all pairings

• Keep SW-synset pairs produced by methods with precision above 85%

mono1 mono2 mono3 mono4 variant

• But, if two different methods propose the same SW-synset pair, it could get better confidence

try pairwise combinations of methods

Page 68: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 68

method2method1 cd2 cd3 p1 p2 p3 p4cd1 size 15736 1849 2076 556 3146 15105

%ok 79 85 86 86 72 64cd2 size 0 2401 2536 592 3777 13246

%ok 0 86 88 86 75 67cd3 size 0 0 205 180 215 3114

%ok 0 0 95 95 100 77p1 size 0 0 0 0 77 178

%ok 0 0 0 0 100 88p2 size 0 0 0 0 28 78

%ok 0 0 0 0 77 96

Expand approach - produce all pairings

Combinations of methods: higher precision in some cases

Page 69: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 69

Expand approach - produce all pairings

Results• SpWN v 0.1• BasqueWN v 0.1:

2 bilingual dictionaries apply first 8 class methods only

WNs #links #synsets #word #CS #poly links SpWN v0.0 10,982 7,131 8,396 87.4 1,777 Combination 7,244 5,852 3,939 85.6 2,075 SpWN v0.1 15,535 10,786 9,986 86.4 3,373 BasqueWN v0.1 41,107 23,486 22,166 >80.0 -

Page 70: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 70

Expand approach - bilingual senses

• Smaller experiment with French bilingual dictionary• Based on WN 1.5• Keep structure of bilingual dictionary: bilingual

senses 21322 entries, 31502 subentries (senses) 16917 nominal subentries

• Disambiguation is possible: 1) one of the translation words is monosemous in WordNet.2) the translation is given by a list of words.3) a cue in French is provided alongside the translation.4) a semantic field is provided.

folie 1: n.f. madness provision 1: n.f. supply, storetrésor 2: n.m. (ressources) (comm.) finances

Page 71: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 71

translation not in WordNet 891 5%unique translation, n senses 6,440 38%any combination of cases 1,2,3,4 9,586 57%total 16,917 100%

case 1; 1 sense 5,119 30%case 2; more than one translation 958 6%case 3; cue in French 3,702 22%case 4; semantic field 1,365 8%

Expand approach - bilingual senses

• Possible disambiguation case by case

Page 72: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 72

Expand approach - bilingual senses

Disambiguation: Conceptual Density [Agirre & Rigau 95]:

• The relatedness of a certain word-sense to the words in the context (cue, other translations and/or semantic field) allows us to select that sense over the others

• Bilingual dictionary + English WordNetno result 8,311 53%result obtained 7,241 47%case 1; 1 sense 5,119 33%case 2; >1 trans 723 5%case 3; cue 1,399 9%total 15,552 100%

Page 73: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 73

Expand approach - summary

• all pairings coverage and precision produce a good starting point for manual revision

• bilingual senses keeping bilingual sense might help precision very low coverage

Page 74: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 74

• Setting• Words and Works • Merge approach

Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs

• Expand approach Translation of synsets: bilingual MRDs

• Interface for manual revision• Conclusions

Outline

Page 75: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 75

Interface for manual revision

Page 76: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 76

Interface for manual revision

Page 77: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 77

Interface for manual revision

• Client/Server achitecture• Data base:

EWN design implemented on SQL tables English, Spanish, Catalan and Basque

• Interface: Perl CGI’s that access the data bases

Page 78: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 78

• Setting• Words and Works • Merge approach

Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs

• Expand approach Translation of synsets: bilingual MRDs

• Interface for manual revision• Conclusions

Outline

Page 79: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 79

Conclusions

• methods to automatically produce preliminary versions

• methods mainly for nouns• need to manually revise• merge approach

method to produce native hierarchies and word senses trust lexicographer’s hierarchies need to map to ILI in independent process

• expand approach method to translate English WN’s synsets trusts WN’s hierarchies, sense distinctions mapping to ILI for free

Page 80: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 80

Conclusions

• merge approach manual work:

revising and re-organizing the automatic hierarchies (hard)

revising automatic mapping (very hard) allows for integration of data from monolingual

dictionary definition text itself lexico-semantic relations from definitions

• expand approach manual work:

revise proposed translations (fast) review the rest of the synsets (many) include glosses

Page 81: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 81

Conclusions

• Interface to speed up manual work• Downloadable soon:

WN 1.5 in data-base format Interface

• WordNets can be checked at: http://www.lsi.upc.es/~nlp http://ixa.si.ehu.es/wei3.html

• This slides will (shortly) be available at : http:// ... http://www.ji.si.ehu.es/users/eneko

Page 82: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WN construction 2002 International WordNet Conference - 82

Bibliography

Page 83: Semi-automatic methods for WordNet construction German Rigau i Claramunt rigau TALP Research Center Universitat Politècnica de Catalunya

Semi-automatic methods for WordNet construction

German Rigau i Claramunthttp://www.lsi.upc.es/~rigauTALP Research CenterUniversitat Politècnica de Catalunya

Eneko Agirrehttp://www.ji.si.upc.es/users/enekoIxA NLP GroupUniversity of the Basque Country

2002 International WordNet Conference