2nd gwc, january 20th-23rd 2004 - brno extending wordnet with syntagmatic information luisa...
TRANSCRIPT
![Page 1: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/1.jpg)
2nd GWC, January 20th-23rd 2004 - Brno
Extending WordNet
with syntagmatic information
Luisa Bentivogli, Emanuele Pianta
ITC-irst
![Page 2: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/2.jpg)
2nd GWC, January 20-23 2004 - Brno
Overview
• WordNet: paradigmatic vs syntagmatic information• Recurrent Free Phrases• Encoding RFP through Phrasets and Syntagmatic
Relations • Getting RFPs in bilingual dictionaries and corpora• Conclusions
![Page 3: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/3.jpg)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
![Page 4: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/4.jpg)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
![Page 5: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/5.jpg)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
Paradigmatic relations (in absentia)
![Page 6: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/6.jpg)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
multiword expression
Paradigmatic relations (in absentia)
![Page 7: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/7.jpg)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
multiword expression
semantic restriction
Paradigmatic relations (in absentia)
![Page 8: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/8.jpg)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
free phrasemultiword expression
semantic restriction
Paradigmatic relations (in absentia)
![Page 9: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/9.jpg)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
free phrasemultiword expression
semantic restriction
Paradigmatic relations (in absentia)
Syntagmatic relations (in presentia)
![Page 10: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/10.jpg)
2nd GWC, January 20-23 2004 - Brno
Why is syntagmatic info useful
• From a lexicographic point of view– See examples of usage in dictionaries (and WN itself)– Often a very short phrase– Sometimes more useful than definitions
• From a computational point of view– statistics oriented, corpus based methods– crucial role of co-occurrence information– co-occurrence of words vs meanings
![Page 11: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/11.jpg)
2nd GWC, January 20-23 2004 - Brno
Lexical units in WordNet
• Criterium for inclusion in synsets: only lexicalized concept
• What counts as a lexical unit
– Simple words: {tree}
– Idioms • non compositional meaning• {rollercoaster, big_dipper, ...}
– Restricted collocations• compositional, reduced substitution, no literal translation• {criminal_record, record} (Italian: precedenti penali)
– Named entities: {Praha, capital_of_the_Czech_Repubblic, …}
![Page 12: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/12.jpg)
2nd GWC, January 20-23 2004 - Brno
Problems with inclusion criteria - 1
• Artificial nodes: synsets with no lexical unit– {social_group}– {gruppo_sociale}– Free combinations of words (Benson et al., 1986)
• DEF: a combination of words following only the general rules of syntax
• Restricted collocations: – reduced substitution, no literal transl., but compositional – ex: circulatory system (*blood, *circulation system)– are they lexical unit?– should we include them in synsets?
• Can we “keep” information currently contained in artificial nodes and restricted collocations without violating the criterium for inclusion in synsets?
![Page 13: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/13.jpg)
2nd GWC, January 20-23 2004 - Brno
Problems with inclusion criteria - 2
• A considerable number of expressions which aresystematically used to express a concept are excluded from (Multi)WordNet as they are not lexical units
• Ex: “andare in bicicletta” [to bike]– andare: to move by walking or using a means of
locomotion– in bicicletta: by bike
• Ex: “punta di freccia” [arrowhead]
![Page 14: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/14.jpg)
2nd GWC, January 20-23 2004 - Brno
Introducing Recurrent Free Phrases
• Recurrent free phrase (RFP): a free combination of words which is recurrently used to express a concept
• 1. Syntactically constrained: N|V|A|P Phrases (cfr. restricted collocations)
• 2. High frequency (“governo italiano” Italian government)
• 3. High degree of association (“prima volta” first time)
• 4. Salience: – intuition of the native speaker lexicographer that a certain
expression picks up a concept which is perceived as relevant and somehow unitary
– not necessarily related to frequency and word association• “vertice internazionale” international summit (high salience)• “coscia destra” right thigh
![Page 15: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/15.jpg)
2nd GWC, January 20-23 2004 - Brno
The salience criterium
• Hypothesis:– Related to the amount of world knowledge that is
attached to a certain phrase
– Such knowledge cannot be inferred from the meanings composing the phrase
• Example:– right hand (more salient)
– right thigh
![Page 16: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/16.jpg)
2nd GWC, January 20-23 2004 - Brno
Recurrent Free Phrases for NLP
• Knowledge-based word alignment of parallel corpora– EX: cornfield ~ campo di grano
• Word Sense Disambiguation– campo: 12 senses in MWN
– grano: 9 senses
– both unambiguous in “campo di grano”
![Page 17: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/17.jpg)
2nd GWC, January 20-23 2004 - Brno
Criteria for RFP selection
• RFPs expressing a concept which is not lexicalized in a language but lexicalized in another language (lexical gaps) – EX: andare in bicicletta [to bike]
• RFPs synonyms with a lexical unit in the same language– EX: strofinaccio dei piatti / canovaccio [dishcloth]
• RPFs that are frequent, cohese and salient within a corpus considered as reference corpus– EX: vertice internazionale [international summit]
• RPFs whose components are highly polysemous. – EX: campo di grano [cornfield ]
![Page 18: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/18.jpg)
2nd GWC, January 20-23 2004 - Brno
MultiWordNet
• MultiWordNet: Italian/English lexical database• Princeton WordNet building criteria• Strict alignment (see expand model)• Explicit treatment of lexical gaps• Italian (44,000 words) and
– Hebrew (University of Haifa, just started)
– Cfr Spanish WordNet (EuroWordNet)
![Page 19: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/19.jpg)
2nd GWC, January 20-23 2004 - Brno
Introducing Phrasets
• Phraset: a set of synonymous recurrent free phrases
ENG-synset {cornfield}ITA-synset {GAP} ITA-phraset {campo_di_grano}
ENG-synset {toilet_roll}ITA-synset {GAP} ITA-phraset {rotolo_di_carta_igienica}
ENG-synset {dishcloth} ITA-synset {canovaccio}ITA-phraset {strofinaccio_dei_piatti,
strofinaccio_da_cucina}
![Page 20: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/20.jpg)
2nd GWC, January 20-23 2004 - Brno
RFPs vs definitions
RFPs are not definitions
E-synset {tree -- a tall perennial wody plant having a main trunk …}
I-synset {albero -- ogni pianta perenne con fusto legnoso ramificato}
I-phraset{}
E-synset {paperboy}I-synset {GAP – ragazzo che recapita i giornali}
I-phraset{ragazzo_dei_giornali}
E-synset {straphanger}I-synset {GAP – chi viaggia in piedi su mezzi pubblici
reggendosi ad un sostegno} I-phraset{}
![Page 21: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/21.jpg)
2nd GWC, January 20-23 2004 - Brno
Synsets vs Phrasets
Simple words
Idioms
Restricted collocations
Named entities
Recurrent Free Phrases
Free combination of words
Synsets
Phrasets
![Page 22: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/22.jpg)
2nd GWC, January 20-23 2004 - Brno
Syntagmatic Relations in WN
• MEANING project: using the involve semantic relation to encode deep selectional restrictions
• Can RFP be encoded through semantic relations?
![Page 23: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/23.jpg)
2nd GWC, January 20-23 2004 - Brno
Encoding “campagna antifumo” -1
Synset: {campagna}Phraset: {}
Synset: {GAP}Phraset: {campagna_antifumo}
campaign
campaign against smoking
hypernym
Through phrasets
![Page 24: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/24.jpg)
2nd GWC, January 20-23 2004 - Brno
Encoding “campagna antifumo” - 2
Synset: {campagna} Synset: {antifumo}
campaign against smoking
has_constraint
Through a semantic relation
![Page 25: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/25.jpg)
2nd GWC, January 20-23 2004 - Brno
Pros and cons of using semantic rels for encoding RPFS
• Smart and concise
but what about
• trigram RFP?• synonymous RFPs• RPFs that are translation equivalent of lexical units?• Restrictions on word order and word morphology?
![Page 26: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/26.jpg)
2nd GWC, January 20-23 2004 - Brno
Taking the best of both encodings
• Phrasets and lexical syntagmatic relations
GAP -- campo di grano (cornfield)
frumento, grano (corn) campo (field)
cereale (cereal) appezzamento (parcel)
hypernym
composed-of(campo)
composed-of (grano)
hypernym
hypernym
![Page 27: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/27.jpg)
2nd GWC, January 20-23 2004 - Brno
RFP in Bilingual Dictionaries
• Collins bilingual dictionary (medium size)• Italian Translation Equivalents (Bentivogli and Pianta, 2000)
– 92.2% correspond to lexical units
– 7.8% correspond to free combination of words (lexical gaps)
• Manual check of 300 lexical gaps– 67% correspond to RFPs
=> More than half of the synsets which are gaps in Italian potentially have an associated phraset
![Page 28: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/28.jpg)
2nd GWC, January 20-23 2004 - Brno
RFPs in corpora
• Correlation between RPFs and frequency?• Analysis of a 32M word corpus (Repubblica, 2000-
2001)• Standard n-gram analysis package (NSP)• All bigrams including at least a stopword excluded• 118,464 bigrams occurring more than 3 times• Highest rank: 5,914 occurrences (“New York”)• Rank 4: 31,453 bigrams• 497 distinct ranks (frequence classes)
![Page 29: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/29.jpg)
2nd GWC, January 20-23 2004 - Brno
RFPs in corpora cont.
• Lower ranks are systematically and densely populated
• Higher ranks are sparsely and poorly populated
• Rank groups– A: 5,914-509 (100 bigrams)– B: 505-257 (257)– C: 256-129 (731)– D: 128-65 (1,965)– E: 64-33 (4,525)– F: 32-17 (10,477)– G: 16-9 (22,167)– H: 8-5 (46,798)– I: 4 (31,453)
• Manual check of 100 random bigrams from each rank group
![Page 30: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/30.jpg)
2nd GWC, January 20-23 2004 - Brno
RFPs in corpora cont.
A5,914
B505
C256
D128
E64
F32
G16
H8
I(4)
Lexical units
82 79 74 65 58 55 42 35 28
Recurrent free phrases
14 4 9 14 17 4 15 3 15
Other 4 17 17 21 25 41 43 58 57
NB: similar results on trigrams
Manual check of 100 random bigrams from each rank group
![Page 31: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/31.jpg)
2nd GWC, January 20-23 2004 - Brno
Correlation between num. of RFPs and frequency in a reference corpus
0
10
20
30
40
50
60
70
80
90
A B C D E F G H I
Lex Unit
R.F.P.
Other
![Page 32: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/32.jpg)
2nd GWC, January 20-23 2004 - Brno
Future work
• Better characterization and classification• Correlation with association measures• Evaluating RFP for WSD
![Page 33: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst](https://reader031.vdocuments.mx/reader031/viewer/2022032702/56649ce35503460f949af5d3/html5/thumbnails/33.jpg)
2nd GWC, January 20-23 2004 - Brno
Conclusions
• Wordnet is poor of syntagmatic information
• We introduced Recurrent Free Phrases, Phrasets, syntagmatic lexical relations
• RFP: free combination of word recurrently used to express a concept
• Criteria for their selection
• Bilingual dictionaries contain many RFPs
• Corpora: no clear correlation with frequency
• Useful for: – lexicographic work
– Word Sense Disambiguation