piek vossen vu university amsterdam
DESCRIPTION
From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages to universal meaning. Piek Vossen VU University Amsterdam. Overview. Wordnet, EuroWordNet Global Wordnet Grid Stevin project Cornetto 7 th Frame work project KYOTO. WordNet. http://wordnet.princeton.edu/ - PowerPoint PPT PresentationTRANSCRIPT
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
1
From WordNet, to EuroWordNet,
to the Global Wordnet Grid: anchoring languages to universal meaning
Piek Vossen
VU University Amsterdam
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
2
Overview
• Wordnet, • EuroWordNet• Global Wordnet Grid• Stevin project Cornetto• 7th Frame work project KYOTO
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
3
WordNet http://wordnet.princeton.edu/http://wordnet.princeton.edu/• Lexical semantic database for English• Developed by George Miller and his team at
Princeton University, as the implementation of a mental model of the lexicon
• Organized around the notion of a synset: a set of synonyms in a language that represent a single concept
• Semantic relations between concepts (synsets) and not between words
• Currently covers over 117,000 concepts (synsets) and over 150,000 English words
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
4
Relational model of meaning
man woman
boy girl
cat
kitten
dog
puppy
animal
man
woman
boy
cat
kitten
dogpuppy
animal
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
5
Wordnet: a network of semantically related words
{car; auto; automobile; machine; motorcar}
{conveyance;transport}
{vehicle}
{motor vehicle; automotive vehicle}
{cruiser; squad car; patrol car; police car; prowl car}
{cab; taxi; hack; taxicab}
{bumper}
{car door}
{car window}
{car mirror} {armrest}
{doorlock}
{hinge; flexible joint}
hyper(o)nym
hyponym
meronyms
Hyponymy and meronymy relations are:• transitive• directed
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
6
Wordnet Semantic RelationsWordnet Semantic RelationsWN 1.5 starting point
The ‘synset’ as a weak notion of synonymy:“two expressions are synonymous in a linguistic context C if the substitution of one for the other in C does not alter the truth value.” (Miller et al. 1993)
Relations between synsets:Example
HYPONYMY noun-to-noun car/ vehicleverb-to-verb walk/ move
MERONYMY noun-to-noun head/ noseANTONYMY adjective-to-adjective good/bad
verb-to-verb open/ closeENTAILMENT verb-to-verb buy/ payCAUSE verb-to-verb kill/ die
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
7
Wordnet Data Model
bank
fiddleviolin
violistfiddler
string
rec: 12345- financial instituterec: 54321- side of a riverrec: 9876- small string instrumentrec: 65438- musician playing violinrec:42654- musician
rec:25876- string instrument
rec:35576- string of instrumentrec:29551- underwear
type-of
type-of
part-of
Vocabulary of a languageConceptsRelations
1
2
2
1
1
2
polysemy
polysemy&synonymy
polysemy
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
8
Some observations on Wordnet• synsets are more compact representations for concepts
than word meanings in traditional lexicons• synonyms and hypernyms are substitutional variants:
– begin – commence– I once had a canary. The bird got sick. The poor animal died.
• hyponymy and meronymy chains are important transitive relations for predicting properties and explaining textual properties:object -> artifact -> vehicle -> 4-wheeled vehicle -> car
• strict separation of part of speech although concepts are closely related (bed – sleep) and are similar (dead – death)
• lexicalization patterns reveal important mental structures
Lexicalization patterns
25 unique beginnersgarbage
tree
organism
animal
bird
canarychurch
building
artifact
object
plant
flower
rose
wastethreat
entity
common canary
abbey
crocodiledogbasic level concepts
• balance of two principles: • predict most features• apply to most subclasses
• where most concepts are created • amalgamate most parts• most abstract level to draw a pictures
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
10
Wordnet top level
Meronymy & picturesbeak
tail
leg
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
12
Meronymy & pictures
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
13
Wordnet 3.0 statistics
POS Unique Synsets Total
Strings Word-Sense
Pairs Noun 117,798 82,115 146,312
Verb 11,529 13,767 25,047
Adjective 21,479 18,156 30,002
Adverb 4,481 3,621 5,580
Totals 155,287 117,659 206,941
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
14
Wordnet 3.0 statistics
POS Monosemous Polysemous Polysemous
Words and
Senses Words Senses
Noun 101,863 15,935 44,449
Verb 6,277 5,252 18,770
Adjective 16,503 4,976 14,399
Adverb 3,748 733 1,832
Totals 128,391 26,896 79,450
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
16
http://www.visuwords.com
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
17
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
18
Usage of Wordnet
• Mostly used database in language technology
• Enormous impact in language technology development
• Large• Free and downloadable• English
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
19
Usage of Wordnet• Improve recall of textual based analysis:
– Query -> Index• Synonyms: commence – begin• Hypernyms: taxi -> car• Hyponyms: car -> taxi• Meronyms: trunk -> elephant• Lexical entailments: gun -> shoot
• Inferencing:– what things can burn?
• Expression in language generation and translation:– alternative words and paraphrases
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
20
Improve recall
• Information retrieval: – effective on small databases without redundancy, e.g.
image captions, video text• Text classification:
– expand small training sets– reduce training effort
• Question & Answer systems– question classification: who, where, what, when– match answers to question types
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
21
Improve recall
• Anaphora resolution:– The girl fell off the table. She....– The glass fell of the table. It...
• Coreference resolution:– When he moved the furniture, the antique table got
damaged. • Information extraction (unstructed text to
structured databases):– generic forms or patterns "vehicle" - > text with
specific cases "car"
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
22
Improve recall
• Summarizers:– Sentence selection based on word counts ->
concept counts– Avoid repetition in summary -> language
generation, pick out another synonym or hypernym
• Limited inferencing: detect locations, people, organisations, etc.
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
23
Enabling technologies
• Semantic similarity: what sentences or expressions are semantically similar?
• Semantic relatedness and textual entailment: smoke entails fire, fire entails damage
• Word-Senses-Disambiguation• Erwin Marsi, University of Tilbug, http://
daeso.uvt.nl/demos/index.html
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
24
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
25
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
Recall & Precision
query:“cell”
“cellphone”
“mobilephones”
“nerve cell”“police cell”
recall = doorsnede / relevantprecision = doorsnede / gevonden
found intersection relevant
Recall < 20% for basic search engines!(Blair & Maron 1985)
“jail”
“neuron”
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
27
Many others
• Data sparseness for machine learning: hapaxes can be replaced by semantic classes that match classes from the training set
• Use redundancy for more robustness: spelling correction and speech recognition can built semantic expectations using Wordnet and make better choices
• Sentiment and opinion mining• Natural language learning
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
28
EuroWordNet
• The development of a multilingual database with wordnets for several European languages
• Funded by the European Commission, DG XIII, Luxembourg as projects LE2-4003 and LE4-8328
• March 1996 - September 1999• 2.5 Million EURO.• http://www.hum.uva.nl/~ewn• http://www.illc.uva.nl/EuroWordNet/finalresults-ewn.html
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
29
EuroWordNetEuroWordNet
• Languages covered: – EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian– EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian.
• Size of vocabulary:– EuroWordNet-1: 30,000 concepts - 50,000 word meanings.– EuroWordNet-2: 15,000 concepts- 25,000 word meaning.
• Type of vocabulary: – the most frequent words of the languages– all concepts needed to relate more specific concepts
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
30
EuroWordNet Model
I = Language Independent linkII = Link from Language Specific to Inter lingual IndexIII = Language Dependent Link
III
Lexical Items Table
cavalcare
andaremuoversi
III
guidare
ILI-record{drive}
Inter-Lingual-Index
Ontology
2OrderEntity
Location Dynamic
Domains
Traffic
Air Road` III
Lexical Items Table
bewegengaan
rijden berijden
III
Lexical Items Table
driveride
movego
III
III
Lexical Items Table
cabalgar jinetear
III
conducir
movertransitar
IIIII
IIII
II
I I
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
31
Differences in relations between Differences in relations between EuroWordNet and WordNetEuroWordNet and WordNet
• Added Features to relations
• Cross-Part-Of-Speech relations
• New relations to differentiate shallow hierarchies
• New interpretations of relations
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
32
EWN Relationship LabelsEWN Relationship Labels
{airplane} HAS_MERO_PART: conj1 {door}HAS_MERO_PART: conj2 disj1 {jet engine}HAS_MERO_PART: conj2 disj2 {propeller}
{door} HAS_HOLO_PART: disj1 {car}HAS_HOLO_PART: disj2 {room}
HAS_HOLO_PART: disj3 {entrance}
Default Interpretation: non-exclusive disjunction
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
33
Overview of the Language Internal relations in EuroWordnet
Same Part of Speech relations:HYPERONYMY/HYPONYMY car - vehicleANTONYMY open - closeHOLONYMY/MERONYMY head – noseNEAR_SYNONYMY apparatus - machineCross-Part-of-Speech relations:XPOS_NEAR_SYNONYMY dead - death; to adorn - adornmentXPOS_HYPERONYMY/HYPONYMY to love - emotionXPOS_ANTONYMY to live - deadCAUSE die - deathSUBEVENT buy - pay; sleep - snoreROLE/INVOLVED write - pencil; hammer - hammerSTATE the poor - poorMANNER to slurp - noisily BELONG_TO_CLASS Rome - city
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
34
Co_Role relationsCo_Role relationscriminal CO_AGENT_PATIENT victimnovel writer/ poet CO_AGENT_RESULT novel/ poemdough CO_PATIENT_RESULT pastry/ breadphotograpic camera CO_INSTRUMENT_RESULT photo
guitar player HAS_HYPERONYM playerCO_AGENT_INSTRUMENT guitar
player HAS_HYPERONYM personROLE_AGENT to play musicCO_AGENT_INSTRUMENT musical instrument
to play music HAS_HYPERONYM to makeROLE_INSTRUMENT musical instrument
guitar HAS_HYPERONYM musical instrumentCO_INSTRUMENT_AGENT guitar player
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
35
chronical patient ; mental patient
patient
HYPONYM
ρ-PROCEDURE ρ-LOCATION
STATE
ρ-CAUSE
cureρ-PATIENT
treatdocter
disease; disorder
physiotherapymedicineetc.
hospital, etc.stomach disease, kidney disorder,
ρ-PATIENT ρ-AGENT
child docter
child
co-ρ-AGENT-PATIENT
Horizontal & vertical semantic relations
HYPONYM
HYPONYM
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
36
• Inter-Lingual-Index: unstructured fund of concepts to provide an efficient mapping across the languages;
• Index-records are mainly based on WordNet synsets and consist of synonyms, glosses and source references;
• Various types of complex equivalence relations are distinguished;
• Equivalence relations from synsets to index records: not on a word-to-word basis;
• Indirect matching of synsets linked to the same index items;
The Multilingual DesignThe Multilingual Design
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
37
Equivalent Near SynonymEquivalent Near Synonym1. Multiple Targets (1:many)
Dutch wordnet: schoonmaken (to clean) matches with 4 senses of clean in WordNet1.5:• make clean by removing dirt, filth, or unwanted substances from• remove unwanted substances from, such as feathers or pits, as of chickens or fruit• remove in making clean; "Clean the spots off the rug"• remove unwanted substances from - (as in chemistry)
2. Multiple Sources (many:1)Dutch wordnet: versiersel near_synonym versiering ILI-Record: decoration.
3. Multiple Targets and Sources (many:many)Dutch wordnet: toestel near_synonym apparaat
ILI-records: machine; device; apparatus; tool
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
38
Equivalent HyperonymyTypically used for gaps in English WordNet:
• genuine, cultural gaps for things not known in English culture:
– Dutch: klunen, to walk on skates over land from one frozen water to the other
• pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English:
– Dutch: kunststof = artifact substance <=> artifact object
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
39
EuroWordNet statistics Synsets No. of senses Sens./
syns. Entries Sens./
entry LIRels. LIRels/
syns EQRels-
ILI EQRels/s
yn Synsets without
ILI Dutch 44015 70201 1,59 56283 1,25 111639 2,54 53448 1,21 7203 Spanish 23370 50526 2,16 27933 1,81 55163 2,36 21236 0,91 0 Italian 40428 48499 1,20 32978 1,47 117068 2,90 71789 1,78 1561 French 22745 32809 1.44 18777 1.75 49494 2.18 22730 1.00 20 German 15132 20453 1.35 17098 1.20 34818 2.30 16347 1.08 0 Czech 12824 19949 1.56 12283 1.62 26259 2.05 12824 1.00 0 Estonian 7678 13839 1.80 10961 1.26 16318 2.13 9004 1.17 0 English 16361 40588 2,48 17320 2,34 42140 2,58 n.a. n.a. n.a. WN15 94515 187602 1,98 126617 1,48 211375 2,24 n.a. n.a. n.a.
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
40
Wordnets as semantic structures
• Wordnets are unique language-specific structures:– same organizational principles: synset structure and
same set of semantic relations. – different lexicalizations– differences in synonymy and homonymy:
• "decoration" in English versus "versiersel/versiering" in Dutch• "bank" in English (money/river) versus "bank" in Dutch
(money/furniture)
• BUT also different relations for similar synsets
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
41
Autonomous & Language-Specific
voorwerp{object}
lepel{spoon}
werktuig{tool}
tas{bag}
bak{box}
blok{block}
lichaam{body}
Wordnet1.5 Dutch Wordnet
bagspoonbox
object
natural object (an object occurring naturally)
artifact, artefact (a man-made object)
instrumentality block body
containerdeviceimplement
tool instrument
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
42
Artificial ontology: • better control or performance, or a more compact and coherent structure. • introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), • neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise).
What properties can we infer for spoons?spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking
Linguistic versus Artificial Ontologies
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
43
Linguistic ontology: • Exactly reflects the relations between all the lexicalized words and
expressions in a language. • Captures valuable information about the lexical capacity of
languages: what is the available fund of words and expressions in a language.
What words can be used to name spoons?spoon -> object, tableware, silverware, merchandise, cutlery,
Linguistic versus Artificial Ontologies
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
44
Wordnets versus ontologies
• Wordnets:• autonomous language-specific lexicalization
patterns in a relational network. • Usage: to predict substitution in text for
information retrieval,• text generation, machine translation, word-
sense-disambiguation.• Ontologies:
• data structure with formally defined concepts.• Usage: making semantic inferences.
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
45
From EuroWordNet to Global WordNet
• EuroWordNet ended in 1999• Global Wordnet Association was founded in 2000 to
maintain the framework: http://www.globalwordnet.org• Currently, wordnets exist for more than 50 languages,
including:– Arabic, Bantu, Basque, Chinese, Bulgarian, Estonian, Hebrew,
Icelandic, Japanese, Kannada, Korean, Latvian, Nepali, Persian, Romanian, Sanskrit, Tamil, Thai, Turkish, Zulu...
• Many languages are genetically and typologically unrelated
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
47
Some downsides of the EuroWordNet model
• Construction is not done uniformly• Coverage differs• Not all wordnets can communicate with one
another, i.e. linked to different versions of English wordnet
• Proprietary rights restrict free access and usage• A lot of semantics is duplicated• Complex and obscure equivalence relations due to
linguistic differences between English and other languages
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
48
Inter-LingualOntology
Device
Object
TransportDeviceEnglish Words
vehicle
car train
1
2
3 3
Czech Words
dopravní prostředník
auto vlak
2
1French Words
véhicule
voiture train
2
1Estonian Words
liiklusvahend
auto killavoor
2
1
German Words
Fahrzeug
Auto Zug
2
1
Spanish Words
vehículo
auto tren
2
1
Italian Words
veicolo
auto treno
2
1
Dutch Words
voertuig
auto trein
2
1
Next step: Global WordNet Grid
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
49
GWNG: Main Features
• Construct separate wordnets for each Grid language
• Contributors from each language encode the same core set of concepts plus culture/language-specific ones
• Synsets (concepts) are mapped crosslinguistically via an ontology instead of just the English Wordnet
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
50
The Ontology: Main Features
• List of concepts is not just based on the lexicon of a particular language (unlike in EuroWordNet) but uses ontological observations
• Ontology contains only upper and mid-level concepts
• Concepts are related in a type hierarchy• Concepts are defined with axioms
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
51
The Ontology: Main Features• Minimal set of concepts (Reductionist view):
– to express equivalence across languages– to support inferencing
• Ontology need not and cannot provide a concept for all concepts found in the Grid languages – Lexicalization in a language is not sufficient to warrant inclusion in the
ontology– Lexicalization in all or many languages may be sufficient
• Ontological observations will be used to define the concepts in the ontology
• Ontological framework still must be powerful enough to encode all concepts that are lexically expressed in any of the Grid languages
• Additional lexicalized concepts are related to the ontology through complex relations
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
52
Ontological observations• Identity criteria as used in OntoClean (Guarino &
Welty 2002), :– rigidity: to what extent are properties true for entities
in all worlds? You are always a human, but you can be a student for a short while.
– essence: what properties are essential for an entity? Shape is essential for a statue but not for the clay it is made of.
– unicity: what represents a whole and what entities are parts of these wholes? An ocean is a whole but the water it contains is not.
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
53
Type-role distinction Current WordNet treatment, hyponyms of dog:• lapdog:1 # toy dog:1, toy:4 # hunting dog:1 # working dog:1, etc.• dalmatian:2, coach dog:1, carriage dog:1 # Leonberg:1 #
Newfoundland:1 # poodle:1, poodle dog:1, etc.
(1) a husky is a kind of dog(type)(2) a husky is a kind of working dog (role)
• What’s wrong? (2) is defeasible, (1) is not:*This husky is not a dogThis husky is not a working dog
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
54
Ontology and lexicon
•Hierarchy of disjunct types:Canine PoodleDog; NewfoundlandDog;
GermanShepherdDog; Husky
•Lexicon:– NAMES for TYPES:
{poodle}EN, {poedel}NL, {pudoru}JP((instance x Poodle)
– LABELS for ROLES:{watchdog}EN, {waakhond}NL, {banken}JP((instance x Canine) and (role x GuardingProcess))
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
55
Ontology and lexicon•Hierarchy of disjunct types:
River; Clay; etc…•Lexicon:
– NAMES for TYPES:{river}EN, {rivier, stroom}NL((instance x River)
– LABELS for dependent concepts:{rivierwater}NL (water from a river => water is not a unit){kleibrok}NL (irregularly shared piece of clay=>non-essential) ((instance x water) and (instance y River) and (portion x y)((instance x Object) and (instance y Clay) and (portion x y)
and (shape X Irregular))
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
56
• {teacher}EN((instance x Human) and (agent x
TeachingProcess))
• {Lehrer}DE ((instance x Man) and (agent x TeachingProcess))
• {Lehrerin}DE ((instance x Woman) and (agent x TeachingProcess))
KIF expression for gender marking
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
57
KIF expression for perspectivesell: subj(x), direct obj(z),indirect obj(y) versus buy: subj(y), direct obj(z),indirect obj(x) (and (instance x Human)(instance y Human)
(instance z Entity) (instance e FinancialTransaction) (source x e) (destination y e) (patient e)
The same process but a different perspective by subject and object realization: marry in Russian two verbs, apprendre in French can mean teach and learn
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
66
Advantages of the Global Wordnet Grid
• Shared and uniform world knowledge:– universal inferencing– uniform text analysis and interpretation
• More compact and less redundant databases• More clear notion how languages map to
the knowledge – better criteria for expressing knowledge– better criteria for understanding variation
CORNETTO(STEVIN TENDER)
Combinatorial and Relational Network as Toolkit for Dutch Language
Technology http://www2.let.vu.nl/oz/cornetto
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
68
Goals of the Cornetto project• Goal: to develop a lexical semantic database for Dutch:
– 40K Entries: generic and central part of the language– Rich horizontal and vertical semantic relations– Combinatoric information – Ontological information
• Method: merge data from Dutch Wordnet (DWN) and Referentie bestand Nederlands (RBN)
• April 2006-March 2008, extended to July 2008• The data of the final results of the Cornetto project
available through the TST-centrale of the Nederlandse Taalunie (free for research).
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
70
Database• Collections:
▪ Lexical Units (LU): mainly derived from the RBN▪ Synsets (SY): mainly derived from DWN▪ Terms (TE) and axioms: mainly derived on SUMO
and MILO▪ Domains (DM): based on Wordnet domains
• Mappings:▪ LU<-> SY▪ SY <-> SY (within Dutch and from Dutch to English)▪ SY <-> TE▪ SY <-> DM
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
71
Data Organization
Internal relations
PrincetonWordnet
WordnetDomains
SpanishWordnet
CzechWordnet
GermanWordnet
FrenchWordnet
KoreanWordnet Arabic
Wordnet
SUMOMILO
Collection of Terms and Axioms
Correspond to word-meaning pair
formmorphologysyntaxsemanticspragmaticsusage examples
Lexical Unit (LU)
Model meaning relations
Synset
Synonyms
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
72
Database
• Implemented in DebVisDic:– http://deb.fi.muni.cz/index.php
• Demo version available:http://www2.let.vu.nl/oz/cornetto/demo.html
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
73
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
74
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
75
Overview of results
ALL NOUNS VERBS ADJ ADV OTHERSSynsets 70,371 52,847 9,017 7,689 220 598
Lexical Units 119,108 85,449 17,314 15,712 475 158
Lemmas (form+pos) 92,686 70,315 9,051 12,288 1,032 n.a.
Synonyms in synsets 103,762 75,476 14,138 12,914 408 826
CID records 104,556 76,537 14,214 13,132 483 190
Synonym per synset 1.47 1.43 1.57 1.68 1.85 1.38
Senses per lemma 1.29 1.22 1.91 1.28 0.46 n.a.
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
76
Mapping relations
No status value 55976 53.54%
Status value 48580 46.46%
manual 10108 9.67%
B-95 4944 4.73%
BM-90 4215 4.03%
D-55 adjectives 171 0.16%
D-58 verbs 774 0.74%
D-75 nouns 2085 1.99%
M-97 25236 24.14%
RESUME-75 1047 1.00%
TOTAL 104556
DWN and RBN matches 35,289 37.74%
LUs only in DWN 54,983 58.81%
LUs only in RBN 3,223 3.45%
Total 93,495
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
77
Overview of synset data
Synsets 70371
Synonyms 103762
InternalRelations 153370
EquivalenceRelations 86830
Definitions 35620
WordNet Domains mappings 93822
Sumo mappings 70654
Base Level Concepts 8828
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
78
English Wordnet to SUMO mappingthrough two-place relations
• = the synset is equivalent to the SUMO concept, circle (= Circle)
• + the synset is subsumed by the SUMO concept, branch (+ PlantBranch)
• @ the synset is an instance of the SUMO concept, Amsterdam (@ City)
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
79
Cornetto SUMO Mappings through triplets
• Equality:– cirkel: (=, 0, Circle) or (=, , Circle)
• Subsumption:– tak: (+, 0, PlantBranch) or (+, , PlantBranch)
• Related:– blad: (part, 0, PlantBranch) or (part, , PlantBranch)
• Axiomatized:– theewater:
(instance, 0, Water) (instance, 1, Making) (instance, 2, Tea) (resource, 0, 1) (result, 2,1) OR (instance, , Water) (instance, 1, Making) (instance, 2, Tea) (resource, , 1) (result, 2,1)
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
80
Ontology mapping: female/male variants
teacher (a person whose occupation is teaching)SUMO: equivalent to Teacher
In Dutch: no neutral formleraar (male teacher)
(+,,Teacher), (+,, Man)lerares (female teacher)
(+,,Teacher), (+,, Woman)
KYOTO (ICT-211423)Yielding Ontologies for Transition-Based OrganizationFP7: Intelligent Content and Semantics
http://www.kyoto-project.eu/
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
82
KYOTO (ICT-211423) Overview • Title: Yielding Ontologies for Transition-Based Organization• Funded:
– 7th Framework Program-ICT of the European Union: Intelligent Content and Semantics– Taiwan and Japan funded by national grants
• Goal: – Platform for knowledge sharing across languages and cultures– Enables knowledge transition and information search across different target groups,
transgressing linguistic, cultural and geographic boundaries.– Open text mining and deep semantic search– Wiki environment that allows people in the field to maintain their knowledge and agree
on meaning without knowledge engineering skills• URL: http://www.kyoto-project.eu/• Duration:
– March 2008 – March 2011• Effort:
– 364 person months of work.
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
83
KYOTO cycle
frog endemic frogs common frog poison frog
Golden poison froggopher frog
Dusky gopher frogforest frog
Garden ponds are havens for wildlife. They provide food and shelter for frogs, newts and aquatic insects, including damselflies and dragonflies,
(garden pont, haven, wild life)(garden pont, has_food, frog)(garden pont, has_food, newt)(garden pont, has_food, aquatic insect)(garden pont, is_shelter, frog)(garden pont, is_shelter, newt)(garden pont, is_shelter, aquatic insect)
Top
Middle
H20 CO2
Substance
Abstract
Process
Physical
Ontology
Environmental organizations
Tybot: term yielding robot
Kybot: knowledge yielding robot
Wordnets
Distributed, diverse & dynamic data
1
Capture text:"Sudden increase of CO2 emissions in 2008 in Europe"
2
CO2 emission3
Wikyoto
maintainterms & concepts
4
Index facts:Process: Increase Involves: CO2 emission When: 2008 Where: Europe
5Text & Fact Index
SemanticSearch
6
CitizensGovernmentsCompanies
DomainCO2 Emission
H20 Pollution
Greenhouse Gas
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
85
Kyoto main application
• Wikyoto (Wiki platform)– Connects people with shared interest as a community– Upload documents and sources– View and edit terms and concepts learned from these
documents– Combines concepts with other taxonomies– Discuss and agree with others in the community,
different languages, regions and cultures
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
86
Kyoto main application• Tybots
– Learns terms and concepts from document collection
– Organizes terms as a hierarchy– Connects terms to other hierarchies– Defines:
• definitions• relations to other terms• properties and criteria for terms
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
87
Kyoto main application
• Kybot:– Detects facts of interest in text and combines
these in a comprehensive overview– Uses knowledge represented for terms to detect
facts in any document, regardless of language– Allows you to specify any collection of types of
knowledge of your interest
Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven
88
Kyoto databases
• Database of users that forms the community• Database of sources and documents provided by
the users• Database of terms, presented as a domain wordnet
in each language• Database of concepts (so-called ontology) that
connects the terms of the different languages• Databases of facts derived from various document
and source collections provided by the user
Thank you for your attention