2004/05modelli simulativi1 modelli simulativi nelle scienze cognitive il lessico: modelli...
TRANSCRIPT
2004/05 Modelli simulativi 1
Modelli simulativi nelle Scienze Modelli simulativi nelle Scienze CognitiveCognitive
Il lessico: modelli linguistici, WordNet, acquisizione lessicale
Massimo Poesio
2004/05 Modelli simulativi 2
PART I:PART I:LEXICON AND LEXICAL SEMANTICSLEXICON AND LEXICAL SEMANTICSWORDNETWORDNET
2004/05 Metodi simulativi 3
What’s in a lexiconWhat’s in a lexicon
A lexicon is a repository of lexical knowledgeThe simplest form of lexicon: a list of wordsBut even for English – let alone languages with a more complex morphology, such as Italian – it makes sense to split WORD FORMS from LEXICAL ENTRIES or LEXEMEs:
LEXEME BANK POS: N
WORD BANKS LEXEME: BANKSYN:
NUM: PLUR
And lexical knowledge also includes information about the MEANING of words
2004/05 Metodi simulativi 4
Meaning ….Meaning ….
•Characterizing the meaning of words not easy
• Most of the methods considered in these lecture characterize the meaning of a word by stating its relations with other words•This method however doesn’t say much about what the word ACTUALLY mean (e.g., what can you do with a car)
2004/05 Metodi simulativi 5
Un esempio di lexical entry: VICINO Un esempio di lexical entry: VICINO (da it.wiktionary.org)(da it.wiktionary.org)
vicino sostantivo m (vicina f, vicini pl m, vicine pl f)
1. Colui che abita accanto. (“I miei vicini vengono da Frosinone”
vicino aggettivo m (vicina f, vicini pl m, vicine pl f) (“La piu’ vicina stella a neutroni e’ RX J185635-3754”)
vicino avverbio (invariabile) (“Itunes visto da vicino”)
2004/05 Metodi simulativi 6
Lexical resources for computers: Lexical resources for computers: MACHINE READABLE DICTIONARIESMACHINE READABLE DICTIONARIES
A traditional DICTIONARY is a database containing information about
the PRONUNCIATION of a certain wordits possible PARTS of SPEECHits possible SENSES (or MEANINGS)
In recent years, most dictionaries have appeared in Machine Readable form (MRD)
English:Oxford English DictionaryCollinsLongman Dictionary of Ordinary Contemporary English (LDOCE)
Italian:GarzantiZanichelliParaviait.wiktionary.org
2004/05 Metodi simulativi 7
An example LEXICAL ENTRY from a An example LEXICAL ENTRY from a machine-readable dictionary: STOCK,from machine-readable dictionary: STOCK,from the LDOCEthe LDOCE
0100 a supply (of something) for use: a good stock of food 0200 goods for sale: Some of the stock is being taken without being paid for 0300 the thick part of a tree trunk 0400 (a) a piece of wood used as a support or handle, as for a gun or tool (b) the piece which goes across the top of an ANCHOR^1 (1) from side to side 0500 (a) a plant from which CUTTINGs are grown (b) a stem onto which another plant is GRAFTed 0600 a group of animals used for breeding 0700 farm animals usu. cattle; LIVESTOCK 0800 a family line, esp. of the stated character 0900 money lent to a government at a fixed rate of interest 1000 the money (CAPITAL) owned by a company, divided into SHAREs 1100 a type of garden flower with a sweet smell 1200 a liquid made from the juices of meat, bones, etc., used in cooking …..
2004/05 Metodi simulativi 8
HomonymyHomonymy
Word-strings like STOCK are used to express apparently unrelated senses / meanings, even in contexts in which their part-of-speech has been determined
Other well-known examples: BANK, LIME, RIGHT, SET, SCALEItalian: CALCIO, OBBIETTIVO
An example of the problems homonimy may cause for IR systems
Search for 'West Bank' with Google
2004/05 Metodi simulativi 9
CALCIO, da “Il grande dizionario CALCIO, da “Il grande dizionario Garzanti”Garzanti”
calcio1 [càl-cio] s.m. 1. colpo dato con il piede o con la zampa; pedata; dare, assestare, ricevere un _2. (sport) gioco che si svolge tra due squadre di undici giocatori ciascuna …3. nel football, colpo dato con il piede al pallone: - di punizione, … - di rigore …. – d’angolo …. – piazzato calcio2 parte inferiore della cassa di un fucile … derivato del lat. calx calcis …. calcio3 elemento chimico il cui simbolo è Ca; metallo alcalinoterroso ……
2004/05 Metodi simulativi 10
Omonimia in un MRD per l’Italiano Omonimia in un MRD per l’Italiano (ItalWordNet)(ItalWordNet)
obbiettivo, Nome
[1] - scopo di un'operazione militare.(obbiettivo [1], obiettivo [1])
[2] - bersaglio nel tiro di artiglieria(obbiettivo [2], obiettivo [2])
[4] - sistema di lenti per proiettare l'immagine reale di un oggetto(obbiettivo [4], obiettivo [4])
2004/05 Metodi simulativi 14
Meaning in MRDs, 2: SYNONYMYMeaning in MRDs, 2: SYNONYMY
Two words are SYNONYMS if they have the same meaning at least in some contextsE.g., PRICE and FARE; CHEAP and INEXPENSIVE; LAPTOP and NOTEBOOK; HOME and HOUSE
I’m looking for a CHEAP FLIGHT / INEXPENSIVE FLIGHT
From Roget’s thesaurus:OBLITERATION, erasure, cancellation, deletion
But few words are truly synonymous in ALL contexts:I wanna go HOME / ?? I wanna go HOUSEThe flight was CANCELLED / ?? OBLITERATED / ??? DELETED
Knowing about synonyms may help in IR: NOTEBOOK (get LAPTOPs as well)CHEAP PRICE (get INEXPENSIVE FARE)
2004/05 Metodi simulativi 15
Sinonimia in ItalianoSinonimia in Italiano
scorza, Nome
[1] - (corteccia [1], scorza [1])
[2] - parte esterna, involucro dei frutti(buccia [1], scorza [2])
[4] - (scorza [4]) "sotto la sua scorza scortese si nasconde un animo nobile"
2004/05 Metodi simulativi 16
Problems and limitations of MRDsProblems and limitations of MRDs
Identifying distinct senses always difficult- Sense distinctions often subjective
Definitions often circular
Very limited characterization of the meaning of words
2004/05 Metodi simulativi 17
Homonymy vs polysemyHomonymy vs polysemy
0100 a supply (of something) for use: a good stock of food 0200 goods for sale: Some of the stock is being taken without being paid for 0300 the thick part of a tree trunk 0400 (a) a piece of wood used as a support or handle, as for a gun or tool (b) the piece which goes across the top of an ANCHOR^1 (1) from side to side 0500 (a) a plant from which CUTTINGs are grown (b) a stem onto which another plant is GRAFTed 0600 a group of animals used for breeding 0700 farm animals usu. cattle; LIVESTOCK 0800 a family line, esp. of the stated character 0900 money lent to a government at a fixed rate of interest 1000 the money (CAPITAL) owned by a company, divided into SHAREs 1100 a type of garden flower with a sweet smell 1200 a liquid made from the juices of meat, bones, etc., used in cooking …..
2004/05 Metodi simulativi 18
POLYSEMY vs HOMONIMYPOLYSEMY vs HOMONIMY
In cases like BANK, it’s fairly easy to identify two distinct senses (etymology also different). But in other cases, distinctions more questionable
E.g., senses 0100 and 0200 of stock clearly related, like 0600 and 0700, or 0900 and 1000
In some cases, syntactic tests may help. E.g., KEEP (Hirst, 1987):
Ross KEPT staring at Nadia’s decolletageNadia KEPT calm and made a cutting remarkRoss wrote of his embarassment in the diary that he KEPT.
POLYSEMOUS WORDS: meanings are related to each otherCfr. Human’s foot vs. mountain’s foot
In general, distinction between HOMONIMY and POLYSEMY not always easy (especially with VERBS)
2004/05 Metodi simulativi 19
Other aspects of lexical meaning not Other aspects of lexical meaning not captured by MRDscaptured by MRDs
Other semantic relations:HYPONYMYANTONYMY
A lot of other information typically considered part of ENCYCLOPEDIAs:
Trees grow bark and twigsAdult trees are much taller than human beings
2004/05 Metodi simulativi 20
Hyponymy and HypernymyHyponymy and Hypernymy
HYPONYMY is the relation between a subclass and a superclass:
CAR and VEHICLEDOG and ANIMALBUNGALOW and HOUSE
Generally speaking, a hyponymy relation holds between X and Y whenever it is possible to substitute Y for X:
That is a X -> That is a YE.g., That is a CAR -> That is a VEHICLE.
HYPERNYMY is the opposite relationKnowledge about TAXONOMIES useful to classify web pages
Eg., Semantic WebAutomatically (e.g., Udo Kruschwitz’s system)
This information not generally contained in MRD
2004/05 Metodi simulativi 22
The organization of the lexiconThe organization of the lexicon
“ate”
WORD-FORMS LEXEMES SENSES
EAT-LEX-1eat0600
eat0700
“eat”
“eats”
“eaten”
2004/05 Metodi simulativi 23
The organization of the lexiconThe organization of the lexicon
“stock”
WORD-STRINGS LEXEMES SENSES
STOCK-LEX-1
STOCK-LEX-2
STOCK-LEX-3
stock0100
stock0200
stock0600
stock0700
stock0900
stock1000
2004/05 Metodi simulativi 24
SynonymySynonymy
“cheap”
WORD-STRINGS LEXEMES SENSES
CHEAP-LEX-1
CHEAP-LEX-2
INEXP-LEX-3
cheap0100
….
……
cheapXXXX
inexp0900
inexpYYYY
“inexpensive”
2004/05 Metodi simulativi 25
A more advanced lexical resource: A more advanced lexical resource: WordNetWordNet
A lexical database created at PrincetonFreely available for research from the Princeton sitehttp://www.cogsci.princeton.edu/~wn/
Information about a variety of SEMANTICAL RELATIONS Three sub-databases (supported by psychological research as early as (Fillenbaum and Jones, 1965))
NOUNsVERBSADJECTIVES and ADVERBS
Each database organized around SYNSETS
2004/05 Metodi simulativi 26
The noun databaseThe noun database
About 90,000 forms, 116,000 sensesRelations:
hypernym breakfast -> meal
hyponym meal -> lunch
has-member faculty -> professor
member-of copilot -> crew
has-Part table -> leg
part-of course -> meal
antonym leader -> follower
2004/05 Metodi simulativi 27
SynsetsSynsets
Senses (or `lexicalized concepts’) are represented in WordNet by the set of words that can be used in AT LEAST ONE CONTEXT to express that sense / lexicalized concept: the SYNSET
E.g.,
{chump, fish, fool, gull, mark, patsy, fall guy, sucker, shlemiel, soft touch, mug}
(gloss: person who is gullible and easy to take advantage of)
2004/05 Metodi simulativi 28
HypernymsHypernyms2 senses of robin
Sense 1robin, redbreast, robin redbreast, Old World robin, Erithacus rubecola -- (small Old World songbird with a reddish breast) => thrush -- (songbirds characteristically having brownish upper plumage with a spotted breast) => oscine, oscine bird -- (passerine bird having specialized vocal apparatus) => passerine, passeriform bird -- (perching birds mostly small and living near the ground with feet having 4 toes arranged to allow for gripping the perch; most are songbirds; hatchlings are helpless) => bird -- (warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings) => vertebrate, craniate -- (animals having a bony or cartilaginous skeleton with a segmented spinal column and a large brain enclosed in a skull or cranium) => chordate -- (any animal of the phylum Chordata having a notochord or spinal column) => animal, animate being, beast, brute, creature, fauna -- (a living organism characterized by voluntary movement) => organism, being -- (a living thing that has (or can develop) the ability to act or function independently) => living thing, animate thing -- (a living (or once living) entity) => object, physical object -- => entity, physical thing --
2004/05 Metodi simulativi 29
MeronymyMeronymy
wn beak –holon
Holonyms of noun beak
1 of 3 senses of beak
Sense 2
beak, bill, neb, nib
PART OF: bird
2004/05 Metodi simulativi 30
The verb databaseThe verb database
About 10,000 forms, 20,000 sensesRelations between verb meanings:
Hypernym fly-> travel
Troponym Walk -> stroll
Entails Snore -> sleep
Antonym Increase -> decrease
2004/05 Metodi simulativi 31
Relations between verbal meaningsRelations between verbal meanings
V1 ENTAILS V2 when Someone V1 (logically) entails Someone V2- e.g., snore entails sleep
TROPONYMY when To do V1 is To do V2 in some manner- e.g., limp is a troponym of walk
2004/05 Metodi simulativi 32
The adjective and adverb databaseThe adjective and adverb database
About 20,000 adjective forms, 30,000 senses4,000 adverbs, 5600 sensesRelations:
Antonym (adjective) Heavy <-> light
Antonym (adverb) Quickly <-> slowly
2004/05 Metodi simulativi 33
How to useHow to use
Online: http://cogsci.princeton.edu/cgi-bin/webwnCommand line:
Get synonyms:wn –synsn bank
Get hypernyms:wn –hypen robin
(also for adjectives and verbs): get antonymswn –antsa right
2004/05 Metodi simulativi 34
ItalWordNet (una produzione locale)ItalWordNet (una produzione locale)
EuroWordNet: creato da un consorzio EuropeoItalWordNet: creato da ITC
http://www.ilc.cnr.it/iwndb_php/
2004/05 Metodi simulativi 36
Other machine-readable lexical Other machine-readable lexical resourcesresources
Machine readable dictionaries:LDOCE
Roget’s ThesaurusThe biggest encyclopedia: CYCItalian:
http://multiwordnet.itc.it/ (IRST)
2004/05 Metodi simulativi 37
ReadingsReadings
WordNet online manualsC. Fellbaum (ed), Wordnet: An Electronic Lexical Database, The MIT Press
2004/05 Modelli simulativi 38
PART II: VECTOR-BASED MODELS OF THE PART II: VECTOR-BASED MODELS OF THE LEXICON AND LEXICAL ACQUISITIONLEXICON AND LEXICAL ACQUISITION
2004/05 Metodi simulativi 41
VECTOR-BASED LEXICAL MODELSVECTOR-BASED LEXICAL MODELS
Both in Linguistics and in Psychology researchers have developed theories of the lexicon in which concepts are characterized in terms of FEATURES
E.g., Smith and Medin, 1981; Sartori and Job, 1988
This type of approach leads to a ‘geometrical’ view of lexical entries as points , or VECTORS, in FEATURE SPACE
This type of model can account for which words ‘mean the same’
A particularly simple version of this theory is the one in which the ‘features’ are simply other wordsVector-space models have been shown to correlate well with the results of psychological experiments, particularly about SEMANTIC PRIMING
2004/05 Metodi simulativi 42
VECTOR-BASED MODELS AND LEXICAL VECTOR-BASED MODELS AND LEXICAL ACQUISITIONACQUISITION
Vector-based models (both the feature-based and the word-based variety) also interesting because they can serve as the basis for models of lexical acquisitionThese models are interesting
From a psychological point of view, to explain how concepts are stored in memoryIn neural science, they are being used to investigate SEMANTIC CATEGORY DEFICITS (e.g., Caramazza, Tyler et al, Vigliocco et al)From a linguistic point of view, because they can address the problems encountered by lexicographers when trying to specify word sensesFrom a practical point of view: most MRD these days contain at least some information derived by computational means
2004/05 Metodi simulativi 43
Feature-based lexical semanticsFeature-based lexical semantics
Very old idea in Linguistics: the meaning of a word can be specified in terms of the values of certain `features’ (`DECOMPOSITIONAL SEMANTICS’)
dog : ANIMATE= +, EAT=MEAT, SOCIAL=+horse : ANIMATE= +, EAT=GRASS, SOCIAL=+cat : ANIMATE= +, EAT=MEAT, SOCIAL=-
E.g., Katz and Fodor, 1968
2004/05 Metodi simulativi 44
PSYCHOLOGY: THE FUSS MODEL PSYCHOLOGY: THE FUSS MODEL (Vinson and Vigliocco, 2002, 2003)(Vinson and Vigliocco, 2002, 2003)
2004/05 Metodi simulativi 45
Vector-based lexical semanticsVector-based lexical semantics
DOG
CAT
HORSE
2004/05 Metodi simulativi 46
WORD-BASED VECTOR-SPACE WORD-BASED VECTOR-SPACE LEXICAL MODELS, ILEXICAL MODELS, I
2004/05 Metodi simulativi 49
Measures of semantic similarityMeasures of semantic similarity
Euclidean distance:
Cosine:
Manhattan Metric:
n
i ii yxd1
n
i i
n
i i
n
i ii
yx
yx
1
2
1
2
1)cos(
n
i ii yxd1
2
2004/05 Metodi simulativi 51
Time
Day
FeelingVehicle
Concept clusteringConcept clustering(aka: automatic taxonomy discovery)(aka: automatic taxonomy discovery)
Car
Airplane
Van
Month
Year
JoyLove
Fear
2004/05 Metodi simulativi 52
Some psychological evidence for Some psychological evidence for vector-space representationsvector-space representations
Burgess and Lund (1996, 1997): the clusters found with HAL correlate well with those observed using semantic priming experiments.Landauer, Foltz, and Laham (1997): scores overlap with those of humans on standard vocabulary and topic tests; mimic human scores on category judgments; etc.Evidence about `prototype theory’ (Rosch et al, 1976)
Posner and Keel, 1968subjects presented with patterns of dots that had been obtained by variations from single pattern (`prototype’)Later, they recalled prototypes better than samples they had actually seen
Rosch et al, 1976: `basic level’ categories (apple, orange, potato, carrot) have higher `cue validity’ than elements higher in the hierarchy (fruit, vegetable) or lower (red delicious, cox)
2004/05 Metodi simulativi 53
General characterization of vector-General characterization of vector-based semantics (from Charniak)based semantics (from Charniak)
Vectors as models of conceptsThe CLUSTERING approach to lexical semantics:1. Define properties one cares about, and give values to each
property (generally, numerical)2. Create a vector of length n for each item to be classified3. Viewing the n-dimensional vector as a point in n-space,
cluster points that are near one another
What changes between models:1. The properties used in the vector2. The distance metric used to decide if two points are `close’3. The algorithm used to cluster
2004/05 Metodi simulativi 54
Using words as features in a vector-Using words as features in a vector-based semanticsbased semantics
The old decompositional semantics approach requires i. Specifying the featuresii. Characterizing the value of these features for each lexeme
Simpler approach: use as features the WORDS that occur in the proximity of that word / lexical entry
Intuition: “You can tell a word’s meaning from the company it keeps”
More specifically, you can use as `values’ of these features The FREQUENCIES with which these words occur near the words whose meaning we are definingOr perhaps the PROBABILITIES that these words occur next to each other
Alternative: use the DOCUMENTS in which these words occur (e.g., LSA)
2004/05 Metodi simulativi 55
Using neighboring words to specify Using neighboring words to specify the meaning of wordsthe meaning of words
Take, e.g., the following corpus:1. John ate a banana.2. John ate an apple.3. John drove a lorry.
We can extract the following co-occurrence matrix:
john ate drove banana apple lorry
john 0 2 1 1 1 1
ate 2 0 0 1 1 0
drove 1 0 0 0 0 1
banana 1 1 0 0 0 0
apple 1 1 0 0 0 0
lorry 1 0 1 0 0 0
2004/05 Metodi simulativi 56
Acquiring lexical vectors from a Acquiring lexical vectors from a corpuscorpus(Schuetze, 1991; Burgess and Lund, (Schuetze, 1991; Burgess and Lund, 1997)1997)
To construct vectors C(w) for each word w:1. Scan a text2. Whenever a word w is encountered, increment all cells of C(w)
corresponding to the words v that occur in the vicinity of w, typically within a window of fixed size
Differences among methods:Size of windowWeighted or notWhether every word in the vocabulary counts as a dimension (including function words such as the or and) or whether instead only some specially chosen words are used (typically, the m most common content words in the corpus; or perhaps modifiers only). The words chosen as dimensions are often called CONTEXT WORDSWhether dimensionality reduction methods are applied
2004/05 Metodi simulativi 60
The HAL model (Burgess and Lund, The HAL model (Burgess and Lund, 1995, 1997)1995, 1997)
A 160 million words corpus of articles extracted from all newsgroups containing English dialogueContext words: the 70,000 most frequently occurring symbols within the corpusWindow size: 10 words to the left and the right of the wordMeasure of similarity: cosine
2004/05 Metodi simulativi 61
Latent Semantic Analysis (LSA) Latent Semantic Analysis (LSA) (Landauer et al, 1997)(Landauer et al, 1997)
Goal: extract relatons of expected contextual usage from passages Two steps:1. Build a word / document cooccurrence matrix2. `Weigh’ each cell 3. Perform a DIMENSIONALITY REDUCTION
Argued to correlate well with humans on a number of tests
2004/05 Metodi simulativi 65
Topic correlations in `raw’ and Topic correlations in `raw’ and `reconstructed’ data`reconstructed’ data
2004/05 Metodi simulativi 69
SEXTANT (Grefenstette, 1992)SEXTANT (Grefenstette, 1992)
It was concluded that the carcinoembryonic antigens represent cellular constituents which are repressed during the course of differentiation the normal digestive system epithelium and reappear in the corresponding malignant cells by a process of derepressive dedifferentiation
antigen carcinoembryonic-ADJantigen repress-DOBJantigen represent-SUBJconstituent cellular-ADJconstituent represent-DOBJcourse repress-IOBJ……..
2004/05 Metodi simulativi 70
SEXTANT: Similarity measureSEXTANT: Similarity measure
dog pet-DOBJdog eat-SUBJ dog shaggy-ADJdog brown-ADJdog leash-NN
cat pet-DOBJcat pet-DOBJ cat hairy-ADJcat leash-NN
CATDOG
B andA by possessed attributes Unique
B andA by shared Attributes
Count
CountJaccard:
6
2
ADJ}-shaggyDOBJ,-petNN,-leashADJ,-hairySUBJ,-eatADJ,-{brown
DOBJ}-pet NN,-{leash
Count
Count
2004/05 Metodi simulativi 71
Some caveatsSome caveats
Two senses of `similarity’Schuetze: two words are similar if one can replace the otherBrown et al: two words are similar if they occur in similar contexts
What notion of `meaning’ is learned here?“One might consider LSA’s maximal knowledge of the world to be analogous to a well-read nun’s knowledge of sex, a level of knowledge often deemed a sufficient basis for advising the young” (Landauer et al, 1997)
Can one do semantics with these representations?Our own experience: using HAL-style vectors for resolving bridging referencesVery limited successApplying dimensionality reduction didn’t seem to help
2004/05 Metodi simulativi 72
Applications of these techniques: Applications of these techniques: Information RetrievalInformation Retrieval
cosmonaut
astronaut moon
car truck
d1 1 0 1 1 0
d2 0 1 1 0 0
d3 1 0 0 0 0
d4 0 0 0 1 1
d5 0 0 0 1 0
d6 0 0 0 0 1
2004/05 Metodi simulativi 73
ReadingsReadings
Jurafsky and Martin, chapter 17.3Also useful:
Manning and Schuetze, chapter 8Charniak, chapters 9-10
Some papers:HAL: see the Higher Dimensional Space pageLSA: Various papers on the Colorado site
Good reference: Landauer, Foltz, and Laham. (1997). Introduction to Latent Semantic Analysis. Discourse Processes.