special topics on information retrieval

53
Special Topics on Information Retrieval Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ [email protected]

Upload: lorna

Post on 05-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Special Topics on Information Retrieval. Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ [email protected]. Beyond word-based representations. Content of the section. Language ambiguity and IR Indexing with parts of speech POS tagging Indexing with senses - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Special Topics on Information Retrieval

Special Topics onInformation Retrieval

Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/

[email protected]

Page 2: Special Topics on Information Retrieval

Beyond word-based representations

Page 3: Special Topics on Information Retrieval

Content of the section

• Language ambiguity and IR• Indexing with parts of speech– POS tagging

• Indexing with senses– Approaches for word sense disambiguation

• Concept indexing– DOR and TCOR representations– Random indexing

3Special Topics on Information Retrieval

Page 4: Special Topics on Information Retrieval

Language ambiguity

• Ambiguity is a condition where information can be understood or interpreted in more than one way.

• Context may play a role in resolving ambiguity.• Different kinds of ambiguity:– Lexical: words may have different meanings– Syntactic: sentence can be parsed in more than

one way (or words having two parts of speech).– Semantic: words or concepts have an inherently

diffuse meaning based on informal usage

Special Topics on Information Retrieval4

Page 5: Special Topics on Information Retrieval

Ambiguity and IR – looking for what?

• “Paris Hilton”– Really interested in The Hilton Hotel in Paris?

• “Tiger Woods”– Searching something about wildlife or the

famous golf player?

• Conclusion, “simple word matching fails”.

Page 6: Special Topics on Information Retrieval

Examples of ambiguity

• Lexical:– “Plants/N need light and water” vs. “Each one

plant/V one”– “The fisherman jumped off the bank and into the

water” vs. “The bank down the street was robbed!”

• Syntactic– He ate the cookies on the couch• He was seated on the couch or the cookies were there?

Special Topics on Information Retrieval6

Page 7: Special Topics on Information Retrieval

Ambiguity and IR – two problems

• Most IR models represent documents as “bag of words”– There is no information on the words’ positions.

• Two main problems:– Synonymy: many ways to refer to the same object,

e.g. car and automobile• leads to poor recall

– Polysemy: most words have more than one distinct meaning, e.g. model, bank, chip• leads to poor precision

Page 8: Special Topics on Information Retrieval

Example: Vector Space Model(Taken from Lillian Lee)

autoenginebonnet

tyreslorryboot

caremissions

hood makemodeltrunk

makehiddenMarkovmodel

emissionsnormalize

Synonymy

Will have small cosine

but are related

Polysemy

Will have large cosine

but not truly related

Page 9: Special Topics on Information Retrieval

First idea: indexing with POS tags

Special Topics on Information Retrieval9

w1t1 w1t2 Plant|NN Plant|VB … wntm

d1

d2

: wi,j

dm

Weight indicating the contributionof term-pos j in document i.

Whole vocabulary of the collection with POS tags

• Simple and nice idea, but how to determine the POS tag of each word of a given document?

Page 10: Special Topics on Information Retrieval

Part-Of-Speech tagging(based on matieial from Dana S. Nau of University of Maryland and Huong LeThanh of the Dresden University of Technology)

• Part of Speech (POS) tagging is the problem of assigning each word in a sentence the part of speech that it assumes in that sentence.– Input: a string of words + a tag set– Output: a single best tag for each word

• Example (from Penn Treebank):– The/DT grand/JJ jury/NN commented/VBD on/IN

a/DT number/NN of/IN other/JJ topics/NNS ./.

Special Topics on Information Retrieval10

Page 11: Special Topics on Information Retrieval

Brown/Penn Treebank tags

Special Topics on Information Retrieval11

Page 12: Special Topics on Information Retrieval

Main approaches

• Rule-Based POS tagging– e.g., ENGTWOL [ Voutilainen, 1995 ]

• Transformation-based tagging– e.g.,Brill’s tagger [ Brill, 1995 ]

• Stochastic (Probabilistic) tagging– e.g., TNT [ Brants, 2000 ]• Necessitates a training corpus (the Brown Corpus)• Based on probability of certain tag occurring, given

information from the word and previous tags.

Special Topics on Information Retrieval12

Page 13: Special Topics on Information Retrieval

Very first approach

• Assign each word its most likely POS tag– If w has tags t1, …, tk, then can use– P(ti|w) = c(w,ti)/(c(w,t1) + … + c(w,tk)), where• c(w,ti) = number of times w/ti appears in the corpus

• Success: 91% for English– For instance, heat is more used as a noun than as

a verb.

Special Topics on Information Retrieval13

Page 14: Special Topics on Information Retrieval

HMM tagging

• A HMM simplifying assumption: the tagging problem can be solved by looking at nearby words and tags.

ti = argmaxj P(tj | tj-1 )P(wi | tj )

Special Topics on Information Retrieval14

Previous tag sequence(tag co-occurence)

word (lexical) likelihood

Page 15: Special Topics on Information Retrieval

Example

Secretariat/NNP is/VBZ expected/VBNto/TO race/VB tomorrow/NN

•Suppose we have tagged all but race– Look at just preceding word (bigram):– to/TO race/??? NN or VB?

•Choose tag with greater of the two probabilities:– P(VB|TO)P(race|VB) or P(NN|TO)P(race|NN)

Special Topics on Information Retrieval15

Page 16: Special Topics on Information Retrieval

Does indexing with POS work?

• Improves precision but reduces recall.• Conclusion, annotating POS does not seem

worthy as a standalone indexing strategy, even if tagging is performed manually.

• Example:– Query: “talented baseball player”– Document: “is one of the top talents of the time”

Special Topics on Information Retrieval16

Page 17: Special Topics on Information Retrieval

Second idea: motivation

• Using single words as index terms generally has good exhaustivity, but poor specificity due to word ambiguity.

• Some word associations have a totally different meaning of the “sum” of the meanings of the words that compose them.– Hot + dog ≠ “hot dog”

• To remedy this problem: use index terms more complex than single words, such as phrases. – Distinguish the two meanings by using phrasal index

terms such as “bank of the Seine” and “bank of Japan”

Special Topics on Information Retrieval17

Page 18: Special Topics on Information Retrieval

Second idea: indexing with phrases

Special Topics on Information Retrieval18

p1 P2Information

retrievalManuelMontes

Brownsugar

pn

d1

d2

: wi,j

dm

Weight indicating the contributionof phrase j in document i.

Extracted phrases from the collection

• Here the questions are, which kind of word sequences are relevant phrases?, how to extract them?

Page 19: Special Topics on Information Retrieval

Syntactical phrases as index terms

This apple pie looks good and is a real treat

• adjective-noun relation (real-treat)• noun-noun relation (apple-pie)• subject-verb relation (pie-looks)• verb-object relation (is-treat)• The complication is that they are extracted

from the POS tagged text or from the syntactic tree.

Special Topics on Information Retrieval19

Page 20: Special Topics on Information Retrieval

Named entities as index terms• Proper names in texts– Three universally accepted categories: person,

location and organisation– Other categories: date/time expressions,

measures (percent, money, weight etc), email addresses, etc.

• One problem: they can be also ambiguous!– George Bush: person or location? – Mexico: geo-political organization or location?

• How to detect named entities?

Special Topics on Information Retrieval20

Page 21: Special Topics on Information Retrieval

Named entity recognition• Two tasks: identification and classification• Two main approaches:– Knowledge-based• rule based; developed by experienced language

engineers; make use of human intuition • Names often have internal structure and style.

– Learning-based• Use statistics or machine learning methods • Requires large amounts of labeled documents• Typical features are: Capitalisation, numeric symbols,

punctuation marks, position in the sentence and the words.

Special Topics on Information Retrieval21

Page 22: Special Topics on Information Retrieval

N-grams as index terms

• N-gram is a subsequence of n items from a given sequence

• N-grams are easily computed• Combining n-grams for different sizes

produces great flexibility at searching time.• Main problem is the high dimensionality.

How to reduce dimensionality? How to select only the most useful n-grams?

Special Topics on Information Retrieval22

Page 23: Special Topics on Information Retrieval

Maximal Frequent Sequences as index terms

• Sequences of words that are frequent in the document collection and that are not contained in any other longer frequent sequence. – A sequence is considered to be frequent if it appears

in at least σ documents.

• Its main strength is to form a very compact index– Avoids storing the numerous least significant phrases

• The extraction of MFS is commonly based on a combination of bottom-up and greedy methods

Special Topics on Information Retrieval23

Page 24: Special Topics on Information Retrieval

Does indexing with phrases work?

• Early results were very promising. However, the constant growth of test collections caused a drastic fall in the quality of the results.

• A conclusion of research works is that phrases improve results in low levels of recall.

• The recommendation is to consider phrases as supplementary terms of the vector space– Terms + phrases as index terms

Special Topics on Information Retrieval24

Page 25: Special Topics on Information Retrieval

Third idea: motivation

• Traditional IR approaches are highly dependent on term-matching

• Term matching is affected by the synonymy and polysemy phenomena.

• Need to capture the concepts instead of only the words

• Solution: indexing by senses!

Special Topics on Information Retrieval25

Page 26: Special Topics on Information Retrieval

What is word sense?

• Word sense is one of the meanings of a word.• “Words” are having different meanings based

on the context of the word. • Example:– We went to see a play at the theater– The children went out to play in the park

A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human

04/20/23NILESH.A.SHEWALE

26

Page 27: Special Topics on Information Retrieval

Third idea: indexing by senses

• How to construct this index? How to determine the sense of each word from the document collection?

Special Topics on Information Retrieval27

w11 w12Bank

(institution)Bank

(hill)pn1 pnm

d1

d2

: wi,j

dm

Weight indicating the contribution of the word-sense j in document i.

All different word senses from the target collection

Page 28: Special Topics on Information Retrieval

Word sense disambiguation

Special Topics on Information Retrieval28

• The task of selecting a sense for a word from a set of predefined possibilities.– Sense Inventory usually comes from a dictionary

or thesaurus.

• A related task is word sense discrimination; the task of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.

Page 29: Special Topics on Information Retrieval

The WSD process

• Choose a sense inventory– Dictionary or thesaurus where word senses are

explicitly indicated.

• Design/apply a disambiguation procedure– Two main approaches: Knowledge-Based and

Machine Learning

1.Evaluate the performance of the procedure– Using a manually labeled corpus– Using as baseline the more frequent sense

Special Topics on Information Retrieval29

Page 30: Special Topics on Information Retrieval

Approaches for WSD• Knowledge Based Approaches– Rely on knowledge resources like WordNet,

Thesaurus, etc.– May use grammar rules for disambiguation.– May use hand coded rules for disambiguation.

• Machine Learning Based Approaches– Rely on corpus evidence.– Train a model using tagged (and untagged) corpus.– Probabilistic/Statistical models.

Special Topics on Information Retrieval30

Page 31: Special Topics on Information Retrieval

Knowledge resources

• Dictionaries in Machine-readable form (MRD)– Oxford English Dictionary, Collins, Longman

Dictionary of Ordinary Contemporary English. Roget’s Thesaurus

• Thesaurus – add synonymy information– Roget’s Thesaurus

• Semantic networks – add more relations– WordNet, EuroWordNet

Special Topics on Information Retrieval31

Page 32: Special Topics on Information Retrieval

Henrik Bulskov32

November 3th, 2006

Wordnet

• A large lexical database organized in terms of meanings. – Includes nouns, adjectives, adverbs, and verbs– Synonym words are grouped into synset

• Example:– {car, auto, automobile, machine, motorcar}

Page 33: Special Topics on Information Retrieval

Wordnet example

Special Topics on Information Retrieval33

Page 34: Special Topics on Information Retrieval

Lesk algorithm

• Identify senses of words in a context using definition overlap.– Identify simultaneously the correct senses for all

words in context

• Algorithm:1.Retrieve from MRD all sense definitions of the

words to be disambiguated2.Determine the definition overlap for all possible

sense combinations3.Choose senses that lead to highest overlap

Special Topics on Information Retrieval34

Page 35: Special Topics on Information Retrieval

Example

• Disambiguate “PINE CONE”– PINE

• kinds of evergreen tree with needle-shaped leaves

• waste away through sorrow or illness

– CONE • solid body which narrows to a

point• something of this shape

whether solid or hollow• fruit of certain evergreen trees

Special Topics on Information Retrieval35

Pine#1 Cone#1 = 0Pine#2 Cone#1 = 0Pine#1 Cone#2 = 1Pine#2 Cone#2 = 0Pine#1 Cone#3 = 2Pine#2 Cone#3 = 0

Page 36: Special Topics on Information Retrieval

Disadvantages of Lesk algorithm

• Two many combinations need to be evaluated; problem with long sentences.– Simplified version is to compare the dictionary

definition of an ambiguous word with the terms contained in its neighborhood.

• No enough overlapping words between definitions– Extend definitions by use such information as

synonyms, different derivatives, or words from definitions of words from definitions.

Special Topics on Information Retrieval36

Page 37: Special Topics on Information Retrieval

WSD using the conceptual density

• Select a sense based on the relatedness of that word-sense to the context.– Relatedness is measured in terms of conceptual

density (in a structured hierarchical semantic net)

• Idea: if all words in the context are strong indicators of a particular concept then that concept will have a higher density.

Special Topics on Information Retrieval37

Page 38: Special Topics on Information Retrieval

Example of the conceptual density

• The dots in the figure represent the senses of the word to be disambiguated or the senses of the words in context.

• The CD formula will yield highest density for the sub-hierarchy containing more senses.

• The sense of W contained in the sub-hierarchy with the highest CD will be chosen.

Special Topics on Information Retrieval38

Page 39: Special Topics on Information Retrieval

Supervised approach for WSD

• Induces a classifier from manually sense-tagged text using machine learning techniques.

• Resources:– Sense Tagged Text– Dictionary (implicit source of sense inventory)– Syntactic Analysis (POS tagger, Chunker, Parser, …)

– Reduces WSD to a classification problem– A target word is assigned the most appropriate sense

from a given set of possibilities based on the context in which it occurs

Special Topics on Information Retrieval39

Page 40: Special Topics on Information Retrieval

Supervised methodology1. Create a sample of training data where a given target

word is manually annotated with its senses2. Select a set of features with which to represent

context information. 3. Convert sense-tagged training instances to feature

vectors. 4. Apply a machine learning algorithm to induce a

classifier. 5. Convert a held out sample of test data into feature

vectors. 6. Apply classifier to test instances to assign a sense tag.

Special Topics on Information Retrieval40

Page 41: Special Topics on Information Retrieval

Some interesting data

• High polysemy: especially verbs.

• Imbalanced training sets: Most examples are from the first sense.

• Current methods: explore semi-supervised machine learning approaches.

Special Topics on Information Retrieval41

Sense n-secmicNouns

Average number of examples

1 9082 13.51

2 1368 4.61

3 544 3.68

4 228 3.55

5 117 3.24

6 59 2.74

7 43 3.52

8 22 3.13

9 8 3.17

10 4 2.33

>10 11 1.75

Page 42: Special Topics on Information Retrieval

Does indexing with senses work?• How much can WSD help improve IR

effectiveness? Open question– Weiss: 1%, Voorhees’ method : negative– Krovetz and Croft, Sanderson : only useful for

short queries– Schütze and Pedersen’s approaches and Gonzalo’s

experiment : positive result• WSD must be accurate to be useful for IR• It seems that it can be more useful as

visualization strategy.

Page 43: Special Topics on Information Retrieval

Fourth idea: motivation

• Bag of words representation ignores all semantic or conceptual information.– It simply looks at the surface word forms

• Words (forms) are very ambiguous.– Polysemy and synonymy are big problems

• It is necessary to have representations at concept level.– “Concept ” is related with “sense”, but from a

practical (usage) point of view.

Special Topics on Information Retrieval43

Page 44: Special Topics on Information Retrieval

Fourth idea: concept-based representations

• In IR, documents are represented by the words occurring in them.– The semantics of a document is conveyed by the

words that occur in it.

• Can the semantics of a word be conveyed by the documents in which it occurs?

• Basis of a representation called:– Document Occurrence Representation (DOR)

Special Topics on Information Retrieval44

Page 45: Special Topics on Information Retrieval

Document Occurrence Representation

• Intuitions about the weights:– The more frequently ti occurs in dj, the more

important is dj for characterizing the semantics of ti

– The more distinct the words dj contains, the smaller its contribution to characterizing the semantics of ti.

Special Topics on Information Retrieval45

d1 d2 … dn

t1

t2

: wi,j

tm

All documents from the collection

All words from the collection

Weight indicating the contribution of document j for the semantics of term i.

Page 46: Special Topics on Information Retrieval

Representing documents by DOR• DOR is a word representation, not a document

representation.• Representation of documents is obtained by the sum

of the vectors from their words.– Queries are represented in the same way: sum of the

vectors from its words.

Special Topics on Information Retrieval46

d1 d2 … dn

t1

t2

: wi,j

tm

d1 d2 … dn

d1

d2

: wi,j

dn

Word representationWord–Document matrix

Index for IRDocument–Document matrix

SUM

Page 47: Special Topics on Information Retrieval

Alternative representation

• In WSD, words are represented by the terms occurring in their context.– The semantics (meaning) of a word is conveyed by

the words commonly co-occurring with it.

• Basis of a representation called:– Term Co-Occurrence Representation (TCOR)

Special Topics on Information Retrieval47

Page 48: Special Topics on Information Retrieval

Term Co-Occurrence Representation

• Intuitions about the weights:– The more words ti and tj co-occur in, the more

important tj is for characterizing the semantics of ti

– The more distinct words tj co-occurs with, the smaller its contribution for characterizing the semantics of ti.

Special Topics on Information Retrieval48

t1 t2 … tm

t1

t2

: wi,j

tm

All words from the collection

Weight indicating the co-occurrenceof words i and j

Page 49: Special Topics on Information Retrieval

Representing documents by TCOR• TCOR, such as DOR, is a word representation, not a

document representation.• Representation of documents is obtained by the sum

of the vectors from their words.– Queries are represented in the same way: sum of the

vectors from its words.

Special Topics on Information Retrieval49

t1 t2 … tm

t1

t2

: wi,j

tm

t1 t2 … tm

d1

d2

: wi,j

dn

Word representationWord–Word matrix

Index for IRDocument–Word matrix

SUM

Page 50: Special Topics on Information Retrieval

Other bag-of-concepts representations

• Standard BoW representations are usually refined before used:– Feature selection: remove some words based on

statistical measures– Feature extraction: artificial features are created

from the originals using distributional clustering of words or factor analytic methods.

• Problem with these approaches is that they are computationally expensive.– Random indexing is a simple approach to generate

BoC representations

Special Topics on Information Retrieval50

Page 51: Special Topics on Information Retrieval

Random indexing• Random Indexing is a vector space methodology that

accumulated context vectors for words base on co-ocurrence data– First step: a unique random representation known as `index

vector´ is assigned to each context (document , paragraph or sentence)

Special Topics on Information Retrieval51

D1

D2

Dn

Documentsk << n

Index Vectors (IV)

1 -1

1 -1

1 -1

0 k

Page 52: Special Topics on Information Retrieval

Random Indexing (2)– Second step: index vectors are used to produce context

vectors by scanning through the text

Special Topics on Information Retrieval52

D1: Towards an Automata Theory of Brain

D2: From Automata Theory to Brain Theory

1 -1

1 -1

0 k

1 1 -1 -1The context vector for brain

– Third step: build document vectors from their word’s context vectors.

di: “From Automata Theory to Brain Theory” CV1 CV2 CV3 CV2

di will be represented as the weighted sum of these vectors:

a1CV1+a2CV2+a3CV3+a2CV2 a1, a2, a2 are idf-values

Page 53: Special Topics on Information Retrieval

Do concept-based representations work?• Useful solutions for a number of conceptual

matching problems– Capture key relationship information, including causal,

goal-oriented, and taxonomic information.• Not to much work in IR– Recent experiments demonstrate that TCOR, DOR and

random indexing results outperform those from traditional VSM; in CLEF collections improvement has been around 7%.

• The more used approach is the one based on Latent Semantic Indexing– But it is computationally expensive

Special Topics on Information Retrieval53