special topics on information retrieval

Special Topics onInformation Retrieval

Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/

[email protected]

Beyond word-based representations

Content of the section

• Language ambiguity and IR• Indexing with parts of speech– POS tagging

• Indexing with senses– Approaches for word sense disambiguation

• Concept indexing– DOR and TCOR representations– Random indexing

3Special Topics on Information Retrieval

Language ambiguity

• Ambiguity is a condition where information can be understood or interpreted in more than one way.

• Context may play a role in resolving ambiguity.• Different kinds of ambiguity:– Lexical: words may have different meanings– Syntactic: sentence can be parsed in more than

one way (or words having two parts of speech).– Semantic: words or concepts have an inherently

diffuse meaning based on informal usage

Special Topics on Information Retrieval4

Ambiguity and IR – looking for what?

• “Paris Hilton”– Really interested in The Hilton Hotel in Paris?

• “Tiger Woods”– Searching something about wildlife or the

famous golf player?

• Conclusion, “simple word matching fails”.

Examples of ambiguity

• Lexical:– “Plants/N need light and water” vs. “Each one

plant/V one”– “The fisherman jumped off the bank and into the

water” vs. “The bank down the street was robbed!”

• Syntactic– He ate the cookies on the couch• He was seated on the couch or the cookies were there?


Ambiguity and IR – two problems

• Most IR models represent documents as “bag of words”– There is no information on the words’ positions.

• Two main problems:– Synonymy: many ways to refer to the same object,

e.g. car and automobile• leads to poor recall

– Polysemy: most words have more than one distinct meaning, e.g. model, bank, chip• leads to poor precision

Example: Vector Space Model(Taken from Lillian Lee)

autoenginebonnet

tyreslorryboot

caremissions

hood makemodeltrunk

makehiddenMarkovmodel

emissionsnormalize

Synonymy

Will have small cosine

but are related

Polysemy

Will have large cosine

but not truly related

First idea: indexing with POS tags


w1t1 w1t2 Plant|NN Plant|VB … wntm

d1

d2

: wi,j

dm

Weight indicating the contributionof term-pos j in document i.

Whole vocabulary of the collection with POS tags

• Simple and nice idea, but how to determine the POS tag of each word of a given document?

Part-Of-Speech tagging(based on matieial from Dana S. Nau of University of Maryland and Huong LeThanh of the Dresden University of Technology)

• Part of Speech (POS) tagging is the problem of assigning each word in a sentence the part of speech that it assumes in that sentence.– Input: a string of words + a tag set– Output: a single best tag for each word

• Example (from Penn Treebank):– The/DT grand/JJ jury/NN commented/VBD on/IN

a/DT number/NN of/IN other/JJ topics/NNS ./.


Brown/Penn Treebank tags


Main approaches

• Rule-Based POS tagging– e.g., ENGTWOL [ Voutilainen, 1995 ]

• Transformation-based tagging– e.g.,Brill’s tagger [ Brill, 1995 ]

• Stochastic (Probabilistic) tagging– e.g., TNT [ Brants, 2000 ]• Necessitates a training corpus (the Brown Corpus)• Based on probability of certain tag occurring, given

information from the word and previous tags.


Very first approach

• Assign each word its most likely POS tag– If w has tags t1, …, tk, then can use– P(ti|w) = c(w,ti)/(c(w,t1) + … + c(w,tk)), where• c(w,ti) = number of times w/ti appears in the corpus

• Success: 91% for English– For instance, heat is more used as a noun than as

a verb.


HMM tagging

• A HMM simplifying assumption: the tagging problem can be solved by looking at nearby words and tags.

ti = argmaxj P(tj | tj-1 )P(wi | tj )


Previous tag sequence(tag co-occurence)

word (lexical) likelihood

Example

Secretariat/NNP is/VBZ expected/VBNto/TO race/VB tomorrow/NN

•Suppose we have tagged all but race– Look at just preceding word (bigram):– to/TO race/??? NN or VB?

•Choose tag with greater of the two probabilities:– P(VB|TO)P(race|VB) or P(NN|TO)P(race|NN)


Does indexing with POS work?

• Improves precision but reduces recall.• Conclusion, annotating POS does not seem

worthy as a standalone indexing strategy, even if tagging is performed manually.

• Example:– Query: “talented baseball player”– Document: “is one of the top talents of the time”


Second idea: motivation

• Using single words as index terms generally has good exhaustivity, but poor specificity due to word ambiguity.

• Some word associations have a totally different meaning of the “sum” of the meanings of the words that compose them.– Hot + dog ≠ “hot dog”

• To remedy this problem: use index terms more complex than single words, such as phrases. – Distinguish the two meanings by using phrasal index

terms such as “bank of the Seine” and “bank of Japan”


Second idea: indexing with phrases


p1 P2Information

retrievalManuelMontes

Brownsugar

pn

d1

d2

: wi,j

dm

Weight indicating the contributionof phrase j in document i.

Extracted phrases from the collection

• Here the questions are, which kind of word sequences are relevant phrases?, how to extract them?

Syntactical phrases as index terms

This apple pie looks good and is a real treat

• adjective-noun relation (real-treat)• noun-noun relation (apple-pie)• subject-verb relation (pie-looks)• verb-object relation (is-treat)• The complication is that they are extracted

from the POS tagged text or from the syntactic tree.


Named entities as index terms• Proper names in texts– Three universally accepted categories: person,

location and organisation– Other categories: date/time expressions,

measures (percent, money, weight etc), email addresses, etc.

• One problem: they can be also ambiguous!– George Bush: person or location? – Mexico: geo-political organization or location?

• How to detect named entities?


Named entity recognition• Two tasks: identification and classification• Two main approaches:– Knowledge-based• rule based; developed by experienced language

engineers; make use of human intuition • Names often have internal structure and style.

– Learning-based• Use statistics or machine learning methods • Requires large amounts of labeled documents• Typical features are: Capitalisation, numeric symbols,

punctuation marks, position in the sentence and the words.


N-grams as index terms

• N-gram is a subsequence of n items from a given sequence

• N-grams are easily computed• Combining n-grams for different sizes

produces great flexibility at searching time.• Main problem is the high dimensionality.

How to reduce dimensionality? How to select only the most useful n-grams?


Maximal Frequent Sequences as index terms

• Sequences of words that are frequent in the document collection and that are not contained in any other longer frequent sequence. – A sequence is considered to be frequent if it appears

in at least σ documents.

• Its main strength is to form a very compact index– Avoids storing the numerous least significant phrases

• The extraction of MFS is commonly based on a combination of bottom-up and greedy methods


Does indexing with phrases work?

• Early results were very promising. However, the constant growth of test collections caused a drastic fall in the quality of the results.

• A conclusion of research works is that phrases improve results in low levels of recall.

• The recommendation is to consider phrases as supplementary terms of the vector space– Terms + phrases as index terms


Third idea: motivation

• Traditional IR approaches are highly dependent on term-matching

• Term matching is affected by the synonymy and polysemy phenomena.

• Need to capture the concepts instead of only the words

• Solution: indexing by senses!


What is word sense?

• Word sense is one of the meanings of a word.• “Words” are having different meanings based

on the context of the word. • Example:– We went to see a play at the theater– The children went out to play in the park

A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human

04/20/23NILESH.A.SHEWALE

26

Third idea: indexing by senses

• How to construct this index? How to determine the sense of each word from the document collection?


w11 w12Bank

(institution)Bank

(hill)pn1 pnm

d1

d2

: wi,j

dm

Weight indicating the contribution of the word-sense j in document i.

All different word senses from the target collection

Word sense disambiguation


• The task of selecting a sense for a word from a set of predefined possibilities.– Sense Inventory usually comes from a dictionary

or thesaurus.

• A related task is word sense discrimination; the task of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.

The WSD process

• Choose a sense inventory– Dictionary or thesaurus where word senses are

explicitly indicated.

• Design/apply a disambiguation procedure– Two main approaches: Knowledge-Based and

Machine Learning

1.Evaluate the performance of the procedure– Using a manually labeled corpus– Using as baseline the more frequent sense


Approaches for WSD• Knowledge Based Approaches– Rely on knowledge resources like WordNet,

Thesaurus, etc.– May use grammar rules for disambiguation.– May use hand coded rules for disambiguation.

• Machine Learning Based Approaches– Rely on corpus evidence.– Train a model using tagged (and untagged) corpus.– Probabilistic/Statistical models.


Knowledge resources

• Dictionaries in Machine-readable form (MRD)– Oxford English Dictionary, Collins, Longman

Dictionary of Ordinary Contemporary English. Roget’s Thesaurus

• Thesaurus – add synonymy information– Roget’s Thesaurus

• Semantic networks – add more relations– WordNet, EuroWordNet


Henrik Bulskov32

November 3th, 2006

Wordnet

• A large lexical database organized in terms of meanings. – Includes nouns, adjectives, adverbs, and verbs– Synonym words are grouped into synset

• Example:– {car, auto, automobile, machine, motorcar}

Wordnet example


Lesk algorithm

• Identify senses of words in a context using definition overlap.– Identify simultaneously the correct senses for all

words in context

• Algorithm:1.Retrieve from MRD all sense definitions of the

words to be disambiguated2.Determine the definition overlap for all possible

sense combinations3.Choose senses that lead to highest overlap


Example

• Disambiguate “PINE CONE”– PINE

• kinds of evergreen tree with needle-shaped leaves

• waste away through sorrow or illness

– CONE • solid body which narrows to a

point• something of this shape

whether solid or hollow• fruit of certain evergreen trees


Pine#1 Cone#1 = 0Pine#2 Cone#1 = 0Pine#1 Cone#2 = 1Pine#2 Cone#2 = 0Pine#1 Cone#3 = 2Pine#2 Cone#3 = 0

Disadvantages of Lesk algorithm

• Two many combinations need to be evaluated; problem with long sentences.– Simplified version is to compare the dictionary

definition of an ambiguous word with the terms contained in its neighborhood.

• No enough overlapping words between definitions– Extend definitions by use such information as

synonyms, different derivatives, or words from definitions of words from definitions.


WSD using the conceptual density

• Select a sense based on the relatedness of that word-sense to the context.– Relatedness is measured in terms of conceptual

density (in a structured hierarchical semantic net)

• Idea: if all words in the context are strong indicators of a particular concept then that concept will have a higher density.


Example of the conceptual density

• The dots in the figure represent the senses of the word to be disambiguated or the senses of the words in context.

• The CD formula will yield highest density for the sub-hierarchy containing more senses.

• The sense of W contained in the sub-hierarchy with the highest CD will be chosen.


Supervised approach for WSD

• Induces a classifier from manually sense-tagged text using machine learning techniques.

• Resources:– Sense Tagged Text– Dictionary (implicit source of sense inventory)– Syntactic Analysis (POS tagger, Chunker, Parser, …)

– Reduces WSD to a classification problem– A target word is assigned the most appropriate sense

from a given set of possibilities based on the context in which it occurs


Supervised methodology1. Create a sample of training data where a given target

word is manually annotated with its senses2. Select a set of features with which to represent

context information. 3. Convert sense-tagged training instances to feature

vectors. 4. Apply a machine learning algorithm to induce a

classifier. 5. Convert a held out sample of test data into feature

vectors. 6. Apply classifier to test instances to assign a sense tag.


Some interesting data

• High polysemy: especially verbs.

• Imbalanced training sets: Most examples are from the first sense.

• Current methods: explore semi-supervised machine learning approaches.


Sense n-secmicNouns

Average number of examples

1 9082 13.51

2 1368 4.61

3 544 3.68

4 228 3.55

5 117 3.24

6 59 2.74

7 43 3.52

8 22 3.13

9 8 3.17

10 4 2.33

>10 11 1.75

Does indexing with senses work?• How much can WSD help improve IR

effectiveness? Open question– Weiss: 1%, Voorhees’ method : negative– Krovetz and Croft, Sanderson : only useful for

short queries– Schütze and Pedersen’s approaches and Gonzalo’s

experiment : positive result• WSD must be accurate to be useful for IR• It seems that it can be more useful as

visualization strategy.

Fourth idea: motivation

• Bag of words representation ignores all semantic or conceptual information.– It simply looks at the surface word forms

• Words (forms) are very ambiguous.– Polysemy and synonymy are big problems

• It is necessary to have representations at concept level.– “Concept ” is related with “sense”, but from a

practical (usage) point of view.


Fourth idea: concept-based representations

• In IR, documents are represented by the words occurring in them.– The semantics of a document is conveyed by the

words that occur in it.

• Can the semantics of a word be conveyed by the documents in which it occurs?

• Basis of a representation called:– Document Occurrence Representation (DOR)


Document Occurrence Representation

• Intuitions about the weights:– The more frequently ti occurs in dj, the more

important is dj for characterizing the semantics of ti

– The more distinct the words dj contains, the smaller its contribution to characterizing the semantics of ti.


d1 d2 … dn

t1

t2

: wi,j

tm

All documents from the collection

All words from the collection

Weight indicating the contribution of document j for the semantics of term i.

Representing documents by DOR• DOR is a word representation, not a document

representation.• Representation of documents is obtained by the sum

of the vectors from their words.– Queries are represented in the same way: sum of the

vectors from its words.


d1 d2 … dn

t1

t2

: wi,j

tm

d1 d2 … dn

d1

d2

: wi,j

dn

Word representationWord–Document matrix

Index for IRDocument–Document matrix

SUM

Alternative representation

• In WSD, words are represented by the terms occurring in their context.– The semantics (meaning) of a word is conveyed by

the words commonly co-occurring with it.

• Basis of a representation called:– Term Co-Occurrence Representation (TCOR)


Term Co-Occurrence Representation

• Intuitions about the weights:– The more words ti and tj co-occur in, the more

important tj is for characterizing the semantics of ti

– The more distinct words tj co-occurs with, the smaller its contribution for characterizing the semantics of ti.


t1 t2 … tm

t1

t2

: wi,j

tm

All words from the collection

Weight indicating the co-occurrenceof words i and j

Representing documents by TCOR• TCOR, such as DOR, is a word representation, not a

document representation.• Representation of documents is obtained by the sum

of the vectors from their words.– Queries are represented in the same way: sum of the

vectors from its words.


t1 t2 … tm

t1

t2

: wi,j

tm

t1 t2 … tm

d1

d2

: wi,j

dn

Word representationWord–Word matrix

Index for IRDocument–Word matrix

SUM

Other bag-of-concepts representations

• Standard BoW representations are usually refined before used:– Feature selection: remove some words based on

statistical measures– Feature extraction: artificial features are created

from the originals using distributional clustering of words or factor analytic methods.

• Problem with these approaches is that they are computationally expensive.– Random indexing is a simple approach to generate

BoC representations


Random indexing• Random Indexing is a vector space methodology that

accumulated context vectors for words base on co-ocurrence data– First step: a unique random representation known as `index

vector´ is assigned to each context (document , paragraph or sentence)


D1

D2

Dn

Documentsk << n

Index Vectors (IV)

1 -1

1 -1

1 -1

0 k

Random Indexing (2)– Second step: index vectors are used to produce context

vectors by scanning through the text


D1: Towards an Automata Theory of Brain

D2: From Automata Theory to Brain Theory

1 -1

1 -1

0 k

1 1 -1 -1The context vector for brain

– Third step: build document vectors from their word’s context vectors.

di: “From Automata Theory to Brain Theory” CV1 CV2 CV3 CV2

di will be represented as the weighted sum of these vectors:

a1CV1+a2CV2+a3CV3+a2CV2 a1, a2, a2 are idf-values

Do concept-based representations work?• Useful solutions for a number of conceptual

matching problems– Capture key relationship information, including causal,

goal-oriented, and taxonomic information.• Not to much work in IR– Recent experiments demonstrate that TCOR, DOR and

random indexing results outperform those from traditional VSM; in CLEF collections improvement has been around 7%.

• The more used approach is the one based on Latent Semantic Indexing– But it is computationally expensive


special topics on information retrieval

Documents

information retrievalambiguity

relatedspecial topics

jdmspecial topics

information retrievalcontent

information retrievalpart

pos tagsspecial topics

poor precisionspecial

informal usagespecial