towards semantics for ir

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 1

Acknowledgements A bunch of slides in this talk are adapted from lots of people, including Chris Manning, ChengXiang Zhai, James Allan, Ray Mooney, and Jimmy Lin.

Towards Semantics for IR

Eugene AgichteinEmory University


Emory University 2

Who is this guy?

Sept 2006-: Assistant Professor in the Math & CS department at Emory.

2004 to 2006: Postdoc in the Text Mining, Search, and Navigation group at Microsoft Research, Redmond.

2004: Ph.D. in Computer Science from Columbia University: dissertation on extracting structured relations from large unstructured text databases

1998: B.S. in Engineering from The Cooper Union.

Research interests: accessing, discovering, and managing information in unstructured (text) data, with current emphasis on developing robust and scalable text mining techniques for the biology and health domains.


Emory University 3

Outline Text Information Retrieval: 10-minute overview

Problems with lexical retrieval Synonymy, Polysemy, Ambiguity

A partial solution: synonym lookup

Towards concept retrieval LSI Language Models for IR PLSI

Towards real semantic search Entities, Relations, Facts, Events in Text (my research area)


Emory University 4

Information Retrieval From Text

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .


Emory University 5

Was that the whole story in IR? Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

query reformulation,vocabulary learning,relevance feedback

source reselection


Emory University 6

Supporting the Search ProcessSource

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

Indexing Index

Acquisition Collection


Emory University 7

Example: Query

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia requires egrep But other operations (e.g., find the word Romans near

countrymen , or top-K scenes “most about” ) not feasible


Emory University 8

Term-document incidence

1 if play contains word, 0 otherwise

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Brutus AND Caesar but NOT Calpurnia


Emory University 9

Incidence vectors So we have a 0/1 vector for each term.

Boolean model: To answer query: take the vectors for Brutus, Caesar and

Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100

Vector-space model: Compute query-document similarity as dot product/cosine

between query and document vector Rank by similarity


Emory University 10

Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.


Emory University 11

Modern Search Engines in 1 Minute Crawl Time:

“Inverted List”: terms doc IDs “Content chunks” (doc copies)

Query Time: Lookup query terms in IL “filter set”

Get content chunks for doc IDs

Rank documents using hundreds of features (e.g., term weights, web topology, proximity, position)

Retrieve Top K documents for query ( K < 100 << |filter set|)

Doc ID pos

11001 44

99875 14

11222 11

40942 24

92739 3

angina

treatment

32145 44

34266 14

11222 17

40942 59

5

4

index

Content chunks

99875 14

11222 11

11222 … A myocardial infraction is …


Emory University 12





Towards real semantic search Entities, Relations, Facts, Events


Emory University 13

The Central Problem in IRInformation Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?


Emory University 14

Noisy-Channel Model of IR

Information need

Query

User has a information need, “thinks” of a relevant document…

and writes down some queries

Task of information retrieval: given the query, figure out which document it came from?

d1

d2

dn

document collection

…


Emory University 15

How is this a noisy-channel?

No one seriously claims that this is actually what’s going on… But this view is mathematically convenient!

Source Destination

Transmitter Receiverchannelmessage message

noise

Source Destination

Encoder DecoderchannelInformationneed

queryterms

Query formulation process


Emory University 16

Problems with term-based retrieval Synonymy

“Power law” vs. “Zipf distribution” Polysemy

“Saturn” Ambiguity

“What do frogs eat?”


Emory University 17

Polysemy and Context Document similarity on single word level: polysemy and

context

carcompany

•••dodgeford

meaning 2

ringjupiter

•••space

voyagermeaning 1…

saturn...

…planet

...

contribution to similarity, if used in 1st meaning, but not if in 2nd


Emory University 18

Ambiguity Different documents with the same keywords may have

different meanings…

What do frogs eat?

(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.

(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.

(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.

keywords: frogs, eat

What is the largest volcano in the Solar System?

(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.

(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.

(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.

keywords: largest, volcano, solar, system


Emory University 19

Indexing Word Synsets/Senses How does indexing word senses solve the

synonym/polysemy problem?

Okay, so where do we get the word senses? WordNet: a lexical database for standard English

Automatically find “clusters” of words that describe the same concepts

{dog, canine, doggy, puppy, etc.} concept 112986

I deposited my check in the bank. bank concept 76529I saw the sailboat from the bank. bank concept 53107

http://wordnet.princeton.edu/


Emory University 20

Example: Contextual Word Similarity

Dagan et al, Computer Speech & Language, 1995

Use Mutual Information:


Emory University 21

Word Sense Disambiguation Given a word in context, automatically determine the sense

(concept) This is the Word Sense Disambiguation (WSD) problem

Context is the key: For each ambiguous word, note the surrounding words

“Learn” a classifier from a collection of examples Use the classifier to determine the senses of words in the

documents

bank {river, sailboat, water, etc.} side of a riverbank {check, money, account, etc.} financial institution


Emory University 22

Hypothesis: same senses of words will have similar neighboring words

Disambiguation algorithm Identify context vectors corresponding to all occurrences of a

particular word Partition them into regions of high density Assign a sense to each such region

“Sit on a chair”

“Take a seat on this chair”

“The chair of the Math Department”

“The chair of the meeting”

Example: Unsupervised WSD


Emory University 23

Does it help retrieval? Not really…

Examples of limited success….

Ellen M. Voorhees. (1993) Using WordNet to Disambiguate Word Senses for Text Retrieval. Proceedings of SIGIR 1993.

Mark Sanderson. (1994) Word-Sense Disambiguation and Information Retrieval. Proceedings of SIGIR 1994

And others…

Hinrich Schütze and Jan O. Pedersen. (1995) Information Retrieval Based on Word Senses. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval.

Rada Mihalcea and Dan Moldovan. (2000) Semantic Indexing Using WordNet Senses. Proceedings of ACL 2000 Workshop on Recent Advances in NLP and IR.


Emory University 24

Why Disambiguation Can Hurt Bag-of-words techniques already disambiguate

Context for each term is established in the query Heuristics (e.g., always most frequent sense) work better

WSD is hard! Many words are highly polysemous, e.g., interest Granularity of senses is often domain/application specific Queries are short – not enough context for accurate WSD

WSD tries to improve precision But incorrect sense assignments would hurt recall Slight gains in precision do not offset large drops in recall


Emory University 25



A partial solution: word synsets, WSD




Emory University 26

Latent Semantic Analysis Perform a low-rank approximation of document-term

matrix (typical rank 100-300) General idea

Map documents (and terms) to a low-dimensional representation.

Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).

Compute document similarity based on the inner product in this latent semantic space

Goals Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction


Emory University 27

Latent Semantic Analysis

Latent semantic space: illustrating example

courtesy of Susan Dumais


Emory University 28


Emory University 29

Simplistic pictureTopic 1

Topic 2

Topic 3


Emory University 30


Emory University 31


Emory University 32


Emory University 33


Emory University 34


Emory University 35

Some (old) empirical evidence Precision at or above median TREC precision

Top scorer on almost 20% TREC 1,2,3 topics (c.f. 1990)

Slightly better on average than original vector space

Effect of dimensionality:

Dimensions Precision

250 0.367

300 0.371

346 0.374


Emory University 36


Emory University 37

Problems with term-based retrieval Synonymy

“Power law” vs. “Zipf distribution” Polysemy

“Saturn” Ambiguity

“What do frogs eat?”


Emory University 38







Emory University 39

IR based on Language Model (LM)

query

d1

d2

dn

…

Information need

document collection

generationgeneration

)|( dMQP 1dM

2dM

…

ndM

A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!

The LM approach directly exploits that idea!


Emory University 40

Formal Language (Model)

Traditional generative model: generates strings Finite state machines or regular grammars, etc.

Example:

I wish

I wishI wish I wishI wish I wish I wishI wish I wish I wish I wish…

*wish I wish


Emory University 41

Stochastic Language Models Models probability of generating strings in the

language (commonly all strings over alphabet ∑)

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

…

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

Model M

P(s | M) = 0.00000008


Emory University 42

Stochastic Language Models Model probability of generating any string

0.2 the

0.01 class

0.0001 sayst

0.0001 pleaseth

0.0001 yon

0.0005 maiden

0.01 woman

Model M1 Model M2

maidenclass pleaseth yonthe

0.00050.01 0.0001 0.00010.2

0.010.0001 0.02 0.10.2

P(s|M2) > P(s|M1)

0.2 the

0.0001 class

0.03 sayst

0.02 pleaseth

0.1 yon

0.01 maiden

0.0001 woman


Emory University 43

Stochastic Language Models A statistical model for generating text

Probability distribution over strings in a given language

M

P ( | M ) = P ( | M )

P ( | M, )

P ( | M, )

P ( | M, )


Emory University 44

Unigram and higher-order models

Unigram Language Models

Bigram (generally, n-gram) Language Models

Other Language Models Grammar-based models (PCFGs), etc.

Probably not the first thing to try in IR

= P ( ) P ( | ) P ( | ) P ( | )

P ( ) P ( ) P ( ) P ( )

P ( )

P ( ) P ( | ) P ( | ) P ( | )

Easy.Effective!


Emory University 45

Using Language Models in IR Treat each document as the basis for a model (e.g.,

unigram sufficient statistics) Rank document d based on P(d | q) P(d | q) = P(q | d) x P(d) / P(q)

P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all d

But we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s model

Very general formal approach


Emory University 46

The fundamental problem of LMs Usually we don’t know the model M

But have a sample of text representative of that model

Estimate a language model from a sample Then compute the observation probability

P ( | M ( ) )

M


Emory University 47

Language Models for IR

Language Modeling Approaches Attempt to model query generation process Documents are ranked by the probability that a

query would be observed as a random sample from the respective document model

Multinomial approach


Emory University 48

Retrieval based on probabilistic LM

Treat the generation of queries as a random process. Approach

Infer a language model for each document. Estimate the probability of generating the query

according to each of these models. Rank the documents according to these probabilities. Usually a unigram estimate of words is used

Some work on bigrams, paralleling van Rijsbergen


Emory University 49

Retrieval based on probabilistic LM Intuition

Users … Have a reasonable idea of terms that are likely to

occur in documents of interest. They will choose query terms that distinguish these

documents from others in the collection. Collection statistics …

Are integral parts of the language model. Are not used heuristically as in many other

approaches. In theory. In practice, there’s usually some wiggle

room for empirically set parameters


Emory University 50


Emory University 51


Emory University 52

Query generation probability Ranking formula

The probability of producing the query given the language model of document d using MLE is:

Qt d

dt

Qtdmld

dl

tf

MtpMQp

),(

)|(ˆ)|(ˆ

Unigram assumption:Given a particular language model, the query terms occur independently

),( dttf

ddl

: language model of document d

: raw tf of term t in document d

: total number of tokens in document d

dM

)|()(

)|()(),(

dMQpdp

dQpdpdQp


Emory University 53


Emory University 54


Emory University 55


Emory University 56

Smoothing (continued)

There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or to counts, Dirichlet priors, discounting, and interpolation [Chen and Goodman, 98]

Another simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution


Emory University 57

Smoothing: Mixture model P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) Mixes the probability from the document with the general

collection frequency of the word. Correctly setting is very important A high value of lambda makes the search “conjunctive-

like” – suitable for short queries A low value is more suitable for long queries Can tune to optimize performance

Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)


Emory University 58

Basic mixture model summary General formulation of the LM for IR

The user has a document in mind, and generates the query from this document.

The equation represents the probability that the document that the user had in mind was in fact this one.

Qt

dMtptpdpdQp ))|()()1(()(),(

general language model

individual-document model


Emory University 59

Example Document collection (2 documents)

d1: Xerox reports a profit but revenue is down

d2: Lucent narrows quarter loss but revenue decreases further Model: MLE unigram from documents; = ½ Query: revenue down

P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]

= 1/8 x 3/32 = 3/256 P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]

= 1/8 x 1/32 = 1/256 A component of model is missing… what is it, and why? Ranking: d1 > d2

)|()(),( dQpdpdQp


Emory University 60

Language Models for IR Tasks Cross-lingual IR Distributed IR Structured doc retrieval Personalization Modelling redundancy Predicting query difficulty Predicting information extraction accuracy PLSI


Emory University 61

Standard Probabilistic IR

query

d1

d2

dn

…

Information need

document collection

matchingmatching

),|( dQRP


Emory University 62

IR based on Language Model (LM)

query

d1

d2

dn

…

Information need

document collection


)|( dMQP 1dM

2dM

…

ndM

A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!

LM approach directly exploits that idea!


Emory University 63

Collection-Topic-Document Model

query

d1

d2

dn

…

Information need

document collection


),,|( dTC MMMQP

1dM

2dM

…

ndM

CM

1TM

2TM

mTM

…),|( TC MMQP


Emory University 64

Collection-Topic-Document model 3-level model

1. Whole collection model ( )2. Specific-topic model; relevant-documents model ( )3. Individual-document model ( )

Relevance hypothesis A request(query; topic) is generated from a specific-topic model

{ , }. Iff a document is relevant to the topic, the same model will apply to

the document. It will replace part of the individual-document model in explaining

the document. The probability of relevance of a document

The probability that this model explains part of the document The probability that the { , , } combination is better than

the { , } combination

CM

dMTM

CM TM

CM TM dMCM dM


Emory University 65







Emory University 66

Probabilistic LSI

Uses LSI idea, but based in probability theory Comes from statistical Aspect [Language] Model

Generate co-occurrence model based on non-observed class

This is a mixture model Models a distribution through a mixture (weighted sum) of

other distributions Independence Assumptions

Observed pairs (doc, word) are generated randomly Conditional independence: conditioned on latent class, words

are generated independently of document


Emory University 67

Aspect Model Generation process

Choose a doc d with prob P(d) There are N d’s

Choose a latent class z with (generated) prob P(z|d) There are K z’s, and K << N

Generate a word w with (generated) prob P(w|z) This creates pair (d, w), without direct concern for z

Joining the probabilities gives you

Remember: P(z|d) means “probability of z, given d”

K chosen in advance (how many topics in

collection???)


Emory University 68

Aspect Model (2) Applying Bayes theorem:

This is conceptually different than LSI Word distribution P(w|d) based on combination of specific

classes/factors/aspects, P(w|z)


Emory University 69

Detour: EM Algorithm

Tune parameters of distributions with missing/hidden data

Topics = hidden classes

Extremely useful, general technique


Emory University 70


Emory University 71


Emory University 72


Emory University 73


Emory University 74


Emory University 75


Emory University 76

Expectation Maximization

Sketch of an EM algorithm for PLSI E-step: calculate future probabilities of z based on

current estimates M-step: update estimate parameters based on

calculated probabilities

Problem: overfitting


Emory University 77

Similarities: LSI and PLSI Using intermediate, latent, non-observed data for classification

(hence the “L”) Can compose Joint Probability similar to LSI SVD

U U_hat = P(di | zk) V V_hat = P(wj | zk) _hat = diag(P(zk))k

JP = U_hat*_hat*V_hat JP is simliar to SVD term-doc matrix N

Values calculated probabilistically

P(di | zk) P(wj | zk)diag(P(zk))k


Emory University 78

Differences: LSI and PLSI

Basis LSI: term frequencies (usually) and performs

dimension reduction via projection or 0-ing weaker components

PLSI: statistical – generate model of probabilistic relation between W, D and Z; refine until effective model is produced


Emory University 79

Experiment: 128-factor decomposition


Emory University 80

Experiments


Emory University 81

pLSI Improves on LSI

Consistently better accuracy curves than LSI TEM SVD, computationally

Better from a modeling sense Uses likelihood of sampling and aims for maximization SVD uses L2-norm or other: implicit Gaussian noise

assumption More intuitive:

Polysemy is recognizable By viewing P(w|z)

Similar handling of synonymy


Emory University 82

LSA, pLSA in Practice?

Only rumors (about Teoma using it)

Both LSA, pLSA VERY expensive LSA

Running times of ~ one day on ~10K docs pLSA

M. Federico (ICASSP 2002, [hermes.itc.it]) use a corpus of 1.2 millions of newspaper articles with a vocabulary of 200K words: approximate pLSA using Non-negative Matrix Factorization (NMF)

612 hours of CPU time (7 processors, 2.5 hours/iteration, 35 iterations)

Do we need (P)LSI for web search?


Emory University 83

Did we solve our problem?


Emory University 84

Ambiguity Different documents with the same keywords may have

different meanings…

What do frogs eat?

(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.

(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.

(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.

keywords: frogs, eat

What is the largest volcano in the Solar System?

(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.

(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.

(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.

keywords: largest, volcano, solar, system


Emory University 85

What we need

Detect and exploit semantic relations between entities

Whole other lecture