towards semantics for ir

3
QUICK START BONUS PKG AC + M + SM 99 0 0 0 199 0 0 0 299 10 30 10 499 30 50 20 699 40 80 30 999 50 150 50 1299 80 200 50 1899 100 250 75 RANK PACKAGE TIME QUALIFICATIONS MATCHING LEVEL BONUS WC NONE, $49 KIT FOREVER NONE NONE SC $ 99 30 DAYS 100 GBV NONE SC $ 199 30 DAYS 100 GBV NONE AC $ 299 30 DAYS 300 GBV NONE AC $ 499 60 DAYS 300 GBV 20% L1 M $ 699 30 DAYS 1000 GBV 30% L1, 10% L2 RETAIL COMMISSIONS EARNED BY EVERY RANK $25 for $99 pkg $50 for $199 or higher HOW TO MAXIMIZE THE COMP PLAN HIGHEST RANK QUALIFICATION VS PAID AS RANK

Upload: fayre

Post on 13-Jan-2016

43 views

Category:

Documents


3 download

DESCRIPTION

Towards Semantics for IR. Eugene Agichtein Emory University. Acknowledgements A bunch of slides in this talk are adapted from lots of people, including Chris Manning, ChengXiang Zhai, James Allan, Ray Mooney, and Jimmy Lin. Who is this guy?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 1

Acknowledgements A bunch of slides in this talk are adapted from lots of people, including Chris Manning, ChengXiang Zhai, James Allan, Ray Mooney, and Jimmy Lin.

Towards Semantics for IR

Eugene AgichteinEmory University

Page 2: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 2

Who is this guy?

Sept 2006-: Assistant Professor in the Math & CS department  at Emory.

2004 to 2006: Postdoc in the Text Mining, Search, and Navigation group at Microsoft Research, Redmond.

2004: Ph.D. in Computer Science from Columbia University: dissertation on extracting structured relations from large unstructured text databases

1998: B.S. in Engineering from The Cooper Union.

Research interests: accessing, discovering, and managing information in unstructured (text) data, with current emphasis on developing robust and scalable text mining techniques for the biology and health domains.

Page 3: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 3

Outline Text Information Retrieval: 10-minute overview

Problems with lexical retrieval Synonymy, Polysemy, Ambiguity

A partial solution: synonym lookup

Towards concept retrieval LSI Language Models for IR PLSI

Towards real semantic search Entities, Relations, Facts, Events in Text (my research area)

Page 4: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 4

Information Retrieval From Text

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .

Page 5: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 5

Was that the whole story in IR? Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

query reformulation,vocabulary learning,relevance feedback

source reselection

Page 6: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 6

Supporting the Search ProcessSource

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

Indexing Index

Acquisition Collection

Page 7: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 7

Example: Query

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia requires egrep But other operations (e.g., find the word Romans near

countrymen , or top-K scenes “most about” ) not feasible

Page 8: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 8

Term-document incidence

1 if play contains word, 0 otherwise

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Brutus AND Caesar but NOT Calpurnia

Page 9: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 9

Incidence vectors So we have a 0/1 vector for each term.

Boolean model: To answer query: take the vectors for Brutus, Caesar and

Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100

Vector-space model: Compute query-document similarity as dot product/cosine

between query and document vector Rank by similarity

Page 10: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 10

Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Page 11: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 11

Modern Search Engines in 1 Minute Crawl Time:

“Inverted List”: terms doc IDs “Content chunks” (doc copies)

Query Time: Lookup query terms in IL “filter set”

Get content chunks for doc IDs

Rank documents using hundreds of features (e.g., term weights, web topology, proximity, position)

Retrieve Top K documents for query ( K < 100 << |filter set|)

Doc ID pos

11001 44

99875 14

11222 11

40942 24

92739 3

angina

treatment

32145 44

34266 14

11222 17

40942 59

5

4

index

Content chunks

99875 14

11222 11

11222 … A myocardial infraction is …

Page 12: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 12

Outline Text Information Retrieval: 10-minute overview

Problems with lexical retrieval Synonymy, Polysemy, Ambiguity

A partial solution: synonym lookup

Towards concept retrieval LSI Language Models for IR PLSI

Towards real semantic search Entities, Relations, Facts, Events

Page 13: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 13

The Central Problem in IRInformation Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

Page 14: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 14

Noisy-Channel Model of IR

Information need

Query

User has a information need, “thinks” of a relevant document…

and writes down some queries

Task of information retrieval: given the query, figure out which document it came from?

d1

d2

dn

document collection

Page 15: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 15

How is this a noisy-channel?

No one seriously claims that this is actually what’s going on… But this view is mathematically convenient!

Source Destination

Transmitter Receiverchannelmessage message

noise

Source Destination

Encoder DecoderchannelInformationneed

queryterms

Query formulation process

Page 16: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 16

Problems with term-based retrieval Synonymy

“Power law” vs. “Zipf distribution” Polysemy

“Saturn” Ambiguity

“What do frogs eat?”

Page 17: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 17

Polysemy and Context Document similarity on single word level: polysemy and

context

carcompany

•••dodgeford

meaning 2

ringjupiter

•••space

voyagermeaning 1…

saturn...

…planet

...

contribution to similarity, if used in 1st meaning, but not if in 2nd

Page 18: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 18

Ambiguity Different documents with the same keywords may have

different meanings…

What do frogs eat?

(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.

(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.

(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.

keywords: frogs, eat

What is the largest volcano in the Solar System?

(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.

(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.

(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.

keywords: largest, volcano, solar, system

Page 19: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 19

Indexing Word Synsets/Senses How does indexing word senses solve the

synonym/polysemy problem?

Okay, so where do we get the word senses? WordNet: a lexical database for standard English

Automatically find “clusters” of words that describe the same concepts

{dog, canine, doggy, puppy, etc.} concept 112986

I deposited my check in the bank. bank concept 76529I saw the sailboat from the bank. bank concept 53107

http://wordnet.princeton.edu/

Page 20: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 20

Example: Contextual Word Similarity

Dagan et al, Computer Speech & Language, 1995

Use Mutual Information:

Page 21: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 21

Word Sense Disambiguation Given a word in context, automatically determine the sense

(concept) This is the Word Sense Disambiguation (WSD) problem

Context is the key: For each ambiguous word, note the surrounding words

“Learn” a classifier from a collection of examples Use the classifier to determine the senses of words in the

documents

bank {river, sailboat, water, etc.} side of a riverbank {check, money, account, etc.} financial institution

Page 22: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 22

Hypothesis: same senses of words will have similar neighboring words

Disambiguation algorithm Identify context vectors corresponding to all occurrences of a

particular word Partition them into regions of high density Assign a sense to each such region

“Sit on a chair”

“Take a seat on this chair”

“The chair of the Math Department”

“The chair of the meeting”

Example: Unsupervised WSD

Page 23: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 23

Does it help retrieval? Not really…

Examples of limited success….

Ellen M. Voorhees. (1993) Using WordNet to Disambiguate Word Senses for Text Retrieval. Proceedings of SIGIR 1993.

Mark Sanderson. (1994) Word-Sense Disambiguation and Information Retrieval. Proceedings of SIGIR 1994

And others…

Hinrich Schütze and Jan O. Pedersen. (1995) Information Retrieval Based on Word Senses. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval.

Rada Mihalcea and Dan Moldovan. (2000) Semantic Indexing Using WordNet Senses. Proceedings of ACL 2000 Workshop on Recent Advances in NLP and IR.

Page 24: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 24

Why Disambiguation Can Hurt Bag-of-words techniques already disambiguate

Context for each term is established in the query Heuristics (e.g., always most frequent sense) work better

WSD is hard! Many words are highly polysemous, e.g., interest Granularity of senses is often domain/application specific Queries are short – not enough context for accurate WSD

WSD tries to improve precision But incorrect sense assignments would hurt recall Slight gains in precision do not offset large drops in recall

Page 25: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 25

Outline Text Information Retrieval: 10-minute overview

Problems with lexical retrieval Synonymy, Polysemy, Ambiguity

A partial solution: word synsets, WSD

Towards concept retrieval LSI Language Models for IR PLSI

Towards real semantic search Entities, Relations, Facts, Events

Page 26: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 26

Latent Semantic Analysis Perform a low-rank approximation of document-term

matrix (typical rank 100-300) General idea

Map documents (and terms) to a low-dimensional representation.

Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).

Compute document similarity based on the inner product in this latent semantic space

Goals Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction

Page 27: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 27

Latent Semantic Analysis

Latent semantic space: illustrating example

courtesy of Susan Dumais

Page 28: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 28

Page 29: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 29

Simplistic pictureTopic 1

Topic 2

Topic 3

Page 30: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 30

Page 31: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 31

Page 32: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 32

Page 33: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 33

Page 34: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 34

Page 35: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 35

Some (old) empirical evidence Precision at or above median TREC precision

Top scorer on almost 20% TREC 1,2,3 topics (c.f. 1990)

Slightly better on average than original vector space

Effect of dimensionality:

Dimensions Precision

250 0.367

300 0.371

346 0.374

Page 36: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 36

Page 37: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 37

Problems with term-based retrieval Synonymy

“Power law” vs. “Zipf distribution” Polysemy

“Saturn” Ambiguity

“What do frogs eat?”

Page 38: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 38

Outline Text Information Retrieval: 5-minute overview

Problems with lexical retrieval Synonymy, Polysemy, Ambiguity

A partial solution: synonym lookup

Towards concept retrieval LSI Language Models for IR PLSI

Towards real semantic search Entities, Relations, Facts, Events

Page 39: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 39

IR based on Language Model (LM)

query

d1

d2

dn

Information need

document collection

generationgeneration

)|( dMQP 1dM

2dM

ndM

A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!

The LM approach directly exploits that idea!

Page 40: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 40

Formal Language (Model)

Traditional generative model: generates strings Finite state machines or regular grammars, etc.

Example:

I wish

I wishI wish I wishI wish I wish I wishI wish I wish I wish I wish…

*wish I wish

Page 41: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 41

Stochastic Language Models Models probability of generating strings in the

language (commonly all strings over alphabet ∑)

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

Model M

P(s | M) = 0.00000008

Page 42: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 42

Stochastic Language Models Model probability of generating any string

0.2 the

0.01 class

0.0001 sayst

0.0001 pleaseth

0.0001 yon

0.0005 maiden

0.01 woman

Model M1 Model M2

maidenclass pleaseth yonthe

0.00050.01 0.0001 0.00010.2

0.010.0001 0.02 0.10.2

P(s|M2) > P(s|M1)

0.2 the

0.0001 class

0.03 sayst

0.02 pleaseth

0.1 yon

0.01 maiden

0.0001 woman

Page 43: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 43

Stochastic Language Models A statistical model for generating text

Probability distribution over strings in a given language

M

P ( | M ) = P ( | M )

P ( | M, )

P ( | M, )

P ( | M, )

Page 44: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 44

Unigram and higher-order models

Unigram Language Models

Bigram (generally, n-gram) Language Models

Other Language Models Grammar-based models (PCFGs), etc.

Probably not the first thing to try in IR

= P ( ) P ( | ) P ( | ) P ( | )

P ( ) P ( ) P ( ) P ( )

P ( )

P ( ) P ( | ) P ( | ) P ( | )

Easy.Effective!

Page 45: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 45

Using Language Models in IR Treat each document as the basis for a model (e.g.,

unigram sufficient statistics) Rank document d based on P(d | q) P(d | q) = P(q | d) x P(d) / P(q)

P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all d

But we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s model

Very general formal approach

Page 46: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 46

The fundamental problem of LMs Usually we don’t know the model M

But have a sample of text representative of that model

Estimate a language model from a sample Then compute the observation probability

P ( | M ( ) )

M

Page 47: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 47

Language Models for IR

Language Modeling Approaches Attempt to model query generation process Documents are ranked by the probability that a

query would be observed as a random sample from the respective document model

Multinomial approach

Page 48: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 48

Retrieval based on probabilistic LM

Treat the generation of queries as a random process. Approach

Infer a language model for each document. Estimate the probability of generating the query

according to each of these models. Rank the documents according to these probabilities. Usually a unigram estimate of words is used

Some work on bigrams, paralleling van Rijsbergen

Page 49: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 49

Retrieval based on probabilistic LM Intuition

Users … Have a reasonable idea of terms that are likely to

occur in documents of interest. They will choose query terms that distinguish these

documents from others in the collection. Collection statistics …

Are integral parts of the language model. Are not used heuristically as in many other

approaches. In theory. In practice, there’s usually some wiggle

room for empirically set parameters

Page 50: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 50

Page 51: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 51

Page 52: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 52

Query generation probability Ranking formula

The probability of producing the query given the language model of document d using MLE is:

Qt d

dt

Qtdmld

dl

tf

MtpMQp

),(

)|(ˆ)|(ˆ

Unigram assumption:Given a particular language model, the query terms occur independently

),( dttf

ddl

: language model of document d

: raw tf of term t in document d

: total number of tokens in document d

dM

)|()(

)|()(),(

dMQpdp

dQpdpdQp

Page 53: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 53

Page 54: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 54

Page 55: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 55

Page 56: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 56

Smoothing (continued)

There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or to counts, Dirichlet priors, discounting, and interpolation [Chen and Goodman, 98]

Another simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution

Page 57: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 57

Smoothing: Mixture model P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) Mixes the probability from the document with the general

collection frequency of the word. Correctly setting is very important A high value of lambda makes the search “conjunctive-

like” – suitable for short queries A low value is more suitable for long queries Can tune to optimize performance

Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

Page 58: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 58

Basic mixture model summary General formulation of the LM for IR

The user has a document in mind, and generates the query from this document.

The equation represents the probability that the document that the user had in mind was in fact this one.

Qt

dMtptpdpdQp ))|()()1(()(),(

general language model

individual-document model

Page 59: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 59

Example Document collection (2 documents)

d1: Xerox reports a profit but revenue is down

d2: Lucent narrows quarter loss but revenue decreases further Model: MLE unigram from documents; = ½ Query: revenue down

P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]

= 1/8 x 3/32 = 3/256 P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]

= 1/8 x 1/32 = 1/256 A component of model is missing… what is it, and why? Ranking: d1 > d2

)|()(),( dQpdpdQp

Page 60: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 60

Language Models for IR Tasks Cross-lingual IR Distributed IR Structured doc retrieval Personalization Modelling redundancy Predicting query difficulty Predicting information extraction accuracy PLSI

Page 61: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 61

Standard Probabilistic IR

query

d1

d2

dn

Information need

document collection

matchingmatching

),|( dQRP

Page 62: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 62

IR based on Language Model (LM)

query

d1

d2

dn

Information need

document collection

generationgeneration

)|( dMQP 1dM

2dM

ndM

A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!

LM approach directly exploits that idea!

Page 63: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 63

Collection-Topic-Document Model

query

d1

d2

dn

Information need

document collection

generationgeneration

),,|( dTC MMMQP

1dM

2dM

ndM

CM

1TM

2TM

mTM

…),|( TC MMQP

Page 64: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 64

Collection-Topic-Document model 3-level model

1. Whole collection model ( )2. Specific-topic model; relevant-documents model ( )3. Individual-document model ( )

Relevance hypothesis A request(query; topic) is generated from a specific-topic model

{ , }. Iff a document is relevant to the topic, the same model will apply to

the document. It will replace part of the individual-document model in explaining

the document. The probability of relevance of a document

The probability that this model explains part of the document The probability that the { , , } combination is better than

the { , } combination

CM

dMTM

CM TM

CM TM dMCM dM

Page 65: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 65

Outline Text Information Retrieval: 5-minute overview

Problems with lexical retrieval Synonymy, Polysemy, Ambiguity

A partial solution: synonym lookup

Towards concept retrieval LSI Language Models for IR PLSI

Towards real semantic search Entities, Relations, Facts, Events

Page 66: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 66

Probabilistic LSI

Uses LSI idea, but based in probability theory Comes from statistical Aspect [Language] Model

Generate co-occurrence model based on non-observed class

This is a mixture model Models a distribution through a mixture (weighted sum) of

other distributions Independence Assumptions

Observed pairs (doc, word) are generated randomly Conditional independence: conditioned on latent class, words

are generated independently of document

Page 67: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 67

Aspect Model Generation process

Choose a doc d with prob P(d) There are N d’s

Choose a latent class z with (generated) prob P(z|d) There are K z’s, and K << N

Generate a word w with (generated) prob P(w|z) This creates pair (d, w), without direct concern for z

Joining the probabilities gives you

Remember: P(z|d) means “probability of z, given d”

K chosen in advance (how many topics in

collection???)

Page 68: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 68

Aspect Model (2) Applying Bayes theorem:

This is conceptually different than LSI Word distribution P(w|d) based on combination of specific

classes/factors/aspects, P(w|z)

Page 69: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 69

Detour: EM Algorithm

Tune parameters of distributions with missing/hidden data

Topics = hidden classes

Extremely useful, general technique

Page 70: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 70

Page 71: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 71

Page 72: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 72

Page 73: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 73

Page 74: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 74

Page 75: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 75

Page 76: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 76

Expectation Maximization

Sketch of an EM algorithm for PLSI E-step: calculate future probabilities of z based on

current estimates M-step: update estimate parameters based on

calculated probabilities

Problem: overfitting

Page 77: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 77

Similarities: LSI and PLSI Using intermediate, latent, non-observed data for classification

(hence the “L”) Can compose Joint Probability similar to LSI SVD

U U_hat = P(di | zk) V V_hat = P(wj | zk) _hat = diag(P(zk))k

JP = U_hat*_hat*V_hat JP is simliar to SVD term-doc matrix N

Values calculated probabilistically

P(di | zk) P(wj | zk)diag(P(zk))k

Page 78: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 78

Differences: LSI and PLSI

Basis LSI: term frequencies (usually) and performs

dimension reduction via projection or 0-ing weaker components

PLSI: statistical – generate model of probabilistic relation between W, D and Z; refine until effective model is produced

Page 79: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 79

Experiment: 128-factor decomposition

Page 80: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 80

Experiments

Page 81: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 81

pLSI Improves on LSI

Consistently better accuracy curves than LSI TEM SVD, computationally

Better from a modeling sense Uses likelihood of sampling and aims for maximization SVD uses L2-norm or other: implicit Gaussian noise

assumption More intuitive:

Polysemy is recognizable By viewing P(w|z)

Similar handling of synonymy

Page 82: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 82

LSA, pLSA in Practice?

Only rumors (about Teoma using it)

Both LSA, pLSA VERY expensive LSA

Running times of ~ one day on ~10K docs pLSA

M. Federico (ICASSP 2002, [hermes.itc.it]) use a corpus of 1.2 millions of newspaper articles with a vocabulary of 200K words: approximate pLSA using Non-negative Matrix Factorization (NMF)

612 hours of CPU time (7 processors, 2.5 hours/iteration, 35 iterations)

Do we need (P)LSI for web search?

Page 83: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 83

Did we solve our problem?

Page 84: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 84

Ambiguity Different documents with the same keywords may have

different meanings…

What do frogs eat?

(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.

(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.

(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.

keywords: frogs, eat

What is the largest volcano in the Solar System?

(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.

(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.

(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.

keywords: largest, volcano, solar, system

Page 85: Towards Semantics for IR

Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,

Emory University 85

What we need

Detect and exploit semantic relations between entities

Whole other lecture