towards semantics for ir
DESCRIPTION
Towards Semantics for IR. Eugene Agichtein Emory University. Acknowledgements A bunch of slides in this talk are adapted from lots of people, including Chris Manning, ChengXiang Zhai, James Allan, Ray Mooney, and Jimmy Lin. Who is this guy?. - PowerPoint PPT PresentationTRANSCRIPT
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 1
Acknowledgements A bunch of slides in this talk are adapted from lots of people, including Chris Manning, ChengXiang Zhai, James Allan, Ray Mooney, and Jimmy Lin.
Towards Semantics for IR
Eugene AgichteinEmory University
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 2
Who is this guy?
Sept 2006-: Assistant Professor in the Math & CS department at Emory.
2004 to 2006: Postdoc in the Text Mining, Search, and Navigation group at Microsoft Research, Redmond.
2004: Ph.D. in Computer Science from Columbia University: dissertation on extracting structured relations from large unstructured text databases
1998: B.S. in Engineering from The Cooper Union.
Research interests: accessing, discovering, and managing information in unstructured (text) data, with current emphasis on developing robust and scalable text mining techniques for the biology and health domains.
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 3
Outline Text Information Retrieval: 10-minute overview
Problems with lexical retrieval Synonymy, Polysemy, Ambiguity
A partial solution: synonym lookup
Towards concept retrieval LSI Language Models for IR PLSI
Towards real semantic search Entities, Relations, Facts, Events in Text (my research area)
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 4
Information Retrieval From Text
IRSystem
Query String
Documentcorpus
RankedDocuments
1. Doc12. Doc23. Doc3 . .
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 5
Was that the whole story in IR? Source
Selection
Search
Query
Selection
Ranked List
Examination
Documents
Delivery
Documents
QueryFormulation
Resource
query reformulation,vocabulary learning,relevance feedback
source reselection
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 6
Supporting the Search ProcessSource
Selection
Search
Query
Selection
Ranked List
Examination
Documents
Delivery
Documents
QueryFormulation
Resource
Indexing Index
Acquisition Collection
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 7
Example: Query
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia requires egrep But other operations (e.g., find the word Romans near
countrymen , or top-K scenes “most about” ) not feasible
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 8
Term-document incidence
1 if play contains word, 0 otherwise
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Brutus AND Caesar but NOT Calpurnia
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 9
Incidence vectors So we have a 0/1 vector for each term.
Boolean model: To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100
Vector-space model: Compute query-document similarity as dot product/cosine
between query and document vector Rank by similarity
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 10
Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 11
Modern Search Engines in 1 Minute Crawl Time:
“Inverted List”: terms doc IDs “Content chunks” (doc copies)
Query Time: Lookup query terms in IL “filter set”
Get content chunks for doc IDs
Rank documents using hundreds of features (e.g., term weights, web topology, proximity, position)
Retrieve Top K documents for query ( K < 100 << |filter set|)
Doc ID pos
11001 44
99875 14
11222 11
40942 24
92739 3
angina
treatment
32145 44
34266 14
11222 17
40942 59
5
4
index
Content chunks
99875 14
11222 11
11222 … A myocardial infraction is …
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 12
Outline Text Information Retrieval: 10-minute overview
Problems with lexical retrieval Synonymy, Polysemy, Ambiguity
A partial solution: synonym lookup
Towards concept retrieval LSI Language Models for IR PLSI
Towards real semantic search Entities, Relations, Facts, Events
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 13
The Central Problem in IRInformation Seeker Authors
Concepts Concepts
Query Terms Document Terms
Do these represent the same concepts?
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 14
Noisy-Channel Model of IR
Information need
Query
User has a information need, “thinks” of a relevant document…
and writes down some queries
Task of information retrieval: given the query, figure out which document it came from?
d1
d2
dn
document collection
…
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 15
How is this a noisy-channel?
No one seriously claims that this is actually what’s going on… But this view is mathematically convenient!
Source Destination
Transmitter Receiverchannelmessage message
noise
Source Destination
Encoder DecoderchannelInformationneed
queryterms
Query formulation process
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 16
Problems with term-based retrieval Synonymy
“Power law” vs. “Zipf distribution” Polysemy
“Saturn” Ambiguity
“What do frogs eat?”
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 17
Polysemy and Context Document similarity on single word level: polysemy and
context
carcompany
•••dodgeford
meaning 2
ringjupiter
•••space
voyagermeaning 1…
saturn...
…planet
...
contribution to similarity, if used in 1st meaning, but not if in 2nd
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 18
Ambiguity Different documents with the same keywords may have
different meanings…
What do frogs eat?
(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.
(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.
(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.
keywords: frogs, eat
What is the largest volcano in the Solar System?
(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.
(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.
(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.
keywords: largest, volcano, solar, system
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 19
Indexing Word Synsets/Senses How does indexing word senses solve the
synonym/polysemy problem?
Okay, so where do we get the word senses? WordNet: a lexical database for standard English
Automatically find “clusters” of words that describe the same concepts
{dog, canine, doggy, puppy, etc.} concept 112986
I deposited my check in the bank. bank concept 76529I saw the sailboat from the bank. bank concept 53107
http://wordnet.princeton.edu/
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 20
Example: Contextual Word Similarity
Dagan et al, Computer Speech & Language, 1995
Use Mutual Information:
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 21
Word Sense Disambiguation Given a word in context, automatically determine the sense
(concept) This is the Word Sense Disambiguation (WSD) problem
Context is the key: For each ambiguous word, note the surrounding words
“Learn” a classifier from a collection of examples Use the classifier to determine the senses of words in the
documents
bank {river, sailboat, water, etc.} side of a riverbank {check, money, account, etc.} financial institution
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 22
Hypothesis: same senses of words will have similar neighboring words
Disambiguation algorithm Identify context vectors corresponding to all occurrences of a
particular word Partition them into regions of high density Assign a sense to each such region
“Sit on a chair”
“Take a seat on this chair”
“The chair of the Math Department”
“The chair of the meeting”
Example: Unsupervised WSD
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 23
Does it help retrieval? Not really…
Examples of limited success….
Ellen M. Voorhees. (1993) Using WordNet to Disambiguate Word Senses for Text Retrieval. Proceedings of SIGIR 1993.
Mark Sanderson. (1994) Word-Sense Disambiguation and Information Retrieval. Proceedings of SIGIR 1994
And others…
Hinrich Schütze and Jan O. Pedersen. (1995) Information Retrieval Based on Word Senses. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval.
Rada Mihalcea and Dan Moldovan. (2000) Semantic Indexing Using WordNet Senses. Proceedings of ACL 2000 Workshop on Recent Advances in NLP and IR.
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 24
Why Disambiguation Can Hurt Bag-of-words techniques already disambiguate
Context for each term is established in the query Heuristics (e.g., always most frequent sense) work better
WSD is hard! Many words are highly polysemous, e.g., interest Granularity of senses is often domain/application specific Queries are short – not enough context for accurate WSD
WSD tries to improve precision But incorrect sense assignments would hurt recall Slight gains in precision do not offset large drops in recall
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 25
Outline Text Information Retrieval: 10-minute overview
Problems with lexical retrieval Synonymy, Polysemy, Ambiguity
A partial solution: word synsets, WSD
Towards concept retrieval LSI Language Models for IR PLSI
Towards real semantic search Entities, Relations, Facts, Events
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 26
Latent Semantic Analysis Perform a low-rank approximation of document-term
matrix (typical rank 100-300) General idea
Map documents (and terms) to a low-dimensional representation.
Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).
Compute document similarity based on the inner product in this latent semantic space
Goals Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 27
Latent Semantic Analysis
Latent semantic space: illustrating example
courtesy of Susan Dumais
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 28
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 29
Simplistic pictureTopic 1
Topic 2
Topic 3
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 30
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 31
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 32
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 33
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 34
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 35
Some (old) empirical evidence Precision at or above median TREC precision
Top scorer on almost 20% TREC 1,2,3 topics (c.f. 1990)
Slightly better on average than original vector space
Effect of dimensionality:
Dimensions Precision
250 0.367
300 0.371
346 0.374
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 36
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 37
Problems with term-based retrieval Synonymy
“Power law” vs. “Zipf distribution” Polysemy
“Saturn” Ambiguity
“What do frogs eat?”
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 38
Outline Text Information Retrieval: 5-minute overview
Problems with lexical retrieval Synonymy, Polysemy, Ambiguity
A partial solution: synonym lookup
Towards concept retrieval LSI Language Models for IR PLSI
Towards real semantic search Entities, Relations, Facts, Events
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 39
IR based on Language Model (LM)
query
d1
d2
dn
…
Information need
document collection
generationgeneration
)|( dMQP 1dM
2dM
…
ndM
A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!
The LM approach directly exploits that idea!
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 40
Formal Language (Model)
Traditional generative model: generates strings Finite state machines or regular grammars, etc.
Example:
I wish
I wishI wish I wishI wish I wish I wishI wish I wish I wish I wish…
*wish I wish
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 41
Stochastic Language Models Models probability of generating strings in the
language (commonly all strings over alphabet ∑)
0.2 the
0.1 a
0.01 man
0.01 woman
0.03 said
0.02 likes
…
the man likes the woman
0.2 0.01 0.02 0.2 0.01
multiply
Model M
P(s | M) = 0.00000008
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 42
Stochastic Language Models Model probability of generating any string
0.2 the
0.01 class
0.0001 sayst
0.0001 pleaseth
0.0001 yon
0.0005 maiden
0.01 woman
Model M1 Model M2
maidenclass pleaseth yonthe
0.00050.01 0.0001 0.00010.2
0.010.0001 0.02 0.10.2
P(s|M2) > P(s|M1)
0.2 the
0.0001 class
0.03 sayst
0.02 pleaseth
0.1 yon
0.01 maiden
0.0001 woman
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 43
Stochastic Language Models A statistical model for generating text
Probability distribution over strings in a given language
M
P ( | M ) = P ( | M )
P ( | M, )
P ( | M, )
P ( | M, )
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 44
Unigram and higher-order models
Unigram Language Models
Bigram (generally, n-gram) Language Models
Other Language Models Grammar-based models (PCFGs), etc.
Probably not the first thing to try in IR
= P ( ) P ( | ) P ( | ) P ( | )
P ( ) P ( ) P ( ) P ( )
P ( )
P ( ) P ( | ) P ( | ) P ( | )
Easy.Effective!
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 45
Using Language Models in IR Treat each document as the basis for a model (e.g.,
unigram sufficient statistics) Rank document d based on P(d | q) P(d | q) = P(q | d) x P(d) / P(q)
P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all d
But we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s model
Very general formal approach
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 46
The fundamental problem of LMs Usually we don’t know the model M
But have a sample of text representative of that model
Estimate a language model from a sample Then compute the observation probability
P ( | M ( ) )
M
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 47
Language Models for IR
Language Modeling Approaches Attempt to model query generation process Documents are ranked by the probability that a
query would be observed as a random sample from the respective document model
Multinomial approach
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 48
Retrieval based on probabilistic LM
Treat the generation of queries as a random process. Approach
Infer a language model for each document. Estimate the probability of generating the query
according to each of these models. Rank the documents according to these probabilities. Usually a unigram estimate of words is used
Some work on bigrams, paralleling van Rijsbergen
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 49
Retrieval based on probabilistic LM Intuition
Users … Have a reasonable idea of terms that are likely to
occur in documents of interest. They will choose query terms that distinguish these
documents from others in the collection. Collection statistics …
Are integral parts of the language model. Are not used heuristically as in many other
approaches. In theory. In practice, there’s usually some wiggle
room for empirically set parameters
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 50
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 51
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 52
Query generation probability Ranking formula
The probability of producing the query given the language model of document d using MLE is:
Qt d
dt
Qtdmld
dl
tf
MtpMQp
),(
)|(ˆ)|(ˆ
Unigram assumption:Given a particular language model, the query terms occur independently
),( dttf
ddl
: language model of document d
: raw tf of term t in document d
: total number of tokens in document d
dM
)|()(
)|()(),(
dMQpdp
dQpdpdQp
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 53
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 54
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 55
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 56
Smoothing (continued)
There’s a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, ½ or to counts, Dirichlet priors, discounting, and interpolation [Chen and Goodman, 98]
Another simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distribution
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 57
Smoothing: Mixture model P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) Mixes the probability from the document with the general
collection frequency of the word. Correctly setting is very important A high value of lambda makes the search “conjunctive-
like” – suitable for short queries A low value is more suitable for long queries Can tune to optimize performance
Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 58
Basic mixture model summary General formulation of the LM for IR
The user has a document in mind, and generates the query from this document.
The equation represents the probability that the document that the user had in mind was in fact this one.
Qt
dMtptpdpdQp ))|()()1(()(),(
general language model
individual-document model
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 59
Example Document collection (2 documents)
d1: Xerox reports a profit but revenue is down
d2: Lucent narrows quarter loss but revenue decreases further Model: MLE unigram from documents; = ½ Query: revenue down
P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]
= 1/8 x 3/32 = 3/256 P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]
= 1/8 x 1/32 = 1/256 A component of model is missing… what is it, and why? Ranking: d1 > d2
)|()(),( dQpdpdQp
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 60
Language Models for IR Tasks Cross-lingual IR Distributed IR Structured doc retrieval Personalization Modelling redundancy Predicting query difficulty Predicting information extraction accuracy PLSI
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 61
Standard Probabilistic IR
query
d1
d2
dn
…
Information need
document collection
matchingmatching
),|( dQRP
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 62
IR based on Language Model (LM)
query
d1
d2
dn
…
Information need
document collection
generationgeneration
)|( dMQP 1dM
2dM
…
ndM
A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!
LM approach directly exploits that idea!
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 63
Collection-Topic-Document Model
query
d1
d2
dn
…
Information need
document collection
generationgeneration
),,|( dTC MMMQP
1dM
2dM
…
ndM
CM
1TM
2TM
mTM
…),|( TC MMQP
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 64
Collection-Topic-Document model 3-level model
1. Whole collection model ( )2. Specific-topic model; relevant-documents model ( )3. Individual-document model ( )
Relevance hypothesis A request(query; topic) is generated from a specific-topic model
{ , }. Iff a document is relevant to the topic, the same model will apply to
the document. It will replace part of the individual-document model in explaining
the document. The probability of relevance of a document
The probability that this model explains part of the document The probability that the { , , } combination is better than
the { , } combination
CM
dMTM
CM TM
CM TM dMCM dM
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 65
Outline Text Information Retrieval: 5-minute overview
Problems with lexical retrieval Synonymy, Polysemy, Ambiguity
A partial solution: synonym lookup
Towards concept retrieval LSI Language Models for IR PLSI
Towards real semantic search Entities, Relations, Facts, Events
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 66
Probabilistic LSI
Uses LSI idea, but based in probability theory Comes from statistical Aspect [Language] Model
Generate co-occurrence model based on non-observed class
This is a mixture model Models a distribution through a mixture (weighted sum) of
other distributions Independence Assumptions
Observed pairs (doc, word) are generated randomly Conditional independence: conditioned on latent class, words
are generated independently of document
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 67
Aspect Model Generation process
Choose a doc d with prob P(d) There are N d’s
Choose a latent class z with (generated) prob P(z|d) There are K z’s, and K << N
Generate a word w with (generated) prob P(w|z) This creates pair (d, w), without direct concern for z
Joining the probabilities gives you
Remember: P(z|d) means “probability of z, given d”
K chosen in advance (how many topics in
collection???)
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 68
Aspect Model (2) Applying Bayes theorem:
This is conceptually different than LSI Word distribution P(w|d) based on combination of specific
classes/factors/aspects, P(w|z)
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 69
Detour: EM Algorithm
Tune parameters of distributions with missing/hidden data
Topics = hidden classes
Extremely useful, general technique
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 70
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 71
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 72
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 73
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 74
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 75
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 76
Expectation Maximization
Sketch of an EM algorithm for PLSI E-step: calculate future probabilities of z based on
current estimates M-step: update estimate parameters based on
calculated probabilities
Problem: overfitting
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 77
Similarities: LSI and PLSI Using intermediate, latent, non-observed data for classification
(hence the “L”) Can compose Joint Probability similar to LSI SVD
U U_hat = P(di | zk) V V_hat = P(wj | zk) _hat = diag(P(zk))k
JP = U_hat*_hat*V_hat JP is simliar to SVD term-doc matrix N
Values calculated probabilistically
P(di | zk) P(wj | zk)diag(P(zk))k
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 78
Differences: LSI and PLSI
Basis LSI: term frequencies (usually) and performs
dimension reduction via projection or 0-ing weaker components
PLSI: statistical – generate model of probabilistic relation between W, D and Z; refine until effective model is produced
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 79
Experiment: 128-factor decomposition
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 80
Experiments
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 81
pLSI Improves on LSI
Consistently better accuracy curves than LSI TEM SVD, computationally
Better from a modeling sense Uses likelihood of sampling and aims for maximization SVD uses L2-norm or other: implicit Gaussian noise
assumption More intuitive:
Polysemy is recognizable By viewing P(w|z)
Similar handling of synonymy
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 82
LSA, pLSA in Practice?
Only rumors (about Teoma using it)
Both LSA, pLSA VERY expensive LSA
Running times of ~ one day on ~10K docs pLSA
M. Federico (ICASSP 2002, [hermes.itc.it]) use a corpus of 1.2 millions of newspaper articles with a vocabulary of 200K words: approximate pLSA using Non-negative Matrix Factorization (NMF)
612 hours of CPU time (7 processors, 2.5 hours/iteration, 35 iterations)
Do we need (P)LSI for web search?
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 83
Did we solve our problem?
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 84
Ambiguity Different documents with the same keywords may have
different meanings…
What do frogs eat?
(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.
(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.
(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.
keywords: frogs, eat
What is the largest volcano in the Solar System?
(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.
(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.
(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.
keywords: largest, volcano, solar, system
Fall 2006CS 584: Information Retrieval. Math & Computer Science Department,
Emory University 85
What we need
Detect and exploit semantic relations between entities
Whole other lecture