probabilistic models in information retrieval si650: information retrieval
DESCRIPTION
Probabilistic Models in Information Retrieval SI650: Information Retrieval. Winter 2010 School of Information University of Michigan. Most slides are from Prof. ChengXiang Zhai ’s lectures. Essential Probability & Statistics. Prob/Statistics & Text Management. - PowerPoint PPT PresentationTRANSCRIPT
2010 © University of Michigan
Probabilistic Models in Information Retrieval
SI650: Information Retrieval
Winter 2010
School of Information
University of Michigan
1
Most slides are from Prof. ChengXiang Zhai’s lectures
2010 © University of Michigan
Essential Probability & Statistics
2010 © University of Michigan
Prob/Statistics & Text Management
• Probability & statistics provide a principled way to quantify the uncertainties associated with natural language
• Allow us to answer questions like:– Given that we observe “baseball” three times and “game” once in a
news article, how likely is it about “sports”?
(text categorization, information retrieval)– Given that a user is interested in sports news, how likely would the
user use “baseball” in a query?
(information retrieval)
2010 © University of Michigan
Basic Concepts in Probability
• Random experiment: an experiment with uncertain outcome (e.g., flipping a coin, picking a word from text)
• Sample space: all possible outcomes, e.g., – Tossing 2 fair coins, S ={HH, HT, TH, TT}
• Event: a subspace of the sample space– ES, E happens iff outcome is in E, e.g., – E={HH} (all heads) – E={HH,TT} (same face)– Impossible event ({}), certain event (S)
• Probability of Event : 0 P(E) ≤1, s.t.– P(S)=1 (outcome always in S)– P(A B)=P(A)+P(B), if (AB)= (e.g., A=same face, B=different
face)
2010 © University of Michigan
Example: Flip a Coin
• Sample space: {Head, Tail}• Fair coin:
– p(H) = 0.5, p(T) = 0.5
• Unfair coin, e.g.:– p(H) = 0.3, p(T) = 0.7
• Flipping two fair coins:– Sample space: {HH, HT, TH, TT}
• Example in modeling text:– Flip a coin to decide whether or not to
include a word in a document– Sample space = {appear, absence}
5
2010 © University of Michigan
Example: Toss a Dice
• Sample space: S = {1,2,3,4,5,6}• Fair dice:
– p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6
• Unfair dice: p(1) = 0.3, p(2) = 0.2, ...
• N-dimensional dice:– S = {1, 2, 3, 4, …, N}
• Example in modeling text:– Toss a die to decide which word to
write in the next position– Sample space = {cat, dog, tiger, …}
6
2010 © University of Michigan7
Basic Concepts of Prob. (cont.)
• Joint probability: P(AB), also written as P(A, B)
• Conditional Probability: P(B|A)=P(AB)/P(A)– P(AB) = P(A)P(B|A) = P(B)P(A|B)– So, P(A|B) = P(B|A)P(A)/P(B) (Bayes’ Rule)– For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A)
• Total probability: If A1, …, An form a partition of S, then
– P(B) = P(BS) = P(B, A1) + … + P(B, An) (why?)
– So, P(Ai|B) = P(B|Ai)P(Ai)/P(B)
= P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)]
– This allows us to compute P(Ai|B) based on P(B|Ai)
2010 © University of Michigan8
Revisit: Bayes’ Rule
)(
)()|()|(
EP
HPHEPEHP ii
i
Hypothesis space: H={H1 , …, Hn} Evidence: E
If we want to pick the most likely hypothesis H*, we can drop P(E)Posterior probability of Hi Prior probability of Hi
Likelihood of data/evidenceif Hi is true
)()|()|( iii HPHEPEHP
In text classification: H: class space; E: data (features)
2010 © University of Michigan9
Random Variable
• X: S (“measure” of outcome)– E.g., number of heads, all same face?, …
• Events can be defined according to X– E(X=a) = {si|X(si)=a}
– E(Xa) = {si|X(si) a}
• So, probabilities can be defined on X– P(X=a) = P(E(X=a))– P(Xa) = P(E(Xa))
• Discrete vs. continuous random variable (think of “partitioning the sample space”)
2010 © University of Michigan
Getting to Statistics ...
• We are flipping an unfair coin, but P(Head)=? (parameter estimation)– If we see the results of a huge number of random experiments, then
– But, what if we only see a small sample (e.g., 2)? Is this estimate still reliable? We flip twice and got two tails, does it mean P(Head) = 0?
• In general, statistics has to do with drawing conclusions on the whole population based on observations of a sample (data)
)()(
)(
)(
)()(ˆ
TailscountHeadscount
Headscount
Flipscount
HeadscountHeadP
2010 © University of Michigan11
Parameter Estimation
• General setting:– Given a (hypothesized & probabilistic) model that
governs the random experiment– The model gives a probability of any data p(D|) that
depends on the parameter – Now, given actual sample data X={x1,…,xn}, what can
we say about the value of ?
• Intuitively, take your best guess of -- “best” means “best explaining/fitting the data”
• Generally an optimization problem
2010 © University of Michigan12
Maximum Likelihood vs. Bayesian
• Maximum likelihood estimation– “Best” means “data likelihood reaches maximum”
– Problem: small sample
• Bayesian estimation– “Best” means being consistent with our “prior”
knowledge and explaining data well
– Problem: how to define prior?
)|(maxargˆ
XP
)()|(maxarg)|(maxargˆ
PXPXP
2010 © University of Michigan 13
Illustration of Bayesian Estimation
Prior: p()
Likelihood: p(X|)X=(x1,…,xN)
Posterior: p(|X) p(X|)p()
: prior mode ml: ML estimate: posterior mode
2010 © University of Michigan14
Maximum Likelihood Estimate
Data: a document d with counts c(w1), …, c(wN), and length |d|Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M)Maximum likelihood estimator: M=argmax M p(d|M)
d
wcwp i
i
)()(
Problem: a document is short if c(wi) = 0, does this mean p(wi) = 0?
2010 © University of Michigan15
If you want to know how we get it…
Data: a document d with counts c(w1), …, c(wN), and length |d|Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M)Maximum likelihood estimator: M=argmax M p(d|M)
( ) ( )
1 11
1
'
1 1
'
1 1
| |( | ) , ( )
( )... ( )
( | ) log ( | ) ( ) log
( | ) ( ) log ( 1)
( ) ( )0
1, ( ) | |
i i
N Nc w c w
i i i ii iN
N
i ii
N N
i i ii i
i ii
i i
N N
i ii i
dp d M where p w
c w c w
l d M p d M c w
l d M c w
c w c wl
Since c w d So
( ), ( )
| |i
i i
c wp w
d
We’ll tune p(wi) to maximize l(d|M)
Use Lagrange multiplier approach
Set partial derivatives to zero
ML estimate
11
N
ii
2010 © University of Michigan16
Statistical Language Models
2010 © University of Michigan17
What is a Statistical Language Model?
• A probability distribution over word sequences– p(“Today is Wednesday”) 0.001– p(“Today Wednesday is”) 0.0000000000001– p(“The eigenvalue is positive”) 0.00001
• Context-dependent! • Can also be regarded as a probabilistic
mechanism for “generating” text, thus also called a “generative” model
2010 © University of Michigan18
Why is a Language Model Useful?
• Provides a principled way to quantify the uncertainties associated with natural language
• Allows us to answer questions like:
– Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition)
– Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)
– Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)
2010 © University of Michigan 19
Source-Channel Framework(Communication System)
Source Transmitter(encoder)
DestinationReceiver(decoder)
NoisyChannel
P(X) P(Y|X)
X Y X’
P(X|Y)=?
)()|(maxarg)|(maxargˆ XpXYpYXpXXX
When X is text, p(X) is a language model
2010 © University of Michigan 20
Speech Recognition
Speaker RecognizerNoisyChannel
P(W) P(A|W)
Words (W)Acousticsignal (A)
Recognizedwords
P(W|A)=?
)()|(maxarg)|(maxargˆ WpWApAWpWWW
Acoustic model Language model
Given acoustic signal A, find the word sequence W
2010 © University of Michigan 21
Machine Translation
EnglishSpeaker
TranslatorNoisyChannel
P(E) P(C|E)
EnglishWords (E)
ChineseWords(C)
EnglishTranslation
P(E|C)=?
ˆ arg max ( | ) arg max ( | ) ( )E E
E p E C p C E p E
English->ChineseTranslation model
EnglishLanguage model
Given Chinese sentence C, find its English translation E
2010 © University of Michigan 22
Spelling/OCR Error Correction
OriginalText
Corrector NoisyChannel
P(O) P(E|O)
OriginalWords (O)
“Erroneous”Words(E)
CorrectedText
P(O|E)=?
)()|(maxarg)|(maxargˆ OpOEpEOpOOO
Spelling/OCR Error model “Normal”Language model
Given corrupted text E, find the original text O
2010 © University of Michigan23
Basic Issues
• Define the probabilistic model– Event, Random Variables, Joint/Conditional Prob’s
– P(w1 w2 ... wn)=f(1, 2 ,…, m)
• Estimate model parameters– Tune the model to best fit the data and our prior
knowledge i=?
• Apply the model to a particular task– Many applications
2010 © University of Michigan24
The Simplest Language Model(Unigram Model)
• Generate a piece of text by generating each word INDEPENDENTLY
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)
• Essentially a multinomial distribution over words• A piece of text can be regarded as a sample
drawn according to this word distribution
2010 © University of Michigan
Modeling Text As Tossing a Die
• Sample space:
S = {w1,w2,w3,… wN}
• Die is unfair: – P(cat) = 0.3, P(dog) = 0.2, ...– Toss a die to decide which word to
write in the next position
• E.g., Sample space = {cat, dog, tiger}
• P(cat) = 0.3; P(dog) = 0.5; P(tiger) = 0.2 • Text = “cat cat dog”• P(Text) = P(cat) * P(cat) * P(dog)
= 0.3 * 0.3 * 0.5 = 0.045
25
2010 © University of Michigan 26
Text Generation with Unigram LM
(Unigram) Language Model p(w| )
…text 0.2mining 0.1association 0.01clustering 0.02…food 0.00001
…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02
…
Topic 2:Health
Document
Text miningpaper
Food nutritionpaper
Sampling
2010 © University of Michigan 27
Estimation of Unigram LM
(Unigram) Language Model p(w| )=?
Document
text 10mining 5association 3database 3algorithm 2…query 1efficient 1
…text ?mining ?association ?database ?…query ?
…
Estimation
A “text mining paper”(total #words=100)
10/1005/1003/1003/100
1/100
2010 © University of Michigan28
Empirical distribution of words
• There are stable language-independent patterns in how people use natural languages
• A few words occur very frequently; most occur rarely. E.g., in news articles,– Top 4 words: 10~15% word occurrences– Top 50 words: 35~40% word occurrences
• The most frequent word in one corpus may be rare in another
2010 © University of Michigan29
Revisit: Zipf’s Law
• rank * frequency constant
WordFreq.
Word Rank (by Freq)
Most useful words (Luhn 57)
Biggestdata structure(stop words)
Is “too rare” a problem?
( ) 1, 0.1( )
CF w C
r w
( )[ ( ) ]
CF w
r w B
Generalized Zipf’s law: Applicable in many domains
2010 © University of Michigan30
Problem with the ML Estimator
• What if a word doesn’t appear in the text?• In general, what probability should we give a
word that has not been observed?• If we want to assign non-zero probabilities to
such words, we’ll have to discount the probabilities of observed words
• This is what “smoothing” is about …
2010 © University of Michigan 31
Language Model Smoothing (Illustration)
P(w)
w
Max. Likelihood Estimate
wordsallofcountwofcount
ML wp )(
Smoothed LM
2010 © University of Michigan32
How to Smooth?• All smoothing methods try to
– discount the probability of words seen in a document
– re-allocate the extra counts so that unseen words will have a non-zero count
• Method 1 (Additive smoothing): Add a constant to the counts of each word
• Problems?
( , ) 1( | )
| | | |
c w dp w d
d V
“Add one”, Laplace smoothing
Vocabulary size
Counts of w in d
Length of d (total counts)
2010 © University of Michigan33
How to Smooth? (cont.)
• Should all unseen words get equal probabilities?
• We can use a reference model to discriminate unseen words
( | )( | )
( | )seen
d
p w d if w is seen in dp w d
p w REF otherwise
Discounted ML estimate
Reference language model1 ( | )
( | )
seenw is seen
d
w is unseen
p w d
p w REF
2010 © University of Michigan34
Other Smoothing Methods
• Method 2 (Absolute discounting): Subtract a constant from the counts of each word
• Method 3 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF)
max( ( ; ) ,0) | | ( | )| |( | ) uc w d d p w REFdp w d
# uniq words
( , )( | ) (1 ) ( | )
| |
c w dp w d p w REF
d
parameterML estimate
αd = δ|d|u/|d|
αd = λ
2010 © University of Michigan35
Other Smoothing Methods (cont.)
• Method 4 (Dirichlet Prior/Bayesian): Assume pseudo counts p(w|REF)
• Method 5 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way
( ; ) ( | ) | || | | | | |
( , )( | ) ( | )
| |c w d p w REF d
d d d
c w dp w d p w REF
d
parameter
*( ; )1| |
1 2
0 1
1( | ) ; *( , ) * , ( , )
2*0* ,1* ,..... 0? | ?
c w drd
r
r
rp w d c w d r n where r c w d
n
n nWhat if n What about p w REF
n n
αd = μ/(|d|+ μ)
2010 © University of Michigan36
So, which method is the best?
It depends on the data and the task!
Many other sophisticated smoothing methods have been proposed…
Cross validation is generally used to choose the best method and/or set the smoothing parameters…
For retrieval, Dirichlet prior performs well…
Smoothing will be discussed further in the course…
2010 © University of Michigan37
Language Models for IR
2010 © University of Michigan 38
Text Generation with Unigram LM
(Unigram) Language Model p(w| )
…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001
…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02
…
Topic 2:Health
Document
Text miningpaper
Food nutritionpaper
Sampling
2010 © University of Michigan 39
Estimation of Unigram LM
(Unigram) Language Model p(w| )=?
Document
text 10mining 5association 3database 3algorithm 2…query 1efficient 1
…text ?mining ?assocation ?database ?…query ?
…
Estimation
A “text mining paper”(total #words=100)
10/1005/1003/1003/100
1/100
2010 © University of Michigan 40
Language Models for Retrieval(Ponte & Croft 98)
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?
…
…food ?nutrition ?healthy ?diet ?
…
Query = “data mining algorithms”
? Which model would most likely have generated this query?
2010 © University of Michigan 41
Ranking Docs by Query Likelihood
d1
d2
dN
qd1
d2
dN
Doc LM
p(q| d1)
p(q| d2)
p(q| dN)
Query likelihood
2010 © University of Michigan42
Retrieval as Language Model Estimation
• Document ranking based on query likelihood
n
ii
wwwqwhere
dwpdqp
...,
)|(log)|(log
21
• Retrieval problem Estimation of p(wi|d)
• Smoothing is an important issue, and distinguishes different approaches
Document language model
2010 © University of Michigan43
How to Estimate p(w|d)?
• Simplest solution: Maximum Likelihood Estimator– P(w|d) = relative frequency of word w in d– What if a word doesn’t appear in the text? P(w|d)=0
• In general, what probability should we give a word that has not been observed?
• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words
• This is what “smoothing” is about …
2010 © University of Michigan44
A General Smoothing Scheme
• All smoothing methods try to– discount the probability of words seen in a doc– re-allocate the extra probability so that unseen words
will have a non-zero probability
• Most use a reference model (collection language model) to discriminate unseen words
otherwiseCwp
dinseeniswifdwpdwp
d
seen
)|(
)|()|(
Discounted ML estimate
Collection language model
2010 © University of Michigan45
Smoothing & TF-IDF Weighting
• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain
i
id
qwdw id
iseen CwpnCwp
dwpdqp
i
i
)|(loglog])|(
)|([log)|(log
Ignore for rankingIDF weighting
TF weightingDoc length normalization(long doc is expected to have a smaller d)
• Smoothing with p(w|C) TF-IDF + length norm.
2010 © University of Michigan47
Three Smoothing Methods(Zhai & Lafferty 01)
• Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C)
)|()|()()|( Cwpdwpdwp ml 1
)|()|()|( ||||||
||)|();( Cwpdwpdwp dmld
dd
Cwpdwc
• Dirichlet prior (Bayesian): Assume pseudo counts p(w|C)
• Absolute discounting: Subtract a constant
||)|(||)0,);(max()|( d
Cwpddwc udwp
2010 © University of Michigan
Assume We Use Dirichlet Smoothing
48
( ; ) ( | ) | || | | | | |
( , )( | ) ( | )
| |c w d p w REF d
d d d
c w dp w d p w REF
d
)|(log dqp
||log||)
)|(
),(1log(),(),(
, dq
Cwp
dwcqwcqdScore
dwqw
Only consider words in the query
Query term frequencyDoc. Term frequency
Similar to IDF Doc. length
2010 © University of Michigan 49
The Notion of Relevance
Relevance
(Rep(q), Rep(d)) Similarity
P(r=1|q,d) r {0,1} Probability of Relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel(Salton et al., 75)
Prob. distr.model(Wong & Yao, 89)
…
GenerativeModel
RegressionModel(Fox 83)
Classicalprob. Model(Robertson & Sparck Jones, 76)
Docgeneration
Querygeneration
LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)
Prob. conceptspace model(Wong & Yao, 95)
Differentinference system
Inference network model(Turtle & Croft, 91)
2010 © University of Michigan
Query Generation and Document Generation
50
• Document ranking based on the conditional probability of relevance
),|1( QDRP R {0,1}
Ranking based on),|0(
),|1(),(
QDRP
QDRPQDO
),|( QDRP )|,( RQDP
),|( RQDP
),|( RDQP
Query generation Language models
Document generation Okapi/BM25
2010 © University of Michigan 51
The Notion of Relevance
Relevance
(Rep(q), Rep(d)) Similarity
P(r=1|q,d) r {0,1} Probability of Relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel(Salton et al., 75)
Prob. distr.model(Wong & Yao, 89)
…
GenerativeModel
RegressionModel(Fox 83)
Classicalprob. Model(Robertson & Sparck Jones, 76)
Docgeneration
Querygeneration
LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)
Prob. conceptspace model(Wong & Yao, 95)
Differentinference system
Inference network model(Turtle & Croft, 91)
2010 © University of Michigan
The Okapi/BM25 Retrieval Formula
52
dwqw qwck
qwck
dwcavdl
dbbk
dwck
wdf
wdfNqdScore
, 3
3
1
1 )),(
),()1(
),()||
)1((
),()1(
5.0)(
5.0)((log),(
Only consider words in the query
Query term frequencyDoc. Term frequency
IDFDoc. length
• Classical probabilistic model – among the best performance
2010 © University of Michigan
What are the Essential Features in a Retrieval Formula?
53
• Document TF – how many occurrence of each query term do we observe in a document
• Query TF – how many occurrence of each term do we observe in a query
• IDF – how distinctive/important is each query term (term with a higher IDF is more distinctive)
• Document length – how long the document is (we want to penalize long documents, but don’t want to over-penalize them!)
• Others?