probabilistic models in information retrieval si650: information retrieval

2010 © University of Michigan

Probabilistic Models in Information Retrieval

SI650: Information Retrieval

Winter 2010

School of Information

University of Michigan

1

Most slides are from Prof. ChengXiang Zhai’s lectures


Essential Probability & Statistics


Prob/Statistics & Text Management

• Probability & statistics provide a principled way to quantify the uncertainties associated with natural language

• Allow us to answer questions like:– Given that we observe “baseball” three times and “game” once in a

news article, how likely is it about “sports”?

(text categorization, information retrieval)– Given that a user is interested in sports news, how likely would the

user use “baseball” in a query?

(information retrieval)


Basic Concepts in Probability

• Random experiment: an experiment with uncertain outcome (e.g., flipping a coin, picking a word from text)

• Sample space: all possible outcomes, e.g., – Tossing 2 fair coins, S ={HH, HT, TH, TT}

• Event: a subspace of the sample space– ES, E happens iff outcome is in E, e.g., – E={HH} (all heads) – E={HH,TT} (same face)– Impossible event ({}), certain event (S)

• Probability of Event : 0 P(E) ≤1, s.t.– P(S)=1 (outcome always in S)– P(A B)=P(A)+P(B), if (AB)= (e.g., A=same face, B=different

face)


Example: Flip a Coin

• Sample space: {Head, Tail}• Fair coin:

– p(H) = 0.5, p(T) = 0.5

• Unfair coin, e.g.:– p(H) = 0.3, p(T) = 0.7

• Flipping two fair coins:– Sample space: {HH, HT, TH, TT}

• Example in modeling text:– Flip a coin to decide whether or not to

include a word in a document– Sample space = {appear, absence}

5


Example: Toss a Dice

• Sample space: S = {1,2,3,4,5,6}• Fair dice:

– p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6

• Unfair dice: p(1) = 0.3, p(2) = 0.2, ...

• N-dimensional dice:– S = {1, 2, 3, 4, …, N}

• Example in modeling text:– Toss a die to decide which word to

write in the next position– Sample space = {cat, dog, tiger, …}

6


Revisit: Bayes’ Rule

)(

)()|()|(

EP

HPHEPEHP ii

i

Hypothesis space: H={H1 , …, Hn} Evidence: E

If we want to pick the most likely hypothesis H*, we can drop P(E)Posterior probability of Hi Prior probability of Hi

Likelihood of data/evidenceif Hi is true

)()|()|( iii HPHEPEHP

In text classification: H: class space; E: data (features)


Random Variable

• X: S (“measure” of outcome)– E.g., number of heads, all same face?, …

• Events can be defined according to X– E(X=a) = {si|X(si)=a}

– E(Xa) = {si|X(si) a}

• So, probabilities can be defined on X– P(X=a) = P(E(X=a))– P(Xa) = P(E(Xa))

• Discrete vs. continuous random variable (think of “partitioning the sample space”)


Getting to Statistics ...

• We are flipping an unfair coin, but P(Head)=? (parameter estimation)– If we see the results of a huge number of random experiments, then

– But, what if we only see a small sample (e.g., 2)? Is this estimate still reliable? We flip twice and got two tails, does it mean P(Head) = 0?

• In general, statistics has to do with drawing conclusions on the whole population based on observations of a sample (data)

)()(

)(

)(

)()(ˆ

TailscountHeadscount

Headscount

Flipscount

HeadscountHeadP


Parameter Estimation

• General setting:– Given a (hypothesized & probabilistic) model that

governs the random experiment– The model gives a probability of any data p(D|) that

depends on the parameter – Now, given actual sample data X={x1,…,xn}, what can

we say about the value of ?

• Intuitively, take your best guess of -- “best” means “best explaining/fitting the data”

• Generally an optimization problem


Maximum Likelihood vs. Bayesian

• Maximum likelihood estimation– “Best” means “data likelihood reaches maximum”

– Problem: small sample

• Bayesian estimation– “Best” means being consistent with our “prior”

knowledge and explaining data well

– Problem: how to define prior?

)|(maxargˆ

XP

)()|(maxarg)|(maxargˆ

PXPXP

2010 © University of Michigan 13

Illustration of Bayesian Estimation

Prior: p()

Likelihood: p(X|)X=(x1,…,xN)

Posterior: p(|X) p(X|)p()

: prior mode ml: ML estimate: posterior mode


Maximum Likelihood Estimate

Data: a document d with counts c(w1), …, c(wN), and length |d|Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M)Maximum likelihood estimator: M=argmax M p(d|M)

d

wcwp i

i

)()(

Problem: a document is short if c(wi) = 0, does this mean p(wi) = 0?


If you want to know how we get it…

Data: a document d with counts c(w1), …, c(wN), and length |d|Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M)Maximum likelihood estimator: M=argmax M p(d|M)

( ) ( )

1 11

1

'

1 1

'

1 1

| |( | ) , ( )

( )... ( )

( | ) log ( | ) ( ) log

( | ) ( ) log ( 1)

( ) ( )0

1, ( ) | |

i i

N Nc w c w

i i i ii iN

N

i ii

N N

i i ii i

i ii

i i

N N

i ii i

dp d M where p w

c w c w

l d M p d M c w

l d M c w

c w c wl

Since c w d So

( ), ( )

| |i

i i

c wp w

d

We’ll tune p(wi) to maximize l(d|M)

Use Lagrange multiplier approach

Set partial derivatives to zero

ML estimate

11

N

ii


Statistical Language Models


What is a Statistical Language Model?

• A probability distribution over word sequences– p(“Today is Wednesday”) 0.001– p(“Today Wednesday is”) 0.0000000000001– p(“The eigenvalue is positive”) 0.00001

• Context-dependent! • Can also be regarded as a probabilistic

mechanism for “generating” text, thus also called a “generative” model


Why is a Language Model Useful?

• Provides a principled way to quantify the uncertainties associated with natural language

• Allows us to answer questions like:

– Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition)

– Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)

– Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)


Source-Channel Framework(Communication System)

Source Transmitter(encoder)

DestinationReceiver(decoder)

NoisyChannel

P(X) P(Y|X)

X Y X’

P(X|Y)=?

)()|(maxarg)|(maxargˆ XpXYpYXpXXX

When X is text, p(X) is a language model


Speech Recognition

Speaker RecognizerNoisyChannel

P(W) P(A|W)

Words (W)Acousticsignal (A)

Recognizedwords

P(W|A)=?

)()|(maxarg)|(maxargˆ WpWApAWpWWW

Acoustic model Language model

Given acoustic signal A, find the word sequence W


Machine Translation

EnglishSpeaker

TranslatorNoisyChannel

P(E) P(C|E)

EnglishWords (E)

ChineseWords(C)

EnglishTranslation

P(E|C)=?

ˆ arg max ( | ) arg max ( | ) ( )E E

E p E C p C E p E

English->ChineseTranslation model

EnglishLanguage model

Given Chinese sentence C, find its English translation E


Spelling/OCR Error Correction

OriginalText

Corrector NoisyChannel

P(O) P(E|O)

OriginalWords (O)

“Erroneous”Words(E)

CorrectedText

P(O|E)=?

)()|(maxarg)|(maxargˆ OpOEpEOpOOO

Spelling/OCR Error model “Normal”Language model

Given corrupted text E, find the original text O


Basic Issues

• Define the probabilistic model– Event, Random Variables, Joint/Conditional Prob’s

– P(w1 w2 ... wn)=f(1, 2 ,…, m)

• Estimate model parameters– Tune the model to best fit the data and our prior

knowledge i=?

• Apply the model to a particular task– Many applications


The Simplest Language Model(Unigram Model)

• Generate a piece of text by generating each word INDEPENDENTLY

• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)

• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)

• Essentially a multinomial distribution over words• A piece of text can be regarded as a sample

drawn according to this word distribution


Modeling Text As Tossing a Die

• Sample space:

S = {w1,w2,w3,… wN}

• Die is unfair: – P(cat) = 0.3, P(dog) = 0.2, ...– Toss a die to decide which word to

write in the next position

• E.g., Sample space = {cat, dog, tiger}

• P(cat) = 0.3; P(dog) = 0.5; P(tiger) = 0.2 • Text = “cat cat dog”• P(Text) = P(cat) * P(cat) * P(dog)

= 0.3 * 0.3 * 0.5 = 0.045

25


Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1association 0.01clustering 0.02…food 0.00001

…

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02

…

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling


Estimation of Unigram LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?association ?database ?…query ?

…

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100


Empirical distribution of words

• There are stable language-independent patterns in how people use natural languages

• A few words occur very frequently; most occur rarely. E.g., in news articles,– Top 4 words: 10~15% word occurrences– Top 50 words: 35~40% word occurrences

• The most frequent word in one corpus may be rare in another


Revisit: Zipf’s Law

• rank * frequency constant

WordFreq.

Word Rank (by Freq)

Most useful words (Luhn 57)

Biggestdata structure(stop words)

Is “too rare” a problem?

( ) 1, 0.1( )

CF w C

r w

( )[ ( ) ]

CF w

r w B

Generalized Zipf’s law: Applicable in many domains


Problem with the ML Estimator

• What if a word doesn’t appear in the text?• In general, what probability should we give a

word that has not been observed?• If we want to assign non-zero probabilities to

such words, we’ll have to discount the probabilities of observed words

• This is what “smoothing” is about …


Language Model Smoothing (Illustration)

P(w)

w

Max. Likelihood Estimate

wordsallofcountwofcount

ML wp )(

Smoothed LM


How to Smooth?• All smoothing methods try to

– discount the probability of words seen in a document

– re-allocate the extra counts so that unseen words will have a non-zero count

• Method 1 (Additive smoothing): Add a constant to the counts of each word

• Problems?

( , ) 1( | )

| | | |

c w dp w d

d V

“Add one”, Laplace smoothing

Vocabulary size

Counts of w in d

Length of d (total counts)


How to Smooth? (cont.)

• Should all unseen words get equal probabilities?

• We can use a reference model to discriminate unseen words

( | )( | )

( | )seen

d

p w d if w is seen in dp w d

p w REF otherwise

Discounted ML estimate

Reference language model1 ( | )

( | )

seenw is seen

d

w is unseen

p w d

p w REF


Other Smoothing Methods

• Method 2 (Absolute discounting): Subtract a constant from the counts of each word

• Method 3 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF)

max( ( ; ) ,0) | | ( | )| |( | ) uc w d d p w REFdp w d

# uniq words

( , )( | ) (1 ) ( | )

| |

c w dp w d p w REF

d

parameterML estimate

αd = δ|d|u/|d|

αd = λ


Other Smoothing Methods (cont.)

• Method 4 (Dirichlet Prior/Bayesian): Assume pseudo counts p(w|REF)

• Method 5 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way

( ; ) ( | ) | || | | | | |

( , )( | ) ( | )

| |c w d p w REF d

d d d

c w dp w d p w REF

d

parameter

*( ; )1| |

1 2

0 1

1( | ) ; *( , ) * , ( , )

2*0* ,1* ,..... 0? | ?

c w drd

r

r

rp w d c w d r n where r c w d

n

n nWhat if n What about p w REF

n n

αd = μ/(|d|+ μ)


So, which method is the best?

It depends on the data and the task!

Many other sophisticated smoothing methods have been proposed…

Cross validation is generally used to choose the best method and/or set the smoothing parameters…

For retrieval, Dirichlet prior performs well…

Smoothing will be discussed further in the course…


Language Models for IR


Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001

…

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02

…

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling


Estimation of Unigram LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?assocation ?database ?…query ?

…

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100


Language Models for Retrieval(Ponte & Croft 98)

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?

…

…food ?nutrition ?healthy ?diet ?

…

Query = “data mining algorithms”

? Which model would most likely have generated this query?


Ranking Docs by Query Likelihood

d1

d2

dN

qd1

d2

dN

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood


Retrieval as Language Model Estimation

• Document ranking based on query likelihood

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

• Retrieval problem Estimation of p(wi|d)

• Smoothing is an important issue, and distinguishes different approaches

Document language model


How to Estimate p(w|d)?

• Simplest solution: Maximum Likelihood Estimator– P(w|d) = relative frequency of word w in d– What if a word doesn’t appear in the text? P(w|d)=0

• In general, what probability should we give a word that has not been observed?

• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words

• This is what “smoothing” is about …


A General Smoothing Scheme

• All smoothing methods try to– discount the probability of words seen in a doc– re-allocate the extra probability so that unseen words

will have a non-zero probability

• Most use a reference model (collection language model) to discriminate unseen words

otherwiseCwp

dinseeniswifdwpdwp

d

seen

)|(

)|()|(

Discounted ML estimate

Collection language model


Smoothing & TF-IDF Weighting

• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain

i

id

qwdw id

iseen CwpnCwp

dwpdqp

i

i

)|(loglog])|(

)|([log)|(log

Ignore for rankingIDF weighting

TF weightingDoc length normalization(long doc is expected to have a smaller d)

• Smoothing with p(w|C) TF-IDF + length norm.


Three Smoothing Methods(Zhai & Lafferty 01)

• Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C)

)|()|()()|( Cwpdwpdwp ml 1

)|()|()|( ||||||

||)|();( Cwpdwpdwp dmld

dd

Cwpdwc

• Dirichlet prior (Bayesian): Assume pseudo counts p(w|C)

• Absolute discounting: Subtract a constant

||)|(||)0,);(max()|( d

Cwpddwc udwp


Assume We Use Dirichlet Smoothing

48

( ; ) ( | ) | || | | | | |

( , )( | ) ( | )

| |c w d p w REF d

d d d

c w dp w d p w REF

d

)|(log dqp

||log||)

)|(

),(1log(),(),(

, dq

Cwp

dwcqwcqdScore

dwqw

Only consider words in the query

Query term frequencyDoc. Term frequency

Similar to IDF Doc. length


The Notion of Relevance

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel(Salton et al., 75)

Prob. distr.model(Wong & Yao, 89)

…

GenerativeModel

RegressionModel(Fox 83)

Classicalprob. Model(Robertson & Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model(Wong & Yao, 95)

Differentinference system

Inference network model(Turtle & Croft, 91)


Query Generation and Document Generation

50

• Document ranking based on the conditional probability of relevance

),|1( QDRP R {0,1}

Ranking based on),|0(

),|1(),(

QDRP

QDRPQDO

),|( QDRP )|,( RQDP

),|( RQDP

),|( RDQP

Query generation Language models

Document generation Okapi/BM25


The Notion of Relevance

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel(Salton et al., 75)

Prob. distr.model(Wong & Yao, 89)

…

GenerativeModel

RegressionModel(Fox 83)

Classicalprob. Model(Robertson & Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model(Wong & Yao, 95)

Differentinference system

Inference network model(Turtle & Croft, 91)


The Okapi/BM25 Retrieval Formula

52

dwqw qwck

qwck

dwcavdl

dbbk

dwck

wdf

wdfNqdScore

, 3

3

1

1 )),(

),()1(

),()||

)1((

),()1(

5.0)(

5.0)((log),(

Only consider words in the query

Query term frequencyDoc. Term frequency

IDFDoc. length

• Classical probabilistic model – among the best performance


What are the Essential Features in a Retrieval Formula?

53

• Document TF – how many occurrence of each query term do we observe in a document

• Query TF – how many occurrence of each term do we observe in a query

• IDF – how distinctive/important is each query term (term with a higher IDF is more distinctive)

• Document length – how long the document is (we want to penalize long documents, but don’t want to over-penalize them!)

• Others?

probabilistic models in information retrieval si650: information retrieval

Documents