probabilistic models in information retrieval si650: information retrieval

52
2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of Michigan 1 Most slides are from Prof. ChengXiang Zhai ’s lectures

Upload: kevlyn

Post on 16-Jan-2016

77 views

Category:

Documents


3 download

DESCRIPTION

Probabilistic Models in Information Retrieval SI650: Information Retrieval. Winter 2010 School of Information University of Michigan. Most slides are from Prof. ChengXiang Zhai ’s lectures. Essential Probability & Statistics. Prob/Statistics & Text Management. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Probabilistic Models in Information Retrieval

SI650: Information Retrieval

Winter 2010

School of Information

University of Michigan

1

Most slides are from Prof. ChengXiang Zhai’s lectures

Page 2: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Essential Probability & Statistics

Page 3: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Prob/Statistics & Text Management

• Probability & statistics provide a principled way to quantify the uncertainties associated with natural language

• Allow us to answer questions like:– Given that we observe “baseball” three times and “game” once in a

news article, how likely is it about “sports”?

(text categorization, information retrieval)– Given that a user is interested in sports news, how likely would the

user use “baseball” in a query?

(information retrieval)

Page 4: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Basic Concepts in Probability

• Random experiment: an experiment with uncertain outcome (e.g., flipping a coin, picking a word from text)

• Sample space: all possible outcomes, e.g., – Tossing 2 fair coins, S ={HH, HT, TH, TT}

• Event: a subspace of the sample space– ES, E happens iff outcome is in E, e.g., – E={HH} (all heads) – E={HH,TT} (same face)– Impossible event ({}), certain event (S)

• Probability of Event : 0 P(E) ≤1, s.t.– P(S)=1 (outcome always in S)– P(A B)=P(A)+P(B), if (AB)= (e.g., A=same face, B=different

face)

Page 5: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Example: Flip a Coin

• Sample space: {Head, Tail}• Fair coin:

– p(H) = 0.5, p(T) = 0.5

• Unfair coin, e.g.:– p(H) = 0.3, p(T) = 0.7

• Flipping two fair coins:– Sample space: {HH, HT, TH, TT}

• Example in modeling text:– Flip a coin to decide whether or not to

include a word in a document– Sample space = {appear, absence}

5

Page 6: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Example: Toss a Dice

• Sample space: S = {1,2,3,4,5,6}• Fair dice:

– p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6

• Unfair dice: p(1) = 0.3, p(2) = 0.2, ...

• N-dimensional dice:– S = {1, 2, 3, 4, …, N}

• Example in modeling text:– Toss a die to decide which word to

write in the next position– Sample space = {cat, dog, tiger, …}

6

Page 7: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan7

Basic Concepts of Prob. (cont.)

• Joint probability: P(AB), also written as P(A, B)

• Conditional Probability: P(B|A)=P(AB)/P(A)– P(AB) = P(A)P(B|A) = P(B)P(A|B)– So, P(A|B) = P(B|A)P(A)/P(B) (Bayes’ Rule)– For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A)

• Total probability: If A1, …, An form a partition of S, then

– P(B) = P(BS) = P(B, A1) + … + P(B, An) (why?)

– So, P(Ai|B) = P(B|Ai)P(Ai)/P(B)

= P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)]

– This allows us to compute P(Ai|B) based on P(B|Ai)

Page 8: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan8

Revisit: Bayes’ Rule

)(

)()|()|(

EP

HPHEPEHP ii

i

Hypothesis space: H={H1 , …, Hn} Evidence: E

If we want to pick the most likely hypothesis H*, we can drop P(E)Posterior probability of Hi Prior probability of Hi

Likelihood of data/evidenceif Hi is true

)()|()|( iii HPHEPEHP

In text classification: H: class space; E: data (features)

Page 9: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan9

Random Variable

• X: S (“measure” of outcome)– E.g., number of heads, all same face?, …

• Events can be defined according to X– E(X=a) = {si|X(si)=a}

– E(Xa) = {si|X(si) a}

• So, probabilities can be defined on X– P(X=a) = P(E(X=a))– P(Xa) = P(E(Xa))

• Discrete vs. continuous random variable (think of “partitioning the sample space”)

Page 10: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Getting to Statistics ...

• We are flipping an unfair coin, but P(Head)=? (parameter estimation)– If we see the results of a huge number of random experiments, then

– But, what if we only see a small sample (e.g., 2)? Is this estimate still reliable? We flip twice and got two tails, does it mean P(Head) = 0?

• In general, statistics has to do with drawing conclusions on the whole population based on observations of a sample (data)

)()(

)(

)(

)()(ˆ

TailscountHeadscount

Headscount

Flipscount

HeadscountHeadP

Page 11: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan11

Parameter Estimation

• General setting:– Given a (hypothesized & probabilistic) model that

governs the random experiment– The model gives a probability of any data p(D|) that

depends on the parameter – Now, given actual sample data X={x1,…,xn}, what can

we say about the value of ?

• Intuitively, take your best guess of -- “best” means “best explaining/fitting the data”

• Generally an optimization problem

Page 12: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan12

Maximum Likelihood vs. Bayesian

• Maximum likelihood estimation– “Best” means “data likelihood reaches maximum”

– Problem: small sample

• Bayesian estimation– “Best” means being consistent with our “prior”

knowledge and explaining data well

– Problem: how to define prior?

)|(maxargˆ

XP

)()|(maxarg)|(maxargˆ

PXPXP

Page 13: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 13

Illustration of Bayesian Estimation

Prior: p()

Likelihood: p(X|)X=(x1,…,xN)

Posterior: p(|X) p(X|)p()

: prior mode ml: ML estimate: posterior mode

Page 14: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan14

Maximum Likelihood Estimate

Data: a document d with counts c(w1), …, c(wN), and length |d|Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M)Maximum likelihood estimator: M=argmax M p(d|M)

d

wcwp i

i

)()(

Problem: a document is short if c(wi) = 0, does this mean p(wi) = 0?

Page 15: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan15

If you want to know how we get it…

Data: a document d with counts c(w1), …, c(wN), and length |d|Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M)Maximum likelihood estimator: M=argmax M p(d|M)

( ) ( )

1 11

1

'

1 1

'

1 1

| |( | ) , ( )

( )... ( )

( | ) log ( | ) ( ) log

( | ) ( ) log ( 1)

( ) ( )0

1, ( ) | |

i i

N Nc w c w

i i i ii iN

N

i ii

N N

i i ii i

i ii

i i

N N

i ii i

dp d M where p w

c w c w

l d M p d M c w

l d M c w

c w c wl

Since c w d So

( ), ( )

| |i

i i

c wp w

d

We’ll tune p(wi) to maximize l(d|M)

Use Lagrange multiplier approach

Set partial derivatives to zero

ML estimate

11

N

ii

Page 16: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan16

Statistical Language Models

Page 17: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan17

What is a Statistical Language Model?

• A probability distribution over word sequences– p(“Today is Wednesday”) 0.001– p(“Today Wednesday is”) 0.0000000000001– p(“The eigenvalue is positive”) 0.00001

• Context-dependent! • Can also be regarded as a probabilistic

mechanism for “generating” text, thus also called a “generative” model

Page 18: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan18

Why is a Language Model Useful?

• Provides a principled way to quantify the uncertainties associated with natural language

• Allows us to answer questions like:

– Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition)

– Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)

– Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

Page 19: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 19

Source-Channel Framework(Communication System)

Source Transmitter(encoder)

DestinationReceiver(decoder)

NoisyChannel

P(X) P(Y|X)

X Y X’

P(X|Y)=?

)()|(maxarg)|(maxargˆ XpXYpYXpXXX

When X is text, p(X) is a language model

Page 20: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 20

Speech Recognition

Speaker RecognizerNoisyChannel

P(W) P(A|W)

Words (W)Acousticsignal (A)

Recognizedwords

P(W|A)=?

)()|(maxarg)|(maxargˆ WpWApAWpWWW

Acoustic model Language model

Given acoustic signal A, find the word sequence W

Page 21: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 21

Machine Translation

EnglishSpeaker

TranslatorNoisyChannel

P(E) P(C|E)

EnglishWords (E)

ChineseWords(C)

EnglishTranslation

P(E|C)=?

ˆ arg max ( | ) arg max ( | ) ( )E E

E p E C p C E p E

English->ChineseTranslation model

EnglishLanguage model

Given Chinese sentence C, find its English translation E

Page 22: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 22

Spelling/OCR Error Correction

OriginalText

Corrector NoisyChannel

P(O) P(E|O)

OriginalWords (O)

“Erroneous”Words(E)

CorrectedText

P(O|E)=?

)()|(maxarg)|(maxargˆ OpOEpEOpOOO

Spelling/OCR Error model “Normal”Language model

Given corrupted text E, find the original text O

Page 23: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan23

Basic Issues

• Define the probabilistic model– Event, Random Variables, Joint/Conditional Prob’s

– P(w1 w2 ... wn)=f(1, 2 ,…, m)

• Estimate model parameters– Tune the model to best fit the data and our prior

knowledge i=?

• Apply the model to a particular task– Many applications

Page 24: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan24

The Simplest Language Model(Unigram Model)

• Generate a piece of text by generating each word INDEPENDENTLY

• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)

• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)

• Essentially a multinomial distribution over words• A piece of text can be regarded as a sample

drawn according to this word distribution

Page 25: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Modeling Text As Tossing a Die

• Sample space:

S = {w1,w2,w3,… wN}

• Die is unfair: – P(cat) = 0.3, P(dog) = 0.2, ...– Toss a die to decide which word to

write in the next position

• E.g., Sample space = {cat, dog, tiger}

• P(cat) = 0.3; P(dog) = 0.5; P(tiger) = 0.2 • Text = “cat cat dog”• P(Text) = P(cat) * P(cat) * P(dog)

= 0.3 * 0.3 * 0.5 = 0.045

25

Page 26: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 26

Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1association 0.01clustering 0.02…food 0.00001

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling

Page 27: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 27

Estimation of Unigram LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?association ?database ?…query ?

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100

Page 28: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan28

Empirical distribution of words

• There are stable language-independent patterns in how people use natural languages

• A few words occur very frequently; most occur rarely. E.g., in news articles,– Top 4 words: 10~15% word occurrences– Top 50 words: 35~40% word occurrences

• The most frequent word in one corpus may be rare in another

Page 29: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan29

Revisit: Zipf’s Law

• rank * frequency constant

WordFreq.

Word Rank (by Freq)

Most useful words (Luhn 57)

Biggestdata structure(stop words)

Is “too rare” a problem?

( ) 1, 0.1( )

CF w C

r w

( )[ ( ) ]

CF w

r w B

Generalized Zipf’s law: Applicable in many domains

Page 30: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan30

Problem with the ML Estimator

• What if a word doesn’t appear in the text?• In general, what probability should we give a

word that has not been observed?• If we want to assign non-zero probabilities to

such words, we’ll have to discount the probabilities of observed words

• This is what “smoothing” is about …

Page 31: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 31

Language Model Smoothing (Illustration)

P(w)

w

Max. Likelihood Estimate

wordsallofcountwofcount

ML wp )(

Smoothed LM

Page 32: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan32

How to Smooth?• All smoothing methods try to

– discount the probability of words seen in a document

– re-allocate the extra counts so that unseen words will have a non-zero count

• Method 1 (Additive smoothing): Add a constant to the counts of each word

• Problems?

( , ) 1( | )

| | | |

c w dp w d

d V

“Add one”, Laplace smoothing

Vocabulary size

Counts of w in d

Length of d (total counts)

Page 33: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan33

How to Smooth? (cont.)

• Should all unseen words get equal probabilities?

• We can use a reference model to discriminate unseen words

( | )( | )

( | )seen

d

p w d if w is seen in dp w d

p w REF otherwise

Discounted ML estimate

Reference language model1 ( | )

( | )

seenw is seen

d

w is unseen

p w d

p w REF

Page 34: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan34

Other Smoothing Methods

• Method 2 (Absolute discounting): Subtract a constant from the counts of each word

• Method 3 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF)

max( ( ; ) ,0) | | ( | )| |( | ) uc w d d p w REFdp w d

# uniq words

( , )( | ) (1 ) ( | )

| |

c w dp w d p w REF

d

parameterML estimate

αd = δ|d|u/|d|

αd = λ

Page 35: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan35

Other Smoothing Methods (cont.)

• Method 4 (Dirichlet Prior/Bayesian): Assume pseudo counts p(w|REF)

• Method 5 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way

( ; ) ( | ) | || | | | | |

( , )( | ) ( | )

| |c w d p w REF d

d d d

c w dp w d p w REF

d

parameter

*( ; )1| |

1 2

0 1

1( | ) ; *( , ) * , ( , )

2*0* ,1* ,..... 0? | ?

c w drd

r

r

rp w d c w d r n where r c w d

n

n nWhat if n What about p w REF

n n

αd = μ/(|d|+ μ)

Page 36: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan36

So, which method is the best?

It depends on the data and the task!

Many other sophisticated smoothing methods have been proposed…

Cross validation is generally used to choose the best method and/or set the smoothing parameters…

For retrieval, Dirichlet prior performs well…

Smoothing will be discussed further in the course…

Page 37: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan37

Language Models for IR

Page 38: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 38

Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling

Page 39: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 39

Estimation of Unigram LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?assocation ?database ?…query ?

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100

Page 40: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 40

Language Models for Retrieval(Ponte & Croft 98)

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?

…food ?nutrition ?healthy ?diet ?

Query = “data mining algorithms”

? Which model would most likely have generated this query?

Page 41: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 41

Ranking Docs by Query Likelihood

d1

d2

dN

qd1

d2

dN

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood

Page 42: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan42

Retrieval as Language Model Estimation

• Document ranking based on query likelihood

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

• Retrieval problem Estimation of p(wi|d)

• Smoothing is an important issue, and distinguishes different approaches

Document language model

Page 43: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan43

How to Estimate p(w|d)?

• Simplest solution: Maximum Likelihood Estimator– P(w|d) = relative frequency of word w in d– What if a word doesn’t appear in the text? P(w|d)=0

• In general, what probability should we give a word that has not been observed?

• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words

• This is what “smoothing” is about …

Page 44: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan44

A General Smoothing Scheme

• All smoothing methods try to– discount the probability of words seen in a doc– re-allocate the extra probability so that unseen words

will have a non-zero probability

• Most use a reference model (collection language model) to discriminate unseen words

otherwiseCwp

dinseeniswifdwpdwp

d

seen

)|(

)|()|(

Discounted ML estimate

Collection language model

Page 45: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan45

Smoothing & TF-IDF Weighting

• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain

i

id

qwdw id

iseen CwpnCwp

dwpdqp

i

i

)|(loglog])|(

)|([log)|(log

Ignore for rankingIDF weighting

TF weightingDoc length normalization(long doc is expected to have a smaller d)

• Smoothing with p(w|C) TF-IDF + length norm.

Page 46: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan47

Three Smoothing Methods(Zhai & Lafferty 01)

• Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C)

)|()|()()|( Cwpdwpdwp ml 1

)|()|()|( ||||||

||)|();( Cwpdwpdwp dmld

dd

Cwpdwc

• Dirichlet prior (Bayesian): Assume pseudo counts p(w|C)

• Absolute discounting: Subtract a constant

||)|(||)0,);(max()|( d

Cwpddwc udwp

Page 47: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Assume We Use Dirichlet Smoothing

48

( ; ) ( | ) | || | | | | |

( , )( | ) ( | )

| |c w d p w REF d

d d d

c w dp w d p w REF

d

)|(log dqp

||log||)

)|(

),(1log(),(),(

, dq

Cwp

dwcqwcqdScore

dwqw

Only consider words in the query

Query term frequencyDoc. Term frequency

Similar to IDF Doc. length

Page 48: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 49

The Notion of Relevance

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel(Salton et al., 75)

Prob. distr.model(Wong & Yao, 89)

GenerativeModel

RegressionModel(Fox 83)

Classicalprob. Model(Robertson & Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model(Wong & Yao, 95)

Differentinference system

Inference network model(Turtle & Croft, 91)

Page 49: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

Query Generation and Document Generation

50

• Document ranking based on the conditional probability of relevance

),|1( QDRP R {0,1}

Ranking based on),|0(

),|1(),(

QDRP

QDRPQDO

),|( QDRP )|,( RQDP

),|( RQDP

),|( RDQP

Query generation Language models

Document generation Okapi/BM25

Page 50: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan 51

The Notion of Relevance

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel(Salton et al., 75)

Prob. distr.model(Wong & Yao, 89)

GenerativeModel

RegressionModel(Fox 83)

Classicalprob. Model(Robertson & Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model(Wong & Yao, 95)

Differentinference system

Inference network model(Turtle & Croft, 91)

Page 51: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

The Okapi/BM25 Retrieval Formula

52

dwqw qwck

qwck

dwcavdl

dbbk

dwck

wdf

wdfNqdScore

, 3

3

1

1 )),(

),()1(

),()||

)1((

),()1(

5.0)(

5.0)((log),(

Only consider words in the query

Query term frequencyDoc. Term frequency

IDFDoc. length

• Classical probabilistic model – among the best performance

Page 52: Probabilistic Models in Information Retrieval SI650: Information Retrieval

2010 © University of Michigan

What are the Essential Features in a Retrieval Formula?

53

• Document TF – how many occurrence of each query term do we observe in a document

• Query TF – how many occurrence of each term do we observe in a query

• IDF – how distinctive/important is each query term (term with a higher IDF is more distinctive)

• Document length – how long the document is (we want to penalize long documents, but don’t want to over-penalize them!)

• Others?