2004.09.23 - SLIDE 1IS 202 – FALL 2004
Prof. Ray Larson & Prof. Marc DavisUC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 pmFall 2004
http://www.sims.berkeley.edu/academics/courses/is202/f04/
SIMS 202: Information Organization
and Retrieval
Lecture 8: Probabilistic IR and Relevance Feedback
2004.09.23 - SLIDE 2IS 202 – FALL 2004
Lecture Overview• Review
– Vector Representation– Term Weights– Vector Matching– Clustering
• Probabilistic Models of IR• Relevance Feedback
Credit for some of the slides in this lecture goes to Marti Hearst
2004.09.23 - SLIDE 3IS 202 – FALL 2004
Lecture Overview• Review
– Vector Representation– Term Weights– Vector Matching– Clustering
• Probabilistic Models of IR• Relevance Feedback
Credit for some of the slides in this lecture goes to Marti Hearst
2004.09.23 - SLIDE 4IS 202 – FALL 2004
Document Vectors
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
2004.09.23 - SLIDE 5IS 202 – FALL 2004
Vector Space Documents and Queries
docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3D10 0 1 1 5D11 0 0 1 4Q 1 2 3
q1 q2 q3
D1D2
D3
D4
D5
D6
D7D8
D9
D10
D11
t2
t3
t1
Boolean term combinationsQ is a query – also represented as a vector
2004.09.23 - SLIDE 6IS 202 – FALL 2004
Documents in Vector Space
t1
t2
t3
D1
D2
D10D3
D9
D4
D7D8
D5
D11
D6
2004.09.23 - SLIDE 7IS 202 – FALL 2004
Binary Weights• Only the presence (1) or absence (0) of a
term is included in the vectordocs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3D10 0 1 1 5D11 0 0 1 3Q 1 2 3
q1 q2 q3
2004.09.23 - SLIDE 8IS 202 – FALL 2004
Raw Term Weights• The frequency of occurrence for the term
in each document is included in the vectordocs t1 t2 t3 RSV=Q.DiD1 2 0 3 4D2 1 0 0 1D3 0 4 7 5D4 3 0 0 1D5 1 6 3 6D6 3 5 0 3D7 0 8 0 2D8 0 10 0 2D9 0 0 1 3D10 0 3 5 5D11 0 0 1 3Q 1 2 3
q1 q2 q3
2004.09.23 - SLIDE 9IS 202 – FALL 2004
tf*idf weights
)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the collection in the documents ofnumber total
in T termoffrequency document inverse document in T termoffrequency
document in term
nNidf
CnCNCidf
DtfDkT
kk
kk
kk
ikik
ik
2004.09.23 - SLIDE 10IS 202 – FALL 2004
Inverse Document Frequency• IDF provides high values for rare words
and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
01000010000log
For a collectionof 10000 documents(N = 10000)
2004.09.23 - SLIDE 11IS 202 – FALL 2004
tf*idf Normalization• Normalize the term weights (so longer
vectors are not unfairly given more weight)– Normalize usually means force all values to
fall within a certain range, usually between 0 and 1, inclusive
t
k kik
kikik
nNtf
nNtfw1
22 )]/[log()(
)/log(
2004.09.23 - SLIDE 12IS 202 – FALL 2004
Vector Space Similarity• Now, the similarity of two documents is:
• This is also called the cosine, or normalized inner product – The normalization was done when weighting
the terms– Note that the wik weights can be stored in the
vectors/ inverted files for the documents
),( 1
t
kjkikji wwDDsim
2004.09.23 - SLIDE 13IS 202 – FALL 2004
Vector Space Matching
1.0
0.8
0.6
0.4
0.2
0.80.60.40.20 1.0
D2
D1
Q
1
2
Term B
Term A
Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
t
j
t
j dq
t
j dqi
ijj
ijj
ww
wwDQsim
1 122
1
)()(),(
Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)
98.042.0
64.0])7.0()2.0[(])8.0()4.0[(
)7.08.0()2.04.0()2,(2222
DQsim
74.058.0
56.),( 1 DQsim
2004.09.23 - SLIDE 14IS 202 – FALL 2004
Vector Space Visualization
2004.09.23 - SLIDE 15IS 202 – FALL 2004
Document/Document Matrix
....
.....
................
21
2212
1121
21
nnn
t
t
t
ddD
ddDddDDDD
jiij DDd to of similarity
2004.09.23 - SLIDE 16IS 202 – FALL 2004
Text Clustering
Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseau
Term 1
Term 2
2004.09.23 - SLIDE 18IS 202 – FALL 2004
Problems with Vector Space• There is no real theoretical basis for the
assumption of a term space– it is more for visualization that having any real
basis– most similarity measures work about the
same regardless of model• Terms are not really orthogonal
dimensions– Terms are not independent of all other terms
• Retrieval efficiency vs. indexing and update efficiency for stored pre-calculated weights
2004.09.23 - SLIDE 19IS 202 – FALL 2004
Lecture Overview• Review
– Vector Representation– Term Weights– Vector Matching– Clustering
• Probabilistic Models of IR• Relevance Feedback
Credit for some of the slides in this lecture goes to Marti Hearst
2004.09.23 - SLIDE 20IS 202 – FALL 2004
Probabilistic Models• Rigorous formal model attempts to predict
the probability that a given document will be relevant to a given query
• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)
• Relies on accurate estimates of probabilities
2004.09.23 - SLIDE 21IS 202 – FALL 2004
Probability Ranking Principle• “If a reference retrieval system’s response to
each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”
Stephen E. Robertson, J. Documentation 1977
2004.09.23 - SLIDE 22IS 202 – FALL 2004
Model 1 – Maron and Kuhns• Concerned with estimating probabilities of
relevance at the point of indexing:– If a patron came with a request using term ti,
what is the probability that she/he would be satisfied with document Dj ?
2004.09.23 - SLIDE 23IS 202 – FALL 2004
Model 1• A patron submits a query (call it Q)
consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q.
Robertson, Maron & Cooper, 1982
2004.09.23 - SLIDE 24IS 202 – FALL 2004
Model 1 – Bayes• A is the class of events of using the library• Di is the class of events of Document i
being judged relevant• Ij is the class of queries consisting of the
single term Ij
• P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved
)|(),|()|(
),|(AIP
DAIPADPIADP
j
ijiji
2004.09.23 - SLIDE 25IS 202 – FALL 2004
Model 2• Documents have many different properties; some
documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties.
Robertson, Maron & Cooper, 1982
2004.09.23 - SLIDE 26IS 202 – FALL 2004
Model 2 – Robertson & Sparck Jones
Document Relevance
DocumentIndexing
Given a term t and a query q
+ -
+ r n-r n
- R-r N-n-R+r N-n
R N-R N
2004.09.23 - SLIDE 27IS 202 – FALL 2004
Robertson-Sparck Jones Weights
• Retrospective formulation
rRnNrnrR
r
log
2004.09.23 - SLIDE 28IS 202 – FALL 2004
Robertson-Sparck Jones Weights
• Predictive formulation
5.05.05.0
5.0
log)1(
rRnNrnrR
r
w
2004.09.23 - SLIDE 29IS 202 – FALL 2004
Probabilistic Models: Some Unifying Notation
• D = All present and future documents• Q = All present and future queries• (Di,Qj) = A document query pair• x = class of similar documents, • y = class of similar queries, • Relevance (R) is a relation:
}Q submittinguser by therelevant judged
isDdocument ,Q ,D | )Q,{(D R
j
ijiji QD
DxQy
2004.09.23 - SLIDE 30IS 202 – FALL 2004
Probabilistic Models• Model 1 -- Probabilistic Indexing, P(R|
y,Di)• Model 2 -- Probabilistic Querying, P(R|
Qj,x)
• Model 3 -- Merged Model, P(R| Qj, Di)• Model 0 -- P(R|y,x)• Probabilities are estimated based on prior
usage or relevance estimation
2004.09.23 - SLIDE 31IS 202 – FALL 2004
Probabilistic ModelsQD
x
y
Di
Qj
2004.09.23 - SLIDE 32IS 202 – FALL 2004
Logistic Regression• Another approach to estimating probability
of relevance• Based on work by William Cooper, Fred
Gey and Daniel Dabney• Builds a regression model for relevance
prediction based on a set of training data• Uses less restrictive independence
assumptions than Model 2– Linked Dependence
2004.09.23 - SLIDE 33IS 202 – FALL 2004
So What’s Regression?• A method for fitting a curve (not necessarily a
straight line) through a set of points using some goodness-of-fit criterion
• The most common type of regression is linear regression
2004.09.23 - SLIDE 34IS 202 – FALL 2004
What’s Regression?• Least Squares Fitting is a mathematical procedure for finding
the best fitting curve to a given set of points by minimizing the sum of the squares of the offsets ("the residuals") of the points from the curve
• The sum of the squares of the offsets is used instead of the offset absolute values because this allows the residuals to be treated as a continuous differentiable quantity
2004.09.23 - SLIDE 35IS 202 – FALL 2004
Logistic Regression100 -90 -80 -70 -60 -50 -40 -30 -20 -10 -0 - 0 10 20 30 40 50 60
Term Frequency in Document
Rel
evan
ce
2004.09.23 - SLIDE 36IS 202 – FALL 2004
Probabilistic Models: Logistic Regression
• Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables
nnkji vcvcvcctdR|qO ...),,(log 22110
)),|(log(11),|(
ji dqROji edqRP
m
kkjiji ROtdqROdqRO
1, )](log),|([log),|(log
Log odds of relevance is a linear function of attributes:
Term contributions summed:
Probability of Relevance is inverse of log odds:
2004.09.23 - SLIDE 37IS 202 – FALL 2004
Logistic Regression Attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document -- logged
2004.09.23 - SLIDE 38IS 202 – FALL 2004
Logistic Regression• Probability of relevance is based on Logistic
regression from a sample set of documents to determine values of the coefficients
• At retrieval the probability estimate is obtained by:
• For the 6 X attribute measures shown previously
6
10),|(
iii XccDQRP
2004.09.23 - SLIDE 39IS 202 – FALL 2004
Probabilistic Models• Strong theoretical
basis• In principle should
supply the best predictions of relevance given available information
• Can be implemented similarly to Vector
• Relevance information is required -- or is “guestimated”
• Important indicators of relevance may not be term -- though terms only are usually used
• Optimally requires on-going collection of relevance information
Advantages Disadvantages
2004.09.23 - SLIDE 40IS 202 – FALL 2004
Vector and Probabilistic Models
• Support “natural language” queries• Treat documents and queries the same• Support relevance feedback searching• Support ranked retrieval• Differ primarily in theoretical basis and in
how the ranking is calculated– Vector assumes relevance – Probabilistic relies on relevance judgments or
estimates
2004.09.23 - SLIDE 41IS 202 – FALL 2004
Current Use of Probabilistic Models
• Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…
5.05.05.0
5.0
log)1(
rRnNrnrR
r
w
2004.09.23 - SLIDE 42IS 202 – FALL 2004
Okapi BM25
• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-
1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q
was derived• dl and avdl are the document length and the average
document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight
QT qtfkqtfk
tfKtfkw
3
31)1( )1()1(
2004.09.23 - SLIDE 43IS 202 – FALL 2004
Language Models• A recent addition to the probabilistic
models is “language modeling” that estimates the probability that a query could have been produced by a given document.
• This is a slight variation on the other probabilistic models that has led to some modest improvements in performance
2004.09.23 - SLIDE 44IS 202 – FALL 2004
Logistic Regression and Cheshire II
• The Cheshire II system (see readings) uses Logistic Regression equations estimated from TREC full-text data
• Used for a number of production level systems here and in the U.K.
2004.09.23 - SLIDE 45IS 202 – FALL 2004
Lecture Overview• Review
– Vector Representation– Term Weights– Vector Matching– Clustering
• Probabilistic Models of IR• Relevance Feedback
Credit for some of the slides in this lecture goes to Marti Hearst
2004.09.23 - SLIDE 46IS 202 – FALL 2004
Querying in IR SystemInterest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles Storage of
Documents
Information Storage and Retrieval System
2004.09.23 - SLIDE 47IS 202 – FALL 2004
Relevance Feedback in an IR System
Interest profiles& Queries
Documents & data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles Storage of
Documents
Information Storage and Retrieval System
Selected relevant docs
2004.09.23 - SLIDE 48IS 202 – FALL 2004
Query Modification• Problem: How to reformulate the query?
– Thesaurus expansion:• Suggest terms similar to query terms
– Relevance feedback:• Suggest terms (and documents) similar to
retrieved documents that have been judged to be relevant
2004.09.23 - SLIDE 49IS 202 – FALL 2004
Relevance Feedback• Main Idea:
– Modify existing query based on relevance judgements
• Extract terms from relevant documents and add them to the query
• And/or re-weight the terms already in the query– Two main approaches:
• Automatic (pseudo-relevance feedback)• Users select relevant documents
– Users/system select terms from an automatically-generated list
2004.09.23 - SLIDE 50IS 202 – FALL 2004
Relevance Feedback• Usually do both:
– Expand query with new terms– Re-weight terms in query
• There are many variations– Usually positive weights for terms from
relevant docs– Sometimes negative weights for terms from
non-relevant docs– Remove terms ONLY in non-relevant
documents
2004.09.23 - SLIDE 51IS 202 – FALL 2004
Rocchio Method
0.25) to and 0.75 to set best to studies some(in termst nonrelevan andrelevant of importance thetune and ,
chosen documentsrelevant -non ofnumber thechosen documentsrelevant ofnumber the
document relevant -non for the vector thedocument relevant for the vector the
query initial for the vector the
2
1
0
121101
21
nn
iSiR
Qwhere
Sn
Rn
i
i
i
n
i
n
ii
2004.09.23 - SLIDE 52IS 202 – FALL 2004
Rocchio/Vector Illustration
Retrieval
Information
0.5
1.0
0 0.5 1.0
D1
D2
Q0
Q’
Q”
Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)
Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)
2004.09.23 - SLIDE 53IS 202 – FALL 2004
Example Rocchio Calculation
)04.1,033.0,488.0,022.0,527.0,01.0,002.0,000875.0,011.0(12
25.075.0
1)950,.00.0,450,.00.0,500,.00.0,00.0,00.0,00.0()00.0,020,.00.0,025,.005,.00.0,020,.010,.030(.
)120,.100,.100,.025,.050,.002,.020,.009,.020(. )120,.00.0,00.0,050,.025,.025,.00.0,00.0,030(.
121
1
2
1
new
new
Q
SRRQQ
QS
RR
Relevantdocs
Non-rel doc
Original Query
Constants
Rocchio CalculationResulting feedback query
2004.09.23 - SLIDE 54IS 202 – FALL 2004
Rocchio Method• Rocchio automatically
– Re-weights terms– Adds in new terms (from relevant docs)
• Have to be careful when using negative terms• Rocchio is not a machine learning algorithm
• Most methods perform similarly– Results heavily dependent on test collection
• Machine learning methods are proving to work better than standard IR approaches like Rocchio
2004.09.23 - SLIDE 55IS 202 – FALL 2004
Probabilistic Relevance Feedback
Document Relevance
DocumentIndexing
Given a query term t
+ -
+ r n-r n
- R-r N-n-R+r N-n
R N-R N
Where N is the number of documents seen
2004.09.23 - SLIDE 56IS 202 – FALL 2004
Robertson-Sparck Jones Weights
• Retrospective formulation
rRnNrnrR
r
wnewt log
2004.09.23 - SLIDE 57IS 202 – FALL 2004
Using Relevance Feedback• Known to improve results
– In TREC-like conditions (no user involved)• What about with a user in the loop?
– How might you measure this?
2004.09.23 - SLIDE 58IS 202 – FALL 2004
Relevance Feedback Summary
• Iterative query modification can improve precision and recall for a standing query
• In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them (Koeneman & Belkin)
2004.09.23 - SLIDE 59IS 202 – FALL 2004
Alternative Notions of Relevance Feedback
• Find people whose taste is “similar” to yours– Will you like what they like?
• Follow a users’ actions in the background– Can this be used to predict what the user will
want to see next?• Track what lots of people are doing
– Does this implicitly indicate what they think is good and not good?
2004.09.23 - SLIDE 60IS 202 – FALL 2004
Alternative Notions of Relevance Feedback
• Several different criteria to consider:– Implicit vs. Explicit judgements – Individual vs. Group judgements– Standing vs. Dynamic topics– Similarity of the items being judged vs.
similarity of the judges themselves
2004.09.23 - SLIDE 61
Collaborative Filtering (Social Filtering)
• If Pam liked the paper, I’ll like the paper• If you liked Star Wars, you’ll like
Independence Day• Rating based on ratings of similar people
– Ignores the text, so works on text, sound, pictures, etc.
– But: Initial users can bias ratings of future users
Sally Bob Chris Lynn KarenStar Wars 7 7 3 4 7Jurassic Park 6 4 7 4 4Terminator II 3 4 7 6 3Independence Day 7 7 2 2 ?
2004.09.23 - SLIDE 62
Ringo Collaborative Filtering • Users rate musical artists from like to dislike
– 1 = detest 7 = can’t live without 4 = ambivalent– There is a normal distribution around 4– However, what matters are the extremes
• Nearest Neighbors Strategy: Find similar users and predicted (weighted) average of user ratings
• Pearson r algorithm: weight by degree of correlation between user U and user J– 1 means very similar, 0 means no correlation, -1
dissimilar– Works better to compare against the ambivalent
rating (4), rather than the individual’s average score
22 )()(
))((
JJUU
JJUUrUJ
2004.09.23 - SLIDE 63IS 202 – FALL 2004
Social Filtering• Ignores the content, only looks at who judges
things similarly• Works well on data relating to “taste”
– something that people are good at predicting about each other too
• Does it work for topic? – GroupLens results suggest otherwise (preliminary)– Perhaps for quality assessments– What about for assessing if a document is about a
topic?
2004.09.23 - SLIDE 64IS 202 – FALL 2004
Summary• Relevance feedback is an effective means
for user-directed query modification• Modification can be done with either direct
or indirect user input• Modification can be done based on an
individual’s or a group’s past input
2004.09.23 - SLIDE 65IS 202 – FALL 2004
David Hong on Cheshire• Cheshire II provided the paradigm of a fully standards-
based IR system (SGML and Z39.50 Protocol). While there are both benefits and drawback to implementing standards-based technologies, what can other IR systems gain from being standards-compliant and how could this model make other IR systems more flexible?
• Cheshire II's interface allows users to specify conventional Boolean matching and probabilistic search. How would you infer this level of granularity in the form of a natural language query?
• What would be some of the potential benefits of doing feedback searching with multiple records in an large Internet search engine?
• What are the potential barriers in implementing this feature?
2004.09.23 - SLIDE 66IS 202 – FALL 2004
Next Time• Information Retrieval Evaluation & more on
collaborative filtering• Readings for next time
– An Evaluation of Retrieval Effectiveness (Blair & Maron)
– Rave Reviews: Acquiring Relevance Assessments from Multiple Users (Belew)
– A Case for Interaction: A Study of Interactive Information Retrieval Behavior and Effectiveness (Koeneman & Belkin)
– Work Tasks and Socio-Cognitive Relevence: A Specific Example (Hjorland & Chritensen)
– Social Information Filtering: Algorithms for Automating "Word of Mouth" (Shardanand & Maes)