relevance ranking
DESCRIPTION
Relevance Ranking. Introduction. In relational algebra, the response to a query is always an unordered set of qualifying tuples. Keyword queries are not precise Rate each document for how likely it is to satisfy the user’s information need. Present the results in a ranked list. - PowerPoint PPT PresentationTRANSCRIPT
1
Relevance Ranking
2
Introduction
In relational algebra, the response to a query is always an unordered set of qualifying tuples.
Keyword queries are not precise Rate each document for how likely it is to satisfy
the user’s information need. Present the results in a ranked list.
3
Relevance ranking
Recall and Precision The Vector-Space Model
A broad class of ranking algorithms based on this model
Relevance Feedback and Rocchio’s Method Probabilistic Relevance Feedback Models Advanced Issues
4
Measures for a search engine How fast does it index
Number of documents/hour How fast does it search
Latency as a function of index size Expressiveness of query language
Ability to express complex information needs Uncluttered UI Is it free?
5
Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness
What is this? Speed of response/size of index are factors But blindingly fast, useless answers won’t make a
user happy Need a way of quantifying user happiness
6
Happiness: elusive to measure
Most common proxy: relevance of search results
But how do you measure relevance? Relevant measurement requires 3
elements:1. A benchmark document collection
2. A benchmark suite of queries
3. A usually binary assessment of either Relevant or Nonrelevant for each query and each document Some work on more-than-binary, but not the
standard
7
Evaluating an IR system Note: the information need is translated into a
query Relevance is assessed relative to the information
need not the query E.g., Information need: I'm looking for information on
whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
Query: wine red white heart attack effective You evaluate whether the doc addresses the
information need, not whether it has these words
8
Standard relevance benchmarks TREC - National Institute of Standards and
Technology (NIST) has run a large IR test bed for many years
Human experts mark, for each query and for each doc, Relevant or Nonrelevant
9
Unranked retrieval evaluation:Precision and Recall Precision: fraction of retrieved docs that are
relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are
retrieved = P(retrieved|relevant)
Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)
Relevant Nonrelevant
Retrieved tp(true positive) fp(false positive)
Not Retrieved fn(false negative) tn(true negative)
10
Should we instead use the accuracy measure for evaluation? Given a query, an engine classifies each doc as
“Relevant” or “Nonrelevant” The accuracy of an engine: the fraction of
these classifications that are correct Accuracy is a commonly used evaluation
measure in machine learning classification work
Why is this not a very useful evaluation measure in IR?
11
Why not just use accuracy? How to build a 99.9999% accurate search
engine on a low budget….
People doing information retrieval want to find something and have a certain tolerance for junk.
Search for:
0 matching results found.
12
Precision/Recall
You can get high recall (but low precision) by retrieving all docs for all queries!
Recall is a non-decreasing function of the number of docs retrieved
In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong
empirical confirmation
13
Evaluating ranked results Model
For each query q, an exhaustive set of relevant documents DqD is identified
A query q is submitted to the query system A ranked list of documents (d1, d2,…, dn) is returned,
Corresponding to the list, we can compute a 0/1 relevance list (r1, r2,…, rn) ri=1 if di Dq, and ri=0 if diDq
The recall at rank k1 is defined as
The precision at rank k is defined as
kii
q
rD
krecall1||
1)(
kiirk
kprecision1
1)(
14
Evaluating ranked results Example
Dq={d1, d5, d7, d10} Retrieved documents: (d1, d10, d15, d5, d4, d7, d22, d2) Then the relevance list: (1, 1, 0, 1, 0, 1, 0, 0)
Recall(2)=(1/4)*(1+1)=0.5 Recall(5)=(1/4)*(1+1+0+1+0)=0.75 Recall (6)=(1/4)*(1+1+0+1+0+1)=1
Precision(2)=(1/2)*(1+1)=1 Precision(5)=(1/5)*(1+1+0+1+0)=0.75 Precision(6)=(1/6)*(1+1+0+1+0+1)=2/3
15
Another figure of merit: Average precision The sum of the precision at each relevant hit
position in the response list, divided by the total number of relevant documents.
The average precision is 1 only if the engine retrieves all relevant documents and ranks them ahead of any irrelevant document.
||1
)(||
1
Dkk
q
kprecisionrD
precisionaverage
16
Average precision
Example Dq={d1, d5, d7, d10}
Retrieved documents: (d1, d10, d15, d5, d4, d7, d22, d2)
relevance list: (1, 1, 0, 1, 0, 1, 0, 0)
Average precision=(1/|4|)*(1+1+3/4+4/6)
17
Evaluation at large search engines
For a large corpus in rapid flux, such as the web, it is impossible to determine Dq. Recall is difficult to measure on the web
Search engines often use precision at top k, e.g., k = 10
. . . or measures that reward you more for getting rank 1 right than for getting rank 10 right.
18
Evaluation at large search engines Search engines also use non-relevance-
based measures. Clickthrough on first result
Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.
A/B testing
19
A/B testing
Purpose: Test a single innovation Prerequisite: You have a large search engine up
and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to
the new system that includes the innovation Evaluate with an “automatic” measure like
clickthrough on first result Now we can directly see if the innovation does
improve user happiness. Probably the evaluation methodology that large
search engines trust most
20
Vector space model
21
Problem with Boolean search: feast or famine Boolean queries often result in either too few (=0)
or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000
hits Query 2: “standard user dlink 650 no card found”:
0 hits It takes skill to come up with a query that
produces a manageable number of hits. With a ranked list of documents it does not matter
how large the retrieved set is.
22
Scoring as the basis of ranked retrieval We wish to return in order the documents
most likely to be useful to the searcher How can we rank-order the documents in
the collection with respect to a query? Assign a score – say in [0, 1] – to each
document This score measures how well document
and query “match”.
23
Term-document count matrices Consider the number of occurrences of a
term in a document: Each document is a count vector in ℕv: a column
below
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
24
The vector space model
Documents are represented as vectors in a multidimensional Euclidean space. Each axis in this space corresponds to a term. The coordinate of document d in the direction
corresponding to term t is determined by two quantities, Term frequency and Inverse document frequency.
25
Term frequency tf
The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.
We want to use tf when computing query-document match scores. But how? Raw term frequency is not what we want:
A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term.
But not 10 times more relevant. Relevance does not increase proportionally with term
frequency.
26
Term frequency tf
Normalization is needed! There are lots of normalization method.
Normalize by using the sum of term counts Tft,d=n(d,t)/(summation of the term frequency in document d )
log frequency weight
Cornell Smart system uses the following equation to normalize the tf.
otherwisetdn
tdniftf dt ))),(log(1log(1
0),(0{,
otherwise 0,
0 t)n(d, ift),n(d, log 1 10
t,dtf
27
Document frequency
Rare terms are more informative than frequent terms Recall stop words
Consider a term in the query that is rare in the collection (e.g., arachnocentric)
A document containing this term is very likely to be relevant to the query arachnocentric
→ We want a high weight for rare terms like arachnocentric.
28
Collection vs. Document frequency
The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.
Example:
Which word is a better search term (and should get a higher weight)?
Word Collection frequency
Document frequency
insurance 10440 3997
try 10422 8760
29
Document frequency Consider a query term that is frequent in the
collection (e.g., high, increase, line) A document containing such a term is more likely
to be relevant than a document that doesn’t, but it’s not a sure indicator of relevance.
→ For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.
We will use document frequency (df) to capture this in the score.
df ( N) is the number of documents that contain the term
30
Inverse document frequency
dft is the document frequency of t: the number of documents that contain t df is a measure of the informativeness of t
We define the idf (inverse document frequency) of t by
We use log N/dft instead of N/dft to “dampen” the effect of idf.
Again IDF used by SMART system
tt N/df log idf 10
tt df
Nidf
1log
31
tf-idf weighting The tf-idf weight of a term is the product of its tf
weight and its idf weight. wt,d=tft,d*idft
Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf Increases with the number of occurrences within a
document Increases with the rarity of the term in the collection
32
example
Collection size=9 Idft1=log((1+9)/1)=1
Idft2=log((1+9)/5)=0.301
Idft3=log((1+9)/6)=0.222
tft1,d1=1+(log(1+log100))=1.477
tft2,d2=1+(log(1+log10))=1.301 Therefore,
t1 t2 t3
d1 100 10 0
d2 0 5 10
d3 0 10 0
d4 0 0 100
d5 0 0 10
d6 0 0 8
d7 0 0 15
d8 0 7 10
d9 0 100 0
222.0*0,301.0*301.1,1*447.11d
33
cosine(query,document)
V
i i
V
i i
V
i ii
dq
dq
d
d
q
q
dq
dqdq
1
2
1
2
1),cos(
Dot product Unit vectors
qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the documentcos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.
The proximity measure between query and documents
34
Summary – vector space ranking
Represent the query as a weighted tf-idf vector
Represent each document as a weighted tf-idf vector
Compute the cosine similarity score for the query vector and each document vector
Rank documents with respect to the query by score
Return the top K (e.g., K = 10) to the user
35
Relevance Feedback The initial response from a search engine may
not satisfy the user’s information need The average Web query is only two words long. Users can rarely express their information need
within two words However, if the response list has at least some
relevant documents, sophisticated users can learn how to modify their queries by adding or negating additional keywords.
Relevance feedback automates this query refinement process.
36
Relevance Feedback : basic idea Relevance feedback: user feedback on
relevance of docs in initial set of results User issues a (short, simple) query The user marks some results as relevant or non-
relevant. The system computes a better representation of
the information need based on feedback. Relevance feedback can go through one or more
iterations.
37
Similar pages
38
Relevance Feedback: Example
Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html
39
Results for Initial Query
40
Relevance Feedback
41
Results after Relevance Feedback
42
Key concept: Centroid The centroid is the center of mass of a set of
points Recall that we represent documents as points
in a high-dimensional space Definition: Centroid
where C is a set of documents.
Cd
dC
C
||
1)(
43
Rocchio Algorithm The Rocchio algorithm uses the vector space
model to pick a relevance feed-back query Rocchio seeks the query qopt that maximizes
Tries to separate docs marked relevant and non-relevant
))](,cos())(,[cos(maxarg nrr
q
opt CqCqq
rjrj Cdj
nrCdj
ropt d
Cd
Cq
11
44
The Theoretically Best Query
x
x
xx
oo
o
Optimal query
x non-relevant documentso relevant documents
o
o
o
x x
xxx
x
x
x
x
x
x
x
x
x
45
Rocchio 1971 Algorithm (SMART)
Used in practice:
Dr = set of known relevant doc vectors Dnr = set of known irrelevant doc vectors
Different from Cr and Cnr
qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically)
New query moves toward relevant documents and away from irrelevant documents
nrjrj Ddj
nrDdj
rm d
Dd
Dqq
110
!
46
Relevance feedback on initial query
x
x
xx
oo
o
Revised query
x known non-relevant documentso known relevant documents
o
o
ox
x
x x
xx
x
x
xx
x
x
x
x
Initial query
47
Subtleties to note Tradeoff α vs. β/γ : If we have a lot of
judged documents, we want a higher β/γ is commonly set to zero
It’s harder for user to give negative feedback It’s also harder to use since relevant documents
can often form tight cluster, but non-relevant documents rarely do.
Some weights in query vector can go negative Negative term weights are ignored (set to 0)
48
Relevance Feedback in vector spaces
We can modify the query based on relevance feedback and apply standard vector space model.
Relevance feedback can improve recall and precision
Relevance feedback is most useful for increasing recall in situations where recall is important Users can be expected to review results and to
take time to iterate
49
Evaluation of relevance feedback strategies Use q0 and compute precision and recall graph
Use qm and compute precision recall graph Assess on all documents in the collection
Spectacular improvements, but … it’s cheating! Partly due to known relevant documents ranked higher Must evaluate with respect to documents not seen by user
Use documents in residual collection (set of documents minus those assessed relevant)
Measures usually then lower than for original query But a more realistic evaluation Relative performance can be validly compared
50
Evaluation of relevance feedback Most satisfactory – use two collections each
with their own relevance assessments q0 and user feedback from first collection
qm run on second collection and measured
Empirically, one round of relevance feedback is often very useful. Two rounds is sometimes marginally useful.
51
Evaluation: Caveat True evaluation of usefulness must compare
to other methods taking the same amount of time.
Alternative to relevance feedback: User revises and resubmits query.
Users may prefer revision/resubmission to having to judge relevance of documents.
There is no clear evidence that relevance feedback is the “best use” of the user’s time.
52
Pseudo relevance feedback Pseudo-relevance feedback automates the
“manual” part of true relevance feedback. Pseudo-relevance algorithm:
Retrieve a ranked list of hits for the user’s query Assume that the top k documents are relevant. Do relevance feedback (e.g., Rocchio)
Works very well on average But can go horribly wrong for some queries.
53
Relevance Feedback on the Web
Some search engines offer a similar/related pages feature (this is a trivial form of relevance feedback) Google (link-based) Stanford WebBase
But some don’t because it’s hard to explain to average user: Alltheweb Yahoo
Relevance feedback is not a commonly available feature on Web search engines. They are not patient enough to give their feedback to the
system!
54
Probabilistic Relevance Feedback Models The vector space model is operational
It gives a precise recipe for computing the relevance of a document with regard to the query.
It does not attempt to justify why the relevance should be defined.
Probabilistic models can be used to understand the behavior of practical IR systems.
55
Probabilistic Relevance Feedback Models Let R be a Boolean random variable that
represents the relevance of document d with regard to query q.
A reasonable order for ranking documents is their odds ratio for relevance
56
Probabilistic Relevance Feedback Models We assume that term occurrences are
independent given the query and the value of R Let {t} be the universe of terms, and xd,t{0,1}
reflects whether the term t appears in document d or not
We get
57
Probabilistic Relevance Feedback Models
58
Advanced Issues Spamming
In classical IR corpora, document authors were largely truthful about the propose of the document Terms found in the document were broadly
indicative of content. However, economic reality on the web and the
need to capture eyeballs has led to quite a different culture.
59
Spamming
Spammers The one who add popular query terms to
pages unrelated to those terms. For example, add the terms “Hawaii
vacation rental” in the page by making the font color the same as the background color.
60
Spamming In the early days of the Web
Search engines using many clues, such as font color, position to judge a page.
They guarded the secrets zealously, for knowing those secrets would enable spammers to beat the system again easily enough.
With the invention of hyperlink-based ranking techniques Spammers went through a setback phase The number and popularity of sites that cited a page
started to matter quite strongly in determining the rank of that page in response to a query
61
Advanced Issues
Titles, headings, metatages and anchor text The standard tf-idf framework treats all terms
uniformly. On the Web, valuable information may be lost this
way! Different idioms reflect different weights
Titles <title>…</title> Headings <h2>…</h2> Font modifiers
<strong>…</strong>, <font …>…</font> Most search engines respond to these different
idioms by assigning different weights
62
Titles, headings, metatages and anchor text Consider the Web’s rich hyperlink structure
Succinct descriptions of the content of a page v may often be found in the text of pages u that link to v.
The text in and near the “anchor” construct may be especially significant
In the world wide web worm system, McBryan built an index where page v, which had not been fetched by the crawler, would be indexed using anchor text on pages u that link to v Lycos and Google adopted the same approach
increasing their reach by almost a factor of two.
63
Titles, headings, metatages and anchor text (near) anchor text on u offers valuable
editorial judgment about v as well Among many authors of pages like u about
what terms to use as (near) anchor text is valuable in fighting spam and returning better-ranked responses.
64
Advanced Issues Ranking for complex queries including
phrases Many search systems permit the query contain
single words, phrases, and word inclusions and exclusions.
Explicit inclusions and exclusions are had Boolean predicates All responses must satisfy these
With operators and phrases, the query/documents can no longer be treated as ordinary points in vector space.
65
Ranking for complex queries including phrases Basic approaches
Positional index Construct two separate indices : one for single terms and
the other for phrases Regard the phrases as new dimensions added to the
vector space The problem is how to define a “phrase”
Manually Biword indexes Derived from the corpus itself using statistical techniques
66
Biword indexes Index every consecutive pair of terms in the
text as a phrase For example the text “Friends, Romans,
Countrymen” would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate.
67
Longer phrase queries Longer phrases are divided into several biwords: stanford university palo alto can be broken into
the Boolean query on biwords:
stanford university AND university palo AND palo alto
Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.
Can have false positives!
68
Extended biwords Parse the indexed text and perform part-of-speech-
tagging (POST). Bucket the terms into (say) Nouns (N) and
articles/prepositions (X). Now deem any string of terms of the form NX*N to be an
extended biword. Each such extended biword is now made a term in the dictionary.
Example: catcher in the rye N X X N
Query processing: parse it into N’s and X’s Segment query into enhanced biwords Look up index
69
Statistical techniques
To decide whether “t1t2” occurs more often as a phrase than their individual rates of occurrence
Likelihood ratio test can be used Null hypothesis : t1 and t2 are independent
Alternative hypothesis : t1 and t2 are dependent
70
Likelihood ratio test is the entire parameter space 0 is the parameter space corresponding to the null
hypothesis 0 => 1≦ It is known that -2log is asymptotically χ2-distributed
with degree of freedom equal to the difference in the dimensions of 0 and
If occurrences of t1 and t2 are independent, is almost 1, and -2log is almost 0 The larger the value of -2log, the stronger the dependence.
);(
);(
max
max0
kpHp
kpHp
71
Likelihood ratio test
In the Bernoulli trial setting For null hypothesis
For alternative hypothesis
11100100
111001001110,01000403,0100 ),,;,,( kkkk ppppkkkkppppH
11100100 )())1(())1(())1)(1((),,;,( 212121211110,010021kkkk ppppppppkkkkppH
72
Likelihood ratio test
The maximum likelihood estimators For null hypothesis
p1= (k10+k11)/(k00+k01+k10+k11)
p2=(k01+k11) /(k00+k01+k10+k11)
For alternative hypothesis p00=k00/ (k00+k01+k10+k11) Other parameters can be gotten in similar way
73
Likelihood ratio test
We can sort the entries by -2log and compare the top rankers against a χ2 table to note the confidence with which the null hypothesis is violated.
Dunning reports the following top candidates from some common English text:
-2log phrase
271 The swiss
264 Can be
257 Previous year
167 Mineral water
74
Advanced Issues
Approximate string matching Even though a large fraction of web pages are
written in English, many other languages are also used.
Dialects of English or transliteration from other languages may make many word spellings nonuniform An exact character-by-character match between the
query terms entered by the user and the keys in the inverted index may miss relevant matches!
75
Approximate string matching
Two ways to reduce the problem Soundex
Use an aggressive conflation mechanism that will collapse variant spellings into the same token.
bradley, bartle, bartlett and brodley all share the soundex code B634.
N-grams approach Decompose terms into a sequence of n-gram
76
Soundex Class of heuristics to expand a query into
phonetic equivalents Language specific – mainly for names E.g., Herman Hermann
Invented for the U.S. census … in 1918
77
Soundex – typical algorithm Turn every token to be indexed into a 4-
character reduced form Do the same with query terms Build and search an index on the reduced
forms (when the query calls for a soundex match)
http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top
78
Soundex – typical algorithm1. Retain the first letter of the word. 2. Change all occurrences of the following letters to '0'
(zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows: B, F, P, V 1 C, G, J, K, Q, S, X, Z 2 D,T 3 L 4 M, N 5 R 6
79
Soundex continued4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.
E.g., Herman becomes H655.Will hermann generate the same code?
80
Soundex Soundex is the classic algorithm, provided by
most databases (Oracle, Microsoft, …) How useful is soundex?
Not very – for information retrieval Okay for “high recall” tasks (e.g., Interpol), though
biased to names of certain nationalities Zobel and Dart (1996) show that other
algorithms for phonetic matching perform much better in the context of IR
81
n-gram Enumerate all the n-grams in the query string
as well as in the lexicon Use the n-gram index to retrieve all lexicon
terms matching any of the query n-grams Threshold by number of matching n-grams
82
Example with trigrams Suppose the text is november
Trigrams are nov, ove, vem, emb, mbe, ber. The query is december
Trigrams are dec, ece, cem, emb, mbe, ber. So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized
measure of overlap?
83
One option – Jaccard coefficient A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is
Equals 1 when X and Y have the same elements and zero when they are disjoint
X and Y don’t have to be of the same size Always assigns a number between 0 and 1
Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match
YXYX /
84
N-gram
Looking up the inverted index now becomes a two-stage affair First, an index of n-grams is consulted to expand
each query term into a set of slightly distorted query terms
Then, all these terms are submitted to the regular index