relevance ranking

1

Relevance Ranking

2

Introduction

In relational algebra, the response to a query is always an unordered set of qualifying tuples.

Keyword queries are not precise Rate each document for how likely it is to satisfy

the user’s information need. Present the results in a ranked list.

3

Relevance ranking

Recall and Precision The Vector-Space Model

A broad class of ranking algorithms based on this model

Relevance Feedback and Rocchio’s Method Probabilistic Relevance Feedback Models Advanced Issues

4

Measures for a search engine How fast does it index

Number of documents/hour How fast does it search

Latency as a function of index size Expressiveness of query language

Ability to express complex information needs Uncluttered UI Is it free?

5

Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness

What is this? Speed of response/size of index are factors But blindingly fast, useless answers won’t make a

user happy Need a way of quantifying user happiness

6

Happiness: elusive to measure

Most common proxy: relevance of search results

But how do you measure relevance? Relevant measurement requires 3

elements:1. A benchmark document collection

2. A benchmark suite of queries

3. A usually binary assessment of either Relevant or Nonrelevant for each query and each document Some work on more-than-binary, but not the

standard

7

Evaluating an IR system Note: the information need is translated into a

query Relevance is assessed relative to the information

need not the query E.g., Information need: I'm looking for information on

whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

Query: wine red white heart attack effective You evaluate whether the doc addresses the

information need, not whether it has these words

8

Standard relevance benchmarks TREC - National Institute of Standards and

Technology (NIST) has run a large IR test bed for many years

Human experts mark, for each query and for each doc, Relevant or Nonrelevant

9

Unranked retrieval evaluation:Precision and Recall Precision: fraction of retrieved docs that are

relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are

retrieved = P(retrieved|relevant)

Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp(true positive) fp(false positive)

Not Retrieved fn(false negative) tn(true negative)

10

Should we instead use the accuracy measure for evaluation? Given a query, an engine classifies each doc as

“Relevant” or “Nonrelevant” The accuracy of an engine: the fraction of

these classifications that are correct Accuracy is a commonly used evaluation

measure in machine learning classification work

Why is this not a very useful evaluation measure in IR?

11

Why not just use accuracy? How to build a 99.9999% accurate search

engine on a low budget….

People doing information retrieval want to find something and have a certain tolerance for junk.

Search for:

0 matching results found.

12

Precision/Recall

You can get high recall (but low precision) by retrieving all docs for all queries!

Recall is a non-decreasing function of the number of docs retrieved

In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong

empirical confirmation

13

Evaluating ranked results Model

For each query q, an exhaustive set of relevant documents DqD is identified

A query q is submitted to the query system A ranked list of documents (d1, d2,…, dn) is returned,

Corresponding to the list, we can compute a 0/1 relevance list (r1, r2,…, rn) ri=1 if di Dq, and ri=0 if diDq

The recall at rank k1 is defined as

The precision at rank k is defined as

kii

q

rD

krecall1||

1)(

kiirk

kprecision1

1)(

14

Evaluating ranked results Example

Dq={d1, d5, d7, d10} Retrieved documents: (d1, d10, d15, d5, d4, d7, d22, d2) Then the relevance list: (1, 1, 0, 1, 0, 1, 0, 0)

Recall(2)=(1/4)*(1+1)=0.5 Recall(5)=(1/4)*(1+1+0+1+0)=0.75 Recall (6)=(1/4)*(1+1+0+1+0+1)=1

Precision(2)=(1/2)*(1+1)=1 Precision(5)=(1/5)*(1+1+0+1+0)=0.75 Precision(6)=(1/6)*(1+1+0+1+0+1)=2/3

15

Another figure of merit: Average precision The sum of the precision at each relevant hit

position in the response list, divided by the total number of relevant documents.

The average precision is 1 only if the engine retrieves all relevant documents and ranks them ahead of any irrelevant document.

||1

)(||

1

Dkk

q

kprecisionrD

precisionaverage

16

Average precision

Example Dq={d1, d5, d7, d10}

Retrieved documents: (d1, d10, d15, d5, d4, d7, d22, d2)

relevance list: (1, 1, 0, 1, 0, 1, 0, 0)

Average precision=(1/|4|)*(1+1+3/4+4/6)

17

Evaluation at large search engines

For a large corpus in rapid flux, such as the web, it is impossible to determine Dq. Recall is difficult to measure on the web

Search engines often use precision at top k, e.g., k = 10

. . . or measures that reward you more for getting rank 1 right than for getting rank 10 right.

18

Evaluation at large search engines Search engines also use non-relevance-

based measures. Clickthrough on first result

Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.

A/B testing

19

A/B testing

Purpose: Test a single innovation Prerequisite: You have a large search engine up

and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to

the new system that includes the innovation Evaluate with an “automatic” measure like

clickthrough on first result Now we can directly see if the innovation does

improve user happiness. Probably the evaluation methodology that large

search engines trust most

20

Vector space model

21

Problem with Boolean search: feast or famine Boolean queries often result in either too few (=0)

or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000

hits Query 2: “standard user dlink 650 no card found”:

0 hits It takes skill to come up with a query that

produces a manageable number of hits. With a ranked list of documents it does not matter

how large the retrieved set is.

22

Scoring as the basis of ranked retrieval We wish to return in order the documents

most likely to be useful to the searcher How can we rank-order the documents in

the collection with respect to a query? Assign a score – say in [0, 1] – to each

document This score measures how well document

and query “match”.

23

Term-document count matrices Consider the number of occurrences of a

term in a document: Each document is a count vector in ℕv: a column

below

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

24

The vector space model

Documents are represented as vectors in a multidimensional Euclidean space. Each axis in this space corresponds to a term. The coordinate of document d in the direction

corresponding to term t is determined by two quantities, Term frequency and Inverse document frequency.

25

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

We want to use tf when computing query-document match scores. But how? Raw term frequency is not what we want:

A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term.

But not 10 times more relevant. Relevance does not increase proportionally with term

frequency.

26

Term frequency tf

Normalization is needed! There are lots of normalization method.

Normalize by using the sum of term counts Tft,d=n(d,t)/(summation of the term frequency in document d )

log frequency weight

Cornell Smart system uses the following equation to normalize the tf.

otherwisetdn

tdniftf dt ))),(log(1log(1

0),(0{,

otherwise 0,

0 t)n(d, ift),n(d, log 1 10

t,dtf

27

Document frequency

Rare terms are more informative than frequent terms Recall stop words

Consider a term in the query that is rare in the collection (e.g., arachnocentric)

A document containing this term is very likely to be relevant to the query arachnocentric

→ We want a high weight for rare terms like arachnocentric.

28

Collection vs. Document frequency

The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.

Example:

Which word is a better search term (and should get a higher weight)?

Word Collection frequency

Document frequency

insurance 10440 3997

try 10422 8760

29

Document frequency Consider a query term that is frequent in the

collection (e.g., high, increase, line) A document containing such a term is more likely

to be relevant than a document that doesn’t, but it’s not a sure indicator of relevance.

→ For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.

We will use document frequency (df) to capture this in the score.

df ( N) is the number of documents that contain the term

30

Inverse document frequency

dft is the document frequency of t: the number of documents that contain t df is a measure of the informativeness of t

We define the idf (inverse document frequency) of t by

We use log N/dft instead of N/dft to “dampen” the effect of idf.

Again IDF used by SMART system

tt N/df log idf 10

tt df

Nidf

1log

31

tf-idf weighting The tf-idf weight of a term is the product of its tf

weight and its idf weight. wt,d=tft,d*idft

Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf Increases with the number of occurrences within a

document Increases with the rarity of the term in the collection

32

example

Collection size=9 Idft1=log((1+9)/1)=1

Idft2=log((1+9)/5)=0.301

Idft3=log((1+9)/6)=0.222

tft1,d1=1+(log(1+log100))=1.477

tft2,d2=1+(log(1+log10))=1.301 Therefore,

t1 t2 t3

d1 100 10 0

d2 0 5 10

d3 0 10 0

d4 0 0 100

d5 0 0 10

d6 0 0 8

d7 0 0 15

d8 0 7 10

d9 0 100 0

222.0*0,301.0*301.1,1*447.11d

33

cosine(query,document)

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Dot product Unit vectors

qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the documentcos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.

The proximity measure between query and documents

34

Summary – vector space ranking

Represent the query as a weighted tf-idf vector

Represent each document as a weighted tf-idf vector

Compute the cosine similarity score for the query vector and each document vector

Rank documents with respect to the query by score

Return the top K (e.g., K = 10) to the user

35

Relevance Feedback The initial response from a search engine may

not satisfy the user’s information need The average Web query is only two words long. Users can rarely express their information need

within two words However, if the response list has at least some

relevant documents, sophisticated users can learn how to modify their queries by adding or negating additional keywords.

Relevance feedback automates this query refinement process.

36

Relevance Feedback : basic idea Relevance feedback: user feedback on

relevance of docs in initial set of results User issues a (short, simple) query The user marks some results as relevant or non-

relevant. The system computes a better representation of

the information need based on feedback. Relevance feedback can go through one or more

iterations.

37

Similar pages

38

Relevance Feedback: Example

Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html

39

Results for Initial Query

40

Relevance Feedback

41

Results after Relevance Feedback

42

Key concept: Centroid The centroid is the center of mass of a set of

points Recall that we represent documents as points

in a high-dimensional space Definition: Centroid

where C is a set of documents.

Cd

dC

C

||

1)(

43

Rocchio Algorithm The Rocchio algorithm uses the vector space

model to pick a relevance feed-back query Rocchio seeks the query qopt that maximizes

Tries to separate docs marked relevant and non-relevant

))](,cos())(,[cos(maxarg nrr

q

opt CqCqq

rjrj Cdj

nrCdj

ropt d

Cd

Cq

11

44

The Theoretically Best Query

x

x

xx

oo

o

Optimal query

x non-relevant documentso relevant documents

o

o

o

x x

xxx

x

x

x

x

x

x

x

x

x

45

Rocchio 1971 Algorithm (SMART)

Used in practice:

Dr = set of known relevant doc vectors Dnr = set of known irrelevant doc vectors

Different from Cr and Cnr

qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically)

New query moves toward relevant documents and away from irrelevant documents

nrjrj Ddj

nrDdj

rm d

Dd

Dqq

110

!

46

Relevance feedback on initial query

x

x

xx

oo

o

Revised query

x known non-relevant documentso known relevant documents

o

o

ox

x

x x

xx

x

x

xx

x

x

x

x

Initial query

47

Subtleties to note Tradeoff α vs. β/γ : If we have a lot of

judged documents, we want a higher β/γ is commonly set to zero

It’s harder for user to give negative feedback It’s also harder to use since relevant documents

can often form tight cluster, but non-relevant documents rarely do.

Some weights in query vector can go negative Negative term weights are ignored (set to 0)

48

Relevance Feedback in vector spaces

We can modify the query based on relevance feedback and apply standard vector space model.

Relevance feedback can improve recall and precision

Relevance feedback is most useful for increasing recall in situations where recall is important Users can be expected to review results and to

take time to iterate

49

Evaluation of relevance feedback strategies Use q0 and compute precision and recall graph

Use qm and compute precision recall graph Assess on all documents in the collection

Spectacular improvements, but … it’s cheating! Partly due to known relevant documents ranked higher Must evaluate with respect to documents not seen by user

Use documents in residual collection (set of documents minus those assessed relevant)

Measures usually then lower than for original query But a more realistic evaluation Relative performance can be validly compared

50

Evaluation of relevance feedback Most satisfactory – use two collections each

with their own relevance assessments q0 and user feedback from first collection

qm run on second collection and measured

Empirically, one round of relevance feedback is often very useful. Two rounds is sometimes marginally useful.

51

Evaluation: Caveat True evaluation of usefulness must compare

to other methods taking the same amount of time.

Alternative to relevance feedback: User revises and resubmits query.

Users may prefer revision/resubmission to having to judge relevance of documents.

There is no clear evidence that relevance feedback is the “best use” of the user’s time.

52

Pseudo relevance feedback Pseudo-relevance feedback automates the

“manual” part of true relevance feedback. Pseudo-relevance algorithm:

Retrieve a ranked list of hits for the user’s query Assume that the top k documents are relevant. Do relevance feedback (e.g., Rocchio)

Works very well on average But can go horribly wrong for some queries.

53

Relevance Feedback on the Web

Some search engines offer a similar/related pages feature (this is a trivial form of relevance feedback) Google (link-based) Stanford WebBase

But some don’t because it’s hard to explain to average user: Alltheweb Yahoo

Relevance feedback is not a commonly available feature on Web search engines. They are not patient enough to give their feedback to the

system!

54

Probabilistic Relevance Feedback Models The vector space model is operational

It gives a precise recipe for computing the relevance of a document with regard to the query.

It does not attempt to justify why the relevance should be defined.

Probabilistic models can be used to understand the behavior of practical IR systems.

55

Probabilistic Relevance Feedback Models Let R be a Boolean random variable that

represents the relevance of document d with regard to query q.

A reasonable order for ranking documents is their odds ratio for relevance

56

Probabilistic Relevance Feedback Models We assume that term occurrences are

independent given the query and the value of R Let {t} be the universe of terms, and xd,t{0,1}

reflects whether the term t appears in document d or not

We get

57

Probabilistic Relevance Feedback Models

58

Advanced Issues Spamming

In classical IR corpora, document authors were largely truthful about the propose of the document Terms found in the document were broadly

indicative of content. However, economic reality on the web and the

need to capture eyeballs has led to quite a different culture.

59

Spamming

Spammers The one who add popular query terms to

pages unrelated to those terms. For example, add the terms “Hawaii

vacation rental” in the page by making the font color the same as the background color.

60

Spamming In the early days of the Web

Search engines using many clues, such as font color, position to judge a page.

They guarded the secrets zealously, for knowing those secrets would enable spammers to beat the system again easily enough.

With the invention of hyperlink-based ranking techniques Spammers went through a setback phase The number and popularity of sites that cited a page

started to matter quite strongly in determining the rank of that page in response to a query

61

Advanced Issues

Titles, headings, metatages and anchor text The standard tf-idf framework treats all terms

uniformly. On the Web, valuable information may be lost this

way! Different idioms reflect different weights

Titles <title>…</title> Headings <h2>…</h2> Font modifiers

<strong>…</strong>, <font …>…</font> Most search engines respond to these different

idioms by assigning different weights

62

Titles, headings, metatages and anchor text Consider the Web’s rich hyperlink structure

Succinct descriptions of the content of a page v may often be found in the text of pages u that link to v.

The text in and near the “anchor” construct may be especially significant

In the world wide web worm system, McBryan built an index where page v, which had not been fetched by the crawler, would be indexed using anchor text on pages u that link to v Lycos and Google adopted the same approach

increasing their reach by almost a factor of two.

63

Titles, headings, metatages and anchor text (near) anchor text on u offers valuable

editorial judgment about v as well Among many authors of pages like u about

what terms to use as (near) anchor text is valuable in fighting spam and returning better-ranked responses.

64

Advanced Issues Ranking for complex queries including

phrases Many search systems permit the query contain

single words, phrases, and word inclusions and exclusions.

Explicit inclusions and exclusions are had Boolean predicates All responses must satisfy these

With operators and phrases, the query/documents can no longer be treated as ordinary points in vector space.

65

Ranking for complex queries including phrases Basic approaches

Positional index Construct two separate indices : one for single terms and

the other for phrases Regard the phrases as new dimensions added to the

vector space The problem is how to define a “phrase”

Manually Biword indexes Derived from the corpus itself using statistical techniques

66

Biword indexes Index every consecutive pair of terms in the

text as a phrase For example the text “Friends, Romans,

Countrymen” would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate.

67

Longer phrase queries Longer phrases are divided into several biwords: stanford university palo alto can be broken into

the Boolean query on biwords:

stanford university AND university palo AND palo alto

Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.

Can have false positives!

68

Extended biwords Parse the indexed text and perform part-of-speech-

tagging (POST). Bucket the terms into (say) Nouns (N) and

articles/prepositions (X). Now deem any string of terms of the form NX*N to be an

extended biword. Each such extended biword is now made a term in the dictionary.

Example: catcher in the rye N X X N

Query processing: parse it into N’s and X’s Segment query into enhanced biwords Look up index

69

Statistical techniques

To decide whether “t1t2” occurs more often as a phrase than their individual rates of occurrence

Likelihood ratio test can be used Null hypothesis : t1 and t2 are independent

Alternative hypothesis : t1 and t2 are dependent

70

Likelihood ratio test is the entire parameter space 0 is the parameter space corresponding to the null

hypothesis 0 => 1≦ It is known that -2log is asymptotically χ2-distributed

with degree of freedom equal to the difference in the dimensions of 0 and

If occurrences of t1 and t2 are independent, is almost 1, and -2log is almost 0 The larger the value of -2log, the stronger the dependence.

);(

);(

max

max0

kpHp

kpHp

71

Likelihood ratio test

In the Bernoulli trial setting For null hypothesis

For alternative hypothesis

11100100

111001001110,01000403,0100 ),,;,,( kkkk ppppkkkkppppH

11100100 )())1(())1(())1)(1((),,;,( 212121211110,010021kkkk ppppppppkkkkppH

72


The maximum likelihood estimators For null hypothesis

p1= (k10+k11)/(k00+k01+k10+k11)

p2=(k01+k11) /(k00+k01+k10+k11)

For alternative hypothesis p00=k00/ (k00+k01+k10+k11) Other parameters can be gotten in similar way

73


We can sort the entries by -2log and compare the top rankers against a χ2 table to note the confidence with which the null hypothesis is violated.

Dunning reports the following top candidates from some common English text:

-2log phrase

271 The swiss

264 Can be

257 Previous year

167 Mineral water

74

Advanced Issues

Approximate string matching Even though a large fraction of web pages are

written in English, many other languages are also used.

Dialects of English or transliteration from other languages may make many word spellings nonuniform An exact character-by-character match between the

query terms entered by the user and the keys in the inverted index may miss relevant matches!

75

Approximate string matching

Two ways to reduce the problem Soundex

Use an aggressive conflation mechanism that will collapse variant spellings into the same token.

bradley, bartle, bartlett and brodley all share the soundex code B634.

N-grams approach Decompose terms into a sequence of n-gram

76

Soundex Class of heuristics to expand a query into

phonetic equivalents Language specific – mainly for names E.g., Herman Hermann

Invented for the U.S. census … in 1918

77

Soundex – typical algorithm Turn every token to be indexed into a 4-

character reduced form Do the same with query terms Build and search an index on the reduced

forms (when the query calls for a soundex match)

http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top

78

Soundex – typical algorithm1. Retain the first letter of the word. 2. Change all occurrences of the following letters to '0'

(zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.

3. Change letters to digits as follows: B, F, P, V 1 C, G, J, K, Q, S, X, Z 2 D,T 3 L 4 M, N 5 R 6

79

Soundex continued4. Remove all pairs of consecutive digits.

5. Remove all zeros from the resulting string.

6. Pad the resulting string with trailing zeros and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.

E.g., Herman becomes H655.Will hermann generate the same code?

80

Soundex Soundex is the classic algorithm, provided by

most databases (Oracle, Microsoft, …) How useful is soundex?

Not very – for information retrieval Okay for “high recall” tasks (e.g., Interpol), though

biased to names of certain nationalities Zobel and Dart (1996) show that other

algorithms for phonetic matching perform much better in the context of IR

81

n-gram Enumerate all the n-grams in the query string

as well as in the lexicon Use the n-gram index to retrieve all lexicon

terms matching any of the query n-grams Threshold by number of matching n-grams

82

Example with trigrams Suppose the text is november

Trigrams are nov, ove, vem, emb, mbe, ber. The query is december

Trigrams are dec, ece, cem, emb, mbe, ber. So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized

measure of overlap?

83

One option – Jaccard coefficient A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is

Equals 1 when X and Y have the same elements and zero when they are disjoint

X and Y don’t have to be of the same size Always assigns a number between 0 and 1

Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match

YXYX /

84

N-gram

Looking up the inverted index now becomes a two-stage affair First, an index of n-grams is consulted to expand

each query term into a set of slightly distorted query terms

Then, all these terms are submitted to the regular index

relevance ranking

Documents

information retrieval

fraction of relevant

users information need

relevance of search

accuracy measure

fraction of retrieved

relevance rankingrecall

relevant measurement