infsci 2140 - university of pittsburghpeterb/2140-051/l4.pdf · how good is the model? ... serving...

28
1 INFSCI 2140 Information Storage and Retrieval Lecture 4: Retrieval Evaluation Peter Brusilovsky http://www2.sis.pitt.edu/~peterb/2140-051 The issue of evaluation: TREC Text REtrieval Conferences organized by NIST TREC-9 was held in 2000 • http://trec.nist.gov/presentations/TREC9/intro/ TREC IR “competitions” – Standard document sets – Standard queries and “topics”

Upload: vuongcong

Post on 29-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

1

INFSCI 2140Information Storage and RetrievalLecture 4: Retrieval Evaluation

Peter Brusilovskyhttp://www2.sis.pitt.edu/~peterb/2140-051

The issue of evaluation: TREC

Text REtrieval Conferences organizedby NIST

TREC-9 was held in 2000• http://trec.nist.gov/presentations/TREC9/intro/

TREC IR “competitions”– Standard document sets

– Standard queries and “topics”

2

Evaluation: Macro view

Design Work Evaluation

Redesign

Death

Life Cycle of an Information System

Evaluation: Micro view

Information Need Query Browsing

Results

ImproveQuery? Evaluation

3

Effectiveness

The effectiveness of a retrieval systemis related to the user satisfaction– i.e. is related to the ectosystem

Informationneed

Query

IR system

User

Results

How Good is the Model?

Was the query language powerfulenough to represent the need?

Were we able to use query syntax toexpress what we need– Operators

– Weights

Were the words from the limitedvocabulary expressive enough?

4

What can we say about a document?

Matching to the need, question, query

Relevance:– How well a the document responds to the

query

Pertinence– how well a document matches an

information need

Usefulness vs. relevance

Relevance and Pertinence

Relevance– how well the documents respond to the

query

Pertinence:– how well the documents respond to the

information need

Usefulness (vs. relevance)– Useful but not relevant

– Relevant but useless

5

How can we measure it?

Binary measure (yes/no)

N-ary measure:– 3 very relevant

– 2 relevant

– 1 barely relevant

– 0 not relevant

N=?: consistency vs. expressiveness

Precision and Recall

Relevant

Retrieved

Not Retrieved

1

2

3

4

6

Precision and Recall

w x

y z

Relevant = w + x

Precision: P= w / Retrieved

Recall: R = w / Relevant

Retrieved = w + y

Not retrievedRetrieved

Relevant

Not relevant

Precision and Recall

Precision = wn2=

ww + y

Recall = wn1=

ww + x

Number of retrieveddocuments that are

relevant

Number ofretrieved

documents

Number ofrelevant

documents

7

How are they related ?

Suppose that the system is running isresponse to a query and Recall and Precisionare measured as increasing number ofdocuments are retrieved.– At the beginning imagine that only one document

is retrieved and that it is relevant:

Precision = 1

Recall = 1n1

Verylow

How are they related ?

– On the other extreme suppose that everydocument in the database is retrieved:

Precision = n2N

Recall = 1

Very low

Totalnumber ofdocument

in thecollection

Allrelevant

documentare

retrieved

8

How are they related ?

Precision falls and recall rises as thenumber of documents retrieved inresponse to a query is increased

The number of returned documents canbe considered as a search parameter

Changing it we can build aprecision/recall graphs

Precision-Recall GraphRel./notRel Precision Recall

1 1 1 0.22 1 1 0.43 0 0.666666667 0.44 1 0.75 0.65 0 0.6 0.66 1 0.666666667 0.87 0 0.571428571 0.88 0 0.5 0.89 0 0.444444444 0.810 0 0.4 0.811 0 0.363636364 0.812 0 0.333333333 0.813 1 0.384615385 114 0 0.357142857 1

• Imagine that aquery issubmitted to thesystem.

• 14 documentsare retrieved

• 5 of them arerelevant

• These 5 are alsothe totalnumber ofrelevantdocument in thecollection

9

Precision and Recall Graphs

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14

ith document retrieved

Precision

Recall

R/notR Pre Rec1 1 1 0.22 1 1 0.43 0 0.666666667 0.44 1 0.75 0.65 0 0.6 0.66 1 0.666666667 0.87 0 0.571428571 0.88 0 0.5 0.89 0 0.444444444 0.810 0 0.4 0.811 0 0.363636364 0.812 0 0.333333333 0.813 1 0.384615385 114 0 0.357142857 1

Precision Graph

Precision when more and more documents are retrieved.

Note sawtooth shape!

10

Recall Graph

Recall when more and more documents are retrieved.

Note terraced shape!

Precision-Recall Graph

Sequences of points (p, r)

Similar to y = 1 / x:– Inversely proportional!

Sawtooth shape

Use smoothed graphs

How we can compare different IRsystems using precision-recall graphs?

11

Precision-Recall Graph

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Precision

Series1

R/notR Pre Rec1 1 1 0.22 1 1 0.43 0 0.666666667 0.44 1 0.75 0.65 0 0.6 0.66 1 0.666666667 0.87 0 0.571428571 0.88 0 0.5 0.89 0 0.444444444 0.810 0 0.4 0.811 0 0.363636364 0.812 0 0.333333333 0.813 1 0.384615385 114 0 0.357142857 1

Precision-Recall Graph

The system a has the best performances, but whatabout system b and c, which one is the best ?

Precision

Recall1

1 ab

c

12

Fallout

The proportion of not relevant document thatare retrieved (it should be low for a good IRsystem)

Fallout measures how well the system filtersout not-relevant documents

1nNy

F−

=Total number of notrelevant documents

Number of notrelevant documents

that are retrieved

Generality

Proportion of relevant documents in thecollection. It is more related to the queryrather than to the retrieval process

G =n1N

Total numberdocuments

Number of relevantdocuments in the

collection

13

Exercise 1

Imagine that an IR system retrieved 10 document in answerto a query, but only the document number 1, 3, 5, 7 arerelevant.

Calculate Precision, Recall and Fallout considering thatthere are other 6 relevant documents that were not retrievedand that the total number of documents in the collection is100 (included the 10 retrieved).

Problems of recall & precision

Hard to find recall Neither shows effectiveness

– Comparing the graphs– F-measure– Relative performance as another single

measure

Recall & precision may not be importantfor the user

14

Problems with Recall

Precision can be determined exactly

Recall cannot be determined exactly because itrequires the knowledge of all of relevant documentsin the collection. Recall can only be estimated

# of relevant docs retrievedPrecision=

# of retrieved docs

# of relevant docs retrievedRecall =

# of relevant docs

The Need for a Single Measure

To compare two IR systems it would be niceto use just one number, and precision andrecall are– Related to each other

– Give an incomplete picture of the system

F-Measure (not fallout!)– F = 2 * (recall * precision) / (recall + precision)

– combines recall and precision in a single efficiencymeasure (harmonic mean of precision and recall)

15

Relative Performance

R P / (1 - P)F G / (1 - G)

P / (1 - P) - relevant to non-relevantretrieved

G / (1 - G) - relevant to non-relevant inthe collection

R/F - relative performance

=

Relative Performance

Relative performance should be greaterthan one if we want that the system doesbetter in locating relevant documents thanit does rejecting not-relevant ones

1

1

1 >

−=

GGPP

FR

16

Precision and Recall: User View

It is not clear how important they are forthe users:– Precision in usually more important that

recall, because users appreciate outputsthat do not contain not relevant documents

– This, of course, depends on the kind ofuser: high recall is important for anattorney that needs to determine all thelegal precedents to a case.

What does the user want?Restaurant case

The user wants to find a restaurantserving Sashimi. She can use 2 IRsystems. How we can say which one isbetter?

17

User - oriented measures

Coverage ratio:known_relevant_retrieved / known_ relevant

Novelty ratio:– new_relevant / Relevant

Relative recall– relevant_retrieved /wants_to_examine

Recall Effort:– wants_to_examine / had_to_examine

Coverage and Novelty

Coverage Ratio: proportion of relevantdocuments known to the user that areactually retrieved– A high coverage ratio would give to the user some

confidence that the system is locating all he relevantdocuments

Novelty Ratio: proportion of relevantretrieved documents that were unknown tothe user– A high novelty ratio suggests that the system is effective in

locating documents previously unknown to the user

18

Coverage and Novelty

For example if the user knows that there are 16relevant documents (but they are not all the relevantdocuments ) and the system retrieve 10 relevantdocuments included 4 of those that the user knowswe have:

User may expect 40 relevant documents in total

4Coverage ratio=

16

6Novelty ratio=

10

Relative Recall: The ratio of relevant retrieveddocuments examined by the user to the number ofdocuments the user would have liked to examine– If the system has retrieved 5 relevant documents among 20 -

how large is the relative recall?

Relative Effort: The ratio of number of relevantdocuments desired to the number of documentsexamined by the user to find the number of relevantdocuments desired– this ratio go to 1 if the relevant docs are the first examined,

to early 0 if the user would need to examine hundreds ofdocuments to find the desired few.

Relative Recall and Effort

19

What happen when we increase the number ofdocuments retrieved?

At low retrieval volumes when we increase thenumber of documents retrieved , the number ofrelevant documents increase more rapidly than thenumber of not relevant documents

Relevant

RetrievedNot Retrieved

What happen when we increase the number ofdocuments retrieved?

At high retrieval volumes when we increase thenumber of document retrieved the situation isreversed

Relevant

RetrievedNot Retrieved

20

From Query to SystemPerformance

Precision ad Recall change with theretrieval value

Averaging the values obtained mightprovide adequate measure of theeffectiveness of the system

To evaluate system performance wecompute average precision and recall

Three Points Average

Fix recall and count precision! For a given query three points average

precision is computed by averaging theprecision of the retrieval system at threerecall levels, typically:0.25 0.5 0.75

or0.2 0.5 0.8

Same can be done for recall

21

Other Averages

For a given query eleven points averageprecision is computed by averaging theprecision of the retrieval system at elevenrecall levels

0.0 0.1 0.2 … 0.9 1.0

If finding exact recall points is hard, it is doneat different levels of document retrieval– 10, 20, 30, 40, 50… relevant retrieved documents

Expected search length

Definition– a way to estimate the number of documents that a

user have to read in order to find the desirednumber of relevant documents.

– M to examine to find N relevant

Calculation

Graphing

Average search length

22

Taking the Order into Account

Results of search is not a set, but a sequence

Recall and Precision fail to take into accountthe sequentiality effect in presenting theretrieval results

Two documents that contains the sameinformation can be judged by the system in adifferent way– the first in the list is considered relevant

– the second one, maybe separated from the first bymany other documents, is considered much lessrelevant

Frustration

Two systems can give a very differentperception if they just organize thesame documents in a different way:

All therelevantdocumentsin the firstpositions

Relevantdocumentsscattered in thelist at the end ofthe list

23

Normalized Recall

To take into account this effect thenormalized recall was introduced.

Imagine that we know all the relevantdocuments– an ideal system will present all the

relevant documents before the not relevantones.

Suppose that the relevant ones are 1, 2, 4, 5, 13 in alist of 14 documents. The graph obtained is:

Normalized Recall

1 2 3 4 5 6 7 8 9 10 11 12 13 14

S10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Series1

Series2

idealReal for

our system

documents

Recall

0

1

24

The area between thetwo graphs (the blackone) is a measure of theeffectiveness of thesystem. This measure isalways reduced to avalue between 0 and 1:1 for the ideal systemand 0 for the systemthat presents all therelevant documents atthe end.

Normalized Recall

1 2 3 4 5 6 7 8 9 10 11 12 13 14

S10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Series1

Series2

1 - Difference/Relevant(N - Relevant)

Sliding Ratio

Sliding ratio is a measure that takes intoaccount the weight (the relevancevalue) of the documents retrieved anddo not needs the knowledge of all therelevant documents.

Assume that we retrieve N=5documents that are ranked by thesystem. Then assume that the userassign a relevance value to thesedocuments

25

Sliding Ratio

7.222.85

5.145.24

0.120.03

0.120.52

0.70.71

)(,

5

4

3

2

1

d

d

d

d

d

wwrelevancedocn ii �

Sum of the weights so far

Sliding Ratio

7.220.0

7.225.2

2.200.5

2.150.7

2.82.8

)(,

3

4

2

1

5

d

d

d

d

d

Wwrelevancedoc ii �

The perfect system will rank the documents in thesame way of the user

Documents arerearranged

26

Sliding Ratio

7.222.85

5.145.24

0.120.03

0.120.52

0.70.71

)(,

5

4

3

2

1

d

d

d

d

d

wwrelevancedocn ii �

7.220.0

7.225.2

2.200.5

2.150.7

2.82.8

)(,

3

4

2

1

5

d

d

d

d

d

Wwrelevancedoc ii �

Real system Ideal system

Sliding Ratio

The Sliding Ratio is the ratio of the last twocolumns

7.222.85

5.145.24

0.120.03

0.120.52

0.70.71

)(,

5

4

3

2

1

d

d

d

d

d

wwrelevancedocn ii �

7.220.0

7.225.2

2.200.5

2.150.7

2.82.8

)(,

3

4

2

1

5

d

d

d

d

d

Wwrelevancedoc ii �

Real system Ideal system

00.17.227.22

639.07.225.14

594.02.20

12

789.02.15

12

85.02.87

=

=

=

=

=

=�

�i

i

Ww

SR

27

Sliding Ratio

If the number of retrieved documents Nis large enough then SR is a reasonablyaccurate picture of the retrieval systemperformances

Homework 1

Imagine that an IR system retrieved 20 document in answerto a query, but only documents number 1, 3, 8, 9, 13, 15,and 20 are relevant.

Calculate Precision, Recall, Fallout and the ratioRecall/Fallout considering that there are other 5 relevantdocuments that were not retrieved and that the total numberof documents in the collection is 100 (included the 20retrieved).

Explore this problem using graphing applet

28

Imagine that a pool of userassign a relevance weightsto the relevant documents.Calculate the column ofthe sliding ratio.

Doc. Rel=1 RelevanceNumber notRel=0 Weights1 1 0.12 0 03 1 0.54 0 05 0 06 0 07 0 08 1 0.99 1 0.510 0 011 0 012 0 013 1 114 0 015 1 116 0 017 0 018 0 019 0 020 1 0.2

Homework 2