1 cs 430 / info 430 information retrieval lecture 8 evaluation of retrieval effectiveness 1

29
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

Upload: victor-jacobs

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

1

CS 430 / INFO 430 Information Retrieval

Lecture 8

Evaluation of Retrieval Effectiveness 1

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

2

Course administration

Change of Office Hours

Office hours are now:

Tuesday: 9:30 to 10:30Thursday: 9:30 to 10:30

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

3

Course administration

Discussion Class 4

Check the Web site.

(a) It is not necessary to study the entire paper in detail

(b) The PDF version of the file is damaged. Use the PostScript version.

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

4

Retrieval Effectiveness

Designing an information retrieval system has many decisions:

Manual or automatic indexing?Natural language or controlled vocabulary?What stoplists?What stemming methods?What query syntax?etc.

How do we know which of these methods are most effective?

Is everything a matter of judgment?

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

5

From Lecture 1: Evaluation

To place information retrieval on a systematic basis, we need repeatable criteria to evaluate how effective a system is in meeting the information needs of the user of the system.

This proves to be very difficult with a human in the loop. It proves hard to define:

• the task that the human is attempting

• the criteria to measure success

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

6

Relevance as a set comparison

D = set of documents

A = set of documents that satisfy some user-based criterion

B = set of documents

identified by the search

system

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

7

Measures based on relevance

retrieved relevant | A B | relevant | A |

retrieved relevant | A B | retrieved | B |

retrieved not-relevant | B - A B | not-relevant | D - A |

recall = =

precision = =

fallout = =

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

8

Relevance

Recall and precision: depend on concept of relevance

Relevance is a context-, task-dependent property of documents

"Relevance is the correspondence in context between an information requirement statement ... and an article (a document), that is, the extent to which the article covers the material that is appropriate to the requirement statement."

F. W. Lancaster, 1979

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

9

Relevance

How stable are relevance judgments?

• For textual documents, knowledgeable users have good agreement in deciding whether a document is relevant to an information requirement.

• There is less consistency with non-textual documents, e.g., a photograph.

• Attempts to have users give a level of relevance, e.g., on a five point scale, are inconsistent.

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

10

Studies of Retrieval Effectiveness

• The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 -1968

• SMART System, Gerald Salton, Cornell University, 1964-1988

• TREC, Donna Harman, National Institute of Standards and Technology (NIST), 1992 -

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

11

Cranfield Experiments (Example)

Comparative efficiency of indexing systems:

(Universal Decimal Classification, alphabetical subject index, a special facet classification, Uniterm system of co-ordinate indexing)

Four indexes prepared manually for each document in three batches of 6,000 documents -- total 18,000 documents, each indexed four times. The documents were reports and paper in aeronautics.

Indexes for testing were prepared on index cards and other cards.

Very careful control of indexing procedures.

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

12

Cranfield Experiments (continued)

Searching:

• 1,200 test questions, each satisfied by at least one document

• Reviewed by expert panel

• Searches carried out by 3 expert librarians

• Two rounds of searching to develop testing methodology

• Subsidiary experiments at English Electric Whetstone Laboratory and Western Reserve University

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

13

The Cranfield Data

The Cranfield data was made widely available and used by other researchers

• Salton used the Cranfield data with the SMART system (a) to study the relationship between recall and precision, and (b) to compare automatic indexing with human indexing

• Sparc Jones and van Rijsbergen used the Cranfield data for experiments in relevance weighting, clustering, definition of test corpora, etc.

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

14

Cranfield Experiments -- Measures of Effectiveness for Matching Methods

Cleverdon's work was applied to matching methods. He made extensive use of recall and precision, based on concept of relevance.

recall (%)

precision (%)

x

xxx

xx

x

x

x

x

Each x represents one search. The graph illustrates the trade-off between precision and recall.

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

15

Typical precision-recall graph for different queries

1.0

0.75

0.5

0.25

1.00.750.50.25recall

precision

Broad, general query

Narrow, specific query

Using Boolean type queries

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

16

Some Cranfield Results

• The various manual indexing systems have similar retrieval efficiency

• Retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies

-> original results from the Cranfield + SMART experiments (published in 1967)

-> considered counter-intuitive -> other results since then have supported this conclusion

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

17

Precision and Recall with Ranked Results

Precision and recall are defined for a fixed set of hits, e.g., Boolean retrieval.

Their use needs to be modified for a ranked list of results.

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

18

Ranked retrieval: Recall and precision after retrieval of n documents

n relevant recall precision1 yes 0.2 1.02 yes 0.4 1.03 no 0.4 0.674 yes 0.6 0.755 no 0.6 0.606 yes 0.8 0.677 no 0.8 0.578 no 0.8 0.509 no 0.8 0.4410 no 0.8 0.4011 no 0.8 0.3612 no 0.8 0.3313 yes 1.0 0.3814 no 1.0 0.36

SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

19

Precision-recall graph

1.0

0.75

0.5

0.25

1.00.750.50.25recall

precision

1 2

34

5

6

1213

200

Note: Some authors plot recall against precision.

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

20

11 Point Precision(Recall Cut Off)

p(n) is precision at that point where recall has first reached n

Define 11 standard recall points p(r0), p(r1), ... p(r10),

where p(rj) = p(j/10)

Note: if p(rj) is not an exact data point, use interpolation

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

21

Recall cutoff graph: choice of interpolation points

1.0

0.75

0.5

0.25

1.00.750.50.25recall

precision

1 2

34

5

6

1213

200

The blue line is the recall cutoff graph.

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

22

Example: SMART System on Cranfield Data

Recall Precision 0.0 1.0 0.1 1.0 0.2 1.0 0.3 1.0 0.4 1.0 0.5 0.75 0.6 0.75 0.7 0.67 0.8 0.67 0.9 0.38 1.0 0.38

Precision values in blue are actual data.

Precision values in red are by interpolation (by convention equal to the next actual data value).

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

23

Average precision

Average precision for a single topic is the mean of the precision obtained after each relevant document is obtained.

Example:

p = (1.0 + 1.0 + 0.75 + 0.67 + 0.38) / 5

= 0.75

Mean average precision for a run consisting of many topics is the mean of the average precision scores for each individual topic in the run.

Definitions from TREC-8

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

24

Normalized recall measure

5 10 15 200195

ideal ranks

actual ranks

worst ranks

recall

ranks of retrieved documents

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

25

Normalized recall

area between actual and worst area between best and worstNormalized recall =

Rnorm = 1 - (after some mathematical manipulation)

ri - i

n(N - n)

i = 1

n

i = 1

n

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

26

Combining Recall and Precision: Normalized Symmetric Difference

Relevant Retrieved

D = set of documents

AB

Symmetric difference, S = A B - A B

Normalized symmetric difference = |S| / 2 (|A| + |B|)

= 1 - 1(1/recall + 1/precision)

Symmetric Difference: The set of elements belonging to one but not both of two given sets.

12 { }

Page 27: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

27

Statistical tests

Suppose that a search is carried out on systems i and jSystem i is superior to system j if, for all test cases,

recall(i) >= recall(j)precisions(i) >= precision(j)

In practice, we have data from a limited number of test cases. What conclusions can we draw?

Page 28: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

28

Recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

recall

precision

The red system appears better than the black, but is the difference statistically significant?

Page 29: 1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

29

Statistical tests

• The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data.

• The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples.

• The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.