infsci 2140 - university of pittsburghpeterb/2140-051/l4.pdf · how good is the model? ... serving...
TRANSCRIPT
1
INFSCI 2140Information Storage and RetrievalLecture 4: Retrieval Evaluation
Peter Brusilovskyhttp://www2.sis.pitt.edu/~peterb/2140-051
The issue of evaluation: TREC
Text REtrieval Conferences organizedby NIST
TREC-9 was held in 2000• http://trec.nist.gov/presentations/TREC9/intro/
TREC IR “competitions”– Standard document sets
– Standard queries and “topics”
2
Evaluation: Macro view
Design Work Evaluation
Redesign
Death
Life Cycle of an Information System
Evaluation: Micro view
Information Need Query Browsing
Results
ImproveQuery? Evaluation
3
Effectiveness
The effectiveness of a retrieval systemis related to the user satisfaction– i.e. is related to the ectosystem
Informationneed
Query
IR system
User
Results
How Good is the Model?
Was the query language powerfulenough to represent the need?
Were we able to use query syntax toexpress what we need– Operators
– Weights
Were the words from the limitedvocabulary expressive enough?
4
What can we say about a document?
Matching to the need, question, query
Relevance:– How well a the document responds to the
query
Pertinence– how well a document matches an
information need
Usefulness vs. relevance
Relevance and Pertinence
Relevance– how well the documents respond to the
query
Pertinence:– how well the documents respond to the
information need
Usefulness (vs. relevance)– Useful but not relevant
– Relevant but useless
5
How can we measure it?
Binary measure (yes/no)
N-ary measure:– 3 very relevant
– 2 relevant
– 1 barely relevant
– 0 not relevant
N=?: consistency vs. expressiveness
Precision and Recall
Relevant
Retrieved
Not Retrieved
1
2
3
4
6
Precision and Recall
w x
y z
Relevant = w + x
Precision: P= w / Retrieved
Recall: R = w / Relevant
Retrieved = w + y
Not retrievedRetrieved
Relevant
Not relevant
Precision and Recall
Precision = wn2=
ww + y
Recall = wn1=
ww + x
Number of retrieveddocuments that are
relevant
Number ofretrieved
documents
Number ofrelevant
documents
7
How are they related ?
Suppose that the system is running isresponse to a query and Recall and Precisionare measured as increasing number ofdocuments are retrieved.– At the beginning imagine that only one document
is retrieved and that it is relevant:
Precision = 1
Recall = 1n1
Verylow
How are they related ?
– On the other extreme suppose that everydocument in the database is retrieved:
Precision = n2N
Recall = 1
Very low
Totalnumber ofdocument
in thecollection
Allrelevant
documentare
retrieved
8
How are they related ?
Precision falls and recall rises as thenumber of documents retrieved inresponse to a query is increased
The number of returned documents canbe considered as a search parameter
Changing it we can build aprecision/recall graphs
Precision-Recall GraphRel./notRel Precision Recall
1 1 1 0.22 1 1 0.43 0 0.666666667 0.44 1 0.75 0.65 0 0.6 0.66 1 0.666666667 0.87 0 0.571428571 0.88 0 0.5 0.89 0 0.444444444 0.810 0 0.4 0.811 0 0.363636364 0.812 0 0.333333333 0.813 1 0.384615385 114 0 0.357142857 1
• Imagine that aquery issubmitted to thesystem.
• 14 documentsare retrieved
• 5 of them arerelevant
• These 5 are alsothe totalnumber ofrelevantdocument in thecollection
9
Precision and Recall Graphs
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14
ith document retrieved
Precision
Recall
R/notR Pre Rec1 1 1 0.22 1 1 0.43 0 0.666666667 0.44 1 0.75 0.65 0 0.6 0.66 1 0.666666667 0.87 0 0.571428571 0.88 0 0.5 0.89 0 0.444444444 0.810 0 0.4 0.811 0 0.363636364 0.812 0 0.333333333 0.813 1 0.384615385 114 0 0.357142857 1
Precision Graph
Precision when more and more documents are retrieved.
Note sawtooth shape!
10
Recall Graph
Recall when more and more documents are retrieved.
Note terraced shape!
Precision-Recall Graph
Sequences of points (p, r)
Similar to y = 1 / x:– Inversely proportional!
Sawtooth shape
Use smoothed graphs
How we can compare different IRsystems using precision-recall graphs?
11
Precision-Recall Graph
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Precision
Series1
R/notR Pre Rec1 1 1 0.22 1 1 0.43 0 0.666666667 0.44 1 0.75 0.65 0 0.6 0.66 1 0.666666667 0.87 0 0.571428571 0.88 0 0.5 0.89 0 0.444444444 0.810 0 0.4 0.811 0 0.363636364 0.812 0 0.333333333 0.813 1 0.384615385 114 0 0.357142857 1
Precision-Recall Graph
The system a has the best performances, but whatabout system b and c, which one is the best ?
Precision
Recall1
1 ab
c
12
Fallout
The proportion of not relevant document thatare retrieved (it should be low for a good IRsystem)
Fallout measures how well the system filtersout not-relevant documents
1nNy
F−
=Total number of notrelevant documents
Number of notrelevant documents
that are retrieved
Generality
Proportion of relevant documents in thecollection. It is more related to the queryrather than to the retrieval process
G =n1N
Total numberdocuments
Number of relevantdocuments in the
collection
13
Exercise 1
Imagine that an IR system retrieved 10 document in answerto a query, but only the document number 1, 3, 5, 7 arerelevant.
Calculate Precision, Recall and Fallout considering thatthere are other 6 relevant documents that were not retrievedand that the total number of documents in the collection is100 (included the 10 retrieved).
Problems of recall & precision
Hard to find recall Neither shows effectiveness
– Comparing the graphs– F-measure– Relative performance as another single
measure
Recall & precision may not be importantfor the user
14
Problems with Recall
Precision can be determined exactly
Recall cannot be determined exactly because itrequires the knowledge of all of relevant documentsin the collection. Recall can only be estimated
# of relevant docs retrievedPrecision=
# of retrieved docs
# of relevant docs retrievedRecall =
# of relevant docs
The Need for a Single Measure
To compare two IR systems it would be niceto use just one number, and precision andrecall are– Related to each other
– Give an incomplete picture of the system
F-Measure (not fallout!)– F = 2 * (recall * precision) / (recall + precision)
– combines recall and precision in a single efficiencymeasure (harmonic mean of precision and recall)
15
Relative Performance
R P / (1 - P)F G / (1 - G)
P / (1 - P) - relevant to non-relevantretrieved
G / (1 - G) - relevant to non-relevant inthe collection
R/F - relative performance
=
Relative Performance
Relative performance should be greaterthan one if we want that the system doesbetter in locating relevant documents thanit does rejecting not-relevant ones
1
1
1 >
−
−=
GGPP
FR
16
Precision and Recall: User View
It is not clear how important they are forthe users:– Precision in usually more important that
recall, because users appreciate outputsthat do not contain not relevant documents
– This, of course, depends on the kind ofuser: high recall is important for anattorney that needs to determine all thelegal precedents to a case.
What does the user want?Restaurant case
The user wants to find a restaurantserving Sashimi. She can use 2 IRsystems. How we can say which one isbetter?
17
User - oriented measures
Coverage ratio:known_relevant_retrieved / known_ relevant
Novelty ratio:– new_relevant / Relevant
Relative recall– relevant_retrieved /wants_to_examine
Recall Effort:– wants_to_examine / had_to_examine
Coverage and Novelty
Coverage Ratio: proportion of relevantdocuments known to the user that areactually retrieved– A high coverage ratio would give to the user some
confidence that the system is locating all he relevantdocuments
Novelty Ratio: proportion of relevantretrieved documents that were unknown tothe user– A high novelty ratio suggests that the system is effective in
locating documents previously unknown to the user
18
Coverage and Novelty
For example if the user knows that there are 16relevant documents (but they are not all the relevantdocuments ) and the system retrieve 10 relevantdocuments included 4 of those that the user knowswe have:
User may expect 40 relevant documents in total
4Coverage ratio=
16
6Novelty ratio=
10
Relative Recall: The ratio of relevant retrieveddocuments examined by the user to the number ofdocuments the user would have liked to examine– If the system has retrieved 5 relevant documents among 20 -
how large is the relative recall?
Relative Effort: The ratio of number of relevantdocuments desired to the number of documentsexamined by the user to find the number of relevantdocuments desired– this ratio go to 1 if the relevant docs are the first examined,
to early 0 if the user would need to examine hundreds ofdocuments to find the desired few.
Relative Recall and Effort
19
What happen when we increase the number ofdocuments retrieved?
At low retrieval volumes when we increase thenumber of documents retrieved , the number ofrelevant documents increase more rapidly than thenumber of not relevant documents
Relevant
RetrievedNot Retrieved
What happen when we increase the number ofdocuments retrieved?
At high retrieval volumes when we increase thenumber of document retrieved the situation isreversed
Relevant
RetrievedNot Retrieved
20
From Query to SystemPerformance
Precision ad Recall change with theretrieval value
Averaging the values obtained mightprovide adequate measure of theeffectiveness of the system
To evaluate system performance wecompute average precision and recall
Three Points Average
Fix recall and count precision! For a given query three points average
precision is computed by averaging theprecision of the retrieval system at threerecall levels, typically:0.25 0.5 0.75
or0.2 0.5 0.8
Same can be done for recall
21
Other Averages
For a given query eleven points averageprecision is computed by averaging theprecision of the retrieval system at elevenrecall levels
0.0 0.1 0.2 … 0.9 1.0
If finding exact recall points is hard, it is doneat different levels of document retrieval– 10, 20, 30, 40, 50… relevant retrieved documents
Expected search length
Definition– a way to estimate the number of documents that a
user have to read in order to find the desirednumber of relevant documents.
– M to examine to find N relevant
Calculation
Graphing
Average search length
22
Taking the Order into Account
Results of search is not a set, but a sequence
Recall and Precision fail to take into accountthe sequentiality effect in presenting theretrieval results
Two documents that contains the sameinformation can be judged by the system in adifferent way– the first in the list is considered relevant
– the second one, maybe separated from the first bymany other documents, is considered much lessrelevant
Frustration
Two systems can give a very differentperception if they just organize thesame documents in a different way:
All therelevantdocumentsin the firstpositions
Relevantdocumentsscattered in thelist at the end ofthe list
23
Normalized Recall
To take into account this effect thenormalized recall was introduced.
Imagine that we know all the relevantdocuments– an ideal system will present all the
relevant documents before the not relevantones.
Suppose that the relevant ones are 1, 2, 4, 5, 13 in alist of 14 documents. The graph obtained is:
Normalized Recall
1 2 3 4 5 6 7 8 9 10 11 12 13 14
S10
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Series1
Series2
idealReal for
our system
documents
Recall
0
1
24
The area between thetwo graphs (the blackone) is a measure of theeffectiveness of thesystem. This measure isalways reduced to avalue between 0 and 1:1 for the ideal systemand 0 for the systemthat presents all therelevant documents atthe end.
Normalized Recall
1 2 3 4 5 6 7 8 9 10 11 12 13 14
S10
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Series1
Series2
1 - Difference/Relevant(N - Relevant)
Sliding Ratio
Sliding ratio is a measure that takes intoaccount the weight (the relevancevalue) of the documents retrieved anddo not needs the knowledge of all therelevant documents.
Assume that we retrieve N=5documents that are ranked by thesystem. Then assume that the userassign a relevance value to thesedocuments
25
Sliding Ratio
7.222.85
5.145.24
0.120.03
0.120.52
0.70.71
)(,
5
4
3
2
1
d
d
d
d
d
wwrelevancedocn ii �
Sum of the weights so far
Sliding Ratio
7.220.0
7.225.2
2.200.5
2.150.7
2.82.8
)(,
3
4
2
1
5
d
d
d
d
d
Wwrelevancedoc ii �
The perfect system will rank the documents in thesame way of the user
Documents arerearranged
26
Sliding Ratio
7.222.85
5.145.24
0.120.03
0.120.52
0.70.71
)(,
5
4
3
2
1
d
d
d
d
d
wwrelevancedocn ii �
7.220.0
7.225.2
2.200.5
2.150.7
2.82.8
)(,
3
4
2
1
5
d
d
d
d
d
Wwrelevancedoc ii �
Real system Ideal system
Sliding Ratio
The Sliding Ratio is the ratio of the last twocolumns
7.222.85
5.145.24
0.120.03
0.120.52
0.70.71
)(,
5
4
3
2
1
d
d
d
d
d
wwrelevancedocn ii �
7.220.0
7.225.2
2.200.5
2.150.7
2.82.8
)(,
3
4
2
1
5
d
d
d
d
d
Wwrelevancedoc ii �
Real system Ideal system
00.17.227.22
639.07.225.14
594.02.20
12
789.02.15
12
85.02.87
=
=
=
=
=
=�
�i
i
Ww
SR
27
Sliding Ratio
If the number of retrieved documents Nis large enough then SR is a reasonablyaccurate picture of the retrieval systemperformances
Homework 1
Imagine that an IR system retrieved 20 document in answerto a query, but only documents number 1, 3, 8, 9, 13, 15,and 20 are relevant.
Calculate Precision, Recall, Fallout and the ratioRecall/Fallout considering that there are other 5 relevantdocuments that were not retrieved and that the total numberof documents in the collection is 100 (included the 20retrieved).
Explore this problem using graphing applet