clef 2010 - tie-breaking bias: effect of an uncontrolled parameter on information retrieval...
TRANSCRIPT
Tie-Breaking Bias:Effect of an Uncontrolled Parameteron Information Retrieval Evaluation
Guillaume Cabanac, Gilles Hubert,
Mohand Boughanem, Claude Chrisment
CLEF’10: Conference on Multilingual and Multimodal
Information Access EvaluationSeptember 20-23, Padua, Italy
2
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
Effect of the Tie-Breaking Bias G. Cabanac et al.
3
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
Effect of the Tie-Breaking Bias G. Cabanac et al.
4
A tale about two TREC participants (1/2)
1. Motivation Tie-breaking bias illustration G. Cabanac et al.
5 relevant documentsTopic 031 “satellite launch contracts”
Chris Ellen
C = ( , 0.8), ( , 0.8), ( , 0.5) E = ( , 0.8), ( , 0.8), ( , 0.5)
one single difference
Why such a huge difference?
unlucky lucky
5
A tale about two TREC participants (2/2)
1. Motivation Tie-breaking bias illustration G. Cabanac et al.
Chris Ellen
C = ( , 0.8), ( , 0.8), ( , 0.5) E = ( , 0.8), ( , 0.8), ( , 0.5)
one single difference
Only difference: the name of one document
After 15 days of hard work
6
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
Effect of the Tie-Breaking Bias G. Cabanac et al.
7
Measuring the effectiveness of IRSs
User-centered vs. System-focused [Spärk Jones & Willett, 1997]
Evaluation campaigns 1958 Cranfield UK
1992 TREC Text Retrieval Conference USA
1999 NTCIR NII Test Collection for IR Systems Japan
2001 CLEF Cross-Language Evaluation Forum Europe
…
“Cranfield” methodology Task
Test collection
Corpus
Topics
Qrels
Measures : MAP, P@X ...
using trec_eval
2. Context & issue Tie-breaking bias G. Cabanac et al.
[Voorhees, 2007]
8
Runs are reordered prior to their evaluation
Qrels = qid, iter, docno, rel Run = qid, iter, docno, rank, sim, run_id
( , 0.8), ( , 0.8), ( , 0.5)
Reordering by trec_evalqid asc, sim desc, docno desc
( , 0.8), ( , 0.8), ( , 0.5)
Effectiveness measure = f (intrinsic_quality, )MAP, P@X, MRR…
2. Context & issue Tie-breaking bias G. Cabanac et al.
9
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
Effect of the Tie-Breaking Bias G. Cabanac et al.
10
Consequences of run reordering
Measures of effectiveness for an IRS s RR(s,t) 1/rank of the 1st relevant document, for topic t
P(s,t,d) precision at document d, for topic t
AP(s,t) average precision for topic t
MAP(s) mean average precision
Tie-breaking bias
Is the Wall Street Journal collection more relevant than Associated Press?
Problem 1 comparing 2 systems AP(s1, t) vs. AP(s2, t)
Problem 2 comparing 2 topics AP(s, t1) vs. AP(s, t2)
Chris
Ellen
3. Contribution Reordering strategies G. Cabanac et al.
Sensitive to document
rank
11
Alternative unbiased reordering strategies
Conventional reordering (TREC) Ties sorted Z A qid asc, sim desc, docno desc
Realistic reordering Relevant docs last qid asc, sim desc, rel asc, docno desc
Optimistic reordering Relevant docs first qid asc, sim desc, rel desc, docno desc
3. Contribution Reordering strategies G. Cabanac et al.
ex aequo
ex aequo
12
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
Effect of the Tie-Breaking Bias G. Cabanac et al.
13
Effect of the tie-breaking bias
Study of 4 TREC tasks
22 editions
1360 runs
Assessing the effect of tie-breaking Proportion of document ties How frequent is the bias?
Effect on measure values
Top 3 observed differences
Observed difference in %
Significance of the observed difference: Student’s t-test (paired, unilateral)
1993 1999 20001998 2002 20041997
routing webfiltering
adhoc
2009
3 GB of data from trec.nist.gov
4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
14
Ties demographics
89.6% of the runs comprise ties
Ties are present all along the runs
4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
15
Proportion of tied documents in submitted runs
On average, 10.6 docs in a tied group of docs On average, 25.2 % of a result-list = tied documents
4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
16
Effect on Reciprocal Rank (RR)
4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
17
Effect on Average Precision (AP)
4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
18
Effect on Mean Average Precision (MAP)
Difference of ranks computed on MAP not significant (Kendall’s t)
4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
19
What we learnt: Beware of tie-breaking for AP
Poor effect on MAP, larger effect on AP
Measure bounds APRealistic APConventionnal APOptimistic
Failure analysis for the ranking process Error bar = element of chance potential for improvement
4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
padre1, adhoc’94
20
Related works in IR evaluation
[Voorhees, 2007]
Topics reliability?[Buckley & Voorhees, 2000] 25[Voorhees & Buckley, 2002] error rate[Voorhees, 2009] n collections
Qrels reliability?[Voorhees, 1998] quality[Al-Maskari et al., 2008] TREC vs. TREC
Measures reliability?[Buckley & Voorhees, 2000] MAP [Sakai, 2008] ‘system bias’[Moffat & Zobel, 2008] new measures
[Raghavan et al., 1989] Precall[McSherry & Najork, 2008] Tied scores
Pooling reliability?[Zobel, 1998] approximation [Sanderson & Joho, 2004] manual[Buckley et al., 2007] size adaptation[Cabanac et al., 2010] tie-breaking bias
4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
21
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
Effect of the Tie-Breaking Bias G. Cabanac et al.
22
Conclusions and future works Context: IR evaluation
TREC and other campaigns based on trec_eval
ContributionsMeasure = f (intrinsic_quality, luck) tie-breaking bias
Measure bounds (realistic conventional optimistic)
Study of the tie-breaking bias effect
(conventional, realistic) for RR, AP and MAP
Strong correlation, yet significant difference
No difference on system rankings (based on MAP)
Future works Study of other / more recent evaluation campaigns
Reordering-free measures
Finer grained analyses: finding vs. ranking
Impact du « biais des ex aequo » dans les évaluations de RI G. Cabanac et al.
Thank you
CLEF’10: Conference on Multilingual and Multimodal
Information Access EvaluationSeptember 20-23, Padua, Italy