Transcript
Page 1: On the Measurement of Test Collection Reliability

SIGIR 2013 Dublin, Ireland · July 30th Picture by Philip Milne

On the Measurement of Test Collection Reliability

@julian_urbano University Carlos III of Madrid

Mónica Marrero University Carlos III of Madrid

Diego Martín Technical University of Madrid

Page 2: On the Measurement of Test Collection Reliability

Gratefully supported by Student Travel Grant

Page 3: On the Measurement of Test Collection Reliability

Is System A More Effective than System B?

-1 1 Δeffectiveness

𝑑 0

Page 4: On the Measurement of Test Collection Reliability

Is System A More Effective than System B?

Get a test collection and evaluate

Measure the average difference 𝒅

and conclude which one is better

Page 5: On the Measurement of Test Collection Reliability

Samples

Test collections are samples from a larger, possibly infinite, population

Documents, queries and assessors

𝒅 is only an estimate

How reliable is our conclusion?

Page 6: On the Measurement of Test Collection Reliability

Reliability vs. Cost

Building reliable collections is easy…

Just use more documents, more queries, more assessors

…but it is prohibitively expensive

Our best bet is to increase query set size

Page 7: On the Measurement of Test Collection Reliability

Data-based approach

1.Randomly split query set 2.Compute indicators of reliability

based on those two subsets 3.Extrapolate to larger query sets

..with some variations

Voorhees’98, Zobel’98, Buckley & Voorhees’00, Voorhees & Buckley’02, Sanderson & Zobel’05,

Sakai’07, Voorhees’09

Page 8: On the Measurement of Test Collection Reliability

Data-based Reliability Indicators based on results with two collections

Kendall 𝝉 correlation stability of the ranking of systems

𝝉𝑨𝑷 correlation add a top-heaviness components

Absolute sensitivity minimum absolute 𝒅 s.t. swaps <5%

Relative sensitivity minimum relative 𝒅 s.t. swaps <5%

Page 9: On the Measurement of Test Collection Reliability

Data-based Reliability Indicators based on results with two collections

Power ratio statistically significant results

Minor conflict ratio statistically non-significant swap

Major conflict ratio statistically significant swap

RMSE differences in 𝒅

Page 10: On the Measurement of Test Collection Reliability

Generalizability Theory

Directly address variability of scores

G-study Estimate variance components

from previous, representative, data

D-study Estimate reliability based on

estimated variance components

Page 11: On the Measurement of Test Collection Reliability

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

Estimated using Analysis of Variance

From previous data, usually an existing test collection

Page 12: On the Measurement of Test Collection Reliability

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

Estimated using Analysis of Variance

From previous data, usually an existing test collection

system differences,

our goal!

Page 13: On the Measurement of Test Collection Reliability

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

Estimated using Analysis of Variance

From previous data, usually an existing test collection

system differences,

our goal! query difficulty

Page 14: On the Measurement of Test Collection Reliability

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

Estimated using Analysis of Variance

From previous data, usually an existing test collection

system differences,

our goal! query difficulty

some systems better for

some queries

Page 15: On the Measurement of Test Collection Reliability

D-study

Relative stability

𝑬𝝆𝟐 =𝝈𝒔𝟐

𝝈𝒔𝟐 +

𝝈𝒔:𝒒𝟐

𝒏𝒒′

Absolute stability

𝚽 =𝝈𝒔𝟐

𝝈𝒔𝟐 +

𝝈𝒒𝟐 + 𝝈𝒔:𝒒

𝟐

𝒏𝒒′

Easy to estimate how many queries we need for a certain stability level

Page 16: On the Measurement of Test Collection Reliability

Generalizability Theory

Proposed by Bodoff’07

Kanoulas & Aslam’09 derive optimal gain & discount in nDCG

TREC Million Query Track

≈80 queries sufficient for stable rankings ≈130 queries for stable absolute scores

Page 17: On the Measurement of Test Collection Reliability

In this Paper / Talk

How sensitive is the D-study to the initial data used in the G-study?

How to interpret G-theory in practice,

why 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓?

From the above two, review the reliability of >40 TREC test collections

Page 18: On the Measurement of Test Collection Reliability

variability of G-theory indicators of reliability

Page 19: On the Measurement of Test Collection Reliability

Data

43 TREC collections from TREC-3 to TREC 2011

12 tasks across 10 tracks

Ad Hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million

Query, Medical and Microblog

Page 20: On the Measurement of Test Collection Reliability

Experiment

Vary number of queries in G-study from 𝒏𝒒 = 𝟓 to full set Use all runs available

Run D-study

Compute 𝑬𝝆 𝟐, 𝚽 Compute 𝒏 𝒒

′ to reach 0.95 stability

200 random trials

Page 21: On the Measurement of Test Collection Reliability

Variability due to queries

Page 22: On the Measurement of Test Collection Reliability

Variability due to queries

We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.3, depending on what 10 queries we use

Page 23: On the Measurement of Test Collection Reliability

Experiment (II)

The same, but vary number of systems from 𝒏𝒔 = 𝟓 to full set

Use all queries available

200 random trials

Page 24: On the Measurement of Test Collection Reliability

Variability due to systems

Page 25: On the Measurement of Test Collection Reliability

Variability due to systems

We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.5, depending on what 20 systems we use

Page 26: On the Measurement of Test Collection Reliability

Results

G-Theory is very sensitive to initial data Need about 50 queries and 50 systems for

differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1

Number of queries for 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 may change in orders of magnitude

Microblog2011 (all 184 systems and 30 queries): need 63 to 133 queries

Medical2011 (all 34 queries and 40 systems): need 109 to 566 queries

Page 27: On the Measurement of Test Collection Reliability

Use Confidence Intervals

Bodoff’08 Confidence intervals in G-study

But what about the D-study? Feldt’65 and Arteaga et al.’82

Work reasonably well even when

assumptions are violated Brennan’01

Page 28: On the Measurement of Test Collection Reliability

Example

Page 29: On the Measurement of Test Collection Reliability

Example

Page 30: On the Measurement of Test Collection Reliability

Example

Account for variability in initial data

Page 31: On the Measurement of Test Collection Reliability

Example

Required number of queries to reach the

lower end of the interval

Page 32: On the Measurement of Test Collection Reliability

Summary in TREC that is, the 43 collections we study here

𝑬𝝆 𝟐: mean=0.88 sd=0.1

95% conf. intervals are 0.1 long

𝚽 : mean=0.74 sd=0.2 95% conf. intervals are 0.19 long

Page 33: On the Measurement of Test Collection Reliability

interpretation of G-Theory indicators of reliability

Page 34: On the Measurement of Test Collection Reliability

Experiment

Split query set in 2 subsets from 𝒏𝒒 = 𝟏𝟎 to full set / 2

Use all runs available Run D-study

Compute 𝑬𝝆 𝟐 and 𝚽 and map onto 𝝉, sensitivity, power, conflicts, etc.

50 random trials

>28,000 datapoints

Page 35: On the Measurement of Test Collection Reliability

Example: 𝑬𝝆𝟐 → 𝝉

*All mappings in the paper

Page 36: On the Measurement of Test Collection Reliability

Example: 𝑬𝝆𝟐 → 𝝉

𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85

*All mappings in the paper

Page 37: On the Measurement of Test Collection Reliability

Example: 𝑬𝝆𝟐 → 𝝉

𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97

*All mappings in the paper

Page 38: On the Measurement of Test Collection Reliability

Example: 𝑬𝝆𝟐 → 𝝉

Million Query 2007

Million Query 2008

*All mappings in the paper

Page 39: On the Measurement of Test Collection Reliability

Future Predictions

Allows us to make more informed decisions within a collection

What about a new collection?

Fit a single model for each mapping with 90% and 95% prediction intervals

Assess whether a larger collection

is really worth the effort

Page 40: On the Measurement of Test Collection Reliability

Example: 𝑬𝝆𝟐 → 𝝉

*All mappings in the paper

Page 41: On the Measurement of Test Collection Reliability

Example: 𝑬𝝆𝟐 → 𝝉 current collection

*All mappings in the paper

Page 42: On the Measurement of Test Collection Reliability

Example: 𝑬𝝆𝟐 → 𝝉 current collection target

*All mappings in the paper

Page 43: On the Measurement of Test Collection Reliability

Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚

Page 44: On the Measurement of Test Collection Reliability

Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚

Page 45: On the Measurement of Test Collection Reliability

review of TREC collections

Page 46: On the Measurement of Test Collection Reliability

Outline

Estimate 𝑬𝝆 𝟐 and 𝚽 , with 95% confidence intervals, and full query set

Map onto 𝝉, sensitivity, power,

conflicts, etc.

Results within task offer historical perspective since 1994

Page 47: On the Measurement of Test Collection Reliability

Example: Ad Hoc 3-8

𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏] 𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 %

𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 %

Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑] Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗]

50 queries were used

*All collections and mappings in the paper

Page 48: On the Measurement of Test Collection Reliability

Example: Web Ad Hoc

TREC-8 to TREC-2001: WT2g and WT10g 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]

Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎

TREC-2009 to TREC-2011: ClueWeb09 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗] Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖

50 queries were used

Page 49: On the Measurement of Test Collection Reliability

Historical Trend

Decreasing within and across tracks?

Page 50: On the Measurement of Test Collection Reliability

Historical Trend

Systems getting better for specific problems?

Page 51: On the Measurement of Test Collection Reliability

Historical Trend

Increasing task-specificity in queries?

Page 52: On the Measurement of Test Collection Reliability

summing up

Page 53: On the Measurement of Test Collection Reliability

Generalizability Theory

Regarded as more appropriate, easy to use and powerful tool

to assess test collection reliability

Very sensitive to the initial data used to estimate variance components

Almost impossible to interpret

in practical terms

Page 54: On the Measurement of Test Collection Reliability

Sensitivity of G-Theory

About 50 queries and 50 systems are needed for robust estimates

Caution if building a new collection

Can always use confidence intervals

Page 55: On the Measurement of Test Collection Reliability

Interpretation of G-Theory

Empirical mapping onto traditional

indicators of reliability like 𝝉 correlation

𝝉 = 𝟎. 𝟗 → 𝑬𝝆𝟐 ≈ 𝟎. 𝟗𝟕

𝑬𝝆𝟐 = 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓

Page 56: On the Measurement of Test Collection Reliability

Historical Reliability in TREC

On average, 𝑬𝝆𝟐 = 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕

Some collections clearly unreliable Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011

50 queries not enough for stable rankings, about 200 are needed

Page 57: On the Measurement of Test Collection Reliability

Implications

Fixing a minimum number of queries

across tracks is unrealistic Not even across editions of the same task

Need to analyze on a case-by-case basis, while building the collections

Page 58: On the Measurement of Test Collection Reliability

to be continued…

Page 59: On the Measurement of Test Collection Reliability

Future Work

Study assessor effect Study document-collection effect

Better models to map G-Theory

onto data-based indicators We fitted theoretically correct(-ish) models,

but in practice theory does not hold

Methods to reliably measure reliability while building the collection

Page 60: On the Measurement of Test Collection Reliability

Source Code Online

Code for R stats software

G-study and D-study Required number of queries

Map onto data-based indicators Confidence intervals

..in two simple steps

Page 61: On the Measurement of Test Collection Reliability

G-Theory too sensitive to initial data Questionable with small collections

Compute confidence intervals

Need 𝑬𝝆𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗 50 queries not enough for stable rankings

Fixing a minimum number of queries across tasks is unrealistic

Need to analyze on a case-by-case basis


Top Related