on the measurement of test collection reliability

Download On the Measurement of Test Collection Reliability

Post on 13-Dec-2014




0 download

Embed Size (px)


The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.


  • 1. SIGIR 2013 Dublin, Ireland July 30thPicture by Philip Milne On the Measurement of Test Collection Reliability @julian_urbano University Carlos III of Madrid Mnica Marrero University Carlos III of Madrid Diego Martn Technical University of Madrid
  • 2. Gratefully supported by Student Travel Grant
  • 3. Is System A More Effective than System B? -1 1 effectiveness 0
  • 4. Is System A More Effective than System B? Get a test collection and evaluate Measure the average difference and conclude which one is better
  • 5. Samples Test collections are samples from a larger, possibly infinite, population Documents, queries and assessors is only an estimate How reliable is our conclusion?
  • 6. Reliability vs. Cost Building reliable collections is easy Just use more documents, more queries, more assessors but it is prohibitively expensive Our best bet is to increase query set size
  • 7. Data-based approach 1.Randomly split query set 2.Compute indicators of reliability based on those two subsets 3.Extrapolate to larger query sets ..with some variations Voorhees98, Zobel98, Buckley & Voorhees00, Voorhees & Buckley02, Sanderson & Zobel05, Sakai07, Voorhees09
  • 8. Data-based Reliability Indicators based on results with two collections Kendall correlation stability of the ranking of systems correlation add a top-heaviness components Absolute sensitivity minimum absolute s.t. swaps 28,000 datapoints
  • 35. Example: *All mappings in the paper
  • 36. Example: 2 = 0.95 0.85 *All mappings in the paper
  • 37. Example: = 0.9 2 0.97 *All mappings in the paper
  • 38. Example: Million Query 2007 Million Query 2008 *All mappings in the paper
  • 39. Future Predictions Allows us to make more informed decisions within a collection What about a new collection? Fit a single model for each mapping with 90% and 95% prediction intervals Assess whether a larger collection is really worth the effort
  • 40. Example: *All mappings in the paper
  • 41. Example: current collection *All mappings in the paper
  • 42. Example: current collection target *All mappings in the paper
  • 43. Example: .
  • 44. Example: .
  • 45. review of TREC collections
  • 46. Outline Estimate and , with 95% confidence intervals, and full query set Map onto , sensitivity, power, conflicts, etc. Results within task offer historical perspective since 1994
  • 47. Example: Ad Hoc 3-8 . , . [. , . ] . , . % . , . % Queries to get = . : [, ] Queries to get = . : [, ] 50 queries were used *All collections and mappings in the paper
  • 48. Example: Web Ad Hoc TREC-8 to TREC-2001: WT2g and WT10g . , . [. , . ] Queries to get = . : , TREC-2009 to TREC-2011: ClueWeb09 . , . [. , . ] Queries to get = . : , 50 queries were used
  • 49. Historical Trend Decreasing within and across tracks?
  • 50. Historical Trend Systems getting better for specific problems?
  • 51. Historical Trend Increasing task-specificity in queries?
  • 52. summing up
  • 53. Generalizability Theory Regarded as more appropriate, easy to use and powerful tool to assess test collection reliability Very sensitive to the initial data used to estimate variance components Almost impossible to interpret in practical terms
  • 54. Sensitivity of G-Theory About 50 queries and 50 systems are needed for robust estimates Caution if building a new collection Can always use confidence intervals
  • 55. Interpretation of G-Theory Empirical mapping onto traditional indicators of reliability like correlation = . . = . .
  • 56. Historical Reliability in TREC On average, = . . Some collections clearly unreliable Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011 50 queries not enough for stable rankings, about 200 are needed
  • 57. Implications Fixing a minimum number of queries across tracks is unrealistic Not even across editions of the same task Need to analyze on a case-by-case basis, while building the collections
  • 58. to be continued
  • 59. Future Work Study assessor effect Study document-collection effect Better models to map G-Theory onto data-based indicators We fitted theoretically correct(-ish) models, but in practice theory does not hold Methods to reliably measure reliability while building the collection
  • 60. Source Code Online Code for R stats software G-study and D-study Required number of queries Map onto data-based indicators Confidence intervals ..in two simple steps
  • 61. G-Theory too sensitive to initial data Questionable with small collections Compute confidence intervals Need . for = . 50 queries not enough for stable rankings Fixing a minimum number of queries across tasks is unrealistic Need to analyze on a case-by-case basis


View more >