mediaeval 2016 - ir evaluation: putting the user back in the loop

IR evaluation: Putting the user back in the loop

Evangelos [email protected]

Change the search algorithm.

How can we know whether we made the users happier?

Different approaches to evaluation

• User-‐studies

• In-‐situ evaluation• A/B Testing• Interleaving

• Collection-‐based evaluation

in-‐situ evaluation

A/B Testing

Baseline (control) Experimental (treatment)

collection-‐based evaluation

Machine Learning

• Feature vectors

• Labels

Cranfield Collections

Information Retrieval

• Documents• Queries

• Labels– relevance judgments

Query 1 Query 2 Query N

Cranfield Paradigm• Simple user model• Controlled experiments• Reusable but static test

collections

Online Evaluation• Full user participation• Many degrees of freedom• Unrepeatable experiments

System Focus User Focus

Evaluation Landscape

TREC Tasks TREC Session

TREC TotalRecall

TREC OpenSearch

TREC Total Recall

results

human assessor

search algorithm

query

documentcollection

TREC Session Track

TREC Session Track [2010-‐2014]

1. improve search by using session information

2. improve search over an entire user’s session instead of a single query

Paris Luxurious Hotels Paris Hilton

Test Collection

Evaluating Retrieval over Sessions: The TREC Session Track 2011–2014Ben Carterette1, Paul Clough2, Mark Hall3, Evangelos Kanoulas4, Mark Sanderson5

1 University of Delaware, 2 University of She�eld, 3 Edge Hill University, 4 University of Amsterdam, 5 RMIT University

Objectives

I Test if the retrieval e�ectiveness of a query could be improved by using previousqueries, ranked results, and user interactions.

Test Collection

Four test collections (2011–2014) comprising N sessions of varying length, each con-sisted of:I mi blocks of user interactions (the session’s length);I the current query qm1 in the session;I mi≠1 blocks of interactions in the session prior to the current query, composed of:

Û the user queries in the session, q1, q2, ..., qmi≠1;Û the ranked list of URLs seen by the user for each of those queries;Û the set of clicked URLs/snippets.

Test Collection Statistics

2011 2012 2013 2014collection ClueWeb09 ClueWeb09 ClueWeb12 ClueWeb12

topic propertiestopic set size 62 48 61 60

topic cat. dist. known-item 10 exploratory,6 interpretive,

20 known-item,12 known-subj

10 exploratory,9 interpretive,

32 known-item,10 known-subj

15 exploratory,15 interpretive,15 known-item,15 known-subj

session propertiesuser population U. She�eld U. She�eld U. She�eld + IR

researchersMTurk

search engine BOSS+CW09filter

BOSS+CW09filter

indri indri

total sessions 76 98 133 1,257sessions per topic 1.2 2.0 2.2 21.0

mean length (in queries) 3.7 3.0 3.7 3.7median time between queries 68.5s 66.7s 72.2s 25.6srelevance judgments

topics judged 62 48 49 51total judgments 19,413 17,861 13,132 16,949

Algorithmic Improvements

I Session history can be used to improve e�ectiveness over basic ad hoc retrieval.

0 20 40 60 80 100

−0

.10

.00.1

0.2

run number

ma

x ch

an

ge in

nD

CG

@1

0 fro

m R

L1

ba

selin

e

2011201220132014

Topic - System Analysis

I Known-subject and exploratory topics benefit most from access to session history.I There is substantial variability in topics due to the way the users perform their

search and formulate their query.

0.0

0.5

1.0

1.5

topic (ordered by median)

diff

ere

nce

in ∆

nD

CG

@1

0 o

ver

sess

ion

s

2012

−10

2012

−47

2014

−40

2013

−14

2012

−28

2012

−4

2013

−24

2014

−46

2012

−6

2014

−52

2014

−39

2014

−26

2014

−13

2014

−47

2012

−5

2014

−44

2013

−12

2011

−7

2012

−32

2011

−30

2014

−56

2013

−21

2011

−20

2012

−34

2013

−49

2014

−15

2012

−11

2014

−24

2014

−35

2014

−10

2012

−23

2014

−30

2011

−52

2013

−28

2012

−24

Conclusions

I Retrieval e�ectiveness can be improved for ad hoc retrieval using data based onsession history.

I The more detailed the session data, the greater the improvement.

SIGIR 2016

TREC Tasks Track

TREC Tasks Track [2015–now]

1. understand underlying user’s task

2. assist user in completing the task

Make Improvements At Home

TASKUNDERSTANDING

Make Improvements At HomeTASK

COMPLETION

CLEF Dynamic Search for Complex Tasks

CLEF Complex Tasks [now]

1. Produce methodology and algorithms that will lead to a dynamic test collection by simulating users

2. Understand and quantify what constitutes a good ranking of documents at different stages of a session, and a good overall session

TREC Open Search

mediaeval 2016 - ir evaluation: putting the user back in the loop

Science