2009 uiir wrkshp proceedings

8/2/2019 2009 Uiir Wrkshp Proceedings

1/73

Vol-512

Copyright 2009 for the

individual papers by the papers'

authors. Copying permitted for

private and academic purposes.

Re-publication of material from

this volume requires permission

by the copyright owners. This

volume is published by itseditors.

Understanding the User - Logging and

Interpreting User Interactions in

Information Search and Retrieval(UIIR-2009)

Proceedings of the Workshop on Understanding the User - Logging andInterpreting User Interactions in Information Search and Retrieval (UIIR-2009), Boston, MA, USA, July 23, 2009.

Workshop in Conjunction with SIGIR-2009, Boston, MA, USA, July 19-23 2009.

Edited by

Nicholas J. Belkin *Ralf Bierig *Georg Buscher%Ludger van Elst %

Jacek Gwizdka *Joemon Jose #Jaime Teevan &

* Rutgers University, USA

% DFKI GmbH, Kaiserslautern, Germany

# Glasgow University, Scotland

& Microsoft Research, USA


2/73


3/73


4/73

Table of Contents

Preface, pages i-v

1. Demonstration of Improved Search Result Relevancy Using Real-Time ImplicitRelevance Feedback.Mark Cramer, Mike Wertheim and David Hardtke, pages 1-7

2. A User-Centered Experiment and Logging Framework for Interactive Information

Retrieval.Ralf Bierig, Jacek Gwizdka and Michael Cole, pages 8-11

3. Incorporating User Behavior Information in IR Evaluation.Emine Yilmaz, Milad Shokouhi, Nick Craswell and Stephen Robertson, pages 12-15

4. Faceted Search for Library Catalogs: Developing Grounded Tasks and Analyzing Eye-Tracking Data.Robert Capra, Bill Kules, Matt Banta and Tito Sierra , pages 16-18

5. How Task Types and User Experiences Affect Information-Seeking Behavior on theWeb: Using Eye-tracking and Client-side Search Logs.Hitomi Saito, Hitoshi Terai, Yuka Egusa, Masao Takaku, Makiko Miwa and Noriko

Kando, pages 19-226. Framework of a Real-Time Adaptive Hypermedia System.

Rui Li, Evelyn Rozanski and Anne Haake, pages 23-27

7. Inferring the Public Agenda from Implicit Query Data.Laura Granka, pages 28-31

8. Evaluation of Digital Library Services Using Complementary Logs.Maristella Agosti, Franco Crivellari and Giorgio Maria Di Nunzio, pages 32-35

9. Watching Through the Web: Building Personal Activity and Context-Aware Interfacesusing Web Activity Streams.Max Van Kleek, David Karger and mc schraefel, pages 36-39

10. Annotating URLs with Query Terms: What Factors Predict Reliable Annotations?Suzan Verberne, Max Hinne, Maarten van der Heijden, Eva D'hondt, Wessel Kraaij andTheo van der Weide, pages 40-43

11. Evaluating the Impact of Snippet Highlighting in Search.Tereza Iofciu, Nick Craswell and Milad Shokouhi, pages 44-47

12. Using Domain Models for Context-Rich User Logging.Stephen Dignum, Yunhyong Kim, Udo Kruschwitz, Dawei Song, Maria Fasli and AnneDe Roeck, pages 48-50

13. Catching the User - User Context through Live Logging in DAFFODIL.

Claus-Peter Klas and Matthias Hemmje, pages 51-5214. Massive Implicit Feedback: Organizing Search Logs into Topic Maps for Collaborative

Surfing.Xuanhui Wang and ChengXiang Zhai, pages 53-54

15. HCI Browser: A Tool for Studying Web Search Behavior.Robert Capra, pages 55-56

16. Catching the User - Logging the Information Retrieval Dialogue.Paul Landwich, Claus-Peter Klas and Matthias Hemmje , pages 57-59

17. Identifying User Behaviour Between Logged Interactions. Max Wilson and mc schraefel, pages 60-61
http://ceur-ws.org/Vol-512/preface.pdfhttp://ceur-ws.org/Vol-512/paper01.pdfhttp://ceur-ws.org/Vol-512/paper01.pdfhttp://ceur-ws.org/Vol-512/paper02.pdfhttp://ceur-ws.org/Vol-512/paper02.pdfhttp://ceur-ws.org/Vol-512/paper03.pdfhttp://ceur-ws.org/Vol-512/paper04.pdfhttp://ceur-ws.org/Vol-512/paper04.pdfhttp://ceur-ws.org/Vol-512/paper05.pdfhttp://ceur-ws.org/Vol-512/paper05.pdfhttp://ceur-ws.org/Vol-512/paper06.pdfhttp://ceur-ws.org/Vol-512/paper07.pdfhttp://ceur-ws.org/Vol-512/paper08.pdfhttp://ceur-ws.org/Vol-512/paper09.pdfhttp://ceur-ws.org/Vol-512/paper09.pdfhttp://ceur-ws.org/Vol-512/paper10.pdfhttp://ceur-ws.org/Vol-512/paper11.pdfhttp://ceur-ws.org/Vol-512/paper12.pdfhttp://ceur-ws.org/Vol-512/paper13.pdfhttp://ceur-ws.org/Vol-512/paper14.pdfhttp://ceur-ws.org/Vol-512/paper14.pdfhttp://ceur-ws.org/Vol-512/paper15.pdfhttp://ceur-ws.org/Vol-512/paper16.pdfhttp://ceur-ws.org/Vol-512/paper17.pdfhttp://ceur-ws.org/Vol-512/paper17.pdfhttp://ceur-ws.org/Vol-512/paper16.pdfhttp://ceur-ws.org/Vol-512/paper15.pdfhttp://ceur-ws.org/Vol-512/paper14.pdfhttp://ceur-ws.org/Vol-512/paper13.pdfhttp://ceur-ws.org/Vol-512/paper12.pdfhttp://ceur-ws.org/Vol-512/paper11.pdfhttp://ceur-ws.org/Vol-512/paper10.pdfhttp://ceur-ws.org/Vol-512/paper09.pdfhttp://ceur-ws.org/Vol-512/paper08.pdfhttp://ceur-ws.org/Vol-512/paper07.pdfhttp://ceur-ws.org/Vol-512/paper06.pdfhttp://ceur-ws.org/Vol-512/paper05.pdfhttp://ceur-ws.org/Vol-512/paper04.pdfhttp://ceur-ws.org/Vol-512/paper03.pdfhttp://ceur-ws.org/Vol-512/paper02.pdfhttp://ceur-ws.org/Vol-512/paper01.pdfhttp://ceur-ws.org/Vol-512/preface.pdf


5/73


6/73

i

PREFACE

Proceedings of the

SIGIR 2009 Workshop on Understanding the User Logging and interpreting user interactions

in information search and retrieval

Georg Buscher

DFKI GmbH

[email protected]

Jacek Gwizdka

Rutgers University

[email protected]

Jaime Teevan

Microsoft Research

[email protected]

Nicholas J. BelkinRutgers University

[email protected]

Ralf BierigRutgers University

[email protected]

Ludger van ElstDFKI GmbH

[email protected]

Joemon Jose

Glasgow University

[email protected]

1 IntroductionModern information search systems can benefit greatly from using additional information about theuser and the user's behavior, and research in this area is active and growing. Feedback data based on

direct interaction (e.g., clicks, scrolling, etc.) as well as on user profiles/preferences has been proven

valuable for personalizing the search process, e.g., from how queries are understood to how relevance

is assessed. New technology has made it inexpensive and easy to collect more feedback data and

more different types of data (e.g., gaze, emotional, or biometric data).

The workshop Understanding the User Logging and interpreting user interactions in

information search and retrieval documented in this volume was held in conjunction with the 32nd

Annual International ACM SIGIR Conference. It focused on discussing and identifying most

promising research directions with respect to logging, interpreting, integrating, and using feedback

data. The workshop aimed at bringing together researchers especially from the domains of IR and

human-computer interaction interested in the collection, interpretation, and application of user

behavior logging for search. Ultimately, one of the main goals was to arrange a commonly shared

collection of user interaction logging tools based on a variety of feedback data sources as well as best

practices for their usage.

2 Structure of the WorkshopSince one of the main goals of the workshop was to gather practical information and best practices

about logging tools, it was structured in a way to foster collaboration and discussion among its

participants. Therefore, it was less presentation intensive (it included only 4 oral paper presentations),

but contained more collaboration-supporting elements: participant introductions, poster presentations,

a panel discussion, and, most importantly, group discussions.


7/73

ii

This was also reflected in the types of possible submissions: Experience papers (4 pages) should

describe experiences with acquiring, logging, interpreting and/or use of using interaction data. Demos

of applications or new technology could be presented. Position statements should focus on types of

user interaction data / their interpretation / their use.

Each of those papers and demo descriptions got reviews by two members of the program

committee. The program committee also judged the interestingness of each paper with regard to oralpresentation (e.g., suitability to spawn discussion). The final selection of the 4 papers for oral

presentation was made also with respect to the diversity of topics and approaches they covered. The

accepted demos and all remaining accepted papers were selected for poster presentation.

Table 1: Scenarios workshop participants focused on with respect to logging and using (implicit)

user interaction data

Types of information interacted with

Information visualizations / search interfaces Web text documents Personal information (emails, files on

desktop)

Notes/annotations in documents Music Images Structured or semi-structured data (e.g.,

medical information)

Physical content (pictures, books)

Types of (implicit) interaction data

Queries Clicks, URL visits

o Identification of interaction patterns, e.g.,repeat actions (repeat queries, repeat URL

visits)

Notes/annotations Changes made by author in document Eye movements Biometric feedback: EEG, galvanic skin

response (GSR), facial expressions

Uses of implicit interaction data

Modeling the useroIdentification of domain knowledge / expertiseoBetter expression of interestsoEmotion detection (frustration, stress)oIdentification of good / bad experiences

Personalization / contextualizationoImproving relevanceoProactive information delivery

Introspection / reflection (e.g., analyzing what makes a good searcher) Finding better ways to display retrieved information

The program of the workshop also reflected the focus on collaboration: It started with an extended

participant introduction session where each participant of the workshop was asked to shortly present

his or her main research interests related to the workshops topics. A poster and demo session

followed, succeeded by oral presentations of the 4 selected papers. After each paper, there was

limited time for focused questions. In that way, each participant got the chance to see all workshop

submissions (either as posters or presentations) and to talk to the authors, after which a panel with 3

panelists was formed based on submitted position statements. Following the panel discussion,

breakout groups were formed based on common research interests and practical issues collected


8/73

iii

during the participant introduction session. The workshop ended with a summary of the achieved

results and next steps to take.

In Table 1, we give an overview of the range of scenarios focused on by the different attendees.

Table 2 shows topics the participants were most interested in.

Table 2: Topics of interest

Topics focused on in the above scenarios

Tools for processing low-level logs (e.g., eye tracking, EEG, ...) Ways to combine implicit and explicit feedback data (frameworks) Ways (tools) to record context (current task, etc.) Sharing of logging tools and log data sets (collection of tools, data formats, etc.) Uses for implicit data:

oImproving information experiences in the aggregateoPersonalizing information experiencesoSocial sciences: Reflecting on people in the aggregateoIntrospection: Reflecting on self or individual

Validity of collected data (collected in the wilds vs. in a user study; dependence on usedcollection tools)

Privacy issues

3 Paper, Poster and Demo PresentationsIn this section, we group and briefly list the papers that have been accepted for the workshop.

Overall, 11 experience papers and 4 demos were accepted which are arranged into 5 topical groups

below. Four papers (one from 4 of the 5 groups) were selected for oral presentation.

Logging tools / frameworks

- Oral presentation by Ralf Bierig, Jacek Gwizdka and Michael Cole: A User-CenteredExperiment and Logging Framework for Interactive Information Retrieval. They presented a

framework for multidimensional (interaction) data logging that can be used to conduct

interactive IR experiments.

- Demo by Claus-Peter Klas and Matthias Hemmje. Catching the User - User Context throughLive Logging in DAFFODIL. This demo presented an interactive IR experimentation

framework that can be used to log events during a search session such as querying, browsing,

storing, and modifying contents on several levels.

- Demo by Robert Capra.HCI Browser: A Tool for Studying Web Search Behavior. This demoshowed a browser extension that contains the most important functionalities needed when

conducting a browser-based user study, such as logging browser-specific events and presenting

questionnaires to the user before and after an experiment.

- Demo by Stephen Dignum, Yunhyong Kim, Udo Kruschwitz, Dawei Song, Maria Fasli andAnne De Roeck. Using Domain Models for Context-Rich User Logging. The demo presented

an interface where users can explore a domain using structured representations thereof. The

authors propose using the explored paths of the domain model as contextual feedback.


9/73

iv

Analyzing user behavior logs

- Oral Presentation by Robert Capra, Bill Kules, Matt Banta and Tito Sierra. Faceted Search forLibrary Catalogs:Developing Grounded Tasks and Analyzing Eye-Tracking Data. The authors

aim at examining how faceted search interfaces are used in a digital library. They conducted an

eye tracking user study and discuss challenges and approaches for analyzing gaze data.

- Poster by Hitomi Saito, Hitoshi Terai, Yuka Egusa, Masao Takaku, Makiko Miwa and NorikoKando. How Task Types and User Experiences Affect Information-Seeking Behavior on theWeb: Using Eye-tracking and Client-side Search Logs. They used screen-capture logs and eye

tracking to identify differences in search behavior according to task type and search experience.

- Poster by Maristella Agosti, Franco Crivellari and Giorgio Maria Di Nunzio. Evaluation ofDigital Library Services Using Complementary Logs. The authors argue that analyzing query

logs alone is not sufficient to study user behavior. Rather, analyzing a larger variety of behavior

logs (beyond query logs) and combining them leads to more accurate results.

Analyzing query logs in the aggregate

- Poster by Laura Granka. Inferring the Public Agenda from Implicit Query Data. The authorpresents an approach how to apply query log analysis to create indicators of political interest.As an example, poll ratings of presidential candidates are approximated by query log analysis.

- Poster by Suzan Verberne, Max Hinne, Maarten van der Heijden, Eva D'hondt, Wessel Kraaijand Theo van der Weide. Annotating URLs with query terms: What factors predict reliable

annotations? The authors try to determine factors that predict the quality of URL annotations

from query terms found in query logs.

Interpreting interaction feedback for an improved immediate/aggregated search/browsing experience

- Oral presentation by Mark Cramer, Mike Wertheim and David Hardtke: Demonstration ofImproved Search Result Relevancy Using Real-Time Implicit Relevance Feedback. The paper

reports about Surf Canyon, an existing browser plugin that interprets users browsing behaviors

for immediate improved ranking of results from commercial search engines. They show that

incorporating user behavior can drastically improve overall result relevancy in the wild.

- Poster by Rui Li, Evelyn Rozanski and Anne Haake. Framework of a Real-Time AdaptiveHypermedia System. The authors present an adaptive hypermedia system that makes use of both

browsing behavior and eye movement data of a user while interacting with the system. They

use this information to automatically re-arrange information for more suitable user presentation.

- Poster by Max Van Kleek, David Karger and mc Schraefel. Watching Through the Web:Building Personal Activity and Context-Aware Interfaces using Web Activity Streams. They use

user activity logs from Web-based information to build more personalized activity-sensitive

information tools. They particularly focus on activity-based organization of user-created notes.

-

Demo by Xuanhui Wang and ChengXiang Zhai. Massive Implicit Feedback: OrganizingSearch Logs into Topic Maps for Collaborative Surfing. In this demo, search and browsing logs

from Web searchers are organized into topic maps so that users can follow the footprints from

searchers who had similar information needs before.

Behavior-based evaluation measures

- Oral presentation by Emine Yilmaz, Milad Shokouhi, Nick Craswell and Stephen Robertson.Incorporating user behavior information in IR evaluation. The authors introduce a new user-

centric measure (Expected Browsing Utility, EBU) for information retrieval evaluation which is

reconciled with click log information from search engines.

- Poster by Tereza Iofciu, Nick Craswell and Milad Shokouhi. Evaluating the impact of snippethighlighting in search. The authors present the idea of highlighting important terms in search


10/73

v

result snippets for helping the user to quickly identify whether a result matches the own query

interpretation. They use speed and accuracy of clicks to evaluate the effect of highlighting.

4 ConclusionsOver the course of the workshop, we have seen a great variety of types of logged user interactions, ofmethods how they are interpreted, and how this information is used and applied. Concerning the latter

point, how log data is used and applied, we have seen an especially great variety: from

personalization purposes, over a more informed visual design of search systems, to teaching users

how to search more effectively.

However, the basis for all those different kinds of applications is the same: logged interaction

data between a user and a system. There are basic kinds of interaction data, e.g., based on explicit

events from the user while browsing the Web, such as clicks and page transitions as well as mouse

movements and scrolling. More advanced and more implicit interaction data logging becomes more

and more popular, e.g., based on eye tracking, skin conductance, and EEG. During the workshop, we

identified common needs and problems with respect to logging interaction data. They reached from

extracting the focused data from different software applications to merging interaction data streams

from different sources. Here, we clearly see a need for a common basis of tools and frameworks

shared within the community so that individual researchers dont have to re-invent the wheel over

and over again.

AcknowledgementsWe would like to thank ACM and SIGIR for hosting this workshop as well as the SIGIR workshop

committee and especially its chair Diane Kelly for their very helpful feedback. We are further very

thankful to the authors, the members of our program committee, and all participants. They helped toform a very lively, spirited, highly interesting, and successful workshop.

Program Committee Eugene Agichtein (Emory University, Canada)

Richard Atterer (University of Munich, Germany)

Nick Craswell (Microsoft Research, England)

Susan Dumais (Microsoft Research, USA)

Laura Granka (Stanford, Google Inc., USA)

Kirstie Hawkey (UBC, Canada)

Eelco Herder (L3S, Germany)

Thorsten Joachims (Cornell University, USA)

Melanie Kellar (Google Inc., USA)

Douglas Oard (University of Maryland, USA)


11/73

Demonstration of Improved Search Result RelevancyUsing Real-Time Implicit Relevance Feedback

David HardtkeSurf CanyonIncorporated

274 14th St.Oakland, CA 94612

[email protected]

Mike WertheimSurf CanyonIncorporated


[email protected]

Mark CramerSurf CanyonIncorporated


[email protected]

ABSTRACT

Surf Canyon has developed real-time implicit personaliza-tion technology for web search and implemented the tech-nology in a browser extension that can dynamically mod-ify search engine results pages (Google, Yahoo!, and LiveSearch). A combination of explicit (queries, reformulations)and implicit (clickthroughs, skips, page reads, etc.) usersignals are used to construct a model of instantaneous userintent. This user intent model is combined with the ini-tial search result rankings in order to present recommendedsearch results to the user as well as to reorder subsequentsearch engine results pages after the initial page. This pa-per will use data from the first three months of Surf Canyonusage to show that a user intent model built from implicituser signals can dramatically improve the relevancy of searchresults.

Keywords

Implicit Relevance Feedback, Personalization, Adaptive SearchSystem

1. INTRODUCTIONIt has long since been demonstrated that explicit relevance

feedback can improve both precision and recall in informa-tion retrieval[1]. An initial query is used to retrieve a set ofdocuments. The user is then asked to manually rate a sub-set of the documents as relevant or not relevant. The termsappearing in the relevant document are then added to theinitial query to produce a new query. Additionally, non-relevant documents can be used to remove or de-emphasizeterms for the reformulated query. This process can be re-peated iteratively, but it was found that after a few iterationsvery few new relevant documents are found [2].

Explicit relevance feedback as described above requires ac-tive user participation. An alternative method that does notrequire specific user participation is pseudo relevance feed-back. In this scheme, the top N documents from the initialquery are assumed to be relevant. The important terms inthese documents are then used to expand the original query.

Implicit Relevance Feedback aims to improve the precisionand recall of information retrieval by utilizing user actions

SIGIR 09, July 19-23, 2009, Boston, USA.Copyright is held by the author/owner(s).

to infer the relevance or non-relevance of documents. Manydifferent user behavior signals can contribute to a proba-bilistic evaluation of document relevance. Explicit docu-ment relevance determinations are more accurate, but im-plicit relevance determinations are more easily obtained asthey require no additional user effort.

2. IMPLICIT SIGNALS AND USER INFOR-MATION NEED

With the large, open nature of the World Wide Web it isvery difficult to evaluate the quality of search engine algo-rithms using explicit human evaluators. Hence, there havebeen numerous investigations into using implicit user sig-nals for evaluation and optimization of search engine quality.Several studies have investigated the extent to which a click-through on a specific search engine result can be interpretedas a user indication of document relevancy (for a review see[3]). The primary issue involving clickthrough data is thatusers are most likely to click on higher ranked documentsbecause they tend to read the SERP (search engine results

page) from top to bottom. Additionally, users trust thata search engine places the most relevant documents at thehighest positions on the SERP.

Joachims et al used eye tracking studies combined withmanual relevance judgements to investigate the accuracy ofclickthrough data for implicit relevance feedback [4]. Theyconclude that clickthrough data can be used to accuratelydetermine relative document relevancies. If, for instance,a user clicks on a search result after skipping other searchresults, subsequent evaluation by human judges show thatin 80% of cases the clicked document is more relevant tothe query than the documents that were skipped.

In addition to clickthroughs, other user behaviors can berelated to document relevancy. Fox et al. used a browseradd-in to track user behavior for a volunteer sample of of-fice workers[5]. In addition to tracking their search and webusage, the browser add-in would prompt the user for spe-cific relevance evaluations for pages they had visited. Usingthe observed user behavior and subsequent relevance evalu-ations, they were able to correlate implicit user signals withexplicit user evaluations and determine what user signalsare most likely to indicate document relevance. For pagesclicked by the user, the user indicated that they were eithersatisfied or partially satisfied with the document nearly 70%of the time. In the study, two other variables were foundto be most important for predicting user satisfaction witha result page visit. The first was the duration of time that


12/73

the user spent away from the SERP before returning ifthe user was away from the SERP for a short period of timethey tended to be dissatisfied with the document. The otherimportant variable for predicting user satisfaction was theExit type users that closed the browser on a result pagetended to be satisfied with that result page. The impor-tant outcome of this and other studies is that implicit userbehavior can be used instead of explicit user feedback to

determine the users information need.

3. IMPLICIT REAL-TIME PERSONALIZA-

TIONAs discussed in the previous section, it has been shown

that implicit user behavior can often infer satisfaction withvisited results pages. The goal of the Surf Canyon technol-ogy is to use implicit user behavior to predict which unseendocuments in a collection are most relevant to the user andto recommend these documents to the user.

Shen, Tan, and Zhai1 have investigated context-sensitiveadaptive information retrieval systems [6]. They use b othclickthrough information and query history information to

update the retrieval and ranking algorithm. A TREC collec-tion was used since manual relevancy judgements are avail-able. They built an adaptive search interface to this collec-tion, and had 3 volunteers conduct searches on 30 relativelydifficult TREC topics. The users could query, re-query, ex-amine document summaries, and examine documents. Toquantify the retrieval algorithms, they used Mean AveragePrecision (MAP) or Precision at 20 documents. As thesewere difficult TREC topics, users submitted multiple queriesfor each topic. They found that including query historyproduced a marginal improvement in MAP, while use ofclickthrough information produced dramatic increases (upto nearly 100%) in MAP.

Shen et al. also built an experimental adaptive search in-

terface called UCAIR (User-Centered Adaptive InformationRetrieval) [7]. Their client-side search agent has the capabil-ity of automatic query reformulation and active reranking ofunseen search results based on a context driven user model.They evaluated their system by asking 6 graduate studentsto work on TREC topic distillation tasks. At the end ofeach topic, the volunteers were asked to manually evaluatethe relevance of 30 top ranked search results displayed by thesystem. The top results shown are mixed between Googlerankings and UCAIR rankings (some results overlap), andthe evaluators could not distinguish the two. UCAIR rank-ings show a 20% increase in precision for the top 20 results.

The Surf Canyon browser extension represents the firstattempt to integrate implicit relevance feedback directly into

the major commercial search engines. Hence, we are able toevaluate this technology outside of controlled studies. Froma research perspective, this is the first study to investigatethis technology in the context of normal searches by normalusers. The drawback is that we have no chance to collecta posteori relevancy judgements from the searchers or toconduct surveys to evaluate the user experience. We can,however, quickly collect large amounts of user data in orderto evaluate the technology.

1Shen, Tan, and Zhai are co-authors on one Surf Canyonpatent application but were not actively involved in the workpresented here

4. TECHNOLOGICAL DETAILSSurf Canyons technology can be used as both a tradi-

tional web search engine and as a browser extension that dy-namically modifies the search results page from commercialsearch engines (currently Google, Yahoo!, and Live Search).The underlying algorithms in the two cases are mostly iden-tical. As the data presented was gathered using the browserextension, we will describe that here.

Surf Canyons browser extension was publicly launchedon February 19, 2008. From that p oint forward visitors tothe Surf Canyon website2 were invited to download a smallpiece of free software that is installed in their browser. Thesoftware works with both Internet Explorer and Firefox. Al-though the implementation differs for the two browsers, thefunctionality is identical.

Internet Explorer leads in all current studies of web browsermarket share with March 2008 market share estimated be-tween 60% and 90%. Among users of the Surf Canyonbrowser extension, however, about 75% use Firefox. Amongusers who merely visit the extension download page, thebreakdown by browser type is nearly 50/50. Part of theskew towards Firefox in both website visitors and users of the

product can be attributed to the fact that marketing of theproduct has been mainly via technology blogs. Readers oftechnology blogs are more likely to use operating systems forwhich Internet Explorer is not available (e.g. Mac, Linux).Additionally, we speculate that Firefox may be more preva-lent among readers of technology blogs. The difference be-tween the fraction of visitors to the site using Firefox (50%)and the fraction of people who install and use the productusing Firefox (75%) is likely due to the more widespreadacceptance towards browser extensions in the Firefox com-munity. The Firefox browser was specifically designed tohave minimal core functionality augmented by browser add-ons submitted by the developer community. The technolo-gies used to implement Internet Explorer browser extensionsare also often used to distribute malware so there may be ahigher level of distrust among IE users.

Once the browser extension is installed, the user neverneeds to visit the company web site again to use the prod-uct. The user enters a Google, Yahoo!, or Live Search websearch query just as they would for any search (using eitherthe search bar built into the browser or by navigating tothe URL of the search engine). After the initial query, thesearch engine results page is returned exactly as it would bewere Surf Canyon not installed (for most users who have notspecified otherwise, the default number of search results is10). Two minor modifications are made to the SERP. Smallbulls eyes are placed next to the title hyperlink for eachsearch result (see Figure 1). Also, the numbered links tosubsequent search engine results pages at the bottom of the

SERP are replaced by a single More Results link.The client side browser extension is used to communicate

with the central Surf Canyon servers and to dynamicallyupdate the search engine results page. The personalizationalgorithms currently reside on the Surf Canyon servers. Thisclient-server architecture is used primarily to facilitate op-timization of the algorithm and to support active researchstudies. Since web search patterns vary widely by user, thebest way to evaluate personalized search algorithms is tovary the algorithms on the same set of users while main-

2http://www.surfcanyon.com


13/73

implicit relevance feedback - Google Search http://www.google.com/search?q=implicit+relevance+feedback&ie=ut...

1 of 3 03/20/2008 10:14 AM

Web Images Maps News Shopping Gmail more Sign in

Search Advanced Search

Preferences

Reset recommendations

Web Results 1 - 10 of about 1,180,000 for implicitrelevancefeedback. (0.04 seconds)

Relevance feedback - Wikipedia, the free encyclopedia

The idea behind relevance feedback is to take the results that are initially ...Implicit

feedback is inferred from user behavior, such as noting which ...

en.wikipedia.org/wiki/Relevance_feedback - 19k - Cached - Similar pages

Implicit Relevance Feedback from Eye Movements (ResearchIndex)

We explore the use of eye movements as a source of implicit relevance feedback

information. We construct a controlled information retrieval experiment where ...

citeseer.ist.psu.edu/730378.html - 20k - Cached - Similar pages

Click data as implicit relevance feedback in web search

In this article, we address three issues related to using click data as implicitrelevance

feedback: (1) How click data beyond the search results page might ...

portal.acm.org/citation.cfm?id=1224561.1224720 - Similar pages

Surf Canyon recommends 3 search results:

Using Implicit Relevance Feedback in a Web (ResearchIndex)

The explosive growth of information on the World Wide Web demands effective intelligent

search and filtering methods. Consequently, techniques have been ...

citeseer.ist.psu.edu/572595.html - 20k - Cached - Similar pages

More results from citeseer.ist.psu.edu

Implicit relevance feedback in interactive music(from page 2)

This paper presents methods for correlating a human performer and a synthetic

accompaniment based on Implicit Relevance Feedback (IRF) using Graugaard's ...

portal.acm.org/citation.cfm?id=1164845 - Similar pages

More results from portal.acm.org

Scalable Relevance Feedback Using Click-Through Data for Web Image ...(from page 2)

File Format: PDF/Adobe Acrobat - View as HTML

In this paper, we have presented a scalable relevance feedback. mechanism for web

image retrieval. Click-through data is used as. implicit relevance...

research.microsoft.com/users/leizhang/Paper/ACMMM06-Cheng.pdf - Similar pages

More results from research.microsoft.com

[PPT]LBSC 796/INFM 718R: Week 8 Relevance Feedback

implicit relevance feedback

Google

Figure 1: A screenshot of the Google search result page with Surf Canyon installed. The third link wasselected by the user, leading to three recommended search results.


14/73

taining an identical user interface. With the client-serverarchitecture, the implicit relevance feedback algorithms canbe modified without alerting the user to any changes. Noth-ing fundamental prevents the technology from becoming ex-clusively client side.

In addition to the ten results displayed by the search en-gine to the user, a larger set of results (typically 200) forthe same query is gathered by the server. With few excep-

tions, the top 10 links in the larger result set are identicalto the results displayed by the search engine. While theuser reads the search result page, the back-end servers parsethe larger result set and prepare to respond to user actions.Each user action on the search result page is sent to theback-end server (note that we are only using the users ac-tions on the SERP for personalization and do not follow theuser after they leave the SERP). For certain actions (selecta link, select a Surf Canyon bulls eye, ask for more results)the back end server sends recommended search results tothe browser. The Surf Canyon real-time implicit personal-ization algorithm incorporates both the initial rank of theresult and personalized instantaneous relevancies. The im-plicit feedback signals used to calculate the real-time searchresult ranks are cumulative across all recent related queriesby that user. The algorithm does not, however, utilize anylong-term user profiling or collaborative filtering. The pre-cise details of the Surf Canyon algorithm are proprietaryand are not important for the evaluation of the technologypresented below. If an undisplayed result from the larger setof results is deemed by Surf Canyons algorithm to be morerelevant than other results displayed below the last selectedlink, it is shown as an indented recommendation below thelast selected link.

The resulting page is shown in Figure 1. Here, the userentered a query forimplicit relevance feedbackon Google3.Google returned 10 organic search results (only three ofwhich are displayed in Figure 1) of the 1,180,000 documentsin their web index that satisfy the query. The user then

selected the third organic search result, a paper from anACM conference entitled Click data as implicit relevancefeedback in web search. Based on the implicit user signals(which include interactions with this SERP, recent similarqueries, and interactions with those results pages) the SurfCanyon algorithm recommends three search results. Theselinks were initially given a higher initial rank (> 10) bythe Google algorithm in response to the query implicit rel-evance feedback. The real-time personalization algorithmhas determined, however, that the three recommended linksare more pertinent to this users information need at thisparticular time than the results displayed by Google withinitial ranks 4-10.

Recommendations are also generated when a user clicks

on the small bulls eyes next to the link title. We assumethat a selection of a bulls eye indicates that the linked doc-ument is similar to but not precisely what the user is lookingfor. For the analysis below, up to three recommendationsare generated for each link selection or bulls eye selection.Unless the user specifically removes recommended search re-sults by clicking on the bulls eye or by clicking the close box,they remain displayed on the page. Recommendations cannest up to three levels deep if the user clicks on the firstrecommended result then up to three recommendations are

3http://www.google.com

generated immediately below this search result.At the bottom of the 10 organic search results, there is a

link to get More Results. If the user requests the next pageof results, all results shown on the second and subsequentpages are determined using Surf Canyons instantaneous rel-evancy algorithm. Unlike the default search engine behavior,subsequent pages of results are added to the existing page.After selectingMore Results links 1-20 are displayed in the

browser, with link 11 focused at the top of the window (theuser needs to scroll up to see links 1-10).

5. ANALYSIS OF USER BEHAVIORMost previous studies of Interactive Information Retrieval

systems have used post-search user surveys to evaluate theefficacy of the systems. These studies also tended to re-cruit test subjects and use closed collections and/or spe-cific research topics. The data presented here was collectedfrom an anonymous (but not necessarily representative) setof web surfers during the course of their interactions withthe three leading search engines (Google, Yahoo, and LiveSearch). The majority of searches were conducted usingGoogle. Where possible, we have analyzed the user data

independently for each of the search engines and have notfound any cases where the conclusions drawn from this studywould differ depending on the users choice of search en-gine. The total number of unique search queries analyzedwas 700,000.

Since the users in this study were acquired primarily fromtechnology web blogs, their search behavior can be expectedto be significantly different than the average web surfer.Thus, we cannot evaluate the real-time personalization tech-nology by comparing to previous studies of web user be-havior. Also, since we have changed the appearance of theSERP and also dynamically modify the SERP, any metricscalculated from our data cannot be directly compared tohistorical data due to the different user interface.

Surf Canyon only shows recommendations after a bullseye or search result is selected. It is therefore interestingto investigate how many actions a user makes for a givenquery as this tells us how frequently implicit personalizationwithin the same query can be of benefit. Jansen and Spink[8] found from a meta-analysis of search engine log studiesthat user interaction with the search engine results pages isdecreasing. In 1997, 71% of searchers viewed beyond the firstpage of search results. In 2002 only 27% of searchers lookedpast the first page of search results. There is a paucity ofdata on the number of web pages visited per search. Jansenand Spink [9] reported the mean number of web pages vis-ited per query to be 2.5 for AllTheWeb searches in 2001,but they exclude queries where no pages were visited in thisestimate. Analysis of the AOL query logs from 2006 [10]gives a mean number of web pages viewed per unique queryof 0.97. For the current data sample, the mean number ofsearch results visited is 0.56. The comparatively low num-ber of search results that were selected in the current studyhas multiple partial explanations. The search results pagenow contains multiple additional links (news, videos) thatare not counted in this study. Additionally, the informationthat the user is looking for is often on the SERP (e.g. asearch for a restaurant often produces the map, phone num-ber, and address). Search engines have replaced bookmarksand direct URL typing for re-visiting web sites. For suchnavigational searches the user will have either one or zero


15/73

Number Of Search Results Selected

NONE 1 2 3 4 5+

Fraction

ofQueries(%)

0

10

20

30

40

50

60

Figure 2: Distribution of total number of selectionsper query.

clicks depending on whether the specific web page is listedon the SERP. Additionally, it may be that the current sam-ple of users is biased towards searchers who are less likely to

click on links.Figure 2 shows the distribution of the total number of

selections per query. 62% of all queries lead to the selectionof zero search results. Since Surf Canyon does nothing untilafter the first selection, this number is intrinsic to the currentusers interacting with these particular search engines. Arecent study by Downey, Dumais and Horvitz also showedthat after a query the users next action is to re-query or endthe search session about half the time [11]. In our study, only12% of queries lead to more than one user selection. A goalof implicit real-time personalization would be to decreasedirect query reformulation and to increase the number ofinformational queries that lead to multiple selections. Thecurrent data sample is insufficient to study whether this goal

has been achieved.In order to evaluate the implicit personalization technol-ogy developed by Surf Canyon we chose to compare the ac-tions of the same set of users with and without the implicitpersonalization technology enabled. Our baseline controlsample was created by randomly replacing recommendedsearch results with random search results selected from amongthe results with initial ranks 11-200. These Random Rec-ommendations were only shown for 5% of the cases whererecommendations were generated. The position (1, 2, or 3)in the recommendation list was also random. These ran-dom recommendations were not necessarily poor, as they docome from the list of results generated by the search enginein response to the query.

Figure 3 shows the click frequency for Surf Canyon rec-ommendations as a function of the position of the recom-mendation relative to the last selected search result. Posi-tion 1 is immediately below the last selected search result.Also shown are the click frequencies for Random Recom-mendations placed at the same positions. In both cases,the frequency is relative to the total number of recommen-dations shown at that position. The increase in click rate(60%) is constant within statistical uncertainties for allrecommended link positions. Note that the recommenda-tions are generated each time a user selects a link and areconsidered to be shown even if the user does not return to theSERP. The low absolute click rates (3% or less) are due to

the fact that users do not often click on more than one searchresult as discussed above. The important point, however, isthat the Surf Canyon implicit relevance feedback technol-ogy increases the click frequency by 80% compared to thelinks presented without any real-time user-intent modelling.The relative increase in clickthrough rate is constant (withinstatistical errors) for all display positions even though theabsolute clickthrough rates rapidly drop as funciton of dis-

play position.

Recommended Link Position

1 2 3ClickProbability(%)

0

0.5

1

1.5

2

2.5

3w/ Implicit Feedback

w/o Implicit Feedback

Figure 3: Probability (%) that a recommendedsearch result will be clicked as a function of displayposition relative to the last selected search result.The red circles are for recommendations selectedusing Surf Canyons instantaneous relevancy algo-rithm, while the black triangles are for the randomcontrol sample that does not incorporate relevancefeedback.

Figure 4 shows the per query distribution of initial searchresult ranks for all selected search links in the current datasample. The top 10 links are selected most frequently. Search

results beyond 10 are all displayed using Surf Canyons al-gorithm (either through a bulls eye selection, a link selec-tion, or when the user selects more results). For the re-sults displayed by Surf Canyon (initial ranks > 10), theselection frequency follows a power-law distribution withP(IR) = 38% IR1.8, where IR is the initial rank.

As Surf Canyons algorithm favors links with higher initialrank, the click frequency distribution does not fully reflectthe relevancy of the links as a function of initial rank. Fig-ure 5 shows the probability that a shown recommendationis clicked as a function of the initial rank. This is onlyfor recommendations shown in the first position below thelast selected link. After using Surf Canyons instantaneousrelevancy algorithm, this probability shows at most a weakdependence on the initial rank of the search result. The dot-ted link shows the result of a linear regression to the data,P(IR) = 3.2(0.00250.00101)IR. When sufficient datais available we will repeat the same analysis for RandomRecommendations as that will give us a user-interface in-dependent estimate of the relative relevance for deep linksin the search result set before the application of the implicitfeedback algorithms.

For the second and subsequent results pages, the browserextension has complete control over all displayed search re-sults. For a short period of time we produced search re-sults pages that mixed Surf Canyons top ranked resultswith results having the top initial ranks from the search


16/73

Initial SERP Rank

0 50 100 150 200

ClickFrequency

(%)

-310

-210

-110

1

10Google w/ Surf Canyon

Figure 4: Frequency per non-repeated search queryfor link selection as a function of initial search resultrank.

Initial SERP Rank

20 40 60 80 100 120 140 160 180 200

Rec.LinkClickProb.(%)

0

0.5

1

1.5

2

2.5

3

3.5

Figure 5: Probability that a displayed recommendedlink is selected as a function of the initial search re-sult rank. This data only include links from the firstposition immediately below the last selected searchresult.

engine. This procedure was proposed by Joachims as a wayto use clickthrough data to determine relative user prefer-ence between two search engine retrieval algorithms [12].Each time a user requests More Results, two lists are gen-erated. The first list (SC) contains the remaining searchresults as ranked by the Surf Canyons instantaneous rele-vancy algorithm. The second list (IR) contains the same setof results ranked by their initial display rank from the search

engine. The list of results shown to the user is such that thetop kSC and kIR results are displayed from each list, with|kSC kIR | < 1. Whenever kSC = kIR the next search re-sult is taken from one of the lists chosen at random. Thus,the topmost search result on the second page will reflectSurf Canyons ranking half the time and the initial searchresult order half the time. By mixing the search resultsthis way, the user will see, on average, an equal number ofsearch results from each ranking algorithm in each positionon the page. The users have no way of determining whichalgorithm produced each search result. If the users selectmore search results from one ranking algorithm comparedto the other ranking algorithm it demonstrates an absoluteuser preference for the retrieval function that led to moreselections.

Figure 6 shows the ratio of link clicks for the two retrievalfunctions. IR is the retrieval function based on the resultrank returned from the search engine. SC is the retrievalfunction incorporating Surf Canyons implicit relevance feed-back technology. The ratio is plotted as a function of thenumber of links selected previously for that query. Previ-ously selected links are generally considered to be positivecontent feedback. If, on the other had, no links were selectedthen the algorithm bases its decision exclusively on negativefeedback indications (skipped links) and on the user intentmodel that may have been developed for similar recent re-lated queries.

# Previous Search Results Selected

NONE 1 2-4 5+LinkSelectionRatio[SC/IR]

0.8

0.9

1

1.1

1.2

1.3

1.4

Figure 6: Ratio of click frequency for second andsubsequent search results page links ordered bySurf Canyons Implicit Relevance Feedback algo-rithm (SC) compared to links ordered by the initialsearch engine result rank (IR).

We observe that, independent of the number of previoususer link selections in the same query, the number of clicks onlinks from the relevance feedback algorithm is higher thanlinks displayed because of their higher initial rank. Thisdemonstrates an absolute user preference for the ranking al-gorithm that utilizes implicit relevance feedback. Remark-


17/73

ably, the significant user preference for search results re-trieved using the implicit feedback algorithm is also appar-ent when the user had zero positive clickthrough actions onthe first 10 results. After skipping the first 10 results andasking for a subsequent set of search links, the users are35% more likely to click on the top ranked Surf Canyonresult compared to result # 11 from Google. Clearly, thesearcher is not so interested in search results produced by

the identical algorithm that produced the 10 skipped linksand an update of the user intent model for this query isappropriate.

6. CONCLUSIONS AND FUTURE DIREC-

TIONSSurf Canyon is an interactive information retrieval system

that dynamically modifies the SERP from major search en-gines based on implicit relevance feedback. This was builtwith the goal of relieving the growing user frustration withthe search experience and to help searchers find what theyneed right now. The system presents recommended searchresults based on an instantaneous user-intent model. Bycomparing clickthrough rates, it was shown that real-time

implicit personalization can dramatically increase the rele-vancy of presented search results.

Users of web search engines learn to think like the searchengines they are using. As an example, searchers tend toselect words with high IDF (inverse document frequency)when formulating queries they naturally select the rarestterms that they can think of that would be in all documentsthey desire. Excellent searchers can often formulate suffi-ciently specific queries after multiple iterations such thatthey eventually find what they need. Properly implementedimplicit relevance feedback would reduce the need for queryreformulations, but it should be noted that in the currentstudy most users had not yet adjusted their browsing habitsto the modified behavior of the search engine. By tracking

the current users in the future we hope to see changes inuser behavior that can further improve the utility of thistechnology. As the user-intent model is cumulative, moreinteraction will produce better recommendations once theusers learn to trust the system.

7. REFERENCES[1] J.J. Rocchio. The Smart Retrieval System

Experiments in Automatic Document Processing.Prentice Hall, 1971.

[2] D. Harman. Relevance feedback revisited. InProceedings of the Fifteenth International ACM SIGIRConference, pages 110, 1992.

[3] D. Kelly and J. Teevan. Implicit feedback for inferring

user preference: A bibliography. In SIGIR Forum37(2), pages 1828, 2003.

[4] Thorsten Joachims, Laura Granka, Bing Pan, HeleneHembrooke, and Geri Gay. Accurately interpretingclickthrough data as implicit feedback. In SIGIR 05,2005.

[5] Steve Fox, Kuldeep Karnawat, Mark Mydland, SusanDumais, and Thomas White. Evaluating implicitmeasures to improve web search. ACM Transactionson Information Systems, 23(2):147168, April 2005.

[6] Xuehua Shen, Bin Tan, and ChengXiang Zhai.Context-sensitive information retrieval using implicit

feedback. In SIGIR 05, 2005.

[7] Xuehua Shen, Bin Tan, and ChengXiang Zhai.Implicit user modelling for personalized search. InCIKM 05, 2005.

[8] B. Jansen and A. Spink. How are we searching theworld wide web?: a comparison of nine search enginetransaction logs. Information Processing andManagement, 42(1):248263, 2006.

[9] B. Jansen and A. Spink. An analysis of web documentsretrieved and viewed. In The 4th InternationalConference on Internet Computing, pages 6569, 2003.

[10] G. Pass, A. Chowdhury, and C. Torgeson. A picture ofsearch. In The First International Conference onScalable Information Systems, 2006.

[11] D. Downey, S. Dumais, and E. Horvitz. Studies of websearch with common and rare queries. In SIGIR 07,2007.

[12] T. Joachims. Unbiased evaluation of retrieval qualityusing clickthrough data. In SIGIR Workshop onMathematical/Formal Methods in InformationRetrieval, 2002.


18/73

A User-Centered Experiment and Logging Framework forInteractive Information Retrieval

Ralf BierigSC&I Rutgers University4 Huntington St.,New BrunswickNJ 08901, USA

[email protected]

Jacek GwizdkaSC&I Rutgers University4 Huntington St.,New BrunswickNJ 08901, USA

[email protected]

Michael ColeSC&I Rutgers University4 Huntington St.,New BrunswickNJ 08901, USA

[email protected]

ABSTRACT

This paper describes an experiment system framework thatenables researchers to design and conduct task-based ex-periments for Interactive Information Retrieval (IIR). Theprimary focus is on multidimensional logging to obtain richbehavioral data from participants. We summarize initial

experiences and highlight the benefits of multidimensionaldata logging within the system framework.

Categories and Subject Descriptors

H.4 [Information Systems Applications]: Miscellaneous

Keywords

User logging, Interactive Information Retrieval, Evaluation

1. INTRODUCTIONOver the last two decades, Interactive Information Retrieval(IIR) has established a new direction within the tradition of

IR. Evaluation in traditional IR is often performed in labo-ratory settings where controlled collections and queries areevaluated against static information needs. IIR introducesthe user at the center of a more naturalistic search environ-ment. Belkin and colleagues [3, 2] suggested the concept ofan information seeking episode composed of a sequence of apersons interactions with information objects, determinedby a specific goal, conditioned by an initial task, the generalcontext and the more specific situation in which the episodetakes place, and the application of a particular informationseeking strategy.

Copyright is held by the author/owner(s).SIGIR09, July 19-23, 2009,Boston, USA.This work is supported, in part, by the Institute of Museumand Library Services (IMLS grant LG-06-07-0105-07)

.

This poses new challenges for the evaluation of informationretrieval systems. An enriched set of possible user behaviorsneeds to be addressed and included as part of the evalu-ation process. Systems need to address information aboutthe entire interactive process with which users accomplish atask. This problem has so far only been initially explored [4].

This paper describes an experiment system framework thatenables researchers to design and conduct task-based IIRexperiments. The paper is focused on the logging featuresof the system designed to obtain rich behavioral data fromparticipants. The following section describes the overall ar-chitecture of the system. Section 3 provides more detailsabout its specific logging features. Section 4 summarizes ini-tial experiences with multidimensional data logging withinthe system framework based on initial data analysis fromthree user studies. Future work is proposed in section 5.

2. THE POODLE IIR EXPERIMENT SYS-

TEM FRAMEWORKThe PooDLE IIR Experiment System Framework is part ofan the ongoing research project. The goal of PooDLE1 toinvestigate ways to improve information seeking in digitallibraries; the analysis concentrates on an array of interact-ing factors involved in such online search activities. Theoverall aim of the framework is to reduce the complexityof designing and conducting IIR experiments using multidi-mensional logging of users interactive search behavior. Suchexperiments usually require a complex arrangement of sys-tem components (e.g. GUI, user management and persis-tent data storage) including logging facilities that monitorimplicit user behavior. Our framework enables researchersto focus on the design of the experiment including ques-tionnaire and task design and the selection of appropriate

logging tools. This can help to reduce the overall time andeffort that is needed to design and conduct experiments thatsupport the needs for IIR. As shown in figure 1, the experi-ment system framework consists of two sides a server thatoperates in an Apache webserver environment and a clientthat resides on the machine where the experiment is con-ducted. We distinguish the following components:

Login and Authenticationmanages participants, allowsthem to authenticate with the system, and enables thesystem to direct individuals to particular experiment

1http://www.scils.rutgers.edu/imls/poodle/index.html


19/73

Figure 1: System components of the PooDLE IIR Experiment System Framework. Logging features high-

lighted in grey.

setups; multiple experiments may exist and users canbe registered for multiple or multi-part experiments atany time.

The Graphical UI allows participants to authenticatewith the framework and activate their experiment. Eachexperiment consists of a number of rotated tasks thatare provided with a generic menu that presents thepredefined task order to the user. After every com-pleted task, the UI guides the participant back to themenu that now highlights the completed tasks. Thisallows participants to navigate between tasks and gain

feedback that helps them to track their progress. Inaddition, the interface presents participants with ad-ditional information, instructions and warnings whenprogressing through the tasks of an experiment.

The Experimenter controls and coordinates the corecomponents of the system these are:

An Extensible Task Framework that provides arange of standard tasks for IIR experiments thatare part of the framework (e.g. questionnairesfor acquiring background information and gen-eral feedback from participants, search tasks with

a bookmarking feature and an evaluation pro-cedure, and cognitive tasks to obtain informa-tion about individual differences between partici-pants). Tasks are easily added to this basic collec-tion and can be reused as part of the frameworkin different experiments.

The Task Progress and Control Management pro-vides participants with (rotated) task sequences,monitors their state within the experiment, andallows them to continue interrupted experimentsat a later point in time.

The Interaction Logger allows tasks to registerand trigger logging messages at strategic pointswithin the task. The system automatically logsthe beginning and end of each task at task bound-aries.

Remote Logging Application Invocation calls log-ging applications that reside on the client. Thisallows for rich client-sided logging of low level userbehavior obtained from specific hardware (e.g. moumovements or eye-tracking information).

The Database interface manages all access to one ormore databases that store users interaction logs as


20/73

well as the basic experiment design for other systemcomponents (e.g. participants, tasks and experimentblocks in the form of task rotations for individual users).

3. USER INTERACTION LOGGINGThis section focuses on the logging features of the Experi-ment System Framework as highlighted in grey in figure 1.The logging features and the arrangement of logging tools

within the framework have been informed by the followingrequirements:

Hybridity: All logging functionality is divided betweena more general server architecture and a more specificclient; this integrates server-based as well as client-based logging features into a hybrid system framework.Whereas the server logs user interactions uniformlyacross experiments, client logging is targeted to thecapabilities of the particular client machine used forthe experiment. Researchers can select from a rangeof logging tools or integrate their own tools to recorduser behavior. This enables the system to use low levelinput devices, normally inaccessible by the server, to

be controlled by logging tools residing on the client.

Flexibility: Client logging tools can be combined througha loosely coupled XML-based configuration that is pro-vided at task granularity. The system framework usesthese task configurations to start logging tools on theclient when the participant enters a task and stopsthem when the participant completes a task. Thisgives researchers the flexibility to compose logging toolsas part of the experiment design and attach them tothe configuration of the task. Such configurations canlater be reused as design templates which promotesuniformly across experiments and ensures importanttypes of user interaction data are being logged.

Scalability: Experiments can be configured to apply anumber of different client machines as part of the datacollection. A researcher can, for example, trigger an-other client computer to record video from a secondweb camera or simultaneously activate several clientsfor experiments that involve more than one partici-pant. Redundant instances of the same logging toolscan be instantiated to produce multiple data streamsto overcome potential measurement errors and insta-bilities on a data stream due to load or general failureof hard and software.

The client is configured to work with the following selectionof open-source and commercial logging tools that record dif-

ferent behavioral aspects of participants:

RUIConsole is an adapted command line version ofthe RUI tools developed at Pennsylvania State Univer-sity [5]. RUI logs low level mouse movements, mouseclicks, and keystrokes. Our extension additionally pro-vides full control over its logging features through acommand line interface to allow for more efficient au-tomated use within our experiment framework.

UsaProxy is a javascript based HTTP proxy devel-oped at the University of Munich [1] that logs inter-active user behavior unobtrusively through injected

javascript. It monitors page loads as well as resize andfocus events. It identifies mouse hover events over pageelements, mouse movements, mouse clicks, keystrokes,and scrolling. Our version of UsaProxy is slightly mod-ified as we dont log mouse movements with this tool.UsaProxy can run directly on the client, but can alsobe activated on a separate computer to balance load.

The URL Tracker is a command line tool that extractsand logs the users current web location directly fromthe Internet Explorer (IE) address bar and makes itavailable to the system framework. This allows anytask to determine participants current position on theweb and to monitor their browsing history within atask.

Tobii Eyetracker: We use the Tobii T60 eyetrackinghardware which is packaged with Tobii Studio2, a com-mercial eyetracking recording and analysis software.The software records eye movements, eye fixations, aswell as webpage access, mouse events and keystrokes.

Morae is a commercial software package for usability

testing and user experience developed and distributedby TechSmith3. It records participants webcam andcomputer screen as video, captures audio, and logsscreen text, mouse clicks and keystrokes occurring withiInternet Explorer.

This extensible list of logging tools are loosely coupled tothe Interaction Logger and the Remote Logging ApplicationFramework components through task configurations for in-dividual tasks. The task configuration describes which log-ging tools are used during a task and the software frameworkactivates them as soon as participants enter a task and de-activates them as soon as they complete a task.

The researcher can create a selection of relevant tools foreach task of a particular IIR experiment from the availablelogging tools supported by the system framework. First, oneshould select all user behavior the researcher is interested in.Second, the observable data types that provide evidence forthe existence and the structure of these user behaviors isidentified. Finally, these data types are linked with relevantlogging tools. In the next section we summarize experiencesfrom three distinct experiments that were designed and per-formed with our experiment system framework. We do notdescribe these experiments in this paper. Instead, we focuson key points and issues that should be addressed when col-lecting multidimensional logging data from hybrid loggingtools.

4. EXPERIENCES FROM MULTIDIMEN-

SIONAL DATA LOGGINGData logging with an array of hybrid tools, as describedin the previous section, has a number of benefits and chal-lenges. This section summarizes our initial experiences fromconducting three IIR user experiments with the system framework and some initial processing and integration of its datalogs.

2http://www.tobii.com3http://www.techsmith.com


21/73

Accuracy and Reliability: Using data streams frommultiple logging tools limits the risk of measurementerrors to enter data analysis. This is especially rel-evant to IIR due to its need to conduct experimentsin naturalistic settings where people perform tasks inconditions that are not fully controlled and thereforeless predictable. Such settings allow participants tosolve tasks with great degrees of freedom. As a re-

sult of this, user actions in such settings tend to behighly variable. Measurement errors or missing data,for example based on varying system performance andnetwork latencies, have a larger impact because theentire interaction is studied. Multiple data streamsfrom different sources improve the overall accuracy ofrecorded sessions and increase the reliability of detect-ing features in individual logs. Furthermore, the use ofmultiple data logs limits of chances that artifacts cre-ated by individual logging tools and their assumptionswill affect downstream analysis.

Disambiguation: The use of multiple data logs allowsto contextualize each log with the logs produced byother tools and disambiguate uncertainties in the in-

terpretation of logging event sequences. We found thatthe most common cases are timestamp disambiguationand the synchronization of event accuracies.

Timestamp disambiguation: The timestamp gran-ularity of recorded events usually varies betweenlogging tools. For example, Tobii Studio recordseye tracking data with a constant frequency deter-mined by the eye tracking hardware (e.g. 60 logsper second (17 ms) for the T60 model) whereasUsaProxy records events only every full secondand RUIConsole records events dynamically onlywhen they occur. The combination of loggingdata from different tools helps to better deter-

mine the real timing of events by providing differ-ent viewpoints for the same sequence of actions auser has performed. Low granularity timestampsmight collapse a number of user events to a sin-gle point of time and, based on that, change thenatural order in which these events are recorded.Alternative secondary logging data can help to de-tect such event sequences and help disambiguat-ing and correcting them.

Detail of event structure: Every logging tool im-poses a number of assumptions on the data pro-duced by a user which events to log, whichevents to differentiate and how to label them.Two logging tools recording the same events can

therefore produce different event structures withvarying detail. For example, RUIConsole differ-entiates a mouse click into a press and a releaseevent whereas Tobii Studio considers a mouse clickas a single event. Different logging tools recordingthe same user actions produce events with a struc-ture of different detail that can be used to con-textualise conflicting recordings of user actions.

Scalability: Concurrent use of logging tools may cre-ate performance issues on the client machine especiallywith tools that produce large amounts of data. Es-pecially the combined use of Morae and Tobii Studio

can be demanding when using high quality web cam-era and screen capture recording. Limited hardwareresources may have a direct effect on the recording ac-curacy of other logging tools. More importantly, how-ever, a overloaded client may have an effect on par-ticipants and their ability to accomplish tasks realis-tically. This can be avoided by choosing a sufficientlyequipped client machine and a fast network. As men-

tioned in section 4, the software framework supportsthe distribution of logging tools over several machines,while these tools are activated centrally by the serverarchitecture, which can help to better balance the load.

Stability: Concurrent use of multiple logging applica-tions can destabilize the client computer. Individualapplications can affect each other especially when log-ging from the same resources (e.g. from the same in-stance of Internet Explorer). Currently, our systemframework does not monitor running logging tools andthere is no mechanism to recover tools that hang orbreak during a task. This is a feature we will incorpo-rate into a future version of the system framework.

5. FUTURE WORKFuture work on the experiment system framework will fo-cus on further improvement of logging tool integration andmonitoring. We are currently developing a graphical userinterface for researchers to more easily design IIR experi-ments with the system and monitor progress of running ex-periments and the accuracy of its data logs. An extensionto the experiment system framework presented in this paperis a data analysis system that allows us to fully integrate,analyse and develop models from the recorded data. In par-ticular, we are interested in creating higher level constructsfrom integrated low-level logging data that can be used topersonalise interactive search for users. The experiment sys-tem framework will be released as open source to the widerresearch community.

6. REFERENCES[1] R. Atterer, M. Wnuk, and A. Schmidt:. Knowing the

Users Every Move - User Activity Tracking for WebsiteUsability Evaluation and Implicit Interaction. In 15thInternational World Wide Web Conference(WWW2006), Edinburgh, Scotland, 2006.

[2] N. Belkin. Intelligent Information Retrieval: WhoseIntelligence? In Fifth International Symposium forInformation Science (ISI), pages 2531, Konstanz,Germany, 1996. Universtaetsverlag Konstanz.

[3] N. Belkin, C. Cool, A. Stein, and U. Thiel. Cases,

Scripts, and Information-Seeking Strategies: On theDesign of Interactive Information Retrieval Systems.Expert Systems with Applications, 9(3):379395, 1995.

[4] A. Edmonds, K. Hawkey, M. Kellar, and D. Turnbull.Workshop on logging traces of web activity: Themechanics of data collection. In 15th InternationalWorld Wide Web Conference (WWW 2006),Edinburgh Scotland, 2006.

[5] U. Kukreja, W. E. Stevenson, and F. E. Ritter. RUI Recording User Input from interfaces under Windowsand Mac OS X. Behavior Research Methods,38(4):656659, 2006.


22/73

Incorporating user behavior information in IR evaluation

Emine Yilmaz Milad Shokouhi Nick Craswell Stephen RobertsonMicrosoft Research Cambridge

Cambridge, UK{eminey, milads, nickcr, ser}@microsoft.com

ABSTRACT

Many evaluation measures in Information Retrieval (IR) canbe viewed as simple user models. Meanwhile, search logsprovide us with information about how real users search.This paper describes our attempts to reconcile click log in-formation with user-centric IR measures, bringing the mea-sures into agreement with the logs. Studying the discountcurve of NDCG and RBP leads us to extend them, incorpo-rating the probability of click in their discount curves. Wemeasure accuracy of user models by calculating session like-lihood. This leads us to propose a new IR evaluation mea-sure, Expected Browsing Utility (EBU), based on a moresophisticated user model. EBU has better session likelihoodthan existing measures, therefore we argue it is a betteruser-centric IR measure.

1. INTRODUCTION

This paper is concerned with user-centric IR evaluation,where an evaluation measure should model the reaction of areal user to a list of results, evaluating the utility of the listof documents to the user. Web search experiments usually

employ an IR measure that focuses only on top-ranked re-sults, under the assumption that Web users deal shallowlywith the ranked list. This is probably correct, but we mightask: How can we be sure that Web search users are shallow,and how should we choose the degree of shallowness. In thispaper, our solution is to make IR evaluation consistent withreal user click behavior. We still evaluate based on relevance

judgments on a list of search results, but the importance ofeach search result is brought in line with the probability ofclicking that result.

In our experiments we use click logs of a search engine(bing.com) taken from January 2009, combined with rele-vance judgments for 2057 queries. For each judged querywe extracted the top-10 results for up to 1000 real query in-stances, and the pattern of clicks in the form of 10 Booleans(so each result is either clicked or not clicked). More than91% of all top-10 query-URL pairs were judged on the 5-level scale {Perfect, Excellent, Good, Fair, Bad}. Unjudgeddocuments are assumed to be Bad. We divide the queriesinto two sets of equal size: training and test.

A key difference between user-centric IR evaluation mea-sures, such as Normalized Discounted Cumulative Gain(NDCG) [2] and Rank Biased Precision (RBP) [3], is the

Copyright is held by the author/owner(s).SIGIR09, July 19-23, 2009, Boston, USA.

choice of discount function. Many experiments with NDCGapply a discount at rank r of 1/ log(r + 1). Another metric,RBP, has a persistence parameter p so that the probabilityof seeing position r is pr1. Note, some evaluation measuressuch as Average Precision are not easily interpretable as auser model. Such measures are beyond the scope of thispaper, since we focus on user-centric evaluation.

The next section considers the discount curves of NDCG

and RBP, in contrast to real click behavior. Noting a dis-crepancy, we extend the two metrics based on informationabout the probability of click on each relevance label. Hav-ing done so, the discount curves are more in line with realuser behavior. However, the curves do not incorporate in-formation about the users probability of returning to theresults list, having clicked on a result. Therefore the nextsection introduces our new evaluation measure ExpectedBrowsing Utility (EBU). Finally we introduce Session Like-lihood, a test for whether an evaluation measure is in agree-ment with click logs. Under that test, EBU is most in linewith real user behavior, therefore we argue it is a superioruser-centric evaluation measure.

2. DISCOUNT FUNCTIONS AND CLICKSOne of the key factors for differentiating between the eval-

uation metrics is their discount functions. Most user-centricIR evaluation metrics in the literature can be written in theform of

Nr=1 p(user observes document at rank r) gain(r)

as the discount function is assumed to be modeling the prob-ability that the user observes a document at a given rank.Therefore, the quality of a metric is directly dependent onhow accurately the discount function estimates this prob-ability. In the case of Web search, this probability valueshould ideally correspond to the probability that the userclicks on a document at rank r. Hence, one can comparethe evaluation metrics based on how their discount function(their assumed probability of click) compare with the actualprobability that the user clicks on a document. Discountfunctions that are more consistent with click patterns aremore flexible in explaining and evaluating the users Websearch behavior.

Next, we compare the user models associated with the un-derlying discount functions of RBP and NDCG. The top twoplots in Figure 1 show the average probability of click (av-eraged over all sessions in the test data) per rank. We thencompare this actual probability of click with the click prob-ability assumed by different evaluation metrics. As men-tioned above, this probability corresponds to the discountfunction used in the definition of the metrics. The upper


23/73

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Probability

ofclick

Actual Probs

NDCG log, RMS=0.385

NDCG 1/r, RMS=0.262

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Probability

ofclick

Actual Probs

RBP, p = 0.2 RMS=0.169

RBP, p = 0.3 RMS=0.173

RBP, p = 0.4 RMS=0.191

RBP, p = 0.5 RMS=0.227

RBP, p = 0.6 RMS=0.284

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rank

Probability

ofclick

Actual Probs

NDCG log, RMS=0.148

NDCG 1/r, RMS=0.098

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rank

Probability

ofclick

Actual Probs

RBP, p = 0.2, RMS=0.070

RBP, p = 0.3, RMS=0.050

RBP, p = 0.4, RMS=0.045

RBP, p = 0.5, RMS=0.069

RBP, p = 0.6, RMS=0.113

Figure 1: P(click) vs. rank for different metrics.

left and right plots compare the discount function of NDCG(with the commonly used 1/ log( r + 1) and 1/r discounts)and RBP (with p {0.2, 0.3, 0.4, 0.5, 0.6}) with the actualclick probability, respectively. For comparison purposes, theplots report the Root Mean Squared (RM S) error betweenthe probability of click assumed by a metric and the actualprobability of click. It can b e seen that the probability ofclick assumed by these two metrics is quite different than

the actual click probability.As the discount functions in NDCG and RBP are not de-

rived from search logs, it is not surprising to see that theyare not successful in predicting clicks. In the following sec-tion, we show how extending such metrics by incorporatingthe quality of snippets can significantly improve the discountfunctions for predicting the probabilities of clicks.

3. MODELING THE IMPACT OF SNIPPETS

One reason for the discrepancy between the described dis-count functions and the click patterns is that these met-rics do not account for the fact that the users only clickon some documents depending on the relevance of the sum-mary (snippets). Both RBP and NDCG assume that theuser always clicks on the document at the first rank, whereasthe actual probability of click calculated from our trainingsearch logs shows that the probability that the user clicks onthe first ranked document is only slightly higher than 0.6.

To address this issue, we enhance the NDCG and RBPuser models by incorporating the snippet quality factor andconsidering its impact on the probability of clicks. We hy-pothesize that the probability that the user clicks on a docu-ment (i.e., the quality of the summary) is a direct function ofthe relevance of the associated document. Table 1 supportsour claim by showing p(C|summary) p(C|relevance) ob-

Table 1: Probability of click given the relevance

Relevance P(click|relevance)Bad 0.5101Fair 0.5042Good 0.5343Excellent 0.6530Perfect 0.8371

tained using the training dataset.1It can be seen that theprobability that the user clicks on a document tends to in-crease as the level of relevance of the document increases.Note that this behavior is slightly different for Bad and Fairdocuments, in which case there is a slight difference in theclick probability. This is caused by the fact that (1) thedocuments judged as Fair tend to be slightly relevant to theuser information need; hence, they are effectively Bad to theuser, and (2) the unjudged documents are treated as Bad inour computations.

Motivated by these observations, we extend NDCG and

RBP to incorporate the summary quality into their dis-count functions as follows: If the discount function of themetric dictates that the user visits a document at rank rwith probability p(dr), then the probability that the userclicks on the document at rank r can be computed as p(dr)

p(C|summaryr) (where the click probabilities are shown inTable 1). The bottom two plots in Figure 1 show how theextended versions of metrics then compare with the actualclick probability. It can be seen that the extended versions

1For simplicity, we assume that the quality of summaries andthe relevance of documents are strongly correlated. That is,relevant summaries for relevant documents and vice versa.


24/73

Examine next

document

Click

doc?

Sum

mary

Yes

No

pclick|summary

1- pclick|summaryNo Click

Examine

more?

Exit

1-pcontinue|noclick

View

Webpage

pcontinue|noclick

Yes

No

Relevant

Doc?

Doc

Rel

No Examine

more?

Yes

pcontinue|nonrel

Exit

1-pcontinue|nonrelNo

Satisfied?Yes

pcontinue|rel

Yes

No

1-pcontinue|rel

Figure 2: The user browsing model associated withthe new evaluation metric.

Table 2: Probability of continue given the relevance

Relevance P(cont|relevancer)Bad 0.5171Fair 0.5727Good 0.6018Excellent 0.4082Perfect 0.1903

of these metrics can approximate the actual probability ofclick substantially better than the standard versions.

We would like to note that Turpin et al. [4] recently also

suggested that document summary information should beincorporated in evaluation retrieval evaluation, independentof our work. They showed that using the summary infor-mation in evaluation may alter the conclusions regardingthe relative quality of search engines. However, their workmainly focus on average precision as the evaluation metric.

4. EXPECTED BROWSING UTILITY (EBU)All the metrics described so far assume that the prob-

ability that the user will continue search at each rank isindependent of (1) whether the user has clicked on a docu-ment or not, and (2) the relevance of the document seen byuser. Intuitively, we expect the search behavior of users tochange based on the relevance of the last visited document.That is, visiting a highly relevant document that perfectlysatisfies the users information need (e.g. a navigational an-swer) shall be strongly correlated with the probability ofterminating the search session.

We confirmed our hypothesis by computing the proba-bilities of continuing the search session conditioned on therelevance of the last clicked document. The results gener-ated from our training set are shown in Table 2. It can beseen that if the document is very relevant to the informationneed (e.g., Perfect), then the user is likely to stop browsingthe results as he has found the information he was lookingfor. On the other hand, if the user clicks on a document that

is not relevant to his information need (e.g., Bad), then he isagain likely to stop browsing as he is frustrated with the re-sult he has clicked on and thinks documents retrieved lowerthan that will probably be even less relevant.

Motivated by the probabilities of click and continue shownin Tables 1 and 2, we propose a novel user model in which:(1) When a user visits a document, the user may or maynot click the document depending on the quality of the sum-

mary, and (2) The relevance of a document visited by a userdirectly affects whether the user continues the search or not.Figure 2 shows the user model associated with our metric.

The associated user model can be described as follows: Theuser starts examining the ranked list of documents from topto bottom. At each step, the user first just observes thesummary (e.g., the snippet and the url) of the document.Based on the quality of the summary, with some proba-bility p(C|summary) the user clicks on the document. Ifthe user does not click on the document, then with proba-bility p(cont|noclick) he/she continues examining the nextdocument or terminates the search session with probability1 p(cont|noclick).

If the user clicks on the document, then he or she canassess the relevance of the document. If the document didnot contain any relevant information, then the user contin-ues examining with the probability p(cont|nonrel) or stopswith 1 p(cont|nonrel) probability. If the clicked documentwas relevant, then the user continues examining with prob-ability p(cont|rel) (which depends on the relevance of theclicked document).

A similar user model has been suggested by Dupret etal. [1]. However, their work is mainly focused on predictingthe future clicks, while our goal is to integrate the probabil-ities of clicks with evaluating the search results.

We use past click data together with relevance informationto model the user search behavior. At each result position r,our model computes the expected probability of examiningthe document p(Er) as follows: We first assume that the user

always examines the very first document, hence p(E1) = 1.Now, suppose the user has examined the document at rankr 1 and we would like to compute p(Er). Given that theuser has already examined the document at r 1, accordingto our model, with probability p(C|summaryr1) the userclicks on the document at rank r 1, observes the relevanceof the document at rank r 1 and continues browsing theranked list with probability p(cont|relr1). Alternatively,with probability 1 p(C|summaryr1) the user does notclick on the document at rank r 1 and continues browsingwith probability p(cont|noclick). Overall, the prob

2009 uiir wrkshp proceedings

Documents