an evaluation of adaptive filtering in the context of realistic task-based information exploration

Available online at www.sciencedirect.com

Information Processing and Management 44 (2008) 511–533

www.elsevier.com/locate/infoproman

An evaluation of adaptive filtering in the contextof realistic task-based information exploration

Daqing He a,*, Peter Brusilovsky a, Jaewook Ahn a, Jonathan Grady a,Rosta Farzan b, Yefei Peng a, Yiming Yang c, Monica Rogati d

a School of Information Sciences, University of Pittsburgh, 135 N. Bellefield Avenue, Pittsburgh, PA 15256, USAb Intelligence Systems Program, University of Pittsburgh, 5113 Sennott Square, Pittsburgh PA 15260, USA

c Language Technologies Institute, Machine Learning Department, School of Computer Science, Carnegie Mellon University,

Pittsburgh PA 15213, USAd Computer Science Department, School of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213, USA

Received 3 January 2007; received in revised form 12 June 2007; accepted 9 July 2007Available online 10 September 2007

Abstract

Exploratory search increasingly becomes an important research topic. Our interests focus on task-based informationexploration, a specific type of exploratory search performed by a range of professional users, such as intelligence analysts.In this paper, we present an evaluation framework designed specifically for assessing and comparing performance of inno-vative information access tools created to support the work of intelligence analysts in the context of task-based informa-tion exploration. The motivation for the development of this framework came from our needs for testing systems in task-based information exploration, which cannot be satisfied by existing frameworks. The new framework is closely tied withthe kind of tasks that intelligence analysts perform: complex, dynamic, and multiple facets and multiple stages. It views theuser rather than the information system as the center of the evaluation, and examines how well users are served by thesystems in their tasks. The evaluation framework examines the support of the systems at users’ major information accessstages, such as information foraging and sense-making. The framework is accompanied by a reference test collection thathas 18 tasks scenarios and corresponding passage-level ground truth annotations. To demonstrate the usage of the frame-work and the reference test collection, we present a specific evaluation study on CAFE, an adaptive filtering enginedesigned for supporting task-based information exploration. This study is a successful use case of the framework, andthe study indeed revealed various aspects of the information systems and their roles in supporting task-based informationexploration.� 2007 Elsevier Ltd. All rights reserved.

Keywords: Task-based information exploration; Exploratory search; Evaluation framework; Adaptive filtering; CAFE; User study;Intelligence analysts

0306-4573/$ - see front matter � 2007 Elsevier Ltd. All rights reserved.

doi:10.1016/j.ipm.2007.07.009

* Corresponding author. Tel.: +1 412 6242477; fax: +1 412 64870001.E-mail addresses: [email protected] (D. He), [email protected] (P. Brusilovsky), [email protected] (J. Ahn), [email protected] (J. Grady),

[email protected] (R. Farzan), [email protected] (Y. Peng), [email protected] (Y. Yang), [email protected] (M. Rogati).

mailto:[email protected]








512 D. He et al. / Information Processing and Management 44 (2008) 511–533

1. Introduction

Studies show that Web searchers have been driven by various information needs, and their correspondingsearch activities can be classified into three types with labels ‘‘lookup’’, ‘‘learn’’ and ‘‘investigate’’ respectively(Marchionini, 2006). In this classification, lookup aims at finding specific and possibly existing information inthe collection, and is often called ‘‘known item’’ search. Both searching to learn and searching to investigateare more challenging types of search, which require iterative efforts with interpretation, synthesis, and evalu-ation of the information returned at each iteration. In the literature these types are called exploratory searches(White, Kules, Drucker, & Schraefel, 2006). Although current Web search engines (e.g., Google) do a reason-ably good job in lookup searches, their support of the user’s needs during exploratory search is still far fromadequate.

A specific type of exploratory search that is considered in this paper is known as task-based informationexploration. Here the information needs and the corresponding search processes are heavily influenced bythe task assigned to the user. This kind of exploratory search is typical for a range of professional users,such as intelligence analysts. To understand the problems and the needs of task-based information explo-ration, let’s consider the anatomy of an analyst’s work on one task. The work starts with a given RequestFor Information (RFI). A RFI typically contains one overall investigation goal and a set of more specificquestions that call for more information related to a seminal event. The analyst’s job is to collect relevantand useful information from various sources to answer the RFI questions and to prepare a short (one totwo-page) report called a point paper. The point paper should summarize what has been found and tomake specific recommendations for certain actions. For example, if the seminal event is an escape of seveninmates from a prison in Texas, an RFI could be issued to ask for more information and potential actionsthat are useful to coordinate the recapture of those inmates. It may list such specific questions as whereand when they were last sighted (location, time/date), whether they were armed, which kind of vehicle theymay be driving, and what steps (e.g., rewards, posting, etc.) have been taken so far by the police to facil-itate the recapture.

As the example shows, task-based information exploration driven by complex realistic tasks is a challengingexercise. Its complexity is further increased since the events associated with the task are frequently evolvingduring the exploration. A typical task requires multiple searches over several sessions to explore the informa-tion space and to obtain accurate and updated information. This puts task-based information exploration inthe same row with other kinds of exploratory searches and far from simpler lookup searches. Yet, the presenceof a clearly defined task makes tasks-based exploration special in at least two aspects. The first aspect is relatedto search system engineering. An intelligent system can build a model of the given task and provide a betterlevel of user support in the process of exploration. The second one is related to evaluation. The presence of thetask and a special structure of task-based information exploration make it possible to build a dedicated eval-uation framework to assess and compare different exploration-support systems.

The work presented in this paper is related to both aspects listed above. Our team is involved in a large-scale project that focuses on developing a new generation of systems to support task-based exploratorywork of intelligence analysts. While the majority of projects in this area focus on developing visualizationsystems that help analysts to examine the information space (Acosta-Diaz et al., 2006; Gotz, Zhou, &Aggarwal, 2006; McColgin, Gregory, Hetzler, & Turner, 2006; Wong, Chin, Foote, Mackey, & Thomas,2006), we focus on artificial intelligence techniques. Among other techniques, our team explores the appli-cation of adaptive filtering (AF) (Hanani, Shapira, & Shoval, 2001) to support exploratory searches,especially task-based information exploration. Adaptive filtering is a popular technology in the field ofuser-adaptive systems. For every user of an information system, an AF engine builds a user model byinferring users’ tasks, interests, preferences, and knowledge from either explicit feedback from users’ rele-vance judgments or implicit feedbacks by observing the users’ search and browsing activities. The model isthen used by the system to predict and recommend potentially relevant information to the user. Suchcapabilities make AF very attractive in the context of information exploration. For plain search engines,which rely on users’ querying skills, the exploratory search context is a disadvantage, because userinformation needs and search scope are poorly defined. In contrast, for an AF engine, this context is

D. He et al. / Information Processing and Management 44 (2008) 511–533 513

advantageous since users’ judgments and search/browsing activities, which are abundant in the process ofinformation exploration, provide a good source of information for user modeling. The task-based informa-tion exploration is especially attractive for the use of AF since task modeling is both much easier and morereliable in the presence of a clearly defined task.

To explore the potential of information filtering in an exploratory search context, a part of our team devel-oped an innovative AF engine known as Carnegie Mellon Adaptive Filtering Engine or CAFE . The mecha-nism of this engine is presented in (Yang et al., 2007). After the first version of CAFE was developed, we facedthe problem of its evaluation. We wanted to confirm our hypothesis that adaptive filtering provides good valuein an exploratory search context (at least in comparison with traditional search) and to measure the effect ofusing CAFE. The evaluation part, however, appeared to be at least as hard as the development part. Aspointed out in (White, Muresan, & Marchionini, 2006), proper evaluation of various information access tech-nologies in the context of exploratory search is a challenge since known evaluation frameworks are limited tothose that support minimal to no human-computer interaction. Researchers in the area of exploratory searchargue that existing evaluation metrics and methodologies are not adequate in the exploratory search contextbecause of the exploratory search’s high level interactions between human and computers. As we discovered,the evaluation of information filtering in the context of task-based exploration is not an exception – all knownevaluation frameworks cannot be applicable directly (see Section 2).

To perform a sound evaluation of CAFE we had to answer the following research questions:

� What kinds of data resources are required to support the evaluation of adaptive filtering in realistic task-based information exploration? Are there any test collections that could serve as a source to build the targetcollection, and how do they have to be expanded?� How can we develop an appropriate experiment design to include users in their task-based information

exploration, where the process is close to real search scenarios, and the evaluation measures are straight-forward and meaningful to the users involved?� How can we analyze the results of the evaluation experiment? What kinds of measures are relevant in this

context to assess the user and system performance? What are the appropriate metrics to express the desiredinformation in numeric form?

As a result, our work on the evaluation of CAFE went far beyond evaluating one specific adaptive filteringengine. As a byproduct of this process, we developed a complete evaluation framework for studying variousinformation access techniques in task-based information exploration context. We started by analyzing the spe-cific evaluation needs of task-based information exploration needs. Based on this analysis, we prepared anannotated collection of resources that can support the required level of evaluation. We developed the structureof the evaluation process, which is based on humans performing complex tasks and resembles the real infor-mation investigation process as much as possible. We also developed evaluation metrics that leverage theusers’ involvement, including those that use the utility of collected information as the basis for examiningthe performance.

The framework developed to evaluate the first version of CAFE turned out to be very useful and, in thelonger term, more important than the study it was developed for. We have already used the framework to eval-uate several tools for task-based information exploration, saving us a great deal of effort and helping to obtaininteresting results. The goal of this paper is to present this framework to the community along with a mean-ingful example of its use, our evaluation of CAFE. Since this evaluation produced interesting results, present-ing these results becomes the secondary goal of the paper.

In the remainder of this paper, Section 2 reviews previous research on frameworks developed in the fields ofinformation retrieval, filtering and user modeling for evaluating the performance of the related systems andtechnologies. In Section 3, we concentrate on the presentation of the evaluation framework we developedfor testing adaptive filtering engines in the context of realistic and task-based information exploration. In Sec-tions 4–6, we will use our evaluation effort on the CAFE engine as an example to demonstrate the character-istics of the framework. We will conclude in Section 7 with some discussion of the further development of theframework.


2. Related work

Evaluation has always been an important aspect in developing information systems such as IR systems. Themost influential evaluation framework has been based on Cranfield model, even though the model was pub-lished about four decades ago (Cleverdon, Mills, & Keen, 1966). The Cranfield model takes the system-oriented view of evaluation, and concentrates on examining the measurements of system performance, suchas the precision and the recall of retrieved results. It also relies on a reference collection that consists of aset of documents, search topics/queries, and the corresponding relevance assessments (ground truth) betweenthe documents and the topics. Because of these features, Cranfield-based evaluation frameworks can system-atically compare different retrieval algorithms, and the evaluation resources can be reused multiple times. Cur-rent widely-used Cranfield-based evaluation frameworks include Text REtrieval Conference (TREC), whichmainly concentrates on issues related to English information retrieval (Voorhees & Harman, 2005), Cross-Language Evaluation Forum (CLEF), which focuses on multilingual information retrieval (MLIR) amongEuropean languages (Peters et al., 2006), NII Test Collection for IR Systems (NTCIR) on MLIR amongAsian languages (Kando, 2005), and INitiative for the Evaluation of XML Retrieval (INEX) on content-oriented XML retrieval (Fuhr, Govert, Kazai, & Lalmas, 2002).

However, Cranfield-based evaluation frameworks have many limitations (see discussions in (Borlund, 2003;Borlund & Ingwersen, 1997; Saracevic, 1995)). For example, the information needs in those frameworks areimposed, simple, and do not evolve as true information needs often do. The relevance assessments are basedon static topical relevance, which do not change with different user characteristics and retrieval contexts. Theretrieval systems are often assumed to be in batch mode for handling requests, where interactions betweenusers and retrieval systems do not exist. Therefore, although this system-driven IR evaluation approachhas greatly improved the effectiveness of retrieval systems and their matching algorithms, further developmentof IR evaluation needs alternative evaluation frameworks that can remove all or some of the limitations.

The evaluation frameworks based on user-oriented approach has been presented as such alternatives to theCranfield model. Robertson and Hancock-Beaulieu (1992) present the improvements of the user-oriented eval-uation approach over the system-oriented approach as three revolutions. The cognitive revolution means thatusers’ cognitive activities should be considered; relevance revolution means that users’ needs rather than theirrequests should be the criteria for judging the relevance; and interactive revolution means that IR mechanisms(a better term, they think, over systems) should be examined in light of the whole interactive process, ratherthan a static step of inputting queries and generating rank lists. Borlund (2003) points out that in the user-ori-ented evaluation approach, the user is the focus; users’ information seeking and retrieval processes should betreated as a whole; users’ information needs may change over time; and their relevance judgments are subjec-tive and situational. Therefore, the evaluation should focus on how well the users, the retrieval mechanism,and the data collection interact for extracting useful information. Many interesting research results have beenachieved via this approach (readers should consult (Ingwersen & Kalervo, 2005) for the latest development inthis area). Interestingly, many studies utilizing the Cranfield-based evaluation frameworks actually work onthe interactive side of IR, where users are the center of focus – not just the IR systems. For example, TREChad the interactive track from TRECs 3–11, then HARD track from TRECs 12–13. CLEF also has an inter-active CLEF track since its beginning. However, it is widely accepted that interactive IR experiments are dif-ficult to design, expensive to conduct, limited in their small scales, and hard to compare cross-site (He &Demner-Fushman, 2003). As Dumais and Belkin (2005) point out, the reasons for this are because the perfor-mance of interactive retrieval is greatly influenced by the searches as well as the topics and the systems, andthese influences are often complex. Therefore, they state that the key is to reduce variability and separate theeffects of searchers, topics and systems.

Both user-oriented and system-oriented evaluation approaches aim at the same goal: the reliability of theIR test performance results. Therefore, it is possible to combine the elements of the two approaches so that theevaluation framework is as close as possible to the actual information retrieval process, and at the same timehas a relatively controlled evaluation environment as provided in the Cranfield model (Borlund, 2003). Aimingtowards developing an evaluation framework for interactive IR systems, Borlund proposes an evaluationmodel that contains (1) a commitment of involving potential users, their dynamic information needs, and theirrelevance assessments that are multidimensional and dynamic; (2) a set of simulated task situations; and


(3) alternative performance measures that are capable of handling non-binary relevance assessments. Theinvolvement of users and their relevance assessments ensure that the IR systems are evaluated under the con-ditions that the systems are useful and meet users’ needs for their given situation. The simulated tasks makethe evaluation close to the real operating environment, but also provide some flexibility on how the evaluationcan be conducted, in the event the real environment is too complex to be modeled or controlled. The alterna-tive measures then avoid the rigid and unrealistic assumption that the relevance assessments have to be binary.The ideas proposed in Borlund’s work are quickly adopted in many areas, including the interactive track ofINEX (Larsen, Malik, & Tombros, 2005), the study of implicit feedback (White, Jose, & Ruthven, 2004), andthe polyrepresentation principle of IR (Larsen, Ingwersen, & Kekalainen, 2006).

At the evaluation framework level, our proposed framework is not restricted within the Cranfield model,because – agreeing with many others – we think that the system-oriented view of the evaluation framework hasthe limitations discussed above. These limitations are serious, since our goal is to develop truly useful IR sys-tems for supporting task-based information exploration. In terms of methodology, our framework has no fun-damental difference to Borlund’s ideas. We share the same idea that user-oriented and system-orientedevaluation approaches can be combined to gain the advantages of both. The users should be included inthe evaluation process, and the effectiveness of supporting the users’ work is the focus of the evaluation.The interactions between users and the systems are important, and the tasks used in the evaluation shouldbe realistic and close to the actual tasks performed by the users in their work. Finally, the current system-ori-ented measures are limited in telling us how truly useful the systems are.

However, our work does have differences to Borlund’s. Instead of working on a framework for genericinteractive IR processes, ours concentrates on testing specific information processing systems under task-basedinformation exploration, where intelligence analysts are the main potential users. We like the idea of simulatedtask scenarios proposed by Borlund, but the tasks we assumed in our evaluation framework are tightly con-nected with the actual work of analysts. Our attention on alternative measures is focused more on utility-oriented rather than non-binary relevance. By using a simulated task outcome, we force users to develop abalanced measure between topical relevance and content novelty so that the overall utility can be expressedand measured. Finally, in our design of the evaluation framework, we also developed a reference test collec-tion, an idea borrowed from the Cranfield model. Based on the test collection, the ideas about simulated tasksscenarios are instantiated into concrete tasks. We will discuss our evaluation framework in detail in Section 3.

There have been two previous major reference test collections for evaluating adaptive filtering and itsrelated techniques. Between 1999 to 2002, Text REtrieval Conference (TREC) organized filtering and routingtracks (Robertson & Soboroff, 2002). Their approach concentrated on batch mode evaluation methods andused a set of filtering topics based on the Reuters news collection. To avoid human involvement while achiev-ing the goal of evaluating adaptive filtering techniques across different sites (which means they are in the Cran-field model), they simulated the users’ feedback that is necessary for adaptive filtering by assuming all relevantdocuments recommended by the system as positive feedback from the users. They concentrated on recom-mending documents rather than useful passages.

The second reference test collection was developed for supporting Topic Detection and Tracking (TDT)research (Allan, 2002). TDT supports five different tasks, in which a tracking task closely resembles adaptivefiltering. Each TDT topic is explicitly linked to a seminal event, and the tracking task is to find documentsrelated to the given topic. Similar to the TREC filtering track framework, there is no human involvementin the tracking process, and all feedback is simulated automatically.

As with other evaluation work based on the Cranfield model, both filtering and tracking tracks in TRECand TDT aim at testing algorithms, using simple topics, and simulating users’ interactions rather than involv-ing actual users. Their advantages, however, include simple design, easy execution, and good cross-systemcomparison.

The state of the art in evaluating AF engines in the field of user modeling is somewhat similar. While it iscustomary in this field to involve human users in the process evaluation, existing approaches use simple sce-narios, small document collections, and formal evaluation metrics that do not take into account a number offactors that are critical in the information exploration context (Dıaz & Gervas, 2005; Waern, 2004). Typically,the users are required to rate every document in a set of documents that could be incrementally presented tothe users over several sessions. The ratings are used in two ways to evaluate the performance of an AF engine


post-factum. On one hand, the ratings are fed to the AF engine to produce a ranked list of recommended doc-uments. On the other hand, these ratings are considered as ground truth, and are compared with the ratingsproduced by the engine to evaluate its performance using classic relevance-precision metrics.

The reference test collection in our evaluation framework has several key differences to previous collections.First, since our framework aims at supporting task-based information exploration, the tasks we have devel-oped resemble intelligence analysts’ actual work scenarios, which are complex and evolving with the events.Second, the evaluated filtering systems under our framework return useful passages rather than documents.Third, our evaluation framework pays attention to the involvement of human subjects in the process ratherthan using simulation. Finally, our framework considers the utility of the selected passages rather than simpletopical relevance. We want to make sure that the usefulness of the system can be revealed through our eval-uation framework. We will discuss the detail of the reference test collection in Section 4.

3. An evaluation framework for task-based information exploration

3.1. Reasons for developing the framework

The goal of our research is to evaluate information systems in the context of task-based information explo-ration performed by intelligence analysts. Although some ideas and resources can be borrowed from existingevaluation frameworks, we still have to develop a new evaluation framework for our evaluation tasks. The keyreasons are mainly from the tasks that the information systems are supposed to support, which have not beenthe foci of previous evaluation frameworks:

� The tasks that the information systems participate in are realistic. In our studies, after given an RFI, an ana-lyst starts to explore large volumes of data from various sources with the help of information systems andgenerates a short summary of information collected, typically a two-page point report. This is the real infor-mation exploration process, and the outcome is real too.� The tasks modeled in the framework are complex. Each task is related to a seminal event. The information

requests specified in an RFI are focused on that event itself and its different aspects. The events can havemultiple aspects. The assigned tasks usually include specific subtasks, all of which are connected to theoverall task.� These tasks are dynamic. Events naturally evolve over time. An intelligence analyst then may have to track

the development of the events over several sessions before producing the final point paper.� Information collected is at passage level. Although documents would still play important roles in task-based

information exploration – because the outcome is a point paper – the useful information is typically con-tained in specific text passages (snippets). Because of their high density of useful information and small size,these passages are perfect for investigation or report writing.

Because of these task characteristics, our evaluation framework should have methodology, measures andother means to reveal how systems perform in information exploration process, including multiple aspectsof the exploration process (such as topical relevance, novelty, and final utility) and multiple stages of explo-ration process (such as information foraging and sense-making). We think that current evaluation frameworksare especially weak at measures and reference test collections for our tasks. This is why we decided to developthis framework.

The remainder of this section will present the methodology adopted in the framework for obtaining theinformation, the general metrics for measuring the systems. In Section 4 we will discuss a specific referencetest collection we used for the study presented in this paper, including 18 specific tasks and the ground truthassessments between the tasks and the passages in the collection.

3.2. Methodology

Although (White, Muresan, & Marchionini, 2006) points out that the ideal evaluation approach is longi-tudinal and in a naturalistic setting, there is a need for a cheaper and quicker evaluation approach for the peo-


ple who develop information systems. Our evaluation framework takes the latter approach, combining con-trolled lab experiments with questionnaires and interviews.

However, our evaluation approach does involve human users conducting task-based exploration and inter-acting with the information system. Sharing with (White, Muresan, et al., 2006), we believe that informationsystems that support task-based exploration are highly interactive, where human subjects’ interaction behav-iors and interaction processes are important aspects to be observed and evaluated.

In addition, our evaluation assumes that intelligence analysts’ seeking behavior in information explorationcontains two major stages (or loops): information foraging and sense-making (Pirolli & Card, 2005). Duringthe foraging stage, the analysts use their domain and search knowledge to collect potentially useful informa-tion in various media and sources in the information foraging stage. In the sense-making stage, analysts pro-cess, organize, and distill the collected information into a coherent form so that it can be integrated into theirstate of knowledge. We believe that the evaluation of the information systems should consider their roles andeffects in both stages.

Our evaluation focuses on the quality of selected passages, which is examined not only by how much thecontents of the passages match to the requests in RFI, but also whether and how they are used by the analystsin the final reports. Therefore, topical relevance is viewed as an important factor in relevance assessments.However, the most critical assessment criteria are based on the utility of the system in the task-based explo-ration (i.e., its ability to support analysts’ work.) Therefore, a utility measure for passages is needed.

Our evaluation settings also try to simulate certain important aspects of analysts’ work scenarios. Forexample, subjects are required to work on a task in multiple sessions, so that subjects encounter the sametask-related issues as analysts, such as event evolution and duplicated information.

3.3. Metrics

Metrics provide a set of examination points for studies based on our framework. We want measures thatcan support our multiple layers of analyzing information systems. These layers include

� Measures focused on system performance. Inherited from existing frameworks, there could be measures onthe accuracy and coverage of identified information (i.e., precision and recall) and new ones on the utilitythat are related to the point paper for a given task. All these measures will concentrate on passage levelexamination rather than document.� Measures focused on user formal performance. Our idea here is similar to that of (Marchionini & Shnei-

derman, 1988), where we examine the paths and decisions taken by the users during the process of complet-ing the tasks, we attempt to make inference regarding the subjects’ cognitive activities, which would sharesome light on how well they are supported by the systems. The measures are expanded to evaluate sepa-rately information foraging and sense-making stages.� Measures focused on user activity patterns obtained by mining the user log.� Measures focused on data collected from subjective evaluation.

Among the measures above, the utility measure deserves special consideration. It is true that the utility mea-sure is related to information systems’ support in task-based exploration, which can be drawn from aspectsrelated to the task outcome – the point paper. However, this approach is less reliable because the qualityof the reports is related not only to the support from the information systems, but also to subjects’ varyingabilities in producing a coherent report based on the collected data. To remove the influence of the latter effect,we suggest to use not a point paper, but a set of collected (i.e., foraging) and organized (i.e., sense-making)passages as the product to be evaluated. We assume that these passages are the basis for writing the pointpaper; therefore the quality of passage selection and organization can be used to reflect the system features.

We impose a word limit on the selected passages for the final product. This has two advantages. First, itmakes the final product more resemble the point paper, which is about one to two pages long. Secondly,the word limit brings in a cost function that is needed to represent our utility measure. Every word added intothe final selection means that some other words could not be included. There is, therefore, a trade-off betweenincluding as much relevant information to a specific question as possible and covering all the required


questions within the specified word limit. Naturally, such decisions involve the concept of topical relevance(whether the passage contains on-topic information), novelty (whether the passage contains substantiallynew information not covered by previously selected passages), and usefulness (how useful the passage tothe final selection, i.e., to improve the quality of the point paper). Here, usefulness is our utility, as we thinkthat it is related to the topical relevance and novelty of the passages.

4. A reference test collection

4.1. Document set

It takes a great amount of resources, time and effort to develop a reference test collection from scratch.Therefore, our strategy is to reuse an existing collection provided the existing collection contains a large num-ber of documents, saving much time and effort.

Two existing document collections have the potential to be used as the basis for our test collection. ReutersRCV1 collection, which is used in TREC filtering track, contains 800,000 news stories, covering a time periodof 12 month between 1996 and 1997. The TDT4 corpus, developed for TDT evaluation, contains news articlesfrom multiple sources in Arabic, Chinese and English languages, and it covers events happened between Octo-ber 2000 to January 2001. It contains 28390 English documents.

Both collections have their corresponding topics. For example, over the years, several hundreds of filteringtopics have been developed for the RCV1 collection. The same thing is true for TDT4 collection. However,because each TDT topic corresponds to a seminal event (Allan, 2002), which share the same characteristicsof the topics in our framework, we decide to choose the TDT4 corpus as the document collection.

4.2. Simulated task scenarios

To simulate the task scenarios used by intelligence analysts, we expanded TDT4 topics according to thetask scenario development guidelines discussed in Section 3.1.

During the development of tasks, we required the developers (i.e., the authors) to search the TDT4 collec-tion to become familiar with the scenarios and their relevant documents. We also used a two-point accessstrategy to help us generate sub-task questions that are related to the evolution of the events and tasks. Thisstrategy requires us to identify two time periods, one at the beginning of the event and the other approximately2 or 3 days later.

Some brief background information, including a seed story (i.e., a good relevant document to the scenario)is provided for each scenario to make sure that subjects would have roughly the same level of knowledge aboutour scenarios. In reality, a superior officer might hand out a sample story as part of the RFI. Fig. 1 shows asample task scenario that we developed.

4.3. Ground truth assessments

Human annotators were recruited to mark up the ‘‘ground truth’’ for each developed task scenario. Ourassessment of ground truth has the follow features:

� the annotations were at the passage level;� the annotations were independently collected for three aspects: topical relevance, novelty and utility;� to accommodate the fact that there are different degrees of relevance, novelty and utility, we collected anno-

tator judgments at three levels: highly relevant/novel/useful, slightly relevant/novel/useful, and not rele-vant/novel/useful;� at least two annotators marked each scenario along each aspect (relevant/novel/useful).

We did not have the manpower to process the entire TDT4 collection for the ground truth annotation. Bytaking advantage of the fact that both our scenarios and our ground truth were in fact elaborations of theoriginal TDT4 topics, we used the set of relevant documents associated with the original TDT4 topics as

Fig. 1. A task scenario example.


the document pool for ground truth annotation. Later, the adaptive filtering engine we were testing was ableto discover small amount of additional relevant documents.1 These additional documents were subsequentlyannotated too.

In total, we developed 18 task scenarios. 1916 documents were examined with respect to their relevance,novelty and utility. The relevance annotation produced, on average, 644.4 highly relevant passages, and230.5 slightly relevant passages per topic. The novelty annotation produced on average 82.4 highly novel pas-sages, and 118.3 slightly novel passages per topic. Finally, the utility annotation produced on average 130.4highly useful passages and 122.8 slightly useful passages per topic before the final selection, and 150.8 selecteduseful passages after the final selection. The annotation files are independent from the source data (TDT4 col-lection) and can be used by anyone interested in running similar studies.2

5. Evaluating CAFE: A sample study

To illustrate how the evaluation methodology associated with our framework, this section describe a sam-ple application of the evaluation framework, where we attempted to assess how well an adaptive filteringengine called CAFE can help analysts in their task-driven exploratory search. The methodology adopted inthe study utilized a controlled lab experiment involving human subjects, and made comparisons betweentwo information access systems: the CAFE and a state of the art information retrieval system. The followingsubsection discusses the four components of the study: data, experiment systems and procedures, measuresand subjects.

5.1. Data selection and preparation

To simulate a realistic context where analysts explore the information space over an extended period oftime, we divided the collection of documents for each topic into 3 or 4 subsets using time thresholds. This seg-mentation simulates the unfolding of a seminal event (and an analyst’s tracking of the event) over time. Sub-jects were asked to perform their tasks in multiple sessions over a period of one to two weeks. Each newsession added a new segment of data to the pool of documents accessible by subjects.

1 Comparing to the original TDT4 relevant document set, the latter add-on discovered by CAFE was relative small: 61 new documents,or approximately 6% of the annotated set. Therefore, it seems that the original TDT4 relevant document set has pretty reasonablecoverage of useful information.

2 The tasks and their ground truth annotations can be downloaded from crystal.exp.sis.pitt.edu:8080/gale/GALE-resources.html.

http://crystal.exp.sis.pitt.edu:8080/gale/GALE-resources.html


The subjects’ interaction with the system during each session was logged for future analysis and also passedto CAFE as a flow of positive and negative feedback. CAFE used this feedback to model the user task and toproduce a list of passages ranked by their relevance to this task. The baseline system simply executed the samequeries over the shifted document set without any adaptation.

We selected eight of the 18 task scenarios as the test topics. The selected scenarios all have a large numberof stories distributed over a reasonable period of time. Each topic is divided into segments along its timeline sothat a comparable number of relevant articles (topic information density) is maintained within each segment(see Fig. 2 for an example). It simulates the real life situation where the frequency of re-assessing informationabout an event is related to the speed of its unfolding. In addition, it ensures that there is reasonable number ofrelevant articles within each session. Four topics with larger number of articles were divided into 4 segmentsand the remaining four topics into 3 segments (Table 1).

5.2. Subjects

Recruiting real intelligence analysts as our subjects was infeasible for this study. We instead recruited grad-uate students in library and information science whose knowledge and experience in information access closelyfit the profile of intelligence analysts. Subjects were required to be native English speakers and have completedat least one information retrieval course. Familiarity with the news content was not required, because analystsare often asked to research topics outside their domain expertise.

Eight subjects participated in the study from July 12th to August 1st, 2006. Four subjects were assigned tothe 3-session topic group, and four to the 4-session topic group. All four subjects in each group completed the

Fig. 2. An example of topic segmentation.

Table 1The eight selected topics

TDT4 topic Title # of Sessions

40004 Russian Submarine Kursk Sank 440021 Earthquake in El Salvador 340055 Edmond Pope Convicted of Espionage in Russia 441005 UN Strengthens Sanctions against Kabul 341011 Turkish Prison Riots 341012 Trouble in the Ivory Coast 441024 Congolese President Laurent Kabila Feared Dead 341025 End of the Line for Peruvian President Alberto Fujimori 4


experiment simultaneously with a session interval of 2–4 days. Subjects were required to attend all of theirgroup’s sessions.

Six of the subjects were students in the Master of Library and Information Science (MLIS) program at theUniversity of Pittsburgh; one subject was from the Master of Science in Information Science (MSIS) programat the University of Pittsburgh; and one subject was from the Computer Science program at Carnegie MellonUniversity. Seven of the eight subjects were female and the age range of all subjects was 22–65. On a ten-pointscale (10 being the highest), the subjects mean rating of their search abilities was 8.375 with a mode of 8. Interms of time spent reading or viewing news each day, five subjects said they spent less than one hour, and theremaining three spent 1–2 hours.

5.3. Experimental and baseline systems

CAFE is an adaptive information filtering system developed at Carnegie Mellon University for utility-based information distillation. It combines the strengths of a state-of-the-art adaptive filtering system (Yang,Yoo, Zhang, & Kisiel, 2005) and a top-performing novelty detection system in benchmark evaluations forTopic Detection and Tracking systems (Fiscus & Wheatley, 2004). Furthermore, to support user interactionwith the system in task-based information exploration, CAFE provides chronological segmentation of theinput stream of documents, passage ranking per query based on both relevance and novelty, and the utiliza-tion of task profiles, query logs and recorded user interactions with system-selected passages.

CAFE takes the rich information in the task description to construct a profile for the task as the initial set-ting for adaptive filtering. The task profile is incrementally updated (‘‘adapted’’) as soon as new user feedbackis received: the feedback indicates the relevance, redundancy or both of the currently processed passages. Theuser may also add new queries or modify the existing queries as a part of the feedback. The adapted profile isfurther used to re-rank passages with respect to the current query. The passages already seen by the user areremoved from the re-ranked list to avoid repetitive information for user to review.

CAFE uses a regularized regression algorithm for the training of task profiles. It estimates the posteriorprobability of a task given a passage using a sigmoid function

3 htt

P ðy ¼ 1j~x;~wÞ ¼ 1=ð1þ e�~w�~xÞ
where~x is the vector representation of the passage whose elements are term weights, ~w is the vector of regres-sion coefficients, and y 2 { + 1, �1}is the output variable corresponding to ‘‘yes’’ or ‘‘no’’ for the relevancewith respect to a particular task. Regularized logistic regression has been found as one of the most successfulalgorithms in benchmark evaluations of adaptive filtering (Yang et al., 2005; Zhang, 2004). Technical detailsabout the CMU’s LR classifier can be found in (Yang et al., 2005).
The baseline system is a non-adaptive passage retrieval engine developed by the first author with the help ofa group of researchers at University of Maryland (He & Demner-Fushman, 2003). It uses Indri 2.03 as theunderlying document retrieval engine. The effectiveness of the baseline engine has been demonstrated in TRECHARD 2003 (He & Demner-Fushman, 2003).

5.4. Experimental procedure

To minimize the potential impact of inter-subject difference on the study results, we adopted a within-subject design. During each session, each subject had to work on two tasks with one system, then worked withthe other system for another two tasks. We used Latin Square method to rotate the sequence of system andtask combinations to remove learning and fatigue effects (Table 2). However, for a given topic, the same sys-tem was used to complete the tasks throughout all sessions.

Subjects were given printed instructions on how to use the system and an entry questionnaire to assess theirtechnical ability, search experience, and familiarity with news. A brief (30-min) training session was conductedon the user interface and experiment tasks, including a 10-min practice on a training topic.

p://www.lemurproject.org/indri/.

http://www.lemurproject.org/indri/

Table 2An example of experimental session structure (Session 1, 3-session topic group) showing points in the experiment when questionnaireswere administered and a Latin square rotation of topics and system sequences

Subject

1 2 3 4

Entry questionnaire (session 1 only)40021-CAFE 40021-Baseline 41024-CAFE 41024-BaselinePost-search questionnaire41005-CAFE 41005-Baseline 41011-CAFE 41011-BaslinePost-search questionnairePost-system questionnaire41011-Baseline 41011-CAFE 41005-Baseline 41005-CAFEPost-search questionnaire41024-Baseline 41024-CAFE 40021-Baseline 40021-CAFEPost-search questionnairePost-system questionnaireExit interview


For the purpose of the experiment, we developed a simple interface, which was used for both CAFE and thebaseline engine. By building this interface, we attempted to achieve two specific goals of this study: to separatethe contribution of ranking provided by the adaptive filtering engine or by the baseline system from other fac-tors that may influence user performance and, at the same time, to simulate the exploratory task-driven workof human analysts as close as possible. To satisfy the first goal, we decided to take two simplifications. First,ad-hoc search function was not provided in the interface. The users are known to differ greatly in their queryformulation skills, and we wanted to avoid the influence of this factor to the user’s performance. Second, inthe sense-making part of the interface (Fig. 4), subjects were not asked to write final point paper, instead theirtasks were to select or organize the passages that would be used for writing the final report. Through this sim-plification, we avoid the effect of different report writing skills, which is not the focus of our study. These sim-plifications may not be necessary when performing a similar study with professional intelligence analysts, butwe consider them important for the kind of subjects we used.

The interface simulates the analyst’s activities of collecting and organizing potentially useful passagesrelated to a given event, and the final sets of organized passages were used as the surrogate of the point paperthat summarizes the collected information.

The interface supports both the foraging and sense-making stages of information access. The foraging inter-face (see Fig. 3) consists of two frames. The left frame shows the list of the passages ordered by their relevanceto the perceived user task. The passages are generated either by CAFE or by the baseline engine. The rightframe shows a container called the shoebox, which is a traditional information-processing tool used by ana-lysts. The shoebox stores all useful text selected by the user for his/her final report. The user can copy directlyany part of the passages to the shoebox, or open a pop up window to view the complete document and selectfrom there. Darker color boxes to the left of the text fragments in the shoebox indicate that the fragments wereselected from the passage list directly, and lighter color boxes indicate that the fragments were selected fromthe full text window. A text fragment can be removed from the shoebox, or can be ordered by the posting dateor by the sequence of user’s selection.

The selection of a text fragment provides some confidence that its content is relevant to the user task and isconsidered as a positive feedback by CAFE. The user can also provide a negative feedback by removing irrel-evant passages from the list. To make this action reversible, the headlines of all removed passages are retainedin the passage list. Positive and negative feedback are used by CAFE to generate the list of passages for thenext session.

The sense-making interface helps the user to organize collected text fragments for the inclusion into thefuture report. In the work of real analysts, the organized set of passages is used as a source to prepare the finalreport or some other product. The interface (see Fig. 4) allows the user to remove unwanted text fragmentsand organize the remainders by associating each with one of the questions of the task scenario. The questionsare presented at the right frame for quick reference.

Fig. 3. Information foraging: assembling text fragments in the shoebox.

Fig. 4. Sense-making: final selecting and organizing in the shoebox.


For each topic, subjects were given 20 min per session to examine, highlight and add text fragments to theshoebox. Subjects could add a maximum limit of 1000 words to their shoebox during each session. After the20-min search session, subjects were given 5 min to edit the recently-added contents of their shoebox for use-fulness and to meet the 1000-word limit. Each search session lasted approximately 2 hours, including breaksand administering questionnaires.

At the end of the final session for each topic, subjects were asked to compile their final ‘‘report’’ on thattopic. In lieu of an actual written report, subjects were presented with all of the contents of their shoeboxesand asked to produce a final shoebox of no more than 2000 words that best summarized the collected infor-mation for the topic. Additionally, subjects had to indicate which subtopic question(s) a final selected passageis related to. The final report generation lasted 20 min, followed by a 10-min exit interview.

6. Result analysis

6.1. System comparison based on output ranked lists

Our first result analysis concentrated on examining the two systems intrinsically, i.e., examining the twosystems’ output – the rank lists of passages displayed to the subjects. Due to the nature of evaluating adaptive


systems, we paid more attention to precision than recall. The calculation of precision has been modified tohandle the fact that the basic unit for recommendation is a passage (e.g., a text snippet) rather than a wholedocument. We adopted and expanded the precision calculation in HARD03 (Allan, 2003), where all the wordsin relevant passages in ground truth that are overlapped with the words in at least one retrieved passage will bemarked. Each marked word also has a weight calculated based on how many ground truth annotators selectedit into the ground truth. The passage precision then is the weighted sum of those marked words over theweighted sum of all the words in the returned passages where each word that is not in the ground truthhas weight.

As shown in Fig. 5a, averaged across all topics, the rank lists generated by CAFE have better precisionscores (0.74 versus 0.45) than that of the baseline when examining the top 20 passages of the ranked lists. This65% relative improvement is statistically significant (paired t test with p 6 0.05). The superiority of CAFE isalso evident in the results of top 60 passages. Although the improvement is not as great (0.32 versus 0.27, a19% relative improvement), the difference is still statistically significant. When we examine further down theranked lists, the performance of CAFE’s results becomes almost equal (both 0.19 at top 100 passages) andslightly inferior (0.16 versus 0.17 at top 120 passages) to the baseline. Although the differences are not signif-icant in these two cases, the whole trend does indicate that CAFE did a better job of pushing high quality,useful passages to the top of the ranked lists, whereas the baseline performed better at the lower end of theranked lists. The reason that our analysis stops at the top 120 passages is because CAFE only generatedaround 120 passages for most ranked lists in later sessions, even though we aimed to show the top 200 pas-sages to subjects in our original design. Therefore, calculating precision at 200 made little sense.

6.2. System comparison based on usage profiles

When examining how system’s performance extrinsically, there are several aspects to look at. Firstly, thereare performance measures on subjects’ selection of useful passages. Again we look at precision of the passages.

Passage precision on systems' top 20 passages

00.10.20.30.40.50.60.7

0.80.9

1

40004 40021 40055 41005 41011 41012 41024 41025 Avg

Topic

Prec

isio

n

Base line CAFE


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

40004 40021 40055 41005 41011 41012 41024 41025 Avg

Topic

Prec

isio

n

Base line CAFE


0

0.1

0.2

0.3

0.4

0.5

0.6

40004 40021 40055 41005 41011 41012 41024 41025 Avg

Topic

Prec

isio

n

Base line CAFE


0

0.05

0.1

0.15

0.2

0.25

0.3

40004 40021 40055 41005 41011 41012 41024 41025 AvgTopic

Peci

sion

r

CAFEBase line

a b

dc

Fig. 5. Passage precision on the rank lists generated by the two systems: (a) based on top 20 passages; (b) based on top 60 passages;(c) based on top 100 passages; (d) based on top 120 passages.


The examination looks into several key points of the exploration process as indicators of the systems utility insupporting subjects’ information exploration tasks. These key points include: (1) the three points in the infor-mation foraging stage. The three key points are after the first session, which tells us the subjects’ ability to usethe systems for finding answers when there is no adaptation difference between the two systems; after the finalsession, where subjects’ selection should reflect any difference in the result sets received from the two systems;and the accumulated final selections of passages in the information foraging stage, which tells us the overallsystem’s support in subjects’ selection; and (2) The passages kept after the sense-making stage, where theselected passages were viewed as the final product to avoid the unnecessary complication from report writing.

Secondly, to compensate for the fact that differences in topic difficulty might affect the output regardless ofsystem, we designed a normalized precision within subjects across all topics that they performed on, which isdefined as (1), here pi,j,k means the precision of subject i in session j on topic k. �pj;k max means the max averageprecision for session j in all topics. �pj;k means the average precision for session j and topic k.

npi;j;k ¼ pi;j;k

�pj;k max

�pj;kð1Þ

Thirdly, there are usability measures examining the support of the systems in subjects’ selection process.This includes, for example, how many passages were selected in each sessions; how quick subjects were ableto select useful passages; and how deep in the rank lists did subjects look for useful passages.

Fig. 6 shows the results of the passage precision on subjects’ selection of useful passages. On average acrosstopics, CAFE obtained better results than the baseline system in helping subjects to forage useful informationafter the first session (see Fig. 6a), after the final session (b), and at the final information foraging results (c).The differences are not statistically significant. However, when we look at the final results from sense-makingstage (see Fig. 6d), the average precision scores of the passages selected using the two systems are almost iden-tical (0.539 versus 0.537, only 0.3% relative improvement).

To remove the effects of topic differences, we utilized the normalized precision. Fig. 7 shows that CAFEconsistently outperformed the baseline at the three key points of the information foraging stage, as well asthe sense-making stage. The differences, however, are not statistically significant. We do see that different sub-jects performed quite differently despite our efforts to normalize the difficulty of tasks. Within-subject perfor-mance also changed dramatically in both the information foraging stage and the sense-making stage (forexample, subject 6).

Fig. 8 compares the number of passage selections made on the two systems for each of the eight topics.These data provides some evidence that CAFE was better than the baseline in helping subjects discover rel-evant information. The subjects annotated more passages overall (734 versus 598) with CAFE, but this differ-ence was not statistically significant (independent sample t-test, p = 0.168). When we considered the six topicswhere CAFE performed better, the difference was statistically significant (independent sample t-test,p = 0.008). Looking at the topics where subjects selected more passages with the baseline system, the differencebetween the two systems was not significant (p = 0.225).

How quickly can subjects locate useful passages can be seen as an indicator of the system’s support in pas-sage selection. For example, a system that supported more passage selection at the beginning of the session canbe viewed as providing more efficient support. Therefore, we counted the number of passage selections at cer-tain benchmarks: at 5, 10, and 15 min after subjects started their tasks (see Fig. 9). These data show that abouthalf of the passage selections were made within the first 5 min in both systems, and there is slightly more pas-sage selections in CAFE system at the earlier session stages (610 min) than that in the baseline.

By examining the ranks of selected passages, we hope to see how deep the subjects had to go in their searchfor relevant information. Both systems ordered the passages, and the subjects knew this. That is why subjectsall started their selection from the top of the rank lists. Therefore, the deeper the subjects have to go, the moreeffort the subjects had to put into, the less supportive the system is. The descriptive statistics in Table 3 showthat passage selections in CAFE were generally concentrated on higher ranks in the list. This may show thatsubjects tended to find useful information closer to the top in CAFE than that in the baseline.

One interesting statistic is the mode ranks for each system, which were 26 and 1 for CAFE and the baseline,respectively. Mode rank 26 is approximately the second or third page when scrolled down the list, whereasmode rank 1 means the very top of the list.

Information Foraging after 1st Session

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

40004 40021 40055 41005 41011 41012 41024 41025 AvgTopic

Prec

ison

Base line CAFE

Information Foraging after Last Session

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

40004 40021 40055 41005 41011 41012 41024 41025 AvgTopic

Prec

ison

Base line CAFE

Final Information Foraging Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

40004 40021 40055 41005 41011 41012 41024 41025 AvgTopic

Prec

ison

Base line CAFE

Final Sense Making Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

40004 40021 40055 41005 41011 41012 41024 41025 AvgTopic

Prec

ison

Base line CAFE

a b

dc

' '

' '

Fig. 6. Passage precision on the selected useful passages at key points of information foraging stage and sense-making stage. (a)Information foraging after the first session; (b) information foraging after the last session; (c) final information foraging results; (d) finalsense-making results.


The topic level analysis shown in Fig. 10 is consistent with the overall statistics above. In five out of theeight topics, the average selection ranks in CAFE were higher than those of the baseline system. This differenceis statistically significant (independent t-test, p = 0.008). The difference of the average ranks among theremaining topics (4004, 41,011, and 41,012) is not significant (p = 0.114).

The way that subjects selecting passages can also be an indicator of the systems’ support. While some pas-sages were selected directly from the ranked list of passages, others were selected after subjects examined thefull content of the corresponding document. Fig. 11 shows the number of passages selected directly from therank lists. Subjects who used CAFE made more direct selection of passages from the rank lists. This couldindicate that the passages generated by CAFE give them more confidence in selecting the passages directly.However, the difference was not significant (paired t-test p = 0.077).

An interesting measure we developed was on how strong the information scent is. A snippet selected bythe filtering system may not be sufficiently relevant to select for the shoebox, but may bear reasonable infor-mation scent (Pirolli & Fu, 2003) to cause the user to open the source document. If the document provesuseful enough to select a snippet from it, the additional efforts of opening these documents has paid off.If the document is not sufficiently useful to make a selection, these efforts are wasted. Therefore, the portionof viewed but not selected from viewed documents is defined as the indicator of the wasted effort. From thisviewpoint, a system that has a larger portion of documents selected from documents viewed by the usersmight be better.

In the analysis, the opened documents found in the log transaction are regarded as viewed documents, anddocuments from which at least one useful passage was selected and added to the shoebox are called selected

documents. The proportion of the number of viewed documents that are not also the selected documentsdivided by the total number of viewed documents is defined as the wasted effort. As shown in Fig. 12a, the

0

20

40

60

80

100

120

140

40004 40021 40055 41005 41011 41012 41024 41025

CAFÉ

Baseline

Fig. 8. Number of selected passages in the two systems by topic.

Normalized Precision for Information Foraging After 1st Session

00.20.4

0.60.8

11.21.4

1.61.8

2

1 2 3 4 5 6 7 8 avg

Subject

Nor

mal

ized

Pre

cisi

on

Base line CAFE

Normalized Precision for Information Foraging

After Last session

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 avg

Subject

Nor

mal

ized

Pre

cisi

on

Base line CAFE

Normalized Precision after Final

Sense Making

00.10.20.30.40.50.60.70.80.9

1

1 2 3 4 5 6 7 8 avg

Subject

Nor

mal

ized

Pre

cisi

on

Base line CAFE

Normalized Precision for Final

Information Foraging Results

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 avg

Subject

Nor

mal

ized

Pre

cisi

on

Base line CAFE

a

c

b

d

''

' '

Fig. 7. Normalized precision of subjects’ passage selection.


wasted effort observed in the baseline system is higher than that in CAFE for almost all subjects. The differ-ence is statistically significant (p = 0.034).

The analysis of wasted efforts by session (see Fig. 12b) reveals that the wasted efforts in the baseline system(M = 0.553, SD = 0.088) was again higher than that in CAFE, and the difference is significant (p = 0.006). Inthis analysis, each session, except session 4, had 8 subjects and 8 topics; but session 4 only had 4 subjects and 4topics.

0

20

40

60

80

100

120

40004 40021 40055 41005 41011 41012 41024 41025

CAFEBase line

'

Fig. 10. Average rank of passage selections by topics.

Fig. 9. Number (left) and percentage (right) of selected passages along the duration of a session.

Table 3Descriptive statistics of the ranks of selected passages

System Mean Median Mode Standard deviation

Baseline 73.4 48 1 66.6CAFE 50.6 39 26 43.7


6.3. System comparison based on user feedbacks

Following each search task, subjects were given a post-search questionnaire to assess their satisfaction withthe system. Then before switching the systems, they were given a post-system questionnaire. For all questions,subjects were asked to rate their level of agreement from 1 (Extremely) to 5 (Not at all). For both systems,subjects were asked to rank topic familiarity, sufficiency of news, utility of passages, ability to find useful snip-pets, ease of use, and overall satisfaction. In sessions 2 through 4 for CAFE only, subjects were asked to ratetheir impression of how well the system used their negative feedback to generate subsequent passage lists.

A 2 · 4 within-subjects ANOVA was performed on the post-system questionnaire data to determine signif-icant differences in user answers by system and session. Fig. 13 shows the mean post-system questionnairesresponses averaged across all sessions. Though the post-system responses averaged across all sessions forCAFE tended to be higher than the baseline’s, there were significant differences between the two systems

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8Subject

The

Num

ber o

f Pas

sage

s

Base lineCAFÉ

Fig. 11. The number of passages selected directly from the ranked list of snippets.

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

1 2 3 4 5 6 7 8Subject

Was

ted

Effo

rts

Base lineCAFÉ

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

1 2 3 4Session

Was

ted

Effo

rtsBase lineCAFÉ

a b

Fig. 12. Wasted efforts by (a) subjects and (b) by sessions.


for any of the questions. These results were reinforced by subjects’ responses in exit interviews: three indicatedthey preferred CAFE, two liked the baseline, and three had no preference. However, there were significantdifferences in mean post-system responses among sessions (Fig. 14) averaged across systems for perceived suf-ficiency of news (F(2,14) = 7.834, p = 0.005), ease of use (F(2,14) = 6.921, p = 0.008), and overall satisfaction(F(2, 14) = 7.440, p = 0.006).

Simple comparisons revealed significant increases in subjects’ ratings from session 1 to session 3 in bothsystems for perceived sufficiency of news (BASELINE: F(1, 7) = 9.000, p = .020; CAFE: F(1, 7) = 14.538,p = .007) and overall satisfaction (BASELINE: F(1,7) = 8.795, p = .021; CAFE: F(1, 7) = 7.000, p = .033).There was also a significant difference in perceived ability to find useful snippets between sessions 1 and 3for the baseline system only (F(1, 7) = 5.727, p = .048), as well as for ease of use for CAFE only(F(1, 7) = 10.309, p = .015). There were no significant differences in subjects’ ratings of CAFE’s use of theirnegative feedback, nor the utility of passages among sessions.

The data suggest that, over time, subjects felt more satisfied and had greater success completing their taskswith both systems. One factor that may explain this increased satisfaction is that subjects were given identicaltask descriptions throughout all sessions; thus, they were presented all sub-task questions from the outset ofthe experiment. Although the temporal nature of the data was explained to subjects in the training, subjects’inability to find answers to certain sub-task questions in earlier sessions may have led to lower responses. Sub-jects also had more complaints about slow system response times in sessions 1 and 2 versus 3 and 4, which mayhave also suppressed earlier ratings.

Post-System Responses

2.5

3

3.5

4

4.5

5

Suffi

cien

cyof

New

s

Util

ity o

fPa

ssag

es

Find

ing

Use

ful

Snip

pets

Ease

of

Use

Ove

rall

Neg

ativ

eFe

edba

ck

Question

Mea

n R

espo

nse

Base line CAFE'

Fig. 13. Post-system questionnaire responses by system, averaged across all sessions.

Post-System Responses (Baseline)

2.5

3

3.5

4

4.5

5

Suffi

cien

cy o

f New

s

Suffi

cien

cy o

f New

s

Find

ing

Use

ful S

nipp

ets

Find

ing

Use

ful S

nipp

ets

Ease

of U

se

Ease

of U

se

Ove

rall

Satis

fact

ion

Ove

rall

Satis

fact

ion

Question

Mea

n R

espo

nse

Session 1Session 2Session 3Session 4

Post-System Responses (CAFE)

2.5

3

3.5

4

4.5

5

Neg

ativ

e Fe

edba

ckQuestion

Mea

n R

espo

nse

Session 1Session 2Session 3Session 4

Fig. 14. Mean post-system questionnaire responses for baseline and CAFE, by session.


Further analysis of the post-search questionnaires uncovered some interesting patterns in users’ subjectivefeedback among the topics. In the 4-iteration group (see Fig. 15), topic 40055’s feedback on all questionstended to be higher versus other topics for both systems, while topic 41012’s tended to be in the bottom halfamong all topics. Topic 41012’s feedback correlates positively with subjects’ performance, which was lowest or

Fig. 15. Mean post-search questionnaire responses from subjects in the 4-iteration group, by topic.


among the lowest for all precision measures and percentage of subqueries answered correctly. Feedback fromsubjects who used CAFE for topic 40004 was also lower relative to other topics (particularly sufficiency ofnews), but the feedback from baseline users is less clear-cut. While users found the passages generated bythe baseline system highly useful when deciding to open the full article, they ranked the overall utility of pas-sages the lowest for topic 40004. Many of the subjects complained about the readability of the articles withinthis topic, particularly of those articles that were translated by machine from foreign languages, which mayhelp reconcile the differences in ratings between these two questions. Percentage of subqueries answered cor-rectly was also lower than average for topic 40004, while precision measures were average to below averageversus all topics. We found similar results in the 3-iteration group.

7. General discussion and conclusion

In this paper, we presented an evaluation framework designed specifically for assessing and comparing per-formance of innovative information access tools created to support the work of intelligence analysts in thecontext of task-based information exploration. The motivation for the development of this evaluation frame-work came from the evaluation need of the project, which our team is focusing on. As we discovered, none ofthe existing evaluation frameworks can satisfy the evaluation needs defined by task-based information explo-ration context. We needed a framework that connects closely with the kind of tasks that intelligence analystsperform: complex, dynamic, multiple facets and multiple stages. We also needed a framework, which views theusers rather than the information systems as the center of the evaluation, and examines how well the users canbe served by the systems in their tasks. Finally, we needed a framework that could allow us to evaluate sep-arately the user performance during each major stage of their work such as information foraging and sense-making.

In this paper we presented the main components of the developed evaluation framework – the methodol-ogy, the ideas of simulated tasks, and the set of metrics. We also presented a reference test collection and anevaluation procedure as the extra part of the framework. All components feature valuable innovations, whichwe already tested while evaluating several innovative information access systems.

Our reference test collection offers 18 tasks scenarios completed with corresponding passage-level groundtruth annotations at three independent aspects: topical relevance, novelty, and utility. The three annotationsprovide multiple aspects examinations on the performance of the systems. The suggested evaluation approachsupports assessing user performance in complex realistic tasks, and allows separate evaluation of the systemimpact on foraging and sense-making stages. Metrics recommended in the framework provide multiple layersof analyzing the information systems. These layers include measures focused on system performance, measuresfocused on user formal performance and informal indicators of the process, and those on subjective evaluation.

To demonstrate the usage of our evaluation framework and the reference test collection, we presented theframework itself in parallel with a specific evaluation study of CAFE, an adaptive filtering engine designed forsupporting task-based information exploration. This study can be considered as a successful use case of theframework. The multiple layers of evaluation measures indeed revealed various aspects of the informationsystems.

One interesting design of our experiment is the usage of multiple sessions to capture the dynamic develop-ment of the event and the tasks. According to the results collected from the experiment, it seems that this fea-ture indeed showed the complexity of the tasks, the changes of systems performance and the varieties ofsubjects’ behaviors in different sessions. The separation of information foraging and sense-making in subjects’tasks, which is another interesting design in our study, also helped to review the different support between thetwo systems.

We hope that this work will contribute to the establishment of evaluation approaches for exploratorysearch systems. We also hope that the reusable framework we developed will be further utilized to explore var-ious ideas on study design and evaluation parameters.

Beyond this general contribution, our work brought some interesting observation from the specific study ofthe CAFE engine. It seems that existing IR systems are polished to work well with existing evaluation frame-works – i.e., produce good formal relevance. Both CAFE and the Indri baselines are developed to push morerelevant passages to the top. However, as our study hints, formal relevant is not equal to better user support.


We can’t take for granted that users can perform better with formally better outputs. From the users’ task-oriented view, CAFE system was able to win over the baseline on many aspects, including better rank lists,better adaptation, less wasted effort, subjects’ comments and log analysis, however, it did almost identicalwhen examining how well the passages were selected. Therefore, this gives us a ground to argue that evalua-tion involving human users is essential to obtain truly useful testing results that are meaningful. To make fur-ther progress with developing systems that can provide better support to their users we need frameworks andevaluation approaches that take users and the work context into account. We consider the framework that wedeveloped and the evaluation approach that we suggested as useful steps in this direction.

Another interesting result of our study is the observation that system performance may differ a lot fromtopic to topic. This phenomenon was observed for every kind of explored performance measures. Eventhough, our normalized precision measure tries to remove this factor for one set of measures, the differencebetween topics is still visible. Not only specific performance measures vary from topic to topic, but also theperformance balance between systems vary. While in user-oriented measures CAFE was better in generaland for most of the topics, there were always topics where the baseline system demonstrated better perfor-mance. Our study demonstrated that CAFE is a great enhancement to a search-only system, but not a replace-

ment since it still can’t beat traditional search for certain topics. A proper approach is to combine both so thatone engine’s advantages can cover the other’s weakness.

As we mentioned above, the presented work was performed in the context of a large-scale collaborativeproject aimed to develop a new generation of systems for task-based information exploration. The currentdesign does combine a search engine and an adaptive filtering engine. In this context, our next evaluation goalis to evaluate the combined system in task-based information exploration using the same evaluation frame-work and a similar study design. We are also currently working on expanding what we have learned from thisstudy to develop evaluation frameworks and new evaluation measures for a range of similar informationexploration systems.

Acknowledgements

We want to thank Dr. Shulman and his QDAP center for their help in developing the ground truth anno-tations. Thank Qi Li and Jongdo Park for helping in results analyses. This work is partially supported byDARPA GALE project.

References

Acosta-Diaz, R., Guillen, H. M., GarciaRuiz, M. A., Gallardo, A. R., Pulido, J. R. G., & Reyes, P. D. (2006). An open source platformfor indexing and retrieval of multimedia information from a digital library of graduate thesis. In Proceedings of world conference on E-

learning in corporate, government, healthcare, & higher education E-learning 2006 Honolulu (pp. 1822–1829). Hawaii: AACE.Allan, J. (2002). Topic detection and tracking: event-based information organization. Kluwer Academic Publishers.Allan, J. (2003). HARD track overview in TREC 2003 high accuracy retrieval from documents. In The twelfth text retrieval conference.Borlund, P. (2003). The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information

Research, 8(3).Borlund, P., & Ingwersen, P. (1997). The development of a method for the evaluation of interactive information retrieval systems. Journal

of Documentation, 53(3), 225–250.Cleverdon, C. W., Mills, J., & Keen, M. (1966). Factors determining the performance of indexing systems. Cranfield: ASLIB Cranfield

Project.Dıaz, A., & Gervas, P. (2005). Personalisation in news delivery systems: item summarization and multi-tier item selection using relevance

feedback. Web Intelligence and Agent Systems, 3(3), 135–154.Dumais, S. T., & Belkin, N. J. (2005). The TREC interactive tracks: putting the user into search. In E. M. Voorhees & D. K. Harman

(Eds.), TREC: Experiment and evaluation in information retrieval (pp. 123–152). MIT Press.Fiscus, J., & Wheatley, B. (2004). Overview of the TDT2004 evaluation and results. In Proceedings of TDT-04.Fuhr, N., Govert, N., Kazai, G., & Lalmas, M. (2002). INEX: INitiative for the evaluation of XML retrieval. In Proceedings of the SIGIR

2002 workshop on XML and information retrieval.Gotz, D., Zhou, M. X., & Aggarwal, V. (2006). Interactive visual synthesis of analytic knowledge. In P. C. Wong & D. Keim (Eds.), IEEE

symposium on visual analytics science and technology, VAST 2006 (pp. 51–58). Baltimore, MD: IEEE.Hanani, U., Shapira, B., & Shoval, P. (2001). Information filtering: overview of issues, research and systems. User Modeling and User

Adapted Interaction, 11(3), 203–259.


He, D., & Demner-Fushman, D. (2003). HARD experiment at maryland: from need negotiation to automated HARD process. InProceedings of text REtrival conference (TREC) 2003.

Ingwersen, P. J., & Kalervo (2005). The Turn: integration of information seeking and retrieval in context. Springer.Kando, N. (2005). In Proceedings of the fifth NTCIR workshop meeting on evaluation of information access technologies: information

retrieval, question answering and cross-lingual information access Tokyo, Japan.Larsen, B., Malik, S., & Tombros, A. (2005). The interactive track at INEX 2005. In The workshop of INEX 2005.Larsen, B., Ingwersen, P., & Kekalainen, J. (2006). The polyrepresentation continuum in IR. In 1st international conference on IR in

context.Marchionini, G. (2006). Exploratory search: from finding to understanding. Communications of the ACM, 49(4), 41–46.Marchionini, G., & Shneiderman, B. (1988). Finding facts vs. browsing knowledge in hypertext systems. IEEE Computer, 21(1), 70–79.McColgin, D., Gregory, M., Hetzler, E., & Turner, A. (2006). In: White, R.W., Muresan, G., Marchionini, G. (Eds.). From question

answering to visual exploration in workshop on evaluating exploratory search systems at SIGIR 2006. (pp. 47–50).Peters, C., Gey, F., Gonzalo, J., Mueller, H., Jones, G., & Kluck, M. (2006). Accessing multilingual information repositories: 6th workshop

of the cross-language evaluation forum. Springer.Pirolli, P., & Card, S.K. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task

analysis. In Proceedings of 2005 International Conference on Intelligence Analysis, McLean, VA, 2–4 May 2005.Pirolli, P., & Fu, W.-T. (2003). SNIF-ACT: A model of information foraging on the World Wide Web. In P. Brusilovsky, A. Corbett, & F.

d. Rosis (Eds.), 9th international user modeling conference (pp. 45–54). Berlin: Springer-Verlag.Robertson, S., & Soboroff, I. (2002). The TREC 2002 filtering track report. In Proceedings of TREC 2002.

Robertson, S. E., & Hancock-Beaulieu, M. M. (1992). On the evaluation of IR systems. Information Processing and Management, 28(4),457–466.

Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In: Proceedings of SIGIR ’95 (pp. 138–146).Voorhees, E. M., Harman, D., K., 2005. TREC: Experiment and evaluation in information retrieval: MIT Press.Waern, A. (2004). User involvement in automatic filtering – an experimental study. User Modeling and User Adapted Interaction, 14,

201–237.White, R. W., Muresan, G., & Marchionini, G. (2006). Evaluating exploratory search systems. In: Evaluating exploratory search systems, a

workshop of ACM SIGIR06.

White, R. W., Jose, J. M., & Ruthven, I. (2004). An implicit feedback approach for interactive information retrieval. Information

Processing and Management, 42(1), 166–190.White, R. W., Kules, B., Drucker, S. M., & Schraefel, M. C. (2006). Supporting exploratory search. Communications of the ACM, 49(4),

37–39.Wong, P. C., Chin, G., Jr., Foote, H., Mackey, P., & Thomas, J. (2006). Have Green – a visual analytics framework for large semantic

graphs. In P. C. Wong & D. Keim (Eds.), IEEE symposium on visual analytics science and technology, VAST 2006 (pp. 67–74).Baltimore, MD: IEEE.

Yang, Y., Lad, A., Lao, N., Harpale, A., Kisiel, B., & Rogati, et al. (2007). Utility-based information distillation over temporallysequenced documents. In Proceedings of ACM SIGIR’2007.

Yang, Y., Yoo, S., Zhang, J., & Kisiel, B. (2005). Robustness of adaptive filtering methods in a cross-benchmark evaluation. In 28th

Annual international ACM SIGIR conference Salvador (pp. 98–105). Brazil: ACM Press.Zhang,Y. (2004). Using bayesian priors to combine classifiers for adaptive filtering. In 27th Annual international ACM SIGIR conference,

Sheffield, United Kingdom (pp. 345–352).

an evaluation of adaptive filtering in the context of realistic task-based information exploration

Documents