Evaluation of information-seeking performance in hypermedia digital libraries
Post on 05-Jul-2016
Embed Size (px)
Interacting with Computers 10 (I 998) 269-284
Interacting with Computers
Evaluation of information-seeking performance in hypermedia digital libraries
Michail Salampasis, John Tait*, Chris Bloor School of Computing and Information Systems, University of Sunderland, St. Peters Campus. St. Peters Wuy,
Sunderland SR6 ODD. UK
Nowadays, we are witnessing the development of new information-seeking environments and applications such as hypermedia digital libraries. Information Retrieval (IR) is increasingly embedded in these environments and plays a cornerstone role. However, in hypermedia digital libraries IR is a part of a large and complex user-centred information-seeking environment. In particular, information seeking is also possible using non-analytical, opportunistic and intuitive browsing strategies. This paper discusses the particular evaluation problems posed by these current developments. Current methods based on Recall (R) and Precision (P) for evaluating IR are discussed, and their suitability for evaluating the performance of hypermedia digital libraries is examined. We argue that these evaluation methods cannot be directly applied, mainly because they do not measure the effectiveness of browsing strategies: the underlying notion of relevance ignores the highly interconnected nature of hypermedia information and misses the reality of how information seekers work in these environments. Therefore, we propose a new quantitative evalua- tion methodology, based on the structural analysis of hypermedia networks and the navigational and search state patterns of information seekers. Although the proposed methodology retains some of the characteristics (and criticisms) of R and P evaluations, it could be more suitable than them for measuring the performance of information-seeking environments where information seekers can utilize arbitrary mixtures of browsing and query-based searching strategies. 0 1998 Elsevier Science B.V.
Keywords: Evaluation; Hypermedia digital libraries; Structural analysis of hypermedia; Relative distance relevance
Evaluation is a major problem in research and development of applications related to
information retrieval (IR) and whole working groups are dedicated to the study of
* Corresponding author
0953.5438/98/$19.00 0 1998 - Elsevier Science B.V. All rights reserved PII SO953-5438(98)OOOlO-I
210 M. Salampasis et al./lnteracting with Computers 10 (1998) 269-284
evaluation of modern information retrieval systems [l]. Currently, the effectiveness of conventional (both batch and interactive) IR systems is mainly measured by two metrics: recall (R) and precision (P). There has been much debate over a long period about whether methods based on these metrics are the most suitable measures of the effectiveness of an IR system. Hersh et al. (, pp. 164) say for example: the evaluation methods currently used, have limitations in their ability to measure how well users are able to acquire information. In a recent paper Saracevic (, pp. 142) examined the evaluation of IR systems and pointed out: the most serious criticism and limitation of the TREC evalua- tions is that they treat IR in a batch mode. Sparck Jones  in a paper discussing the TREC programme says the TREC programme involves, at the meta level, a strongly shaped evaluation methodology relying on an evaluation paradigm that has to be judged both for its own soundness and pertinence. She also says in a much earlier introductory paper (, pp. 1) of a book dedicated to information retrieval experiment: there is no good reason to suppose that conventional methods are best, even in principle, let alone practice. A more fundamental issue in respect of evaluation methods based on indices like recall and precision is the subjective nature of relevance assessments (, pp. 146), and the lack of emphasis on the theory underlying the systems under evaluation .
This debate and scepticism over evaluation, however, was for IR systems which have been developed as isolated systems having in mind the support of analytical, query-based retrospective information retrieval. In recent years, a shift has been observed towards more complex information workplaces, which usually are connected and can be accessed through a wide-area network. The term digital library is often used to describe these rather complex and highly interactive information workplaces . Hypermedia digital libraries are digital libraries based on a hypermedia paradigm . Information in hypermedia digital libraries is richly connected and organized using networked, and sometimes, additionally, hierarchical, or aggregation structures. IR research convincingly suggests the highly interconnected nature of hypermedia can be used to increase information retrieval effectiveness [lO,ll], so this property cannot be ignored. But, even more fundamental for the arguments made in this paper, is the fact that query-based information retrieval is not the onl,y way available to seek for relevant information. Opportunistic, non-analytical browsing strategies can also be effectively deployed by information seekers.
Browsing is a natural approach to many types of information-seeking problems. It can also be effective for information problems which are ill defined, or where the actual information need is to gain an overview, or clarify an information problem ([ 121, pp. 103). We can distinguish many different types of browsing (e.g. scanning, observing, navigating). In this paper, we are mostly concerned with what may be characterized as across-document browsing [ 131. This type of browsing has the goal of identifying relevant documents, in contrast to within-document browsing which is concerned with locating a relevant passage within a document, or extracting its gist. Across-document browsing can be used in conjunction with other on-line information-seeking strategies in order to more effectively solve an information problem.
Thinking of ways to measure the effectiveness of information seeking in hypermedia digital libraries poses some stimulating questions. For example, the first general question which immediately arises is whether R and P are the most suitable (measurable) quantities for
M. Salampasis et d/Interacting with Computers 10 (1998) 269-284 271
evaluating the effectiveness of hypermedia digital libraries. And if they are, can we really measure them in a pragmatic distributed and rapidly changing networked environment?
Browsing strategies depend significantly more on interactions between information seekers and the system than the analytical, query-based strategies. Because of their highly interactive nature, browsing is more dependent on the physical, emotive and cognitive abilities of information seekers than are analytical strategies. Browsing strategies are also more dependent than analytical strategies on influences of the system on the information seeker. Hence, it is not realistic to artificially simulate browsing in the same manner as ad hoc, stateless, non-interactive runs of queries simulate operational query-based environ- ments in P and R evaluations.
In this paper, we argue that the classical evaluation methodologies based on the usual measurements of R and P cannot be directly applied to the evaluation of hypermedia digital libraries. In Section 2 we provide the context to support this argument and discuss the qualitative and quantitative reasoning behind it. In Section 3 we propose a new quantitative methodology for evaluating the performance of information seeking in hyper- media digital libraries. It is based on the structural analysis of a hypermedia network and the step by step evaluation of users movements during an information-seeking process. The proposed methodology could be used to evaluate in a natural and unified way the performance of hypermedia-based seeking environments. It could provide a solution to at least some of the problems found in evaluations based on R and P, if these methods would applied to evaluate hypermedia digital libraries. Finally, in Section 4 we summarize and attempt to drawn some conclusions.
2. Information seeking in hypermedia digital libraries
One cannot discuss evaluation without considering the context in which information seeking takes place. Hypermedia digital libraries are very different information-seeking environments from conventional IR systems.
Firstly, they can support arbitrary mixtures of query-based IR and browsing mechanisms for information seeking. Secondly, information objects (i.e. documents) are highly interconnected with easily accessed cross-reference links. The semantic interconnection of documents suggests that a different interpretation of the relevance of an information object, given an information need, may be required.
Beyond these qualitative differences there are some other issues which we believe should be taken into account when considering evaluation of hypermedia digital libraries: the issue of electronic publishing [ 141 and the extremely dynamic nature of these environments [ 151.
2.1. Measuring the pe$ormance of browsing strategies
Experience suggests that individual people apply different mixtures of analytical and browsing strategies when both are available, but IR systems severely limit the strategies information seekers can use. In conventional IR systems browsing is unknown or very cumbersome, or limited for example to within document browsing, or requiring repeated often manually reformulated queries. and therefore ignored by evaluation methodologies.
272 M. Salampasis et al./lnteracting with Computers 10 (1998) 269-284
However, browsing cannot be ignored in hypermedia digital libraries, not only because it is a defining feature, but crucially because the overall effectiveness of information seeking in a hypermedia digital library greatly depends on the browsing strategy it supports. Therefore, evaluation methods for hypermedia digital libraries should be able to measure the effectiveness of information seeking, using both browsing and querying in a unified way if possible.
Draper [ 161 and Marchionini [ 121 have broken down the IR process into five and eight subtasks respectively: for example, select a source, formulate a query, examine results, and so on. In both classifications retrieving a set of documents is just one subtask. Current evaluation methods based on P and R can be criticized because they measure the effec- tiveness of the execution of this single subtask which relates to the internal system operation only. In the case of hypermedia digital libraries this criticism can be amplified because R and P methods can measure only a part of even this single subtask.
Evaluation methods based on P and R cannot measure the effectiveness of browsing strategies which are an important part of any hypermedia information-seeking environ- ment. They cannot measure the effectiveness of browsing not because of the inability of R and P metrics to do so, but because of the inappropriateness of R and P evaluation methodologies (at least those which apply stateless, ad hoc approaches) to produce mean- ingfully the raw data on which R and P could be calculated. A possible solution which can remedy this situation is the involvement of users and the calculation of R and P based on data produced during a series of interactions that users have with the system under evaluation (see an example of such an evaluation in ). This is the approach which our evaluation methodology suggests (without using R and P, but instead another new proposed set of metrics) and therefore we believe it is more user-centred, although still quantitative.
The reason this approach is not taken with most of the R and P evaluations is possibly the high cost that it incurs due to the involvement of users. Certainly, cost is an important factor which must be taken into account seriously, especially since many IR projects have involved thousands of tests [ 181. However, at least some of the experiments or investiga- tions should involve users independently of the cost that it requires, in order that more realistic and complete evaluations be obtained. This problem stimulates the (ambitious) idea of artificially generating/simulating user behaviour (i.e. users) based on prior knowl- edge (e.g. for the WWW this could be done using the WWW log files).
2.2. The underlying notion of relevance
The highly interconnected nature of hypermedia suggests another reason why current evaluation methods based on P and R are unsuitable for evaluating these environments. Richly organized and structured information objects bring forward the necessity for a new notion of relevance in hypermedia digital libraries. In conventional IR systems, the support offered to the user is for the retrieval of documents which have one or more exactly matching terms. On the other hand, in a hypermedia digital library items may be usefully retrieved which are not in themselves apparently relevant (i.e. not exact matching documents, or even low-ranked documents in a ranked output approach), but if they can serve as useful launching pads for browsing, will serve the user information need equally well.
M. Saiampasis et al./Interucting with Computers 10 (1998) 269-284 273
It is necessary therefore to determine a different, but still formally defined, derived notion of relevance in hypermedia digital libraries. This notion of relevance should take into account the reality of interconnected information objects stored in such environments. The evaluation methods based on P and R have underlying them a notion of relevance, which is at least partly incompatible which this new proposed notion of relevance in hypermedia digital libraries. Consider for example under a P and R based evaluation that a retrieved irrelevant document, which has five links to relevant documents, will be classified the same as another retrieved irrelevant document which does not contain any links to relevant documents. We need evaluation methodologies which apply a less strict and less monolithic notion of the relevance of a document.
2.3. Obtaining relevance assessments in dynamic environments
In the previous paragraphs, we discussed the fundamental problems of current evaluation methods based on R and P, i.e. the inability to take account of browsing strategies in measuring information-seeking effectiveness, the stateless approach to measuring performance which does not involve information seekers in producing the raw data, and the need for a new notion of relevance for information objects. There is another broader issue which can greatly affect the evaluation of hypermedia digital libraries, and make inapplicable current evaluation methods: the issue of electronic pub- lishing. Davis and Hey [ 141 claim that if digital libraries are to be more than computerized search engines, it is essential to add value to what is currently avail...