evaluation of information-seeking performance in hypermedia digital libraries

Interacting with Computers 10 (I 998) 269-284

Interacting with Computers

Evaluation of information-seeking performance in hypermedia digital libraries

Michail Salampasis, John Tait*, Chris Bloor School of Computing and Information Systems, University of Sunderland, St. Peter’s Campus. St. Peter’s Wuy,

Sunderland SR6 ODD. UK

Abstract

Nowadays, we are witnessing the development of new information-seeking environments and applications such as hypermedia digital libraries. Information Retrieval (IR) is increasingly embedded in these environments and plays a cornerstone role. However, in hypermedia digital libraries IR is a part of a large and complex user-centred information-seeking environment. In particular, information seeking is also possible using non-analytical, opportunistic and intuitive browsing strategies. This paper discusses the particular evaluation problems posed by these current developments. Current methods based on Recall (R) and Precision (P) for evaluating IR are discussed, and their suitability for evaluating the performance of hypermedia digital libraries is examined. We argue that these evaluation methods cannot be directly applied, mainly because they do not measure the effectiveness of browsing strategies: the underlying notion of relevance ignores the highly interconnected nature of hypermedia information and misses the reality of how information seekers work in these environments. Therefore, we propose a new quantitative evaluation methodology, based on the structural analysis of hypermedia networks and the navigational and search state patterns of information seekers. Although the proposed methodology retains some of the characteristics (and criticisms) of R and P evaluations, it could be more suitable than them for measuring the performance of information-seeking environments where information seekers can utilize arbitrary mixtures of browsing and query-based searching strategies. 0 1998 Elsevier Science B.V.

Keywords: Evaluation; Hypermedia digital libraries; Structural analysis of hypermedia; Relative distance relevance

1. Introduction

Evaluation is a major problem in research and development of applications related to

information retrieval (IR) and whole working groups are dedicated to the study of

* Corresponding author

0953.5438/98/$19.00 0 1998 - Elsevier Science B.V. All rights reserved PII SO953-5438(98)OOOlO-I

210 M. Salampasis et al./lnteracting with Computers 10 (1998) 269-284

evaluation of modern information retrieval systems [l]. Currently, the effectiveness of conventional (both batch and interactive) IR systems is mainly measured by two metrics: recall (R) and precision (P). There has been much debate over a long period about whether methods based on these metrics are the most suitable measures of the effectiveness of an IR system. Hersh et al. ([2], pp. 164) say for example: “the evaluation methods currently

used, have limitations in their ability to measure how well users are able to acquire information”. In a recent paper Saracevic ([3], pp. 142) examined the evaluation of IR systems and pointed out: “the most serious criticism and limitation of the TREC evaluations is that they treat IR in a batch mode”. Sparck Jones [4] in a paper discussing the

TREC programme says “the TREC programme involves, at the meta level, a strongly shaped evaluation methodology relying on an evaluation paradigm that has to be judged both for its own soundness and pertinence”. She also says in a much earlier introductory

paper ([5], pp. 1) of a book dedicated to information retrieval experiment: “there is no good reason to suppose that conventional methods are best, even in principle, let alone practice”. A more fundamental issue in respect of evaluation methods based on indices like recall and precision is the subjective nature of relevance assessments ([6], pp. 146), and the lack of emphasis on the theory underlying the systems under evaluation [7].

This debate and scepticism over evaluation, however, was for IR systems which have been developed as ‘isolated’ systems having in mind the support of analytical, query-based retrospective information retrieval. In recent years, a shift has been observed towards more complex information workplaces, which usually are connected and can be accessed

through a wide-area network. The term digital library is often used to describe these rather complex and highly interactive information workplaces [8]. Hypermedia digital libraries are digital libraries based on a hypermedia paradigm [9]. Information in hypermedia digital libraries is richly connected and organized using networked, and sometimes, additionally, hierarchical, or aggregation structures. IR research convincingly suggests

the highly interconnected nature of hypermedia can be used to increase information retrieval effectiveness [lO,ll], so this property cannot be ignored. But, even more fundamental for the arguments made in this paper, is the fact that query-based information retrieval is not the onl,y way available to seek for relevant information. Opportunistic, non-analytical browsing strategies can also be effectively deployed by information seekers.

Browsing is a natural approach to many types of information-seeking problems. It can also be effective for information problems which are ill defined, or where the actual information need is to gain an overview, or clarify an information problem ([ 121, pp.

103). We can distinguish many different types of browsing (e.g. scanning, observing, navigating). In this paper, we are mostly concerned with what may be characterized as across-document browsing [ 131. This type of browsing has the goal of identifying relevant documents, in contrast to within-document browsing which is concerned with locating a relevant passage within a document, or extracting its gist. Across-document browsing can be used in conjunction with other on-line information-seeking strategies in order to more effectively solve an information problem.

Thinking of ways to measure the effectiveness of information seeking in hypermedia digital libraries poses some stimulating questions. For example, the first general question which immediately arises is whether R and P are the most suitable (measurable) quantities for

M. Salampasis et d/Interacting with Computers 10 (1998) 269-284 271

evaluating the effectiveness of hypermedia digital libraries. And if they are, can we really

measure them in a pragmatic distributed and rapidly changing networked environment?

Browsing strategies depend significantly more on interactions between information seekers and the system than the analytical, query-based strategies. Because of their highly interactive nature, browsing is more dependent on the physical, emotive and cognitive abilities of information seekers than are analytical strategies. Browsing strategies are also more dependent than analytical strategies on influences of the system on the information seeker. Hence, it is not realistic to artificially simulate browsing in the same manner as ad hoc, stateless, non-interactive runs of queries simulate operational query-based environments in P and R evaluations.

In this paper, we argue that the classical evaluation methodologies based on the usual

measurements of R and P cannot be directly applied to the evaluation of hypermedia digital libraries. In Section 2 we provide the context to support this argument and discuss the qualitative and quantitative reasoning behind it. In Section 3 we propose a new quantitative methodology for evaluating the performance of information seeking in hypermedia digital libraries. It is based on the structural analysis of a hypermedia network and the step by step evaluation of users movements during an information-seeking process.

The proposed methodology could be used to evaluate in a natural and unified way the performance of hypermedia-based seeking environments. It could provide a solution to at least some of the problems found in evaluations based on R and P, if these methods would applied to evaluate hypermedia digital libraries. Finally, in Section 4 we summarize and attempt to drawn some conclusions.

2. Information seeking in hypermedia digital libraries

One cannot discuss evaluation without considering the context in which information seeking takes place. Hypermedia digital libraries are very different information-seeking environments from conventional IR systems.

Firstly, they can support arbitrary mixtures of query-based IR and browsing mechanisms for information seeking. Secondly, information objects (i.e. documents) are highly interconnected with easily accessed cross-reference links. The semantic interconnection of documents suggests that a different interpretation of the relevance of an information object, given an information need, may be required.

Beyond these qualitative differences there are some other issues which we believe should be taken into account when considering evaluation of hypermedia digital libraries: the issue of electronic publishing [ 141 and the extremely dynamic nature of these environments [ 151.

2.1. Measuring the pe$ormance of browsing strategies

Experience suggests that individual people apply different mixtures of analytical and browsing strategies when both are available, but IR systems severely limit the strategies information seekers can use. In conventional IR systems browsing is unknown or very cumbersome, or limited for example to within document browsing, or requiring repeated often manually reformulated queries. and therefore ignored by evaluation methodologies.


However, browsing cannot be ignored in hypermedia digital libraries, not only because it is a defining feature, but crucially because the overall effectiveness of information seeking in a hypermedia digital library greatly depends on the browsing strategy it supports. Therefore, evaluation methods for hypermedia digital libraries should be able to measure the effectiveness of information seeking, using both browsing and querying in a unified way if possible.

Draper [ 161 and Marchionini [ 121 have broken down the IR process into five and eight subtasks respectively: for example, select a source, formulate a query, examine results, and so on. In both classifications retrieving a set of documents is just one subtask. Current evaluation methods based on P and R can be criticized because they measure the effectiveness of the execution of this single subtask which relates to the internal system

operation only. In the case of hypermedia digital libraries this criticism can be amplified because R and P methods can measure only a part of even this single subtask.

Evaluation methods based on P and R cannot measure the effectiveness of browsing strategies which are an important part of any hypermedia information-seeking environment. They cannot measure the effectiveness of browsing not because of the inability of R and P metrics to do so, but because of the inappropriateness of R and P evaluation methodologies (at least those which apply stateless, ad hoc approaches) to produce mean- ingfully the raw data on which R and P could be calculated. A possible solution which can remedy this situation is the involvement of users and the calculation of R and P based on data

produced during a series of interactions that users have with the system under evaluation (see an example of such an evaluation in [17]). This is the approach which our evaluation methodology suggests (without using R and P, but instead another new proposed set of metrics) and therefore we believe it is more user-centred, although still quantitative.

The reason this approach is not taken with most of the R and P evaluations is possibly the high cost that it incurs due to the involvement of users. Certainly, cost is an important

factor which must be taken into account seriously, especially since many IR projects have involved thousands of tests [ 181. However, at least some of the experiments or investiga- tions should involve users independently of the cost that it requires, in order that more realistic and complete evaluations be obtained. This problem stimulates the (ambitious) idea of artificially generating/simulating user behaviour (i.e. users) based on prior knowledge (e.g. for the WWW this could be done using the WWW log files).

2.2. The underlying notion of relevance

The highly interconnected nature of hypermedia suggests another reason why current evaluation methods based on P and R are unsuitable for evaluating these environments. Richly organized and structured information objects bring forward the necessity for a new notion of relevance in hypermedia digital libraries. In conventional IR systems, the support offered to the user is for the retrieval of documents which have one or more ‘exactly matching terms’. On the other hand, in a hypermedia digital library items may be usefully retrieved which are not in themselves apparently relevant (i.e. not exact matching documents, or even low-ranked documents in a ranked output approach), but if they can serve as useful ‘launching pads’ for browsing, will serve the user information need equally well.

M. Saiampasis et al./Interucting with Computers 10 (1998) 269-284 273

It is necessary therefore to determine a different, but still formally defined, derived

notion of relevance in hypermedia digital libraries. This notion of relevance should take into account the reality of interconnected information objects stored in such environments. The evaluation methods based on P and R have underlying them a notion of relevance, which is at least partly incompatible which this new proposed notion of relevance in hypermedia digital libraries. Consider for example under a P and R based evaluation that a retrieved irrelevant document, which has five links to relevant documents, will be classified the same as another retrieved irrelevant document which does not contain any links to relevant documents. We need evaluation methodologies which apply a less strict and less monolithic notion of the relevance of a document.

2.3. Obtaining relevance assessments in dynamic environments

In the previous paragraphs, we discussed the fundamental problems of current evaluation methods based on R and P, i.e. the inability to take account of browsing strategies in measuring information-seeking effectiveness, the stateless approach to measuring performance which does not involve information seekers in producing the raw data, and the need for a new notion of relevance for information objects. There is another broader issue which can greatly affect the evaluation of hypermedia digital libraries, and make inapplicable current evaluation methods: the issue of electronic pub-

lishing. Davis and Hey [ 141 claim that if digital libraries are to be more than computerized search engines, it is essential to add value to what is currently available. Undoubtedly, the promise of electronic publishing (and in particular the spread of the publishing function from traditional publishing organizations to individuals) is of great additional value. However, there is also no doubt that giving to individuals the ability to change documents, publish new ones, etc., is ‘incompatible’ with implicit assumptions made from current evaluation methods. These assumptions relate to static document collections, complete-

ness of the indices, etc. [ 191. To understand the effects of electronic publishing for evaluation methods, let us accept,

for the sake of the argument, that R and P are suitable for evaluating a hypermedia digital library system. It will be recalled that the R for a search query is the proportion of relevant material actually retrieved and the P of a search result is the proportion of retrieved material which is actually relevant. There is an underlying assumption in evaluation

methods based on measuring the R of a system for a set of queries and a document collection, that the total number of relevant documents in the collection for a particular

query is known. This is possible because test collections (i.e. document collections which are supplemented with queries and the experts’ relevance assessments) are snapshots of changing document collections. There is also an underlying assumption that evaluation of the test collection is representative of the real document collection as well.

The fact is that most non-test text-based document collections are dynamic, and further- more networked and distributed hypermedia digital library systems are by their nature especially dynamic. The different document collections are not immediately accessible, thus it is impossible at any moment to know the exact number and identity of all documents relevant to a given test query. This is especially a problem when relevance assessments are required of the relevant documents not retrieved (as in the case of R).


This and other problems in obtaining relevance assessments (see [20], pp. 17), however, have to be accepted by every quantitative evaluation methodology (including the one proposed in this paper) which requires the numbers of the relevant documents in the collection to be known. A compromise must be obtained between realistic modelling of the retrieval environment and the need to utilize static relevance judgments.

2.4. Multi-level evaluations

We have mentioned before that most of the IR evaluation efforts are carried out at the system processing level (algorithms and procedures). Sarajevic [3] adds that even when evaluation efforts have been carried out at the user level (qualitative approach), they have never been extensively considered by evaluations at other levels. He also says: “if there is a paradigm shift, it should be toward cooperative efforts and mutual use of results between system and user centred evaluations”. Clearly, there is a scepticism here about the

completeness of evaluation procedures of even conventional IR. The advent of hypermedia digital libraries will increase this scepticism about the use of

one-dimensional, system-centred evaluation, because hypermedia digital libraries are more user-centred information-seeking environments than conventional IR systems. For example, in a system which supports browsing, the information-seeking sub-processes

proceed in a more parallel fashion, and they require the user to be more actively engaged in coordinating these sub-processes. In contrast, most current IR environments are based on the view of information retrieval as a set of isolated tasks. Rao et al. [21] argue that this view misses the reality of users doing work in real information environments. We believe that it is gradually being recognized that there is a shift from system-centred information retrieval to more user-centred information-seeking environments. Therefore, changes are required for the evaluation of such environments.

In this section we have discussed the reasons which we believe traditional evaluation methods based on P and especially R are to some extent inappropriate for evaluating the effectiveness of information seeking in hypermedia digital libraries. In the next section we

propose an evaluation method based on a structural analysis of hypermedia networks and the patterns of information seekers in hypermedia digital libraries.

3. A new method for evaluating hypermedia digital libraries

Botafogo et al. [22] have presented a methodology and used a set of metrics to identify hierarchies in hypermedia systems and to help users navigate in hypermedia systems [23]. Their work is based on a structural analysis of hypermedia, where links are untyped and hypermedia can be represented with a directed graph. Based on their work, and on Belkin’s information theory which formalizes the process of information seeking based on the idea of ‘anomalous states of knowledge’ (ASK; [24]), we suggest here a new approach and a new set of metrics which could be potentially used to evaluate information-seeking performance in hypermedia digital libraries.

M. Salampasis et al./lnterarting with Computers 10 (I 998) 269-284 27.5

3. I. Background information

Suppose that we have a hypermedia network which consists of L nodes, and it is represented with a directed graph. A distance matrix of a graph is a matrix that has as its entries the distance of every node in the hypermedia to every other node. If M is the distance matrix, the converted distance matrix (C) is defined as follows:

c‘,, = M ‘, ifM,, f 30

K otherwise (1) Where K IS ZI tinitc constant (Typically K =L. )

The converted out distance (COD) for a node is the sum of all entries in row i in the converted distance matrix (C):

CQQ =EC, (2)

The converted distance of a hypermedia H (CD,) is given by the sum of all entries in the converted distance matrix:

Fig, 1 and Table 1 present a hypermedia network H and the respective converted distance matrix.

The COD value is a good indication of the centrality of a node within a given hypermedia network with lower figures indicating more central nodes. The need to have a more ‘normalized’ view leads to the introduction of the relative out centrality

(ROC):

The higher ROC value a node has in the hypermedia network, the easier (i.e. with less movements) can access other nodes (high out centrality). In the network presented in Fig. 1, we can see that node {d} has the biggest ROC and node {a) has the lowest ROC. We can also see that node {d) is the most central in the network, while node (a] is the less central.

3.2. Rationale and metrics used in the methodology

Before we start presenting the methodology and the metrics which we propose for evaluating the effectiveness of information-seeking performance in hypermedia digital

Fig. 1. A hypermedia network H

276

Table 1

A4. Salampasis et al./Interacting with Computers 10 (1998) 269-284

Converted distance matrix of hypermedia network H

a b C d

0 6 6 6 1 0 2 1

2 2 1 1 0 1 0 2

3 2 2 I

3 2 2 1

11 12 13 11

e f COD ROC

6 6 30 2.4 2 2 8 9.25

3 1 3 1 11 6 12.3 6.1

0 2 10 1.4

1 0 9 8.2

13 14 CD” = 14

libraries, we would like first to discuss the rationale behind this proposal. We said in Section 2 that a new notion of relevance is required for evaluating information-seeking in hypermedia digital libraries, which should take into account the reality of highly interconnected information objects. Also it should be able to evaluate both analytical query- based and opportunistic browsing-based information-seeking strategies. Our proposed

methodology and metrics achieve these goals since they can measure in a unified way the effectiveness of arbitrary mixtures of query-based and browsing information-seeking strategies. The relevance assessment of information objects to a particular information

need is based on the structural analysis of the hypermedia network, and hence it is more objective since it takes into account interconnections between information objects.

Also, the relevance of documents is not measured on a yes/no basis, but on a scale which allows any intermediate value between these two extremes. The relevance of a document is determined by the position that this information object has in the hyperinformation space (provided that it is topologically organized by relevance). That allows documents, where their content would prevent them having a positive relevance (i.e. greater than zero) in R and P evaluations, to have some degree of relevance which is relative to their location

to the documents satisfying the current information need. Think for example of a user who is seeking information in the hypermedia network

presented in Fig. 1. Let us assume that the user has an information need N which can be fulfilled by the document {a). Two information retrieval processes Qt and Q2 which would produce as a result documents (b} and (e), respectively, will have the same R and P values (0%). Certainly, this misses the reality of how users seek information in

hypermedia digital libraries. Intuitively, we can see that Qt has produced better results than Q2. It is much easier for the user to find the relevant information (i.e. document (a}), beginning from document {b) rather than document (e}. We can say then that the overall performance of the results produced by Q, is better than the results produced by Qz.

This idea that the relevance of an information object X to a particular information need N which can be fulfilled by information object Y is expressed in the methodology by the converted distance Cxu of node X to the node Y. If an information need N is fulfilled by a set of documents S = {dt,dp;.. .dk}, then the distance relevance DRdi) of an information object {i], which is the result of an information-seeking process (based on browsing or querying), is defined as the sum of the C, of node (i) to all nodes j (Vdj E S) which are

M. Salampasis et al./lnteracting with Computers 10 (1998) 269-2X4 277

members of the set S. This is expressed as:

In other words, this function determines the overall relevance of a document {i ] given an information need N which is fulfilled by S = { dl,d2;..,dk}. Note that the ‘closer’ a

document is to the documents fulfilling the information need, the smaller is the DR of this document (becomes 0 in the best possible match). Similarly. the larger the DR of a document becomes, the less relevant this document is to the information need (the DR becomes x, in the worst case). However, from eqn (1) we can observe that if we use L (i.e.

the total number of nodes in the network) instead of x when calculating the distance matrix of a hypermedia graph, none of the documents can become totally irrelevant to a given information need (i.e. no document can have DR equal to m). This approach in measuring the DR metric captures the idea that even the least relevant (based on content) document could provide useful links to other documents which are closer (i.e. have smaller

DR) to relevant documents. Again the need to normalize the relevance of an information object, given a hypermedia

network H, leads to the introduction of the relative distance relevance RDR,,(X) of a document X given an information need N. We normalize by dividing the converted distance of the hypermedia network H (CD”) by the DRdX), i.e.

RDh(X) = cDy&v(X) (6)

Note that RDR is inversely analogous to DR (RDR increases when the relevance of an information object increases and vice versa). The RDR normalizes for different in size networks and it is the metric which should be better used for measuring the relevance of a

document. As is shown from eqn (6), the RDR is calculated based on metrics derived from the

structural analysis of a hypermedia network. So, evaluation methodologies based on the RDR take into account the interconnections between documents, as these are expressed by hypermedia links. Recall again the example of a user who is seeking in the hypermedia network presented in Fig. 1, the document {a} which fulfils an information need N. The RDR of the two documents (b) and {e} resulted from two different information-seeking processes Qt and 42, respectively, are RDRN(b) = 74 and RDRN(e) = 24.8. This simple example demonstrates how the RDR can capture mathematically the intuitive perception that Q, has produced more effective results than Q2. Additionally, it allows the relevance of an information object to be expressed on a continuous scale rather than on a yes/no basis, and as it will be demonstrated later it can measure the effectiveness of both query- based and browsing information-seeking strategies.

The RDR gives the relevance of one information object to a given information need. In most cases, however, the result of a query-based search is a set of documents. This is similarly true in advanced hypermedia digital libraries where information objects can be organized in aggregates or hierarchies. Therefore, we need to measure the relevance of a whole set of documents resulted from a search process. We assume that the system does not discriminate between any members of the set, in other words all the resultant documents have


the same possibility of being selected. We can now define the distance relevance DRdSj of a search process which results in a state K (i.e. a set of information objects A = ( dl,d2;..,dm,)), given an information need N fulfilled by S = [ d,,dz;..,dk] as:

DRdS) = 1 DR,,, (i) = c c C,j (7) i=l 1 r=l .nr j=l..k

Again the need to normalize leads to the introduction of the relative distance relevance RDR,,,fS) of a search process which results in a state of knowledge K to an information

need N fulfilled by S = {dl.d2:..,dk) as:

The evaluation methodology which we will describe next uses the metrics presented above for evaluating hypermedia digital libraries. The methodology is based on an approach to information seeking which is based on Be&in’s anomalous states of knowledge theory [24]. According to this approach, at any particular time users posses a particular state of knowledge, and they seek information in order to change their state of (anomalous) knowledge. It might be said that in simple terms the ultimate goal is for their new state of knowledge to overlap and match exactly a particular state of knowledge expressed by an information need (ignoring for the moment the possibility that, for example, the search process might itself change the information need). In order to achieve

this goal, users can employ different search strategies (i.e. querying and/or browsing). Interaction between users and the system involves the transition of users’ knowledge from one state to another, as a result of the search process.

Evaluations based on R and P are governed by a bottom-up approach, i.e. from the experiments to the understanding of the rules that guide the behaviour of a particular system. It has been suggested that a top-down approach, guided by a theory or model, is preferable ([5], pp. 2). However, it seems to be more natural and a better tactic to have the theory which provides the framework and the tools to study a particular system, rather than do the opposite and try to invent the theory by investigating existing systems. Our methodology, as we have explained in the paragraph above, is grounded on and guided by a theory and model of information systems (i.e. Belkin’s ASK).

The evaluation methodology we propose here can also be applied when the information need is not stable during the whole information-seeking processes, but it changes as a result of a sub-process (e.g. after examining results). This is because the methodology is ‘state based’, and in each separate state the set of information objects which satisfy a problem (i.e. the set S

in RDRN(S)) can be different. In other words, there is no aspect of the proposed methodology which precludes S to change between different states of knowledge.

3.3. Utilization of the methodology

Our proposed evaluation methodology tracks the users’ movements in the hypermedia digital library, and calculates the relative distance relevance of each state to the current information need. Each state can be a single information object (i.e. a hypermedia node, a document), or a set of information objects as a result of a query-based search or a move- ment to a composite/aggregate hypermedia node. From the above we can conceive that the

M. Salampasis et al./Interauring with Computer.~ IO (1998) 269-284 219

first thing that we must have to utilize the method is a system (of course) which can support a search process divided into many sub-processes. We also need a means (i.e. users) that

can generate the different steps (transitions) from one state to another. For each state of the search process we calculate the RDR of the resultant documents to

the documents which in each state satisfy the information need (as we have explained these can be different in each state). Using this mechanism and for a search process which has a starting state and eventually will have an ending state, a series of RDR numbers can be calculated and obtained. It is this series of numbers which represents the performance of the information-seeking process from the start to the end. Also averages can be produced to obtain a single number which characterizes the overall performance of the whole process.

A relative relevance graphical diagram can also be created which depicts the effectiveness of an information-seeking process (session) in relation to a specific information need. This graphical diagram starts from the first interaction (state) that a user has with the system, and ends when the user believes that it has fulfilled their information need (last state).

The RDR/state diagram can graphically show the performance of a hypermedia digital system as it was involved state by state within one session. The overall performance of the search can be calculated as the mean value of all intermediate states. This overall performance can be used to evaluate the effectiveness of a system to help users fulfil an information need. A diagram can be then created from mean overall performances, derived by many different sessions. This diagram can be used to evaluate the overall information- seeking effectiveness of a hypermedia digital library system. In Table 2 one can see the methodology applied for a user who was seeking information for document [d] in two

hypothetical different systems based on the hypermedia network presented in Fig. 1. In

Fig. 2 we present the RDR/state diagrams for systems A and B. Table 2 presents a series of states (i.e. S ,, S r, S ,, etc.) and the corresponding RDRN

during the information-seeking process. The last line calculates the mean value that RDRN

had during the whole process. Although the previous example is based on a very small number of search states, we can hypothesize that system B can support more focused information seeking than system A (i.e. helps the users more effectively to fulfil an

information need faster without imposing significant information noise). Similar diagrams can be created for every hypermedia digital library system which

supports both querying and browsing strategies. In general, from these diagrams we can determine how effective are the users seeking information given a particular information need. Using averaging techniques a single line can be created which evaluates the effectiveness of a system to support users in their information-seeking tasks.

3.4. Discussion

The proposed methodology can measure in a unified way the effectiveness of both analytical query-based and browsing search strategies. The same metrics can be used (i.e. the relative distance relevance) regardless of the mixtures of query-based and browsing searches which have been used to achieve this performance during the search. Similarly, unique diagrams can be developed and the overall performance can be calculated.

280 M. Salampasis et al./Interacting with Computers 10 (1998) 269-284

RDR I State Diagram 80 T

I

1 2 3

Search State

Fig. 2. RDR/state diagram for the information seeking process illustrated in Table 2.

At the heart of the methodology lies an entirely new notion of relevance which does not ignore the highly interconnected nature of hypermedia digital libraries; it is more objective since relevance assessment is based on the distance between information objects. This notion of relevance also makes it possible to apply the proposed methodology in dynamic environments such as hypermedia digital libraries. This is because the relevance of a document is solely calculated on the basis of the relevance distance to a given information

need. Even in dynamic environments which support electronic publishing this distance can be calculated and thus evaluations based on the methodology are feasible.

In a recent paper [25] we have suggested that a combination of roughly locating information by query searching and then specifically accessing by browsing could be a powerful and natural mechanism for information retrieval in hypermedia digital libraries. Our experience (mostly from using the World Wide Web) says that this is the way that most people prefer to search for information. It is not necessary to retrieve documents which can straightforwardly satisfy their information need, but documents which are closely located to relevant documents are also ‘good matches’.

Table 2

A series of information-seeking states for two hypothetical systems

Information need N , = [d]

System A System B

Transition State S RDRN (S) Transition State RDRN (S)

Q = query Q = query B = browse B = browse

Ql SI = la,cl 9.25 Ql SI = 101 10.5

BI s2 = Ic,) 31 BI S2 = (bl 74

B2 SJ = ibJ 74 82 S3 = idI cc

B3 S4 = IdI m

Mean RDR = RDRN(QJ + RDRN(B J + RDRN(B2) =Mean RDR = RDR,(Q,) + RDRN(B ,) = 84.5/2 = 42.5 120.25/3 = 40.1

M. Salampasis et al./lnteracting with Computers 10 (199N) 269-284 281

We believe that the proposed methodology suits user-centred information-seeking environments like hypermedia digital libraries better that R and P. Evaluation methods

based on R and P measure the effectiveness of the system to retrieve relevant information. However, retrieving relevant information does not guarantee that this information will be used by the users, or even more that it is really relevant to the users. For example, the document might be retrieved which matches one or more words (stems) in the search

specification, but these words are being used in different senses from those appropriate to the users’ real information need. Additionally, the proposed methodology evaluates only the ‘real’ movements of information seekers, as they result from query-based or browsing search strategies.

In general, we believe that the proposed methodology is more user-centred, although it is still quantitative, than R and P evaluations. It is more user-centred because it directly involves users in the evaluation methodology. This is not true in R and P evaluations which are based on the use of ad hoc runs of queries to produce results. The methodology is state based in contrast to R and P evaluations which are stateless. Moreover, the series states which we evaluate in each search session using the RDR are the results of real interactions between users and the information-seeking environments.

Other people have also recognized the need for user-centred evaluations and suggested other alternatives to R and P evaluation methods. For example, Hersh et al. [2] suggest an alternative criteria for measuring the effectiveness of an IR system: its ability to satisfy the user information needs. Users can be asked to use an IR system and then to assess the ability of the system to help them acquire information. Other research has been undertaken to analyze sequences of user movements in IR systems, and utilize this analysis for determining the effectiveness of such systems. Qiu [26] tried to discover search state patterns through which users retrieve information in hypermedia systems. Wildemuth [27] has been involved in a series of studies to analyze the sequence of moves of information seekers. Other research tried to categorize the user actions captured in transaction logs [28].

All these research efforts were carried out in order to identify, analyze and describe the behaviour of information seekers in electronic environments. These research efforts are analogous to the evaluation methodology that we proposed, in the sense that they are all capturing user moves. However, our methodology differs that it uses the users’ transaction logs to evaluate the effectiveness of the information-seeking process, rather than to describe it.

3.5. Applicability and further work

The methodology described in this paper is just the first step towards an evaluation methodology for evaluating hypermedia digital libraries. There are some points for which further work is needed. For example, in the calculation of the RDR of an information

object, all the links are regarded as having the same weight. Certainly, this is not always true in many hypermedia system. Additionally, in the calculation of the relative distance relevance of a search process S, we assumed that all information objects resulting from S have the same possibility of being selected by the user. Again, this is not always true. Further lexical analysis and structural analysis could eliminate these problems. Also support of the methodology for typed links which can have an associated weight affecting


the distance relevance could be a solution to the ignorance of the distinction between different types of links which exists in most hypermedia digital libraries.

One could also be concerned about the applicability of the method, because of the

potentially large converted matrices and computational intensity of calculations that are required to apply the methodology. If the links between information objects are non- weighted then the problem of constructing a converted matrix is actually similar to the breadth-first traversal of a directed graph ([29], pp. 243). On the other hand, if the links are weighted and the resulting graph is a directed weighted graph (like the extensions proposed in the previous paragraph), the problem of constructing a converted matrix is similar to the travelling salesman problem. This is a processing intensive problem for very

large graphs. However, despite the large numbers of complex calculations that may be required to construct the converted matrices in very large hypermedia networks, the calculations required during retrieval are simpler and substantially smaller in number.

We are currently doing an experimental study which uses the methodology proposed here to evaluate the effectiveness of a WWW information-seeking environment. Addition- ally, an R and P evaluation will be applied and the results from the two different approaches will be compared. The CACM test collection is being used for this study after it is trans-

formed into a hypermedia network using the existing links between CACM documents. The collection also has been clustered following the complete link approach [30], in order to (artificially) create and distribute a number of autonomous but interconnected sub-collections across a WWW network. The subject of the study is to investigate and experiment with different collection fusion methods, and in particular how these methods affect the information-seeking performance in distributed hypermedia digital libraries.

We asked the 36 users (divided into three groups) participating in the experimental study to seek information in the WWW-based digital library, using different fusion strategies available for each group (all the groups could use browsing). Users’ moves and search results were systematically captured by the system (i.e. the WWW server and the

search engine). This mechanism resulted to a set of data representing an average of 150 different (knowledge) states during a maximum 30 minutes search session. Most of these states, in our experiment, were the results of browsing (about 60%), while others were the result of distributed parallel searching (15%) and single collection searching (25%). For each user, and for each session executed (users were asked to run two sessions totalling 60 minutes duration) a series of RDR can be calculated and graphical representations can be created.

Unfortunately, the data from this experimental study have not been processed and results are not yet available. We are looking forward to the results of this study for the

insights that it will probably give us about the methodology itself, and comparatively to R and P evaluations which we intend to report in a future paper.

4. Conclusion

Hypermedia digital libraries are very different user-centred information-seeking environments from conventional information retrieving systems. They can support arbitrary mixtures of query-based IR and browsing mechanisms and they are more complex and intrinsically dynamic. The advent of hypermedia digital libraries poses new questions and

M. Salampasis et al./Interacting with Computers IO (1998) 269-284 283

increases the scepticism about the appropriateness of current system-centred evaluation

methods based on R and P. We believe that these evaluation methods (based on R and P)

cannot be directly applied in hypermedia digital libraries because:

l they do not measure the effectiveness of browsing strategies and thus miss the reality of how information seekers work in these environments; l the underlying notion of relevance ignores the fact of highly interconnected information;

l they do not involve users directly since the measures of R and P are taken as a result of standardized runs of ad hoc queries. In other words, they are well-suited measurements of the system performance but they are less adequate for the measurement of the performance of information seekers. l the nature (through electronic publishing) of these environments is extremely dynamic.

Therefore, we proposed a new evaluation methodology, based on the structural analysis of hypermedia networks and the analysis of navigational patterns of information seekers, which can provide a (partial) solution to the problem. The proposed evaluation methodology:

l can measure in a unified way the effectiveness of both analytical query-based and browsing search strategies; l underlies an entirely new notion of relevance which does not ignore the highly

interconnected nature of hypermedia digital libraries and it is more objective; l directly involves users, evaluates their performance in a state-based fashion, and generally, can measure the effectiveness of the system to support users seeking information since it evaluates only the ‘real’ movements of information seekers, as

they are resulted from query-based or browsing search strategies; l can be probably easily applied in dynamic environments such as hypermedia digital libraries which support electronic publishing.

In so far as the importance of electronic environments such as digital libraries will continue to increase, it will be more important to evaluate the effectiveness of information- seeking performance of such systems. Hypermedia digital libraries will play an important role in future electronic environments since they can support both query-based and

browsing search strategies. We believe that our methodology for evaluating hypermedia digital libraries has several advantages to current evaluation methodologies based on P and R, and can be used to evaluate what really should be of our interest: how well users seek information in electronic environments.

References

[I] M.D. Dunlop (Ed.), Proceedings of the Second Mira Workshop. Moncelice, Italy. University of Glasgow

Computing Science Research report, TR-1997-2, 1997.

[2] W. Hersh, D. Elliot, D. Hickam, et al., Towards new measures of information retrieval evaluation, in:

Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in

Information Retrieval, 1995, pp. 164- 170.

[3] T. Saracevic, Evaluation of evaluation in information retrieval, in: Proceedings of the 18th Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval, 1995, pp. 138- 146.

284 M. Salampasis et al./Interacting with Computers 10 (1998) 269-284

[4] K. Sparck Jones, Reflections on TREC, Information Processing and Management 31 (3) (1995) 291-314.

[5] K. Sparck Jones, Introduction to IR experiment, in: K. Sparck Jones (Ed.), Information Retrieval

Experiment, Butterworths, London, 198 1.

[6] C.J. van Rijsbergen, Information Retrieval, 2nd ed., Butterworths, London, 1979.

[7] N.J. Belkin, Ineffable concepts in information retrieval, in: K. Sparck Jones (Ed.), Information Retrieval

Experiment, Butterworths, London, 1981.

[8] A.E. Fox, M. Akscyn, R. Furuta, J. Legget, Digital libraries, Communications of the ACM 38 (4) (1995) 23-28.

[9] V. Balasubramanian, A hypermedia approach to digital libraries: review of research issues, SIGLINK

newsletter 4 (2) (1995) 26-28.

[lo] H. Frei, D. Stieger, Making Use of hypertext links when retrieving information, in: Proceedings of ECHT

92, Milano, Italy, 102-l 11, 1992.

[ll] J. Savoy, An extended vector-processing scheme for searching information in hypertext systems,

Information Processing and Management 32 (2) (1996) 155- 170.

[12] G. Marchionini, Information Seeking in Electronic Environments, Cambridge University Press, 1995.

[ 131 M. Bates, The design of browsing and berrypicking techniques for the on line search interface, Online

review 13 (5) (1989) 407-424.

[14] H. Davis, P. Hay, Automatic extraction of hypermedia bundles from the digital library, in: Proceedings of

the DL 95 conference. Austin, TX, USA, 1995.

[15] M. Salampasis, J.I. Tait, C. Hardy, An agent-based hypermedia model for digital libraries, in: Proceedings

of the 3rd Forum on Research and Technology, Advances in Digital Libraries, Washington DC, USA, May

lo-13 1996, pp. 5-13.

[ 161 S. Draper, Overall task measurement and sub-task measurements, in: Proceedings of the Second Mira

Workshop, Moncelice, Italy. University of Glasgow Computing Science Research report, TR-1997-2, 1997.

[ 171 G. Golovchinsky, What the query told the link: the integration of hypertext and information retrieval, in:

Proceedings of the 10th ACM Hypertext Conference, Southampton, 6-10 April 1997.

[ 181 G. Salton, The Smart environment for retrieval evaluation-advantages and problem areas, in: K. Sparck

Jones (Ed.), Information Retrieval Experiment, Butterworths, London, 198 1.

[19] J. Ellman, J. Tait, INTERNET challenges for information retrieval, in: Proceedings of the 18th BCS IRSG

Annual Colloquium on Information Retrieval Research, 1996, pp. l-12.

[20] S. Robertson, The methodology of information retrieval experiment, in: K. Sparck Jones (Ed.), Information

Retrieval Experiment, Butterworths, London, 198 1.

[21] R. Rao, J.O. Pederson, M.A. Hearst, J.O. Mackinlay, S.K. Card, L. Masinter, P.K. Halvorsen, G.G. Robert-

son Rich Interaction in the Digital Library, Communications of the ACM, 38(4) (1995) 29-39.

[22] R. Botafogo, E. Rivlin, B. Schneiderman, Structural analysis of hypertext: identifying hierarchies and useful

metrics, ACM Transactions on Information Systems 10 (1) (1992) 142-180.

[23] E. Rivlin, R. Botafogo, B. Schneiderman, Navigating in hyperspace: designing a structure based toolbox,

Communications of ACM 37 (2) (1994) 87-96.

[24] N.J. Belkin, Anomalous states of knowledge as a basis for Information Retrieval, Canadian Journal of

Information Science 5 (1980) 133- 143.

[25] M. Salampasis, J.I. Tait, C. Bloor, Cooperative information retrieval in digital libraries, in: Proceedings of

the 18th BCS IRSG Annual Colloquium on Information Retrieval Research, 1996, pp. 13-26.

[26] L.W. Qiu, Frequency-Distributions of Hypertext Path Patterns-A Pragmatic Approach, Information Pro-

cessing and Management 30( 1) (1994) 13 1 - 140.

[27] M.B. Wildemuth, Defining search success: evaluation of searcher performance in digital libraries, SIGOIS Bulletin 16 (2) (1995) 29-32.

[28] S.J. Shute, P. Smith, Knowledge-based search tactics, Information Processing and Management 29 (3)

(1993) 29-46.

[29] H.J. Kingston, Algorithms and Data Structures, Design, Correctness, Analysis. Addison-Wesley, 1990.

[30] G. Salton, J. Araya, On the use of clustered file organization in information search and retrieval, Department of Computer Science, Cornell University technical report TR-89-989, 1989.

evaluation of information-seeking performance in hypermedia digital libraries

Documents