an associative search strategy for information retrieval

9
Infommtion Pmcessing & Management Vol. 16, pp. 129-137 Pcrgamon Press Ltd., MIO. Printed in Great Britain AN ASSOCIATIVE SEARCH STRATEGY FOR INFORMATION RETRIEVAL OVAD MANSUR Department of Computer and Information Science, Cleveland State University, Cleveland, OH 44115, U.S.A. (Receioed for publication IO September 1979) Abstract-This paper extends Goffman’s Indirect Method of Information Retrieval by sugges- ting a more flexible search strategy. The suggested strategy in its simplified version is more effective than the Boolean strategy used in existing information retrieval systems and can be implemented at comparable costs. Some of its features might further increase the effective- ness of information retrieval systems. 1. INTRODUCTION There are more than a hundred information retrieval systems (IRS) in operation now, their purpose being to provide the user with documents (journal articles and others) relevant to his needs. What distinguishes such IRS from Data Retrieval Systems (DRS), i.e. banking, airlines reservations, etc., is the element of relevance. In the DRS, the relevance of an answer to a query can be objectively defined, or at least to a larger extent than in the IRS case, where the relevance of the answer, or of its components can be determined only by the user himself. Thus, DRS can be viewed as a special case of IRS. It is the relevance factor that creates the specific problems of the IRS and shifts the emphasis from efficiency (in DRS), to effectiveness (in IRS). The existing IRS, some of which include large data bases with on-line retrieval capabilities, operate in a similar way; they all allow retrieval of documents by matching index terms (descriptors, keywords) pre-assigned to the documents in the file, with the user’s query, which is defined as a set of index terms with Boolean operators. Each document in the file is compared with the query independently of the others and those which match are retrieved. These Boolean systems are often ineffective in retrieving information relevant to the user. During the last forty years the deficiencies of the Boolean systems were pointed out repeatedly and many suggestions made to improve retrieval. They are discussed in detail elsewhere (see for example[l, 2]), and we would like to mention only a few. MARONand KHUNS[~] have suggested the Probabilistic Indexing method which assigns index terms to a document together with a probability which conveys the extent to which the index terms describe the content of the document. The importance of this idea consist in the relevance relation, which can be expressed on a sliding scale rather than as a dichotomy. DOYLE[~] was concerned with the idea of interdependence of the documents in a file. He proposed that documents in a file be displayed in an association map which would express the relationship of the content of the documents. The user would be presented with a map of concepts and he could choose his own way in the map, according to his own interests and needs. The importance of this idea lies in the recognition of the differences among the users and of the necessity to enable the users to find their own way instead of presenting them with a “definite” answer. Tremendous amount of work in all areas of information retrieval systems design was done by SALTON[~-71. His most important contribution related to the content of this paper is in automatic clustering and creation of hierarchial tile organization. Such a structure is very effective for searches, in particular for the browsing query (see discussion below). Some of Salton’s methods have some similarities to the methods suggested here, the main difference lies in our viewing of the relationships among the items in the data base, our viewing of relevance as a subjective changing quality and the information retrieval process as a trial and error one 129

Upload: ovad-mansur

Post on 30-Aug-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Infommtion Pmcessing & Management Vol. 16, pp. 129-137

Pcrgamon Press Ltd., MIO. Printed in Great Britain

AN ASSOCIATIVE SEARCH STRATEGY FOR INFORMATION RETRIEVAL

OVAD MANSUR

Department of Computer and Information Science, Cleveland State University, Cleveland, OH 44115, U.S.A.

(Receioed for publication IO September 1979)

Abstract-This paper extends Goffman’s Indirect Method of Information Retrieval by sugges- ting a more flexible search strategy. The suggested strategy in its simplified version is more effective than the Boolean strategy used in existing information retrieval systems and can be implemented at comparable costs. Some of its features might further increase the effective- ness of information retrieval systems.

1. INTRODUCTION

There are more than a hundred information retrieval systems (IRS) in operation now, their purpose being to provide the user with documents (journal articles and others) relevant to his needs. What distinguishes such IRS from Data Retrieval Systems (DRS), i.e. banking, airlines reservations, etc., is the element of relevance. In the DRS, the relevance of an answer to a query can be objectively defined, or at least to a larger extent than in the IRS case, where the relevance of the answer, or of its components can be determined only by the user himself. Thus, DRS can be viewed as a special case of IRS.

It is the relevance factor that creates the specific problems of the IRS and shifts the emphasis from efficiency (in DRS), to effectiveness (in IRS).

The existing IRS, some of which include large data bases with on-line retrieval capabilities, operate in a similar way; they all allow retrieval of documents by matching index terms (descriptors, keywords) pre-assigned to the documents in the file, with the user’s query, which is defined as a set of index terms with Boolean operators. Each document in the file is compared with the query independently of the others and those which match are retrieved. These Boolean systems are often ineffective in retrieving information relevant to the user.

During the last forty years the deficiencies of the Boolean systems were pointed out repeatedly and many suggestions made to improve retrieval. They are discussed in detail elsewhere (see for example[l, 2]), and we would like to mention only a few. MARON and KHUNS[~] have suggested the Probabilistic Indexing method which assigns index terms to a document together with a probability which conveys the extent to which the index terms describe the content of the document. The importance of this idea consist in the relevance relation, which can be expressed on a sliding scale rather than as a dichotomy. DOYLE[~] was

concerned with the idea of interdependence of the documents in a file. He proposed that documents in a file be displayed in an association map which would express the relationship of the content of the documents. The user would be presented with a map of concepts and he could choose his own way in the map, according to his own interests and needs. The importance of this idea lies in the recognition of the differences among the users and of the necessity to enable the users to find their own way instead of presenting them with a “definite” answer.

Tremendous amount of work in all areas of information retrieval systems design was done by SALTON[~-71. His most important contribution related to the content of this paper is in

automatic clustering and creation of hierarchial tile organization. Such a structure is very effective for searches, in particular for the browsing query (see discussion below). Some of Salton’s methods have some similarities to the methods suggested here, the main difference lies in our viewing of the relationships among the items in the data base, our viewing of relevance as a subjective changing quality and the information retrieval process as a trial and error one

129

130 0. MANSUR

rather than a process of getting a “definite” relevant answer (see detailed discussion below, especially in Section 7).

Goffman’s Indirect Method for Information Retrieval (IDM)[8] was introduced 10 years ago as a method of information retrieval more effective than the straight Boolean method, and more efficient than Maron and Kuhn’s The main contribution of the IDM is in viewing the relevance of a document to a certain query as dependent on other documents already retrieved in the process.

The improved effectiveness of the IDM over the Boolean systems was demonstrated in several small scale experiments [g-lo]. Recently, CROFT and VAN RIJSBERCEN [ 1 I] have done more comprehensive experiments comparing the effectiveness and the efficiency of the IDM with the Single Link clustering method (see discussion in Appendix).f

However, the IDM was not implemented in large information retrieval systems mainly because of high costs associated with it in the way it was suggested. The strategy proposed here is an extension of the IDM more flexible than the original method and involving much lower costs, comparable to those of the currently operating Boolean system, which makes its implementation economically feasible.

The discussion below deals only with the search strategy of an information retrieval system, although the search strategy is not independent of other elements of the IRS (the updating policy, the selection of relevance indicators, etc.) but rather has some relationships with them. The discussion is limited to information retrieval systems using index terms, manually assigned or automatically derived, as relevance indicators. The same reasoning is valid when other relevance indicators are used, and the strategy can be easily applied to other types of IRS.

2. THE QUERY

Before discussing the search strategy let us define the different types of possible queries which a user might pose to an IRS. One can classify queries by two different criteria:

-by the kind of answer the user expects from the system, or the aim of the query (Type A). -by the way the query is defined (Type B).

Type A queries Type A queries, can be of two kinds: (a) The standard query. This is the most common query in which the user wants a limited

size, preferably ordered, list of items as an answer. Relevance in this case is an important factor and so is the amount of effort required from the user in order to satisfy his needs. Getting as an answer an unordered set of 500 “relevant” documents, for example, means that the answer is likely to be non-relevant, as examining 500 in order to end up with 10 is quite complicated. This occurs rather frequently in the existing IRS, thus limiting their effectiveness.

(b) The non-standard query. In this kind of query the user wants a larger size answer. The emphasis is on a “complete” bibliography. Comprehensivness is more important here than strict relevance and the effort-the user will be willing to invest in order to be satisfied is relatively extensive.+ For this type of query we do not need a search strategy as sensitive as for the standard type query, neither need we a highly discriminative updating policy, or reduncancy

filters.§

Type B queries It is suggested here to distinguish among three kinds of queries according to the way they

are defined by the user. (a) A query which is a defined entry point. In this case the user presents to the system, as a

query, a document he found relevant (we assume that this item is included in the file). The document represents his query, and is used as a starting point for the search. This is a natural, easy and precise way of defining a query (see also SALTON[~]).

(b) A Boolean query. This is a query which does not consist of a specific item, but comes from a user who knows, more or less precisely, the nature of his problem. The common way of

tFor comparison between Salton’s and Goffman’s methods see also OMLOR[I~].

fWe can say that the user prefers, to a large extent, to trade “precision” for increased “recall”.

PA Boolean system is quite effective for this type of query.

An associative search strategy for information retrieval 131

defining such query is by a set of index terms combined with Boolean operators. Defining such a query poses some problems. One should use index terms in the same way they were assigned to the documents in the file, a knowledge which most users do not possess, or be aided by a person who is familiar with the data base and the subject area.

(c) A browsing query. In this case, the user can not define his query in any of the ways described above. He has none or little knowledge of what he is looking for. Browsing is a scanning and selection process which helps the user define a more precise query. Effective browsing will enable the user to find one relevant document or to define a Boolean query. Browsing can be done by scanning the whole file, item by item, which is neither effective nor efficient. A better way of browsing is by limiting the file to a portion and organizing that portion in such a way that only a small number of items will be actually examined, each of them representing a class of closely related documents.?

3. THE SEARCH STRATEGY-ASSUMPTIONS AND GENERALREQUIREMENTS

A search strategy is a technique, a rule, or a set or rules for matching a query with a file and for retrieving or selecting from the file a set of items which constitute an answer to the query.

A good search strategy should be able to supply as an answer all the items which are relevant and none of the non-relevant ones. Since we do not know much about relevance, this is just a general idealized statement. To some extent we can be more specific and contend that:

--It is preferable that the answer be ordered. This means that the set of items which results from the search, should be displayed in the order in which the documents should be read, i.e. first, second, etc.

-The size of the answer should be determined by the user. -The strategy should take into account the change in the relevance of an item occurring

after another item has been consulted. -The possibility of interaction between the user and the system is highly desirable since it

may potentially improve effectiveness. Relevance is a time dependent subjective property, since we lack the knowledge of the

nature of the user, the real nature of his problem, the nature of relevance, and of the user’s perception; therefore, a system can only supply documents which have a possibility (prob- ability) of being relevant, the final judgment being made by the user. This implies that some decision-making by the user is helpful and necessary. On the other hand, this process need not be complicated but should allow some selection possibilities, as many as desired by the user. The most we can expect from an IRS, given the current state of the art, is to preselect for the user a set of documents which have a high probability of being relevant and let the user by a reasonably easy way of interaction, make the final selection.

This implies the following:

-One user presenting the same query at different times, may end up with two different final selected sets, even if he received, as an answer from the system, initial identical preselected sets.

-Two users presenting the same query at the same time, and having, naturally, the same preselected sets as answers from the system may end up with two different final sets.

4.THEINDIRECTMETHOD

Goffman’s IDM is based on a general communication theory[l3] from which a search strategy is derived. Given a set of documents D (the data base), the conditional probability Pji is computed between all pairs of D. Pji is the probability of item j being relevant (to a certain query), given that item i is.S The strategy assumes that associated with any query there is a

tlf our data base had an hierarchical structure, browsing would have been relatively simple.

$P, is computed in the following way:

where m(xil2.q) is the number of index terms shared by documents xi and xi and m(q) is the number of index terms representing x,. 0 5 Pii 5 1.

132 0. MANSUR

threshold probability &, which distinguishes between “the barely relevant and the nonreIevant”[S]. All Pii’s values which are less or equal to .& are considered to be zeroes for this query. The IDM assumes that a user’s query can be defined in two ways:

-an initial document Q’ known to be relevant (type B(a)-an entry point query), and a threshold &,.

-a set of index terms with Boolean operators Q (type E(b)-a Boolean query), and a threshold 4.

When a query is defined in the first way, a chain of documents is retrieved as follows: Assuming that Q’ is relevant, the item which has the highest conditional probability (above &J with Q’ is selected. Then, in the same way, assuming that this one is relevant, another one is

selected, and so forth, till no relevant item is left, i.e. a document which has a Pii > &,. Now, going form Q’ “backwards”: from all items which if relevant, than Q’ is relevant, we select that one Q” which is associated with the highest probability. From this document we continue backwards till none is left.

The items retrieved in this way constitute an answer to the query. The documents are ordered and Q’ may not be the first document in the chain. The answer set (chain) is not necessarily unique, since there might be several chains of relevant documents.

When the query is defined in the second way, the strategy divides the data-base into disjoint classes (equivalence classes) of related documents according to &,. From each class an item is selected and compared with the query. The item which best matches the query leads to a class of related items. The query Q is then compared with each member of the class in order to find the one which best matches it, Q’ which becomes the initial document. From Q’ one proceeds as described above. Obviously, this strategy takes into account the fact that the relevance of a document depends upon the other documents already retrieved.

This strategy was extended by CLEVELAND[~] in his Geometrical Model for Information Retrieval (GM) in the following way: given Q’ and &,, the same process is followed, except that all items which have Po,j > & are retrieved in the first stage. In the second stage, all items which are relevant to those selected in the first one are retrieved, and so forth, till no relevant item is left. The strategy does not go “backwards”. The answer set might be of a different size. For an identical &, it is more likely that the GM answer set will be of a larger size, though only partially ordered (the term “partial order” indicates here that there is an order (“distance”) between Q’ and every other item in the answer set, but several items have the same “distance” from Q’).

As mentioned above, the IDM is a highly effective strategy. However, the need to process the large matrix of the Pij’s, to partition the data-base once or more for each query and to update it for each new document added, inhibited its implementation.

S. A PROPOSEDCHAINING PROCESS

The chaining process of the IDM and the GM after an item Q’ and a threshold &are specified, is such that at each stage an item is selected on the basis of its relationship with the preceding one only. This is, in a sense, a multi-stage Boolean search strategy.

It is evident from the reasoning which led to the formulation of the IDM and from the mathematical formulation of the general communication theory[l3] that the relevance of a document is dependent upon the previous knowledge of the user. It is difficult to assess the user’s background; however, once he has entered the system, we learn something about the knowledge he gains during that particular search. More specifically, in the first stage we know only that Q’ is relevant and based on this knowledge we select the second item J’. In the second stage we know that not only J’ is relevant but also Q’ and can base our selection of the third document on this assumption, instead of assuming that only the last one J’, was relevant. Thus, we are proposing to base the chaining process on the conditional probability that a set of documents is relevant instead of one only. The set, (s), is getting larger at each step. Pij becomes P(s)i, (s) being one document only in the first stage.

The set (s) can be viewed as an unordered set or an ordered one. It is natural to view this set as an ordered one and to give more importance to the last document retrieved and less to the former ones. This scheme involves more computational work than the one stage chaining, but when done by a computer, the computational work is manageable.

An associative search strategy for information retrieval 133

There is an infinite number of schemes for assigning different weights to different items in the chain. Experimentation is needed in order to find the optimal one, or a simple way of letting the user select among a few options.

6. A PROPOSED SEARCH STATEGY

Both strategies (IDM, GM) require specification of a threshold by the user. Usually, the user has no knowledge of how to select his TO. Actually, he can trace it only aft& the search is completed and relevance is determined. In reality, there is a trial and error process starting with an initial TO, gathering results, evaluating them in some way, changing & accordingly and repeating the whole process again and again, until the user is satisfied. This is a long and expensive process.

A search strategy which is a modification of the IDM and the GM is suggested here. For the sake of simplicity the proposed chaining process (computing Pr,,J described above will be disregarded, though it could have been incorporated.

Type B(a) query Given an entry document Q’, the suggested strategy retrieves two items which are the most

relevant to Q’. In the second stage, for each of the two items retrieved in the first stage, two additional items are retrieved, i.e. those with the highest conditional probabilities, and so forth.? The result is a “tree-like” structure presented in Fig. 1, which we call a “map”.S

Starting from the initial relevant document Q’ the user selects one out of two documents. After the selection is made, the user proceeds to the next stage and selects again one out of two documents and so forth. At each stage the map contains as much data as possible in order to help the user select between the next two documents (author(s), title, possibly an abstract; and other data). An item can appear more than once in the map, but not on the same path. Different paths might overlap or might even include the same set of items, differently ordered.

The strategy is not limited to selecting two items at each stage. The user can specify any number n, nf 2, if he wishes to. The number 2 is chosen if no other number is specified, because it is felt that 2 is optimal for most queries. It will give the user decision latitude and will not complicate much his decision making process. The number of comparisons he has to make in order to select 1 out of 2 is 1. If we set n = 3 the number of comparisons the user has to

Fig. 1. An answer map.

tit might be that items which have high Pij values (very close items), are informationally the same, so that the retrieval of the second item after the first has been retrieved, will not convey any new information. In this case it might be useful to define an upper limit threshold & which will eliminate retrieval of item xi after item Xi, if pii > &. See also[lO] p. 75.

*This term is borrowed from bYLE[4].

134 0. MANSUR

make is 3 (A&; A,C; B,C) due to the lack of transitivity. In the case of the non-standard type query it is more likely that the user will specify IZ 12.

For this strategy there is no need to specify &,. It can be viewed as if & is changing at each stage of the tree. Increasing n wit1 have an effect similar to that of lowering & in the GM, but still not the same. In both cases the number of documents retrieved might increase, but the sets are likely to be different. In the GM case, the & is still fixed at each stage, and any document retrieved is assumed to be relevant. In our strategy &, is different at each stage, and not all documents are relevant. Furthermore, only those on a certain path are relevant (for the standard query). Since each path is a possible final set, a document can appear more than once on the map, on different paths.

Type B(b) query It is doubtful whether many Boolean queries would be posed to such system since defining a

Boolean query is difhcult, and requires the help of an “intermediary”. Having such a query in mind one can often associate it with a document. Assuming that we allow Boolean queries, we need to structure our data-base in order to handle effectively and efficiently this type of queries. The structuring algorithm of the IDM seems a good solution to the Boolean query, assuming that the structuring is not done for each query (structuring is time consuming and is an expensive process). An arbitrary threshold & can be selected in order to partition the file into classes in such a way that the size of most classes will be in the order 20-30.1 A representative is selected from each class. This representative can be the item which has the highest “synthesis” value (see [ 141). It is very likely that this will be the item which is m~~ximally linked to the members of the class (see also [2], p. 36). A Boolean query can be matched with the group of representatives the same way it was described for the IDM so that an entry item can be obtained.

Type B(c) query For the browsing query, scanning the whole set of representative items, is neither efficient

nor effective if the file is not small. Considering the set of representative items. we can partition it again into disjoint classes, select representatives from those classes and so on. This creates a quasi-hierarchical structure which fits browsing.

7. DISCUSSION OFTHE PROPOSEDSTRATEGY

The information retrieval process is viewed as a trial and error process, rather than a process which retrieves only absolutely “relevant” documents. It is contended that relevance is a subjective quality which can be finally determined only by the user. A system can not compute absolute relevance but only an a priori probability of relevance. Only the user decides if the Pij is either 0 or 1. This is a subjective decision depending on his specific background, interests and his perception of the document’s content.

Interaction between the user and the system is desirable in order to enable the user to obtain a better answer. In this case, the interaction is achieved between the user and the map supplied by the system. This can be done at the user’s own desk, provided he works with a hard copy of the map. This fact will contribute to lower costs per query.

The size of the answer The size of the answer provided by this strategy is growing fast with each step. Since we do

not have a limiting factor such as 60, the system can not limit the size of the answer unless a certain number of steps is specified. It is suggested here to limit the number of stages for a standard query, to 8, 10 or 12 steps (unless otherwise specified). This answer size (8, 10, 12 final relevant documents) seems su~cient for most purposes. This wilf reduce costs as well as increase immediacy for the standard query without much affecting the non-standard one.S

tA possible criterion is the elasticity of &, i.e. when the size of most groups will be relatively insensitive IO a change in

&.

SThe immediacy factor is not so important in the non-standard query. An answer to such a query is more likely to h submitted in two parts-the first part immediately, the rest--later (off-line).

An associative search strategy for information retrieval 13s

A built-in learning process When the map with several possible paths is created, some information is gained during the

process. When the user selects a path, we already know that he prefers some items over others. As he proceeds farther, we have an increasing amount of information on his preferences. Taking into account all possible paths, we know that if he chose path A (Fig. l), this means that he prefers item Jr to Jz, and Kl to K;, etc. Thus, we might be able to construct crude indifference tables of the user for that specific query at that point in time, and when creating the map, we can be aided in selecting the next documents by the information gained while selecting the previous ones. For example, if we know that the user consistently prefers highly cited articles to those less cited but closer (in terms of keywords for instance), we can select accordingly. This possibility should be studied further.

8. ECONOMIC FEASIBILITY

When considering economic feasibility the various costs of the simplified version of the suggested strategy will be compared with those of the Boolean systems. Before computing the costs, we would like to elaborate on several points:

(i) The size of the matrix The size of the Pij’s matrix depends upon the size of the file. A file which contains S

documents will have a matrix of size S x S. From the way in which the Pij’s are computed it is evident (see also [ 111) that only half of this matrix needs to be stored. The actual size will be significantly smaller since most documents in a large file are not related at all and consequently the appropriate entries will be zeroes. Furthermore, for practical purposes we can define a low limit threshold & for all possible queries, and replace all Pij’s values which are smaller or equal to .$ by zeroes. This will further reduce the size of the matrix for on-line storage requirements.

(ii) The need for the matrix There is no need to store the matrix on-line for searches. Instead we propose to store only

pointers from one item to others. These pointers are computed from the matrix once only (per period of time) and the matrix can be stored off-line. Different number of pointers are needed for different types of queries:

-Type B(a)-A(a) (entry point-standard) query. In this case we need to store, for each item, pointers to only two other items, which are mostly related to it. This is true when there is no “looping” (looping is the case of a document having its highest Pij value associated with another one already retrieved by the same query on the same path). Due to the looping possibility we need to store for each item more than two pointers to other items. This number is determined by the size of the path. For this type of query we have indicated that the optimal size of the path is about 10, so that for each item we need to store at the most 10 pointers. There is no need to use the matrix at all for retrieval.

-Type B(a)-A(b) (entry point-non-standard) query. In this case, the size of the answer is relatively large, and we need to store more pointers than in the previous case. It will be shown later that storing even 30 pointers for each item is not economically unfeasible. However, this might not be the optimal solution. This type of query is rare, and usually not all portions of the answer are needed immediately. It seems reasonable to supply immediately the first portion of the answer, and to retrieve the rest later.

-Type B(b) (Boolean) query. The requirements for this type of query are similar to those for type B(a) with the exception of the need to find an entry item. Earlier we have outlined a way of finding an entry item by partitioning the file into classes according to a relatively high threshold &. This partitioning does not have to be done for each query, but rather, periodically (every 3, 6, 12 months or so).

-Type B(c) (browsing) query. This case was discussed earlier and we have suggested creating a quasi hierarchical structure of the file. From the discussion above it is clear that the search is extremely fast.

(iii) Updating a new item It seems that the matrix needs to be recomputed whenever we add a new document to our

136 0. MANSUR

file. This involves high costs when items are added at a relatively high frequency. We can bypass it by considering each item added to the file as a type B(b) (Boolean) query, comparing it with the group of representative and adding it to the class of the representative item which best matches it. Then we will compute the Pii’s values between this item and each item of this class only. Updating the whole matrix can be done periodically as was suggested above, before each re-partitioning of the whole file. This will not affect much the effectiveness.

Cost structure The total costs of the two IRS considered consists of fixed costs, i.e. costs which are not

dependent on the number of queries but rather on the size of the data-base, and variable costs, i.e. costs which are functions of the number of queries. Each of these two types consists of several elements:

Fixed costs: (a) acquisition (b) processing (indexing, keypunching, etc.) (c) updating the data-base, when documents are added (or deleted) (d) storage of the items (or portions of them) (e) storage of the data needed for retrieval purposes (f) computing relationships and partitioning the data-base

Variables costs: (g) formulating the query (h) searching (including interaction)

Cost comparison We can easily assume that the costs of (a), (b), (c) and (d) are essentially the same for both

systems. Costs of (e) seem to be much higher in the suggested IRS and we will examine them in detail later. Costs of (f) are incurred in the suggested IRS and not in the conventional ones; however, they occur only periodically and can be practically neglected. Costs of (g) occur mostly in the conventional systems, they reflect the assistance needed by the user from an “intermediator” and the user’s time and efforts. Those costs will not be taken into con- sideration. Costs of (h) are considerably higher in the conventional IRS, where a large portion of the file has to be examined in order to retrieve an answer. They also tend to increase with the size of the file as more documents are associated with each index term.

Discussion of costs of (e) In the suggested IRS costs of (e) reflect the need to store, for each item, pointers to several

other items mostly related to it. We will not try to compute those costs directly, but rather to estimate the size of the required storage in relation to the size of the storage required for item

(d).

The conventional IRS store on-line for each document at least the following fields: -author(s) -title -journal -volume, issue, page -8-15 index terms -some inverted files of index-terms and document numbers.

In many cases other data are added, sometimes even an abstract. The storage requirements for the first five fields are estimated at 2000 bits per item. In the suggested IRS we need to store 10-30 pointers. For each pointer, we need 20 bits (for a file size of 1,024,OOO items). Usually, this figure will be lower since most of the related documents belong to the same group. Taking as an average 12 bits per pointer we need at the most 6-18% more storage. This is by no means a prohibitive cost, especially if one takes into consideration the extremely fast search of the suggested IRS, and the fact that on-line storage of index terms is not required at all for type B(a) queries which represent the majority of queries.

An associative search strategy for information retrieval 137

We can infer that the suggested IRS is economically feasible. It is associated with higher fixed costs, but much lower variable costs. This last result has further implications:

-The searching costs in the conventional IRS are even higher than stated due to the amount of interaction needed in order to achieve a manageable answer. This contributes not only to the pure searching costs, but also to the cost of on-line communication.

-The “fixed” costs as defined above, and as a matter of fact almost any fixed costs, are fixed only up to a certain amount of the output (in this case the number of queries). When the number of queries reaches a critical amount which exceeds the physical capacity of the system, the fixed costs are changing, i.e. additional hardware capacity is needed, etc. Since the variable costs are significantly lower in the suggested IRS, the size of the “critical output” will be much higher, thus allowing larger capacity for a given amount of fixed costs.

9. conclusion

The search strategy outlined above is a flexible extension of the IDM, the IDM being a special case of it. In its simplest version it can replace the Boolean systems which are in operation now, achieving much greater effectiveness, with comparable costs. Some of its features might further increase the effectiveness and flexibility of information retrieval systems.

REFERENCES

[I] E. VERHOEFF, W. GOFFMAN and J. BELZER, Inefficiency of the use of Boolean functions for information retrieval. Commun. ACM 1961, 4, 557-559.

[2] C. J. VAN RIJSBERGEN, lnformurjon Reirievaf. Butterworths, London (1975). [3] M. E. MARON and J. L. KUHNS, On relevance, probabilistic indexing and information retrieval. J.

ACM 1960, 7, 3. [4] L. B. DOYLE, Semantic road maps for literature searches. J. ACM 1961,8, 574-578. [5] A. SALTON, Automated Information Organization and Retrieval. McGraw-Hill, New York (1968). [6] G. SALTON (Ed.), The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, New Jersey (1971). 171 G. SALTON. ~~amj~ ~~~o~u~jon and ~jbra~ Processjng. Prentice-Hall, Englewood Cliffs, New

Jersey (1976). [8] W. GOFFMAN, An indirect method of information retrieval. Inform. Star. Retr. 1968.4, 4. [9] D. B. CLEVELAND, An n-Dimensional retrieval model. J. ASIS 1976,27, 5/6.

[lo] D. K. DERINGER, An information retrieval system for a computer center. Ph.D. Dissertation, CWRU Cleveland, Ohio (1972).

[i 11 W. B. CROIT and C. J. VAN RIJSBERC~EN, An evaluation of Goffman’s indirect retrieval method. Inform. Proc. Management 1976, 12, 327-331.

[12] D. OMLOR, An efficiency analysis for file organization and information retrieval. Ph.D. Dissertation, CWRU Cleveland, Ohio (1978).

[13] W. GOFFMAN and V. A. NEWILL, Communication and epidemic processes. Proc. R. Sue. A 1%7, 298, 316-334.

[14] W. GOFFMAN, Dynamics of communication. Proc. of the AAAS Conf. Denver, Colorado, Feb. 1977.

APPENDIX

A Note on the results obtained by Croft and Vnn Rijsbergen [ 1 l] when com~a~ng the indirect method with the single link method.

The effectiveness of the IDM was found to be similar to that of the Single Link Method (SLM) using 3 types of searches:

(a) “a narrow search”-similar to type A(a) query (b) “a broad search”-similar to type A(b) query (c) “a bottom up search”-similar to type B(a) query, The costs associated with the IDM were significantly higher than those of the SLM for searches (a) and

(b) above but lower for search (c). Those costs are “overhead” costs which occur when the Pij matrix is processed and the file is partitioned. When comparing the total costs of a system one should take into consideration the relative frequency of each type of query. We wish to repeat here the argument mentioned in the body of the article that most queries, when allowed to be defined in that way, will be such that a bottom up search will be required. Furthermore when using the simplified version of the suggested strategy, no threshold probability &, has to be stated, neither partitioning of the file for each query is needed. One need not specify a “recall” and “precision” as required by a query posed to the SLM.