relevance judgements for assessing recall

Relevance Judgements for Assessing RecallPeter Wallis and James A. ThomDepartment of Computer Science, Royal Melbourne Institute of Technology,GPO Box 2476V, Melbourne, Victoria 3001, AUSTRALIA.March 15, 1995Corresponding author: Dr J.A. Thom, address as above, email ([email protected]),fax (+61 3 660 1617), telephone (+61 3 660 2348).Running title: Assessing Recall

1

Abstract | Recall and Precision have become the principle measures of thee�ectiveness of information retrieval systems. Inherent in these measures of per-formance is the idea of a relevant document. Although recall and precision areeasily and unambiguously de�ned, selecting the documents relevant to a query haslong been recognised as problematic. To compare performance of di�erent sys-tems, standard collections of documents, queries, and relevance judgements havebeen used. Unfortunately the standard collections, such as SMART and TREC,have locked in a particular approach to relevance and this has a�ected subsequentresearch. Two styles of information need are distinguished, high precision andhigh recall, and a method of forming relevance judgements suitable for each isdescribed. The issues are illustrated by comparing two retrieval systems, keywordretrieval and semantic signatures, on di�erent sets of relevance judgements.

2

IntroductionFour decades of testing information retrieval systems suggests that automated keyword searchis a very e�ective means of �nding relevant material. This outcome is understandable whena query is expressed using words that are speci�c to the query's topic, but many have feltthere must be room for improvement in the general case. Recent attempts to improve thee�ectiveness of information retrieval systems include adding natural language processing tech-niques (Fagan, 1987; Smeaton, 1987; Gallant et al., 1992; Wallis, 1993), broadening the queryusing automatic and general purpose thesauri (Crouch, 1990; Thom and Wallis, 1992), doc-ument clustering techniques (El-hamdouch and Willett, 1989), and other statistical meth-ods (Deerwester et al., 1990). The measured performance of these techniques has not beenencouraging. In this paper we argue that the reason for this result may lie in biases associ-ated with the assessment method rather than with the information retrieval techniques beingproposed.The emphasis in information retrieval research has been, it seems, on users who wantto �nd just a few relevant documents quickly. Once such a user has found the requiredinformation, he or she stops searching. In the extreme case, these users want a systemthat �nds a single relevant document and as few as possible non-relevant items. In otherwords, they need a system that emphasises high precision. In many situations, however, auser must be fairly con�dent that a literature search has found all relevant material; such auser requires high recall. Su (1994) has found that users are more concerned with absoluterecall than with precision. Her results were based on information requests by users in anacademic environment, but her �ndings are applicable to other users. For instance whenlodging a new patent at the Patent O�ce it is necessary to retrieve all relevant, or partiallyrelevant, material. Finding precedence cases in legal work, and intelligence gathering, areother situations in which high recall is desirable.Some might argue that there is no need for high recall retrieval systems because existingtools are good enough. Cleverdon (1974) has suggested that, because there is signi�cantredundancy in the content of documents, all the relevant information on a topic will be foundin only a quarter of the relevant documents. However a user interested in high recall will needto �nd much more than a quarter of the relevant documents, because it is unlikely that the�rst 25% of the relevant documents an information retrieval system �nds will be exactly thosethat contain all the relevant information. Users who need high recall information retrievalget by with existing high precision tools, but there is a need for systems that emphasise thisside of the problem.Apart from practical bene�ts, improving the recall of information retrieval systems is aninteresting research problem. Swanson (1988) has said there are conceptual problems in in-formation retrieval that have been largely ignored. One of the \postulates of impotence" heproposes is that human judgement can bring something to information retrieval that comput-ers cannot because computers cannot understand text. Our work on semantic signatures goespart way toward computer understanding and shows where understanding can contribute tothe information retrieval process. In this paper we show that the high recall part of the infor-mation retrieval problem may have been ignored, not because people have found the problemesoteric or uninteresting, but because they have not had the tools to e�ectively test ideas.This paper is structured as follows. In the section on relevance judgements and informationretrieval we look at what it means for a document to be relevant and how relevance has beenassessed in some existing test collections. In that section we also de�ne recall and precision,and describe the tradeo� between them. In the section on information retrieval and assessingrecall we describe how information retrieval measurement is usually biased towards highprecision rather than high recall. We also discuss the limitations of using the TREC collectionand discuss why this collection does not solve the problem of assessing recall for informationretrieval systems that support ad hoc queries. In the section comparing two systems for recallwe describe information retrieval based on semantic signatures and compare it with standard3

keyword retrieval. The relevance judgements of one popular test collection are re-evaluated,and we show that on a small test set of queries, whereas the existing relevance judgementssupport keyword retrieval, semantic signatures have a signi�cant advantage using the revisedset of relevance judgements.Relevance judgements and information retrievalInformation retrieval systems are usually considered to contain a static set of documents,from which a user is wanting to extract those documents she or he will �nd interesting. Somedocuments are relevant to the user, and others are not.The e�ectiveness of any query can be measured by computing recall and precision �guresbased on a list of documents that are considered to be relevant. For a given query andinformation retrieval system, recall measures the number relevant documents retrieved as aproportion all relevant documents, and precision measures the number of relevant documentsretrieved as a proportion all documents retrieved. The perfect system would return a set ofdocuments to the user containing all the relevant documents, and only the relevant documents.Such a query would have recall of 1:0 and a precision of 1:0. Critical to computing recall andprecision �gures is determining lists of relevant documents for given queries.Intuitively a relevant document is one that satis�es some requirement in the user's head.Saracevic (1975) describes this approach to relevance as \a primitive \y' know" concept, as isinformation for which we hardly need a de�nition". This concept of relevance was the basis ofearly relevance judgements in which the person who formulated the query chose the relevantdocuments.Only one person (the requester) was asked to collect the judgements for eachrequest, and dichotomous assessments were made to declare each document eitherrelevant or not (Lesk and Salton, 1969).The position that the requester knows what she or he wants, and is therefore the personwho knows what is relevant, is entirely reasonable, but it is not necessarily a good means ofassessing the e�ectiveness of a retrieval system. What a user wants may not be what she orhe describes. There are several approaches taken to this problem. The simplest is to ignore it;recall and precision can only be used comparatively anyway and Lesk and Salton (1969) haveshown that user judgements can be used e�ectively to compare the performance of di�erentsystems. At the other extreme, many have tried to �nd a formal de�nition of relevance thatwould allow us to say de�nitively whether a text was relevant or not. A good example ofthis approach is Cooper's (1971) de�nition of logical relevance. More recently the TIPSTERand TREC projects employ specially trained relevance assessors (Harman, 1992) who, it isassumed, can make consistent and accurate assessments of relevance. Naturally it is optimisticto expect a third party, be it a machine or an information o�cer, to �nd what the authorof a query wants rather than what the author actually requests. Having a machine retrieveall and only the documents the user considers relevant from a badly worded query would tellus more about the user's mind than about the ability of the machine to convert queries intouseful sets of documents. In many cases users do not express their desires clearly and thesame query can be given by two users with signi�cantly di�erent meanings. Consider the fourqueries shown in Figure 1 that are selected from the Communications of the ACM (CACM)test set of 64 queries and 3204 documents provided with the SMART information retrievalsystem (Buckley et al., 1988). Query 19 is quite ambiguous. Does the author of this requestwant examples of parallel algorithms, or information about parallel algorithms? And does theauthor want everything on parallel algorithms, or simply a few examples? Figure 2 illustratesthe di�erence between user judgements and those of a third party.The problem of judging relevance has been around for a long time and the idea of thirdparty judges is not new. There can be signi�cant disagreement, not only between a third4

Query 15 Find all discussions of horizontal microcode optimisation with special emphasis onoptimisation of loops and global optimisation.Query 19 Parallel algorithmsQuery 39 What does type compatibilitymean in languages that allow programmer de�ned types?(You might want to restrict this to \extensible" languages that allow de�nition of abstractdata types or programmer-supplied de�nitions of operators like *, +.)Query 64 List all articles on EL1 and ECL (EL1 may be given as EL/1; I don't remember howthey did it). Figure 1: Four queries from the CACM test collection.party and the author of the query, but also amongst third party judges working on the samequery. In 1953 a large scale experiment was designed to compare the retrieval e�ectiveness oftwo information retrieval systems developed by di�erent institutions. Two sets of relevancejudgements were provided by the separate groups for a test collection consisting of 15,000technical documents and 98 questions. There were 1390 documents that the two groupsagreed were relevant, and another 1577 that one, but not both thought relevant { \a colossaldisagreement that was never resolved" (Swanson, 1988).If relevance judgements are so unstable, how can they be used as the basis for objectivemeasurement of information retrieval system performance? Cuadra and Katter think theycannot, and saythe �rst and most obvious implication is that one cannot legitimately view `pre-cision' and `recall' scores as precise and stable bases for comparison between sys-tems or system components, unless [appropriate controls are introduced].(quotedin (Lesk and Salton, 1969))In spite of these problems, Lesk and Salton (1969) have argued that, for the purposes ofassessing information retrieval systems, it does not matter whether the relevance assessmentsof the author, or a third party, are used. They found that although their volunteers did indeedgave inconsistent relevance judgements, these judgements could still be used to compare thee�ectiveness of information retrieval systems. They show experiments in which the relevancejudgements of the author of a query, and a second person were compared over 48 queries.On average there was about a 30% overlap between the two sets of judgements (ranging froman average of about 10% for one author to 53% for another | the actual queries are notprovided). Similar di�erences between user judgements and second person judgements havebeen reported elsewhere in the literature (Janes, 1994). Lesk and Salton (1969) show thatas long as relevance judgements are used to compare systems, it does not seem to matterwhich person's relevance judgements are used and a technique for information retrieval thatperforms well on one set of judgements will perform well on others as well. The explanationthey provide for this phenomenon is that there is substantial overlap on the set of documentsthat are \most certainly relevant to each query". They conclude:... that although there may be considerable di�erence in the document sets termedrelevant by di�erent judges, there is in fact a considerable amount of agreement forthose documents which appear most similar to the queries and which are retrievedearly in the search process (assuming retrieval is in decreasing correlation orderwith the queries). Since it is precisely these documents which largely determineretrieval performance it is not surprising they �nd that the valuation output is5

User judgements Third partyjudgements

Parallel algorithms

Fast Parallel SortingAlgorithms

Comments on A Paperon Parallel Processing

Optimal Code forSerial and ParallelComputation

Some thoughts onParallel Processing

How can several processorsbe used for sorting?

Figure 2: Users, queries, and relevance.6

substantially invariant for the di�erent sets of relevance judgements. (Lesk andSalton, 1969, p355)Because everyone agrees about which documents should be found �rst, systems aimed atthe high precision end of the information retrieval problem can be tested (comparatively)on anyone's judgements. In the next section we assume that third party judgements are anacceptable source of relevance assessments, but �nd that accepting Lesk and Salton's positionintroduces a bias in the test procedure toward systems which emphasise precision. This biasis acknowledged by Lesk and Salton, but it seems the presence of this bias has been forgottenover the last twenty years.A second bias, again acknowledged by Lesk and Salton, is introduced by the way in-formation retrieval systems which rank documents are assessed. An information retrievalmechanism employing ranking provides the user with a list of documents that the systemconsiders relevant. The list is presented to the user in decreasing order of relevance | ac-cording to the system. Ranking systems can deal with what are often called \noise words".This feature allows users to enter their query in natural English. Consider phrases in fourqueries shown in Figure 1 such as \I don't remember" in the CACM query 64, \Find all" inquery 15, and \what does" in query 39. As far as keyword retrieval is concerned, these addno useful information to the query (and would not be used as part of a boolean query), butthey do not seem to hinder the performance of ranking systems that incorporate stop-listsand term weighting. Such phrases can thus be left in and enable users to express their desiresin a form that comes naturally.Given an information retrieval system that ranks documents, performance can be assessedusing recall and precision �gures by comparing the list of documents returned with the set ofdocuments judged as relevant in the test collection. For instance, a keyword retrieval systemmay return documents in the following order for query 64.D2651 D1307 D2513 D793 ...According to the relevance judgements provided, the document numbered D2651 is relevant,and indeed it is the only relevant document. The system has done well and the �rst documentthe user looks at will be useful.When a ranking information retrieval system does not perform perfectly however, as-sessment is more di�cult. Consider a ranking system executing Query 15 from the CACMcollection. It has 10 relevant documents, and documents are returned by the system in thefollowing order.D820 D2835 D3080 D2685 D307 D2929 D2616 D2344 D3054 D1466 D113 D1461D658 D1231 ...Of these the underlined document identi�ers are the only relevant ones in the top fourteendocuments. The precision at the 10% level of recall is calculated as if the system had stoppedafter the fourth document, giving 25% precision. The next relevant document occurs atposition 14, and the next a position 20. We plot the precision at 10% increments of recallas the broken line in Figure 3. The performance of information retrieval systems is fairlyuneven, and so results are usually presented as the average precision at �xed levels of recallacross all queries in a collection.1Calculating a precision value for an information retrieval system is relatively simple: theuser looks at the texts and decides which texts she or he actually want. The precision isthe number the user wants, divided by the number the user has looked at. Finding recallis more di�cult in that the calculation requires knowledge of all relevant documents in the1Variations on this are sometimes used, for example one of the evaluation �gures produced for the TRECexperiments interpolates for each query high precision �gures to all lower levels of recall. The evaluation pro-cedure trec eval() is available from the ftp site ftp.cs.cornell.edu with the SMART information retrievalsystem. 7

recall

precision

20% 40% 60% 80% 100%

20%

40%

60%

80%

100%

Query 15

Average over 54 queries. 10 point average: 20.0%

Figure 3: Precision at various levels of Recall for Query 15.collection. Having the user exhaustively search the collection to �nd all relevant material is anexpensive task in document collections of signi�cant size. Methods for predicting the numberof remaining relevant documents using statistical methods have not been successful (Miller,1971). An alternative to exhaustively searching the document collection is what is knownas the pooled method. Using this test method, two or more information retrieval systemsare used on the same query with the same document collection. The top N documents fromeach system are pooled, and judged for relevance. The judge does not know which systemfound which documents. The relevance judgements can then be used to compare the relativerecall of the various systems. This is the approach taken in the TIPSTER project and therelated TREC project (Harman, 1992). In some cases researchers have claimed that, by usinga range of di�erent information retrieval systems, the pooled method has found \the vastmajority of relevant items" (Salton et al., 1983). If this assumption holds, then the collectionand relevance judgements can be used to assess other systems without having to reassessretrieved documents.Often the performance of a system is summarised with a single �gure that is the averageprecision at all levels of recall. This provides a crude but easy way to compare the performanceof di�erent information retrieval methods. Consider the performance of the informationretrieval system plotted as the solid line in Figure 3.10% 20% 30% 40% 50% 60% 70% 80% 90% 100%.44 .36 .30 .24 .18 .15 .10 .08 .06 .06When these are averaged, the performance of this particular system can be summarised withthe single �gure, 20.0% average precision on a 10 point scale.Information retrieval and assessing recallUnfortunately, both averaging precision �gures, and relying on a single set of relevance judge-ments, introduces a bias towards high precision. First consider our keyword system rankingdocuments for query 15. The top �fty documents in the ranked list is represented below, from8

left to right, with dots representing non-relevant documents and hashes representing relevantdocuments....#.........#.....#.#............................The average precision for the system on this query is 9%. Now consider two hypotheticalsystems. The �rst system �nds a high ranking relevant document sooner (that is, given aneven higher rank) so improving precision. Let us assume it moves the �rst relevant documentin the list from position 4 to position 1:#............#.....#.#............................This improvement almost doubles the average precision of the system to 16.2%. The secondsystem improves recall by making low ranking relevant documents, which are unlikely to beseen by the user, more accessible. Let us assume it �nds all relevant documents in the top�fty....#.........#.....#.#......................######This improvement once again almost doubles the average precision of the system to 16.1%. Toa user after the �rst relevant document, the inconvenience of having to scan three irrelevanttitles is small; users wanting many more relevant documents are greatly inconvenienced ifthey have to scan hundreds or thousands of titles to �nd them. The hypothetical high recallsystem is far more use to the author of the query who wanted to \�nd all" material. The�rst alternative is of questionable bene�t even when the user is only after the �rst relevantdocument, and the di�erence in utility of the two mechanisms is not shown when averageprecision is used to summarise system performance. Those interested in improving the recallof information retrieval systems must compare precision at individual levels of recall and notbe misguided by an overall average.Taking the average precision over various levels of recall has a strong bias toward systemsthat �nd a few relevant texts early, but it is a bias that can be recognised and controlled.Unfortunately there is a similar bias built into the way relevant documents have been judgedfor some test collections.Lesk and Salton's work, discussed in the previous section, argues that any set of relevancejudgements can be used for assessment because all include those documents which are \cer-tainly relevant". They conclude that the e�ectiveness of information retrieval systems canbe compared on such judgements. Note that such relevance judgements give no indicationof the e�ectiveness of an information retrieval system in an absolute sense, and a questionof some import would seem to be just exactly how much room is there for improvement ininformation retrieval systems. Consider a retrieval system that found between 10% and 50%of the documents judged relevant. Such a system appears to be doing as well as any humanjudge from the Lesk and Salton results and, given this is a reasonable �gure for a booleansystem to achieve, it would seem there is little point to further research.Here we suggest that a less comparative method of assessment could be had by explicitly�nding all and only the documents that are certainly relevant, and using those for assessmentpurposes. A method suggested by their work is to take the intersection of several sets ofjudgements, and use that as the set of relevant documents for a query. Such a test collectionwould give some idea of the absolute value of a text retrieval system for those users who wishto �nd something quickly on their topic of interest.But what about users who want to �nd all relevant documents? Many researchers in thearea have had users provide relevance judgements at various levels (Sparck Jones and vanRijsbergen, 1976). However this is not a simple thing to express to judges, nor does thereseem to be any reason to choose 2 levels of relevance over say 5. Frei (1991) has proposed a testmechanism that takes this view to its extreme and compares two ranked lists of documents.We advocate instead that the \perfect" information retrieval system aimed at high recall9

would retrieve all documents considered relevant to any judge. The relevant documents fortesting such a system would be the union of all the relevance sets.With this view of relevance it is not possible for an information retrieval system to achieveperfect retrieval (100% precision at 100% recall) because di�erent users have di�erent ideasas to what is relevant. Without looking into the minds of each user, the ideal informationretrieval system would retrieve all documents certainly relevant (those in the intersection testset) followed by all those thought relevant by at least one judge (those in the union) followedby the rest. The performance of such a system would be extremely di�cult to assess inabsolute terms, and so we advocate using two distinct relevance sets for testing purposes: theintersection of the relevance judgements for those research projects focusing on high precision,and the union of these sets for those interested in high recall. The ultimate system wouldperform well on both sets of relevance judgements.Lesk and Salton's conclusions were incorporated in the creation of test collections such asCACM, and defended the status of others. Anyone using these collections must keep in mindthe assumptions behind their creation. But some might question the importance of this giventhe recent TIPSTER initiative. TREC has several appealing features including signi�cantlymore documents, unambiguous queries, and relevance assessment done by professionals. Webelieve however that these advantages introduce problems that must be considered. TheTREC project grew out of an interest in routers and �lters sponsored by DARPA, the UnitedStates Defense Advanced Research Projects Agency. Although there are considerable similar-ities between such mechanisms and information retrieval systems, the people using �lters canhave considerably di�erent interests to those wanting an information retrieval system. Wehave already suggested that many doing information gathering are, unlike Lesk and Salton,going to be interested in �nding more than just the �rst few relevant documents.The TREC project's professional relevance assessors, in combination with the extended\topics" rather than queries, give the meaning of \relevance" a conciseness unattainable withtest collections assembled with volunteer judges. Although the TREC project provides anenvironment with fewer disputes about what is relevant and what is not, this is not a featureof the average library catalogue search. Ad hoc queries are not only short (a point that hasrecently been addressed by the TREC organisers with the use of the summarization �eldbut the same relevance judgements) they are also, we claim, inherently ambiguous. \Parallelalgorithms" is a genuine ad hoc query, no matter how scienti�cally unappealing it may be.Proper evaluation of ad hoc queries can only be carried out with multiple judgements andnaive users.The size of the TREC document collection also introduces problems. Creating a testcollection with exhaustive relevance judgements is an extremely expensive process and isnot feasible with a collection as big as TREC. The alternative is to use the pooled methodto compare competing systems. This is expensive in that each new comparison requiresmore documents to be judged, and thus the number of groups participating in the TRECcompetition must be limited. Researchers who are not able to participate might use theTREC documents and queries, and the released relevance judgements. A problem with thisapproach to testing is the status of unjudged documents. There is a tacit assumption thatany documents that have not been found by the contractors and participants are not going tobe relevant. If one believes that keyword retrieval is e�ective at �nding signi�cant numbers ofthe actual relevant documents, then, with enough participants, all the relevant material willbe found. But this line of argument relies on one's faith in keyword retrieval being good forhigh recall. The more divergent new systems become from the other systems participating,the more likely it is that the new system will be �nding relevant, unjudged documents. Thus,care must be taken when using the released relevance judgements to test novel systems.10

Comparing two systems for recallThe remainder of this paper illustrates the problems of measuring high recall by comparingtwo retrieval mechanisms over a limited set of queries using a new set of relevance judgementsdesigned for assessing high recall. In this section we describe a new information retrievalmechanism that we expect may give better recall than keyword retrieval.The semantic signature mechanism is based on the assumption that relevant documentswill partially paraphrase the users query. There is thus room to improve information retrievalsystems by having them recognise when the same idea is being expressed in di�erent words.Natural languages such as English allow great diversity in the way ideas are expressed. Syn-tactic variations can be dealt with formally by, for instance, converting all texts to their activeform. In information retrieval, a less formal and quicker mechanism is used and all word or-der is removed and texts are treated as sets of words. This works, we argue, because thesemantic structure of a text is primarily carried by the lexical preferences associated with thewords themselves. As an extreme example of this process, there is only one way to assemblea meaningful sentence from the three words \police", \door", and \sledge-hammer". Varia-tions in word meanings have been tackled using thesaurus-like mechanisms. This approachhowever does not capture paraphrases in which single words are replaced with longer texts.The semantic signature mechanism attempts to deal with variations in the meaning/textmapping at the text level by constraining the vocabulary in which ideas are expressed. Key-word retrieval of documents written in a restricted language, by queries written in the samerestricted language is likely to signi�cantly reduce the number of misses caused by variationin language use. The vocabulary in which semantic signatures attempt to capture ideas isthat used in the de�nitions in the Longman Dictionary of Contemporary English (LDOCE).The de�nitions in LDOCE have been written from a restricted vocabulary of approximately2300 words.The technique we use relies on the assumption that the de�nitions in LDOCE have beenwritten using Liebniz's substitutability criteria for a good de�nition: each word in the textof documents and the query is replaced with the appropriate de�nition from LDOCE. If thede�nitions are good, the meaning of the text remains unchanged2. The aim is to imitatea system that paraphrases both documents and queries in a language with a restricted vo-cabulary, and then does conventional information retrieval on the new representations of thedocuments (and queries) meaning.In terms of the vector-space model, conventional keyword retrieval places documents andqueries in N -dimensional space, where N is the number of unique words in the document col-lection. When the cosine similarity measure is used, the similarity (relevance) of a documentto a query is the cosine of the angle between the appropriate vectors. The semantic signa-ture mechanism is simply using a smaller set of features, and using LDOCE as the requiredmapping function. When a keyword information retrieval mechanism is used, the featuresare the words used to describe the concept. They need not be, and several attempts havebeen made to map the words to better features before placing the concept in the conceptspace (Deerwester et al., 1990; Gallant et al., 1992; Wallis, 1993). The major problems withimplementing this technique are in devising a reasonable set of better features, and devisinga mapping from the words appearing in the text of documents and queries, to a suitablefeature-based representation. The semantic signature mechanism uses LDOCE to solve boththese problems.Mapping words to dictionary entries is not straight forward. Two �lters are used: one toremove a�xes, and another to select the required sense of the word. Both these proceduresare known to degrade a retrieval systems performance3 and so, to maintain a level playing�eld, the keyword mechanism we use has the a�xes removed and words are tagged withtheir homograph number. This process is compared with a conventional stemming �lter in2Liebniz actually wanted the truth of the text to remain the same as before the substitution.3Recent tests by the authors on TREC suggest otherwise.11

Raw Text

Stemmer

Morpher SenseSelection

SignatureGennerator

Semantic Signatures

Keyword Test

ConventionalText RetrievalFigure 4: Text processing requirements for semantic signature tests.Figure 4. The keyword test takes the original text and replaces each word that appears inLDOCE with its sense-selected root word. As an example, query 25 from the CACM test setasks for\Performance evaluation and modelling of computer systems".This text is passed through the above �lters to become the set of sense-selected words:fperformance2 evaluate1 model8 computer1 system4g.There is now a one-to-one correspondence between the terms in the text and de�nitions.When the words are replaced with the appropriate de�nition for each term, the text becomesthe set of \primitives":act action before character music perform piece play public trickcalculate degree valuemodelcalculate electric information machine make speed storebody system usual way workNote that although the textual representation of query 20 in the semantic signature format islarge, the vocabulary there are highly constrained and so a practical implementation of thisretrieval mechanism could be quite e�cient.To illustrate the utility of combining relevance judgements in di�erent ways, we wouldhave liked to take a collection of documents and a set of queries and have several people makerelevance judgements on each query and all the documents. This is an exceedingly labourintensive task and so we have restricted the size of the experiment. We chose instead to usethe existing CACM document and query test collection because it is widely known and usedby many researchers. The aim of the experiment is to create relevance assessments suitablefor information retrieval systems that emphasise recall, and to test the provided relevancejudgements for CACM against those documents that are certainly relevant.There are 3204 documents in the CACM collection; these documents consist of titles and,in most cases, abstracts. Obtaining relevance judgements for all of these is beyond the extentof volunteer labour even for one query. In order to limit the number of documents each personwould need to examine, we have followed the TREC procedure and used the pooling methodof system assessment. That is, each system is run on the collection with each query; the12

Query id number of relevant docu-ments (provided withSMART)Q6 Interested in articles on robotics, motion planning particu-larly the geometric and combinatorial aspects. We are notinterested in the dynamics of arm motion. 3Q7 I am interested in distributed algorithms - concurrent pro-grams in which processes communicate and synchronise byusing message passing. Areas of particular interest includefault-tolerance and techniques for understanding the cor-rectness of these algorithms. 28Q19 Parallel algorithms 11Q20 Graph theoretic algorithms applicable to sparse matrices 21Q25 Performance evaluation and modelling of computer systems 51Q36 Fast algorithm for context-free language recognition orparsing 20Q61 Information retrieval articles by Gerard Salton or othersabout clustering, bibliographic coupling, use of citations orco-citations, the vector space model, Boolean search meth-ods using inverted �les, feedback, etc. 31Figure 5: The 7 queries from CACM used in these experiments.best K documents according to each system are collected and judged as relevant or not. Theprecision for each system is then calculated in the usual manner, and a comparative resultfor the recall can be given. In other words instead of saying that system X found 30% of therelevant documents in the collection, one can say that system X found 20% more relevantdocuments than system Y . In the TREC experiments the number of documents examinedfrom each system, K, is 100 for TREC participants, and 200 for contractors. The relevanceassessments are made by a single expert provided by The National Institute of Standards andTechnology. In our experiments K is 80, and the judgements are made the authors of thispaper. The information retrieval systems we compare are the semantic signature mechanismdescribed in the previous section and a keyword mechanism. There are 64 queries in theCACM test collection, and between 80 and 160 documents per query requiring consideration,which was still too many relevance judgements to make. We set out to select approximately10 queries that had similar performance for each system when compared using the suppliedrelevance judgements. We also wanted the chosen queries to be representative of the overallperformance of each system. Both systems achieve around 20% overall average precision ona ten point scale, and so all queries that had an average precision of 20% � 5% on bothsystems were chosen for these tests. This gives 7 queries: 6, 7, 19, 20, 25, 36, and 61. Thequeries themselves are shown in Figure 5. This selection of queries, although not a random4selection, provides a relatively diverse range of query styles, and a signi�cant variation in theamount of relevant material.The results of the relevance judgements reported here and characterised below were at-tained by having the authors of this paper make the relevance judgements. We believe thisdoes not introduce a bias because, using the pooled method, there was no way for the judgesto know which system provided which documents. Although Judge-1 and Judge-2 found sig-ni�cantly more relevant material than the SMART assessors, the overlap between Judge-14Since the selection of queries was not random the test cannot be considered a fair test of semanticsignatures versus keyword retrieval, however in this paper we are primarily concerned with the assessmentmechanism not testing particular information retrieval mechanisms.13

judge(s) Qry 6 Qry 7 Qry 19 Qry 20 Qry 25 Qry 36 Qry 61 sumsmart 3(3) 16(28) 8(11) 3(21) 21(51) 13(20) 21(31) 85(147)jdge1 3 34 40 13 48 18 55 211jdge2 7 40 38 11 61 22 47 220jdge1 \ smart 0 15 8 2 20 9 19 73jdge2 \ smart 3 16 8 2 21 11 17 78jdge1 \ jdge2 1 32 37 9 45 15 43 182jdge1 \ smart \ jdge2 0 15 8 2 20 9 17 71jdge1 [ smart [ jdge2 9 42 41 16 64 27 61 260Figure 6: Agreement of Relevance Judgements | Wallis and Thom.and the SMART judge is empty for query 6, and neither new judge was lenient enough to�nd all documents judged relevant by the SMART assessors even though both chose morethan twice as many relevant documents. As with other non-user judges in similar tests, weconsidered more documents relevant than the original authors of the queries.Figure 6 provides the same information as that provided for the Lesk and Salton experi-ment from 1969. The intersection of the two judges relevance judgements is often larger thanthe original set of judgements, and this, once again, is presumably a product of the smallnumber of relevance assessors participating.We compared the semantic signature mechanism with the keyword mechanism using threedi�erent test sets of relevance judgements: the original CACM relevance judgements; theintersection of our judgements; and �nally the union of our judgements. The second test setis appropriate for comparison of keyword and semantic signature mechanisms when precisionis the emphasis, and the third test set is appropriate when recall is important. The resultsare presented as the amount of relevant documents after �xed numbers of viewed documents.We do not use precision at levels of recall because, using the pooled method, the performanceof either system does not indicate anything about the overall number of relevant documentsin the collection. We cannot, therefore, calculate actual recall. Figure 7 plots precision overthe �rst ranked 50 documents for the seven queries, using the original CACM judgements,on the two systems. When the number of relevant documents is summed for the 10 �rstranked documents for each system, the keyword mechanism �nds one more document thanthe semantic signature mechanism. Over the �rst 50 ranked documents the performance isabout the same. This is not surprising in that the parameters of the semantic signaturemechanism were chosen to give the best performance using the original relevance judgements.Figure 8 shows the same comparison when the intersection of the two sets of judgements areused. This is the test aimed at high precision and, although the performance does not varyin accord with the SMART judgements, this is probably because of the bias of the assessorstoward things being relevant, and the e�ect seen using SMART judgements would becomeobvious with more relevance judges participating.Figure 9 compares the two systems for high recall. In this case the performance is aboutthe same for the �rst 20 documents, but then the semantic signature mechanism starts to �ndmore documents. By the time the user has looked at 50 documents, the user has found anaverage of 4 more documents per query. This represents about a 20% increase in the numberof found relevant material. If the user is interested in �nding more later, rather than a fewsooner, the semantic signature mechanism has a signi�cant advantage. This performance doesnot appear to drop o�. We have examined the ranking up to the 80 document level and thesemantic signature mechanism maintains a solid 20% advantage.We conclude that the semantic signature mechanism is worthy of further investigationin the context of high recall. This contrasts sharply with the conclusion drawn from theresults based on the SMART relevance judgements. The semantic signature mechanism is14

Number of documents viewed

0 10 20 30 40 50

Percent

precision

10

20

30

40

50

60

70

KeywordsSemantic signaturesFigure 7: Semantic signatures vs. keyword: assessed with SMART package relevance judge-ments.

15


0 10 20 30 40 50

Percent

precision

10

20

30

40

50

60

70

KeywordsSemantic signaturesFigure 8: Semantic signatures vs. keyword: High recall test.


0 10 20 30 40 50

Percent

precision

10

20

30

40

50

60

70

KeywordsSemantic signaturesFigure 9: Semantic signatures vs. keyword: High precision test.16

not, however, the only approach to improving recall. Experiments, such as those on latentsemantic indexing (Deerwester et al., 1990) and thesauri (Thom and Wallis, 1992) using theoriginal relevance judgements, have shown some bene�ts at the higher levels of recall. Howaccurate are these experiments? It seems these experiments need to be redone using newrelevance judgements.ConclusionKeyword retrieval of texts is very e�ective at what it does, and the history of informationretrieval suggests that it is very hard to �nd something better. There are things though thatkeyword retrieval cannot do and one of those is �nding all relevant material. High recallsystems are perhaps not as generally useful as systems with high precision, but in severalapplications there is a need for better recall. There would appear to be considerable room forinteresting research on the high recall problem. We have shown that there is a bias in the wayinformation retrieval systems have been assessed, and that this bias means the e�ectivenessof high recall systems has not been apparent.The word \relevant" means di�erent things depending on, amongst other things, whichindividual is asking, and the degree to which the individual is interested in �nding all relevantmaterial. The variation between individual ideas of relevance is objecti�ed by combiningmul-tiple judgements, and di�erent needs for recall determines the way the relevance judgementsare combined. An ideal ad hoc query system that found all relevant documents, would �ndall and only the documents in the union of the sets of all users' relevance judgements. Eachindividual user would of course consider the precision of such a system to be less than perfect,but this is a byproduct of the nature of relevance. It is indeed misleading to think that asystem might simultaneously attain 100% recall and precision. An ideal high precision sys-tem would retrieve all and only the documents in the intersection of various user's relevancejudgements.The relevance judgments in existing test collections are suitable for testing high precisionsystems, however for testing high recall the relevance judgements in these test collectionscannot be used. In the future we hope there will be a test collection suitable for testing highrecall mechanisms. Such a collection would consist of documents and diverse naive queries,with multiple, exhaustive, relevance judgements. Without such a collection, future work oninformation retrieval mechanisms for high recall must be run using the pooled method ofjudging relevance, with multiple relevance judgements on one of the existing ad hoc querytest collections | an expensive process.ReferencesBuckley, C., Voorhees, E., and Salton, G. (1988). The SMART information retrieval system,version 8.8. Fetched from ftp site ftp.cs.cornell.edu.Cleverdon, C. W. (1974). User evaluation of information retrieval systems. Journal of Docu-mentation, 30(2):170.Cooper, W. S. (1971). A de�nition of relevance for information retrieval. Information Storageand Retrieval, 7(1):19{37.Crouch, C. J. (1990). An approach to the automatic construction of global thesauri. Infor-mation Processing and Management, 26(5):629{640.Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).Indexing by latent semantic analysis. Journal of the American Society for InformationScience, 41(6):391{407. 17

El-hamdouch, A. and Willett, P. (1989). Comparison of hierarchic agglomerative clusteringmethods for document retrieval. The Computer Journal, 32(3):220{227.Fagan, J. L. (1987). Experiments in Automatic Phrase Indexing For Document Retrieval:A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Department ofComputer Science, Cornell University, Ithaca, New York.Frei, H. P. and Schauble, P. (1991). Determining the e�ectiveness of retrieval algorithms.Information Processing and Management, 27(2):153{164.Gallant, S. I., Caid, W. R., Carleton, J., Hecht-Nielsen, R., Qing, K. P., and Sudbeck, D.(1992). HNC's MatchPlus system. SIGIR Forum, 26(2):2{5.Harman, D. (1992). The DARPA TIPSTER project. SIGIR Forum, 26(2):26{28.Janes, J. W. (1994). Other people's judgements: A comparison users' and others' judge-ments of document relevance, topicality and utility. Journal of the American Society forInformation Science, 45(3):160{171.Lesk, M. E. and Salton, G. (1969). Relevance assessments and retrieval system evaluation.Information Storage and retrieval, 4(4):343{359.Miller, W. L. (1971). The extension of users' literature awareness as a measure of retrievalperformance, and its application to MEDLARS. Journal of Documentation, 27(2):125{135.Salton, G., Fox, E., and Wu, H. (1983). Extended boolean information retrieval. Communi-cations of the ACM, 26(12):1022{1036.Saracevic, T. (1975). RELEVANCE: a review of and a framework for the thinking on thenotion in information science. Journal of the American Society for Information Science,26:321{343.Smeaton, A. F. (1987). Using Parsing of Natural Language as part of Document Retrieval.PhD thesis, Department of Computer Science, University College Dublin, The NationalUniversity of Ireland.Sparck Jones, K. and van Rijsbergen, C. (1976). Progress in documentation. Journal ofDocumentation, 32(1):59{75.Su, L. T. (1994). The relevance of recall and precision in user evaluation. Journal of theAmerican Society for Information Science, 45(3):207{217.Swanson, D. R. (1988). Historical note: Information retrieval and the future of an illusion.Journal of the American Society for Information Science, 39(2):92{98.Thom, J. A. and Wallis, P. (1992). Enhancing retrieval using the Macquarie thesaurus. In 1stAustralian Workshop on Natural Language Processing and Information Retrieval, pages111{117. Monash University, Clayton, Victoria, Australia.Wallis, P. (1993). Information retrieval based on paraphrase. In Proceedings of the 1stPaci�c Association for Computational Linguistics Conference, pages 118{126. SimonFraser University, Vancouver, British Columbia, Canada.18

relevance judgements for assessing recall

Documents