improving browsing in digital libraries with keyphrase indexes

Ž .Decision Support Systems 27 1999 81–104www.elsevier.comrlocaterdsw

Improving browsing in digital libraries with keyphrase indexes

Carl Gutwin a,), Gordon Paynter b,1, Ian Witten b,2, Craig Nevill-Manning c,3,Eibe Frank b,4

a Department of Computer Science, UniÕersity of Saskatchewan, 57 Campus DriÕe, Saskatoon, Saskatchewan, Canada S7N 5A9b Department of Computer Science, UniÕersity of Waikato, PriÕate Bag 3105, Hamilton, New Zealand

c Department of Computer Science, Rutgers UniÕersity, Piscataway, NJ 08855, USA

Abstract

Browsing accounts for much of people’s interaction with digital libraries, but it is poorly supported by standard searchengines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, andreturning documents when people want a broader view. As a result, users cannot easily determine what is in a collection,how well a particular topic is covered, or what kinds of queries will provide useful results. We have built a new kind ofsearch engine, Keyphind, that is explicitly designed to support browsing. Automatically extracted keyphrases form the basicunit of both indexing and presentation, allowing users to interact with the collection at the level of topics and subjects ratherthan words and documents. The keyphrase index also provides a simple mechanism for clustering documents, refiningqueries, and previewing results. We compared Keyphind to a traditional query engine in a small usability study. Usersreported that certain kinds of browsing tasks were much easier with the new interface, indicating that a keyphrase indexwould be a useful supplement to existing search tools. q 1999 Elsevier Science B.V. All rights reserved.

Keywords: Digital libraries; Browsing interfaces; Text mining; Keyphrase extraction; Machine learning

1. Introduction

Digital libraries and searchable document collec-tions are widely available on the Internet. Unfortu-nately, people still have difficulty finding what they

Ž w x.want in these collections e.g., Refs. 3,13 . In par-ticular, users often find browsing to be frustrating.In browsing, the user interacts with a collection andcarries out searches, without having in mind a spe-

w xcific document or document set 5 . For example,

) Corresponding author. E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] E-mail: [email protected].

one may wish to ascertain what a collection contains,whether there is any material on a certain topic, orwhat kinds of queries are likely to produce good

Ž .results see Table 1 . Such tasks are difficult to carryout in digital libraries — not because the problemsare inherently hard, but because the search systemsthat are provided do not support browsing very well.

Standard search engines use a full-text invertedindex to find documents that match the user’s query,and use a ranking algorithm to order the documents

Ž w x.for presentation e.g., Refs. 35,36 . Although thisapproach is powerful — users can retrieve a docu-ment by entering any word that appears in it —full-text indexing causes several problems whenbrowsing. First, most standard engines index individ-ual words only, whereas people often use short topic

0167-9236r99r$ - see front matter q 1999 Elsevier Science B.V. All rights reserved.Ž .PII: S0167-9236 99 00038-X

( )C. Gutwin et al.rDecision Support Systems 27 1999 81–10482

Table 1Three types of browsing and related task questions

Task type Questions

Collection What’s in this collection?evaluation What topics does this collection cover?

Subject How well does this collection cover area X ?exploration What topics are available in area X ?

Query What kind of queries will succeed in area X ?exploration How can I specialize or generalize my query?

Žphrases when exploring a collection e.g., solar en-ergy, programming by demonstration, artificial in-

. 5 w xtelligence 38,42 . Second, standard engines returnlists of individual documents, but this level of pre-sentation is usually too specific for users who are

w xbrowsing 34,45 . Third, standard engines do notprovide support for forming new queries, even thoughbrowsing involves an iterative process of exploration

w xand query refinement 39 .We have built a new search engine — called

Keyphind — that uses a single simple approach toaddress all of these problems. Keyphind is based on

Ž .keyphrases similar to those provided by authorsthat have been automatically extracted from the doc-uments in a collection. While the use of phrases ininformation retrieval is not new, Keyphind goesbeyond previous systems by making the phrase thefundamental building block for the entire system —including the index, the presentation and clusteringof results, and the query refinement process. Thekeyphrase approach allows people to interact with adocument collection at a relatively high level. Topicsand subjects are the basic level of interaction, bothwhen forming queries and when viewing and manip-ulating query results. Browsing is supported byshowing the user what topics are available for aparticular query, how many documents there are inthose areas, and all possible ways that a query can belegitimately extended. In our initial usability studies,the keyphrase approach was seen to be more usefulfor browsing tasks than a standard query engine.

5 For example, the typical query to the Infoseek search engineŽ .www.infoseek.com is a noun phrase with an average length of

w x2.2 words 23 .

In the first part of this article, we discuss brows-ing in more detail, and review previous efforts toaddress the drawbacks of standard search engines.Then we introduce the Keyphind system, outline thetext-mining technique used to extract keyphrasesfrom documents, and describe the ways that userscan interact with the system. Finally, we evaluate thekeyphrase approach, and report on a usability studythat compared browsing in Keyphind with browsingin a traditional full-text query system.

2. Browsing document collections

When people interact with large document collec-tions, only part of their activity involves specificsearches for particular documents. Many tasks arebetter characterized as ‘‘browsing’’ — information-seeking activity where the goal is not a particular set

Ž w x.of documents e.g., Refs. 5,29 . For example, peo-ple may want to learn about a collection, what itcontains, and how well it covers a particular topic;they may want to find documents in a general area,but without knowing exactly what they are lookingfor; or they may simply be looking for something‘‘interesting.’’ Browsing is an inherently interactiveactivity: when people browse, their direction andactivities evolve as they work, and they may takeseveral different routes to their final destination.

In bricks-and-mortar libraries, people browse bywalking among the stacks, looking at displays set upby library staff, even perusing the reshelving carts tosee what other people have taken out. Althoughdigital libraries preclude these physical activities,people still browse; they just do it through whatevermeans they can — namely the collection’s queryinterface. Browsing in digital libraries is character-ized by a succession of queries rather than a singlesearch — that is, people determine their next query

w xpartly based on the results of the previous one 29 .This paper focuses on three kinds of browsing

tasks that we have informally observed in our experi-Žence with the New Zealand Digital Library NZDL;

.www.nzdl.org : evaluating collections, exploringtopic areas, and exploring possible queries. Table 1shows some of the questions that people have inmind when they undertake these tasks. The following

( )C. Gutwin et al.rDecision Support Systems 27 1999 81–104 83

sections describe the tasks, outline why traditionalsearch engines make the tasks difficult, and reviewprior research in each area.

2.1. Collection eÕaluation

When a searcher has several different documentcollections to choose from, but not enough time tosearch all of them in detail, they must determinewhich collections are most likely to provide themwith the information they need. This process ofassessment often involves browsing — both to gainan overview of the collection and to determine if it

w xcontains material in the area of interest 29 . As anexample of the number of collections available for aparticular search, the following are a few of thepossibilities for finding literature on browsing inter-faces:

Ž .Ø The ACM Digital Library www.acm.orgrdlŽØ Internet search engines e.g., AltaVista, Infoseek,

.HotBotŽ .Ø The HCI Bibliography www.hcibib.org

Ø The NZDL Computer Science Technical ReportsŽ . Ž .CSTR Collection www.nzdl.org

Ø The Cornell Computer Science Technical Refer-Ž .ence Library cs-tr.cs.cornell.edu

Ž .Ø The INSPEC citation database local accessŽØ The NIST TREC document collection nvl.nist.

.govŽØ The UnCover citation database uncweb.carl.

.orgrŽØ Local online library catalogues e.g., library.us-

.ask.ca .There are many ways that these and other collec-

tions can be compared and evaluated: for example,size, currency, and quality of information are a fewpossibilities. Perhaps the most important criterion inchoosing between collections, however, is coverage— what is in the collection and whether or not it islikely to contain the user’s desired information. With

Ž .physical collections that is, books , people oftenevaluate coverage by scanning tables of contents or

Žby looking up their topics in the index e.g., Ref.w x.28 .

In digital collections that use standard searchengines, however, ascertaining what the collectioncontains is difficult or impossible. Although mostsystems provide a brief description of the collection’s

Ž .contents e.g., CSTR , they rarely display the rangeof topics covered. The only way to find this informa-tion is by issuing general queries and scanning thelists of returned documents — a slow and arduousprocess.

Several systems and research projects have soughtto address this problem by providing informationabout the top-level contents of a document collec-tion. These approaches can be split into two groups:those that provide topic information manually, andthose that determine it automatically. Manually de-termined categories involve the classification of eachdocument by a human indexer, and so are laboriousto create and maintain. Examples of manual catego-rization can be seen in commercial systems such as

Ž .the Yahoo! collection www.yahoo.com , and inŽ w x.some research collections e.g., Ref. 14 . In con-

trast, automatic techniques create high-leveloverviews by algorithmic examination of the docu-ments in the collection. A variety of techniques forgathering and displaying this information have beeninvestigated. The most well known of these is vec-tor-space document clustering, where document simi-larities are calculated and used to group similar

Ž w x.documents together e.g., Refs. 26,36 . The processcan be repeated until the desired level of generality

Žis reached although it is not immediately apparenthow to choose meaningful names or labels for the

w x.clusters 26 . Other researchers have used self-organizing maps to show both the main topics in acollection and their relative frequencies. For exam-

w xple, Lin 28 creates graphical tables of contents fromw xdocument titles, and Chen and Houston 6 create

map representations of a collection of web pages.Finally, when a collection has an internal structureŽ .such as a hierarchical web site , this structure can be

Ž w x.used to generate an overview e.g., Ref. 22 .

2.2. Exploring topic areas

A second common browsing activity is the explo-Ž w x.ration of a subject area e.g., Refs. 5,34 . When

exploring, users may try to gain an understanding ofthe topics that are part of the area, may wish togather contextual information for directing and fo-cusing their search, or may simply hope to comeacross useful material. Exploration is often under-taken by novices and people who need to learn about


a new area; but it may also be undertaken by peoplewho have had a search failure and are interested inthe kinds of queries that will definitely succeed.

This kind of browsing is greatly aided by beingable to see the range of materials available about thesubject, since people can then both learn about thedomain and use visual recognition rather than lin-guistic recall to direct their activity. As Chang and

w xRice 5 state, ‘‘browsing in computer systems ischaracterized by searching without specifying; it is a

Ž .recognition-based search strategy’’ p. 243 .Traditional search systems make exploration diffi-

cult for two reasons. First, subject areas are oftendescribed with multi-word phrases, but the word-levelindexes used in many systems deal poorly withqueries where query terms are meaningful in relation

w xto one another 43 . For example, a standard systemwill have poor precision when asked about a subjectarea like ‘‘computer architecture,’’ where the mean-ing of the phrase is different from either component.Second, ranking algorithms are designed to returndocuments relevant to the query terms, not to showthe range of documents that exist. A list of individualdocuments is usually far too specific for users who

Ž w x.are trying to explore an area e.g., Refs. 34,39 .The first of these problems has previously been

addressed by including phrases as well as individualŽ w x.words in the inverted index e.g., Refs. 35,43 .

However, indexing all phrases consumes consider-able space, so researchers have also considered vari-ous means of selecting only ‘‘important’’ or

Žcontent-bearing phrases for indexing e.g., Refs.w x.12,18 . We discuss the issues involved in phrase-based indexing in more detail later in the article.

The second problem — that systems return a listof documents rather than a range of topics — hasbeen addressed primarily through the use of docu-ment clustering. In this approach, the search enginereturns a set of clusters that match a user’s queryrather than a set of documents. Again, clusteringmay be based on existing manual document classifi-

Ž .cations e.g., Yahoo or on automatically extractedinformation. Automatic approaches are primarily

w xbased on vector-space based similarity 26 . ClustersŽare then presented to the user, either as a list e.g.,

w x. ŽRefs. 1,11 or in some graphical format e.g., Refs.w x.19,44 . One of the better-known clustering applica-tions that seeks to support browsing is the

w xScatterrGather system 11,33 . This system uses alinear-time vector-space algorithm to dynamicallycluster the documents resulting from a user’s query.The user can then select one or more clusters forreclustering and further analysis.

2.3. Query exploration

Browsing is an activity where the user’s queriesare not determined beforehand, but evolve as the

w xuser interacts with the collection 5 . Therefore, partof the browsing process involves gathering informa-tion about how different kinds of queries will workin the search system, and ‘‘translating an information

Ž w x .need into a searchable query’’ Ref. 3 , p. 493 .People may wish to find out what kinds of querieswill succeed, how a query can be specialized orgeneralized, or what will happen if an existing queryis changed.

In a traditional search system, however, there islittle support for query exploration. Users have tocompose their queries without any assistance fromthe system, and there is no indication of how likely aquery is to be successful or of what kind of results itwill produce. It is also difficult to use a previousquery as the basis for a new one: since the innerworkings of a ranking algorithm are rarely madeexplicit, it is not easy to predict how changes to aquery will affect the size or relevance of the resultset. Compounding the difficulty, a concept may bereferred to in the collection by any of several related

Ž w x.terms the vocabulary problem 7,17 , and usersmust somehow determine the correct one. Withoutguidance in these areas, users will often issue queriesthat return either nothing or a multitude of docu-

Žments — the ‘‘feast or famine’’ problem e.g., Ref.w x.14 .

Researchers have attempted to provide guidancefor query exploration in several ways. Support forquery refinement often presents the user with a set of

Ž w x.terms related to their query terms e.g., Refs. 9,47 ;they can then add the suggestions to their query,

Ž .either disjunctively to expand the results or con-Ž .junctively to specialize the results . In some sys-

tems, terms are automatically added to the user’squery before it is returned, depending on the size of

Ž w x.the result set e.g., Ref. 48 . The related terms maycome from a thesaurus, which gathers and records


Ž w x.related terms in the collection e.g., Ref. 10 , orŽ w x.from other lexical means e.g., Ref. 1 . Hierarchical

and dynamic clustering can also be used to help theuser form their next query. For example, the Scat-terrGather system regroups the results every timethe user selects a set of clusters to examine further;so at any stage of the search, this always providesthe user with a set of possibilities that they knowwill produce results.

A few systems also provide results previews —feedback about the results of a query before the

Ž w x.query is issued e.g., Refs. 14,20 . These tech-niques allow users to manipulate query parametersand see qualities of the results set before the query is

Žissued for example, the number of documents in theset or their distribution across one or more kinds of

.meta-data . Results previews help users understandwhich queries are likely to succeed, and enablepowerful interaction techniques like dynamic query-

w xing 39 , but require that a small and fast version ofthe entire index be available or downloadable to the

w xlocal query client 20 .

2.4. Supporting browsing tasks using phrases

Our approach to supporting these browsing tasksis to use keyphrases as the basic unit for indexing,retrieval, and presentation. Keyphrases provide uswith a simple mechanism for supporting collectionevaluation, subject-area exploration, and query ex-ploration. The advantages of this approach are thatkeyphrases can be automatically extracted from doc-uments; that the resulting indexes are simple, small,and quickly built; and that the workings of thesystem are easy for users to understand. In Section 3,we consider these issues in more detail. We describethe Keyphind system, the technique used to extractkeyphrases from documents, and how the keyphraseindexes are built.

3. Keyphind

Ž .Keyphind from ‘‘keyphrase find’’ is a systemthat permits browsing, exploring, and searching large

Ž .collections of text documents see Fig. 1 . Like othersearch engines, it consists of an index and a graphi-cal query interface; however, both index and pre-

sentation schemes differ substantially from conven-tional systems. First, the index is built fromkeyphrases that have been automatically extractedfrom the documents, rather than from the full text ofthe collection. Second, documents that match a queryare grouped by keyphrase in the interface, and userscan work with these clusters before looking at indi-vidual documents.

More detail on these techniques is given below.First, we take a brief look at what it is like to use thesystem. Then we describe how it obtains phrases,builds its indexes, and presents documents to theuser. The examples and screen snapshots that wediscuss use a database built from approximately26 000 documents, whose source text totals 1 Gb.The reports are part of the CSTR collection of the

w x Ž .NZDL 49 www.nzdl.org . Although the examplecollection is not enormous by today’s standards, thealgorithms and techniques used in Keyphind willreadily scale to larger collections.

3.1. Using Keyphind

Keyphind’s interface is shown in Fig. 1. A userinitiates a query by typing a word or phrase andpressing the ‘‘Search’’ button, just as with othersearch engines. However, what is returned is not alist of documents, but rather a list of keyphrasescontaining the query terms. Since all phrases in thedatabase are extracted from the source texts, everyreturned phrase represents one or more documents inthe collection. Searching on the word text, for exam-

Žple, returns a list of phrases including text editor a. Žkeyphrase for 12 documents , text compression 11

. Ž . Ždocuments , and text retrieÕal 10 documents see.Fig. 1 . The phrase list provides a high-level view of

the topics covered by the collection, and indicates,by the number of documents, the coverage of eachtopic.

Following the initial query, a user may choose torefine the search using one of the phrases in the list,or examine one of the topics more closely. Sincethey are derived from the collection itself, any fur-ther search with these phrases is guaranteed to pro-duce results — and furthermore, the user knowsexactly how many documents to expect. To examinethe documents associated with a phrase, the user


Fig. 1. Keyphind user interface.

selects one from the list, and previews of the docu-ments are displayed in the lower panel of the inter-face. Selecting any document in the preview listshows the document’s full text. More detail onKeyphind’s capabilities is given in later sections.

3.2. Automatic keyphrase extraction

Keyphrases are words or short phrases, commonlyused by authors of academic papers, that characterizeand summarize the topics covered by a document.Keyphrases are most often noun phrases of between

two and four words. For example, the keyphrases wehave chosen for this article are digital libraries,browsing interfaces, text mining, keyphrase extrac-tion, and machine learning. The index terms col-lected in ‘‘back-of-the-book’’ subject indexes alsorepresent a kind of keyphrase. Keyphrases can be asuitable basis for a searchable index, since theyrepresent the contents and subject matter of a docu-

Ž w x.ment e.g., Ref. 42 . In most document collections,however, author-specified keyphrases are not avail-able for the documents; and it is laborious to manu-ally determine and enter terms for each document in


w xa large collection 10 . A keyphrase index, therefore,is usually feasible only if index terms can be deter-mined automatically.

Automatic determination of keyphrases generallyworks in one of two ways: meaningful phrases caneither be extracted from the text of the documentitself, or may be assigned from some other sourceŽ .but again using the document text as input . Thelatter technique, keyphrase assignment, attempts tofind descriptive phrases from a controlled vocabularyŽ .e.g., MeSH terms or from a set of terms that havebeen previously collected from a sample of the col-

Ž w x.lection e.g., Ref. 37 . Assignment is essentially adocument classification problem, where a new docu-ment must be placed into one or more of the cate-

w xgories defined by the vocabulary phrases 27,34 . Ingeneral, the advantage of keyphrase assignment isthat it can find keyphrases that do not appear in the

Žtext; the advantage of keyphrase extraction the tech-.nique used in Keyphind is that it does not require an

external phrase source. We now consider keyphraseextraction in more detail.

There are two parts to the problem of extractingkeyphrases: first, candidate phrases must be found inthe document text, and second, the candidates mustbe evaluated as to whether they are likely to be goodkeyphrases. A variety of techniques have been pro-posed for each of these steps, as described below.

3.2.1. Finding candidate phrasesSince any document will contain many more

multi-word sequences than keyphrases, the first stepin extracting keyphrases is finding reasonable candi-date phrases for further analysis. There are two mainapproaches to finding phrases in text: syntactic anal-

w xysis and statistical analysis 15,41 . Syntactic analy-sis parses the text and reports phrases that matchpredetermined linguistic categories. Often, these cat-egories define noun-phrase forms that can be deter-

Žmined using a part-of-speech tagger e.g., Refs.w x.2,40 . More sophisticated analysis is also possible:

w xfor example, Srtzalkowski et al. 42 look for head-modifier pairs, where the head is a central verb ornoun, and the modifier is an adjunct argument of thehead.

In contrast, finding phrases through statisticalanalysis does not consider the words’ parts of speech.Instead, this approach calculates co-occurrence statis-

tics for sequences of words; those words that appearfrequently together are likely to indicate a semanti-

w xcally meaningful concept 41 . This technique gener-ally uses a stop-word list to avoid finding phrases

Ž .made up of function words e.g., of , the, and, then .In its simplest form, statistical analysis simply findsword sequences that occur more frequently than

w xsome fixed threshold 41 .

3.2.2. EÕaluating candidate phrases as keyphrasesThe second step in finding keyphrases involves

determining whether a candidate word sequence is oris not a keyphrase. This process generally meansranking the candidates based on co-occurrence crite-ria and then selecting the top-ranked phrases. Thetechniques used are similar to those for term weight-ing and feature selection in vector-space models,where a few components of a multi-dimensionalterm vector are chosen to represent the documentw x18,26 . Two common criteria for ranking candidates

Žare simple frequency normalized by the length of.the document , and TF= IDF measures, which com-

pare the frequency of a phrase in the document withits rarity in general use. Other possible criteria in-

w xclude entropy measures 34 , first occurrence in theŽ .document see below , or appearance of the phrase in

Ž‘‘important’’ sections of structured text e.g., title.tags in HTML documents . Once phrases are given a

score based on these criteria, the best keyphrases arethose candidates with the highest ranks.

However, one problem of evaluating candidates isthat when several attributes contribute to a phrase’sscore, it is difficult to construct a simple model forcombining the variables. Machine learning tech-niques can be used to build complex decision modelsthat can take several attributes into account, and thisis the approach used by Keyphind. Machine learninghas been used for a variety of purposes in retrieval

Ž w x.systems e.g., Refs. 8,34 , but only one other pro-ject to our knowledge uses it for ranking keyphrases.

w xThis is the Extractor system 46 , which uses agenetic algorithm to determine optimal variableweights from a set of training documents. In thecurrent project, we use a technique that is faster andsimpler than genetic algorithms, but provides equalperformance. Our extraction process is describedbelow.


Fig. 2. Kea keyphrase extraction process.

3.3. Keyphrase extraction in keyphind

We have previously developed a system calledw x 6Kea 16 to extract keyphrases from text. It uses

machine learning techniques and a set of trainingdocuments to build a model of where keyphrasesappear in documents and what their distinguishingcharacteristics are. In particular, we treat extractionas a problem of supervised learning from examples,where the examples are documents in which everyphrase has been manually classified as a keyphraseor nonkeyphrase. This manual classification ofphrases is accomplished using the set of author-specified keyphrases for the document. The learningprocess results in a classification model, and thismodel is used to find keyphrases in new documents.

Like other approaches, Kea extracts keyphrasesusing the two steps described above: first identifyingcandidate phrases in a document’s text, and thenranking the candidates based on whether they arelikely to be keyphrases. However, we differ fromprevious work in that we rank candidates using amodel built with the Naıve Bayes machine learning¨

w xalgorithm 25 . The steps involved in training andextraction are illustrated in Fig. 2; below, we de-scribe how Kea finds and ranks phrases, and thendiscuss the construction of the Naıve Bayes model in¨more detail.

6 Several versions of Kea have been implemented; the versionw xdescribed in Ref. 16 differs from the present description in that it

uses an extensive stopword list rather than a part-of-speech taggerto identify candidate phrases in a document.

3.3.1. Finding phrasesThe process used to find candidate phrases for

Keyphind is similar to that used in earlier projectsŽ w x.e.g., Refs. 30,40 . First, documents are cleanedand tokenised, and then tagged using the Brill part-

w xof-speech tagger 4 . The tagger adds a part-of-speechŽ .indicator e.g., rNN for noun, rVB for verb to

each word. Words are not currently stemmed, al-Žthough we do fold plurals to singular forms e.g.,

.‘‘libraries’’ to ‘‘library’’ and standardize words thatŽ .have different spellings e.g., ‘‘labor’’ to ‘‘labour’’ .

Second, all phrases matching certain lexical patternsare reported. In particular, we accept any sequenceof words consisting of a string of nouns and adjec-tives with a final noun or gerund. For example,‘‘probabilistic machine learning’’ contains an adjec-tive followed by a noun followed by a gerund. This

w xpattern was proposed initially by Turney 46 and isŽsimilar to others used in the literature e.g., Ref.

w x.30 ; in our own experiments with author-generatedkeyphrases, the pattern covered more than 90% of atest set of more than 1800 examples.

3.3.2. Building the NaıÕe Bayes ranking model¨The machine learning technique that we use to

build the ranking model requires a set of trainingŽdocuments where positive instances that is, ‘‘true’’

.keyphrases have already been identified. For ourtraining data, we used papers from the CSTR wherethe authors have already specified a set of keywordsand phrases. Although the author’s choices are not

Žalways the best examples of good keyphrases forexample, there are often good keyphrases in the text


.that were not chosen by the author , this method issimple and readily available.

The process of building the ranking model in-volves four preliminary steps. First, authorkeyphrases are removed from the training documentsand stored in separate files. Second, candidate phrasesare found in the training documents as described

Žabove note that some author keyphrases do not.appear in the document text . Third, we calculate

values for three attributes of each phrase:Ž .Ø distance d is the distance from the start of the

Ždocument to the phrase’s first appearance nor-.malized by document length

Ž .Ø term frequency tf is the number of times theŽphrase appears in the document normalized by

.document lengthŽ .Ø inverse document frequency idf is the phrase’s

frequency of use in the domain of the collection.This attribute is approximated using a randomsample of 250 documents from the collection.Term frequency and inverse document frequency

are combined in a standard TF= IDF calculationw x50 . These attributes were experimentally deter-mined to be the most useful in characterising

w xkeyphrases 16 : that is, keyphrases are more likelyto appear early in a document, to appear often in thedocument, and to appear less often in general use.Fourth, each candidate phrase is marked as akeyphrase or a nonkeyphrase, based on the author-specified keyphrases for that document.

Once these preliminary steps are completed, theNaıve Bayes machine-learning algorithm compares¨attribute values for the known keyphrases with thosefor all the other candidates, and builds a model fordetermining whether or not a candidate is akeyphrase. The model predicts whether a candidateis or is not a keyphrase, using the values of the otherfeatures. The Naıve Bayes scheme learns two sets of¨numeric weights from the attribute values, one set

Ž .applying to positive ‘‘is a keyphrase’’ examplesŽ .and the other to negative ‘‘is not a keyphrase’’

instances.

3.3.3. Ranking candidate phrases using the NaıÕe¨Bayes model

To rank each candidate phrase that has beenŽfound in a new document one where there are no

.author keyphrases , Kea first calculates values for

the three attributes described above. The Naıve Bayes¨model built in the training phase is then used todetermine a probability score, as follows. When themodel is used on a candidate phrase with feature

Ž . Žvalues t TF= IDF and d distance of first occur-.rence , two quantities are computed:

< <w x w xw x w xP t ,d ,yes s P t yes ) P d yes ) P yes 1Ž .TF ) IDF distance

w xand a similar expression for P t,d,no , wherew < xP t yes is the proportion of phrases with valueTF ) IDF

w < xt among all keyphrases, P d yes is the propor-distance

tion of phrases with value d among all keyphrases,w xand P yes is the proportion of keyphrases among allŽphrases i.e., the prior probability that a candidate

.will be a keyphrase . The overall probability that thecandidate phrase is a keyphrase can then be calcu-lated:

w x w x w xpsP yes r P yes qP no . 2Ž .Ž .Candidate phrases are ranked according to this

value. When all of the document’s phrases are ranked,the top 12 candidates are selected and written to afile for later indexing.

Further details of the extraction process and theNaıve Bayes scheme can be found in other articles¨Ž w x.e.g., Ref. 16 .

3.3.4. PerformanceWe have evaluated the Kea algorithm in terms of

the number of author-specified keyphrases that itcorrectly extracts from text. In our experiments, 7

Kea’s recall and precision were both approximately0.23 — therefore, it finds about one in four author-specified keyphrases. We also examined the questionof how many keyphrases to extract, and consideredboth a rank cutoff and a probability cutoff. Theexperiments showed that selecting about 12 candi-dates usually includes all of the high-probability

Ž .phrases where ‘‘high’’ is empirically determined ,without including too many poor phrases.

For this research, we have examined Kea’s outputfor about 20 CSTR documents. We estimate that anaverage of 9 of the 12 keyphrases extracted for each

7 In these experiments, we used a combined measure of recalland precision called the F-measure.


document could be considered useful to an end-userŽsome of these were ungrammatical, but still under-

.standable , and about 3 of 12 were unusable. Anexample of the choices, using this article as input, isshown in Table 2.

The performance of our algorithm represents thecurrent state of the art for machine-learning ap-

Ž w x.proaches see Ref. 16 ; however, recall and preci-sion are still quite low. There are several reasons forthis. First, only about 80% of the author’s keyphrasesactually appear in the text of the document, and thisforms an upper bound on the number of phrases thatcan be extracted by any algorithm. Second, attempt-ing to match author phrases complicates the ma-chine-learning problem because the data is hugelyskewed towards negative instances. For every au-thor-specified keyphrase in a document of the CSTR,there are about 1000 phrases that are not keyphrases.Third, extracting author keyphrases is a more de-manding task than simply extracting ‘‘good’’

Ž .keyphrases although it is much easier to evaluate ;even two human indexers will rarely come up with

w xexactly the same index terms for a document 17 .The primary problem of evaluating our technique

using author-specified keyphrases is that the schemedoes not assess the quality of those keyphrases thatdo not match author phrases. We make the assump-tion that maximizing performance on extraction ofauthor keyphrases will also maximize the quality ofall extracted phrases. However, we are currentlyinvestigating other assessment techniques: in particu-

Table 2Example output from Kea

Phrases chosen by Phrases chosen by Keathe authors

Browsing interfaces 1. Digital librariesDigital libraries 2. Browsing in digital librariesText mining 3. Keyphrase indexKeyphrase extraction 4. Libraries with keyphraseMachine learning 5. Digital libraries with keyphrase

6. Browsing tasks7. Keyphrase extraction8. Usability study9. Browsing interfaces10. Search engines11. Document collections12. Query engines

lar, we are constructing a corpus of documents whereŽ .human indexers have marked and ranked all

phrases in the text that could be used as validkeyphrases.

Kea processed the collection of 26 432 documentsŽin about 4 days on a Pentium 233-MHz system

.running Linux ; more than half of this time was usedby the part-of-speech tagger. The resulting rawkeyphrase file contains 309 908 keyphrases, about 6Mb of text. More than half of the raw phrases areduplicates from some other document; as describedbelow, the list of unique phrases contains only145 788 entries.

3.4. Indexing

Once keyphrases are extracted from each docu-ment in the collection, Keyphind converts the rawkeyphrase file into indexes that are used by the queryinterface. This process, illustrated in Fig. 3, producesfour indexes: a phrase list, a word-to-phrase index, aphrase-to-document index, and a document-to-phraseindex. The phrase list simply associates each uniquephrase with a number that is used in the remaining

Ž .indexes for efficiency reasons ; the other indexesare described below.

The word-to-phrase index lists all words in thedocument collection and indicates for each one thephrases that contain it. For example, ‘‘browsing’’appears in ‘‘browsing technique,’’ ‘‘collaborativebrowsing,’’ ‘‘browsing interface,’’ etc. This is theonly index needed to process basic queries: the setsof phrases for each word in the query are found andthen intersected to find the phrases containing all ofthe query terms.

The phrase-to-document index lists every phrasein the database and indicates which documents theywere extracted from. Documents are represented bynumbers, so the entry for ‘‘browsing technique’’might show document numbers 113, 4024, 75 376,and 165 234. This index is used to retrieve theappropriate documents when a phrase is selected,and to calculate the co-occurring phrases for any

Ž .selected phrase see below .The document-to-phrase index lists every docu-

ment by number and indicates all phrases that wereextracted from that document. There are 12 phrasesfor each document. The document-to-phrase index is


Fig. 3. Generation of Keyphind indexes.

used only in the calculation of co-occurring phrasesŽ .see below .

Keyphind’s indexes are small, and quick to cre-ate. It took less than 5 min to build the indexes for

Žthe 26 000-document example collection on a Pen-.tium Pro 200-MHz system running Linux ; this speed,

however, is dependent on being able to hold theentire phrase list in a hash table. More importantthan creation time, however, is index size, and theindexes are tiny in comparison to the source text.The file sizes for the four indexes created for theexample collection are shown in Table 3. Since theseindexes are derived from 1 Gb of source text, thekeyphrase approach offers an index that is less than1% of the size of the source. In comparison, standardfull-text inverted indexes are typically larger than thesource text, and even modern compressed indexesŽ w x.e.g., Ref. 50 are about one-tenth the size of thesource. Of course, the small size of the indexes isonly achieved because we discard all but 12 phrasesfor each document — but we believe that for somebrowsing tasks, indexing 12 meaningful phrases maybe as good as indexing every word.

Keyphind stores all of these indexes in memory.The word-to-phrase index is stored as a hash tablefor quick look-up, and the other three are stored as

Table 3File sizes of Keyphind indexes

Index Number of items File size

Phrase list 145788 Unique phrases 2.98 MbWord-to-phrase index 35 705 Unique words 2.80 MbPhrase-to-document index 145788 Entries 1.84 MbDocument-to-phrase index 26 432 Documents 1.79 MbTotal 9.41 Mb

sequential arrays. The memory space needed is about20 Mb. The small size of these indexes, both inmemory and on disk, means that keyphrase indexescan be downloaded to a networked client at the startof a browsing session, providing quick access andshort response times.

3.5. Presentation and user interaction

Keyphind’s query interface is shown in Figs. 1and 4. There are three activities that users can under-take. They can issue keyphrase queries, they canview and preview documents associated with akeyphrase, and they can work with co-occurringphrases related to their current selection. We de-scribe each of these below.

3.5.1. Keyphrase queriesBeing based on keyphrases rather than docu-

ments, Keyphind answers a slightly different ques-tion than does a full-text search engine. Instead ofdeciding which documents contain the user’s queryterms, Keyphind determines which keyphrases con-tain the query terms. When the user enters a word ortopic and presses the ‘‘Search’’ button, the systemconsults the word-to-phrase index, retrieves thosephrases that contain all of the query words, anddisplays them in the upper-left panel of the interfaceas the result of the query. In addition, the documentsassociated with each phrase in the result set are

Ž .counted using the phrase-to-document index andthe number is shown alongside the phrase. The resultset can be sorted either by phrase or by number ofdocuments. An example of a basic query on extrac-tion is illustrated in Fig. 4a.


Ž . Ž . Ž .Fig. 4. Four types of interaction with Keyphind. a Basic query results. b Document previews and co-occurring phrases. c DocumentŽ .inspection window. d Co-occurrence filtering of previews.

3.5.2. Refining the queryThe result set contains all keyphrases in the

database that contain the query terms. Therefore, thislist shows all possible ways that the user’s query canbe extended with additional words. Additional queriescan be issued by double-clicking on phrases in thelist.

3.5.3. PreÕiewing and Õiewing documentsOnce a keyphrase query has been made, the user

can look at the documents associated with any of thephrases displayed. When the user selects one,Keyphind retrieves the first few lines of each associ-ated document and displays these previews at the

Ž .bottom of the interface Fig. 4b . If the phrase of


interest occurs in the document preview, it is high-lighted in colored text. To look more closely at adocument, clicking on the preview presents the full

Ž .document text in a secondary window Fig. 4c .

3.5.4. Related phrases and co-occurrence filteringAn additional capability for exploring the collec-

tion is based on the co-occurrence of keyphrases.When the user selects a phrase from the result list,additional phrases related to the user’s selection aredisplayed in the upper-right panel of the interface.These related phrases can provide the user with newquery ideas and new directions for exploring a topic.The scheme works as follows.

Ž .1 The user selects a phrase from the result list,Ž .such as feature extraction Fig. 4b .

Ž .2 Feature extraction is a keyphrase for 21documents; however, since we originally extracted12 keyphrases from each text, each of the 21 docu-ments is also associated with 11 other keyphrases.These are called ‘‘co-occurring phrases.’’

Ž .3 Sometimes, a phrase will co-occur with fea-ture extraction in more than one of the 21 docu-ments, and Keyphind keeps track of this number.

Ž .4 The system puts the co-occurring phrases intoa list and displays them in the top right panel of the

Ž .interface Fig. 4b , along with the number of co-oc-currences. For example, pattern recognition co-oc-curs with feature extraction in three documents, andimage acquisition co-occurs twice.

The co-occurrence number is exactly the numberof documents that would be returned if the userformed a query from both the target keyphrase ANDthe co-occurring phrase. Thus, related phrases can beused to filter the documents in the previews list.When the user selects a phrase in the co-occurrencelist, the lower panel of the interface shows only thosedocuments containing both keyphrases. This situa-tion is illustrated in Fig. 4d, where the user hasselected feature extraction and then clicked on pat-tern recognition in the co-occurrence list; the docu-ments shown in the preview panel are the two thatcontain both feature extraction and pattern recogni-tion as keyphrases.

3.6. Summary

By using keyphrases as the basic unit of indexingand presentation, Keyphind allows users to interact

with a collection at a higher level than individualdocuments. This approach supports browsing tasks infour ways.

Ž .1 Topical orientation. Queries return a list ofkeyphrases rather than documents, giving users animmediate idea of the range of topics that the collec-tion covers. This is useful in both evaluating thecollection and in exploring a subject area.

Ž .2 Phrase-based clustering. Each keyphrase rep-resents several documents, so users can see and workwith a much larger portion of the collection thanthey could with document-based searching. In addi-tion, the reason why documents have been clusteredtogether is easily understood — they share akeyphrase.

Ž .3 Query refinement. Since the returned phrasesare almost always longer than the search string, thephrase list shows all possible ways that the query canbe extended and still return a result. In addition, theco-occurring phrases provide a means for findingterms that are related to the user’s query, and can beused to filter the query results, explore new direc-

Žtions, and help address the vocabulary problem e.g.,w x.Ref. 7 .

Ž .4 Results prediction. By showing how manydocuments there are for each phrase, Keyphind indi-cates the collection’s depth and also what will bereturned when the query is extended.

4. Evaluation: a usability study of Keyphind

We hypothesized that using keyphrases as thebasis for indexing and clustering documents wouldsupport browsing in large collections better thanconventional full-text search does. To test that hy-pothesis, and to determine the strengths and weak-nesses of our approach, we conducted a small usabil-ity study in which people carried out browsing taskswith Keyphind and also with a traditional queryinterface. The goals of this study were to betterunderstand how people use phrases in searching, todetermine differences in people’s search strategieswhen they used the two interfaces, and to make aninitial assessment about whether working withkeyphrases makes browsing easier. Tasks involvedeither evaluating the coverage of a collection in anarea, or looking for documents relevant to a particu-


lar topic. From our observations and from people’scomparisons, it was clear that search strategies dif-fered considerably, and that most participants foundthe tasks to be easier with the Keyphind interface.The next sections outline our methods and summa-rize our results.

4.1. Method

Ten computer science students and researchersfrom the University of Waikato participated in thestudy as unpaid volunteers. Five of the participants

Žused search engines a moderate amount 1–10 ses-.sions per week and five searched more frequently

Ž .more than 15 sessions per week . Participants weregiven a short training session on the two interfaces,where the capabilities of each system were explainedand where the participant carried out a practice task.Once the training session was complete, the partici-pant began their tasks.

4.1.1. TasksParticipants completed three tasks using two query

interfaces. Two tasks involved assessing the cover-age of a collection in a particular area, and one taskinvolved exploring the collection to find documentsrelevant to a topic. A participant carried out each

Ž .task first with one interface Keyphind or traditionaland then repeated the task with the other interface.Order was counterbalanced, so that half the partici-pants started with each interface.

4.1.1.1. CoÕerage eÕaluation tasks 1 and 2. Partici-pants were given a task scenario that was intention-ally underspecified, asking them to find subtopics ofa topic area. For the first task, it was:

‘‘You are a researcher who has been asked to findon-line resources for doing computer science re-search in planning. As one part of your evalua-tion, please determine three kinds of planning thatare adequately represented in the current collec-tion. Adequate representation means that thereshould be at least three documents that have somerelation to the type of planning you are investigat-ing.’’

For the second task, the topic area was computergraphics, and the instructions were otherwise the

same. None of the participants considered them-selves to be experts in planning research or computergraphics, so they did not begin either task with aclear idea of what they were seeking. This task wasdesigned to determine how easily people could reacha high-level assessment of a collection.

4.1.1.2. Exploration task. Participants were asked toname a subject area from their current research orstudies. They were then asked to find two articles inthe collection that were relevant to that topic. Thistask was designed to find out how people exploredan area, and to determine how easily they could findsomething of use to their work.

4.1.2. Data collectionOne experimenter recorded observations about

strategy use during the tasks. To assist observation,participants were asked to ‘‘think aloud’’ as theycarried out the task. After each task, participantswere asked to compare their experiences with the

Ž .two interfaces see questions 1–3 of Table 4 . Wedid not ask these questions after the exploration task,since that task was recognized to be highly variable.At the end of the session, the experimenter con-ducted a short interview. Four specific questions

Ž .were asked see questions 4–7 of Table 4 , and thenthe experimenter asked additional questions to ex-plore particular incidents observed during the tasks.

4.1.3. SystemsKeyphind is illustrated above in Figs. 1 and 4; the

traditional interface that we compared it to is shown

Table 4Questions asked after tasks and during interview

Questions asked after tasks:1. Was it easier to carry out the task with one or

the other of these systems?2. If yes, which one?3. If yes, was the task: slightly easier,

somewhat easier, or much easier?

InterÕiew questions:4. What process did you use to find out

about a topic with each system?5. What were the major differences

between the two systems?6. Would you use a system like this one

Ž .Keyphind in your work?7. How would you use it?


in Fig. 5. This system is the standard WWW inter-Ž .face used with the NZDL www.nzdl.org . Partici-

pants were allowed to change the search parametersŽin this interface turning word stemming on or off,

folding case, using boolean or ranked retrieval, phrase.searching as they wished. Search parameters in

Keyphind, however, were fixed as described above.

4.2. Results

All but two participants were able to complete thetasks in both interfaces. In these two cases, technical

problems prevented them from using the traditionalinterface. Our results are organized around two is-

Ž .sues: first, which interface if any made the taskeasier; and second, how search strategies varied withthe two interfaces.

4.2.1. Which interface made the tasks easierWe asked whether the first two tasks were easier

with either interface, and if so, how much easier. Wedid not do a formal comparison for the third task,due to its open-ended nature. Participants’ responsesare shown in Table 5. Two participants were unable

Fig. 5. Traditional WWW query interface.


Table 5Ž .Responses to comparison questions cells indicate number of people

Task The task was easier with:

Keyphind Traditional No difference No answer

Ž .1 Planning 7 0 1 0Ž .2 Computer graphics 3 0 4 1

Task How much easier:

Slightly Moderately Much No answer

Ž .1 Planning 2 0 4 1Ž .2 Computer graphics 1 0 2 0

to compare the interfaces because of technical prob-lems with the traditional interface.

Seven of eight people found the first task to beeasier using the Keyphind interface, and three foundthe second task easier as well. No one found thetasks easier with the full-text interface, althoughseveral participants felt that there was no difference.When we considered the responses in terms of aparticipant’s stated level of search-engine use, wedid not find any clear distinctions between the mod-

Ž .erate users less than 10 searches per week and theŽmore frequent users more than 15 searches per

.week .

4.2.2. How the interfaces affected strategy useSeveral differences were observed in people’s

strategies, differences that appear to relate to the waythat the two interfaces present information. Mostpeople began the tasks in the same way — byentering a general query and then looking throughthe results — but from there, the two groups di-verged. In the full-text interface, people looked forinformation by scanning the document previews re-turned from the general query; in Keyphind’s inter-face, they looked through the list of returned phrases.For example, the first task asked people to find typesof planning covered by the collection. Several partic-ipants reported that in the full-text interface, theylooked for phrases containing ‘‘planning’’ and kepttrack in their heads of how often they saw eachphrase. Although ‘‘planning’’ was emphasized in thepreviews, this strategy could be laborious, since thepreviews were not formatted and showed only the

first 200 characters of the file. In contrast, all thephrases in Keyphind’s result list contained planning,and so people moved immediately to investigatingthose phrases that looked reasonable and were asso-ciated with several documents. The initial resultsreturned for both interfaces are shown in Fig. 6.

In the second collection–evaluation task, with thetopic area computer graphics, strategies with thefull-text interface were much the same as in the first.In Keyphind, however, the results of the initial querydid not contain many phrases that were clearlysubtopics of computer graphics, and the strategy ofsimply looking through the phrase list did not workas well. In addition, most of the documents wereconcentrated in one phrase — computer graphics

Ž .itself see Fig. 7 . Some participants dealt with thissituation by showing the previews for this phrase,and then scanning the previews much as they did inthe full-text interface. Four people, however, usedthe co-occurring phrases panel to look for possiblesubtopics of computer graphics; and there were sev-eral reasonable phrases, such as ray tracing, Õolume

Ž .rendering, and Õirtual reality see Fig. 8 .The final task asked people to find two technical

reports relevant to their area of research or study,and there were fewer strategy differences in this task.Participants began in the same way that they beganthe other tasks — by entering the research area as aquery. In the Keyphind interface, people spent moretime looking through document previews than in theother tasks, probably because the final task involvedlooking for documents rather than topics. Most par-ticipants used the following strategy: enter the re-


Ž .Fig. 6. Results from the query planning in Keyphind above and traditional interface.

search area as the initial query, choose a likelylooking phrase from the results list, and then scanthrough the document previews for that phrase.

With the full-text interface, people once againcarried out the task by looking through the documentpreviews returned from the query. In this task, how-


Ž . Ž .Fig. 7. Results from the query computer graphics in Keyphind above and traditional interface below .

ever, people were already familiar with the area, andso they already knew what subtopics they were

looking for. That meant that when they did not findŽ .what they wanted, they in most cases knew of other


Ž .Fig. 8. Most common phrases co-occurring with computer graphics in Keyphind see upper right panel .

search terms to try. Although people were on thewhole successful in this task, there was some diffi-culty finding query terms that were appropriate tothe ranked retrieval algorithm. For example, oneparticipant entered computer and Õision as queryterms and was given a large set of documents wherecomputer figured prominently, but that had nothingto do with computer vision. In most cases, changingthe system’s settings to do a string search for thephrase led to more relevant results.

4.2.3. Other interÕiew resultsAt the end of the session, we asked participants

about the differences they saw between the twosystems, and about how they might use a phrase-based system like Keyphind in the queries theyperformed for their research. The major differencesseen by almost all of the participants were thatKeyphind presented topics rather than documents,and that it was more difficult to explore an area byscanning through documents than by looking througha list of topics. However, three participants said thatwhile Keyphind provided a faster and easier route tothe subtopics of an area, they were less confident

that the list of subtopics was complete than theyŽ .were when they laboriously found subtopics by

scanning document previews in the traditional inter-face. In addition, three participants said that theyparticularly liked being able to see how many docu-ments would be returned from a further search.

When asked about how a phrase-based systemcould be used in their day-to-day searches, seven ofthe participants said that they would like to have asystem like Keyphind available in addition to their

Ž .normal search tools although not as a replacement .Participants said that they would use such a systemin situations where they were having difficulty withtheir traditional tools, or to ‘‘familiarize themselves

w xwith an area’’ when they first started to explore.

5. Discussion

The usability study provides evidence that aphrase-based query system can provide better sup-port than traditional interfaces for browsing tasks,using an index that is far smaller than a full-textindex. However, our experiences with Keyphind have


suggested several issues in the keyphrase approachthat must be considered in more detail. In this sec-tion, we look at two main areas: first, limitationscaused by the structure and format of a keyphraseapproach, and second, issues involved with phrasequality and how users react to an imperfect index.

5.1. User queries and automatically extractedkeyphrases

Information retrieval literature suggests that usingphrases assists some tasks and some queries, but

Ž w x.does not greatly affect others e.g., Refs. 30,32 .This seems to be the case with Keyphind as well,and the problem appears to lie in the fact that somekinds of user queries better fit the structure of the

w xautomatically extracted phrase hierarchy. Borman 3notes that one problem in traditional query systemsis that successful searching requires that users knowabout the underlying IR techniques; and although webelieve the keyphrase approach provides a muchmore transparent model, it does still require thatusers ‘‘think in keyphrases’’ to a certain extent.

For example, Keyphind was seen as more usefulin the first task of the usability study; one likelyreason for this is that people’s initial queries in thefirst task returned a broader range of topics. Aninitial query on planning results in several phrases

Žthat can be considered subtopics of planning e.g.,motion planning, case-based planning, collaboratiÕe

.planning ; however, computer graphics does notproduce such obvious good subtopics. In short, thisdifference arises because many useful phrases fit the

² : ²pattern something planning, but fewer fit some-:thing computer graphics. This situation is caused

in part by the shallowness of the phrase hierarchy,which reduces the richness of the clusters whenusing a two-word phrase as a query. Note, however,that the case of computer graphics is exactly thekind of situation where the co-occurring phrases canprovide an alternate set of clusters.

A second problem in matching user queries to theindex is the vocabulary problem. Keyphind worksbest when initial queries are general and short, suchas planning; the user can peruse the range of possi-ble extensions to the general query and choose thosethat are most applicable. If users begin with a morespecific query, however, they may choose the

‘wrong’ phrase and so miss related topics. For exam-ple, path planning returns only seven phrases, eventhough there are many more phrases under the closelyrelated motion planning. This problem could beaddressed in part by using a thesaurus to automati-

Žcally or interactively expand searches although this. Ž w x.would add another index e.g., Ref. 9 .

Finally, some of the queries that people make tosystems like Keyphind are legitimate topics, but areunlikely to appear as phrases. For example, there areundoubtedly reports relevant to eÕaluation of brows-ing interfaces in a 26 000-paper collection of CSTR— but this phrase is unlikely to appear in the indexbecause it is unlikely to be extracted as a keyphrase.Keyphind will therefore not return anything for thisquery. There are several possible ways to reduce theproblem, such as automatically performing multiplesearches on parts of the query, but the fact remainsthat keyphrases only approximate a true human-builtsubject index. Having several search tools that canmake up for each others’ shortfalls may be the onlyway to truly solve the problem.

5.2. Phrase quality and completeness

A user of Keyphind is in constant interaction withthe phrases that have been automatically extractedfrom documents in the collection. For a phrase-basedapproach to be considered useful by its users, peoplemust believe that the system has done a reasonablejob of extracting important phrases from documentswithout including too much garbage. Unfortunately,no fully automatic system can come close to thequality of a human indexer, and so users are going toencounter both missing phrases and poor phrases.

In our view, there are three issues that affect thequality of the phrase index: phrases may be linguisti-

Ž .cally unusable e.g., garbage strings , phrases maynot be good descriptors of the document they wereextracted from, or the ‘‘good’’ keyphrases containedin a document may have been missed by the extrac-tor. Each of these problems reduces recall in theresults set; however, in our experience withKeyphind, recall has not been a problem. In general,users seem willing to overlook a certain number of

Žnonrelevant documents perhaps they have been well.trained in this regard by traditional search engines .

Second, in large collections, recall may not be as


much of a problem as precision: there are often somany documents that enough relevant items will bereturned for the user’s purposes, even though manyhave been missed by the indexing system. Thisassumes that users will be willing to work around theincorrect results, as long as they can find a reason-able set of documents in a reasonable time.

However, in our experience with the NZDL, wehave also found that some frequent users of tradi-tional query systems prefer to carry out browsingtasks by looking over all of the documents andmaking up their own minds about the structure of thedomain, even if this process is more laborious thanhaving an automatically extracted set of phrases.Since Keyphind cannot always provide a perfect setof keyphrases, it is essential to give people easyaccess to the documents themselves, so that they arenot forced to trust the system, but can draw theirown conclusions from the raw data.

6. Comparison with previous work

6.1. Phrase-based systems

Phrases have been used in previous information-Žretrieval systems to improve indexing e.g., Refs.

w x. Ž w x.15,43 , to help refine queries e.g., Refs. 1,37 ,Žand to provide facilities for subject browsing e.g.,

w x.Refs. 10,28 . In addition, there are a variety ofprojects that have extracted different kinds of infor-

Žmation from text, such as document summaries e.g.,w x.Ref. 24 . In terms of the extraction of keyphrases,

Keyphind is most similar to the Extractor systemw x46 , which is fully automatic, and also approacheskeyphrase extraction as a supervised learning prob-

Ž w xlem see Ref. 16 for a comparison with our extrac-.tion methods .

Keyphind differs from previous phrase-based sys-tems in that it treats keyphrases as the fundamentalunit of indexing and presentation, rather than anadd-on to assist one function of a retrieval engine.This means that we use keyphrases as the only indexrather than as a supplement to another index, andthat we can provide retrieval, clustering, refinement,and results prediction with a single simple mecha-nism.

6.2. Document clustering

As described earlier, document clustering is oftenused in browsing interfaces as a means of providinga high-level view of a database. Keyphind differsfrom other clustering techniques in two ways. First,the mechanism for grouping documents together isthe shared keyphrase, which is simple for users tounderstand — unlike a distance metric. It is easy tosee why documents in Keyphind have been clusteredtogether — they all share a common keyphrase.Second, each document in the collection is alwayspart of 12 clusters — one for each extractedkeyphrase. This affords a greater degree of overlapthan is generally possible with other similarity mea-sures. Since documents may often be about severaldifferent themes or topics, it is possible withkeyphrase clustering to form an association betweenparts of a document, rather than demanding that twodocuments be similar in their entirety in order to beclustered together.

Basing clusters on keyphrases, however, does havesome drawbacks not encountered in other methods.First, since documents can fall into more than onecluster, it can be difficult to determine how manydocuments are actually represented by the phrasesreturned from a query. Second, the sizes ofKeyphind’s clusters are variable: a keyphrase mayrepresent hundreds of documents, or only one. Clus-ters that are too large or too small are less useful tothe user; in particular, having many clusters of size 1is no better than having a list of documents. There-fore, we are considering strategies for splitting largeclusters and grouping small ones. For example, wecould show some of the phrases within a large

Žcluster either the extensions of the phrase, or its set.of co-occurring phrases ; small clusters could be

grouped by collapsing component phrases, by fold-ing phrase variants more extensively, or by groupingsynonyms based on a thesaurus.

6.3. Projects directly related to Keyphind

Finally, a note on Keyphind’s lineage. Keyphindis one of several related projects dealing with theextraction and use of information from text. Kea,discussed earlier, is an ongoing project to extract


keyphrases from text. Keyphind is also descendedw xfrom an earlier system called Phind 31 that builds

phrase hierarchies from the full text of a document.These hierarchies were not based on keyphrases, buton any repeating sequence in the text. The phraseindexes used in Keyphind have also given rise to two

w xother systems: Phrasier 21 , a text editor that auto-matically creates links from phrases in a user’s text

Žto documents in a collection, and Kniles www..nzdl.orgrKearKniles , a system that links key-

phrases in returned documents to others in the collec-tion.

7. Conclusion

In this article, we described a search engine thatsupports browsing in large document collections anddigital libraries. The Keyphind engine is based on adatabase of keyphrases that can be automaticallyextracted from text, and uses keyphrases both as thebasic unit of indexing and as a way to organizeresults in the query interface. Keyphind’s indexes aremuch smaller than standard full-text indexes, areefficient, and are relatively easy to build. Keyphindsupports browsing in three main ways: by presentingkeyphrases that indicate the range of topics in acollection, by clustering documents to show a largerportion of the collection at once, and by explicitlyshowing all of the ways a query can be extended. Ausability study of Keyphind gives evidence that thephrase-based approach provides better support thanfull-text engines for some kinds of browsing tasks— evaluation of collections in particular.

8. Software availability

Keyphind is available from www.cs.usask.carfa-cultyrgutwinrKeyphindr. Kea, the system used toextract keyphrases for Keyphind, is available fromwww.nzdl.orgrKear.

Acknowledgements

This research was supported by the New ZealandFoundation for Research, Science, and Technology.

Our thanks to the anonymous reviewers for theircomments and suggestions.

References

w x1 P. Anick, S. Vaithyanathan, Exploiting Clustering and Phrasesfor Context-Based Information Retrieval, Proceedings of the20th Annual International ACM SIGIR Conference on Re-search and Development in Information Retrieval, ACMPress, 1997, pp. 314–323.

w x2 N.A. Bennett, Q. He, C. Chang, B.R. Schatz, Concept extrac-tion in the interspace prototype, Technical Report, DigitalLibrary Initiative Project, University of Illinois at Urbana-Champaign. Currently available from http:rrwww.canis.uiuc.edurinterspacertechnicalrcanis-report-0001.html.

w x3 C.L. Borman, Why are online catalogs still hard to use?,Journal of the American Society for Information Science 47Ž . Ž .7 1996 493–503.

w x4 E. Brill, Some Advances in Rule-Based Part of SpeechTagging, Proceedings of the Twelfth National Conference onArtificial Intelligence, AAAI Press, 1994.

w x5 S.J. Chang, R.E. Rice, Browsing: a multidimensional frame-Ž .work, in: M.E. Williams Ed. , Annual Review of Informa-

Ž .tion Science and Technology ARIST , Vol. 28, 1993, pp.321–276.

w x6 H. Chen, A.L. Houston, Internet browsing and searching:user evaluations of category map and concept space tech-niques, Journal of the American Society for Information

Ž . Ž .Science 49 7 1998 582–603.w x7 H. Chen, T.D. Ng, A concept space approach to addressing

the vocabulary problem in scientific information retrieval: anexperiment on the worm community system, Journal of the

Ž . Ž .American Society for Information Science 48 1 199717–31.

w x8 H. Chen, G. Shankaranarayanan, L. She, A machine learningapproach to inductive query by examples: an experimentusing relevance feedback, ID3, genetic algorithms, and simu-lated annealing, Journal of the American Society for Informa-

Ž . Ž .tion Science 49 8 1998 693–705.w x9 H. Chen, J. Martinez, A. Kirchhoff, T.D. Ng, B.R. Schatz,

Alleviating search uncertainty through concept associations:automatic indexing, co-occurrence analysis, and parallelcomputing, Journal of the American Society for Information

Ž . Ž .Science 49 3 1998 206–216.w x10 Y. Chung, W.M. Pottenger, B.R. Schatz, Automatic Subject

Indexing Using an Associative Neural Network, Proceedingsof the 3rd ACM International Conference on Digital Li-

Ž .braries DL ‘98 , ACM Press, 1998, pp. 59–68.w x11 D.R. Cutting, D.R. Karger, J.O. Pedersen, J.W. Tukey, Scat-

terrGather: A Cluster-Based Approach to Browsing LargeDocument Collections, Proceedings of the Fifteenth AnnualInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval, ACM Press, 1992, pp.318–329.


w x12 S. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas,R.A. Harshman, Indexing by latent semantic analysis, Jour-

Ž .nal of the American Society for Information Science 41 6Ž .1990 391–407.

w x13 W.M. Detmer, E.H. Shortliffe, Using the internet to improveknowledge diffusion in medicine, Communications of the

Ž . Ž .ACM 40 8 1997 101–108.w x14 K. Doan, C. Plaisant, B. Shneiderman, T. Bruns, Query

Previews for Networked Information Systems: a Case Studywith NASA Environmental Data, SIGMOD Record, 26, No.1, 1997.

w x15 J. Fagan, Automatic Phrase Indexing for Document Re-trieval: An Examination of Syntactic and Non-SyntacticMethods, Proceedings of the Tenth Annual InternationalACM SIGIR Conference on Research and Development inInformation Retrieval, ACM Press, 1987, pp. 91–101.

w x16 E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, C.G.Nevill-Manning, Domain-Specific Keyphrase Extraction, toappear in: Proceedings of the Sixteenth International JointConference on Artificial Intelligence, Morgan Kaufmann,1999. Also available from: http:rrwww.cs.waikato.ac.nzr;eiberpubsrZ507.ps.gz.

w x17 G.W. Furnas, T.K. Landauer, L.M. Gomez, S.T. Dumais,The vocabulary problem in human–system communication,

Ž . Ž .Communications of the ACM 30 11 1987 964–971.w x18 D. Harman, Automatic indexing, in: R. Fidel, T. Hahn, E.M.

Ž .Rasmussen, P.J. Smith Eds. , Challenges in Indexing Elec-tronic Text and Images, ASIS Press, 1994, pp. 247–264.

w x19 M.A. Hearst, C. Karadi, Cat-a-Cone: An Interactive Interfacefor Specifying Searches and Viewing Retrieval Results Usinga Large Category Hierarchy, Proceedings of the 20th AnnualInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval, ACM Press, 1997, pp.246–255.

w x20 S. Jones, Graphical Query Specification and Dynamic ResultPreviews for a Digital Library, Proceedings of the ACMConference on User Interface Software and TechnologyŽ .UIST ‘98 , ACM Press, 1998, pp. 143–151.

w x21 S. Jones, Link as you Type: Using Keyphrases for Auto-mated Dynamic Link Generation, Department of ComputerScience Working Paper 98r16, University of Waikato, NewZealand, August 1998.

w x22 E. Kandogan, B. Shneiderman, Elastic Windows: A Hierar-chical Multi-Window World-Wide Web Browser, Proceed-ings of the ACM Conference on User Interface Software and

Ž .Technology UIST ‘97 , ACM Press, 1997, pp. 169–177.w x23 S. Kirsch, The Future of Internet Search: Infoseek’s Experi-

ences Searching the Internet, SIGIR Forum, 32, No. 2, 1998.w x24 J. Kupiec, J. Pedersen, F. Chen, A Trainable Document

Summarizer, Proceedings of the 18th Annual InternationalACM SIGIR Conference on Research and Development inInformation Retrieval, ACM Press, 1995, pp. 68–73.

w x25 P. Langley, W. Iba, K. Thompson, An Analysis of BayesianClassifiers, Proceedings of the Tenth National Conference onArtificial Intelligence, AAAI Press, 1992, pp. 223–228.

w x26 A. Leouski, W.B. Croft, An evaluation of techniques for

clustering search results, Technical Report IR-76, Center forIntelligent Information Retrieval, University of Mas-sachusetts, 1996.

w x27 C.H. Leung, W.K. Kan, A statistical learning approach toautomatic indexing of controlled index terms, Journal of the

Ž . Ž .American Society for Information Science 48 3 1997 9.w x28 X. Lin, Graphical Table of Contents, Proceedings of the 1st

ŽACM International Conference on Digital Libraries DL.‘96 , ACM Press, 1996, pp. 45–53.

w x29 G. Marchionini, Information Seeking in Electronic Environ-ments, Cambridge Univ. Press, 1995.

w x30 M. Mitra, C. Buckley, A. Singhal, C. Cardie, An Analysis ofStatistical and Syntactic Phrases, Proceedings of the 5thRIAO Conference on Computer-Assisted InformationSearching on the Internet, sponsored by the Centre De Hautes

Ž .Etudes Internationales D’informatique Documentaire CID ,1997.

w x31 C. Nevill-Manning, I.H. Witten, G. Paynter, Browsing inDigital Libraries: a Phrase-Based Approach, Proceedings ofthe 2nd ACM International Conference on Digital LibrariesŽ .DL ‘97 , ACM Press, 1997, pp. 230–236.

w x32 R. Papka, J. Allan, Document Classification using MultiwordFeatures, Proceedings of the Seventh International Confer-

Žence on Information and Knowledge Management CIKM.‘98 , ACM Press, 1998.

w x33 P. Pirolli, P. Schank, M. Hearst, C. Diehl, ScatterrGatherBrowsing Communicates the Topic Structure of a VeryLarge Text Collection, Proceedings of ACM CHI 96 Confer-ence on Human Factors in Computing Systems, Vol. 1, ACMPress, 1996, pp. 213–220.

w x34 M. Sahami, S. Yusufali, M.Q.W. Baldonado, SONIA: AService for Organizing Networked Information Au-tonomously, Proceedings of the 3rd ACM International Con-

Ž .ference on Digital Libraries DL ‘98 , ACM Press, 1998, pp.200–209.

w x35 G. Salton, Automatic Text Processing, Addison-Wesley,1989.

w x36 G. Salton, J. Allan, C. Buckley, Automatic structuring andretrieval of large text files, Communications of the ACM 37Ž . Ž .2 1994 97–108.

w x37 B.R. Schatz, E.H. Johnson, P.A. Cochrane, H. Chen, Interac-tive Term Suggestion for Users of Digital Libraries: UsingSubject Thesauri and Co-Occurrence Lists for InformationRetrieval Information Retrieval, Proceedings of the 1st ACMInternational Conference on Digital Libraries, ACM Press,1996, pp. 126–133.

w x38 B.R. Schatz, Information Analysis in the Net: The Interspaceof the 21st Century, White Paper for ‘America in the Age ofInformation: A Forum on Federal Information and Communi-cations R&D’, National Library of Medicine, U.S. Commit-tee on Information and Communications, 1995.

w x39 B. Shneiderman, D. Byrd, W.B. Croft, Sorting out searching:a user interface framework for text searches, Communica-

Ž . Ž .tions of the ACM 41 4 1998 65–98.w x40 A.F. Smeaton, F. Kelledy, User-Chosen Phrases in Interac-

tive Query Formulation for Information Retrieval, Proceed-


ings of The 20th BCS Colloquium on Information RetrievalŽ .IRSG ‘98 , Springer-Verlag, 1998.

w x41 A.F. Smeaton, Information retrieval: still butting heads withŽ .natural language processing? in: M.T. Pazienza Ed. , Infor-

mation Extraction: A Multidisciplinary Approach to anEmerging Information Technology, Springer-Verlag LectureNotes in Computer Science a1299, 1997, pp. 115–138.

w x42 T. Strzalkowski, J. Perez-Carballo, M. Marinescu, NaturalLanguage Information Retrieval in Digital Libraries, Pro-ceedings of the 1st ACM International Conference on Digital

Ž .Libraries DL ‘96 , ACM Press, 1996, pp. 117–125.w x43 T. Strzalkowski, F. Lin, Natural Language Information Re-

trieval, in NIST Special Publication 500-240: The Sixth TextŽ .Retrieval Conference TREC-6 , United States Department of

Commerce, National Institute of Standards and Technology,1997.

w x44 R.C. Swan, J. Allan, Aspect Windows, 3-D Visualizations,and Indirect Comparisons of Information Retrieval Systems,Proceedings of the 21st Annual International ACM SIGIRConference on Research and Development in InformationRetrieval, ACM Press, 1998.

w x45 A. Tombros, M. Sanderson, Advantages of Query BiasedSummaries in Information Retrieval, Proceedings of the 21stAnnual International ACM SIGIR Conference on Researchand Development in Information Retrieval, ACM Press, 1998.

w x46 P. Turney, Learning to Extract Keyphrases from Text, Na-tional Research Council Technical Report ERB-1057, 1999.

w x47 B. Velez, R. Weiss, M.A. Sheldon, D.K. Gifford, Fast and´Effective Query Refinement, Proceedings of the 20th AnnualInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval, ACM Press, 1997.

w x48 R. Weiss, B. Velez, M.A. Sheldon, C. Namprempre, P.Szilagyi, A. Duda, D.K. Gifford, HyPursuit: A HierarchicalNetwork Search Engine that Exploits Content-Link Hyper-text Clustering, Proceedings of the Seventh ACM Conferenceon Hypertext, ACM Press, 1996.

w x49 I.H. Witten, C. Nevill-Manning, R. McNab, S. Cunningham,A public library based on full-text retrieval, Communications

Ž . Ž .of the ACM 41 4 1998 .w x50 I.H. Witten, A. Moffat, T. Bell, Managing Gigabytes: Com-

pressing and Indexing Documents and Images, Van Nos-trand-Reinhold, New York, 1994.

Craig Nevill-Manning is an assistantprofessor of Computer Science at Rut-gers, the State University of New Jer-sey. His research interests include inter-face and indexing technology for digitallibraries, as well as bioinformatics anddata compression. He received his doc-torate from the University of Waikato inNew Zealand in 1996, and spent 2 yearsas a post-doctoral fellow at StanfordUniversity.

Ian H. Witten is a professor of com-puter science at the University ofWaikato in New Zealand. He directs theNew Zealand Digital Library researchproject. His research interests includeinformation retrieval, machine learning,text compression, and programming bydemonstration. He received an MA inMathematics from Calgary, Canada; anda PhD in Electrical Engineering fromEssex University, England. He is a fel-low of the ACM and of the Royal Soci-

ety of New Zealand. He has published widely on machine learn-ing, speech synthesis and signal processing, text compression,hypertext, and computer typography. He has written several books,

Ž .the latest being Managing Gigabytes 1999 and Data MiningŽ .forthcoming , both from Morgan Kaufman.

Eibe Frank is a PhD student of Com-puter Science at The University ofWaikato in New Zealand. He holds adegree in Computer Science from theUniversity of Karlsruhe in Germany. Hismain research interests are machinelearning techniques and their applica-tions. He has published several articleson this topic in international journalsand conferences; and co-authored the

Ž .book Data Mining forthcoming , pub-lished by Morgan Kaufman.

Gordon W. Paynter is a graduate stu-dent in computer science at The Univer-sity of Waikato in New Zealand, wherehe works with the New Zealand DigitalLibrary research project. His researchinterests include programming bydemonstration, machine learning, infor-mation retrieval and visualization, andhuman–computer interaction. He re-ceived a Bachelor of Computing andMathematical Sciences from Waikato,where he is studying towards his PhD.

Carl Gutwin is an assistant professor ofComputer Science at the University ofSaskatchewan. His research interests in-clude user interfaces for information re-trieval systems, text mining techniques,information visualization, and the designand evaluation of real-time distributedgroupware systems. He received BScand MSc degrees from the University ofSaskatchewan, and the PhD degree fromthe University of Calgary in 1997.