topic-based browsing within a digital library using …...we have implemented a keyphrase-based...

8
Topic-Based Browsing Within a Digital Library Using Keyphrases Steve Jones and Gordon Paynter Department of Computer Science The University of Waikato Private Bag 3105 Hamilton New Zealand Tel: +64 7 838 4021 Email: {stevej,gwp}@cs.waikato.ac.nz ABSTRACT Many digital libraries are comprised of documents from disparate sources that are independent of the rest of the collection in which they reside. A users ability to explore is severely curtailed when each document stands in isolation; there is no way to navigate to other, related, documents, or even to tell if such documents exist. We describe a method for automatically introducing topic—based links into documents to support browsing in digital libraries. Automatic keyphrase extraction is exploited to identify link anchors, and keyphrase—based similarity measures are used to select and rank destinations. Two implementations are described: one that applies these techniques to existing WWW—based digital library collections using standard HTML, and one that uses a wider range of interface techniques to provide more sophisticated linking capabilities. An evaluation shows that keyphrase—based similarity measures work as well as a popular full-text retrieval system for finding relevant destination documents. Keywords: automated hypertext generation, keyphrase extraction, information retrieval, information exploration 1. INTRODUCTION Serendipitous browsing is a widely used strategy for retrieving information from large collections of documents. For example, the core method of accessing information in the World Wide Web (WWW) is to navigate between documents via embedded hyperlinks. This method of information access is possible because authors develop their documents for the WWW and provide relevant links to other nodes in the document space. Unfortunately, the documents in many large digital library collections do not contain browsable links because they were never intended to belong to that collection, or to a hypertext at all. This problem is regularly encountered by the New Zealand Digital Library (NZDL, http://www.nzdl.org) [18]. Users cannot navigate between documents that address similar topics because the collections have no evident structure, and lack explicit relationships between their constituent parts. Links to support navigation must therefore be introduced by other means. This can be done manually, by asking human experts to identify similar documents and introduce links between them. There are two problems with this approach: it is time-consuming, so quickly becomes impracticable as the number of documents increases [11]; and people are inconsistent in their selection of link anchors and destinations, reducing the coherency of the resulting hypertext. Semi-automated (or supervised) techniques help to process larger numbers of documents, but ultimately suffer from the same problems [4]. A third approach, fully automated hypertext generation, holds more promise for large scale digital libraries. In this paper we describe two systems, Kniles and Phrasier, that automatically generate links to support browsing in NZDL collections. These systems are novel in the way that they identify link anchors and determine which documents are suitable destinations. Both systems use automatically identified keyphrases as link anchors and employ Information Retrieval (IR) [12] techniques using keyphrases to determine similarities between documents. They work with frequently changing document collections, but can be incrementally updated so that a minimum of effort is required to keep them up-to-date. The paper is organised as follows: first we present an overview of automatic hypertext generation. The next section describes how we use automatically identified keyphrases to link documents. Following that, two systems for dynamic link generation, Kniles and Phrasier, are described, and we report on an evaluation of document linking using keyphrases. Finally we reflect on the design of the two systems and consider directions for further work. 2. AUTOMATED HYPERTEXT CONSTRUCTION We are interested in two types of connection between documents: links between similar documents, and links from references to a topic within a document to other documents about that topic. The first supports high-level browsing where links can reflect several

Upload: others

Post on 29-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topic-Based Browsing Within a Digital Library Using …...We have implemented a keyphrase-based linking engine which uses the keyphrase indexes and a standard information retrieval

Topic-Based Browsing Within aDigital Library Using Keyphrases

Steve Jones and Gordon PaynterDepartment of Computer Science

The University of WaikatoPrivate Bag 3105

HamiltonNew Zealand

Tel: +64 7 838 4021

Email: {stevej,gwp}@cs.waikato.ac.nz

ABSTRACTMany digital libraries are comprised of documents from disparatesources that are independent of the rest of the collection in whichthey reside. A userÕs ability to explore is severely curtailed wheneach document stands in isolation; there is no way to navigate toother, related, documents, or even to tell if such documents exist.We describe a method for automatically introducing topicÐbasedlinks into documents to support browsing in digital libraries.Automatic keyphrase extraction is exploited to identify linkanchors, and keyphraseÐbased similarity measures are used toselect and rank destinations. Two implementations are described:one that applies these techniques to existing WWWÐbased digitallibrary collections using standard HTML, and one that uses awider range of interface techniques to provide more sophisticatedlinking capabilities. An evaluation shows that keyphraseÐbasedsimilarity measures work as well as a popular full-text retrievalsystem for finding relevant destination documents.

Keywords: automated hypertext generation, keyphraseextraction, information retrieval, information exploration

1. INTRODUCTIONSerendipitous browsing is a widely used strategy for retrievinginformation from large collections of documents. For example, thecore method of accessing information in the World Wide Web(WWW) is to navigate between documents via embeddedhyperlinks. This method of information access is possible becauseauthors develop their documents for the WWW and providerelevant links to other nodes in the document space.Unfortunately, the documents in many large digital librarycollections do not contain browsable links because they werenever intended to belong to that collection, or to a hypertext at all.

This problem is regularly encountered by the New Zealand DigitalLibrary (NZDL, http://www.nzdl.org) [18]. Users cannot navigatebetween documents that address similar topics because thecollections have no evident structure, and lack explicitrelationships between their constituent parts. Links to supportnavigation must therefore be introduced by other means.

This can be done manually, by asking human experts to identifysimilar documents and introduce links between them. There aretwo problems with this approach: it is time-consuming, so quicklybecomes impracticable as the number of documents increases[11]; and people are inconsistent in their selection of link anchorsand destinations, reducing the coherency of the resultinghypertext. Semi-automated (or supervised) techniques help toprocess larger numbers of documents, but ultimately suffer fromthe same problems [4]. A third approach, fully automatedhypertext generation, holds more promise for large scale digitallibraries.

In this paper we describe two systems, Kniles and Phrasier, thatautomatically generate links to support browsing in NZDLcollections. These systems are novel in the way that they identifylink anchors and determine which documents are suitabledestinations. Both systems use automatically identified keyphrasesas link anchors and employ Information Retrieval (IR) [12]techniques using keyphrases to determine similarities betweendocuments. They work with frequently changing documentcollections, but can be incrementally updated so that a minimumof effort is required to keep them up-to-date.

The paper is organised as follows: first we present an overview ofautomatic hypertext generation. The next section describes howwe use automatically identified keyphrases to link documents.Following that, two systems for dynamic link generation, Knilesand Phrasier, are described, and we report on an evaluation ofdocument linking using keyphrases. Finally we reflect on thedesign of the two systems and consider directions for furtherwork.

2. AUTOMATED HYPERTEXTCONSTRUCTIONWe are interested in two types of connection between documents:links between similar documents, and links from references to atopic within a document to other documents about that topic. Thefirst supports high-level browsing where links can reflect several

Page 2: Topic-Based Browsing Within a Digital Library Using …...We have implemented a keyphrase-based linking engine which uses the keyphrase indexes and a standard information retrieval

common topics, the second offers more precise control over thesubject matter that links origin and destination documents.

2.1 Hypertext Nodes and LinksAutomated hypertext construction entails two steps: dividing thesource text into nodes, and creating links between nodes [2].Nodes are the objects between which users will navigate, and canrange from sets of documents to short excerpts such as paragraphsand sentences. We wish to support browsing between documentsin our library collections, so the documents themselves are nodes.

Once nodes have been identified a range of techniques can be usedto create links between them. Our focus is on the semantic linksthat Allan [2] terms equivalence links and describes as connectingÒstrongly-related discussions of the same topicÓ.

2.2 Information retrieval techniquesInformation retrieval (IR) similarity functions [12] have beenshown to be effective in determining the statistical similaritybetween text passages. IR techniques have previously been used tocreate links in hypertext ([2], [5], [7] and [13], for example) and auseful summary of work in this area is provided by Agosti et al[1].

Many IR techniques use a vector space model [19] to comparenodes and assess their similarity. Nodes are viewed as vectors ofterms (usually single words), and the cosine of the angle betweenthe vectors reflects their degree of similarity. A tf.idf valueweights termsÑthey carry more weight if they appear frequentlyin a given document (the tf measure), but infrequently in the restof the document collection (the df measure). We adapt thisapproach by using vectors of keyphrases, rather than terms, tocalculate node similarity.

2.3 Off-line vs on-line linkingThe issue of when links are created is an important one. At oneextreme a hypertext is fully constructed off-line and then madeavailable to users in its entirety. At the other, links are added tothe document text on-line, as each user navigates from node tonode.

The off-line approach is attractive because the user incurs no costof link generation when browsing the hypertext. However, pair-wise similarity comparisons must be carried out across thecollection in advance. Whenever the collection is updated the termand phrase frequency statistics change, so new links betweendocuments may be created, or existing links removed. In the worstcase, the entire collection of documents will need to be amendedwhen a single new document is added. Computing links atbrowse-time removes many of these overheads, but can increasethe time that the system takes to respond to the user.

On-line linking is more flexible than off-line. It is possible toamend the structure and presentation of the hypertext to meetchanging user requirements, and to update document collectionswithout reconstructing the entire hypertextÑan importantconsideration in large, rapidly changing collections. It alsoprovides a basis for the insertion of links into documents that arenot part of the original collection.

More computing resources are required to display the documentwhen links are inserted on-line, but less are needed to store thecollection, as hypertext versions of the original source documentsare unnecessary. We have adopted on-line link creation.

3. KEYPHRASE EXTRACTION ANDLINKINGSome documents (such as this one) contain key words and phrasesspecified by the author. They are often used to classify, cluster,

summarise, and index documents. They have also been used ininterfaces for iterative query refinement, web log analysis, andhighlighting important phrases [16]. Keyphrases are succinctdescriptions of important topics, and therefore make goodcandidate link anchors in a hypertext. Because they characterisedocument content, keyphrase-based IR techniques can be used toselect link destinations.

3.1 Automatic Keyphrase ExtractionUnfortunately, not all documents contain author-specifiedkeywords, and even in collections of scientific papers those withkeywords are in the minority. For example, one collection of theNZDL is a mirror of the Human Computer Interaction (HCI)

Bibliography1 [10] which contains the bibliographic details ofmore than 15,000 published articles, less than a third of whichcontain keyword fields. Another, the Computer Science TechnicalReport collection, contains more than 40,000 documents amassedfrom electronic repositories around the world. The original formatof these documents is postscript, which can be filtered to produceplain text, but from which it is seldom possible to extract theauthorÕs keywords, even when they are provided.

Because of this lack of metadata, the NZDL project has developedKea [6] a system that utilises machine learning techniques toautomate keyphrase extraction. Kea produces a list of extractedkeyphrases for each document in a given collection. Evaluation ofKea indicates that its performance rivals the current state-of-the-art [6], and it is the system that we have used here to generatekeyphrases.

Once keyphrases have been extracted they are used to createkeyphrase indexes [8] for a collection. For example, the keyphraselist associates an identifier with each keyphrase. A keyphrase todocument index lists every extracted keyphrase with the set ofdocuments from which it was extracted. A document to keyphraseindex lists every document in the collection with the keyphrasesextracted from each. Indexes of keyphrases require substantiallyless storage space than conventional full-text indexes.

3.2 Keyphrase-based Similarity MeasureWe have implemented a keyphrase-based linking engine whichuses the keyphrase indexes and a standard information retrievalsimilarity measure to compare documents for semantic similarity.The measure is the cosine measure used in the MG retrievalsystem [19] (p. 147) and the NZDL. It has been modified to usekeyphrase frequencies in place of the term frequencies of full-textretrieval systems.

Documents are viewed as vectors (in n-dimensional space) of theircomponent keyphrases, and the cosine of the angle between twovectors indicates the level of similarity of the correspondingdocuments. Keyphrases are weighted so that those that appearinfrequently within the given document collection are strongerindicators of similarity than those that appear often.

In term-based measures the weight of a given word is based uponthe number of documents in which it occurs. However, an aspectof keyphrases is that they are associated with only a subset of thedocuments in which they occur. Consequently, a keyphraseweighting scheme might be based on the number of documents towhich a given phrase is assigned, rather than the number in whichit appears. In a comparative study, described below, we employedboth alternatives for keyphrase weighting.

1 http://www.hcibib.org

Page 3: Topic-Based Browsing Within a Digital Library Using …...We have implemented a keyphrase-based linking engine which uses the keyphrase indexes and a standard information retrieval

3.3 Keyphrase LinkingFor any given document, a list of related documents ranked bysimilarity are retrieved. Links can be created to those documents,constrained by thresholds for similarity values or the number oflinks required. This supports browsing between nodes that as awhole are related, but does not describe the aspects of the nodesthat cause them to be similar. The user can navigate to documentsthat discuss similar topics, but has little control of what thosetopics are.

Topic-based browsing can be supported by providing anchorswithin or alongside a document that reflect the range of subjectsthat appear in the text. Keyphrases are suitable anchors becausethey provide succinct topic summaries and frequently occur in thetext. Anchors might be selected individually to navigate todocuments that address a particular topic, or in groups to formmore complex queries involving multiple topics.

4. KNILESKniles is a World-Wide Web (WWW) based system thatfacilitates browsing between the documents that form a digital

library collection.2 It enables the reader of a document to quicklyidentify and access related material.

The Computer Science Technical Reports collection of the NZDLcontains more than 40,000 technical reports from a diverse rangeof sources. The documents were originally in postscript form, andcontain neither metadata nor embedded hyperlinks. There is noway to browse the documents; they can only be accessed with afull-text search engine. When the user examines the text of adocument retrieved by a search, they can follow a link to a versionof the same document that has hypertext links added. This is themain entry-point to the Kniles interface.

Kniles allows users to browse the collection by inserting ahypertext link wherever keyphrases found within the collection asa whole occur in the document text. This is shown in the leftmostpane of Figure 1. The keyphrase anchors are coloured andunderlined by the user's web browser. Keyphrases that areassigned to the current document are displayed in a bold font,while those that simply appear within it are in the regular weightfont. The link destinations are lists of the documents to which thekeyphrase has been assigned.

Not all automatically chosen keyphrases make suitable anchors;Kniles ignores singleÐword keyphrases; merges keyphrases withidentical stems like digital library and digital libraries; andeliminates ill-chosen phrases incorporating specific words, such asuniversity, proceedings, and journal. Despite these filters, somepoor phrases are still chosen, such as paper describes in theabstract of Figure 1. If the author has specified keywords for thepaper, and these phrases can easily be detected and extracted, theyare added to the automatically generated phrases assigned to thedocument.

Kniles displays a second frame containing a summary of thekeyphrase anchors that have been added to the text (to the right ofFigure 1). The upper portion lists the keyphrases that have beenassigned to the current document by Kea or the author, the lowerportion shows a list of related topics; that is, the keyphrases ofother documents that occur within the current document. Twovalues are shown for each keyphraseÑthe frequency with which itoccurs within the current document, and the number of otherdocuments to which it has been assigned. The list can be sorted byeither of these values. For example, in Figure 1 the keyphrase

2 http://www.nzdl.org/Kea/Kniles.html

decision model has been assigned to the paper, occurs in it 36times, and is a keyphrase of four papers (including the oneshown).

When the user selects a phrase (from either pane) a new web page,containing a list of the documents for which the phrase is akeyphrase, is loaded. The documents are ranked by the keyphrasesimilarity measure. Selecting a document from the list loads it,with hyperlinks inserted, into the web browser. In this way userscan browse topics and documents in the collection.

The architecture of Kniles is shown in Figure 2. Each collection isbased on an NZDL collection, which is independently accessiblethrough the WWW. When the user invokes Kniles from a standardNZDL document, a request for the same document is sent to theKniles server, which retrieves the document text from the NZDLand uses the keyphrase indexes to insert hypertext links. Theresulting Kniles document is displayed in the userÕs web browser.When the user selects a keyphrase anchor a keyphrase request issent to the Kniles server, and a document list is created from thekeyphrase indexes and displayed in the userÕs browser. Selecting adocument from the list loads it, with links inserted, into the webbrowser.

Kniles is implemented in the Perl3 scripting language andcommunicates with the browser in standard HTML using theCommon Gateway Interface (CGI) to the WWW. This has a corebenefit of accessibilityÑit is easily integrated with the standardWWW interface to the NZDL and is available through almost anybrowser. However there are several limitations to HTML. Forinstance, each link anchor can have only one destination, butkeyphrases are often assigned to more than one document. For thisreason we use an intermediate document list, rather than

3 http://www.perl.org/

Figure 1:The Kniles user interface. To the left is the text paneand to the right is the link summary pane.

Page 4: Topic-Based Browsing Within a Digital Library Using …...We have implemented a keyphrase-based linking engine which uses the keyphrase indexes and a standard information retrieval

navigating directly to a relevant document. Further, it is notpossible to select more than one anchor at a time and fully exploitthe similarity measures, or to link arbitrary documents into thecollection. We have developed a second system, Phrasier, tocircumvent these limitations.

5. PHRASIERPhrasier extends the basic functionality of KnilesÑdynamicinsertion of keyphrase-based links into documents and summaryviews of the keyphrases within a documentÑusing the ComputerScience Technical Report and HCI Bibliography collections. The

user interface of Phrasier is implemented in Tcl/Tk4, a scriptinglanguage and interface toolkit, and consequently supportsinteraction techniques than are not possible with HTML. Itconnects over the Internet to a collection server and phrase-basedretrieval engine, both of which are implemented in Perl.

Figure 3 shows the main Phrasier interface. To the left is the textpane where documents from the collection are viewed, in themiddle is the keyphrase pane where lists of phrases occurringwithin the document are displayed, and to the right is relateddocuments pane where similar documents are listed. Anydocument from the userÕs filestore or the supported collections canbe loaded into the text pane. The keyphrases that appear in thedocument text are immediately highlighted and listed in the

4 http://www.scriptics.com/

keyphrase pane. The related documents pane, which is analogousto KnilesÕ document list page, is initially empty.

When the user selects a keyphrase, from either the keyphrase orthe text pane, a ranked list of related documents, generated by thecollection server, is displayed in the related documents pane. Thedocuments are ranked according to the IR similarity measure andthe user can control the number of displayed documents. Itemswithin the related documents pane can be selected to open thecorresponding document.

As with Kniles, keyphrase anchors are highlighted within thedocument text, but unlike Kniles each anchor can provideimmediate access to several documents. These are displayed in theform of a two-level popup menu which provides a gloss for eachdestination documentÑsome information about the documentwhich allows users to evaluate the link without actually navigatingto the destination [20]. The gloss, visible in the lower rightforeground of Figure 3, gives the document title, author, date ofpublication and abstract when these items are available (as is thecase for the HCI Bibliography, which is rich in metadata).Documents can be selected from the menu and displayed inanother window.

Kniles uses bold type to indicate that a keyphrase was assigned tothis document, rather than simply appearing in it. Phrasier extendsthis cue by displaying keyphrases in different levels of grey whichreflect how representative they are of the viewed document, a

NZDL & Kniles interfaces(standard WWW browser)

NZDL document collectionserver

stan

dard

NZ

DL

quer

y re

sult

and

docu

men

t tex

t pag

es

NZDL full text search (URL)

NZDL search results (HTML)K

nile

s do

cum

ent w

ith h

yper

links

and

keyp

hras

e fr

eque

ncie

s

getdocument

textinvoke Kniles: doc id (URL)

document id

document text

insertlinks

document text

keyphraseindex

keyphrase todocument index

document tokeyphrase index

Kniles document (HTML)

phrase

keyphrase id document id

getdocuments

keyphraseid list

keyphrase id

document id list anddocument tfidf list

keyphrase request: phrase id (URL)

createranked

list

document list (HTML) document id list

document summary list

Keakeyphraseextractionprocess

training documents anddocuments for keyphraseextraction

keyphrases &document ids

indexcreation

keyphrasefrequencystatistics

Kniles collectionHTTP NZDL collection

document list

sum

mar

ies

ofdo

cum

ents

assi

gned

sel

ecte

dph

rase

s

Figure 2: The Kniles architecture. User access is through a standard WWW browser (to the left). The Kniles server (in themiddle) retrieves document text from the NZDL (to the right) and inserts hyperlinks, generating HTML.

Page 5: Topic-Based Browsing Within a Digital Library Using …...We have implemented a keyphrase-based linking engine which uses the keyphrase indexes and a standard information retrieval

technique used in XLibris [14] to highlight important sections ofthe text. The user can control the visual emphasis of keyphraseswith respect to the rest of the document text.

Phrasier lets the user construct queries of more than one phrase.Items in the keyphrase pane can be selected individually, incontiguous blocks, or in multiple disjoint blocks, as can be seen inFigure 3. Users can find documents that relate to segments of thedocument by selecting a range of text in the document pane; anykeyphrases occurring in the selection are immediately selected inthe keyphrase pane The keyphrases are issued as a single query tothe keyphrase-based retrieval engine, and the ranked list ofdocuments returned is displayed in the related documents pane.Issuing the complete list of phrases as a query retrieves documentsthat are similar to the source document as a whole.

The document pane is also an editor into which documents can betyped or loaded from disk, and later saved. Text entered into theeditor is analysed Òon-the-flyÓ, and keyphrase anchors are insertedas the user types. This process is illustrated in Figure 4, where atyped document is linked into the HCI Bibliography collection.Keyphrase identification and link insertion happen in real-time. Inpractice, over a local network, this happens quickly enough that itdoes not impact on the writing process. Insertion of links in thisway allows related material to be uncovered as part of the writingprocess rather than a separate and distinct querying activity.

6. AN EVALUATION OF KEYPHRASE-BASED LINKINGKey to the utility of the two systems is the creation of quality linksbetween documents in the digital library. We carried out a study,using human assessors, to evaluate the links generated by the

keyphrase-based linking engine, comparing them to thosegenerated using a standard full-text retrieval system.

We evaluated two versions of the phrase-based engine. The first(labeled P1) weighted phrases using the number of documentscontaining a keyphrase. The second (labeled P2) weighted themusing the number of documents to which a keyphrase wasallocated. These measures were compared to the NZDL full-textmeasure (labeled T1).

We envisage that Kniles and Phrasier will be used in the contextof a focussed collection, often by users with expertise in the topicdomain of the collection. We therefore required expert assessmentof the appropriateness of generated links. Although sample testcorpora (such as those produced by TREC [17]) are available, theyproved to be unsuitable for our study.

6.1 MethodEach of six subjects (who were faculty members or doctoralstudents in university Computer Science departments) providedtwo research papers that they had recently authored. The paperswere related to the research area of Human-Computer Interaction.Each paper was converted to a plain text file to remove formattingthat did not contribute to the content of the paper, and thereference list was removed.

The full set of keyphrases appearing in the document wassubmitted as a query to the phrase-based retrieval engine (for theHCI Bibliography collection); and a ranked list of similardocuments was retrieved. The text of each paper (withoutreferences) was also submitted as a ranked, full-text query to theHCI Bibliography collection which returned a ranked result list.The returned documents correspond to links created to related

Figure 3: The Phrasier user interface. A keyphrase anchor has been selected from the document pane (to the left) and adocument gloss is displayed (bottom right). Several keyphrases have been selected from the keyphrase pane (center) and the

resulting list of related documents is shown in the related documents pane (top right).

Page 6: Topic-Based Browsing Within a Digital Library Using …...We have implemented a keyphrase-based linking engine which uses the keyphrase indexes and a standard information retrieval

documents. The first ten documents from each ranked list werecombined and duplicates removed to create a composite list.

We focussed on a small set of documents from the start of eachlist because usage studies of digital libraries and WWW-basedsearch engines show that most users are unlikely to investigatebeyond the first few items in a query result list [9], [15]

Each subject was presented with the composite list for each oftheir papers and asked to judge how relevant each document wasto their original paper. Relevance was scored on a scale of 1(irrelevant) to 7 (highly relevant). Each subject then scored the topten list generated by each of the three measures on the samerelevance scale.

6.2 Author relevance assessmentsThe sets of top ten documents returned by full-text and keyphraseretrieval mechanisms were quite distinct. On average, only onedocument was common to the two mechanisms, whereas therewas a noticeable degree of overlap between P1 and P2Ñanaverage of more than 7 common documents.

Using the scores allocated to individual documents in thecomposite list, we calculated a mean returned document relevancescore for each combination of measure and paper. There was nosignificant difference between the mean relevance scores for themeasures (Friedman two-way analysis of variance by ranks,p=0.005). That is, subjects judged the different similaritymeasures to have returned lists whose documents, on average,were equally relevant.

This result is supported by the relevance scores assigned by thesubjects to each top ten list as a whole. Although there was littleoverlap between the full-text and keyphrase based lists, there wasno significant difference between the averages of the list ratingsassigned by the author. Further, there was no significant differencebetween the mean number of documents assigned each relevancelevel (from 1 through 7) for each method.

The measure by which the methods vary significantly is the meanrelevance score assigned to the documents in each of the positionsin the list. These are significantly higher for the phrase-basedmeasures than for the full-text measure, although there was nodifference between the two phrase-based measures (see Figure 5).The perceived relevance tends to decrease from the beginning tothe end of the top ten list for all measures, so it appears that theyare loosely ranked in relevance order.

6.3 Reference list analysisA simple second evaluation can be performed using the papersreferenced in the source documents as a basis for evaluation. Thereferences must be related to the source document, otherwise theywould not have been cited in the text. The recall of referencedpapers provides another basis for comparing the ability of thedifferent similarity measures to find related documents.

Recall on the referenced papers that also occur within thecollection is almost identical (approximately 25%) at 10documents for all three methods. It remains similar at a range ofthresholds up to 300 documents, over which T1 begins to performmarginally better than either P1 or P2. Recall at 500 documents(the maximum possible for the NZDL implementation of T1), is58.8% for T1 and 52.9% for P1 and P2. None of the methodsreturned the full set of relevant reference documents from thecollection.

Overall we see little difference in the performance of the twomethods, either in recall of referenced documents or perceivedrelevance of returned documents. Within the phrase-based methodthere is no evidence of any difference between the two variationsof the algorithm.

7. FUTURE WORKKniles and Phrasier insert semantic links between documents, butkeyphrases can also be used as a source of structural information.Other systems, such as KeyPhind [8], have been developed that

The design of

a digital library

user interfaces for

validatekeyphrase

no

yes

retrieverelateddocuments

insert linkanchor andgloss

Keyphraseindexesranking

function

Document collection

links to multiple documents

Phrasier(Tcl/Tk client)

collection server(Perl script)

networkconnection

Keakeyphraseextractionprocess

validation result

phrase

ranked document list

training documents anddocuments for keyphraseextraction

keyphrases &document idsdocument metadata

Document pane

indexcreation

phrase

phrase

frequencystatistics

keyphrasefrequencystatistics

documentdetails

phrase

documentmetadata

Figure 4: The Phrasier architecture. The user interface is a Tcl/Tk client communicating with a server written in Perl.

Page 7: Topic-Based Browsing Within a Digital Library Using …...We have implemented a keyphrase-based linking engine which uses the keyphrase indexes and a standard information retrieval

use keyphrases to provide structure, and we intend to add a subjectindex to Kniles using the extracted phrases.

Several extensions to Kniles are suggested by the innovations inPhrasier. Kniles only allows the user to select one link at a time. Inthe future we hope users will be able to use checkboxes to select anumber of keyphrases and form more complex queries. Phrasierallows the user to load their own documents and integrate theminto the collection; we might approximate this functionality byinserting links into web pages specified by the user. We will alsoinvestigate different means of providing a gloss for link anchors.

Further evaluation of the current version of each system isrequired. In a study that is under way we are testing how wellvariable keyphrase highlighting supports users in skimmingdocuments and assessing their topic coverage. We will carry outusability evaluations of the two interface designs, and arecurrently evaluating the quality of the automatically generatedlinks using techniques such as those suggested by Blustein et al[3].

8. CONCLUSIONSWe have described how the independent documents that form adigital library collection can be linked to help users browse in auseful manner. Our solutions dynamically insert links based onkeyphrases: link anchors are identified by the occurrence ofkeyphrases in the text, and link destinations are found usingkeyphrase-based similarity measures.

Two implementations illustrate these techniques. Kniles workswith on-line digital library collections, and is accessible over theWWW using a standard web browser. Phrasier, written in Tcl/Tk,supports more sophisticated interaction techniques, but losesimmediacy of access, requiring software installation.

At the heart of our approach are automatically extractedkeyphrases. We have explored the use of keyphrases indetermining similarity between documents and the results of auser study are encouraging. Judgements from human assessorsindicate that keyphrase-based retrieval performs as well as astandard full-text retrieval mechanism in returning lists ofrelevant documents.

Providing effective access to large collections of documents iscritical to the success of digital libraries. We are enthusiastic aboutthe potential of keyphrases to support topic-driven browsing and

querying in ways which benefit both users and providers of digitallibrary services.

9. ACKNOWLEDGEMENTSMany thanks to Mark Staveley who administered the study andcarried out initial analysis of the results. Many thanks also to CarlGutwin for implementation of the first Phrasier prototype, andproducing the keyphrase indexes originally used by Phrasier.

10. REFERENCES[1] Agosti, M., Crestani, F., and Melucci, M. On the Use of

Information Retrieval Techniques for the AutomaticConstruction of Hypertext. Information Processing andManagement, 33, 2 (1997), 133-144.

[2] Allan, J. Building Hypertext Using Information Retrieval.Information Processing and Management, 33, 2 (1997), 145-159.

[3] Blustein, J., Webber, R.E., and Tague-Sutcliffe, J. MethodsFor Evaluating the Quality of Hypertext Links. InformationProcessing and Management, 33, 2 (1997), 255-271.

[4] Chignell, M.H., Nordhausen, B., Valdez, F. and Waterworth,J.A. The HEFTI Model of Text to Hypertext Conversion.Hypermedia, 3, 3 (1991), 187-205.

[5] Chua, T-S. and Choo, C-H. Automatic Generation andRefinement of Hypertext Links. The New Review ofHypermedia and Multimedia, 1 (1995), 41-66.

[6] Frank, E., Paynter, G.W, Witten, I.H., Gutwin, C. andNevill-Manning, C.G. Domain-Specific KeyphraseExtraction. In Proceedings of the Sixteenth InternationalJoint Conference on Artificial Intelligence, 1999. MorganKaufmann Publishers, San Francisco, CA. In Press.

[7] Golovchinsky, G. What the Query Told the Link: theIntegration of Hypertext and Information Retrieval. InProceedings of ACM HypertextÕ97 (Southampton, UK,April, 1997), ACM Press, 67-74.

[8] Gutwin, C., Paynter, G., Witten, I.H., Nevill-Manning, C.G.,and Frank, E. Improving Browsing in Digital Libraries WithKeyphrase Indexes. Technical Report, Department ofComputer Science, University of Saskatchewan, Canada.

[9] Jones S., Cunningham S.J. and McNab R.J. An Analysis ofUsage of a Digital Library. In Proceedings of ECDL'98Second European Conference on Digital Libraries, 1998.Springer, pp 261-277..

[10] Perlman, G. The HCI Bibliography Project. Sigchi Bulletin,23, 3, (1991), 15-20.

[11] Robertson, J. Merkus, E., and Ginige, A. The HypermediaAuthoring Research Toolkit (HART). In Proceedings ofECHTÕ94, Edinburgh, UK (1994). ACM Press, pp 177-185.

[12] Salton, G. Automatic Text ProcessingÑthe Transformation,Analysis and Retrieval of Information by Computer.Addison-Wesley Publishing Co, Reading, MA. 1989.

[13] Salton, G., Singhal, A., Mitra, M., and Buckley, C.Automatic Text Structuring and Summarization. InformationProcessing and Management, 33, 2 (1997), 193-207.

[14] Schilit, B.N., Price, M.N. and Golovchinsky, G. DigitalLibrary Information Appliances. In Proceedings of ACMDigital LibrariesÕ98 (Pittsburgh, PA, USA, 1998), ACMPress, 217-226.

0

1

2

3

4

5

6

document position in top 10 list

full textP1P2

full text 4.33 4 4.25 3.75 3.25 2.75 3.33 2.5 2.5 2.58

P1 5.25 4.58 4.92 3.92 4.08 4.58 4.5 3.42 3.08 2.92

P2 4.92 4.58 4.75 5.17 4.75 4.08 4.5 3.67 3.17 3.5

1 2 3 4 5 6 7 8 9 10

Figure 5: mean relevance scores for each position inthe top ten list

Page 8: Topic-Based Browsing Within a Digital Library Using …...We have implemented a keyphrase-based linking engine which uses the keyphrase indexes and a standard information retrieval

[15] Spink, A., Bateman, J., and Jansen, B.J. SearchingHeterogeneous Collections on the Web: Behaviour of ExciteUsers. Information Research: an Electronic Journal, 4, 2(1998).

[16] Turney, P. Learning to Extract Keyphrases From Text.Technical Report ERB-1057, National Research Council,Institute for Information Technology, Canada, 1998.

[17] Voorhees, E.M. and Harman, D.K. Proceedings of theSeventh Text Retrieval Conference (Gaithersburg, Maryland,November 9-11, 1998).

[18] Witten, I.H., McNab, R., Jones, S., Cunningham, S.J.,Bainbridge, D., and Apperley, M. Managing Multiple

Collection, Multiple Languages, and Multiple Media in aDistributed Digital Library. IEEE Computer, 32, 2 (1999),74-79.

[19] Witten, I.H., Moffat, A. and Bell, T.C. Managing Gigabytes:Compressing and Indexing Documents and Images. VanNostrand Reinhold, 1994.

[20] Zellweger, P.T., Chang, B., and Mackinlay, J.D. Fluid Linksfor Informed and Incremental Link Transitions. InProceedings of ACM HypertextÕ98 (Pittsburgh, PA, USA,Jun 1998), 50-57.