0q000q000qq00ddd000000000iq000000qq00ddd80d0000000q00d0000...

8

Upload: others

Post on 08-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 0Q000Q000QQ00DDD000000000IQ000000QQ00DDD80D0000000Q00D0000 …files.grouplens.org/papers/p10-terveen.pdfity of a recommendation and some- times indicates what a page is good for. Following

0 Q 0 0 0 Q 0 0 0 Q Q 0 0 D D D 0 0 0 0 0 0 0 0 0 I Q 0 0 0 0 0 0 Q Q 0 0 D D D 8 0 D 0 0 0 0 0 0 0 Q 0 0 D 0 0 0 0 0 0 0 0 0 0 0 D 0 D Q D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 O D

Collaborative Filtering

Robin Jareaux ©1997 Artvilie, LLC

Locate, Comprehend, and Organize Collections of

~ b Sites 1 0 W i n t e r 1 9 9 8 * S l G A R T B u l l e ' r . i n

Page 2: 0Q000Q000QQ00DDD000000000IQ000000QQ00DDD80D0000000Q00D0000 …files.grouplens.org/papers/p10-terveen.pdfity of a recommendation and some- times indicates what a page is good for. Following

Collaborative Filtering • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

Loren 7brveen, Will Hill, and Brian Amento I

AT&T Labs - Research 180 Park Avenue

Florham Park, NJ 07932

{terveen, willhill, s_brian}@research.att.com

Introduction

F inding information is a com- mon and fundamental task for Internet users. Search engines

such as AltaVista and indices such as Yahoo? are the tools commonly used for this task. Users type in queries using keywords to indicate their inter- est or navigate through hierarchical directories of topics, eventually obtain- ing (usually quite large) collections of Web pages.

What then? How do users make sense of the results? No one can wade through hundreds or thousands of pages, and few people can even get through more than a handful. So how does one tell which are the best pages, or what kind of information each page contains, or whether all interesting pages are included? Furthermore, peo- ple who are truly interested in a topic usually want to boil it down to a man- ageable collection of perhaps 10 to 20 key sites and often want to share their results with others. Current tools pro- vide scant support for the process of exploring, comprehending, and orga- nizing collections.

Our research investigates algorithms and interfaces to meet these user needs. We have developed collaborative filter- ing 2 algorithms that construct collec- tions of topically relevant Web sites and

create site profiles that can be used to estimate the relevance, quality, and functionality of a site. We have devel- oped interfaces that make it easy for users to explore and organize collec- tions of sites and a data format that makes it simple to share collections.

The first system we developed is PHOAKS (People Helping One Another Know Stuff) [Hill and Terveen 1996, Terveen et al. 1997]. PHOAKS exploits the thousands of public discussions that take place daily on Usenet newsgroups. One of the things participants in these discussions do is discuss and recommend Web pages related to the newsgroup's topic. PHOAKS processes messages from 6,000 newsgroups and applies rules to identify recommended Web pages. The PHOAKS Web site (http://www. phoaks.com) contains more than 100,000 recommendations and associ- ated contextual information (who rec- ommended the Web pages and what they said about the pages). Whereas PHOAKS mines Usenet messages, our second system, TopicShop [Amento 1999, Terveen and Hill 1998a, Terveen and Hill 1998b], analyzes the structure of the Web itself. Starting from a set of initial "seed" pages for a topic (perhaps taken from PHOAKS), TopicShop uses a Web crawler to group individual pages

into sites (structured multimedia docu- ments), build site profiles, and discover new, topically related sites. TopicShop also includes several user interfaces for exploring and organizing sites.

Although PHOAKS and Topic- Shop are independent systems, the original motivation for TopicShop was to overcome some shortcomings of PHOAKS. Therefore, we begin by dis- cussing PHOAKS, first describing how it works, then examining the set of Web pages PHOAKS found for a par- ticular newsgroup, to highlight both the benefits and limits of its approach. We then will show how TopicShop addresses these limits, describe the TopicShop algorithm, illustrate how it improves on the results available from PHOAKS, and present the TopicShop interface.

PHOAKS The basic job of PHOAKS is to search messages for mentions of Web pages (URLs) and identify which mentions are recommendations. A URL men- tion counts as a recommendation if it passes a number of tests. First, the message must not be cross-posted to too many newsgroups. Messages post- ed to a large number of groups often are worthless (spam), and, even if they are not, they are usually so general that

1 Also affiliated with the Department of Computer Science, Virginia Polytechnic Institute.

2 The basic idea of collabotative filtering is that one person recommends items to another. Computational collaborative filtering techniques attempt to automate this "word-of-mouth" process.

. S l G A I ~ T B u l l e t i n • W i n t e r 1 9 9 8 11

Page 3: 0Q000Q000QQ00DDD000000000IQ000000QQ00DDD80D0000000Q00D0000 …files.grouplens.org/papers/p10-terveen.pdfity of a recommendation and some- times indicates what a page is good for. Following

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Together, u~ l~,unc, t i ag£ ~ul i~m,

]- .............................................................................. * S~v.tt~ r~cposted.w~1~p~tbat ccctt~m my oftl~ ~ov~,amr~l~

Navigate Up: PI-IO.~.KS Home Page : Newsl~roup :Areas : re__.c, m t t c . titian

Frequent ly M e n t i o n e d 1Resources

Re~urce Title

i) Bob Dylan- Bob Links, Fall-& Winter lYE) ....

2) ]gob D¥1an Boot Database ~DBDB)

3) Bob l~vl~n- B N I t All Back Homepa~e

4)...Same :titles different locatioa...

5) Never p h ~ d iS) DYLAr~IOLO~W; ~ STUDY OF AN AID S RIDDI~,I EX,=

7) RooU, Routes m:u:l R.a.mbl.ir~

8) amb-la~zes

9) CDOutlets and Other ResotL~es

t o) Bob D ~ C h o ~ I 1) Frequently $~ked Questions

t21 cconow : r ~ : ~ - ~ , . e v ~ t 3) ..,Same. title., di~**nt .loca~on... t4) Bob Dylan- F..xpectinl~ Rain

53 L~zt, tile m~m,,.~ Ust directory t 6) A Cat's Bob Dytan Links

1"7) EDLIg ParRes & Catherh~s On Tour

Dis t inc t Ctigk on Ba.rs f o r P o U r s l W u a g e Cor~e,~(s) _

9 /HUl l l l

fi I I I I I

5 IIlU

~ m

4 g

3 II!

m

m 3 I i l

3 I l l

m

3. lU

m

I I I

18) bobdTlan.com 2_ B

Figure 1. FHOAK5 page for Usenet group rec.music.dglan.

they are not likely to be of much inter- est to any of the groups. Second, if the URL is part of the message signature, it is not counted as a recommendation. Third, if the URL occurs in a quoted section of a previous message, it is ruled out. Fourth, if the textual con- text surrounding the URL contains word markers that indicate that it is being recommended and does not con- tain markers that indicate that it is

being advertised or announced (i.e., self-promotion), then it is categorized as a recommendation. We have devel- oped rather complicated rules that implement this basic strategy.

We carried out experiments [Hill and Terveen 1996] that provided empirical evidence of the plausibility of our approach. In particular, we showed that

Usenet messages are a significant

source of recommendations of Web resources,

* Recommendations can be recog- nized automatically with nearly 90 percent accuracy, Some resources are recommended by more than one person, and

* The number of distinct recom- menders of a resource is a plausi- ble measure of the quality of the resource.

1 2 W i n t e r 1 9 9 8 ° S l G A R T B u l l e t i n

Page 4: 0Q000Q000QQ00DDD000000000IQ000000QQ00DDD80D0000000Q00D0000 …files.grouplens.org/papers/p10-terveen.pdfity of a recommendation and some- times indicates what a page is good for. Following

~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Figure 1 shows the PHOAKS page for the newsgroup rec.music.dylan, which discusses the music of Bob Dylan. Recommended Web pages are ordered by the number of distinct rec- ommenders. By clicking on the appro- priate links, a user can browse to a rec- ommended page or explore the set of posters who recommended a particular page (and then go on to see what else they recommended) or the message context that surrounded the recom- mendation. This additional contextual information helps users judge the qual- ity of a recommendat ion and some- times indicates what a page is good for.

Following is a list of the first 20 rec- o m m e n d e d pages f rom tee.music. dylan, including comments that show both the utility and the limits o f harvesting Web pages from Usenet newsgroups. 1. Bob Dylan - Bob Links - Fall &

Winter 199... This is a very useful page fkom a very useful site; however, this is not the ~ont page of the site.

2. Bob Dylan Boot Database (BDBDB)

This is not the J~ont page of the site. 3. Bob Dylan - Bringing It All Back

Homepage 4 . . . . Same title different location...

This V~b site movedj~om one server to another, and messages mentioning both locations were posted to the newsgroup - this shows that di~rent URLs may re3~r to the same document.

5. Never played live 6. DYLANOLOGY: T H E S T U D Y

OF AN AIDS R I D D E N EX... This is a low-qualio~ low-credibility site; nothing prevents people j~om making poor recommendations (or even J~om making "~nti-recommen- dations'~ although our empirical studies showed that these are extremely rare).

7. Roots Routes and Ramblings What is this page about? Can you tell fkom just the title?

8. ambulances 9. C D Outlets and Other Resources

This is not the j~ont page of the site, and other pages j~om the same site are also included in the list.

10. Bob Dylan Chords 11. Frequently Asked Questions

This is not the fkont page of the site, and other pages j~om the same site are also included in the list.

12. C D n o w : Main : Homepage This page points us toward a related, more general resource; it is not about Bob Dylan, but outlets to purchase music are mentioned occasionally on all music-related groups.

Collaborative Fil ..r.i.n. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 ' ! 0 0 •

+

occur," these discussions don't concern

13 .... Same title different location...

This is a dif~rent URL 3~r the same document.

14. Bob Dylan - Expecting Rain This is an important site, with some unique content, yet i f rated simp~ by number of recommenders it does not make it very high in the ordering.

15. Liszt the mailing list directory Somewhat similar to the mention of CDnow; since the newsgroup discus- sion is carried out using Internet technology, mentions of and discus- sions about the technology sometimes

the newsgroup) topic at all. 16. A Cat's Bob Dylan Links 17. EDLIS Parties & Gatherings O n

Tour 18. bobdylan.com

This is the official site; it is very rich in content, some of it unique and very valuable, yet again, the num- ber-ofirecommenders metric did not rate it high~.

19. BOB DYLAN: A Portrait of the

Artist as a Y... 20. MP3 .com - The MP3 resource on

the Internet See note ~ r item 15.

Let us now articulate the issues we have encountered and identify a set of goals for creating more useful collec-

tions of sites. X For many purposes, the Web page

is the wrong basic unit. Individuals and organizations cre-

ate Web sites. Our goal is to group Web pages into sites, without los- ing sight of or obscuring the fact that certain pages within the site have been singled out as being of particular interest.

X Usenet newsgroups are a some- what noisy source of recommen- dations, especially for judging the quality of a Web page. That is, while several people indeed may have recommended a page (at least according to the PHOAKS classi- fication rules), the page may be of low quality or peripheral rele- vance. Conversely, some relevant,

high-quality sites are seldom or never recommended. This can occur for various reasons; for instance, if one person, such as the site maintainer or the newsgroup FAQ maintainer, posts the site regularly (recall that if someone recommends his or her own site, PHOAKS will classify it as a self- promotion and will not count it as a recommendation), no one else needs to. To overcome these prob- lems, we seek additional, more

S l G A I R T B u l l e t i n • W i n t e r 1 9 9 8 1 5

Page 5: 0Q000Q000QQ00DDD000000000IQ000000QQ00DDD80D0000000Q00D0000 …files.grouplens.org/papers/p10-terveen.pdfity of a recommendation and some- times indicates what a page is good for. Following

` o o o • o o o o o o o o • • • o o o o o o o o o o o o o o • o o o o o o o o o o o o o o o o o o o o o o o o o o o o • o o o • o o o o • o o o o o o o o o o o o • • o o o o o o o o o o o

finely tuned quality metrics and techniques to discover new sites that are likely to be on the same topic as a given set of sites.

X Finally, page titles often do not communicate what a site is good for. We care about not only qual- ity, but also function. For exam- ple, bobdylan.com, the official Sony site, contains RealAudio versions of many otherwise unavailable live performances. "Bob Links" contains a huge number of links to other sites as well as tour dates and concert set lists and reviews. And "Roots Routes and Ramblings" contains information about Dylan's musi- cal roots and influences. Our goal is to construct profiles of site con- tent and structure that make it easy for users to determine not just their quality, but also their function.

These were the motivations that led us to develop the TopicShop system.

TopicSho p TopicShop consists of one main algo- rithmic component and several user interfaces:

÷ A Web crawler and site analyzer that follows links from a set of initial "seed" pages, heuristically groups pages into sites, builds profiles of site content and struc- ture, and discovers new sites that are structurally related (and thus likely to be topically related) to the seeds. Other systems with somewhat similar goals are described by Kleinberg [1998] and on the twURL Web site ( http ://www. twurl, corn);

÷ A Java applet that lets users inter- actively evoke and control the crawler/analyzer and offers incre- mental presentation of the results so users can immediately begin to explore sites;

÷ The TopicShop Auditorium Visualization, an interactive

graph visualization that lets users quickly identify topically central sites and investigate relationships between sites; and

÷ The TopicShop Explorer, a pro- gram modeled on Microsoft Windows Explorer, which pro- vides simple, yet powerful means for users to explore, browse, and share collections of sites.

We describe the details of our Web crawling algorithm elsewhere [Terveen and Hill 1998a]. However, the intu- ition is that the algorithm constructs a graph of sites that are "closely connect- ed" with the seed sites. Formally, it constructs the NK-elan graph, the set of all nodes such that each node is in an N-clan with at least K seeds. An N- clan [Scott 1991] is a graph in which (1) every node is connected to every other node by a path of length N or less, and (2) all of the connecting paths only go through nodes in the clan. We are interested primarily in 2-clans, that is, the 2K-clan graph. We showed that this algorithm tends to filter out "noise" in the seeds, that is, off-topic pages, and tends to bring in significant numbers of relevant sites.

As the crawler works, it heuristical- ly groups Web pages into sites. A site is an organized collection of pages on a specific topic maintained by a single person or group. Sites have structure, with pages that play certain roles (front door, table of contents, index). A site is not the same as a domain: for example, thousands of sites are hosted on http://www.geocities.com or http:// www.tripod.com. And what counts as a site may be context dependent. For example, if one is taking a survey of research labs, http://www, media. mit. edu might well be considered a site, whereas if one is investigating social fil- tering projects, sites of individual researchers sites hosted on http://www.media.mit.edu are probably the proper units.

The last observation suggested a way to operationalize the definition of a site. As the crawler/analyzer is constructing a collection, the relevant known context is the set of URLs that the expanded sites have linked to. The intuition is that if sites in the collection link to two URLs, one of which is in a directory that contains the other, then they are likely to be from the same site. More precisely, if URL A has been linked to and URL A/B has been linked to, then assume that A is the root page of the site and that A/B is an internal URL.

This rule applies recursively, so the URLs A/B/C, A/B, and A would be merged into a site with root page A and internal pages A/B and A/B/C.

In addition to this rule, our system uses a number of heuristics. General heuristics include the convention that the tilde (~) character indicates differ- ent user sites (therefore, http://www. research.att.com/~terveen and http:// www. research.att, corn~- willhill are usu- ally distinct sites). Site-specific heuris- tics encode knowledge that enables the identification of distinct user sites, hosted on large servers such as GeoCities or Tripod. O f course, the site aggregation rule and our heuristics can fail; therefore, we also have

1 4 W in te r 1 9 9 8 • ~ I G A I ~ , . T B u l l e t i n

Page 6: 0Q000Q000QQ00DDD000000000IQ000000QQ00DDD80D0000000Q00D0000 …files.grouplens.org/papers/p10-terveen.pdfity of a recommendation and some- times indicates what a page is good for. Following

explored heuristics to detect when sites should be split or merged.

The crawler analyzes the content of pages it fetches in order to build pro- files of site content and structure. Profiles include the following data:

* Title (of the site's root page); * Thumbnail image (of the site's

root page); * Links to and from other sites; * List and count of internal HTML

pages, images, audio files, and

this can be seen as ordering sites by the number of recommendations they have received from their peers, that is, other individuals who have created Bob Dylan sites. (As we shall soon see, our interfaces make it easy to order sites using any of the profile attributes. Different purposes require different orderings; for example, if you were looking for sound clips from a favorite TV series, you could sort by the num- ber of audio files.) Following are the

• 0 0 0

questioned? Where did they end up in

Number of Rank by Site In-links PHOAKS Bob Dylan - Bob Links 19 Fall & Winter 1998 Tour Guide ,5 1 Friends Along The Way 3 The front page of the site was inferred. A seed page was found to belong to this site, and another internal page of the site was linked to by multiple other sites, so it too is included.

Sidewalk 13 Web J-Cards 5 Things Twice 4 Bob Dylan Boot Database 2 Similar to Item 1, the front page was inferred, a seed page was placed under this site, and several other internal pages of the site were discovered.

bobdylan.com 12 18 This important site moves close to the top of the ordering. In addition, the analyzer discovered it was rich in content, particularly in audio files, so it wiJl show up at the top of other orderings.

Expecting Rain 11 14 Seems Like A Freeze Out 10 The Bob Dylan Web Ring 7

Bringing It All Back Homepage 7 3 Olof Bj6rner's Yearly Chronicles 3

Bread Crumb Sins 7 ISIS Home Page 4

Cowboy Angel Sings 7 COME WRITERS AND CRITICS 7 Bill Parr's Christianity and Bob Dylan Page 7

top 10 sites, with annotations that explain what the crawler/analyzer has done for us.

Five of the last six sites in the list were discovered by the crawler. They contain useful information such as lists of cover songs that Dylan has per- formed and essays exploring the role of Christianity in Dylan's writing and performance.

Finally, what about the items in the PHOAKS top 20 whose presence we

movie files; and * Count of user-specified or

domain-specific keywords. Let us now consider the results of

applying our algorithm to the PHOAKS pages for rec.music.dylan. In the following list, we order the sites within the collection by the number of in-links, that is, the number of other sites within the collection that point to them. Recall that we view links as a form of implicit recommendation, so

the ordering? CDnow (Number 12) dropped into the 40s, and DYLANOLOGY (Number 6), "Liszt" (Number 15) and MP3.com (Number 20) fell into the 70s.

Although a simple list of sites, ranked by number of in-links, looks impressive, it still does not meet all users' needs. We designed the TopicShop Explorer (see Figure 2) to support users in exploring, organizing, and sharing collections of sites. It offers two main views. The details view shows site profile information, allowing users to quickly and easily explore and order sites. The icons view allows users to arrange icons spatially, quickly organiz- ing and grouping them even before explicit categories emerge. The views are linked, so selections and operations in one view are reflected in the other. We will explain the user interface further by detailing three main design goals.

The TopicShop Explorer has three main design goals. 1. Make relevant but invisible inj~r-

mation visible. We hypothesize that making site profile information visible will significantly inform users in evaluating a collection of sites. No longer must they decide to visit sites--a time-consuming process--solely on the basis of titles and, sometimes, brief textual annotations. (A chief complaint of subjects in a study by Abrams et al. [1998] was that titles were inade- quate descriptors of site content-- and that was for sites that users already had browsed and decided to bookmark.) Instead, users can choose to visit only sites that have been endorsed (linked to) by many other sites or sites that are rich in a particular type of content (e.g., images or audio files). In addition to site profile data, the thumbnail images also are quite useful; most notably, for sites a user has visited, thumbnails are an effective visual identifier for sites.

~ I G A R T B u l l e t i n • Win te r 1998 15

Page 7: 0Q000Q000QQ00DDD000000000IQ000000QQ00DDD80D0000000Q00D0000 …files.grouplens.org/papers/p10-terveen.pdfity of a recommendation and some- times indicates what a page is good for. Following

Bob Dylan ... Cowboy Ang...

Bill Pair'... Bob Dylan

Rebel with... Never pl~...

Bringing f...

i El'"__] topic d{.,,'~J i i ! ! F ~ a c t i i i i i J act i i i i i -~a"~' { i -"--.I ani i i i i ! -Jan~ i i i i F . j a o l ~n,oe~oe... i i i ! F-~AT i i i i i - ' " J atla i '~ i i i - j b u f i i i i i-Jd~ | ! i i i i . j a y L i ! i i i ! i ~ i i ~- - . -Jeai i

i i i i i-'L__lhig i ~ i i i i i . . . .;jHa-~ i i i i !--.-Jluc Bob Dflan... i i i i i-...J~a i i i i i-':_Jpa' i i i i i-. phi i i i i i--~_.l pu_c i i i i F . j r o a i i i i i-.-.:jso~

Bob Dylan ....

m Bob Dylan ... Bob D~lan ...

=__Pm] Seems Like A Freeze :0 ut 10 1S 14 0 0 ! j Bill PaWs Ch,i~tianity and Bob Dyla.. 7 10 14 0 0 F~ • Beb Dylan- Breed Crumb Sins Bob... 7 26 35 i 0

Bringing It All Back Homepage 7 36 34 1 i ~ COME WRITERS'AND CR1TICS 7 7 29 0 0

Cowb~ Angel Sings 7 3 :8 O 0 [ ] Bob Dylan Chords 6 O 13 0 0 ~Addicted To Noise- 4.08-August... 5 0 86 38 0 [ ] BOLi Dylan 5 1 26 = 86 0

4 0 O: 18 0 E{,~nal Circle 4 8 4 0 0

• Rebel with a Cause/Dylan 7' slee... 4 6 28 0 ~3 Ron Chesler 4 13 14 0 0

• The Ivtany Faces of Bob Dylan 4 O 8 0 0 TWENTY POUNDS 0FHEADLIN... 4 41 44 0 O

[ ] Absolutely Cynthia Marie's Bob Dyl... 3 1 3 11 0 []ANOTHER SITE O F BOB DYLAN 3 6 62 0 0

• -BOB DYLAN PINATA BOOTLEGS 3 2 111 0 0 Bol~ D dan Ticket Sales 3 2 50 0 0 Bob D~larc Tangled Up ~ Jews 3 7 17 0 O

[ ] No title 3 0 3 0 0 Rebel with a Cause/Dylan T' slee... 3 1 101 0 0

[~I Some of Bob Dylan's.Work~ 3 0 2 0 0 [~[] The Unofficial Bob Dylan Free Tap... 3 1 1 O 0 [ ] "You Know You're a Dylan Fa n If.._7 2 1 O" 0 0 [[[[[]AIternate'titles for Bob Dylan songs... 2 O 1 0 0

I i ! ! i !-Jsta !1 Ii[]BOS0 N 2 0 2 0 0

Figure 2. TopicShop Explorer. In the details view (right), each Web site is represented by a small thumbnail image, the site title, properties of the site itself (number of internal pages, images, and audio files) and the site's relations to other sites (number of in-links and out-links). By clicking on a

column, users can sort by the appropriate property. In the icons view (center), each site is represented by a large thumbnail image and the site title. Users can organize sites by arranging them spatially, a technique especially useful in the early stages of exploration.

2. Make it simple J~r users to explore and organize resources. In the details view, users can sort resources by any of the properties (e.g., columns showing the number of in-links, out-links, images, etc.) simply by clicking on the label at the top of the column. Users also can "zoom in" on the internal structure of a site, getting a details view of inter- nal pages of the site (analogous to the view of the sites within a col- lection shown here). For example, this can take users directly to the links page of a site or to a page that contains audio samples or occur- rences of interesting phrases. Double-clicking on a site will send

the user's default Web browser to that site.

Users can organize resources both spatially (in the icons view) and by creating subfolders and moving resources into the subfolders. Nardi and Barreau [1995] found that users of graphical file systems preferred spatial location as a tech- nique for organizing their files. We believe that spatial organization is particularly useful early in the exploration process while users are still discovering important distinc- tions among resources and user- defined categories have not yet explicitly emerged. As categories become explicit, users can create

folders to contain sites in each of the categories.

3. Integrate the task of managing collec- tions of sites into a user~ normal computing and communications environment. The TopicShop Explorer is based on Microsoft Windows Explorer. This enables Windows users to apply their exist- ing knowledge, resulting in little learning time and high ease of use. Further, the data format for repre- senting sites makes it very easy for collections to be shared. All data for a site are stored in a file, and a collection is simply a folder of these files. Thus collections can be shared in all the normal ways files

1 6 Win te r 1998 • S I G A I R T B u l l e t i n

Page 8: 0Q000Q000QQ00DDD000000000IQ000000QQ00DDD80D0000000Q00D0000 …files.grouplens.org/papers/p10-terveen.pdfity of a recommendation and some- times indicates what a page is good for. Following

o e o o o o o o o o o e o o o o o e e o o o o o o o o o o o o o o

are shared, for example, by being zipped and e-mailed.

We conducted an empirical s tudy compar ing user per formance with TopicShop vs. Yahoo. TopicShop subjects found more than 80 percent more high- quality sites (where quality was deter- mined by independent expert judgments) while browsing only 81 percent as many

sites and completing their task in 89 per- cent of the time. The site profile data that TopicShop provides-- in particular, the number of pages on a site and the num- ber of other sites that link to i t - -were the key to these results, as users exploited it to identify the most promising sites quickly

and easily.

Conclusions Making sense of collections of Web sites is an increasingly important problem. We attacked this problem in two ways. First, we developed algorithms that construct collections of topically related sites and that create profiles of site content and

Collaborative Fiiterinfl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

structure. Second, we created interfaces that make it easy for users to explore and organize sites. We built two systems, PHOAKS and TopicShop, that imple- ment this methodology. We conducted empirical studies showing the utility of both systems. PHOAKS has been avail- able on the Web for several years and

draws about 5,000 visitors a day. TopicShop was developed more recently, and we are moving ahead with plans to

also deploy it on the Web.

References Abrams, D., R. Baecker, and M. Chignell. 1998.

Information Archiving with Bookmarks: Personal Web Space Construction and Organization. In Proceedings of CHIT8 (Los Angeles, April 1998), ACM Press, 41-48.

Amento, B., W. Hill, L. Terveen, D. Hix, and P. Ju. 1999. An Empirical Evaluation of User Interfaces for Topic Management of Web Sites.

To appear in Proceedings of CHITg.

Hill, W.C., and L.G. Terveen. 1996. Using Frequency-of-Mention in Public Conversations

for Social Filtering. In Proceedings of CSCW~6

(Boston, November 1996), ACM Press, 106-112. Kleinberg, J.M. Authoritative Sources in a

Hyperlinked Environment. 1998. In Proceedings

of 1998 ACM-SIAM Symposium on Discrete Algorithms. (San Francisco, CA, January 1998),

ACM Press. Nardi, B., and D. Barreau. 1995. Finding and

Reminding: File Organization from the

Desktop. ACM SIGCHI Bulletin 27:3 (July). Scott, J. 1991. Social Network Ana~sis: A

Handbook. SAGE Publications, London. Terveen, L.G., W.C. Hill, B. Amento, D. McDonald,

and J. Creter. 1997. Building Task-Specific Interfaces to High Volume Conversational Data.

In Proceedings of CHIT7 (Atlanta, March 1997), ACM Press, 226-233.

Terveen, L.G., and W.C. Hill. 1998a. Finding and Visualizing Inter-site Clan Graphs. In Proceedings of CHIT8 (Los Angeles, April

1998), ACM Press, 448-455. Terveen, L.G., and W.C. Hill. 1998b. Evaluating

Emergent Collaboration on the Web. In Proceedings of CSCW~8 (Seattle, November

1998), ACM Press. • PERMISSION TO MAKE DIGITAL OR HARD COHES OF ALL OR PART OF THIS WORK FOR PERSONAL OR CLASSROOM USE IS GRANTED WITHOUT FEE I'ROVIDED THAT COPIES ARE NOT MADE OR DISTRIBUTED FOR PROFIT OR

COMMERCIAL ADVANTAGE AND THAT COPIES BEAR THIS NOTICE AND THE FULL CITATION ON THE FIRST PAGE. T O COPY OTHERWISE, TO REPUBLISH, TO POST ON SERVERS OR TO REDIST"RIBUTE TO LISTS, REQUIRES PRIOR

SPECIFIC PERMISSION AND/OR A FEE. Q A C M 0153-4830/98/1200 $ 3 . 0 0

On-The-Spot Interviews With Hiring Managers From The Nation's Premier Companies Chicago: Monday, December 7 Boston:

Washington DC:

Baltimore:

New Jersey:

Tuesday, December 8

Wednesday, January 6 Thursday, January 7

Wednesday, January 13

Monday, January 11 Tuesday, January 12

Monday, January 11 Tuesday, January 12 Wednesday, January 13

Connecticut Wednesday, January 13

San Francisco Monday, January 18

San Jose Wednesday, January 20

San Ramon Thursday, January 21

NO COST OR OBLIGATION TO ATTEND. To receive Priority Access Services including advance distribution of your resume to participating companies, or if you are unable to attend, send your resume (Attn: 12ACM) ON-LINE to the address at right, or by FAX to 757-437-2243, or by MAIL to the address below. Also, please indicate the geographic locations in which you are interested and the type of position you seek. For more info, call 888-765-4473 and ask for Candidate Assistance or visit us at www.lendman.com. Produced by: The Lendman Group, 1081 19th Street, Suite 100, Virginia Beach, VA 23451

CAREER SERVICES T H E L E N D M A N G R O U P