enterprise and desktop search lecture 2: searching the

71
Enterprise and Desktop Search Lecture 2: Searching the Enterprise Web Pavel Dmitriev Yahoo! Labs Sunnyvale, CA USA Pavel Serdyukov University of Twente Netherlands Sergey Chernov L3S Research Center Hannover Germany

Upload: others

Post on 22-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

EnterpriseandDesktopSearch

Lecture2:SearchingtheEnterpriseWeb

PavelDmitrievYahoo!Labs

Sunnyvale,CAUSA

PavelSerdyukovUniversityof

TwenteNetherlands

SergeyChernovL3SResearchCenter

HannoverGermany

Outline

•  SearchingtheEnterpriseWeb– Whatworksandwhatdoesn’t(Fagin03,Hawking04)

•  UserFeedbackinEnterpriseWebSearch– ExplicitvsImplicitfeedback(Joachims02,Radlinski05)

– UserAnnotaWons(Dmitriev06,Poblete08,Chirita07)– SocialAnnotaWons(Millen06,Bao07,Xu07,Xu08)– UserAcWvity(Bilenko08,Xue03)– Short‐termUserContext(Shen05,Buscher07)

SearchingtheEnterpriseWeb

•  HowisEnterpriseWebdifferentfromthePublicWeb?– Structuraldifferences

•  Whatarethemostimportantfeaturesforsearch?– UseRankAggregaWontoexperimentwithdifferentrankingmethodsandfeatures

Searching the Workplace Web

Ronald Fagin Ravi Kumar Kevin S. McCurley

Jasmine Novak D. Sivakumar

John A. Tomlin David P. Williamson

IBM Almaden Research Center650 Harry Road

San Jose, CA 95120

ABSTRACT

The social impact from the World Wide Web cannot be un-derestimated, but technologies used to build the Web arealso revolutionizing the sharing of business and governmentinformation within intranets. In many ways the lessonslearned from the Internet carry over directly to intranets,but others do not apply. In particular, the social forces thatguide the development of intranets are quite different, andthe determination of a “good answer” for intranet search isquite different than on the Internet. In this paper we studythe problem of intranet search. Our approach focuses on theuse of rank aggregation, and allows us to examine the effectsof different heuristics on ranking of search results.

Categories and Subject Descriptors

H.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval

General Terms

Algorithms, Measurement, Experimentation

1. INTRODUCTIONThe corporate intranet is an organism that is at once

very similar to and very unlike the Internet at large. Awell-designed intranet, powered by a high-quality enterpriseinformation portal (EIP), is perhaps the most significantstep that corporations can make—and have made in recentyears—to improve productivity and communication betweenindividuals in an organization. Given the business of EIPsand their impact, it is natural that the search problem forintranets and EIPs is growing into an important business.

Despite its importance, however, there is little scientificwork reported on intranet search, or intranets at all for thatmatter. For example, in the history of the WWW confer-ence, there appear to have been only two papers that referto an “intranet” at all in their title [28, 22], and as bestwe can determine, all previous WWW papers on intranetsare case studies in their construction for various companiesand universities [18, 22, 12, 4, 21, 29, 6]. It is not surprisingthat few papers are published on intranet search: companieswhose livelihood depends on their intranet search productsare unlikely to publish their research, academic research-ers generally do not have access to corporate intranets, andCopyright is held by the author/owner(s).WWW2003, May 20–24, 2003, Budapest, Hungary.ACM 1-58113-680-3/03/0005.

most researchers with access to an intranet have access onlyto that of their own institution.

One might wonder why the problem of searching an in-tranet is in any way different from the problem of search-ing the web or a small corpus of web documents (e.g., sitesearch). In Internet search, many of the most significanttechniques that lead to quality results are successful becausethey exploit the reflection of social forces in the way peoplepresent information on the web. The most famous of theseare probably the HITS [24, 10, 11] and the PageRank [7] al-gorithms. These methods are motivated by the observationthat a hyperlink from a document to another is an impli-cit conveyance of authority to the target page. The HITSalgorithm takes advantage of a social process in which au-thors create “hubs” of links to authoritative documents ona particular topic of interest to them.

Searching an intranet differs from searching the Internetbecause different social forces are at play, and thus searchstrategies that work for the Internet may not work on anintranet. While the Internet reflects the collective voice ofmany authors who feel free to place their writings in publicview, an intranet generally reflects the view of the entitythat it serves. Moreover, because intranets serve a differentpurpose than the Internet at large, the kinds of queries madeare different, often targeting a single “right answer”. Theproblem of finding this “right answer” on an intranet is verydifferent from the problem of finding the best answers to aquery on the Internet.

In this paper we study the problem of intranet search.As suggested by the argument above, part of such a studynecessarily involves observations of ways in which intranetsdiffer from the Internet. Thus we begin with a high-levelunderstanding of the structure of intranets. Specifically, wepostulate a collection of hypotheses that shed some light onhow—and why—intranets are different from the Internet.These “axioms” lead to a variety of ranking functions orheuristics that are relevant in the context of intranet search.

We then describe an experimental system we built tostudy intranet ranking, using IBM’s intranet as a case study.Our system uses a novel architecture that allows us to eas-ily combine our various ranking heuristics through the useof a “rank aggregation” algorithm [16]. Rank aggregationalgorithms take as input multiple ranked lists from the var-ious heuristics and produces an ordering of the pages aimedat minimizing the number of “upsets” with respect to theorderings produced by the individual ranking heuristics.

Rank aggregation allows us to easily add and remove heuris-

EnterpriseWebvsPublicWeb:StructuralDifferences

StructureofthePublicWeb[Broder00]

EnterpriseWebvsPublicWeb:StructuralDifferences

StructureofEnterpriseWeb[Fagin03]

•  ImplicaWons:– Moredifficulttocrawl–  DistribuWonofPageRankvaluesissuchthatlargerfracWonofpageshashighPRvalues,thusPRmaybelesseffecWveindiscriminaWngamongregularpages

intranets from the Internet; these arise from the internal ar-chitecture of the intranet, the kind of servers used, documenttypes specific to an organization (e.g., calendars, bulletinboards that allow discussion threads), etc. For example,the IBM intranet contains many Lotus Domino servers thatpresent many different views of an underlying document for-mat by exposing links to open and close individual sections.For any document that contains sections, there are URLsto open and close any subset of sections in an arbitraryorder, which results in exponentially many different URLsthat correspond to different views of a single document. Thisis an example of a more general and serious phenomenon:large portions of intranet documents are not designed tobe returned as answers to search queries, but rather to beaccessed through portals, database queries, and other spe-cialized interfaces. This results in our fourth axiom.

Axiom 4. Large portions of intranets are not search-engine-friendly.

There is, however, a positive side to the arguments under-lying this axiom: adding a small amount of intelligence tothe search engine could lead to dramatic improvements inanswering fairly large classes of queries. For example, con-sider queries that are directory look-ups for people, projects,etc., queries that are specific to sites/locations/organizationaldivisions, and queries that can be easily targeted to specificinternal databases. One could add heuristics to the searchengine that specifically address queries of these types, anddivert them to the appropriate directory/database lookup.

2.1 Intranets vs. the Internet: structural dif-ferences

Until now, we have developed a sequence of hypothesesabout how intranets differ from the Internet at large, fromthe viewpoint of building search engines. We now turn tomore concrete evidence of the structural dissimilarities be-tween intranets and the Internet. One of the problems inresearching intranet search is that it is difficult to obtain un-biased data of sufficient quantity to draw conclusions from.We are blessed (beleaguered?) with working for a very largeinternational corporation that has a very heterogeneous in-tranet, and this was used for our investigations. With afew notable exceptions, we expect that our experience fromstudying the IBM intranet would apply to any large multi-national corporation, and indeed may also apply to muchsmaller companies and government agencies in which au-thority is derived from a single management chain.

The IBM intranet is extremely diverse, with content onat least 7000 different hosts. Because of dynamic content,the IBM intranet contains an essentially unbounded numberof URLs. By crawling we discovered links to approximately50 million unique URLs. Of these the vast majority aredynamic URLs that provide database access to various un-derlying databases. IBM’s intranet has nearly every kindof commercially available web server represented on the in-tranet as well as some specialty servers, but one feature thatdistinguishes IBM from the Internet is that a larger fractionof the web servers are Lotus Domino. These servers influencequite a bit of the structure of IBM’s intranet, because theURLs generated by Lotus Domino servers are very distinc-tive and contribute to some of the problems in distinguishingduplicate URLs. Whenever possible we have made an effortto identify and isolate their influence. As it turns out, thiseffort led us to an architecture that is easily tailored to the

specifics of any particular intranet.From among the approximately 50 million URLs that were

identified, we crawled about 20 million. Many of the linksthat were not crawled were forbidden by robots, or weresimply database queries of no consequence to our experi-ments. Among the 20 million crawled pages, we used aduplicate elimination process to identify approximately 4.6million non-duplicate pages, and we retained approximately3.4 million additional pages for which we had anchortext.

The indegree and outdegree distributions for the intranetare remarkably similar to those reported for the Internet [8,14]. The connectivity properties, however, are significantlydifferent. In Figure 1, SCC refers to the largest stronglyconnected component of the underlying graph, IN refers tothe set of pages not in the SCC from which it is possible toreach pages in the SCC via hyperlinks, OUT refers to the setof pages not in the SCC that are reachable from the SCC viahyperlinks, and P refers to the set of pages reachable fromIN but that are not part of IN or SCC. There is an assort-ment of other kinds of pages that form the remainder of theintranet. On the Internet, SCC is a rather large componentthat comprises roughly 30% of the crawlable web, while INand OUT have roughly 25% of the nodes each [8]. On theIBM intranet, however, we note that SCC is quite small,consisting of roughly 10% of the nodes. The OUT segmentis quite large, but this is expected since many of these nodesare database queries served by Lotus Domino with no linksoutside of a site. The more interesting story concerns thecomponent P, which consists of pages that can be reachedfrom the seed set of pages employed in the crawl (a stan-dard set of important hosts and pages), but which do notlead to the SCC. Some examples are hosts dedicated to spe-cific projects, standalone servers serving special databases,and pages that are intended to be “plain information” ratherthan be navigable documents.

Figure 1: Macro-level connectivity of IBM intranet

One consequence of this structure of the intranet is thedistribution of PageRank [7] across the pages in the intranet.The PageRank measure of the quality of documents in ahyperlinked environment corresponds to a probability dis-tribution on the documents, where more important pagesare intended to have higher probability mass. We comparedthe distribution of PageRank values on the IBM intranetwith the values from a large crawl of the Internet. On theintranet, a significantly larger fraction of pages have highvalues of PageRank (probability mass in the distribution al-luded to), and a significantly smaller fraction of pages have

RankAggregaWon

•  Input:severalrankedlistsofobjects

•  Output:asinglerankedlistoftheunionofalltheobjectswhichminimizesthenumberof“inversions”wrtiniWallists

•  NP‐hardtocomputefor4ormorelists•  VarietyofheurisWcapproximaWonsexistforcompuWngeitherthewholeorderingortopk[Dwork01,Fagin03‐1]

RankAggregaWoncanalsobeusefulinEnterpriseSearchforcombiningrankingsfromdifferentdatasource

Whatarethemostimportantfeatures?

•  Create3indices:Content,Title,Anchortext(aggregatedtextfromthe<a>tagspoinWngtothepage)

•  Gettheresults,rankthembyl‐idf,andfeedtotherankingheurisWcs

•  CombinetheresultsusingRankAggregaWon

•  EvaluateallpossiblesubsetsofindicesandheurisWcsonveryfrequent(Q1)andmediumfrequency(Q2)querieswithmanuallydeterminedcorrectanswers

it would often be more general and have links to pages thatare lower in the hierarchy.

Discriminator. This is a “pure hack” to discriminate infavor of certain classes of URLs over others. The favoredURLs in our case consisted of those that end in a slash / orindex.html, and those that contain a tilde ∼. Those thatwere discriminated against are certain classes of dynamicURLs containing a question mark, as well as some DominoURLs. This heuristic is neutral on all other URLs, and iseasily tailored to knowledge of a specific intranet.

The exact choice of heuristics is not etched in stone, andour system is designed to incorporate any partial orderingon results. Others that we considered include favoritism forsome hosts (e.g., those maintained by the CIO), differentcontent types, age of documents, click distance, HostRank(Pagerank on the host graph), etc.

In our experiments these factors were combined in thefollowing way. First, all three indices are consulted to getthree ranked lists of documents that are scored purely onthe basis of tf·idf (remembering that the title and anchortextindex consists of virtual documents). The union of these listsis then taken, and this list is reordered according to each ofthe seven scoring factors to produce seven new lists. Theselists are all combined in rank aggregation. In practice if kresults are desired in the final list, then we chose up to 2kdocuments to go into the individual lists from the indices,and discard all but the top k after rank aggregation.

Discriminator

URL depth

URL length

Words in URL

Discovery date

Indegree

PageRank

Anchortext Index

Title Index

Content Index

!

"

"""""""

""""""""""

Rank

Aggregation

"Result

Figure 2: Rank aggregation: The union of resultsfrom three indices is ordered by many heuristics andfed with the original index results to rank aggrega-tion to produce a final ordering.

Our primary goal in building this system was to exper-iment with the effect of various factors on intranet searchresults, but it is interesting to note that the performancecost of our system is reasonable. The use of multiple indicesentails some overhead, but this is balanced by the small sizesof the indices. The rank aggregation algorithm described inSection 4.1 has a running time that is quadratic in the lengthof the lists, and for small lists is not a serious factor.

5. EXPERIMENTAL RESULTSIn this section we describe our experiments with our in-

tranet search system. The experiments consist of the follow-ing steps: choosing queries on which to test the methods,identifying criteria based on which to evaluate the quality ofvarious combinations of ranking heuristics, identifying mea-sures of the usefulness of individual heuristics, and finally,analyzing the outcome of the experiments.

5.1 Ranking methods and queriesBy using the 10 ranking heuristics (three directly based

on the indices and the seven auxiliary heuristics), one couldcreate 1024 combinations of subsets of heuristics that couldbe aggregated. However, the constraint that at least one ofthe three indices needs to be consulted eliminates one-eighthof these combinations, leaving us with 896 combinations.

Our data set consists of the following two sets of queries,which we call Q1 and Q2, respectively.

The first set Q1 consists of the top 200 queries issued toIBM’s intranet search engine during the period March 14,2002 to July 8, 2002. Several of these turned out to be broadtopic queries, such as “hr,” “vacation,” “travel,” “autonomiccomputing,” etc.—queries that might be expected to be inthe bookmarks of several users. These queries also tend to beshort, usually single-word queries, and are usually directedtowards “hubs” or commonly visited sites on the intranet.These 200 queries represent roughly 39% of all the queries;this suggests that on intranets there is much to be gained byoptimizing the search engine, perhaps via feedback learning,to accurately answer the top few queries.

The second set Q2 consists of 150 queries of median fre-quency, that is these are the queries near the 50th percentileof the query frequency histogram (from the query logs inthe period mentioned above). These are typically not the“bookmark” type queries; rather, they tend to arise whenlooking for something very specific. In general, these tendto be longer than the popular queries; a nontrivial fractionof them are fairly common queries disambiguated by termsthat add to the specificity of the query (e.g., “american ex-press travel canada”). Other types of queries in this cate-gory are misspelled common queries (e.g., “ebusiness”), orqueries made by users looking for specific parts in a catalogor an specific invoice code (e.g., “cp 10.12”, which refers toa corporate procedure). Thus Q2 represents the “typical”user queries; hence the satisfaction experienced by a user ofthe search engine will crucially depend on how the searchengine handles queries in this category.

We regard both sets of queries as being important, but fordifferent reasons. The distribution of queries to the IBM in-tranet resembles that found in Internet search engines in thesense that they both have a heavy-tail distribution for queryfrequency. A very few queries are very common, but mostof the workload is on queries that individually occur veryrarely. To provide an accurate measure of user satisfaction,we included both classes of queries in our study.

To be able to evaluate the performance of the rankingheuristics and aggregations of them, it is important that wehave a clear notion of what “ground truth” is. Therefore,once the queries were identified, the next step was to collectthe correct answers for these queries. A subtle issue here isthat we cannot use the search engine that we are testing tofind out the correct answers, as that would be biasing theresults in our favor. Therefore, we resorted to the use ofthe existing search engine on the IBM intranet, plus a bit ofgood old browsing; in a handful of cases where we could notfind any good page for a query, we did employ our search en-gine to locate some seed answers, which we then refined byfurther browsing. If the query was ambiguous (e.g., “ameri-can express”, which refers to both the corporate credit cardand the travel agency) or if there were multiple correct an-swers to a query (e.g., “sound driver”), we permitted all ofthem to be included as correct answers.

Results

IRi(a)is“influence”oftherankingmetnoda

ObservaWons:•  AnchortextisbyfarthemostinfluenWalfeature

•  Titleisveryuseful,too•  ContentisineffecWveforQ1,butisusefulforQ2

•  PRisuseful,butdoesnothaveahugeimpact

(3) Some URL canonicalizations were missed, leading theevaluation to believe that the correct result was missed whileit was reported under an alias URL.

Tables 1 and 2 present the results for the influences ofvarious heuristics. For a ranking method α and a measure ofgoodness µ, recall that Iµ(α) denotes the influence (positiveor negative) of α with respect to the goodness measure µ.Our µ’s are the recall value at various top k positions—1, 3,5, 10, and 20; we will abbreviate “recall at 1” as “R1,” etc.

Legend. The following abbreviations are used for the 10ranking heuristics:

Ti = index of titles, keywords, etc., An = anchortext, Co= content, Le = URL length, De = URL depth, Wo = querywords in URL, Di = discriminator, PR = PageRank, In =indegree, Da = discovery date.

α IR1(α) IR3(α) IR5(α) IR10(α) IR20(α)Ti 29.2 13.6 5.6 6.2 5.6An 24.0 47.1 58.3 74.4 87.5Co 3.3 −6.0 −7.0 −4.4 −2.7Le 3.3 4.2 1.8 0 0De −9.7 −4.0 −3.5 −2.9 −4.0Wo 3.3 0 −1.8 0 1.4Di 0 −2.0 −1.8 0 0PR 0 13.6 11.8 7.9 2.7In 0 −2.0 −1.8 1.5 0Da 0 4.2 5.6 4.6 0

Table 1: Influences of various ranking heuristics onthe recall at various positions on the query set Q1

α IR1(α) IR3(α) IR5(α) IR10(α) IR20(α)Ti 6.7 8.7 3.4 3.0 0An 23.1 31.6 30.4 21.4 15.2Co −6.2 −4.0 3.4 0 5.6Le 6.7 −4.0 0 0 −5.3De −18.8 −8.0 −10 −8.8 −7.9Wo 6.7 −4.0 0 0 0Di −6.2 −4.0 0 0 0PR 6.7 4.2 11.1 6.2 2.7In −6.2 −4.0 0 0 0Da 14.3 4.2 3.4 0 2.7

Table 2: Influences of various ranking heuristics onthe recall at various positions on the query set Q2

Some salient observations. (1) Perhaps the most note-worthy aspect of Tables 1 and 2 is the amazing efficacy ofanchortext. In Table 1, the influence of anchortext can beseen to be progressively better as we relax the recall param-eter. For example, for recall at position 20, anchortext hasan influence of over 87%, which means that using anchortextleads to essentially doubling the recall performance!

(2) Table 1 also shows that the title index (which, thereader may recall, consists of words extracted from titles,meta-tagged information, keywords, etc.) is an excellent con-tributor to achieving very good recall, especially at the top 1and top 3. Specifically, notice that adding information fromthe title index improves the accuracy at top 1 by nearly

30%, the single largest improvement for the top 1. Interest-ingly enough, at top 20, the role of this index is somewhatdiminished, and, compared to anchortext, is quite weak.

One way to interpret the information in the first two rowsof Table 1 is that anchortext fetches the important pagesinto the top 20, and the title index pulls up the most ac-curate pages to the near top. This is also evidence thatdifferent heuristics have different roles, and a good aggrega-tion mechanism serves as a glue to bind them seamlessly.

(3) Considering the role of the anchortext in Table 2,which corresponds to the query set Q2 (the “typical,” asopposed to the “popular” queries), we notice that the mono-tonic increase in contribution (with respect to the position)of anchortext is no longer true. This is not very surprising,since queries in this set are less likely to be extremely impor-tant topics with their own web pages (which is the primarycause of a query word being in some anchortext). Neverthe-less, anchortext still leads to a 15% improvement in recallat position 20, and is the biggest contributor.

The effect of the title index is also less pronounced inTable 2, with no enhancement to the recall at position 20.

Observations (1)–(3) lead to several inferences.Inference 1. Information in anchortext, document titles,

keyword descriptors and other meta-information in docu-ments, is extremely valuable for intranet search.

Inference 2. Our idea of building separate indices basedon this information, as opposed to treating this as auxiliaryinformation and using it to tweak the content index, is par-ticularly effective. These indices are quite compact (roughly5% and 10% of the size of the content index), fairly easy tobuild, and inexpensive to access.

Inference 3. Information from these compact indices isquery-dependent; thus, these ranking methods are dynamic.The rank aggregation framework allows for easy integrationof such information with static rankings such as PageRank.

Continuing with our observations on Tables 1 and 2:(4) The main index of information on the intranet, namely

the content index, is quite ineffective for the popular queriesin Q1. However, it becomes increasingly more effective whenwe consider the query set Q2, especially when we considerrecall at position 20. This fact is in line with our expec-tations, since a large number of queries in Q2 are pointedqueries on specialized topics, ones that are more likely to bediscussed inside documents rather than in their headers.

Inference 4. Different heuristics have different perfor-mances for different types of queries, especially when wecompare their performances on “popular” versus “typical”queries. An aggregation mechanism is a convenient way tounify these heuristics, especially in the absence of “classi-fiers” that tell which type a given query is. Such a classifier,if available, is a bonus, since we could choose the right heu-ristics for aggregation depending on the query type.

(5) The ranking heuristics based on URL length (Le), pre-sence of query words in the URL (Wo), discovery date (Da),and PageRank (PR) form an excellent support cast in therank aggregation framework. The first three of these areseen to be especially useful in Q2, the harder set of queries.An interesting example is discovery date (Da), which hasquite a significant effect on the top 1 recall for Q2.

(6) While PageRank is uniformly good, contrary to itshigh-impact role on the Internet, it does not add much valuein bringing good pages into the top 20 positions; its value

(3) Some URL canonicalizations were missed, leading theevaluation to believe that the correct result was missed whileit was reported under an alias URL.

Tables 1 and 2 present the results for the influences ofvarious heuristics. For a ranking method α and a measure ofgoodness µ, recall that Iµ(α) denotes the influence (positiveor negative) of α with respect to the goodness measure µ.Our µ’s are the recall value at various top k positions—1, 3,5, 10, and 20; we will abbreviate “recall at 1” as “R1,” etc.

Legend. The following abbreviations are used for the 10ranking heuristics:

Ti = index of titles, keywords, etc., An = anchortext, Co= content, Le = URL length, De = URL depth, Wo = querywords in URL, Di = discriminator, PR = PageRank, In =indegree, Da = discovery date.

α IR1(α) IR3(α) IR5(α) IR10(α) IR20(α)Ti 29.2 13.6 5.6 6.2 5.6An 24.0 47.1 58.3 74.4 87.5Co 3.3 −6.0 −7.0 −4.4 −2.7Le 3.3 4.2 1.8 0 0De −9.7 −4.0 −3.5 −2.9 −4.0Wo 3.3 0 −1.8 0 1.4Di 0 −2.0 −1.8 0 0PR 0 13.6 11.8 7.9 2.7In 0 −2.0 −1.8 1.5 0Da 0 4.2 5.6 4.6 0

Table 1: Influences of various ranking heuristics onthe recall at various positions on the query set Q1

α IR1(α) IR3(α) IR5(α) IR10(α) IR20(α)Ti 6.7 8.7 3.4 3.0 0An 23.1 31.6 30.4 21.4 15.2Co −6.2 −4.0 3.4 0 5.6Le 6.7 −4.0 0 0 −5.3De −18.8 −8.0 −10 −8.8 −7.9Wo 6.7 −4.0 0 0 0Di −6.2 −4.0 0 0 0PR 6.7 4.2 11.1 6.2 2.7In −6.2 −4.0 0 0 0Da 14.3 4.2 3.4 0 2.7

Table 2: Influences of various ranking heuristics onthe recall at various positions on the query set Q2

Some salient observations. (1) Perhaps the most note-worthy aspect of Tables 1 and 2 is the amazing efficacy ofanchortext. In Table 1, the influence of anchortext can beseen to be progressively better as we relax the recall param-eter. For example, for recall at position 20, anchortext hasan influence of over 87%, which means that using anchortextleads to essentially doubling the recall performance!

(2) Table 1 also shows that the title index (which, thereader may recall, consists of words extracted from titles,meta-tagged information, keywords, etc.) is an excellent con-tributor to achieving very good recall, especially at the top 1and top 3. Specifically, notice that adding information fromthe title index improves the accuracy at top 1 by nearly

30%, the single largest improvement for the top 1. Interest-ingly enough, at top 20, the role of this index is somewhatdiminished, and, compared to anchortext, is quite weak.

One way to interpret the information in the first two rowsof Table 1 is that anchortext fetches the important pagesinto the top 20, and the title index pulls up the most ac-curate pages to the near top. This is also evidence thatdifferent heuristics have different roles, and a good aggrega-tion mechanism serves as a glue to bind them seamlessly.

(3) Considering the role of the anchortext in Table 2,which corresponds to the query set Q2 (the “typical,” asopposed to the “popular” queries), we notice that the mono-tonic increase in contribution (with respect to the position)of anchortext is no longer true. This is not very surprising,since queries in this set are less likely to be extremely impor-tant topics with their own web pages (which is the primarycause of a query word being in some anchortext). Neverthe-less, anchortext still leads to a 15% improvement in recallat position 20, and is the biggest contributor.

The effect of the title index is also less pronounced inTable 2, with no enhancement to the recall at position 20.

Observations (1)–(3) lead to several inferences.Inference 1. Information in anchortext, document titles,

keyword descriptors and other meta-information in docu-ments, is extremely valuable for intranet search.

Inference 2. Our idea of building separate indices basedon this information, as opposed to treating this as auxiliaryinformation and using it to tweak the content index, is par-ticularly effective. These indices are quite compact (roughly5% and 10% of the size of the content index), fairly easy tobuild, and inexpensive to access.

Inference 3. Information from these compact indices isquery-dependent; thus, these ranking methods are dynamic.The rank aggregation framework allows for easy integrationof such information with static rankings such as PageRank.

Continuing with our observations on Tables 1 and 2:(4) The main index of information on the intranet, namely

the content index, is quite ineffective for the popular queriesin Q1. However, it becomes increasingly more effective whenwe consider the query set Q2, especially when we considerrecall at position 20. This fact is in line with our expec-tations, since a large number of queries in Q2 are pointedqueries on specialized topics, ones that are more likely to bediscussed inside documents rather than in their headers.

Inference 4. Different heuristics have different perfor-mances for different types of queries, especially when wecompare their performances on “popular” versus “typical”queries. An aggregation mechanism is a convenient way tounify these heuristics, especially in the absence of “classi-fiers” that tell which type a given query is. Such a classifier,if available, is a bonus, since we could choose the right heu-ristics for aggregation depending on the query type.

(5) The ranking heuristics based on URL length (Le), pre-sence of query words in the URL (Wo), discovery date (Da),and PageRank (PR) form an excellent support cast in therank aggregation framework. The first three of these areseen to be especially useful in Q2, the harder set of queries.An interesting example is discovery date (Da), which hasquite a significant effect on the top 1 recall for Q2.

(6) While PageRank is uniformly good, contrary to itshigh-impact role on the Internet, it does not add much valuein bringing good pages into the top 20 positions; its value

Thisstudyconfirmsmostofthefindingsif[Fagin03]on6differentEnterpriseWebs(resultsfor4datasetsareshown)

• AnchortextandWtlearesWllthebest

• Contentisalsouseful

Challenges in Enterprise Search

David Hawking

CSIRO ICT Centre,GPO Box 664,

Canberra, Australia [email protected]

Abstract

Concerted research effort since the nineteen fifties haslead to effective methods for retrieval of relevant doc-uments from homogeneous collections of text, such asnewspaper archives, scientific abstracts and CD-ROMencyclopaedias. However, the triumph of the Webin the nineteen nineties forced a significant paradigmshift in the Information Retrieval field because of theneed to address the issues of enormous scale, fluidcollection definition, great heterogeneity, unfetteredinterlinking, democratic publishing, the presence ofadversaries and most of all the diversity of purposesfor which Web search may be used. Now, the IRfield is confronted with a challenge of similarly daunt-ing dimensions – how to bring highly effective searchto the complex information spaces within enterprises.Overcoming the challenge would bring massive eco-nomic benefit, but victory is far from assured. Thepresent work characterises enterprise search, hints atits economic magnitude, states some of the unsolvedresearch questions in the domain of enterprise searchneed, proposes an enterprise search test collectionand presents results for a small but interesting sub-problem.

Keywords: Information Retrieval; enterprise search;search quality evaluation.

1 Introduction

The IDC report entitled “The High Cost of Not Find-ing Information” (Feldman and Sherman, 2003) quan-tifies the significant economic penalties caused bypoor quality search within enterprises, both in theform of lost opportunities and through lost productiv-ity. CSIRO’s observations of a large number of Aus-tralasian organisations suggest that these were notisolated or exceptional cases – in fact, very poor en-terprise search is the norm and, while employees andcustomers may complain or laugh about it, the organ-isation as a whole typically fails either to recognize theseriousness of the situation or the possibility of doingbetter.

I interpret the term enterprise search to include:

• any organisation with text content in electronicform;

Copyright c©2004, Australian Computer Society, Inc. This pa-per appeared at Fifteenth Australasian Database Conference(ADC2004), Dunedin, NZ. Conferences in Research and Prac-tice in Information Technology, Vol. 27. Klaus-Dieter Scheweand Hugh Williams, Ed. Reproduction for academic, not-forprofit purposes permitted provided this text is included.

• search of the organisation’s external website;

• search of the organisation’s internal websites (it’sintranet);

• search of other electronic text held by the organ-isation in the form of email, database records,documents on fileshares and the like.

In general, search of non-textual and continuous me-dia is included but here I consider only retrieval me-diated by text (e.g. retrieving a video by matchingtextual annotations associated with it against a textquery rather than by measuring its visual or auditorysimilarity to a video query).

An obvious reason for poor enterprise search isthat a high performing text retrieval algorithm de-veloped in the laboratory cannot be applied withoutextensive engineering to the enterprise search problembecause of the complexity of typical enterprise infor-mation spaces. Out of the hundreds of search engineson the market, so few are able to work with the rangeof databases, content management systems, email for-mats, document formats, operational and security re-quirements typical of medium scale enterprises thatquality of search results is often forgotten when pur-chasing decisions are made. (See Stenmark (1999) foran illustration.)

Not only does enterprise information complexityrestrict the range of applicable commercial searchproducts and increase the cost of deploying them butit makes it difficult to measure the quality of searchresults obtained and also, we hypothesise, makes ithard to approach the effectiveness level achieved bystate-of-the-art whole-of-Web search engines.

For researchers in the Information Retrieval (IR)field and for commercial companies, the problem ofenterprise search is a formidable challenge but onefor which a solution would deliver enormous benefit.

No solution can be presented here, but it is hopedthat a characterisation of the problem, a list of openresearch questions, a discussion of approaches anda preliminary proposal for an evaluation frameworkmay attract researchers to work in the area andthereby accelerate progress toward a solution.

Section 2 characterises the problem of enterprisesearch in the light of past work and in terms of sometypical enterprise search scenarios. Section 3 enu-merates and characterises a number of open researchproblems, including the development of a test collec-tion for enterprise search. Section 4 addresses thespecific problem of searching enterprise data held inweb format and reports new results across a range ofoutward-facing enterprise web sites. It also includes

P@

1 (%

)

Aust. Post - 94 queries; 2088 documents

10.0

20.0

30.0

40.0

50.0

60.0

70.0

UR

L w

ords

UR

L w

ords

title

title

desc

riptio

nde

scrip

tion

subj

ect

subj

ect

cont

ent

cont

ent

anch

ors

anch

ors P

@1

(%)

Comm. Bank - 145 queries; 2967 documents

10.0

20.0

30.0

40.0

50.0

60.0

70.0

UR

L w

ords

UR

L w

ords

title

title

desc

riptio

nde

scrip

tion

subj

ect

subj

ect

cont

ent

cont

ent

anch

ors

anch

ors

P@

1 (%

)

CSIRO - 130 queries; 95,907 documentss

10.0

20.0

30.0

40.0

50.0

60.0

70.0

UR

L w

ords

UR

L w

ords

title

title

desc

riptio

nde

scrip

tion

subj

ect

subj

ect

cont

ent

cont

ent

anch

ors

anch

ors S

@1

(%)

Curtin Uni. - 332 queries; 79,296 documents

10.0

20.0

30.0

40.0

50.0

60.0

70.0

UR

L w

ords

UR

L w

ords

title

title

desc

riptio

nde

scrip

tion

subj

ect

subj

ect

cont

ent

cont

ent

anch

ors

anch

ors

S@

1 (%

)

DEST - 62 queries; 8416 documents

10.0

20.0

30.0

40.0

50.0

60.0

70.0

UR

L w

ords

UR

L w

ords

title

title

desc

riptio

nde

scrip

tion

subj

ect

subj

ect

cont

ent

cont

ent

anch

ors

anch

ors P

@1

(%)

unimelb - 415 queries

10.0

20.0

30.0

40.0

50.0

60.0

70.0

UR

L w

ords

UR

L w

ords

title

title

desc

riptio

nde

scrip

tion

subj

ect

subj

ect

cont

ent

cont

ent

anch

ors

anch

ors

Figure 3: The relative value of different types of query-dependent evidence for navigational search on theexternal web sites of six different enterprises. Within a graph, the height of a bar reflects the effectivenessachieved on a navigational search task, when the Okapi BM25 scoring function is applied to pseudo documentscontaining only parts of the available text: content, title, URL words, subject and description metadata andpropagated referring anchortext. In each case the sitemap from which the navigational queries were derivedwas excluded from the index. Exact duplicates of correct answers were accepted as equally correct.

Summary

•  EnterpriseWebandPublicWebexhibitsignificantstructuraldifferences

•  ThesedifferencesresultinsomefeaturesveryeffecWveforwebsearchnotbeingsoeffecWveforEnterpriseWebSearch– Anchortextisveryuseful(butthereismuchlessofit)

– Titleisgood– ContentisquesWonable– PageRankisnotasuseful

UsingUserFeedbackinEnterpriseWebSearch

UsingUserFeedback•  OneofthemostpromisingdirecWonsinEnterpriseSearch–  Cantrustthefeedback(nospam)–  CanprovideincenWves–  Candesignasystemtofacilitatefeedback–  Canactuallyimplementit

•  Wewilllookatseveraldifferentsourcesoffeedback–  Clicks(verybriefly)–  ExplicitAnnotaWons– Queries–  SocialAnnotaWons–  BrowsingTraces

SourcesofFeedbackinWebSearch

•  ExplicitFeedback–  Overheadforuser–  Onlyfewusersgivefeedback=>notrepresentaWve

•  ImplicitFeedback–  Queries,clicks,Wme,mousing,scrolling,etc.

–  NoOverhead– Moredifficulttointerpret

[Joachims02,Radlinski05]

8/22/09 8:25 PMRuSSIR 2009 - Google Search

Page 1 of 2http://www.google.com/search?client=safari&rls=en&q=RuSSIR+2009&ie=UTF-8&oe=UTF-8

Web Images Videos Maps News Shopping Gmail more ! [email protected] | Web History | My Account | Sign out

RuSSIR 2009 SearchAdvanced SearchPreferences

Results 1 - 10 of about 143,000,000 for RuSSIR 2009. (0.44 seconds)

Did you mean: RuSSIA 2009

3 rd Russian Summer School in Information Retrieval - RuSSIR'2009 ... - 3 visits - Jul 23

The 3rd Russian Summer School in Information Retrieval will be held September 11 -16,2009 in Petrozavodsk, Russia. The school is co-organized by the Russian ...romip.ru/russir2009/eng/index.html - Cached - Similar -

good result

Make a public comment Cancel

RuSSIR'2009: III "#$$%&$'() *+,-)) .'#*( /# %-0#12(3%#--#24 /#%$'4 - 2 visits -

Feb 25 - [ Translate this page ]5+*% "#$$%&$'#& *+,-+& .'#*6 /# %-0#12(3%#--#24 /#%$'4 (RuSSIR) — /#7-('#2%,8$*4.(,+*+& $# $/+',1#2 $#91+2+--6: /1#;*+2 % 2+,#<#9 %-0#12(3%#--#=# /#%$'(, ...romip.ru/russir2009/ - Cached - Similar -

Show more results from romip.ru

RuSSIR 2009: call for participation | PASCAL 2RuSSIR 2009 is co-located with the yearly ROMIP meeting (http://romip.ru/) and RussianConference on Digital Libraries 2009 ...www.pascal-network.org/?q=node/106 - Cached - Similar -

3rd Russian Summer School in Information Retrieval (RuSSIR 2009 ...RuSSIR 2009 is co-located with the yearly ROMIP meeting (http://romip.ru/) and RussianConference on Digital Libraries 2009 ...www.pascal-network.org/?q=node/78 - Cached - Similar -

[PDF] 3rd Russian Summer School in Information Retrieval (RuSSIR 2009)File Format: PDF/Adobe Acrobat - View11>16, 2009 in Petrozavodsk, Russia. The school is co>organized by the. Russian ...RuSSIR 2009 is co>located with the yearly ROMIP meeting (http:// ...sci.tech-archive.net/pdf/Archive/sci.image.../2008.../msg00046.pdf - Similar -

3rd Russian Summer School in Information Retrieval (RuSSIR 2009)FIRST CALL FOR COURSE PROPOSALS ... The 3rd Russian Summer School inInformation Retrieval will be held ... RuSSIR 2009 is co-located with the yearly ROMIP ...sci.tech-archive.net/Archive/sci.image...11/msg00046.html - Cached - Similar -

ru_ir: RUSSIR 2009 - [ Translate this page ]

?1+,8) "#$$%&$'() *+,-)) .'#*( /# %-0#12(3%#--#24 /#%$'4 (RuSSIR 2009) /1#&<+,$ 11 /# 16 $+-,);1) 9 @+,1#7(9#<$'+, -+/#$1+<$,9+--# /+1+< ...community.livejournal.com/ru_ir/74129.html - Similar -

3rd Russian Summer School in Information Retrieval (RuSSIR ...3rd Russian Summer School in Information Retrieval (RuSSIR 2009) Friday September 11 -Wednesday September 16, 2009. Petrozavodsk, Russia ...linguistlist.org/callconf/browse-conf-action.cfm?ConfID... - Cached - Similar -

SELF-EVALUATION FORMS - CORDIS: FP7: Find a CallIdentifier: FP7-NMP-2009-EU-Russia. Publication Date: 19 November 2008. Budget: A 4650 000. Deadline: 31 March 2009 at 17:00:00 (Brussels local time) ...cordis.europa.eu/fp7/dc/index.cfm?fuseaction=usersite... - Cached - Similar -

Economic Survey of Russia 2009Home: Economics Department > Economic Survey of Russia 2009 ... Additional information|. The next Economic Survey of Russia will be prepared for 2011. ...www.oecd.org/.../0,3343,en_2649_33733_43271966_1_1_1_1,00.html -Cached - Similar -

Google

Web Show options...

Please remember comments are public.

Comments will be visible to others and identified by your Google Account nickname.

Yes, continue. Cancel

UsingClickDatatoImproveSearch

•  VeryacWveareaofresearchinbothacademiaandindustry,mostlyinthecontextofPublicWebsearch,butcanbeappliedtoEnterpriseWebsearchaswell

•  Theideaistreatclicksasrelevancevotes(“clicked”=“relevant”),oraspreferencevotes(“clickedpage”>“non‐clickedpage”),andthenusethisinformaWontomodifythesearchengine’srankingfuncWon

SeeRuSSIR’07,“MachineLearningforWeb‐RelatedProblems”,lecture3.

ExplicitandImplicitAnnotaWons

•  AnchortextisthemostimportantrankingfeatureforEnterpriseWebSearch

•  ButthequanWtyoftheanchortextisverylimitedintheEnterprise

•  CanweuseuserannotaWonsasasubsWtuteforanchortext?

Using Annotations in Enterprise Search

Pavel A. DmitrievDepartment of Computer Science

Cornell UniversityIthaca, NY 14850

[email protected]

Nadav EironGoogle Inc.

1600 Amphitheatre Pkwy.Mountain View, CA 94043∗

Marcus FontouraYahoo! Inc.

701 First AvenueSunnyvale, CA, 94089

[email protected]

Eugene ShekitaIBM Almaden Research Center

650 Harry RoadSan Jose, CA 95120

[email protected]

ABSTRACT

A major difference between corporate intranets and the Internet is

that in intranets the barrier for users to create web pages is much

higher. This limits the amount and quality of anchor text, one of

the major factors used by Internet search engines, making intranet

search more difficult. The social phenomenon at play also means

that spam is relatively rare. Both on the Internet and in intranets,

users are often willing to cooperate with the search engine in im-

proving the search experience. These characteristics naturally lead

to considering using user feedback to improve search quality in in-

tranets. In this paper we show how a particular form of feedback,

namely user annotations, can be used to improve the quality of in-

tranet search. An annotation is a short description of the contents

of a web page, which can be considered a substitute for anchor

text. We propose two ways to obtain user annotations, using ex-

plicit and implicit feedback, and show how they can be integrated

into a search engine. Preliminary experiments on the IBM intranet

demonstrate that using annotations improves the search quality.

Categories and Subject Descriptors

H.3.3 [Information Systems]: Information Search and Retrieval

General Terms

Algorithms, Experimentation, Human Factors

Keywords

Anchortext, Community Ranking, Enterprise Search

1. INTRODUCTIONWith more and more companies having a significant part of their

information shared through a corporate Web space, providing high

quality search for corporate intranets becomes increasingly impor-

tant. It is particularly appealing for large corporations, which often

have intranets consisting of millions of Web pages, physically lo-

cated in multiple cities, or even countries. Recent research shows

∗This work was done while these authors were at IBM AlmadenResearch Center.

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2006, May 23–26, 2006, Edinburgh, Scotland.ACM 1-59593-323-9/06/0005.

that employees spend a large percentage of their time searching for

information [16]. An improvement in quality of intranet search re-

duces the time employees spend on looking for information they

need to perform their work, directly resulting in increased em-

ployee productivity.

As it was pointed out in [15], social forces driving the develop-

ment of intranets are rather different from the ones on the Internet.

One particular difference, that has implications on search, is that

company employees cannot freely create their own Web pages in

the intranet. Therefore, algorithms based on link structure analysis,

such as PageRank [24], do not apply to intranets the same way they

apply to the Internet. Another implication is that the amount of an-

chor text, one of the major factors used by Internet search engines

[14, 1], is very limited in intranets.

While the characteristics of intranets mentioned above make in-

tranet search more difficult compared to search on the Internet,

there are other characteristics that make it easier. One such char-

acteristic is the absence of spam in intranets. Indeed, there is no

reason for employees to try to spam their corporate search engine.

Moreover, in many cases intranet users are actually willing to co-

operate with the search engine to improve search quality for them-

selves and their colleagues. These characteristics naturally lead to

considering using user feedback to improve search quality in in-

tranets.

In this paper we explore the use of a particular form of feedback,

user annotations, to improve the quality of intranet search. An an-

notation is a short description of the contents of a web page. In

some sense, annotations are a substitute for anchor text.

One way to obtain annotations is to let users explicitly enter an-

notations for the pages they browse. In our system, users can do so

through a browser toolbar. When trying to obtain explicit user feed-

back, it is important to provide users with clear immediate benefits

for taking their time to give the feedback. In our case, the anno-

tation the user has entered shows up in the browser toolbar every

time the user visits the page, providing a quick reminder of what a

page is about. The annotation will also appear on the search engine

results page, if the annotated page is returned as a search result.

While the methods described above provide the user with use-

ful benefits for entering annotations, we have found many users

reluctant to provide explicit annotations. We therefore propose an-

other method for obtaining annotations, which automatically ex-

tracts them from the search engine query log. The basic idea is to

use the queries users submit to the search engine as annotations for

pages users click on. However, the naı̈ve approach of assigning a

ExplicitAnnotaWons•  CreateaToolbartoallowusersannotatepagestheyvisit

•  ProvideincenWvestoannotate:–  PersonalannotaWonappearsinthetoolbareveryWmeuservisitsthepage

– AggregatedannotaWonsfromallusersappearinsearchengineresults

Figure 1: The Trevi Toolbar contains two fields: a search field to search IBM intranet using Trevi, and an annotation field to submit

an annotation for the page currently open in the browser.

query as an annotation to every page the user clicks on may assign

annotations to irrelevant pages. We experiment with several tech-

niques for deciding which pages to attach an annotation to, making

use of the users’ click patterns and the ways they reformulate their

queries.

The main contributions of this paper include:

• A description of the architecture for collecting annotations

and for adding annotations to search indexes.

• Algorithms for generating implicit annotations from query

logs.

• Preliminary experimental results on a real dataset from the

IBM intranet, consisting of 5.5 million web pages, demon-

strating that annotations help to improve search quality.

The rest of the paper is organized as follows. Section 2 briefly re-

views basic Web IR concepts and terminology. Section 3 describes

in detail our methods for collecting annotations. Section 4 explains

how annotations are integrated into the search process. Section 5

presents experimental results. Section 6 discusses related work,

and section 7 concludes the paper.

2. BACKGROUNDIn a Web IR system retrieval of web pages is often based on the

pages’ content plus the anchor text associated with them. Anchor

text is the text given to links to a particular page in other pages that

link to it. Anchor text can be viewed as a short summary of the

content of the page authored by a person who created the link. In

aggregate, anchor text from all incoming links provides an objec-

tive description of the page. Thus, it is not surprising that anchor

text has been shown to be extremely helpful in Web IR [1, 14, 15].

Most Web IR systems use inverted indexes as their main data

structure for full-text indexing [29]. In this paper, we assume an in-

verted index structure. The occurrence of a term t within a pagep is called a posting. The set of postings associated to a termt is stored in a posting list. A posting has the form <pageID,payload>, where pageID is the ID of the page p and where the pay-load is used to store arbitrary information about each occurrence of

t within p. For example, payload can be used to indicate whetherthe term came from the title of the page, from the regular text, or

from the anchor text associated with the page. Here, we use part

of the payload to indicate whether the term t came from content,

anchor text, or annotation of the page and to store the offset within

the document.

For a given query, a set of candidate answers (pages that match

the query words) is selected, and every page is assigned a relevance

score. The score for a page usually contains a query-dependent

textual component, which is based on the page’s similarity to the

query, and a query-independent static component, which is based

on the static rank of the page. In most Web IR systems, the textual

component of the score follows an additive scoring model like tf ×

idf for each term, with terms of different types, e.g. title, text, an-

chor text, weighted differently. Here we adopt a similar model, with

annotation terms weighted the same as terms from anchor text. The

static component can be based on the connectivity of web pages, as

in PageRank [24], or on other factors such as source, length, cre-

ation date, etc. In our system the static rank is based on the site

count, i.e., the number of different sites containing pages that link

to the page under consideration.

3. COLLECTING ANNOTATIONSThis section describes how we collect explicit and implicit an-

notations from users. Though we describe these procedures in the

context of the Trevi search engine for the IBM intranet [17], they

ExamplesofExplicitAnnotaWons

Annotation Annotated Page

change IBM passwords Page about changing various passwords in IBM intranet

stockholder account access Login page for IBM stock holders

download page for Cloudscape and Derby Page with a link to Derby download

ESPP home Details on Employee Stock Purchase Plan

EAMT home Enterprise Asset Management homepage

PMR site Problem Management Record homepage

coolest page ever Homepage of an IBM employee

most hard-working intern an intern’s personal information page

good mentor an employee’s personal information page

Table 1: Examples of Annotations

of annotated pages is 12433 for the first and the third strategies,

8563 for the second strategy, and 4126 for the fourth strategy.

Given the small number of explicit annotations we were able to

collect, relative to the size of the dataset (5.5 million pages), the

results presented below can only be viewed as very preliminary.

Nevertheless, we observed several interesting characteristics of an-

notations, which highlight their usefulness in enterprise search.

5.1 Types of AnnotationsTable 1 shows examples of explicit annotations. The most typi-

cal type of annotations were concise descriptions of what a page is

about, such as the first set of annotations in Table 1. While these

annotations do not usually introduce new words into the descrip-

tion of a page, in many cases they are still able to increase the rank

of the page for relevant queries. Consider, for example, the annota-

tion ”download page for Cloudscape and Derby”, attached to a page

with the link to Derby download. Cloudscape is a popular IBM Java

RDBMS, which was recently renamed Derby and released under an

open source license. Cloudscape has been integrated in many IBM

products, which resulted in frequent co-occurence of the ”down-

load” and ”Cloudscape” on the same page. The renaming also led

to replacement of Cloudscape with Derby on some of the pages. As

a result, neither of the queries ”download Cloudscape” or ”down-

load Derby” return the correct page with a high rank. However,

with the above annotation added to the index, the page is ranked

much higher for both queries, because the keywords occur close to

each other in the annotation, and Trevi’s ranking function takes this

into account.

Another common type of annotations were abbreviations (the

second set in Table 1). At IBM, like at any other big company,

everything is given a formal sounding, long name. Thus, employees

often come up with abbreviations for such names. These abbrevia-

tions, widely used in spoken language, are not always mentioned in

the content of the web pages describing the corresponding entities.

We observed that entering an abbreviation as an annotation for a

page describing a program or a service was a very common type

of annotations. Such annotations are extremely useful for search,

since they augment the page with a keyword that people frequently

use to refer to the page, but which is not mentioned in its content.

The third type of annotations reflect an opinion of a person re-

garding the content of the web page. The last set of annotations

in Table 1 gives examples of such annotations. While a few an-

notations of this type that we have in our dataset do not convey

any serious meaning, such annotations have a potential to collect

opinions of people regarding services or other people, which em-

ployees do not have an opportunity to express otherwise. One can

imagine, for example, that a query like ”best physical training class

at Almaden” will indeed return as the first hit a page describing the

most popular physical training program offered to IBM Almaden

employees, because many people have annotated this page with the

keyword ”best”.

5.2 Impact on Search QualityTo evaluate the impact of annotations on search quality we gener-

ated 158 test queries by taking explicit annotations to be the queries,

and the annotated pages to be the correct answers. We used perfor-

mance of the search engine without annotations as a baseline for

our experiments. Table 2 shows the performance of explicit and

implicit annotations in terms of the percentage of queries for which

the correct answer was returned in the top 10 results.

Baseline EA IA 1 IA 2 IA 3 IA 4

8.9% 13.9% 8.9% 8.9% 9.5% 9.5%

Table 2: Summary of the results measured by the percentage

of queries for which the correct answer was returned in the top

10. EA = Explicit Annotations, IA = Implicit Annotations.

Adding explicit annotations to the index results in statistically

significant at 95% level improvement over the baseline. This is

expected, given the nice properties of annotations mentioned above.

However, even with explicit annotations the results are rather low.

One reason for that is that many of the annotations were attached to

dynamically generated pages, which are not indexed by the search

engine. As mentioned above, only 14 pages out of 67 annotated

pages were actually in the index.

Implicit annotations, on the other hand, did not show any signifi-

cant improvement over the baseline. There was also little difference

among the different implicit annotation strategies. A closer inves-

tigation showed that there was little overlap between the pages that

received implicit annotations, and the ones that were annotated ex-

plicitly. We suspect that, since the users knew that the primary goal

of annotations is to improve search quality, they mostly tried to

enter explicit annotations for pages they could not find using Trevi.

We conclude that a different experimentation approach is needed to

evaluate the true value of implicit annotations, and the differences

among the four annotation strategies.

6. RELATEDWORKThere are three categories of work related to this paper: enter-

prise search, page annotations on the Web, and improving search

quality using user feedback. We discuss each of these categories in

subsequent sections.

6.1 Enterprise SearchWhile there are many companies that offer enterprise search so-

lutions [2, 3, 5, 7], there is surprisingly little work in this area in

the research community.

ImplicitAnnotaWons

•  MineannotaWonsfromquerylogs– TreatqueriesasannotaWonsforrelevantpages– WhilesuchannotaWonsareoflowerquality,alargenumberofthemcanbecollectedeasily

•  Howtodetermine“relevant”pages?[Joachims02,Radlinski05]

can be implemented with minor modifications on top of any intranet

search engine. One assumption our implementation does rely on is

the identification of users. On the IBM intranet users are identified

by a cookie that contains a unique user identifier. We believe this

assumption is valid as similar mechanisms are widely used in other

intranets as well.

3.1 Explicit AnnotationsThe classical approach to collecting explicit user feedback asks

the user to indicate relevance of items on the search engine results

page, e.g. [28]. A drawback of this approach is that, in many cases,

the user needs to actually see the page to be able to provide good

feedback, but after they got to the page they are unlikely to go back

to the search results page just for the purpose of leaving feedback.

In our system users enter annotations through a toolbar attached

to the Web browser (Figure 1). Each annotation is entered for the

page currently open in the browser. This allows users to submit

annotations for any page, not only for the pages they discovered

through search. This is a particularly promising advantage of our

implementation, as annotating only pages already returned by the

search engine creates a “rich get richer” phenomenon, which pre-

vents new high quality pages from becoming highly ranked in the

search engine[11]. Finally, since annotations appear in the toolbar

every time the user visits the page he or she has annotated, it is easy

for the user to modify or delete their annotations.

Currently, annotations in our system are private, in the sense that

only the user who entered the annotation can see it displayed in the

toolbar or search results. While there are no technical problems

preventing us from allowing users to see and modify each other’s

annotations, we regarded such behavior undesirable and did not

implement it.

3.2 Implicit AnnotationsTo obtain implicit annotations, we use Trevi’s query log, which

records the queries users submit, and the results they click on.

Every log record also contains an associated userID, a cookie auto-

matically assigned to every user logged into the IBM intranet (Fig-

ure 2). The basic idea is to treat the query as an annotation for

pages relevant to the query. While these annotations are of lower

quality than the manually entered ones, a large number of them can

be collected without requiring direct user input. We propose sev-

eral strategies to determine which pages are relevant to the query,

i.e., which pages to attach an annotation to, based on clickthrough

data associated with the query.

LogRecord ::= <Query> | <Click> Query ::= <Time>\t<QueryString>\t<UserID> Click ::= <Time>\t<QueryString>\t<URL>\t<UserID>

Figure 2: Format of the Trevi log file.

The first strategy is based on the assumption that if the user

clicked on a page in the search results, they thought that this page

is relevant to the query in some way. For every Click record, the

strategy produces a (URL, Annotation) pair, where Annotationis the QueryString. This strategy is simple to implement, and

gives a large number of annotations.

The problem with the first strategy is that, since the user only

sees short snippets of page contents on the search results page, it is

possible that the page they clicked on ends up not being relevant to

the query. In this case, the strategy will attach the annotation to an

irrelevant page.

Typically, after clicking on an irrelevant link on a search engine

results page, the user goes back to the results page and clicks on

another page. Our second strategy accounts for this type of behav-

ior, and is based on the notion of a session. A session is a time-

ordered sequence of clicks on search results that the user makes

for a given query. We can extract session data from the query log

based on UserIDs. Our second strategy only produces a (URL,

Annotation) pair for a click record which is the last record in a

session. Annotation is still the QueryString.

The two other strategies we propose try to take into account the

fact that users often reformulate their original query. The strategies

are based on the notion of a query chain [25]. A query chain is a

time-ordered sequence of queries, executed over a short period of

time. The assumption behind using query chains is that all subse-

quent queries in the chain are actually refinements of the original

query.

The third and the fourth strategies for extracting implicit anno-

tations are similar to the previous two, except that they use query

chains instead of individual queries. We extract query chains from

the log file based on the time stamps of the log records. The third

strategy, similarly to the first one, produces a (URL, Annotation)

pair for every click record, but Annotation now is the concate-

nation of all QueryStrings from the corresponding query chain.

Finally, the fourth strategy produces a (URL, Annotation) pair

for a click record which is the last record in the last session in

a query chain, and Annotation is, again, the concatenation of

QueryStrings from the corresponding query chain.

Recent work has demonstrated that the naı̈ve strategy of regard-

ing every clicked page as relevant to the query (our first strategy)

produces biased results, due to the fact that the users are more

likely to click on the higher ranked pages irrespective of their rele-

vance [21]. Our hope is that the session-based strategies will help to

eliminate this bias. However, these strategies produce significantly

smaller amounts of data comparing to the original strategy. Using

query chains helps to increase the amount of data, by increasing the

size of the annotation data added to a page.

3.3 The Value of AnnotationsAnnotations, both explicit and implicit, have the potential to in-

fluence retrieval and ranking in many ways. One particular instance

in which annotations are helpful is in enriching the language used

to describe a concept. Like anchor text, annotations let users use

their own words to describe a concept that the annotated page talks

about. Users’ vocabulary may be radically different in some cases

than the one used by authors. This dichotomy in vocabulary is par-

ticularly prevalent in intranets, where corporate policy dictate that

certain terms be avoided in official documents or web pages. We

expect that the similarity between anchor text vocabulary and query

vocabulary [14] will also be present in annotations. This is obvi-

ously the case for our method of collecting implicit annotations, as

those annotations are just old queries.

As an example, the United States Federal Government went through

a restructuring lately that changed the name of the agency that

is responsible for immigration from “Immigration and Naturaliza-

tion Service”, or INS, to “Unites States Citizenship and Immigra-

tion Services”, or USCIS1. While all formal web pages describing

government immigration-related activities no longer mention the

words “INS”, users might still search by the old, and better known,

name of the agency. We can therefore expect that annotations will

also use these terms. This is true even of implicit annotations, as

many search engines do not force all terms to be present in a search

1While this example does not directly applies to most intranets,similar examples are common in many large intranets.

Strategy1

•  Assumeeveryclickedpageisrelevant– Simpletoimplement

– ProducesalargenumberofannotaWons– ButmayasachanannotaWontoanirrelevantpage

Strategy2

•  Session=Wmeorderedsequenceofclicksausermakesforagivenquery

•  Assumeonlythelastclickinthesessionisrelevant– ProduceslessannotaWons– AvoidsassigningannotaWonstoirrelevantpages

Strategies3&4

•  QueryChain=WmeorderedsequenceofqueriesexecutedoverashortperiodofWme

•  Strategy3:Assumeeveryclickinthequerychainisrelevant

•  Strategy4:Assumeonlythelastclickinthelastsessionofthequerychainisrelevant

UsingAnnotaWonsinEnterpriseWebSearch

FlowofAnnotaWonsthroughthesystem

Toolbar

Search Results

Browser Web

Server

Annotationdatabase

-------

--------------

Content Store

Anchortext Store

Annotation Store

Index

enter

save

display

store /

updateretrieve

export

build

click

save

Figure 3: Flow of annotations through the system.

result. Thus, someone searching for “INS H1-B visa policies” may

actually find a USCIS page talking about the subject. An implicit

annotation of such a page will still be able to add the term “INS”

to that page, improving its ranking for users using the older terms

in future queries, or in queries where the term “INS” is required to

appear.

4. USING ANNOTATIONS IN INTRANET

SEARCHIn order for annotations to be integrated into the search process,

they must be included in the search engine index. One way to

achieve this is to augment the content of pages with annotations

collected for these pages [22]. However, this does not take into

account the fact that annotations have different semantics. Being

concise descriptions of content, annotations should be treated as

meta-data, rather than content. From this point of view, they are

similar to anchor text [14]. Thus, we decided to use annotations in

a similar way to how anchor text is used in our system.

The flow of annotations through the system is illustrated in Fig-

ure 3. After submitted by the user, explicit annotations are stored in

a database. This database is used to display annotations back to the

user in the toolbar and search results. Periodically (currently once

a day), annotations are exported into an annotation store – a special

format document repository used by our indexing system [17]. The

annotation store is combined with the content store and anchor text

store to produce a new index. This is done by sequentially scanning

these three stores in batch mode and using a disk-based sort merge

algorithm for building the index [17]. Once the new index is ready,

it is substituted for the old one.

The index build algorithm takes the pages from the stores as in-

put and produces the inverted index, which is basically a collection

of posting lists, one list for each token that appears in the corpus.

To reduce storage and I/O costs, posting lists are compressed us-

ing a simple variable-byte scheme based on computing the deltas

between positions.

As the stores are scanned, tokens from each page are streamed

into a sort, which is used to create the posting lists. The primary

sort key is on token, the secondary sort key is on the pageID, and

the tertiary sort key is on offset within a document. By sorting on

this compound key, token occurrences are effectively grouped into

ordered posting lists.

Since we use an optimized fixed-width radix sort [12, 26], we

hash the variable length tokens to produce a 64-bit token hash.

Therefore our sort key has the form (tokenID, pageID, section/offset),

where tokenID is a 64-bit hash value, pageID is 32 bits, and sec-

tion/offset within a document is 32 bits. The encoding of the sort

key is illustrated in Figure 4. In the posting list data structure the

section/offset is stored in the payload for each posting entry.

As shown in Figure 4, we use two bits to denote the section of

a document. This is so a given document can be streamed into

the sort in different sections. To index anchor text and annotation

tokens, the anchor and annotation stores are scanned and streamed

into the sort after the content store is scanned. The section bits

are used to indicate whether a token is for content, anchor text, or

annotation. Consequently, after sorting, the anchor text tokens for a

page P follow the content tokens for P , and the annotations tokensfollow the anchor text.

tokenID (64 bits) pageID (32 bits) offset (30 bits)

section (2 bits)

00 = content

01 = anchor text10 = annotation

payload

tokenID (64 bits) pageID (32 bits) offset (30 bits)

section (2 bits)

00 = content

01 = anchor text10 = annotation

payload

Figure 4: Sort key used to sort content, anchor text, and anno-

tation tokens

Within each pageP , tokens are streamed into the sort in the orderin which they appear in P , that is, in offset order. Taking advantageof the fact that radix sort is stable, this allows us to use a 96-bit sort

key that excludes offset, rather than sorting on the full 128-bit key.

Currently, for retrieval and ranking purposes annotations are treated

as if they were anchor text. However, one can imagine a ranking

function that would treat annotations and anchor text differently.

Experimenting in this direction is one of the areas for our future

work.

The process for including implicit annotations is similar, except

that the annotation store is built from search engine log files instead

of the database.

5. EXPERIMENTAL RESULTSIn this section we present experimental results for search with

explicit and implicit annotations. We implemented our approaches

on the Trevi search engine for the IBM intranet, currently searching

more than 5.5 million pages.

The explicit annotations dataset consists of 67 pages, annotated

with a total of 158 annotations by users at IBM Almaden Research

Center over a period of two weeks during summer 2005. Out of

the 67 annotated pages, 14 were contained in the Trevi index. The

implicit annotations dataset consists of annotations extracted from

Trevi log files for a period of approximately 3 months. The number

ExperimentalResults

•  Dataset:5.5MindexofIBMintranet•  Queries:158testquerieswithmanuallyidenWfiedcorrectanswers

•  EvaluaWonwasconductedauer2weekssincestarWngcollecWngtheannotaWons

Annotation Annotated Page

change IBM passwords Page about changing various passwords in IBM intranet

stockholder account access Login page for IBM stock holders

download page for Cloudscape and Derby Page with a link to Derby download

ESPP home Details on Employee Stock Purchase Plan

EAMT home Enterprise Asset Management homepage

PMR site Problem Management Record homepage

coolest page ever Homepage of an IBM employee

most hard-working intern an intern’s personal information page

good mentor an employee’s personal information page

Table 1: Examples of Annotations

of annotated pages is 12433 for the first and the third strategies,

8563 for the second strategy, and 4126 for the fourth strategy.

Given the small number of explicit annotations we were able to

collect, relative to the size of the dataset (5.5 million pages), the

results presented below can only be viewed as very preliminary.

Nevertheless, we observed several interesting characteristics of an-

notations, which highlight their usefulness in enterprise search.

5.1 Types of AnnotationsTable 1 shows examples of explicit annotations. The most typi-

cal type of annotations were concise descriptions of what a page is

about, such as the first set of annotations in Table 1. While these

annotations do not usually introduce new words into the descrip-

tion of a page, in many cases they are still able to increase the rank

of the page for relevant queries. Consider, for example, the annota-

tion ”download page for Cloudscape and Derby”, attached to a page

with the link to Derby download. Cloudscape is a popular IBM Java

RDBMS, which was recently renamed Derby and released under an

open source license. Cloudscape has been integrated in many IBM

products, which resulted in frequent co-occurence of the ”down-

load” and ”Cloudscape” on the same page. The renaming also led

to replacement of Cloudscape with Derby on some of the pages. As

a result, neither of the queries ”download Cloudscape” or ”down-

load Derby” return the correct page with a high rank. However,

with the above annotation added to the index, the page is ranked

much higher for both queries, because the keywords occur close to

each other in the annotation, and Trevi’s ranking function takes this

into account.

Another common type of annotations were abbreviations (the

second set in Table 1). At IBM, like at any other big company,

everything is given a formal sounding, long name. Thus, employees

often come up with abbreviations for such names. These abbrevia-

tions, widely used in spoken language, are not always mentioned in

the content of the web pages describing the corresponding entities.

We observed that entering an abbreviation as an annotation for a

page describing a program or a service was a very common type

of annotations. Such annotations are extremely useful for search,

since they augment the page with a keyword that people frequently

use to refer to the page, but which is not mentioned in its content.

The third type of annotations reflect an opinion of a person re-

garding the content of the web page. The last set of annotations

in Table 1 gives examples of such annotations. While a few an-

notations of this type that we have in our dataset do not convey

any serious meaning, such annotations have a potential to collect

opinions of people regarding services or other people, which em-

ployees do not have an opportunity to express otherwise. One can

imagine, for example, that a query like ”best physical training class

at Almaden” will indeed return as the first hit a page describing the

most popular physical training program offered to IBM Almaden

employees, because many people have annotated this page with the

keyword ”best”.

5.2 Impact on Search QualityTo evaluate the impact of annotations on search quality we gener-

ated 158 test queries by taking explicit annotations to be the queries,

and the annotated pages to be the correct answers. We used perfor-

mance of the search engine without annotations as a baseline for

our experiments. Table 2 shows the performance of explicit and

implicit annotations in terms of the percentage of queries for which

the correct answer was returned in the top 10 results.

Baseline EA IA 1 IA 2 IA 3 IA 4

8.9% 13.9% 8.9% 8.9% 9.5% 9.5%

Table 2: Summary of the results measured by the percentage

of queries for which the correct answer was returned in the top

10. EA = Explicit Annotations, IA = Implicit Annotations.

Adding explicit annotations to the index results in statistically

significant at 95% level improvement over the baseline. This is

expected, given the nice properties of annotations mentioned above.

However, even with explicit annotations the results are rather low.

One reason for that is that many of the annotations were attached to

dynamically generated pages, which are not indexed by the search

engine. As mentioned above, only 14 pages out of 67 annotated

pages were actually in the index.

Implicit annotations, on the other hand, did not show any signifi-

cant improvement over the baseline. There was also little difference

among the different implicit annotation strategies. A closer inves-

tigation showed that there was little overlap between the pages that

received implicit annotations, and the ones that were annotated ex-

plicitly. We suspect that, since the users knew that the primary goal

of annotations is to improve search quality, they mostly tried to

enter explicit annotations for pages they could not find using Trevi.

We conclude that a different experimentation approach is needed to

evaluate the true value of implicit annotations, and the differences

among the four annotation strategies.

6. RELATEDWORKThere are three categories of work related to this paper: enter-

prise search, page annotations on the Web, and improving search

quality using user feedback. We discuss each of these categories in

subsequent sections.

6.1 Enterprise SearchWhile there are many companies that offer enterprise search so-

lutions [2, 3, 5, 7], there is surprisingly little work in this area in

the research community.

•  WanttogeneratepersonalizedwebpageannotaWonsbasedondocumentsontheuser’sDesktop

•  SupposewehaveanindexofDesktopdocumentsontheuser’scomputer(files,email,browsercache,etc.)

P-TAG: Large Scale Automatic Generation ofPersonalized Annotation TAGs for the Web

Paul - Alexandru Chirita1∗, Stefania Costache1, Siegfried Handschuh2, Wolfgang Nejdl1

1L3S Research Center / University of Hannover, Appelstr. 9a, 30167 Hannover, Germany{chirita,costache,nejdl}@l3s.de

2National University of Ireland / DERI, IDA Business Park, Lower Dangan, Galway, [email protected]

ABSTRACTThe success of the Semantic Web depends on the availability ofWeb pages annotated with metadata. Free form metadata or tags, asused in social bookmarking and folksonomies, have become moreand more popular and successful. Such tags are relevant keywordsassociated with or assigned to a piece of information (e.g., a Webpage), describing the item and enabling keyword-based classifica-tion. In this paper we propose P-TAG, a method which automat-ically generates personalized tags for Web pages. Upon browsinga Web page, P-TAG produces keywords relevant both to its textualcontent, but also to the data residing on the surfer’s Desktop, thusexpressing a personalized viewpoint. Empirical evaluations withseveral algorithms pursuing this approach showed very promisingresults. We are therefore very confident that such a user orientedautomatic tagging approach can provide large scale personalizedmetadata annotations as an important step towards realizing the Se-mantic Web.

Categories and Subject DescriptorsH.3.1 [Information Storage and Retrieval]: Content Analysisand Indexing; H.3.5 [Information Storage and Retrieval]: On-line Information Services

General TermsAlgorithms, Experimentation, Design

KeywordsWeb Annotations, Tagging, Personalization, User Desktop

1. INTRODUCTIONThe World Wide Web has had a tremendous impact on society

and business in recent years by making information instantly andubiquitously available. The Semantic Web is seen as an extensionof the WWW, a vision of a future Web of machine-understandabledocuments and data. One its main instruments are the annotations,which enrich content with metadata in order to ease its automaticprocessing. The traditional paradigm of Semantic Web annotation∗Part of this work was performed while the author was visitingYahoo! Research, Barcelona, Spain.

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.ACM 978-1-59593-654-7/07/0005.

(i.e., annotating Web sites with the help of external tools) has beenestablished for a number of years by now, for example in the formof applications such as OntoMat [20] or tools based on Annotea[23], and the process continues to develop and improve. However,this paradigm is based on manual or semi-automatic annotation,which is a laborious, time consuming task, requiring a lot of expertknow-how, and thus only applicable to small-scale or Intranet col-lections. For the overall Web though, the growth of a Semantic Weboverlay is restricted because of the lack of annotated Web pages. Inthe same time, the tagging paradigm, which has its roots in socialbookmarking and folksonomies, is becoming more and more pop-ular. A tag is a relevant keyword associated with or assigned to apiece of information (e.g., a Web page), describing the item andenabling keyword-based classification of the information it is ap-plied to. The successful application of the tagging paradigm can beseen as evidence that a lowercase semantic Web1 could be easier tograsp for the millions of Web users and hence easier to introduce,exploit and benefit from. One can then build upon this lowercasesemantic web as a basis for the introduction of more semantics,thus advancing further towards the Web 2.0 ideas.

We argue that a successful and easy achievable approach is toautomatically generate annotation tags for Web pages in a scalablefashion. We use tags in their general sense, i.e., as a mechanism toindicate what a particular document is about [4], rather than for ex-ample to organize one’s tasks (e.g., “todo”). Yet automatically gen-erated tags have the drawback of presenting only a generic view,which does not necessary reflect personal interests. For example, aperson might categorize the home page of Anthony Jameson2 withthe tags “human computer interaction” and “mobile computing”,because this reflects her research interests, while another would an-notate it with the project names “Halo 2” and “MeMo”, becauseshe is more interested in research applications.

The crucial question is then how to automatically tag Web pagesin a personalized way. In many environments, defining a user’sviewpoint would rely on the definition of an interest profile. How-ever, these profiles are laborious to create and need constant main-tenance in order to reflect the changing interest of the user. Fortu-nately, we do have a rich source of user profiling information avail-able: everything stored on her computer. This personal Desktopusually contains a very rich document corpus of personal informa-tion which can and should be exploited for user personalization!There is no need to maintain a dedicated interest profile, since the

1Lowercase semantic web refers to an evolutionary approach forthe Semantic Web by adding simple meaning gradually into thedocuments and thus lowering the barriers for re-using information.2http://www.dfki.de/˜jameson

WWW 2007 / Track: Semantic Web Session: Semantic Web and Web 2.0

845

ExtracWngtagsfromDesktopdocuments

•  Givenawebpagetoannotate,thealgorithmproceedsasfollows:– Step1:Extractimportantkeywordsfromthepage– Step2:RetrieverelevantdocumentsusingtheDesktopsearch

– Step3:ExtractimportantkeywordsfromtheretrieveddocumentsasannotaWons

•  Usersjudged70%‐80%ofannotaWonscreatedusingthisalgorithmasrelevant

•  WhenhavelotsofannotaWonsforagivenpage,whichonesshouldweuse?

•  ThispaperproposestoperformfrequentitemsetminingtoextractrecurringgroupsoftermsfromannotaWons–  ShowthatthistypeofprocessingisusefulforwebpageclassificaWon

– MayalsobeusefulforimprovingsearchqualitybyeliminaWngnoisyterms

Query-Sets: Using Implicit Feedback and Query Patternsto Organize Web Documents

Barbara PobleteWeb Research Group

University Pompeu FabraBarcelona, Spain

[email protected]

Ricardo Baeza-YatesYahoo! Research &

Barcelona Media Innovation CenterBarcelona, Spain

[email protected]

ABSTRACT

In this paper we present a new document representationmodel based on implicit user feedback obtained from searchengine queries. The main objective of this model is to achievebetter results in non-supervised tasks, such as clustering andlabeling, through the incorporation of usage data obtainedfrom search engine queries. This type of model allows usto discover the motivations of users when visiting a certaindocument. The terms used in queries can provide a betterchoice of features, from the user’s point of view, for summa-rizing the Web pages that were clicked from these queries. Inthis work we extend and formalize as query model an exist-ing but not very well known idea of query view for documentrepresentation. Furthermore, we create a novel model basedon frequent query patterns called the query-set model. Ourevaluation shows that both query-based models outperformthe vector-space model when used for clustering and labelingdocuments in a website. In our experiments, the query-setmodel reduces by more than 90% the number of featuresneeded to represent a set of documents and improves byover 90% the quality of the results. We believe that thiscan be explained because our model chooses better featuresand provides more accurate labels according to the user’sexpectations.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Clustering; H.2.8 [Information Systems]: Data Mining; H.4.m [Information Systems Applications]: Miscella-neous

General Terms

Algorithms, Experimentation, Human Factors

Keywords

Feature Selection, Labeling, Search Engine Queries, UsageMining, Web Page Organization

1. INTRODUCTIONAs the Web’s contents grow, it becomes increasingly dif-

ficult to manage and classify its information. Optimal or-ganization is especially important for websites, for example,

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2008, April 21–25, 2008, Beijing, China.ACM 978-1-60558-085-2/08/04.

where classification of documents into cohesive and relevanttopics is essential to make a site easier to navigate and moreintuitive to its visitors. The high level of competition in theWeb makes it necessary for websites to improve their orga-nization in a way that is both automatic and effective, sousers can reach effortlessly what they are looking for. Webpage organization has other important applications. Searchengine results can be enhanced by grouping documents intosignificant topics. These topics can allow users to disam-biguate or specify their searches quickly. Moreover, searchengines can personalize their results for users by rankinghigher the results that match the topics that are relevant tousers’ profiles. Other applications that can benefit from au-tomatic topic discovery and classification are human editeddirectories, such as DMOZ1 or Yahoo!2. These directoriesare increasingly hard to maintain as the contents of the Webgrow. Also, automatic organization of Web documents isvery interesting from the point of view of discovering newinteresting topics. This would allow to keep up with user’strends and changing interests.

The task of automatically clustering, labeling and clas-sifying documents in a website is not an easy one. Usuallythese problems are approached in a similar way for Web doc-uments and for plain text documents, even if it is known thatWeb documents contain richer and, sometimes, implicit in-formation associated to them. Traditionally, documents arerepresented based on their text, or in some cases, also usingsome kind of structural information of Web documents.

There are two main types of structural information thatcan be found in Web documents: HTML formatting, whichsometimes allows to identify important parts of a document,such as title and headings, and link information betweenpages [15]. The formatting information provided by HTMLis not always reliable, because tags are more often used forstyling purposes than for content structuring. Informationgiven by links, although useful for general Web documents,is not of much value when working with documents from aparticular website, because in this case, we cannot assumethat this data has any objectiveness, i.e.: any informationextracted from the site’s structure about that same site, isa reflection of the webmaster’s criteria, which provides nowarranty of being thorough or accurate, and might be com-pletely arbitrary. A clear example of this, is that manywebsites that have large amounts of contents, use some kindof content management system and/or templates, that givepractically the same structure to all pages and links within

1http://www.dmoz.org2http://www.yahoo.com

41

WWW 2008 / Refereed Track: Data Mining - Log Analysis April 21-25, 2008 · Beijing, China

Summary

•  UserAnnotaWonscanhelpimprovesearchqualityintheEnterprise

•  AnnotaWonscanbecollectedbyexplicitlyaskinguserstoprovidethem,orbyminingquerylogsandusers’Desktopcontents

•  Post‐processingtheresulWngannotaWonsmayhelptoimprovethesearchquality

SocialAnnotaWons

Tagging•  Easywayfortheuserstoannotatewebobjects•  Peopledoit(noonereallyknowswhy)

Tagging

•  VerypopularontheWeb,becomingmoreandmorepopularintheEnterprise– Usersaddtagstoobjects(pages,pictures,messages,etc.)

– TaggingSystemkeepstrackof<user,obj,tag>triplesandmines/organizesthisinformaWonforpresenWngittotheuser(moreinLecture3)

•  Inthislecturewewillseehowtagscanbeusedtoimprovesearchinenterpriseweb

UsingTaggingtoImproveSearch

•  Approach1:Mergetagswithcontentoranchortext

•  Approach2:Keeptagsseparateandrankqueryresultsby

α×content_match+(1–α)×tag_match

•  Otherapproaches:explorethesocial/collaboraWveproperWesoftags– Givemoreweighttosomeusersandtagsvsothers– ComputesimilariWesbetweentagsanddocumentsandincorporateitintoranking

•  ObservaWon:similar(semanWcallyrelated)annotaWonsareusuallyassignedtosimilar(semanWcallyrelated)webpages– ThesimilarityamongannotaWonscanbeidenWfiedbysimilarwebpagestheyareassignedto

– ThesimilarityamongwebpagescanbeidenWfiedbysimilarannotaWonstheyareannotatedwith

•  ProposediteraWvealgorithmtocomputethesesimilariWesandusethemtoimproveranking

!"#$%$&$'()*+,)-+./01)23$'()-40$.5)6''4#.#$4'3)!"#$%"&'()'*+,-(./'*0&'$(1&+,-()#$(2#/3-(4&/5*$%(.&#+-(6"*$%(!&3-('$7(8*$%(8&+(

(+!"'$%"'/(9/'*:*$%(;$/<#5=/>0(!"'$%"'/-(3??3@?-(A"/$'(

B=""C'*-(D&E0-(%5E&#-(00&FG'H#EI=J>&I#7&IK$(

((((3L)M(A"/$'(N#=#'5K"(O'C()#/J/$%-(+???P@-(A"/$'(

BQ#/C#$-(=&R"*$%FGK$I/CSIK*S(

!

!"#$%!&$!"#$%!&'&()!(*&+,)(%!-#(!.%(!,/!%,0$'+!'11,-'-$,1%!-,!$2&),3(!4(5!

%(')0#6! 7,4'8'9%:! 2'19! %()3$0(%:! (6;6! 8(+6$0$,6.%:! #'3(! 5((1!

8(3(+,&(8!/,)!4(5!.%()%!-,!,);'1$<(!'18!%#')(!-#($)!/'3,)$-(!4(5!

&';(%! ,1! +$1(! 59! .%$1;! %,0$'+! '11,-'-$,1%6!=(! ,5%()3(! -#'-! -#(!

%,0$'+! '11,-'-$,1%! 0'1! 5(1(/$-!4(5! %(')0#! $1! -4,! '%&(0-%>! ?@! -#(!

'11,-'-$,1%! ')(! .%.'++9! ;,,8! %.22')$(%! ,/! 0,))(%&,18$1;! 4(5!

&';(%A!B@!-#(!0,.1-!,/!'11,-'-$,1%!$18$0'-(%!-#(!&,&.+')$-9!,/!4(5!

&';(%6! "4,! 1,3(+! '+;,)$-#2%! ')(! &),&,%(8! -,! $10,)&,)'-(! -#(!

'5,3(! $1/,)2'-$,1! $1-,! &';(! )'1C$1;>! ?@! D,0$'+D$2E'1C! FDDE@!

0'+0.+'-(%! -#(! %$2$+')$-9! 5(-4((1! %,0$'+! '11,-'-$,1%! '18! 4(5!

G.()$(%A!B@!D,0$'+H';(E'1C!FDHE@!0'&-.)(%!-#(!&,&.+')$-9!,/!4(5!

&';(%6! H)(+$2$1')9! (*&()$2(1-'+! )(%.+-%! %#,4! -#'-! DDE! 0'1! /$18!

-#(! +'-(1-! %(2'1-$0! '%%,0$'-$,1! 5(-4((1!G.()$(%! '18! '11,-'-$,1%:!

4#$+(!DHE!%.00(%%/.++9!2('%.)(%!-#(!G.'+$-9!F&,&.+')$-9@!,/!'!4(5!

&';(! /),2! -#(! 4(5! .%()%I! &()%&(0-$3(6! =(! /.)-#()! (3'+.'-(! -#(!

&),&,%(8! 2(-#,8%! (2&$)$0'++9! 4$-#! JK! 2'1.'++9! 0,1%-).0-(8!

G.()$(%! '18! LKKK! '.-,M;(1()'-(8! G.()$(%! ,1! '! 8'-'%(-! 0)'4+(8!

/),2! 8(+6$0$,6.%6! N*&()$2(1-%! %#,4! -#'-! 5,-#! DDE! '18! DHE!

5(1(/$-!4(5!%(')0#!%$;1$/$0'1-+96!!

&'()*+,-)./'01/#234)5(/6).5,-7(+,.!O6L6L!P809+,:'(-+0/#;.():.Q>!R1/,)2'-$,1!D(')0#!'18!E(-)$(3'+6!

<)0),'=/$),:.>!S+;,)$-#2%:!N*&()$2(1-'-$,1:!O.2'1!T'0-,)%6!

?);@+,1.>/D,0$'+!'11,-'-$,1:!%,0$'+!&';(!)'1C:!%,0$'+!%$2$+')$-9:!4(5!%(')0#:!(3'+.'-$,1!

AB) 8C$%D6E&$8DC/U3()!-#(!&'%-!8(0'8(:!2'19!%-.8$(%!#'3(!5((1!8,1(!,1!$2&),3$1;!

-#(! G.'+$-9! ,/! 4(5! %(')0#6! V,%-! ,/! -#(2! 0,1-)$5.-(! /),2! -4,!

'%&(0-%>! ?@! ,)8()$1;! -#(! 4(5! &';(%! '00,)8$1;! -,! -#(! G.()9M

8,0.2(1-! %$2$+')$-96! D-'-(M,/M-#(M')-! -(0#1$G.(%! $10+.8(! '10#,)!

-(*-! ;(1()'-$,1! PB?:! BW:! LXQ:! 2(-'8'-'! (*-)'0-$,1! PLYQ:! +$1C!

'1'+9%$%! PLXQ:! '18! %(')0#! +,;! 2$1$1;! P?KQA! B@! ,)8()$1;! -#(! 4(5!

&';(%! '00,)8$1;! -,! -#($)! G.'+$-$(%6! R-! $%! '+%,! C1,41! '%! G.()9M

$18(&(18(1-!)'1C$1;:!,)!%-'-$0!)'1C$1;6!T,)!'!+,1;!-$2(:!-#(!%-'-$0!

)'1C$1;! $%! 8()$3(8! 5'%(8! ,1! +$1C! '1'+9%$%:! (6;6:! H';(E'1C! P?YQ:!

OR"D! P?JQ6! E(0(1-+9:! -#(! /('-.)(%! ,/! 0,1-(1-! +'9,.-:! .%()! 0+$0CM

-#),.;#%! (-06! ')(! '+%,! (*&+,)(8:! (6;6:! /E'1CP?ZQ6! [$3(1! '! G.()9:!

-#(! )(-)$(3(8! )(%.+-%! ')(! )'1C(8! 5'%(8! ,1! 5,-#! &';(! G.'+$-9! '18!

G.()9M&';(!%$2$+')$-96!

E(0(1-+9:!4$-#!-#(!)$%(!,/!=(5!B6K!-(0#1,+,;$(%:!4(5!.%()%!4$-#!

8$//()(1-! 5'0C;),.18%! ')(! 0)('-$1;! '11,-'-$,1%! /,)!4(5! &';(%! '-!

'1! $10)(8$5+(! %&((86! T,)! (*'2&+(:! -#(! /'2,.%! %,0$'+! 5,,C2')C!

4(5! %$-(:! 8(+6$0$,6.%! PXQ! F#(10(/,)-#! )(/())(8! -,! '%! \](+$0$,.%^@:!

#'%! 2,)(! -#'1! ?! 2$++$,1! )(;$%-()(8! .%()%! %,,1! '/-()! $-%! -#$)8!

5$)-#8'9:! '18! -#(! 1.25()! ,/! ](+$0$,.%! .%()%! #'3(! $10)('%(8! 59!

2,)(!-#'1!BKK_!$1!-#(!&'%-!1$1(!2,1-#%!P?LQ6!D,0$'+!'11,-'-$,1%!

')(!(2();(1-!.%(/.+!$1/,)2'-$,1!-#'-!0'1!5(!.%(8!$1!3')$,.%!4'9%6!

D,2(!4,)C!#'%!5((1!8,1(!,1!(*&+,)$1;!-#(!%,0$'+!'11,-'-$,1%!/,)!

/,+C%,1,29!PBQ:!3$%.'+$<'-$,1!P?WQ:!%(2'1-$0!4(5!PL`Q:!(1-()&)$%(!

%(')0#! PBLQ! (-06! O,4(3():! -,! -#(! 5(%-! ,/! ,.)! C1,4+(8;(:! +$--+(!

4,)C!#'%!5((1!8,1(!,1!$1-(;)'-$1;!-#$%!3'+.'5+(!$1/,)2'-$,1!$1-,!

4(5!%(')0#6!O,4!-,!.-$+$<(!-#(!'11,-'-$,1%!(//(0-$3(+9!-,!$2&),3(!

4(5!%(')0#!5(0,2(%!'1!$2&,)-'1-!&),5+(26!!

R1! -#$%! &'&():! 4(! %-.89! -#(! &),5+(2! ,/! .-$+$<$1;! %,0$'+!

'11,-'-$,1%! /,)! 5(--()! 4(5! %(')0#:! 4#$0#! $%! '+%,! )(/())(8! -,! '%!

\%,0$'+!%(')0#^!/,)!%$2&+$0$-96!V,)(!%&(0$/$0'++9:!4(!,&-$2$<(!4(5!

%(')0#! 59! .%$1;! %,0$'+! '11,-'-$,1%! /),2! -#(! /,++,4$1;! -4,!

'%&(0-%>!!!

!( #-:-=',-(;/ ,'0F-0*:! 4#$0#! 2('1%! -#(! (%-$2'-(8!

%$2$+')$-9! 5(-4((1! '! G.()9! '18! '! 4(5! &';(6! "#(!

'11,-'-$,1%:! &),3$8(8! 59! 4(5! .%()%! /),2! 8$//()(1-!

&()%&(0-$3(%:! ')(! .%.'++9! ;,,8! %.22')$(%! ,/! -#(!

0,))(%&,18$1;! 4(5! &';(%6! T,)! (*'2&+(:! -#(! -,&! J!

'11,-'-$,1%! ,/! S2'<,1I%! #,2(&';( ? !$1! ](+$0$,.%! ')(!

"#$%%&'()!*$$+")!,-,.$')!-/"&0!'18!"1$23)!4#$0#!8(&$0-!

-#(! &';(! ,)! (3(1! -#(! 4#,+(! 4(5%$-(! (*'0-+96! "#(%(!

'11,-'-$,1%! &),3$8(! '! 1(4! 2(-'8'-'! /,)! -#(! %$2$+')$-9!

0'+0.+'-$,1! 5(-4((1! '! G.()9! '18! '!4(5! &';(6!O,4(3():!

/,)! '! %&(0$/$0! 4(5! &';(:! -#(! '11,-'-$,1! 8'-'! 2'9! 5(!

%&')%(! '18! $10,2&+(-(6!"#()(/,)(:! '!2'-0#$1;!;'&! (*$%-%!

5(-4((1! -#(! '11,-'-$,1%! '18! G.()$(%! F(6;6:! 5(-4((1!

\%#,&^!'18!\%#,&&$1;^@6!O,4!-,!5)$8;(!-#(!;'&!)(2'$1%!'!

0).0$'+! &),5+(2! $1! /.)-#()! $2&),3$1;! -#(! %$2$+')$-9!

)'1C$1;6! =(! &),&,%(! '! 1(4! %$2$+')$-9! (%-$2'-$,1!

'+;,)$-#2:!D,0$'+D$2E'1C!FDDE@!-,!'88)(%%!-#$%!&),5+(26!

!( #('(-5/ ,'0F-0*:! -#(!'2,.1-!,/!'11,-'-$,1%!'%%$;1(8! -,!'!

&';(! $18$0'-(%! $-%! &,&.+')$-9! '18! $2&+$(%! $-%! G.'+$-9! $1!

%,2(!%(1%(:!9(-!-)'8$-$,1'+!%-'-$0!)'1C$1;!'+;,)$-#2%!%.0#!

'%!H';(E'1C!#'3(!1,!4'9!-,!2('%.)(!-#$%!C$18!,/!G.'+$-96!

T,)! (*'2&+(:! -#,.%'18%! ,/! ](+$0$,.%! .%()%! 0,++(0-! -#(!

&,&.+')!&'--()1!$1-),8.0-$,1!&';(B!'%!-#($)!/'3,)$-(!4$-#!'!

3')$(-9!,/! '11,-'-$,1%:!5.-! -#$%! %$-(! $%! ;$3(1! '!H';(E'1C!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!

a! H')-! ,/! D#(1;#.'! b',! '18!c$',9.'1!=.d%!4,)C! ,/! -#$%! &'&()!

4'%!0,18.0-(8!$1!RbV!e#$1'!E(%(')0#!f'56!

?!#--&>gg4446'2'<,160,2!

B!!#--&>ggLY%$;1'+%60,2g&'&()%g$1-),-,&'--()1%g!

!

e,&9)$;#-! $%!#(+8!59! -#(! R1-()1'-$,1'+!=,)+8!=$8(!=(5!e,1/()(10(!

e,22$--((! FR=LeB@6! ]$%-)$5.-$,1! ,/! -#(%(! &'&()%! $%! +$2$-(8! -,!

0+'%%),,2!.%(:!'18!&()%,1'+!.%(!59!,-#()%6!

444!5667:!V'9!Wh?B:!BKKY:!b'1//:!S+5()-':!e'1'8'6!

SeV!ZYWM?MJZJZLM`JXMYgKYgKKKJ6!

WWW 2007 / Track: Search Session: Search Quality and Precision

501

SimilarityofannotaWonsaiandaj

Similarityofpagespiandpj

Sumoverallpairsofpagesannotatedwithaioraj

SumoverallpairsofannotaWonsassignedtoaioraj

!"# $%&'# ()$'$*# $%&'# +),'$#&)-# (%".)/"# %"0-# .1'# )""%.)./%"#

2!"!#$!3# '4,4*# 5'6# +),'# %4# 71'"# ,/8'"# .1'# 9:';-# (%".)/"/",#

2&'#!(3*# .1'# +),'# .1).# %"0-# 1)$# 2!"!#$!3# &)-# 6'# </0.';'=# %:.#

/&+;%+';0-# 6-# $/&+0'# .';&>&).(1/",# &'.1%=4# ?8'"# /<# .1'# +),'#

(%".)/"$#6%.1#)""%.)./%"$#2!"!#$!3#)"=#2&'#!(3*#/.#/$#"%.#+;%+';#.%#

()0(:0).'#.1'#$/&/0);/.-#6'.5''"#.1'#9:';-#)"=#.1'#=%(:&'".#:$/",#

.1'# @'-5%;=# 2&'#!(3# %"0-4#A"# 'B+0%;)./%"# %<# $/&/0);/.-# 6'.5''"#

2!"!#$!3#)"=#2&'#!(3#&)-#<:;.1';#/&+;%8'#.1'#+),'#;)"@/",4#

7%#'B+0%;'#.1'#)""%.)./%"$#5/.1#$/&/0);#&')"/",$*#5'#6:/0=#)#

6/+);./.'>,;)+1#6'.5''"#$%(/)0#)""%.)./%"$#)"=#5'6#+),'$#5/.1#/.$#

'=,'$# /"=/()./",# .1'# :$';# (%:".4# A$$:&'# .1).# .1';'# );'# )*#

)""%.)./%"$*#)+#5'6#+),'$#)"=#),#5'6#:$';$4#-*+# /$# .1'#)*C)+#

)$$%(/)./%"# &).;/B# 6'.5''"# )""%.)./%"$# )"=# +),'$4# -*+D%(./0E#

='"%.'$#.1'#":&6';#%<#:$';$#51%#)$$/,"#)""%.)./%"#%(#.%#+),'#/04#

F'../",12*16'#.1'#)*C)*#&).;/B#51%$'#'0'&'".#2*D%'*#%3E#/"=/().'$#

.1'# $/&/0);/.-# $(%;'# 6'.5''"# )""%.)./%"$#%'# )"=#%3# )"=#2+16'# .1'1

)+C)+# &).;/B# ')(1# %<# 51%$'# '0'&'".# $.%;'$# .1'# $/&/0);/.-##

6'.5''"# .5%# 5'6# +),'$*# 5'# +;%+%$'# G%(/)0G/&H)"@DGGHE*# )"#

/.';)./8'# )0,%;/.1&# .%# 9:)"./.)./8'0-# '8)0:).'# .1'# $/&/0);/.-#

6'.5''"#)"-#.5%#)""%.)./%"$4#

!"#$%&'()*+,*-$.&/"-&)0/12*3--04*

-'56*+# !"#$%&&&&&&'($&2*I#D%'*#%3E#J#K#<%;#')(1#%'1J#%3#%.1';5/$'#I#

2+I#D/'*#/3E#J#K#<%;#')(1#/'1J#/3#%.1';5/$'#I&

-'56*7* )*&8&

+*,&(-./&)""%.)./%"#+)/;#D%'.1%3E&0*&

#

! !" "

#"

ELDL

K

ELDL

K

K

EEDE*DDEE*DE**D&)BD

EE*DE**D&/"D

LEDLLEDLE*D

' 3%+

4

%+

#

3#'4

5

+

#3*+4'*+

#3*+4'*+

3'

*3'

5

*

%+%+2/%-/%-

/%-/%-

%+%+

6%%2

1111*& DME

* +*,&(-./&+),'#+)/;#D1/'.1/3E#0*& #

! !" "

#

#"

LEDL

K

LEDL

K

K

K

EEDE*DDEE*DE**D&)BD

EE*DE**D&/"D

LEDLLEDLE*D

' 3/*

4

/*

#

3#'45*

3#*+'4*+

3#*+'4*+

3'

+3'

5+

/*/*2/%-/%-

/%-/%-

/*/*

6//2

*& DNE

* 91"$#2#2*D%'*#%3E#(%"8';,'$4&#

-'56*:# 34$54$%&2*D%'*#%3E#

O';'*# 6*# )"=# 6+# ='"%.'# .1'# =)&+/",# <)(.%;$# %<# $/&/0);/.-#

+;%+),)./%"#<%;#)""%.)./%"$#)"=#5'6#+),'$*#;'$+'(./8'0-4#+D%'E# /$#

.1'#$'.#%<#5'6#+),'$#)""%.).'=#5/.1#)""%.)./%"#%'##)"=#*D/3E##/$#.1'#

$'.# %<# )""%.)./%"$#,/8'"# .%# +),'#/371+48%'9# ='"%.'$# .1'#4.1# +),'#

)""%.).'=#6-#%'#)"=#*48/'9#='"%.'$#.1'#4.1#)""%.)./%"#)$$/,"'=#.%#

+),'#/'4#!"#%:;#'B+';/&'".$*#6%.1#6*#)"=#6+1);'#$'.#.%#I4P4#

Q%.'#.1).#.1'#$/&/0);/.-#+;%+),)./%"#;).'#/$#)=R:$.'=#)((%;=/",#

.%# .1'# ":&6';# %<# :$';$# 6'.5''"# .1'# )""%.)./%"# )"=# 5'6# +),'4#

7)@'#?9:)./%"#DME# <%;#)"#'B)&+0'*# .1'#&)B/&)0#+;%+),)./%"#;).'#

()"#6'#)(1/'8'=#%"0-#/<#-*+D%'.1/4E#/$#'9:)0#.%#-*+1D%3.1/#E4#S/,:;'#

MD6E# $1%5$# .1'# </;$.# /.';)./%"T$# GGH# ;'$:0.# %<# .1'# $)&+0'# =).)#

51';'#6*#)"=#6+1);'#$'.#.%#K4#

71'#(%"8';,'"('#%<# .1'#)0,%;/.1&#()"#6'#+;%8'=# /"#)#$/&/0);#

5)-# )$#G/&H)"@# UVW4# S%;# ')(1# /.';)./%"*# .1'# ./&'# (%&+0'B/.-# %<#

.1'# GGH# )0,%;/.1&# /$# :D)*M)+

ME4# X/.1/"# .1'# =).)# $'.# %<# %:;#

'B+';/&'".*#6%.1#.1'#)""%.)./%"#)"=#5'6#+),'#$/&/0);/.-#&).;/('$#

);'#9:/.'# $+);$'# )"=# .1'# )0,%;/.1&#(%"8';,'$# 9:/(@0-4#Y:.# /<# .1'#

$()0'#%<#$%(/)0#)""%.)./%"$#@''+$#,;%5/",#'B+%"'"./)00-*#.1'#$+''=#

%<#(%"8';,'"('#<%;#%:;#)0,%;/.1&$#&)-#$0%5#=%5"4#7%#$%08'#.1/$#

+;%60'&*# 5'# ()"# :$'# $%&'# %+./&/Z)./%"# $.;).',-# $:(1# )$#

/"(%;+%;)./",# .1'# &/"/&)0# (%:".# ;'$.;/(./%"# UVW# .%# &)@'# .1'#

)0,%;/.1&#(%"8';,'#&%;'#9:/(@0-4#

F'../",#;J[;K*;M*\*;#]#6'#)#9:';-#51/(1#(%"$/$.$#%<1##9:';-#

.';&$#)"=#*D/EJ[%K*%M*\*#%4]#6'#.1'#)""%.)./%"#$'.#%<#5'6#+),'#/*#

?9:)./%"#D^E#$1%5$#.1'#$/&/0);/.-#()0(:0)./%"#&'.1%=#6)$'=#%"#.1'#

G%(/)0G/&H)"@4##

!!" "

"#

'

4

3

3'*22< %;2/;='4K K

E*DE*D# *# D^E

:;:! </#5*=>/"&'?*@A'&)/'&$1*BA&1#*-$.&/"*

!11$'/'&$1A*?B/$./",# $.)./(# ;)"@/",# &'.1%=$# :$:)00-# &')$:;'# +),'$T# 9:)0/.-#

<;%&#.1'#>?"1/%@?1AB?%$CB=D*#%;#.1'#=?%BAE1?#@'#?1!=?B=D1+%/".#%<#

8/'54#H'()00#.1).#/"#S/,:;'#K*#.1'#'$./&)./%"#%<#_),'H)"@#UKPW#/$#

$:6R'(.#.%#>?"1AB?%$CB=*#)"=#.1'#<H)"@#UKVW#/$#()0(:0).'=#6)$'=#%"#

6%.1#>?"1/%@?1AB?%$CB=# )"=# =?%BAE1 ?#@'#?1!=?B=T# )(./8/./'$4#71'#

$%(/)0#)""%.)./%"$#);'#.1'#"'5#/"<%;&)./%"#.1).#()"#6'#:./0/Z'=#.%#

()+.:;'# .1'# 5'6# +),'$T# 9:)0/.-# <;%&# >?"1 /%@?1 %##C$%$CB=D1

+';$+'(./8'4#

F7F7G! 2CA'%&+%@?<%#51*&@CB'$E41CDA5%E/'&$1* 7,* H'@E1 ;!%&'$01 >?"1 /%@?=1 %B?1 !=!%&&01 /C/!&%B&01

%##C$%$?I1%#I1/C/!&%B*>?"1/%@?=.1!/J$CJI%$?1>?"1!=?B=1%#I1EC$1

=CA'%&1 %##C$%$'C#=1 E%K?1 $E?1 LC&&C>'#@1 B?&%$'C#=M1 /C/!&%B1 >?"1

/%@?=1 %B?1 "CC54%B5?I1 "014%#01 !/J$CJI%$?1 !=?B=1 %#I1 %##C$%$?I1

"01 EC$1 %##C$%$'C#=N1 !/J$CJI%$?1 !=?B=1 &'5?1 $C1 "CC54%B51 /C/!&%B1

/%@?=1 %#I1 !=?1 EC$1 %##C$%$'C#=N1 EC$1 %##C$%$'C#=1 %B?1 !=?I1 $C1

%##C$%$?1/C/!&%B1>?"1/%@?=1%#I1!=?I1"01!/J$CJI%$?1!=?B=7##

Y)$'=#%"#.1'#%6$';8)./%"#)6%8'*#5'#+;%+%$'#)#"%8'0#)0,%;/.1&*#

")&'0-#G%(/)0_),'H)"@#DG_HE#.%#9:)"./.)./8'0-#'8)0:).'#.1'#+),'#

9:)0/.-#D+%+:0);/.-E#/"=/().'=#6-#$%(/)0#)""%.)./%"$4#71'#/".:/./%"#

6'1/"=# .1'# )0,%;/.1&# /$# .1'#&:.:)0# '"1)"('&'".# ;'0)./%"# )&%",#

+%+:0);* 5'6# +),'$*# :+>.%>=).'# 5'6# :$';$# )"=# 1%.# $%(/)0#

)""%.)./%"$4# S%00%5/",*# .1'# +%+:0);/.-# %<# 5'6# +),'$*# .1'# :+>.%>

=).'"'$$# %<# 5'6# :$';$# )"=# .1'# 1%."'$$# %<# )""%.)./%"$# );'# )00#

;'<';;'=#.%1)$1/C/!&%B'$01<%;#$/&+0/(/.-4##

A$$:&'# .1).# .1';'# );'#)*# )""%.)./%"$*#)+#5'6#+),'$#)"=#),#

5'6# :$';$4# F'.#-+,# 6'# .1'# )+C),1 )$$%(/)./%"# &).;/B# 6'.5''"#

+),'$# )"=# :$';$*1 -*+1 6'# .1'#)*C)+1 )$$%(/)./%"#&).;/B# 6'.5''"#

)""%.)./%"$# )"=# +),'$# )"=# -,*..1'1 ),C)*1 )$$%(/)./%"# &).;/B#

6'.5''"# :$';$# )"=# )""%.)./%"$4# ?0'&'".#-+,# D/'*!3E# /$# )$$/,"'=#

5/.1#.1'#(%:".#%<#)""%.)./%"$#:$'=#6-#:$';#!3#.%#)""%.).'#+),'#/'4#

?0'&'".$#%<#-*+1)"=1-,*#);'# /"/./)0/Z'=#$/&/0);0-4#F'.#+I#6'# .1'#

8'(.%;# (%".)/"/",# ;)"=%&0-# /"/./)0/Z'=# G%(/)0_),'H)"@# $(%;'$4##

`'.)/0$#%<#.1'#G%(/)0_),'H)"@#)0,%;/.1&#);'#+;'$'".'=#)$#<%00%5$4##

!"#$%&'()*7,*-$.&/"</#50/12*3-<04*

-'56*+*!"54$%&

A$$%(/)./%"# &).;/('$1 -+,*1 -*+.1 )"=1 -,** )"=# .1'#

;)"=%&#/"/./)0#G%(/)0_),'H)"@#$(%;'#+I##

WWW 2007 / Track: Search Session: Search Quality and Precision

504

UsingAnnotaWonSimilarityforRanking

•  Givenaqueryq={q1,…,qn},apagep,andasetofannotaWonsA(p)={a1,…,am},“socialsimilarity”ofqandpcanbecomputedasfollows:

•  CombinedifferentrankingfeaturesusingRankSVM(Joachims02)

*See (Xu 07) for how to use annotation similarity in a Language Modeling framework

!"# $%&'# ()$'$*# $%&'# +),'$#&)-# (%".)/"# %"0-# .1'# )""%.)./%"#

2!"!#$!3# '4,4*# 5'6# +),'# %4# 71'"# ,/8'"# .1'# 9:';-# (%".)/"/",#

2&'#!(3*# .1'# +),'# .1).# %"0-# 1)$# 2!"!#$!3# &)-# 6'# </0.';'=# %:.#

/&+;%+';0-# 6-# $/&+0'# .';&>&).(1/",# &'.1%=4# ?8'"# /<# .1'# +),'#

(%".)/"$#6%.1#)""%.)./%"$#2!"!#$!3#)"=#2&'#!(3*#/.#/$#"%.#+;%+';#.%#

()0(:0).'#.1'#$/&/0);/.-#6'.5''"#.1'#9:';-#)"=#.1'#=%(:&'".#:$/",#

.1'# @'-5%;=# 2&'#!(3# %"0-4#A"# 'B+0%;)./%"# %<# $/&/0);/.-# 6'.5''"#

2!"!#$!3#)"=#2&'#!(3#&)-#<:;.1';#/&+;%8'#.1'#+),'#;)"@/",4#

7%#'B+0%;'#.1'#)""%.)./%"$#5/.1#$/&/0);#&')"/",$*#5'#6:/0=#)#

6/+);./.'>,;)+1#6'.5''"#$%(/)0#)""%.)./%"$#)"=#5'6#+),'$#5/.1#/.$#

'=,'$# /"=/()./",# .1'# :$';# (%:".4# A$$:&'# .1).# .1';'# );'# )*#

)""%.)./%"$*#)+#5'6#+),'$#)"=#),#5'6#:$';$4#-*+# /$# .1'#)*C)+#

)$$%(/)./%"# &).;/B# 6'.5''"# )""%.)./%"$# )"=# +),'$4# -*+D%(./0E#

='"%.'$#.1'#":&6';#%<#:$';$#51%#)$$/,"#)""%.)./%"#%(#.%#+),'#/04#

F'../",12*16'#.1'#)*C)*#&).;/B#51%$'#'0'&'".#2*D%'*#%3E#/"=/().'$#

.1'# $/&/0);/.-# $(%;'# 6'.5''"# )""%.)./%"$#%'# )"=#%3# )"=#2+16'# .1'1

)+C)+# &).;/B# ')(1# %<# 51%$'# '0'&'".# $.%;'$# .1'# $/&/0);/.-##

6'.5''"# .5%# 5'6# +),'$*# 5'# +;%+%$'# G%(/)0G/&H)"@DGGHE*# )"#

/.';)./8'# )0,%;/.1&# .%# 9:)"./.)./8'0-# '8)0:).'# .1'# $/&/0);/.-#

6'.5''"#)"-#.5%#)""%.)./%"$4#

!"#$%&'()*+,*-$.&/"-&)0/12*3--04*

-'56*+# !"#$%&&&&&&'($&2*I#D%'*#%3E#J#K#<%;#')(1#%'1J#%3#%.1';5/$'#I#

2+I#D/'*#/3E#J#K#<%;#')(1#/'1J#/3#%.1';5/$'#I&

-'56*7* )*&8&

+*,&(-./&)""%.)./%"#+)/;#D%'.1%3E&0*&

#

! !" "

#"

ELDL

K

ELDL

K

K

EEDE*DDEE*DE**D&)BD

EE*DE**D&/"D

LEDLLEDLE*D

' 3%+

4

%+

#

3#'4

5

+

#3*+4'*+

#3*+4'*+

3'

*3'

5

*

%+%+2/%-/%-

/%-/%-

%+%+

6%%2

1111*& DME

* +*,&(-./&+),'#+)/;#D1/'.1/3E#0*& #

! !" "

#

#"

LEDL

K

LEDL

K

K

K

EEDE*DDEE*DE**D&)BD

EE*DE**D&/"D

LEDLLEDLE*D

' 3/*

4

/*

#

3#'45*

3#*+'4*+

3#*+'4*+

3'

+3'

5+

/*/*2/%-/%-

/%-/%-

/*/*

6//2

*& DNE

* 91"$#2#2*D%'*#%3E#(%"8';,'$4&#

-'56*:# 34$54$%&2*D%'*#%3E#

O';'*# 6*# )"=# 6+# ='"%.'# .1'# =)&+/",# <)(.%;$# %<# $/&/0);/.-#

+;%+),)./%"#<%;#)""%.)./%"$#)"=#5'6#+),'$*#;'$+'(./8'0-4#+D%'E# /$#

.1'#$'.#%<#5'6#+),'$#)""%.).'=#5/.1#)""%.)./%"#%'##)"=#*D/3E##/$#.1'#

$'.# %<# )""%.)./%"$#,/8'"# .%# +),'#/371+48%'9# ='"%.'$# .1'#4.1# +),'#

)""%.).'=#6-#%'#)"=#*48/'9#='"%.'$#.1'#4.1#)""%.)./%"#)$$/,"'=#.%#

+),'#/'4#!"#%:;#'B+';/&'".$*#6%.1#6*#)"=#6+1);'#$'.#.%#I4P4#

Q%.'#.1).#.1'#$/&/0);/.-#+;%+),)./%"#;).'#/$#)=R:$.'=#)((%;=/",#

.%# .1'# ":&6';# %<# :$';$# 6'.5''"# .1'# )""%.)./%"# )"=# 5'6# +),'4#

7)@'#?9:)./%"#DME# <%;#)"#'B)&+0'*# .1'#&)B/&)0#+;%+),)./%"#;).'#

()"#6'#)(1/'8'=#%"0-#/<#-*+D%'.1/4E#/$#'9:)0#.%#-*+1D%3.1/#E4#S/,:;'#

MD6E# $1%5$# .1'# </;$.# /.';)./%"T$# GGH# ;'$:0.# %<# .1'# $)&+0'# =).)#

51';'#6*#)"=#6+1);'#$'.#.%#K4#

71'#(%"8';,'"('#%<# .1'#)0,%;/.1&#()"#6'#+;%8'=# /"#)#$/&/0);#

5)-# )$#G/&H)"@# UVW4# S%;# ')(1# /.';)./%"*# .1'# ./&'# (%&+0'B/.-# %<#

.1'# GGH# )0,%;/.1&# /$# :D)*M)+

ME4# X/.1/"# .1'# =).)# $'.# %<# %:;#

'B+';/&'".*#6%.1#.1'#)""%.)./%"#)"=#5'6#+),'#$/&/0);/.-#&).;/('$#

);'#9:/.'# $+);$'# )"=# .1'# )0,%;/.1&#(%"8';,'$# 9:/(@0-4#Y:.# /<# .1'#

$()0'#%<#$%(/)0#)""%.)./%"$#@''+$#,;%5/",#'B+%"'"./)00-*#.1'#$+''=#

%<#(%"8';,'"('#<%;#%:;#)0,%;/.1&$#&)-#$0%5#=%5"4#7%#$%08'#.1/$#

+;%60'&*# 5'# ()"# :$'# $%&'# %+./&/Z)./%"# $.;).',-# $:(1# )$#

/"(%;+%;)./",# .1'# &/"/&)0# (%:".# ;'$.;/(./%"# UVW# .%# &)@'# .1'#

)0,%;/.1&#(%"8';,'#&%;'#9:/(@0-4#

F'../",#;J[;K*;M*\*;#]#6'#)#9:';-#51/(1#(%"$/$.$#%<1##9:';-#

.';&$#)"=#*D/EJ[%K*%M*\*#%4]#6'#.1'#)""%.)./%"#$'.#%<#5'6#+),'#/*#

?9:)./%"#D^E#$1%5$#.1'#$/&/0);/.-#()0(:0)./%"#&'.1%=#6)$'=#%"#.1'#

G%(/)0G/&H)"@4##

!!" "

"#

'

4

3

3'*22< %;2/;='4K K

E*DE*D# *# D^E

:;:! </#5*=>/"&'?*@A'&)/'&$1*BA&1#*-$.&/"*

!11$'/'&$1A*?B/$./",# $.)./(# ;)"@/",# &'.1%=$# :$:)00-# &')$:;'# +),'$T# 9:)0/.-#

<;%&#.1'#>?"1/%@?1AB?%$CB=D*#%;#.1'#=?%BAE1?#@'#?1!=?B=D1+%/".#%<#

8/'54#H'()00#.1).#/"#S/,:;'#K*#.1'#'$./&)./%"#%<#_),'H)"@#UKPW#/$#

$:6R'(.#.%#>?"1AB?%$CB=*#)"=#.1'#<H)"@#UKVW#/$#()0(:0).'=#6)$'=#%"#

6%.1#>?"1/%@?1AB?%$CB=# )"=# =?%BAE1 ?#@'#?1!=?B=T# )(./8/./'$4#71'#

$%(/)0#)""%.)./%"$#);'#.1'#"'5#/"<%;&)./%"#.1).#()"#6'#:./0/Z'=#.%#

()+.:;'# .1'# 5'6# +),'$T# 9:)0/.-# <;%&# >?"1 /%@?1 %##C$%$CB=D1

+';$+'(./8'4#

F7F7G! 2CA'%&+%@?<%#51*&@CB'$E41CDA5%E/'&$1* 7,* H'@E1 ;!%&'$01 >?"1 /%@?=1 %B?1 !=!%&&01 /C/!&%B&01

%##C$%$?I1%#I1/C/!&%B*>?"1/%@?=.1!/J$CJI%$?1>?"1!=?B=1%#I1EC$1

=CA'%&1 %##C$%$'C#=1 E%K?1 $E?1 LC&&C>'#@1 B?&%$'C#=M1 /C/!&%B1 >?"1

/%@?=1 %B?1 "CC54%B5?I1 "014%#01 !/J$CJI%$?1 !=?B=1 %#I1 %##C$%$?I1

"01 EC$1 %##C$%$'C#=N1 !/J$CJI%$?1 !=?B=1 &'5?1 $C1 "CC54%B51 /C/!&%B1

/%@?=1 %#I1 !=?1 EC$1 %##C$%$'C#=N1 EC$1 %##C$%$'C#=1 %B?1 !=?I1 $C1

%##C$%$?1/C/!&%B1>?"1/%@?=1%#I1!=?I1"01!/J$CJI%$?1!=?B=7##

Y)$'=#%"#.1'#%6$';8)./%"#)6%8'*#5'#+;%+%$'#)#"%8'0#)0,%;/.1&*#

")&'0-#G%(/)0_),'H)"@#DG_HE#.%#9:)"./.)./8'0-#'8)0:).'#.1'#+),'#

9:)0/.-#D+%+:0);/.-E#/"=/().'=#6-#$%(/)0#)""%.)./%"$4#71'#/".:/./%"#

6'1/"=# .1'# )0,%;/.1&# /$# .1'#&:.:)0# '"1)"('&'".# ;'0)./%"# )&%",#

+%+:0);* 5'6# +),'$*# :+>.%>=).'# 5'6# :$';$# )"=# 1%.# $%(/)0#

)""%.)./%"$4# S%00%5/",*# .1'# +%+:0);/.-# %<# 5'6# +),'$*# .1'# :+>.%>

=).'"'$$# %<# 5'6# :$';$# )"=# .1'# 1%."'$$# %<# )""%.)./%"$# );'# )00#

;'<';;'=#.%1)$1/C/!&%B'$01<%;#$/&+0/(/.-4##

A$$:&'# .1).# .1';'# );'#)*# )""%.)./%"$*#)+#5'6#+),'$#)"=#),#

5'6# :$';$4# F'.#-+,# 6'# .1'# )+C),1 )$$%(/)./%"# &).;/B# 6'.5''"#

+),'$# )"=# :$';$*1 -*+1 6'# .1'#)*C)+1 )$$%(/)./%"#&).;/B# 6'.5''"#

)""%.)./%"$# )"=# +),'$# )"=# -,*..1'1 ),C)*1 )$$%(/)./%"# &).;/B#

6'.5''"# :$';$# )"=# )""%.)./%"$4# ?0'&'".#-+,# D/'*!3E# /$# )$$/,"'=#

5/.1#.1'#(%:".#%<#)""%.)./%"$#:$'=#6-#:$';#!3#.%#)""%.).'#+),'#/'4#

?0'&'".$#%<#-*+1)"=1-,*#);'# /"/./)0/Z'=#$/&/0);0-4#F'.#+I#6'# .1'#

8'(.%;# (%".)/"/",# ;)"=%&0-# /"/./)0/Z'=# G%(/)0_),'H)"@# $(%;'$4##

`'.)/0$#%<#.1'#G%(/)0_),'H)"@#)0,%;/.1&#);'#+;'$'".'=#)$#<%00%5$4##

!"#$%&'()*7,*-$.&/"</#50/12*3-<04*

-'56*+*!"54$%&

A$$%(/)./%"# &).;/('$1 -+,*1 -*+.1 )"=1 -,** )"=# .1'#

;)"=%&#/"/./)0#G%(/)0_),'H)"@#$(%;'#+I##

WWW 2007 / Track: Search Session: Search Quality and Precision

504

!"#$%&%

%

!"#$

!"#$%

!$#$%&&

!'#$%&&

!(#$%&&

!)#$%&&

!*#$%&&

+

*

++

++

+

!"#!

!#$!

!$"!

!$"!

!#$!

!"#!

#%"

$%#

"%$

$%"

#%$

"%#

&

&

&

!"

!"

!"

!"

!"

!"

#

&

%&'()&"!$,-./01203#&

%$!

!"#$%'(% *+',+'#$

"'4&560&,-./01207&8-,9:;<:20=:.>&3,-10#$

850?&*&7-03&560&9.959:;9@:59-.#&A.&850?&)B&"!B&#!B&:.7&$!&70.-50&

560&?-?C;:195D&/0,5-13&-E&?:203B&C3013B& :.7&:..-5:59-.3& 9.& 560& !56&

9501:59-.#&"!F()#!

FB&:.7&$!F&:10&9.501G079:50&/:;C03#&H3&9;;C351:507&9.&

I92C10& (B& 560& 9.5C959-.& J069.7& KLC:59-.& %$!& 93& 56:5& 560& C3013F&

?-?C;:195D&,:.&J0&7019/07&E1-G&560&?:203&560D&:..-5:507&%$#*!M&560&

:..-5:59-.3F& ?-?C;:195D& ,:.& J0& 7019/07& E1-G& 560& ?-?C;:195D& -E&

C3013& %$#)!M& 39G9;:1;DB& 560& ?-?C;:195D& 93& 51:.3E01107& E1-G&

:..-5:59-.3& 5-&N0J& ?:203& %$#(!B&N0J& ?:203& 5-& :..-5:59-.3& %$#'!B&

:..-5:59-.3&5-&C3013&%$#$!B&:.7&560.&C3013&5-&N0J&?:203&:2:9.&%$#"!#&&

I9.:;;DB&N0&205&"'& :3& 560&-C5?C5&-E&8-,9:;<:20=:.>& %8<=!&N60.&

560&:;2-1956G&,-./01203#&8:G?;0&8<=&/:;C03&:10&29/0.&9.&560&19265&

?:15&-E&I92C10&(#&&

&

)*+,-#&'.&%/00,1"-2"*34%35%6,20*"7%"-241*"*34%8#"9##4%":#%

,1#-1;%2443"2"*341;%24<%$2+#1%*4%":#%!=>%20+3-*":?&

A.& 0:,6& 9501:59-.B& 560& 59G0& ,-G?;0O95D& -E& 560& :;2-1956G& 93&

*%+#+"&P)+$&+"&P)+#&+$&!#&89.,0&560&:7Q:,0.,D&G:519,03&:10&/01D&

3?:130& 9.& -C1& 7:5:& 305B& 560& :,5C:;& 59G0& ,-G?;0O95D& 93& E:1& ;-N01#&

R-N0/01B&9.&S0J&0./91-.G0.5B&560&39@0&-E&7:5:&:10&9.,10:39.2&:5&:&

E:35& 3?007B& :.7& 3-G0& :,,0;01:59-.& 5-& 560& :;2-1956G& %;9>0& TUV& E-1&

<:20=:.>!&36-C;7&J0&70/0;-?07#&

,-,-.! /0123453163)07)8"9)$:504!;<=)R010B& N0& 29/0& :& J190E& ?1--E& -E& 560& ,-./0120.,0& -E& 560& 8<=&

:;2-1956G#&A5&,:.&J0&7019/07&E1-G&560&:;2-1956G&56:54&

W

*

* !%!% "%%"%%" !&

!

&

! !!"!!"#

#B& %"!

N6010&

$"#$"# %%%% !!" #& %U!

H&35:.7:17&103C;5&-E&;9.0:1&:;20J1:&%0#2#&TXV!&35:503&56:5&9E&%>&93&

:& 3DGG0519,) G:519OB& :.7& 2& 93& :& /0,5-1& .-5& -156-2-.:;& 5-& 560&

?19.,9?:;&0920./0,5-1&-E&560&G:519O)!*%%>!B&560.&560&C.95&/0,5-1&9.&

560&7910,59-.&-E&%%>!?2),-./01203&5-&!*%%>!&:3&?&9.,10:303&N956-C5&

J-C.7#&R010&%%&)93&:&3DGG0519,&G:519O&:.7)"W&93&.-5&-156-2-.:;&

5-&!*%%%&!B& 3-& 560& 30LC0.,0&"!& ,-./01203& 5-& :& ;9G95&"

YB& N69,6&

392.:;3&560&501G9.:59-.&-E&560&8<=&:;2-1956G#&

'.@! A742?*B%>24C*4+%9*":%!3B*20%

/453-?2"*34%

,-@-A! BC1D=!6)9D1?!15)%3;<0E)ZC0&5-&560&;:120&.CGJ01&-E&E0:5C103B&G-701.&N0J&30:1,6&0.29.03&

C3C:;;D& 1:.>& 103C;53& JD& ;0:1.9.2& :& 1:.>& EC.,59-.#&[:.D&G056-73&

6:/0&J00.&70/0;-?07&E-1&:C5-G:59,&%-1&30G9\:C5-G:59,!&5C.9.2&-E&

3?0,9E9,& 1:.>9.2& EC.,59-.3#& <10/9-C3& N-1>& 0359G:503& 560& N092653&

561-C26& 10210339-.& T)"V#& =0,0.5& N-1>& -.& 5693& 1:.>9.2& ?1-J;0G&

:550G?53& 5-& 7910,5;D& -?59G9@0& 560& -17019.2& -E& 560& -JQ0,53& T(B& ))B&

()V#&&

H3&793,C3307&9.&T$VB&56010&:10&20.01:;;D&5N-&N:D3&5-&C59;9@0&560&

0O?;-107& 3-,9:;& E0:5C103& E-1& 7D.:G9,& 1:.>9.2& -E& N0J& ?:2034& %:!&

510:59.2& 560& 3-,9:;& :,59-.3& :3& 9.70?0.70.5& 0/970.,0& E-1& 1:.>9.2&

103C;53B& :.7& %J!& 9.5021:59.2& 560& 3-,9:;& E0:5C103& 9.5-& 560& 1:.>9.2&

:;2-1956G#&A.&-C1&N-1>B&N0&9.,-1?-1:50&J-56&39G9;:195D&:.7&35:59,&

E0:5C103& 0O?;-9507& E1-G& 3-,9:;& :..-5:59-.3& 9.5-& 560& 1:.>9.2&

EC.,59-.&JD&C39.2&=:.>8][&T()V#&

,-@-.! F3D;G43>)S0&79/9707&-C1&E0:5C10&305&9.5-&5N-&GC5C:;;D&0O,;C39/0&,:502-19034&

LC01D\?:20&39G9;:195D&E0:5C103&:.7&?:20F3&35:59,&E0:5C103#&^:J;0&*&

703,19J03&0:,6&-E&56030&E0:5C10&,:502-1903&9.&705:9;#&

D280#%E.&I0:5C103&9.&3-,9:;&30:1,6&

H4&&LC01D\7-,CG0.5&E0:5C103&

B068!=!:D4!;C)& 89G9;:195D&J05N00.&LC01D&:.7&?:20&,-.50.5&

&34=%D;6<!15)

%^[!)

89G9;:195D& J05N00.& LC01D& :.7& :..-5:59-.3&

C39.2&560&501G&G:5,69.2&G056-7#&

806!D:8!=9D1?)

%88=!)

89G9;:195D& J05N00.& LC01D& :.7& :..-5:59-.3&

J:307&-.&8-,9:;89G=:.>#&

_4&&7-,CG0.5&35:59,&E0:5C103&

H005:3"D539D1?)

%<=!)

^60&N0J&?:20F3&<:20=:.>&-J5:9.07&E1-G&560&

`--2;0F3&5--;J:1&H<A#&

806!D:"D539D1?)

%8<=!)

^60& ?-?C;:195D& 3,-10& ,:;,C;:507& J:307& -.&

8-,9:;<:20=:.>&:;2-1956G#&

@.! FG=F>/HFIDJK%>F!LKD!%

@.E! A#0*B*3,1%A2"2%^6010& :10& G:.D& 3-,9:;& J-->G:1>& 5--;3& -.& S0J& T(WV#& I-1& 560&

0O?019G0.5B&N0&C30&560&7:5:&,1:N;07&E1-G&Z0;9,9-C3&7C19.2&[:DB&

)WW"B& N69,6& ,-.39353& -E& *BU("B)"X& N0J& ?:203& :.7& )"aB$""&

79EE010.5&:..-5:59-.3#&

H;56-C26& 560&:..-5:59-.3& E1-G&Z0;9,9-C3&:10&0:3D& E-1&6CG:.&

C3013& 5-& 10:7&:.7&C.70135:.7B& 560D&:10&.-5&70392.07& E-1&G:,69.0&

C30#&I-1&0O:G?;0B&C3013&G:D&C30&,-G?-C.7&:..-5:59-.3&9.&/:19-C3&

:&

,&

J&&

&

&

&

8-,9:;&:..-5:59-.3&

%:!& %J!&

!3B*20=2+#>24C1(%

8<=%:!b&W#W*$$&

8<=%J!b&W#WW')&

8<=%,!&bW#W)*X&

S0J&?:203&

S0J&?:20&:..-5:5-13&

&

%$#*!&

%$#"!%$#$!&

%$#)!&

&

&

%$#(!&

%$#'!&

WWW 2007 / Track: Search Session: Search Quality and Precision

505

ExperimentalResults

•  DatafromDelicious:1,736,268pages,269,566differentannotaWons

Example:Top4relatedannotaWonsfordifferentcategories

!"#$%&%'()&*%& !"#"$%&'(&"))*+(&"#& !"#",%&'(&"))*+(+&,-&%./01&

1)-%-&*22"1*10"2%& 021"& %1*23*#3&4"#3%&401)& 1)-&)-/.&"!&,"#35-1&

6789&:-!"#-&'%02;&1)-$&02&1)-&-<.-#0$-21%+&

!"#! $%&'(&)*+,-+.-/,,+)&)*+,-0*1*'&2*)*34-=2& "'#& -<.-#0$-21%>& 1)-& ?"(0*/?0$@*2A& */;"#01)$& ("2B-#;-3&

401)02&CD&01-#*10"2%+&E*:/-&D&%)"4%&1)-&%-/-(1-3&*22"1*10"2%&!#"$&

!"'#& (*1-;"#0-%>& 1";-1)-#& 401)& 1)-0#& 1".& F& %-$*210(*//G& #-/*1-3&

*22"1*10"2%+&,01)& 1)-& -<./"#-3& %0$0/*#01G& B*/'-%>&4-& *#-& *:/-& 1"&

!023&$"#-&%-$*210(*//G&#-/*1-3&4-:&.*;-%&*%&%)"42&/*1-#+&

5&6'3-#"-H<./"#-3&%0$0/*#&*22"1*10"2%&:*%-3&"2&?"(0*/?0$@*2A&

5378,+'+9:-23'&)3;<&

3':/02& $-1*3*1*>&%-$*210(>&%1*23*#3>&"4/&

3-:0*2& 30%1#0:'10"2>&30%1#">&':'21'>&/02'<&

$7+,+1:-23'&)3;<&

*3%-2%-& %-2%->&*3B-#10%->&-21#-.#-2-'#>&$"2-G&

IJJ& 2'$:-#>&30#-(1"#G>&.)"2->&:'%02-%%&

$,)32)&*,13,)-23'&)3;<&

*/:'$& ;*//-#G>&.)"1";#*.)G>&.*2"#*$*>&.)"1"& &

()*1& $-%%-2;-#>&K*::-#>&0$>&$*("%<&&

$,)*):--23'&)3;<&

-02%1-02& %(0-2(->&%A-.10(>&-B"/'10"2>&L'*21'$&

()#0%10*2&& 3-B"1->&!*01)>&#-/0;0"2>&;"3&&

!"=! $%&'(&)*+,-+.-0>?-?34(')4-,-& ":1*02-3& -*()& 4-:& .*;-M%& ?N@& %("#-& '%02;& 1)-& */;"#01)$&

3-%(#0:-3& 02& %-(10"2& 7+7+& =2& "'#& -<.-#0$-21%>& 1)-& */;"#01)$&

("2B-#;-3& 401)02& O& 01-#*10"2%+& H*()& .*;-M%& N*;-@*2A& 4*%& */%"&

-<1#*(1-3& !#"$& 1)-& P"";/-M%& 1""/:*#& QN=& 3'#02;& R'/G>& DJJS+&

T-#-*!1-#>& 4-& '%-& N*;-@*2A& 1"& 3-.0(1& 1)-& -<1#*(1-3& P"";/-M%&

N*;-@*2A&:G&3-!*'/1+&

-$.$/! 0123#4$31"(52"+6378*49&*:;9*'+3<+"=>4*4?3U0;'#-&F&%)"4%&1)-&*B-#*;-&("'21%&"!&*22"1*10"2%&*23&*22"1*1"#%&

"B-#& 4-:& .*;-%& 401)& 30!!-#-21& N*;-@*2A& B*/'-%>& *23& !"#& 1)-&

@+*A;53<++'9"9*'+&/02->&1)-&B*/'-&$-*2%&1)-&("'21&"!&*22"1*10"2%&

1)*1& *#-& 30!!-#-21& 401)& -*()& "1)-#+& U#"$& 1)-& !0;'#->& 4-& (*2&

("2(/'3-& 1)*1& 02&$"%1&(*%-%>& 1)-&.*;-&401)&*&)0;)-#&N*;-@*2A& 0%&

/0A-/G&1"&:-&*22"1*1-3&:G&$"#-&'%-#%&401)&$"#-&*22"1*10"2%+&&

&

@*9(23&!"&-/%32&93-7+(,)-;*4)2*6()*+,-+%32->&93?&,A-

E"& !'#1)-#& 02B-%10;*1-& 1)-& 30%1#0:'10"2>& ("'21%& "!& *22"1*10"2%&

*23& *22"1*1"#%& !"#& *//& ("//-(1-3& .*;-%& 401)& 30!!-#-21& N*;-@*2A&

B*/'-%& *#-& ;0B-2& 02& U0;'#-& 8V*W+& =1& 0%& -*%G& 1"& %--& 1)*1& 1)-& .*;-%&

401)& -*()& N*;-@*2A& B*/'-& 30B-#%0!G& *& /"1& "2& 1)-& 2'$:-#& "!&

*22"1*10"2%&*23&'%-#%+&,-:&.*;-%&401)&*&#-/*10B-/G&/"4&N*;-@*2A&

$*G&"42&$"#-&*22"1*10"2%&*23&'%-#%&1)*2&1)"%-&4)"&)*B-&)0;)-#&

N*;-@*2A+&U"#&-<*$./->&%"$-&.*;-%&401)&N*;-@*2A&J&)*B-&$"#-&

'%-#%&*23&*22"1*10"2%&1)*2&1)"%-&4)"&)*B-&N*;-@*2A&CJ+&

&,-&*../0-3&1)-&?N@&*/;"#01)$&1"& 1)-&("//-(1-3&3*1*+&U"#&-*%G&

'23-#%1*2302;>&?N@&0%&2"#$*/0X-3&021"&*&%(*/-&"!&JYCJ&%"&1)*1&?N@&

*23&N*;-@*2A&)*B-&1)-&%*$-&2'$:-#&"!&.*;-%&02&-*()&;#*3-&!#"$&

J&1"&CJ+&U0;'#-&8V:W&%)"4%&1)-&3-1*0/-3&("'21%&"!&*22"1*10"2%&*23&

'%-#%&"!&.*;-%&401)&30!!-#-21&?"(0*/N*;-@*2A&B*/'-%+&=1&0%&-*%G&1"&

%--& 1)*1& ?"(0*/N*;-@*2A& %'((-%%!'//G& ()*#*(1-#0X-%& 1)-& 4-:&

.*;-%M&.".'/*#01G&3-;#--%&*$"2;&4-:&*22"1*1"#%+&

&&&&&&& -

B&C-/,,+)&)*+,4-&,;-(4324-+%32->&93?&,A& B6C-/,,+)&)*+,4-&,;-(4324-+%32-0+7*&'>&93?&,A&

@*9(23&D"-E*4)2*6()*+,-&,&':4*4-+.-4+7*&'-&,,+)&)*+,4--

WWW 2007 / Track: Search Session: Search Quality and Precision

506

ExperimentalResults

•  Twoquerysets:– MQ50:50queriesmanuallygeneratedbystudents

– AQ3000:3000queriesauto‐generatedfromODP

•  MeasureNDCGandMAP:

!

!"#$%&!'(!)*+,-)./)0)12%)3.4&5"6&7))3.4&5"6&)89:7)

.6;))3.4&5"6&)8<<=)12%)>.%?"6#)0)

9.@5&)A(!"#$%&'()#*!#+!,-.!/01200*!)($(3&'(14!+0&15'0)!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

<&)03(*0!AB,! :=>;>?! :=??CD!

3.4&5"6&)A<<=) B(A'CD) B(EEAD)

!"!"#! $%&'()*+,'&-)&.+/0)&.+12,+E(F5'0!G!)6#2)!160!H#$%&'()#*!/01200*!IJ"K!#+!.&F0L&*M!&*7!

N#H(&3.&F0L&*M!#*! 160!-8!)01=!<#16!N.L!&*7!.&F0L&*M!/0*0+(1!

20/!)0&'H6=!B60!/0110'! '0)531! ()!&H6(0O07!/4!N.L=!-F&(*P!)($(3&'!

H#*H35)(#*! H&*! /0! 7'&2*! +'#$! B&/30! 9P! 26(H6! )6#2)! 160!

H#$%&'()#*!#+!,-.!#*!160!12#!Q50'4!)01)=!

!

!"#$%&!D(!)*+,-)./)!"12%)3:FG7))3:FGHI=7).6;))

3:FGH<I=)12%)>.%?"6#)!)

)

9.@5&)G(!"#$%&'()#*!#+!,-.!/01200*!)1&1(H!+0&15'0)!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

<&)03(*0!A.L! :=>?>?! :=??RR!

3.4&5"6&!8<I=! B(AFDJ) B(EFFG)

!"!"3! $%&'()*+,'&-)&.+/0)&.+4567+11,+'&8+12,+<4! (*H#'%#'&1(*F! /#16! N#H(&3N($L&*M! &*7! N#H(&3.&F0L&*MP! 20!

H&*!&H6(0O0!160!/0)1!)0&'H6!'0)531!&)!)6#2*!(*!B&/30!R=!BS10)1)!#*!

,-.!)6#2!)1&1()1(H&334!)(F*(+(H&*1!($%'#O0$0*1)!T%SO&350U:=:9V=!!

)9.@5&)'(!J4*&$(H!'&*M(*F!2(16!/#16!NNL!&*7!N.L!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

3.4&5"6&8<<=7<I=) B(ADFA)K8EA(JBLM) B(EN'A)K8FG(BFLM)

!"!"9! :'0;+16<8%+B#!5*70')1&*7!6#2!160)0!($%'#O0$0*1)!&'0!&H6(0O07P!20!%'0)0*1!

&!H&)0!)1574!60'0=!E#'!)($%3(H(14P!160!)($(3&'(14!/01200*!160!Q50'4!

&*7!7#H5$0*1!H#*10*1!()!*#1!H#*)(70'07=!

K(O0*! 160! Q50'4! W')=>'=;XP! ;?D! 20/! %&F0)! &'0! &))#H(&107!

16'#5F6!160!)#H(&3!&**#1&1(#*)!&*7!160!1#%S>!20/!%&F0)!&HH#'7(*F!

1#!N.L!)H#'0)!&'0!)6#2*! (*!B&/30!G=!B6'#5F6! 160! '0O(02)!#+! 160!

1'&O03! )(10)! 3(M0! 0YH0330*1S'#$&*1(HSO&H&1(#*)!9P! 20! H&*! H#*H3570!

16&1!160!)#H(&3!&**#1&1(#*)!&'0!5)0+53!&*7!N.L!&'0!'0&334!0++0H1(O0=!

E#'!0Y&$%30P!160!)(10!766?@AABBB"-'%'-"*5(A!2(16!1#%!N.L!)H#'0!()!

&3)#! '&*M07! +(')1! /4! 0YH0330*1S'#$&*1(HSO&H&1(#*)! &*7! H&3307!

WK##F30!#+!B'&O03!N(10)X=!

9.@5&)D(!Z0/!%&F0)!&))#H(&107!/4!&**#1&1(#*!W&('+&'0X!

O=P4) <I=)

611%[\\222=M&4&M=H#$\!!!!!! @!

611%[\\222=1'&O03#H(14=H#$\!! D!

611%[\\222=1'(%&7O()#'=H#$\!! D!

611%[\\222=0Y%07(&=H#$\!!!! D!

B60*! /4! 5)(*F! NNL! '&*MP! 20! +(*7! 160! 1#%S>! )($(3&'! 1&F)! 1#!

W')=>'=;X!&'0!W6)*-;6XP!W>C).76XP!W756;CXP!&*7!W')=C)&;X=!B6'#5F6!160!

&*&34)()P!20!+(*7!$#)1!#+!160!1#%!'&*M07!20/!)(10)!(*!B&/30!G!&'0!

&**#1&107! /4! 160)0! )($(3&'! &**#1&1(#*)! &)! 2033=! <0)(70)P! 160)0!

)($(3&'!&**#1&1(#*)!&3)#! (*1'#75H0!)#$0!*02!20/!%&F0)=!B&/30!D!

)6#2)! 160! 1#%SN.L!]L^)! 16&1! &'0! *#1! &**#1&107! /4! W')=>'=;X! (*!

#5'!H#'%5)=!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!

9611%[\\222=0YH0330*1S'#$&*1(HSO&H&1(#*)=H#$\/0)1S&('+&'0S

)0&'H6S0*F(*0=61$3!!

WWW 2007 / Track: Search Session: Search Quality and Precision

508

!

!"#$%&!'(!)*+,-)./)0)12%)3.4&5"6&7))3.4&5"6&)89:7)

.6;))3.4&5"6&)8<<=)12%)>.%?"6#)0)

9.@5&)A(!"#$%&'()#*!#+!,-.!/01200*!)($(3&'(14!+0&15'0)!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

<&)03(*0!AB,! :=>;>?! :=??CD!

3.4&5"6&)A<<=) B(A'CD) B(EEAD)

!"!"#! $%&'()*+,'&-)&.+/0)&.+12,+E(F5'0!G!)6#2)!160!H#$%&'()#*!/01200*!IJ"K!#+!.&F0L&*M!&*7!

N#H(&3.&F0L&*M!#*! 160!-8!)01=!<#16!N.L!&*7!.&F0L&*M!/0*0+(1!

20/!)0&'H6=!B60!/0110'! '0)531! ()!&H6(0O07!/4!N.L=!-F&(*P!)($(3&'!

H#*H35)(#*! H&*! /0! 7'&2*! +'#$! B&/30! 9P! 26(H6! )6#2)! 160!

H#$%&'()#*!#+!,-.!#*!160!12#!Q50'4!)01)=!

!

!"#$%&!D(!)*+,-)./)!"12%)3:FG7))3:FGHI=7).6;))

3:FGH<I=)12%)>.%?"6#)!)

)

9.@5&)G(!"#$%&'()#*!#+!,-.!/01200*!)1&1(H!+0&15'0)!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

<&)03(*0!A.L! :=>?>?! :=??RR!

3.4&5"6&!8<I=! B(AFDJ) B(EFFG)

!"!"3! $%&'()*+,'&-)&.+/0)&.+4567+11,+'&8+12,+<4! (*H#'%#'&1(*F! /#16! N#H(&3N($L&*M! &*7! N#H(&3.&F0L&*MP! 20!

H&*!&H6(0O0!160!/0)1!)0&'H6!'0)531!&)!)6#2*!(*!B&/30!R=!BS10)1)!#*!

,-.!)6#2!)1&1()1(H&334!)(F*(+(H&*1!($%'#O0$0*1)!T%SO&350U:=:9V=!!

)9.@5&)'(!J4*&$(H!'&*M(*F!2(16!/#16!NNL!&*7!N.L!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

3.4&5"6&8<<=7<I=) B(ADFA)K8EA(JBLM) B(EN'A)K8FG(BFLM)

!"!"9! :'0;+16<8%+B#!5*70')1&*7!6#2!160)0!($%'#O0$0*1)!&'0!&H6(0O07P!20!%'0)0*1!

&!H&)0!)1574!60'0=!E#'!)($%3(H(14P!160!)($(3&'(14!/01200*!160!Q50'4!

&*7!7#H5$0*1!H#*10*1!()!*#1!H#*)(70'07=!

K(O0*! 160! Q50'4! W')=>'=;XP! ;?D! 20/! %&F0)! &'0! &))#H(&107!

16'#5F6!160!)#H(&3!&**#1&1(#*)!&*7!160!1#%S>!20/!%&F0)!&HH#'7(*F!

1#!N.L!)H#'0)!&'0!)6#2*! (*!B&/30!G=!B6'#5F6! 160! '0O(02)!#+! 160!

1'&O03! )(10)! 3(M0! 0YH0330*1S'#$&*1(HSO&H&1(#*)!9P! 20! H&*! H#*H3570!

16&1!160!)#H(&3!&**#1&1(#*)!&'0!5)0+53!&*7!N.L!&'0!'0&334!0++0H1(O0=!

E#'!0Y&$%30P!160!)(10!766?@AABBB"-'%'-"*5(A!2(16!1#%!N.L!)H#'0!()!

&3)#! '&*M07! +(')1! /4! 0YH0330*1S'#$&*1(HSO&H&1(#*)! &*7! H&3307!

WK##F30!#+!B'&O03!N(10)X=!

9.@5&)D(!Z0/!%&F0)!&))#H(&107!/4!&**#1&1(#*!W&('+&'0X!

O=P4) <I=)

611%[\\222=M&4&M=H#$\!!!!!! @!

611%[\\222=1'&O03#H(14=H#$\!! D!

611%[\\222=1'(%&7O()#'=H#$\!! D!

611%[\\222=0Y%07(&=H#$\!!!! D!

B60*! /4! 5)(*F! NNL! '&*MP! 20! +(*7! 160! 1#%S>! )($(3&'! 1&F)! 1#!

W')=>'=;X!&'0!W6)*-;6XP!W>C).76XP!W756;CXP!&*7!W')=C)&;X=!B6'#5F6!160!

&*&34)()P!20!+(*7!$#)1!#+!160!1#%!'&*M07!20/!)(10)!(*!B&/30!G!&'0!

&**#1&107! /4! 160)0! )($(3&'! &**#1&1(#*)! &)! 2033=! <0)(70)P! 160)0!

)($(3&'!&**#1&1(#*)!&3)#! (*1'#75H0!)#$0!*02!20/!%&F0)=!B&/30!D!

)6#2)! 160! 1#%SN.L!]L^)! 16&1! &'0! *#1! &**#1&107! /4! W')=>'=;X! (*!

#5'!H#'%5)=!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!

9611%[\\222=0YH0330*1S'#$&*1(HSO&H&1(#*)=H#$\/0)1S&('+&'0S

)0&'H6S0*F(*0=61$3!!

WWW 2007 / Track: Search Session: Search Quality and Precision

508

WhataboutPageRank?

•  ObservaWon:popularwebpagesasracthotsocialannotaWonsandbookmarkedbyup‐to‐dateusers

•  UsetheseproperWestoesWmatepopularityofpages(SocialPageRank)

!"#$%&%'()&*%& !"#"$%&'(&"))*+(&"#& !"#",%&'(&"))*+(+&,-&%./01&

1)-%-&*22"1*10"2%& 021"& %1*23*#3&4"#3%&401)& 1)-&)-/.&"!&,"#35-1&

6789&:-!"#-&'%02;&1)-$&02&1)-&-<.-#0$-21%+&

!"#! $%&'(&)*+,-+.-/,,+)&)*+,-0*1*'&2*)*34-=2& "'#& -<.-#0$-21%>& 1)-& ?"(0*/?0$@*2A& */;"#01)$& ("2B-#;-3&

401)02&CD&01-#*10"2%+&E*:/-&D&%)"4%&1)-&%-/-(1-3&*22"1*10"2%&!#"$&

!"'#& (*1-;"#0-%>& 1";-1)-#& 401)& 1)-0#& 1".& F& %-$*210(*//G& #-/*1-3&

*22"1*10"2%+&,01)& 1)-& -<./"#-3& %0$0/*#01G& B*/'-%>&4-& *#-& *:/-& 1"&

!023&$"#-&%-$*210(*//G&#-/*1-3&4-:&.*;-%&*%&%)"42&/*1-#+&

5&6'3-#"-H<./"#-3&%0$0/*#&*22"1*10"2%&:*%-3&"2&?"(0*/?0$@*2A&

5378,+'+9:-23'&)3;<&

3':/02& $-1*3*1*>&%-$*210(>&%1*23*#3>&"4/&

3-:0*2& 30%1#0:'10"2>&30%1#">&':'21'>&/02'<&

$7+,+1:-23'&)3;<&

*3%-2%-& %-2%->&*3B-#10%->&-21#-.#-2-'#>&$"2-G&

IJJ& 2'$:-#>&30#-(1"#G>&.)"2->&:'%02-%%&

$,)32)&*,13,)-23'&)3;<&

*/:'$& ;*//-#G>&.)"1";#*.)G>&.*2"#*$*>&.)"1"& &

()*1& $-%%-2;-#>&K*::-#>&0$>&$*("%<&&

$,)*):--23'&)3;<&

-02%1-02& %(0-2(->&%A-.10(>&-B"/'10"2>&L'*21'$&

()#0%10*2&& 3-B"1->&!*01)>&#-/0;0"2>&;"3&&

!"=! $%&'(&)*+,-+.-0>?-?34(')4-,-& ":1*02-3& -*()& 4-:& .*;-M%& ?N@& %("#-& '%02;& 1)-& */;"#01)$&

3-%(#0:-3& 02& %-(10"2& 7+7+& =2& "'#& -<.-#0$-21%>& 1)-& */;"#01)$&

("2B-#;-3& 401)02& O& 01-#*10"2%+& H*()& .*;-M%& N*;-@*2A& 4*%& */%"&

-<1#*(1-3& !#"$& 1)-& P"";/-M%& 1""/:*#& QN=& 3'#02;& R'/G>& DJJS+&

T-#-*!1-#>& 4-& '%-& N*;-@*2A& 1"& 3-.0(1& 1)-& -<1#*(1-3& P"";/-M%&

N*;-@*2A&:G&3-!*'/1+&

-$.$/! 0123#4$31"(52"+6378*49&*:;9*'+3<+"=>4*4?3U0;'#-&F&%)"4%&1)-&*B-#*;-&("'21%&"!&*22"1*10"2%&*23&*22"1*1"#%&

"B-#& 4-:& .*;-%& 401)& 30!!-#-21& N*;-@*2A& B*/'-%>& *23& !"#& 1)-&

@+*A;53<++'9"9*'+&/02->&1)-&B*/'-&$-*2%&1)-&("'21&"!&*22"1*10"2%&

1)*1& *#-& 30!!-#-21& 401)& -*()& "1)-#+& U#"$& 1)-& !0;'#->& 4-& (*2&

("2(/'3-& 1)*1& 02&$"%1&(*%-%>& 1)-&.*;-&401)&*&)0;)-#&N*;-@*2A& 0%&

/0A-/G&1"&:-&*22"1*1-3&:G&$"#-&'%-#%&401)&$"#-&*22"1*10"2%+&&

&

@*9(23&!"&-/%32&93-7+(,)-;*4)2*6()*+,-+%32->&93?&,A-

E"& !'#1)-#& 02B-%10;*1-& 1)-& 30%1#0:'10"2>& ("'21%& "!& *22"1*10"2%&

*23& *22"1*1"#%& !"#& *//& ("//-(1-3& .*;-%& 401)& 30!!-#-21& N*;-@*2A&

B*/'-%& *#-& ;0B-2& 02& U0;'#-& 8V*W+& =1& 0%& -*%G& 1"& %--& 1)*1& 1)-& .*;-%&

401)& -*()& N*;-@*2A& B*/'-& 30B-#%0!G& *& /"1& "2& 1)-& 2'$:-#& "!&

*22"1*10"2%&*23&'%-#%+&,-:&.*;-%&401)&*&#-/*10B-/G&/"4&N*;-@*2A&

$*G&"42&$"#-&*22"1*10"2%&*23&'%-#%&1)*2&1)"%-&4)"&)*B-&)0;)-#&

N*;-@*2A+&U"#&-<*$./->&%"$-&.*;-%&401)&N*;-@*2A&J&)*B-&$"#-&

'%-#%&*23&*22"1*10"2%&1)*2&1)"%-&4)"&)*B-&N*;-@*2A&CJ+&

&,-&*../0-3&1)-&?N@&*/;"#01)$&1"& 1)-&("//-(1-3&3*1*+&U"#&-*%G&

'23-#%1*2302;>&?N@&0%&2"#$*/0X-3&021"&*&%(*/-&"!&JYCJ&%"&1)*1&?N@&

*23&N*;-@*2A&)*B-&1)-&%*$-&2'$:-#&"!&.*;-%&02&-*()&;#*3-&!#"$&

J&1"&CJ+&U0;'#-&8V:W&%)"4%&1)-&3-1*0/-3&("'21%&"!&*22"1*10"2%&*23&

'%-#%&"!&.*;-%&401)&30!!-#-21&?"(0*/N*;-@*2A&B*/'-%+&=1&0%&-*%G&1"&

%--& 1)*1& ?"(0*/N*;-@*2A& %'((-%%!'//G& ()*#*(1-#0X-%& 1)-& 4-:&

.*;-%M&.".'/*#01G&3-;#--%&*$"2;&4-:&*22"1*1"#%+&

&&&&&&& -

B&C-/,,+)&)*+,4-&,;-(4324-+%32->&93?&,A& B6C-/,,+)&)*+,4-&,;-(4324-+%32-0+7*&'>&93?&,A&

@*9(23&D"-E*4)2*6()*+,-&,&':4*4-+.-4+7*&'-&,,+)&)*+,4--

WWW 2007 / Track: Search Session: Search Quality and Precision

506

Page‐UserassociaWonmatrix

!"# $%&'# ()$'$*# $%&'# +),'$#&)-# (%".)/"# %"0-# .1'# )""%.)./%"#

2!"!#$!3# '4,4*# 5'6# +),'# %4# 71'"# ,/8'"# .1'# 9:';-# (%".)/"/",#

2&'#!(3*# .1'# +),'# .1).# %"0-# 1)$# 2!"!#$!3# &)-# 6'# </0.';'=# %:.#

/&+;%+';0-# 6-# $/&+0'# .';&>&).(1/",# &'.1%=4# ?8'"# /<# .1'# +),'#

(%".)/"$#6%.1#)""%.)./%"$#2!"!#$!3#)"=#2&'#!(3*#/.#/$#"%.#+;%+';#.%#

()0(:0).'#.1'#$/&/0);/.-#6'.5''"#.1'#9:';-#)"=#.1'#=%(:&'".#:$/",#

.1'# @'-5%;=# 2&'#!(3# %"0-4#A"# 'B+0%;)./%"# %<# $/&/0);/.-# 6'.5''"#

2!"!#$!3#)"=#2&'#!(3#&)-#<:;.1';#/&+;%8'#.1'#+),'#;)"@/",4#

7%#'B+0%;'#.1'#)""%.)./%"$#5/.1#$/&/0);#&')"/",$*#5'#6:/0=#)#

6/+);./.'>,;)+1#6'.5''"#$%(/)0#)""%.)./%"$#)"=#5'6#+),'$#5/.1#/.$#

'=,'$# /"=/()./",# .1'# :$';# (%:".4# A$$:&'# .1).# .1';'# );'# )*#

)""%.)./%"$*#)+#5'6#+),'$#)"=#),#5'6#:$';$4#-*+# /$# .1'#)*C)+#

)$$%(/)./%"# &).;/B# 6'.5''"# )""%.)./%"$# )"=# +),'$4# -*+D%(./0E#

='"%.'$#.1'#":&6';#%<#:$';$#51%#)$$/,"#)""%.)./%"#%(#.%#+),'#/04#

F'../",12*16'#.1'#)*C)*#&).;/B#51%$'#'0'&'".#2*D%'*#%3E#/"=/().'$#

.1'# $/&/0);/.-# $(%;'# 6'.5''"# )""%.)./%"$#%'# )"=#%3# )"=#2+16'# .1'1

)+C)+# &).;/B# ')(1# %<# 51%$'# '0'&'".# $.%;'$# .1'# $/&/0);/.-##

6'.5''"# .5%# 5'6# +),'$*# 5'# +;%+%$'# G%(/)0G/&H)"@DGGHE*# )"#

/.';)./8'# )0,%;/.1&# .%# 9:)"./.)./8'0-# '8)0:).'# .1'# $/&/0);/.-#

6'.5''"#)"-#.5%#)""%.)./%"$4#

!"#$%&'()*+,*-$.&/"-&)0/12*3--04*

-'56*+# !"#$%&&&&&&'($&2*I#D%'*#%3E#J#K#<%;#')(1#%'1J#%3#%.1';5/$'#I#

2+I#D/'*#/3E#J#K#<%;#')(1#/'1J#/3#%.1';5/$'#I&

-'56*7* )*&8&

+*,&(-./&)""%.)./%"#+)/;#D%'.1%3E&0*&

#

! !" "

#"

ELDL

K

ELDL

K

K

EEDE*DDEE*DE**D&)BD

EE*DE**D&/"D

LEDLLEDLE*D

' 3%+

4

%+

#

3#'4

5

+

#3*+4'*+

#3*+4'*+

3'

*3'

5

*

%+%+2/%-/%-

/%-/%-

%+%+

6%%2

1111*& DME

* +*,&(-./&+),'#+)/;#D1/'.1/3E#0*& #

! !" "

#

#"

LEDL

K

LEDL

K

K

K

EEDE*DDEE*DE**D&)BD

EE*DE**D&/"D

LEDLLEDLE*D

' 3/*

4

/*

#

3#'45*

3#*+'4*+

3#*+'4*+

3'

+3'

5+

/*/*2/%-/%-

/%-/%-

/*/*

6//2

*& DNE

* 91"$#2#2*D%'*#%3E#(%"8';,'$4&#

-'56*:# 34$54$%&2*D%'*#%3E#

O';'*# 6*# )"=# 6+# ='"%.'# .1'# =)&+/",# <)(.%;$# %<# $/&/0);/.-#

+;%+),)./%"#<%;#)""%.)./%"$#)"=#5'6#+),'$*#;'$+'(./8'0-4#+D%'E# /$#

.1'#$'.#%<#5'6#+),'$#)""%.).'=#5/.1#)""%.)./%"#%'##)"=#*D/3E##/$#.1'#

$'.# %<# )""%.)./%"$#,/8'"# .%# +),'#/371+48%'9# ='"%.'$# .1'#4.1# +),'#

)""%.).'=#6-#%'#)"=#*48/'9#='"%.'$#.1'#4.1#)""%.)./%"#)$$/,"'=#.%#

+),'#/'4#!"#%:;#'B+';/&'".$*#6%.1#6*#)"=#6+1);'#$'.#.%#I4P4#

Q%.'#.1).#.1'#$/&/0);/.-#+;%+),)./%"#;).'#/$#)=R:$.'=#)((%;=/",#

.%# .1'# ":&6';# %<# :$';$# 6'.5''"# .1'# )""%.)./%"# )"=# 5'6# +),'4#

7)@'#?9:)./%"#DME# <%;#)"#'B)&+0'*# .1'#&)B/&)0#+;%+),)./%"#;).'#

()"#6'#)(1/'8'=#%"0-#/<#-*+D%'.1/4E#/$#'9:)0#.%#-*+1D%3.1/#E4#S/,:;'#

MD6E# $1%5$# .1'# </;$.# /.';)./%"T$# GGH# ;'$:0.# %<# .1'# $)&+0'# =).)#

51';'#6*#)"=#6+1);'#$'.#.%#K4#

71'#(%"8';,'"('#%<# .1'#)0,%;/.1&#()"#6'#+;%8'=# /"#)#$/&/0);#

5)-# )$#G/&H)"@# UVW4# S%;# ')(1# /.';)./%"*# .1'# ./&'# (%&+0'B/.-# %<#

.1'# GGH# )0,%;/.1&# /$# :D)*M)+

ME4# X/.1/"# .1'# =).)# $'.# %<# %:;#

'B+';/&'".*#6%.1#.1'#)""%.)./%"#)"=#5'6#+),'#$/&/0);/.-#&).;/('$#

);'#9:/.'# $+);$'# )"=# .1'# )0,%;/.1&#(%"8';,'$# 9:/(@0-4#Y:.# /<# .1'#

$()0'#%<#$%(/)0#)""%.)./%"$#@''+$#,;%5/",#'B+%"'"./)00-*#.1'#$+''=#

%<#(%"8';,'"('#<%;#%:;#)0,%;/.1&$#&)-#$0%5#=%5"4#7%#$%08'#.1/$#

+;%60'&*# 5'# ()"# :$'# $%&'# %+./&/Z)./%"# $.;).',-# $:(1# )$#

/"(%;+%;)./",# .1'# &/"/&)0# (%:".# ;'$.;/(./%"# UVW# .%# &)@'# .1'#

)0,%;/.1&#(%"8';,'#&%;'#9:/(@0-4#

F'../",#;J[;K*;M*\*;#]#6'#)#9:';-#51/(1#(%"$/$.$#%<1##9:';-#

.';&$#)"=#*D/EJ[%K*%M*\*#%4]#6'#.1'#)""%.)./%"#$'.#%<#5'6#+),'#/*#

?9:)./%"#D^E#$1%5$#.1'#$/&/0);/.-#()0(:0)./%"#&'.1%=#6)$'=#%"#.1'#

G%(/)0G/&H)"@4##

!!" "

"#

'

4

3

3'*22< %;2/;='4K K

E*DE*D# *# D^E

:;:! </#5*=>/"&'?*@A'&)/'&$1*BA&1#*-$.&/"*

!11$'/'&$1A*?B/$./",# $.)./(# ;)"@/",# &'.1%=$# :$:)00-# &')$:;'# +),'$T# 9:)0/.-#

<;%&#.1'#>?"1/%@?1AB?%$CB=D*#%;#.1'#=?%BAE1?#@'#?1!=?B=D1+%/".#%<#

8/'54#H'()00#.1).#/"#S/,:;'#K*#.1'#'$./&)./%"#%<#_),'H)"@#UKPW#/$#

$:6R'(.#.%#>?"1AB?%$CB=*#)"=#.1'#<H)"@#UKVW#/$#()0(:0).'=#6)$'=#%"#

6%.1#>?"1/%@?1AB?%$CB=# )"=# =?%BAE1 ?#@'#?1!=?B=T# )(./8/./'$4#71'#

$%(/)0#)""%.)./%"$#);'#.1'#"'5#/"<%;&)./%"#.1).#()"#6'#:./0/Z'=#.%#

()+.:;'# .1'# 5'6# +),'$T# 9:)0/.-# <;%&# >?"1 /%@?1 %##C$%$CB=D1

+';$+'(./8'4#

F7F7G! 2CA'%&+%@?<%#51*&@CB'$E41CDA5%E/'&$1* 7,* H'@E1 ;!%&'$01 >?"1 /%@?=1 %B?1 !=!%&&01 /C/!&%B&01

%##C$%$?I1%#I1/C/!&%B*>?"1/%@?=.1!/J$CJI%$?1>?"1!=?B=1%#I1EC$1

=CA'%&1 %##C$%$'C#=1 E%K?1 $E?1 LC&&C>'#@1 B?&%$'C#=M1 /C/!&%B1 >?"1

/%@?=1 %B?1 "CC54%B5?I1 "014%#01 !/J$CJI%$?1 !=?B=1 %#I1 %##C$%$?I1

"01 EC$1 %##C$%$'C#=N1 !/J$CJI%$?1 !=?B=1 &'5?1 $C1 "CC54%B51 /C/!&%B1

/%@?=1 %#I1 !=?1 EC$1 %##C$%$'C#=N1 EC$1 %##C$%$'C#=1 %B?1 !=?I1 $C1

%##C$%$?1/C/!&%B1>?"1/%@?=1%#I1!=?I1"01!/J$CJI%$?1!=?B=7##

Y)$'=#%"#.1'#%6$';8)./%"#)6%8'*#5'#+;%+%$'#)#"%8'0#)0,%;/.1&*#

")&'0-#G%(/)0_),'H)"@#DG_HE#.%#9:)"./.)./8'0-#'8)0:).'#.1'#+),'#

9:)0/.-#D+%+:0);/.-E#/"=/().'=#6-#$%(/)0#)""%.)./%"$4#71'#/".:/./%"#

6'1/"=# .1'# )0,%;/.1&# /$# .1'#&:.:)0# '"1)"('&'".# ;'0)./%"# )&%",#

+%+:0);* 5'6# +),'$*# :+>.%>=).'# 5'6# :$';$# )"=# 1%.# $%(/)0#

)""%.)./%"$4# S%00%5/",*# .1'# +%+:0);/.-# %<# 5'6# +),'$*# .1'# :+>.%>

=).'"'$$# %<# 5'6# :$';$# )"=# .1'# 1%."'$$# %<# )""%.)./%"$# );'# )00#

;'<';;'=#.%1)$1/C/!&%B'$01<%;#$/&+0/(/.-4##

A$$:&'# .1).# .1';'# );'#)*# )""%.)./%"$*#)+#5'6#+),'$#)"=#),#

5'6# :$';$4# F'.#-+,# 6'# .1'# )+C),1 )$$%(/)./%"# &).;/B# 6'.5''"#

+),'$# )"=# :$';$*1 -*+1 6'# .1'#)*C)+1 )$$%(/)./%"#&).;/B# 6'.5''"#

)""%.)./%"$# )"=# +),'$# )"=# -,*..1'1 ),C)*1 )$$%(/)./%"# &).;/B#

6'.5''"# :$';$# )"=# )""%.)./%"$4# ?0'&'".#-+,# D/'*!3E# /$# )$$/,"'=#

5/.1#.1'#(%:".#%<#)""%.)./%"$#:$'=#6-#:$';#!3#.%#)""%.).'#+),'#/'4#

?0'&'".$#%<#-*+1)"=1-,*#);'# /"/./)0/Z'=#$/&/0);0-4#F'.#+I#6'# .1'#

8'(.%;# (%".)/"/",# ;)"=%&0-# /"/./)0/Z'=# G%(/)0_),'H)"@# $(%;'$4##

`'.)/0$#%<#.1'#G%(/)0_),'H)"@#)0,%;/.1&#);'#+;'$'".'=#)$#<%00%5$4##

!"#$%&'()*7,*-$.&/"</#50/12*3-<04*

-'56*+*!"54$%&

A$$%(/)./%"# &).;/('$1 -+,*1 -*+.1 )"=1 -,** )"=# .1'#

;)"=%&#/"/./)0#G%(/)0_),'H)"@#$(%;'#+I##

WWW 2007 / Track: Search Session: Search Quality and Precision

504

!"#$%&%

%

!"#$

!"#$%

!$#$%&&

!'#$%&&

!(#$%&&

!)#$%&&

!*#$%&&

+

*

++

++

+

!"#!

!#$!

!$"!

!$"!

!#$!

!"#!

#%"

$%#

"%$

$%"

#%$

"%#

&

&

&

!"

!"

!"

!"

!"

!"

#

&

%&'()&"!$,-./01203#&

%$!

!"#$%'(% *+',+'#$

"'4&560&,-./01207&8-,9:;<:20=:.>&3,-10#$

850?&*&7-03&560&9.959:;9@:59-.#&A.&850?&)B&"!B&#!B&:.7&$!&70.-50&

560&?-?C;:195D&/0,5-13&-E&?:203B&C3013B& :.7&:..-5:59-.3& 9.& 560& !56&

9501:59-.#&"!F()#!

FB&:.7&$!F&:10&9.501G079:50&/:;C03#&H3&9;;C351:507&9.&

I92C10& (B& 560& 9.5C959-.& J069.7& KLC:59-.& %$!& 93& 56:5& 560& C3013F&

?-?C;:195D&,:.&J0&7019/07&E1-G&560&?:203&560D&:..-5:507&%$#*!M&560&

:..-5:59-.3F& ?-?C;:195D& ,:.& J0& 7019/07& E1-G& 560& ?-?C;:195D& -E&

C3013& %$#)!M& 39G9;:1;DB& 560& ?-?C;:195D& 93& 51:.3E01107& E1-G&

:..-5:59-.3& 5-&N0J& ?:203& %$#(!B&N0J& ?:203& 5-& :..-5:59-.3& %$#'!B&

:..-5:59-.3&5-&C3013&%$#$!B&:.7&560.&C3013&5-&N0J&?:203&:2:9.&%$#"!#&&

I9.:;;DB&N0&205&"'& :3& 560&-C5?C5&-E&8-,9:;<:20=:.>& %8<=!&N60.&

560&:;2-1956G&,-./01203#&8:G?;0&8<=&/:;C03&:10&29/0.&9.&560&19265&

?:15&-E&I92C10&(#&&

&

)*+,-#&'.&%/00,1"-2"*34%35%6,20*"7%"-241*"*34%8#"9##4%":#%

,1#-1;%2443"2"*341;%24<%$2+#1%*4%":#%!=>%20+3-*":?&

A.& 0:,6& 9501:59-.B& 560& 59G0& ,-G?;0O95D& -E& 560& :;2-1956G& 93&

*%+#+"&P)+$&+"&P)+#&+$&!#&89.,0&560&:7Q:,0.,D&G:519,03&:10&/01D&

3?:130& 9.& -C1& 7:5:& 305B& 560& :,5C:;& 59G0& ,-G?;0O95D& 93& E:1& ;-N01#&

R-N0/01B&9.&S0J&0./91-.G0.5B&560&39@0&-E&7:5:&:10&9.,10:39.2&:5&:&

E:35& 3?007B& :.7& 3-G0& :,,0;01:59-.& 5-& 560& :;2-1956G& %;9>0& TUV& E-1&

<:20=:.>!&36-C;7&J0&70/0;-?07#&

,-,-.! /0123453163)07)8"9)$:504!;<=)R010B& N0& 29/0& :& J190E& ?1--E& -E& 560& ,-./0120.,0& -E& 560& 8<=&

:;2-1956G#&A5&,:.&J0&7019/07&E1-G&560&:;2-1956G&56:54&

W

*

* !%!% "%%"%%" !&

!

&

! !!"!!"#

#B& %"!

N6010&

$"#$"# %%%% !!" #& %U!

H&35:.7:17&103C;5&-E&;9.0:1&:;20J1:&%0#2#&TXV!&35:503&56:5&9E&%>&93&

:& 3DGG0519,) G:519OB& :.7& 2& 93& :& /0,5-1& .-5& -156-2-.:;& 5-& 560&

?19.,9?:;&0920./0,5-1&-E&560&G:519O)!*%%>!B&560.&560&C.95&/0,5-1&9.&

560&7910,59-.&-E&%%>!?2),-./01203&5-&!*%%>!&:3&?&9.,10:303&N956-C5&

J-C.7#&R010&%%&)93&:&3DGG0519,&G:519O&:.7)"W&93&.-5&-156-2-.:;&

5-&!*%%%&!B& 3-& 560& 30LC0.,0&"!& ,-./01203& 5-& :& ;9G95&"

YB& N69,6&

392.:;3&560&501G9.:59-.&-E&560&8<=&:;2-1956G#&

'.@! A742?*B%>24C*4+%9*":%!3B*20%

/453-?2"*34%

,-@-A! BC1D=!6)9D1?!15)%3;<0E)ZC0&5-&560&;:120&.CGJ01&-E&E0:5C103B&G-701.&N0J&30:1,6&0.29.03&

C3C:;;D& 1:.>& 103C;53& JD& ;0:1.9.2& :& 1:.>& EC.,59-.#&[:.D&G056-73&

6:/0&J00.&70/0;-?07&E-1&:C5-G:59,&%-1&30G9\:C5-G:59,!&5C.9.2&-E&

3?0,9E9,& 1:.>9.2& EC.,59-.3#& <10/9-C3& N-1>& 0359G:503& 560& N092653&

561-C26& 10210339-.& T)"V#& =0,0.5& N-1>& -.& 5693& 1:.>9.2& ?1-J;0G&

:550G?53& 5-& 7910,5;D& -?59G9@0& 560& -17019.2& -E& 560& -JQ0,53& T(B& ))B&

()V#&&

H3&793,C3307&9.&T$VB&56010&:10&20.01:;;D&5N-&N:D3&5-&C59;9@0&560&

0O?;-107& 3-,9:;& E0:5C103& E-1& 7D.:G9,& 1:.>9.2& -E& N0J& ?:2034& %:!&

510:59.2& 560& 3-,9:;& :,59-.3& :3& 9.70?0.70.5& 0/970.,0& E-1& 1:.>9.2&

103C;53B& :.7& %J!& 9.5021:59.2& 560& 3-,9:;& E0:5C103& 9.5-& 560& 1:.>9.2&

:;2-1956G#&A.&-C1&N-1>B&N0&9.,-1?-1:50&J-56&39G9;:195D&:.7&35:59,&

E0:5C103& 0O?;-9507& E1-G& 3-,9:;& :..-5:59-.3& 9.5-& 560& 1:.>9.2&

EC.,59-.&JD&C39.2&=:.>8][&T()V#&

,-@-.! F3D;G43>)S0&79/9707&-C1&E0:5C10&305&9.5-&5N-&GC5C:;;D&0O,;C39/0&,:502-19034&

LC01D\?:20&39G9;:195D&E0:5C103&:.7&?:20F3&35:59,&E0:5C103#&^:J;0&*&

703,19J03&0:,6&-E&56030&E0:5C10&,:502-1903&9.&705:9;#&

D280#%E.&I0:5C103&9.&3-,9:;&30:1,6&

H4&&LC01D\7-,CG0.5&E0:5C103&

B068!=!:D4!;C)& 89G9;:195D&J05N00.&LC01D&:.7&?:20&,-.50.5&

&34=%D;6<!15)

%^[!)

89G9;:195D& J05N00.& LC01D& :.7& :..-5:59-.3&

C39.2&560&501G&G:5,69.2&G056-7#&

806!D:8!=9D1?)

%88=!)

89G9;:195D& J05N00.& LC01D& :.7& :..-5:59-.3&

J:307&-.&8-,9:;89G=:.>#&

_4&&7-,CG0.5&35:59,&E0:5C103&

H005:3"D539D1?)

%<=!)

^60&N0J&?:20F3&<:20=:.>&-J5:9.07&E1-G&560&

`--2;0F3&5--;J:1&H<A#&

806!D:"D539D1?)

%8<=!)

^60& ?-?C;:195D& 3,-10& ,:;,C;:507& J:307& -.&

8-,9:;<:20=:.>&:;2-1956G#&

@.! FG=F>/HFIDJK%>F!LKD!%

@.E! A#0*B*3,1%A2"2%^6010& :10& G:.D& 3-,9:;& J-->G:1>& 5--;3& -.& S0J& T(WV#& I-1& 560&

0O?019G0.5B&N0&C30&560&7:5:&,1:N;07&E1-G&Z0;9,9-C3&7C19.2&[:DB&

)WW"B& N69,6& ,-.39353& -E& *BU("B)"X& N0J& ?:203& :.7& )"aB$""&

79EE010.5&:..-5:59-.3#&

H;56-C26& 560&:..-5:59-.3& E1-G&Z0;9,9-C3&:10&0:3D& E-1&6CG:.&

C3013& 5-& 10:7&:.7&C.70135:.7B& 560D&:10&.-5&70392.07& E-1&G:,69.0&

C30#&I-1&0O:G?;0B&C3013&G:D&C30&,-G?-C.7&:..-5:59-.3&9.&/:19-C3&

:&

,&

J&&

&

&

&

8-,9:;&:..-5:59-.3&

%:!& %J!&

!3B*20=2+#>24C1(%

8<=%:!b&W#W*$$&

8<=%J!b&W#WW')&

8<=%,!&bW#W)*X&

S0J&?:203&

S0J&?:20&:..-5:5-13&

&

%$#*!&

%$#"!%$#$!&

%$#)!&

&

&

%$#(!&

%$#'!&

WWW 2007 / Track: Search Session: Search Quality and Precision

505

User‐Ann.associaWonmatrixAnn.‐PageassociaWonmatrix

ExperimentalResults

!

!"#$%&!'(!)*+,-)./)0)12%)3.4&5"6&7))3.4&5"6&)89:7)

.6;))3.4&5"6&)8<<=)12%)>.%?"6#)0)

9.@5&)A(!"#$%&'()#*!#+!,-.!/01200*!)($(3&'(14!+0&15'0)!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

<&)03(*0!AB,! :=>;>?! :=??CD!

3.4&5"6&)A<<=) B(A'CD) B(EEAD)

!"!"#! $%&'()*+,'&-)&.+/0)&.+12,+E(F5'0!G!)6#2)!160!H#$%&'()#*!/01200*!IJ"K!#+!.&F0L&*M!&*7!

N#H(&3.&F0L&*M!#*! 160!-8!)01=!<#16!N.L!&*7!.&F0L&*M!/0*0+(1!

20/!)0&'H6=!B60!/0110'! '0)531! ()!&H6(0O07!/4!N.L=!-F&(*P!)($(3&'!

H#*H35)(#*! H&*! /0! 7'&2*! +'#$! B&/30! 9P! 26(H6! )6#2)! 160!

H#$%&'()#*!#+!,-.!#*!160!12#!Q50'4!)01)=!

!

!"#$%&!D(!)*+,-)./)!"12%)3:FG7))3:FGHI=7).6;))

3:FGH<I=)12%)>.%?"6#)!)

)

9.@5&)G(!"#$%&'()#*!#+!,-.!/01200*!)1&1(H!+0&15'0)!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

<&)03(*0!A.L! :=>?>?! :=??RR!

3.4&5"6&!8<I=! B(AFDJ) B(EFFG)

!"!"3! $%&'()*+,'&-)&.+/0)&.+4567+11,+'&8+12,+<4! (*H#'%#'&1(*F! /#16! N#H(&3N($L&*M! &*7! N#H(&3.&F0L&*MP! 20!

H&*!&H6(0O0!160!/0)1!)0&'H6!'0)531!&)!)6#2*!(*!B&/30!R=!BS10)1)!#*!

,-.!)6#2!)1&1()1(H&334!)(F*(+(H&*1!($%'#O0$0*1)!T%SO&350U:=:9V=!!

)9.@5&)'(!J4*&$(H!'&*M(*F!2(16!/#16!NNL!&*7!N.L!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

3.4&5"6&8<<=7<I=) B(ADFA)K8EA(JBLM) B(EN'A)K8FG(BFLM)

!"!"9! :'0;+16<8%+B#!5*70')1&*7!6#2!160)0!($%'#O0$0*1)!&'0!&H6(0O07P!20!%'0)0*1!

&!H&)0!)1574!60'0=!E#'!)($%3(H(14P!160!)($(3&'(14!/01200*!160!Q50'4!

&*7!7#H5$0*1!H#*10*1!()!*#1!H#*)(70'07=!

K(O0*! 160! Q50'4! W')=>'=;XP! ;?D! 20/! %&F0)! &'0! &))#H(&107!

16'#5F6!160!)#H(&3!&**#1&1(#*)!&*7!160!1#%S>!20/!%&F0)!&HH#'7(*F!

1#!N.L!)H#'0)!&'0!)6#2*! (*!B&/30!G=!B6'#5F6! 160! '0O(02)!#+! 160!

1'&O03! )(10)! 3(M0! 0YH0330*1S'#$&*1(HSO&H&1(#*)!9P! 20! H&*! H#*H3570!

16&1!160!)#H(&3!&**#1&1(#*)!&'0!5)0+53!&*7!N.L!&'0!'0&334!0++0H1(O0=!

E#'!0Y&$%30P!160!)(10!766?@AABBB"-'%'-"*5(A!2(16!1#%!N.L!)H#'0!()!

&3)#! '&*M07! +(')1! /4! 0YH0330*1S'#$&*1(HSO&H&1(#*)! &*7! H&3307!

WK##F30!#+!B'&O03!N(10)X=!

9.@5&)D(!Z0/!%&F0)!&))#H(&107!/4!&**#1&1(#*!W&('+&'0X!

O=P4) <I=)

611%[\\222=M&4&M=H#$\!!!!!! @!

611%[\\222=1'&O03#H(14=H#$\!! D!

611%[\\222=1'(%&7O()#'=H#$\!! D!

611%[\\222=0Y%07(&=H#$\!!!! D!

B60*! /4! 5)(*F! NNL! '&*MP! 20! +(*7! 160! 1#%S>! )($(3&'! 1&F)! 1#!

W')=>'=;X!&'0!W6)*-;6XP!W>C).76XP!W756;CXP!&*7!W')=C)&;X=!B6'#5F6!160!

&*&34)()P!20!+(*7!$#)1!#+!160!1#%!'&*M07!20/!)(10)!(*!B&/30!G!&'0!

&**#1&107! /4! 160)0! )($(3&'! &**#1&1(#*)! &)! 2033=! <0)(70)P! 160)0!

)($(3&'!&**#1&1(#*)!&3)#! (*1'#75H0!)#$0!*02!20/!%&F0)=!B&/30!D!

)6#2)! 160! 1#%SN.L!]L^)! 16&1! &'0! *#1! &**#1&107! /4! W')=>'=;X! (*!

#5'!H#'%5)=!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!

9611%[\\222=0YH0330*1S'#$&*1(HSO&H&1(#*)=H#$\/0)1S&('+&'0S

)0&'H6S0*F(*0=61$3!!

WWW 2007 / Track: Search Session: Search Quality and Precision

508

•  UsingSocialPageRanksignificantlyimprovesbothMAPandNDCGmesures:

!

!"#$%&!'(!)*+,-)./)0)12%)3.4&5"6&7))3.4&5"6&)89:7)

.6;))3.4&5"6&)8<<=)12%)>.%?"6#)0)

9.@5&)A(!"#$%&'()#*!#+!,-.!/01200*!)($(3&'(14!+0&15'0)!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

<&)03(*0!AB,! :=>;>?! :=??CD!

3.4&5"6&)A<<=) B(A'CD) B(EEAD)

!"!"#! $%&'()*+,'&-)&.+/0)&.+12,+E(F5'0!G!)6#2)!160!H#$%&'()#*!/01200*!IJ"K!#+!.&F0L&*M!&*7!

N#H(&3.&F0L&*M!#*! 160!-8!)01=!<#16!N.L!&*7!.&F0L&*M!/0*0+(1!

20/!)0&'H6=!B60!/0110'! '0)531! ()!&H6(0O07!/4!N.L=!-F&(*P!)($(3&'!

H#*H35)(#*! H&*! /0! 7'&2*! +'#$! B&/30! 9P! 26(H6! )6#2)! 160!

H#$%&'()#*!#+!,-.!#*!160!12#!Q50'4!)01)=!

!

!"#$%&!D(!)*+,-)./)!"12%)3:FG7))3:FGHI=7).6;))

3:FGH<I=)12%)>.%?"6#)!)

)

9.@5&)G(!"#$%&'()#*!#+!,-.!/01200*!)1&1(H!+0&15'0)!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

<&)03(*0!A.L! :=>?>?! :=??RR!

3.4&5"6&!8<I=! B(AFDJ) B(EFFG)

!"!"3! $%&'()*+,'&-)&.+/0)&.+4567+11,+'&8+12,+<4! (*H#'%#'&1(*F! /#16! N#H(&3N($L&*M! &*7! N#H(&3.&F0L&*MP! 20!

H&*!&H6(0O0!160!/0)1!)0&'H6!'0)531!&)!)6#2*!(*!B&/30!R=!BS10)1)!#*!

,-.!)6#2!)1&1()1(H&334!)(F*(+(H&*1!($%'#O0$0*1)!T%SO&350U:=:9V=!!

)9.@5&)'(!J4*&$(H!'&*M(*F!2(16!/#16!NNL!&*7!N.L!

,016#7! ,89:! -8;:::!

<&)03(*0! :=>??9! :=?:@?!

3.4&5"6&8<<=7<I=) B(ADFA)K8EA(JBLM) B(EN'A)K8FG(BFLM)

!"!"9! :'0;+16<8%+B#!5*70')1&*7!6#2!160)0!($%'#O0$0*1)!&'0!&H6(0O07P!20!%'0)0*1!

&!H&)0!)1574!60'0=!E#'!)($%3(H(14P!160!)($(3&'(14!/01200*!160!Q50'4!

&*7!7#H5$0*1!H#*10*1!()!*#1!H#*)(70'07=!

K(O0*! 160! Q50'4! W')=>'=;XP! ;?D! 20/! %&F0)! &'0! &))#H(&107!

16'#5F6!160!)#H(&3!&**#1&1(#*)!&*7!160!1#%S>!20/!%&F0)!&HH#'7(*F!

1#!N.L!)H#'0)!&'0!)6#2*! (*!B&/30!G=!B6'#5F6! 160! '0O(02)!#+! 160!

1'&O03! )(10)! 3(M0! 0YH0330*1S'#$&*1(HSO&H&1(#*)!9P! 20! H&*! H#*H3570!

16&1!160!)#H(&3!&**#1&1(#*)!&'0!5)0+53!&*7!N.L!&'0!'0&334!0++0H1(O0=!

E#'!0Y&$%30P!160!)(10!766?@AABBB"-'%'-"*5(A!2(16!1#%!N.L!)H#'0!()!

&3)#! '&*M07! +(')1! /4! 0YH0330*1S'#$&*1(HSO&H&1(#*)! &*7! H&3307!

WK##F30!#+!B'&O03!N(10)X=!

9.@5&)D(!Z0/!%&F0)!&))#H(&107!/4!&**#1&1(#*!W&('+&'0X!

O=P4) <I=)

611%[\\222=M&4&M=H#$\!!!!!! @!

611%[\\222=1'&O03#H(14=H#$\!! D!

611%[\\222=1'(%&7O()#'=H#$\!! D!

611%[\\222=0Y%07(&=H#$\!!!! D!

B60*! /4! 5)(*F! NNL! '&*MP! 20! +(*7! 160! 1#%S>! )($(3&'! 1&F)! 1#!

W')=>'=;X!&'0!W6)*-;6XP!W>C).76XP!W756;CXP!&*7!W')=C)&;X=!B6'#5F6!160!

&*&34)()P!20!+(*7!$#)1!#+!160!1#%!'&*M07!20/!)(10)!(*!B&/30!G!&'0!

&**#1&107! /4! 160)0! )($(3&'! &**#1&1(#*)! &)! 2033=! <0)(70)P! 160)0!

)($(3&'!&**#1&1(#*)!&3)#! (*1'#75H0!)#$0!*02!20/!%&F0)=!B&/30!D!

)6#2)! 160! 1#%SN.L!]L^)! 16&1! &'0! *#1! &**#1&107! /4! W')=>'=;X! (*!

#5'!H#'%5)=!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!

9611%[\\222=0YH0330*1S'#$&*1(HSO&H&1(#*)=H#$\/0)1S&('+&'0S

)0&'H6S0*F(*0=61$3!!

WWW 2007 / Track: Search Session: Search Quality and Precision

508

•  ObservaWon:socialannotaWonscharacterizewelltopicsofpagesandinterestsofusers

•  Rankqueryresultsforqueryq,pagep,useruasfollows:

•  Computertopic(u,p)ascosinesimilaritybetweenannotaWonsofuandannotaWonsofp

omy, and then discuss in detail the approach we propose forpersonalized search.

3.1 Analysis of FolksonomyWhat folksonomy can bring us in personalized search?

The best way to answer this question is to analyze it.Social Annotations as Category Names. In the folk-

sonomy systems, the users are free to choose any social an-notations to classify and organize their bookmarks. Thoughthere may be some noise, each social annotation representsa topic that is related to its semantic meaning [17]. Basedon this, the social annotations owned by the web pages andthe users reflect their topics and interests respectively.

Social Annotations as Keywords. As discussed in [2,4, 27] the annotations are very close to human generatedkeywords. Thus, the social annotations usually can well de-scribe or even complement the content of the web pages.

Collaborative Link Structure. One of the most impor-tant benefits that online folksonomy systems bring is the col-laborative link structure created by the users unconsciously.The underlying link structure of the tagging systems hasbeen explored in many prior efforts [4, 18, 27]. The wholeunderlying structures of folksonomy systems are rather com-plex. Different researchers may reduce the complexity ofmodeling the structure by various simplified model, e.g. in[27], the structure is modeled through a latent semantic layerwhile in [4] the relations between the annotations and theweb pages are modeled using a bipartite graph. In our work,since the relations between the users and the web pages arevery important, we model the structure using a user-webpage bipartite graph as shown in Figure 1.

!"!# !$!%

!&

!'

!(

)" )% )# )*

+" +#+% +,

Figure 1: User-web page bipartite structure

where ui, i = 1, 2, · · · , n denote n users, pj , j = 1, 2, · · · , mdenote m web pages, Wk, k = 1, 2, · · · , l are the weights ofthe links, i.e. the bookmarking actions of the users. One ofthe simplest implementation of the weights is the number ofannotations a user assigned to a web page.

3.2 A Personalized Search FrameworkIn the classical non-personalized search engines, the rel-

evance between a query and a document is assumed to beonly decided by the similarity of term matching. However,as pointed in [21], relevance is actually relative for each user.Thus, only query term matching is not enough to generatesatisfactory search results for various users.

In the widely used Vector Space Model(VSM), all thequeries and the documents are mapped to be vectors in auniversal term space. The similarity between a query anda document is calculated through the cosine similarity be-tween the query term vector and the document term vector.Though simple, the model shows amazing effectiveness andefficiency.

Inspired by the VSM model, we propose to model theassociations between the users and the web pages using atopic space. Each dimension of the topic space representsa topic. The topics of the web pages and the interests of

the users are represented as vectors in this space. Furtherwe define a topic similarity measurement using the cosinefunction. Let !pti = (w1,i, w2,i, · · · , wα,i) be the topic vectorof the web page pi where α is the dimension of the topicspace and wk,i is the weight of the kth dimension. Similarly,let !utj = (w1,j , w2,j , · · · , wα,j) be the interest vector of theuser uj . The topic similarity between pi and uj is calculatedas Equation 1.

simtopic(pi, uj) =!pti • !utj

| !pti|× | !utj |(1)

Based on the topic space, we make a fundamental person-alized search assumption, i.e. Assumption 1.

Assumption 1. The rank of a web page p in the result listwhen a user u issues a query q is decided by two aspects, aterm matching between q and p and a topic matching betweenu and p.

When a user u issues a query q, we assume two searchprocesses, a term matching process and a topic matchingprocess. The term matching process calculates the similaritybetween q and each web page to generate a user unrelatedranked document list. The topic matching process calculatesthe topic similarity between u and each web page to generatea user related ranked document list. Then a merge operationis conducted to generate a final ranked document list basedon the two sub ranked document lists. We adopt rankingaggregation to implement the merge operation.

Ranking Aggregation is to compute a “consensus” ran-king of several sub rankings [11]. There are a lot of rank ag-gregation algorithms that can be applied in our work. Herewe choose one of the simplest, Weighted Borda-Fuse (WBF).Equation 2 shows our idea.

r(u, q, p) = γ · rterm(q, p) + (1 − γ) · rtopic(u, p) (2)

where rterm(q, p) is the rank of the web page p in theranked document list generated by query term matching,rtopic(u, p) is the rank of p in the ranked document list gen-erated by topic matching and γ is the weight that satisfies0 ≤ γ ≤ 1.

Obviously, how to select a proper topic space and howto accurately estimate the user interest vectors and the webpage topic vectors are two key points in this framework. Thenext two subsections discuss these problems.

3.3 Topic Space SelectionIn web page classification, the web pages are classified

to several predefined categories. Intuitively, the categoriesof web page classification are very similar to the topics ofthe topic space. In today’s World Wide Web, there aretwo classification systems, the traditional taxonomy suchas ODP and the new folksonomy. The two classificationsystems can be both applied in our framework. Since ourwork focuses on exploring the folksonomy for personalizedsearch, we set the ODP topic space as a baseline.

3.3.1 Folksonomy: Social Annotations as Topics

Based on the categorization feature, we set the social an-notations to be the dimensions of the topic space. Thus,the topic vector of a web page can be simply estimated byits social annotations directly. In the same way, the interest

157

Exploring Folksonomy for Personalized Search

Shengliang Xu∗

Shanghai Jiao Tong UniversityShanghai, 200240, China

[email protected]

Shenghua Bao∗

Shanghai Jiao Tong UniversityShanghai, 200240, China

[email protected]

Ben FeiIBM China Research LabBeijing, 100094, China

[email protected]

Zhong SuIBM China Research LabBeijing, 100094, China

[email protected]

Yong YuShanghai Jiao Tong UniversityShanghai, 200240, China

[email protected]

ABSTRACT

As a social service in Web 2.0, folksonomy provides the usersthe ability to save and organize their bookmarks online with“social annotations” or “tags”. Social annotations are highquality descriptors of the web pages’ topics as well as goodindicators of web users’ interests. We propose a personal-ized search framework to utilize folksonomy for personalizedsearch. Specifically, three properties of folksonomy, namelythe categorization, keyword, and structure property, are ex-plored. In the framework, the rank of a web page is decidednot only by the term matching between the query and theweb page’s content but also by the topic matching betweenthe user’s interests and the web page’s topics. In the evalu-ation, we propose an automatic evaluation framework basedon folksonomy data, which is able to help lighten the com-mon high cost in personalized search evaluations. A seriesof experiments are conducted using two heterogeneous datasets, one crawled from Del.icio.us and the other from Do-gear. Extensive experimental results show that our person-alized search approach can significantly improve the searchquality.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Informationsearch and Retrieval—Search Process

General Terms

Algorithms, Measurement, Experimentation, Performance

Keywords

Folksonomy, Personalized Search, Topic Space, Web 2.0, Au-tomatic Evaluation Framework∗Part of this work was done while Shengliang Xu andShenghua Bao were interns at IBM China Research Lab.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGIR’08, July 20–24, 2008, Singapore.Copyright 2008 ACM 978-1-60558-164-4/08/07 ...$5.00.

1. INTRODUCTIONIn today’s search market, the most popular search paradi-

gm is keyword search. Despite simplicity and efficiency, key-word queries can not accurately describe what the users re-ally want. People engaged in different areas may have differ-ent understandings of the same literal keywords. Authors of[26], concluded that people differ significantly in the searchresults they considered to be relevant for the same query.

One solution to this problem is Personalized Search. Byconsidering user-specific information [21], search engines canto some extent distinguish the exact meaning the users wantto express by the short queries. Along with the evolution ofthe World Wide Web, many kinds of personal data have beenstudied for personalized search, including user manually se-lected interests [16, 8], web browser bookmarks [23], users’personal document corpus [7], search engine click-throughhistory [10, 22, 24], etc. In all, search personalization is oneof the most promising directions for the traditional searchparadigm to go further.

In recent years, there raises a growing concern in the newWeb 2.0 environment. One feature of Web 2.0 that distin-guishes it from the classical World Wide Web is the socialdata generation mode. The service providers only provideplatforms for the users to collaborate and share their dataonline. Such services include folksonomy, blog, wiki and soon. Since the data are generated and owned by the users,they form a new set of personal data. In this paper, we focuson exploring folksonomy for personalized search.

The term “folksonomy” is a combination of “Folk” and“Taxonomy”to describe the social classification phenomenon[3]. Online folksonomy services, such as Del.icio.us , Flickrand Dogear [19] , enable users to save and organize theirbookmarks, including any accessible resources, online withfreely chosen short text descriptors, i.e. “social annotations”or “tags”, in flat structure. The users are able to collabo-rate during bookmarking and tagging explicitly or implicitly.The low barrier and facility of this service have successfullyattracted a large number of users to participate.

The folksonomy creates a social association between theusers and the web pages through social annotations. Morespecifically, a user who has a given annotation may be in-terested in the web pages that have the same annotation.Inspired by this, we propose to model the associations be-tween the users and the web pages using a topic space. Theinterests of each user and the topics of each web page can

155

ExperimentalResults

records, denoted as DEL.gt500. The 3 test beds from theDogear data set are built in the same way as Del.icio.us,denoted as DOG.5-10, DOG.80-100 and DOG.gt500 respec-tively. The purpose of building the 6 test beds is not onlyto evaluate the model in the two different environments, i.e.web and enterprise, but also to evaluate it in the situationsof different amount of data.

Before the experiments we perform two data preprocess-ing processes. 1)Several of the annotations are too personalor meaningless, such as “toread”, “Imported IE Fa-vorites”,“system:imported”, etc. We remove some of them manually.2) Some users may concatenate several words to form anannotation , e.g. javaprogramming, java/programming, etc.We split this kind of annotations with the help of a dictio-nary. Table 2 presents the statistics of the two data sets andthe 6 test beds after data preprocessing where “num.users”denotes the number of users, “max.tags” denotes the maxi-mum number of distinct tags owned by each user, the restcolumns have the similar meanings as “max.tags”. As for

Table 2: Statistics of the user owned tags and webpages of the experiment data

Data Set Num.

Users

Max.

Tags

Min.

Tags

Avg.

Tags

Max.

Pages

Min.

Pages

Avg.

Pages

Delicious 9813 2055 1 56.04 1790 1 40.35Dogear 5192 2288 1 47.43 4578 1 46.78DEL.gt500 31 1133 74 464.42 1790 506 727.55DEL.80-100 100 456 2 107.51 100 80 88.43DEL.5-10 100 64 1 18.53 10 5 7.44DOG.gt500 92 2147 42 543.87 4578 500 999.04DOG.80-100 85 295 9 126.96 100 80 89.32DOG.5-10 100 41 2 16.11 10 5 6.99

each test bed, we randomly split them into 2 parts, a 80%training part and a 20% test part. The training parts areused to estimate the models while the test parts are used forevaluating. All the preprocessed data sets are used in theexperiments. No other filtering is conducted.

5.1.2 Personalized Search Framework Implementa-tion

Our personalized search framework needs two separatedranked lists of web pages. In practice, instead of generatingtwo full ranked lists of all the web pages, an alternative ap-proach that costs less is to rerank only the top ranked resultsfetched by the text matching model. In the experiments, weconduct such reranking based on two state-of-the-art textretrieval model, BM25 and Language Model for IR (LMIR).Firstly, a ranked list by a text retrieval model is generated.Then top 100 web pages in the ranked list are reranked byour personalized search model.

5.1.3 Parameter Setting

Before the experiments, there are three sets of parametersthat must be set. The first two parameters are the α and βin the Topic Adjusting Algorithm in Section 3.4. We simplyset them both 0.5 to keep the same influence for the initialsocial annotations’ literal contents and the link structure.The second set of parameters are the set of γs in the rankingaggregation when using various search models under varioustest beds, i.e. Equation 2. We conduct a simple trainingprocess to estimate the γs as shown in Procedure 1. Theconcrete values of γ under each search model and test bed

Procedure 1. Ranking aggregation parameter

training process

foreach test bed TB ∈ 6 test beds do1

split the training part of TB into 4 parts TNi, 1 ≤ i ≤ 42

foreach TN ∈ TNi do3

training the interest vectors and the topic vectors4

using other 3 training partsrun the evaluation 11 times using TN with γ set to5

0.0, 0.1, · · · , 1.0 respectivelyrecord the γ that leads to the optimal performance6

set the average of the 4 γs as the final parameter7

is listed in Table 3. In addition, we set the three parametersk1, k3 and b in BM25 1.2, 7 and 0.75 respectively, which arethe default parameter scheme in the lemur toolkit1. For theLMIR we accept jelinek-mercer smoothing [28] with α set to0.3.

5.1.4 Baseline Models

In the experiments we select 4 baseline models, one is thenon-personalized text matching model using no extra infor-mation except for contents, the second is the model usingthe top 16 ODP categories as topic space which is denoted as“ODP1”, the third is the model using 1171 ODP categoriesas topics which is denoted as “ODP2”, and the last is themodel proposed in [20], which is actually a simplified caseof our personalized search framework when the topic spaceis set to be folksonomy and the topic matching function isset to simply counting the number of matched annotations.We refer to it as the “AC” model.

5.1.5 Evaluation Metric

The main evaluation metric we used in our work is meanaverage precision (MAP), which is a widely used evaluationmetric in the IR community. More specifically, in our work,we calculate MAP for each user and then calculate the meanof all the MAP values. We refer it as Mean MAP or MMAP.

MMAP =

∑Nui=1

MAPi

Nu

where MAPi represents the MAP value of the ith user andNu is the number of users.

In addition, we perform t-tests on average precisions overall the queries issued by all the users in each experimentaldata set to show whether the experimental improvementsare statistical significant or not.

5.2 Performance

Table 3 lists all the 120 experimental results. The columns“text”, “ODP1”, “ODP2”, “AC”, “f.tfidf” and “f.bm25” de-note the non-personalized text model, the 16 top most ODPtopic space personalized model, the 1171 ODP topic spacepersonalized model, the AC model, the folksonomy topicspace personalized model using tfidf weighting scheme andthe folksonomy topic space personalized model using BM25weighting scheme. The sub columns “B. A.” and “A. A.” de-note“Before Adjusting by link structure”and“After Adjust-ing by link structure”, respectively. The“*”s in the“MMAP”row stand for four significance levels of the t-test, satisfying0.05 ≥ * > 0.01 ≥ ** > 0.001 ≥ ***.

As we can see from the table, the 5 personalized searchmodels all outperform the simple text retrieval models sig-1http://www.lemurproject.org/

160

•  Observed75%‐250%improvementinMAPforalldatasets

•  Improvementislargerforthedatasetswhereuserswhoownlessbookmarks,becausetypicallytheirannotaWonsaresemanWcallyricher

Summary

•  SocialAnnotaWons(tags)canhelpimprovesearchqualityintheEnterprise

•  WhiletheycanbedirectlyusedasfeaturesfortherankingfuncWon,exploiWngtheircollaboraWveproperWeshelpstofurtherimprovesearchquality

•  AnnotaWonscanalsobeusedtoinferusers’interestsandprovidepersonalizedsearchresults

Users’BrowsingTraces

•  Observeusers’browsingbehaviorauerenteringaqueryandclickingonasearchresult

•  Rankwebsitesforanewquerybasedonhowheavilytheywerebrowsedbyusersauerenteringsameorsimilarqueries

•  Useitasafeatureinsearchrankingalgorithm

Mining the Search Trails of Surfing Crowds:Identifying Relevant Websites From User Activity

Mikhail BilenkoMicrosoft ResearchOne Microsoft Way

Redmond, WA 98052, [email protected]

Ryen W. WhiteMicrosoft ResearchOne Microsoft Way

Redmond, WA 98052, [email protected]

ABSTRACT

The paper proposes identifying relevant information sources from

the history of combined searching and browsing behavior of many

Web users. While it has been previously shown that user inter-

actions with search engines can be employed to improve docu-

ment ranking, browsing behavior that occurs beyond search re-

sult pages has been largely overlooked in prior work. The pa-

per demonstrates that users’ post-search browsing activity strongly

reflects implicit endorsement of visited pages, which allows esti-

mating topical relevance of Web resources by mining large-scale

datasets of search trails. We present heuristic and probabilistic

algorithms that rely on such datasets for suggesting authoritative

websites for search queries. Experimental evaluation shows that

exploiting complete post-search browsing trails outperforms alter-

natives in isolation (e.g., clickthrough logs), and yields accuracy

improvements when employed as a feature in learning to rank for

Web search.

Categories and Subject Descriptors

H.2.8 [DatabaseManagement]: DataMining; H.3.3 [Information

Storage and Retrieval]: Information Search and Retrieval

Keywords

Learning from user behavior, Mining search and browsing logs,

Implicit feedback, Web search

1. INTRODUCTIONTraditional information retrieval (IR) techniques identify docu-

ments relevant to a given query by computing similarity between

the query and the documents’ contents [36]. Challenges posed by

IR on Web scale motivated a number of approaches that exploit

data sources beyond document contents, such as the structure of

the hyperlink graph [10, 23, 26], or users’ interactions with search

engines [17, 2, 4, 42], as well as machine learning methods that

combine multiple features for estimating resource relevance [5, 7,

33, 19].

A common theme unifying many of these recent IR algorithms is

the use of evidence stemming from unorganized behavior of many

individuals for estimating document authority. Hyperlink structure

created by millions of individual Web page authors is one example

of phenomena arising from local activity of many users, which is

exploited by such algorithms as HITS [23] and PageRank [26], as

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2008, April 21–25, 2008, Beijing, China.ACM 978-1-60558-085-2/08/04.

well as many others. Another group of IR algorithms that lever-

age user behavior on a large scale includes methods that utilize

search engine clickthrough logs, where users’ clicks on search re-

sults provide implicit affirmation of the corresponding pages’ au-

thority and/or relevance to the original query [17, 48, 2, 4]. Addi-

tionally, search engine query logs can be used to incorporate query

context derived from users’ search histories, leading to better query

language models that improve search accuracy [42].

While query and clickthrough logs from search engines have

been shown to be a valuable source of implicit supervision for train-

ing retrieval methods, the vast majority of users’ browsing behav-

ior takes place beyond search engine interactions. It has been re-

ported in previous studies that users’ information seeking behavior

often involves orienteering: navigating to desired resources via a

sequence of steps, instead of attempting to reach the target doc-

ument directly via a search query [43]. Therefore, post-search

browsing behavior provides valuable evidence for identifying doc-

uments relevant to users’ information goals expressed in preceding

search queries.

This paper proposes exploiting a combination of searching and

browsing activity of many users to identify relevant resources for

future queries. To the best of our knowledge, previous approaches

have not considered mining the history of user activity beyond search

results, and our experimental results show that comprehensive logs

of post-search behavior are an informative source of implicit feed-

back for inferring resource relevance. We also depart from prior

work in that we propose term-based models that generalize to pre-

viously unseen queries, which comprise a significant proportion of

real-world search engine submissions.

To demonstrate the utility of exploiting collective search and

browsing behavior for estimating document authority, we describe

several methods that rely on such data to identify relevant web-

sites for new queries. Our initial approach is motivated by heuris-

tic methods used in traditional vector-space information retrieval.

Next, we improve on it by employing a probabilistic generative

model for documents, queries and query terms, and obtain our best

results using a variant of the model that incorporates a simple random-

walk modification. Intuitively, all of these algorithms leverage user

behavior logs to suggest websites for a new query that were heavily

browsed by users after entering similar (or same) queries.

We evaluate the proposed algorithms using an independent dataset

of Internet search engine queries for which human judges identi-

fied relevant Web pages, as well as an automatically constructed

dataset consisting of previously unseen queries. Results demon-

strate that relevant websites are identified most accurately when

complete post-search browsing trails with associated dwell times

are used, compared with using just users’ search result clicks, or

ignoring dwell times. We also show that a query-term model is

SearchTrails

•  Startwithasearchenginequery•  ConWnueunWlaterminaWngevent

– Anothersearch–  Visittoanunrelatedsite(socialnetworks,webmail)–  Timeout,browserhomepage,browserclosing

q (p1, p2, p1, p3, p4, p3, p5)

UsingSearchTrailsforRanking

•  Approach1:AdaptBM25scoringfuncWon

•  Approach2:ProbabilisWcmodel

4.1 Heuristic Retrieval Model

First, we consider an ad-hoc model motivated by the empiri-

cal success of the term frequency × inverse document frequency

(TF.IDF) heuristic and its variants for traditional content-based re-

trieval. Based on the search trail corpus D, we construct a vector-space representation for documents, where each document di is

represented via the agglomeration of queries following which the

page was visited in the search trails. Every document is thus de-

scribed as a sparse vector, every non-zero element of which en-

codes the relative weight of the corresponding term.

Weights in this model must capture the frequency with which

users have visited the document following queries containing each

term, scaled proportionally to the term’s relative specificity across

the query corpus. Then, given the search trail corpusD, the compo-nent corresponding to term tj in the vector representing document

di can be computed as a product of query-based term frequency

QTFi,j and the term’s inverse query frequency IQFj :

wdi,tj = QTFi,j · IQFj =

=(λ + 1)n(di, tj)

λ((1 − β) + β n(di)n̄(di)

) + n(di, tj)· log

Nd − n(tj) + 0.5n(tj) + 0.5

where:

• λ and β are smoothing parameters; while in this work we useλ = 0.5 and β = 0.75, we found that results are relativelyrobust to the choice of specific values;

• n(di, tj) =∑

q!di,tj∈q f(q ! di) is the term frequency

aggregated over all trails that begin with queries containing

term tj and include document di, where the aggregation is

performed via the feature function f(q ! di) computed forthe document from each trail;

• n(di) is the total number of terms in all queries followed bysearch trails that include document di;

• n̄(di) is the average value of n(di) over all documents inD;

• n(tj) is the number of documents for which queries leadingto them include the term tj ;

• Nd is total number of documents.

This formula is effectively an adaptation of the BM25 scoring

function, which is a variant of the traditional TF.IDF heuristic that

has provided good performance on a number of retrieval bench-

marks [34]. To instantiate the term frequencies computed from all

trails leading from queries containing the term to the document,

n(di, qj), different instantiations of the feature function f(q!di)are possible that weigh the contribution of each particular trail. In

this work, we consider three variants of this feature function:

• Raw visitation count: f(q!di) = 1;

• Dwell time: f(q!di) = τ(q!di), where τ(q!di) is thetotal dwell time for document di in this particular trail;

• Log of dwell time: f(q!di) = log τ(q!di).

Given the large size of typical search trail datasets D, computa-tion of the document vectors can be performed efficiently in sev-

eral passes over the data for term and document index construc-

tion, term-document frequency computation, and final estimation

of document-term scores.

Given a new query q̂ = {t̂1, . . . , t̂k}, candidate documents areretrieved from the inverted index and their relevance is computed

via the dot product between the document and query vectors:

RelH(di, q̂) =∑

t̂j∈q̂

wdi,t̂j· wt̂j

(1)

where wt̂jis the relative weight of each term in the query, com-

puted using inverse query frequency over the set of all queries in

the dataset: wt̂j= log

Nq−n(t̂j)+0.5

n(t̂j)+0.5, withNq and n(t̂j) being the

total number of queries and the number of queries containing term

t̂j , respectively.

4.2 Probabilistic Retrieval Model

Statistical approaches to content-based information retrieval have

been considered alongside heuristic methods for several decades,

and have attracted increasing attention recently. Besides theoret-

ical elegance, probabilistic retrieval models provide competitive

performance and can be used to explain the empirical success of

the heuristics (e.g., a number of papers have proposed generative

interpretations of the IDF heuristic). Hence, we employ a statisti-

cal framework to formulate an alternative approach for retrieving

documents most relevant to a given query, provided a large dataset

of users’ past searching and browsing behavior.

We consider a generative model for queries, terms, and docu-

ments, where every query q instantiates a multinomial distributionover its terms. Every term in the vocabulary is in turn associated

with a multinomial distribution over the documents, which can be

viewed as the likelihood of a user browsing to the document after

submitting a query that contains the term (or the likelihood of the

user viewing the document per unit time, depending on the partic-

ular instantiation of the distribution). In effect, this probability en-

codes the topical relevance of the document for the particular term.

Then, the probability of selecting document di given a new query

q̂ can be used to estimate the document’s relevance:

RelP (di, q̂) = p(di|q̂) =∑

t̂j∈q

p(t̂j |q̂)p(di|t̂j) (2)

Selecting particular parameterizations for the query-term distri-

bution p(t̂j |q̂) and the distribution over documents for a given termp(di|t̂j) allows instantiating different retrieval functions. In thiswork, we estimate the instantaneous multinomial query-term like-

lihood p(t̂j |q̂) via a negative exponentiated term prior estimated

from the query corpus, which has an effect similar to IDF weight-

ing: less frequent terms have higher influence on selection of doc-

uments:

p(t̂j |q̂) =exp(−p(t̂j))

t̂l∈q̂ exp(−p(t̂l))=

exp(−n(t̂j)+µ

t̂s∈D n(t̂s)+µ)

t̂l∈q̂ exp(− n(t̂l)+µ∑

t̂s∈D n(t̂s)+µ)

(3)

where n(t̂j) is the number of queries containing term t̂j , as in pre-

viuos subsection, and µ is a smoothing constant (we set µ = 10in all experiments; alternative smoothing approaches are possible,

e.g., those used in language modeling [49]).

Document probabilities for every query term can be estimated as

maximum-likelihood estimates over the training data for all paths

inD that originate with queries containing t̂j . Using Laplace smooth-

ing for more robust estimation, term-document probabilities be-

come:

p(di|t̂j) =n(di, t̂j)

dl∈D n(dl, t̂j)(4)

InsteadoftermfrequencyinadocumentusesumoflogsofdwellWmesondifromqueriescontainingtj

Insteadinversedocfrequencyuse#docsforwhichqueriesleadingtothemincludetj

4.1 Heuristic Retrieval Model

First, we consider an ad-hoc model motivated by the empiri-

cal success of the term frequency × inverse document frequency

(TF.IDF) heuristic and its variants for traditional content-based re-

trieval. Based on the search trail corpus D, we construct a vector-space representation for documents, where each document di is

represented via the agglomeration of queries following which the

page was visited in the search trails. Every document is thus de-

scribed as a sparse vector, every non-zero element of which en-

codes the relative weight of the corresponding term.

Weights in this model must capture the frequency with which

users have visited the document following queries containing each

term, scaled proportionally to the term’s relative specificity across

the query corpus. Then, given the search trail corpusD, the compo-nent corresponding to term tj in the vector representing document

di can be computed as a product of query-based term frequency

QTFi,j and the term’s inverse query frequency IQFj :

wdi,tj = QTFi,j · IQFj =

=(λ + 1)n(di, tj)

λ((1 − β) + β n(di)n̄(di)

) + n(di, tj)· log

Nd − n(tj) + 0.5n(tj) + 0.5

where:

• λ and β are smoothing parameters; while in this work we useλ = 0.5 and β = 0.75, we found that results are relativelyrobust to the choice of specific values;

• n(di, tj) =∑

q!di,tj∈q f(q ! di) is the term frequency

aggregated over all trails that begin with queries containing

term tj and include document di, where the aggregation is

performed via the feature function f(q ! di) computed forthe document from each trail;

• n(di) is the total number of terms in all queries followed bysearch trails that include document di;

• n̄(di) is the average value of n(di) over all documents inD;

• n(tj) is the number of documents for which queries leadingto them include the term tj ;

• Nd is total number of documents.

This formula is effectively an adaptation of the BM25 scoring

function, which is a variant of the traditional TF.IDF heuristic that

has provided good performance on a number of retrieval bench-

marks [34]. To instantiate the term frequencies computed from all

trails leading from queries containing the term to the document,

n(di, qj), different instantiations of the feature function f(q!di)are possible that weigh the contribution of each particular trail. In

this work, we consider three variants of this feature function:

• Raw visitation count: f(q!di) = 1;

• Dwell time: f(q!di) = τ(q!di), where τ(q!di) is thetotal dwell time for document di in this particular trail;

• Log of dwell time: f(q!di) = log τ(q!di).

Given the large size of typical search trail datasets D, computa-tion of the document vectors can be performed efficiently in sev-

eral passes over the data for term and document index construc-

tion, term-document frequency computation, and final estimation

of document-term scores.

Given a new query q̂ = {t̂1, . . . , t̂k}, candidate documents areretrieved from the inverted index and their relevance is computed

via the dot product between the document and query vectors:

RelH(di, q̂) =∑

t̂j∈q̂

wdi,t̂j· wt̂j

(1)

where wt̂jis the relative weight of each term in the query, com-

puted using inverse query frequency over the set of all queries in

the dataset: wt̂j= log

Nq−n(t̂j)+0.5

n(t̂j)+0.5, withNq and n(t̂j) being the

total number of queries and the number of queries containing term

t̂j , respectively.

4.2 Probabilistic Retrieval Model

Statistical approaches to content-based information retrieval have

been considered alongside heuristic methods for several decades,

and have attracted increasing attention recently. Besides theoret-

ical elegance, probabilistic retrieval models provide competitive

performance and can be used to explain the empirical success of

the heuristics (e.g., a number of papers have proposed generative

interpretations of the IDF heuristic). Hence, we employ a statisti-

cal framework to formulate an alternative approach for retrieving

documents most relevant to a given query, provided a large dataset

of users’ past searching and browsing behavior.

We consider a generative model for queries, terms, and docu-

ments, where every query q instantiates a multinomial distributionover its terms. Every term in the vocabulary is in turn associated

with a multinomial distribution over the documents, which can be

viewed as the likelihood of a user browsing to the document after

submitting a query that contains the term (or the likelihood of the

user viewing the document per unit time, depending on the partic-

ular instantiation of the distribution). In effect, this probability en-

codes the topical relevance of the document for the particular term.

Then, the probability of selecting document di given a new query

q̂ can be used to estimate the document’s relevance:

RelP (di, q̂) = p(di|q̂) =∑

t̂j∈q

p(t̂j |q̂)p(di|t̂j) (2)

Selecting particular parameterizations for the query-term distri-

bution p(t̂j |q̂) and the distribution over documents for a given termp(di|t̂j) allows instantiating different retrieval functions. In thiswork, we estimate the instantaneous multinomial query-term like-

lihood p(t̂j |q̂) via a negative exponentiated term prior estimated

from the query corpus, which has an effect similar to IDF weight-

ing: less frequent terms have higher influence on selection of doc-

uments:

p(t̂j |q̂) =exp(−p(t̂j))

t̂l∈q̂ exp(−p(t̂l))=

exp(−n(t̂j)+µ

t̂s∈D n(t̂s)+µ)

t̂l∈q̂ exp(− n(t̂l)+µ∑

t̂s∈D n(t̂s)+µ)

(3)

where n(t̂j) is the number of queries containing term t̂j , as in pre-

viuos subsection, and µ is a smoothing constant (we set µ = 10in all experiments; alternative smoothing approaches are possible,

e.g., those used in language modeling [49]).

Document probabilities for every query term can be estimated as

maximum-likelihood estimates over the training data for all paths

inD that originate with queries containing t̂j . Using Laplace smooth-

ing for more robust estimation, term-document probabilities be-

come:

p(di|t̂j) =n(di, t̂j)

dl∈D n(dl, t̂j)(4)

ExperimentalResults•  Dataset:140millionsearchtrails;33,150querieswith5‐point

scalehumanjudgments(sitegetshighestrelevancescoreofallitspages)

•  AddthewebsiterankfeaturetoRankNet(Burges05)•  MeasureimprovementinNDCG

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72

NDCG@1 NDCG@3 NDCG@10

NDCG

Baseline

Baseline+HeurisWc

Baseline+ProbabilisWc

Baseline+ProbabilisWc+RW

•  Useallusers’browsingtracestoinfer“implicitlinks”betweenpairsofwebpages

•  IntuiWvely,thereisanimplicitlinkbetweentwopagesiftheyarevisitedtogetheronmanybrowsingpaths

•  ConstructagraphwithpagesasnodesandimplicitlinksasedgesanduseittocalculatePageRank

!

Implicit Link Analysis for Small Web Search Gui-Rong Xue

1 Hua-Jun Zeng

2 Zheng Chen

2 Wei-Ying Ma

2 Hong-Jiang Zhang

2 Chao-

Jun Lu1

1Computer Science and Engineering

Shanghai Jiao-Tong University

Shanghai 200030, P.R.China

[email protected], [email protected]

2Microsoft Research Asia

5F, Sigma Center, 49 Zhichun Road

Beijing 100080, P.R.China

{i-hjzeng, zhengc, wyma, hjzhang}@microsoft.com

!

!"#$%!&$'"#$$%&'!(%)!*%+$,-!%&./&%*!.%&%$+001!/234*%!0/&5!+&+01*/*6)+*%7!

$%6$+&5/&.!4&!8%)63+.%!$%'$/%9+0:!;48%9%$<!'-%!*+2%!'%,-&/=#%*<!

8-%&! +330/%7! 7/$%,'01! '4! *2+00! 8%)! *%+$,-! *#,-! +*! /&'$+&%'! +&7!

*/'%! *%+$,-<! ,+&&4'! +,-/%9%! '-%! *+2%! 3%$>4$2+&,%! )%,+#*%! '-%/$!

0/&5!*'$#,'#$%*!+$%!7/>>%$%&'!>$42!'-%!.04)+0!(%):!?&!'-/*!3+3%$<!8%!

3$434*%! +&! +33$4+,-! '4! ,4&*'$#,'/&.! /230/,/'! 0/&5*! )1! 2/&/&.!

#*%$*@! +,,%**! 3+''%$&*<! +&7! '-%&! +3301! +! 247/>/%7! A+.%B+&5!

+0.4$/'-2! '4! $%6$+&5! 8%)63+.%*! >4$! *2+00! 8%)! *%+$,-:! C#$!

%D3%$/2%&'+0! $%*#0'*! /&7/,+'%! '-+'! '-%! 3$434*%7! 2%'-47!

4#'3%$>4$2*! ,4&'%&'6)+*%7! 2%'-47! )1! EFG<! %D30/,/'! 0/&56)+*%7!

A+.%B+&5!)1!HIG!+&7!J/$%,';/'!)1!EKG<!$%*3%,'/9%01:!

&()*+,-.*/'(01'#234*5)'6*/5-.7),-/';:L:L!M809,-:().,0'#),-(+*'(01'%*)-.*;(<NO!?&>4$2+'/4&!P%+$,-!

+&7!B%'$/%9+0!6!!"#$%&'($)%"!!*'$"+$,"-#.'/)0".!Q!;:H:R!M6()(3(/*'

=(0(+*:*0)>O!J+'+)+*%!S330/,+'/4&*!6!1#+#'/,2,23'

?*0*-(<'$*-:/'S0.4$/'-2*<!TD3%$/2%&'+'/4&!

@*AB,-1/'?&>4$2+'/4&!$%'$/%9+0<!8%)!*%+$,-<!0/&5!+&+01*/*<!04.!2/&/&.!

CD' 80)-,125).,0'U04)+0!*%+$,-!%&./&%*!*#,-!+*!U44.0%!MEKN!+&7!S0'+V/*'+!MHN!+$%!

8/7%01! #*%7! )1! 3%430%! '4! >/&7! /&>4$2+'/4&! 7/$%,'01! 4&! '-%!(%):!

W%9%$'-%0%**<! '4! .%'! 24$%! *3%,/>/,! +&7! #36'467+'%! /&>4$2+'/4&<!

3%430%! 4>'%&! .4! 7/$%,'01! '4! *42%! *3%,/>/,! 8%)*/'%*! +&7! ,4&7#,'!

*/'%!*%+$,-:!?&'$+&%'!*%+$,-!/*!+&4'-%$!/&,$%+*/&.01!/234$'+&'!+$%+!

>4$!+&!4$.+&/X+'/4&! '4!2+&+.%!+&7!*%+$,-! /'*!48&!74,#2%&'*:! ?&!

)4'-!,+*%*<!*%+$,-!4,,#$*!/&!+!,04*%7!*#)6*3+,%!4>!'-%!(%)<!8-/,-!

/*!,+00%7!!/#..'4"5'!"#$%&!/&!'-/*!3+3%$:!!

TD/*'/&.! *2+00! 8%)! *%+$,-! %&./&%*! .%&%$+001! #*%! '-%! *+2%!

'%,-&404./%*! +*! '-4*%! #*%7! /&! .04)+0! *%+$,-! %&./&%*:! ;48%9%$<!

'-%/$! 3%$>4$2+&,%*! +$%! 3$4)0%2+'/,:! S*! $%34$'%7! /&! +! Y4$$%*'%$'

*#$9%1!MEZN<!,#$$%&'!*/'%6*3%,/>/,!*%+$,-!%&./&%*!>+/0!'4!7%0/9%$!+00!

'-%!$%0%9+&'!,4&'%&'<!/&*'%+7!$%'#$&/&.!'44!2#,-!/$$%0%9+&'!,4&'%&'!

'4! 2%%'! '-%! #*%$@*! /&>4$2+'/4&! &%%7*:! ?&! '-%! *#$9%1<! '-%! *%+$,-!

>+,/0/'/%*!4>!ZI!8%)*/'%*!8%$%!'%*'%7<!)#'!&4&%!4>!'-%2!$%,%/9%7!+!

*+'/*>+,'4$1!$%*#0':!Y#$'-%$24$%<!/&!'-%![BT"6R!P2+00!(%)![+*5<!

0/''0%!)%&%>/'!/*!4)'+/&%7!>$42!'-%!#*%!4>!0/&56)+*%7!2%'-47*!MEFN<!

8-/,-! +0*4! /&7/,+'%*! '-%! 048! 3%$>4$2+&,%! 4>! %D/'/&.! *%+$,-!

'%,-&404./%*!/&!*2+00!8%)!*%+$,-:!!

[-%!$%+*4&!4>!'-/*!048!3%$>4$2+&,%!0/%*!/&!*%9%$+0!+*3%,'*:!Y/$*'<!/'!

-+*! )%%&! 5&48&! '-+'! $+&5/&.! )1! ,4&'%&'6)+*%7! */2/0+$/'1! >+,%*!

7/>>/,#0'/%*! *#,-! +*! *-4$'&%**! +&7! +2)/.#/'1! 4>! =#%$/%*:! P%,4&7<!

0/&5!+&+01*/*! '%,-&404./%*!8-/,-!-+9%!)%%&!*#,,%**>#0!+330/%7! '4!

%&-+&,%!.04)+0!8%)!*%+$,-!,4#07!&4'!)%!7/$%,'01!+330/%7!)%,+#*%!

'-%!0/&5!*'$#,'#$%!4>!+!*2+00!8%)!/*!7/>>%$%&'!>$42!'-%!.04)+0!(%):!

(%!8/00!%D30+/&!'-/*!7/>>%$%&,%!/&!7%'+/0!/&!P%,'/4&!L:![-/$7<!%9%&!

'-4#.-! /'! /*! #&7%$*'447! '-+'! #*%$*@! +,,%**! 04.*! ,4&'+/&! 9+0#+)0%!

/&>4$2+'/4&! +)4#'!8%)63+.%*@! /234$'+&,%<! *4! >+$! >%8! *#,,%**>#0!

%>>4$'*!/&!'-/*!7/$%,'/4&!8%$%!2+7%!%D,%3'!J/$%,';/'!MELN:!

?&! '-/*! 3+3%$<! +! 2%'-47! /*! 3$434*%7! '4! .%&%$+'%! /230/,/'! 0/&5!

*'$#,'#$%!)+*%7!4&!#*%$!+,,%**!3+''%$&!2/&/&.!>$42!(%)!04.*:!(%!

7%24&*'$+'%! -48! /230/,/'! 0/&5*! ,4#07! )%! /&>%$$%7! +&7! %D'$+,'%7:!

[-%! 0/&5! +&+01*/*! +0.4$/'-2*! '-%&! #*%! '-%*%! /230/,/'! 0/&5*! '4!

,423#'%! $+&5! *,4$%*! 4>! 8%)63+.%*! >4$! /23$49/&.! '-%! *%+$,-!

3%$>4$2+&,%:! [-%! %D3%$/2%&'+0! $%*#0'*! $%9%+0! '-+'! .%&%$+'%7!

/230/,/'! 0/&5*! ,4&'+/&! 24$%! $%,422%&7+'/4&! $%0+'/4&*-/3*! '-+&!

%D30/,/'! 0/&5*:! [-%! *%+$,-! %D3%$/2%&'*! ,4&7#,'%7! 4&! \%$5%0%1!

8%)*/'%! /00#*'$+'%! '-+'!4#$!2%'-47!4#'3%$>4$2*! '-%! ,4&'%&'6)+*%7!

2%'-47! )1! EFG<! %D30/,/'! 0/&56)+*%7! A+.%B+&5! )1! HIG! +&7!

J/$%,';/'!)1!EKG<!$%*3%,'/9%01:!

[-%! $%*'! 4>! '-/*! 3+3%$! /*! 4$.+&/X%7! +*! >40048*:! ?&! P%,'/4&! H<!8%!

/&'$47#,%! *42%! $%0+'%7! 84$5*! 4&! *2+00! 8%)! *%+$,-! +&7! 0/&5!

+&+01*/*:! ?&! P%,'/4&! L<! 8%! 3$%*%&'! '-%! )+*/,! 0/&5! *'$#,'#$%! 4>! +!

*2+00! 8%)<! +&7! 7%*,$/)%! '-%! 2/&/&.! '%,-&404.1! '4! ,4&*'$#,'!

/230/,/'! 0/&5!*'$#,'#$%!+&7!'-%!$+&5/&.!3$4,%**:!C#$!%D3%$/2%&'+0!

$%*#0'*!+$%!3$%*%&'%7!/&!P%,'/4&!K:!Y/&+001<!,4&,0#*/4&*!+&7!>#'#$%!

84$5*!+$%!7/*,#**%7!/&!P%,'/4&!Z:!

ED' %*<()*1'F,-G']+&1! $%*%+$,-%*! +&7! 3$47#,'*! -+9%! )%%&! 7%94'%7! '4! *2+00! 8%)!

*%+$,-:!]4*'!4>!'-%2!#*%!'-%!^>#00!'%D'!*%+$,-_!'%,-&404./%*!8-/,-!

$%'$/%9%! +! 0+$.%! +24#&'! 4>! 74,#2%&'*! ,4&'+/&/&.! '-%! *+2%!

5%184$7*!'4!'-%!=#%$1!+&7!$+&5!'-%2!)1!5%184$76*/2/0+$/'1:!Y4$!

%D+230%<!S0'+V/*'+!3$49/7%!+!,4&'%&'6)+*%7!*/'%!*%+$,-!%&./&%!MENQ!

\%$5%0%1@*! "-+6"-+! *%+$,-! %&./&%! 4$.+&/X%*! '-%! *%+$,-! $%*#0'*!

/&'4!*42%!,+'%.4$/%*!'4!$%>0%,'!'-%!#&7%$01/&.!/&'$+&%'!*'$#,'#$%!M`NQ!

+&7! '-%! &+9/.+'/4&! *1*'%2! )1! ]:! a%9%&,%! %'! +0:! MHLN! $%'#$&*! +!

*%=#%&,%! 4>! 0/&5%7! 3+.%*! '4! -%03! '-%! %&76#*%$! '4! )$48*%! '-%!

*%+$,-! $%*#0'*:! [-%*%! *1*'%2*! 74! &4'! #*%! 0/&5! +&+01*/*! /&! $%6

$+&5/&.! '-%! *%+$,-! $%*#0'*<! '-#*! '-%$%! /*! *'/00! -%+7! $442! >4$!

/23$49/&.!'-%!*%+$,-!3$%,/*/4&:!!

!

A%$2/**/4&!'4!2+5%!7/./'+0!4$!-+$7!,43/%*!4>!+00!4$!3+$'!4>!'-/*!84$5!>4$!

3%$*4&+0!4$!,0+**$442!#*%!/*!.$+&'%7!8/'-4#'!>%%!3$49/7%7!'-+'!,43/%*!+$%!

&4'! 2+7%! 4$! 7/*'$/)#'%7! >4$! 3$4>/'! 4$! ,422%$,/+0! +79+&'+.%! +&7! '-+'!

,43/%*! )%+$! '-/*! &4'/,%! +&7! '-%! >#00! ,/'+'/4&! 4&! '-%! >/$*'! 3+.%:! [4! ,431!

4'-%$8/*%<! 4$! $%3#)0/*-<! '4! 34*'! 4&! *%$9%$*! 4$! '4! $%7/*'$/)#'%! '4! 0/*'*<!

$%=#/$%*!3$/4$!*3%,/>/,!3%$2/**/4&!+&7b4$!+!>%%:!

67879:;<<!c#01!HRdS#.#*'!E<!HIIL<![4$4&'4<!"+&+7+:!

"431$/.-'!HIIL!S"]!E6ZREEL6FKF6LbILbIIIefgZ:II:!

56

ImplicitLinkGeneraWon

•  UseglidingwindowtomoveovereachbrowsingpathgeneraWngallorderedpairsofpagesandcounWngoccurrenceofeachpair

•  Selectpairswhichhavefrequency>tasimplicitlinks

!

"#!$%&$&'#! (&!)&*'(%+)(!,!*#-!!"#$!%!&' $!()'.%,$/' 0*'(#,1!&2! (/#!

&%0.0*,3! *+#$!%!&' $!()'.%,$/! 0*! ,! '4,33!-#56!7/0'! *#-! .%,$/! 0'! ,!

-#0./(#1!10%#)(#1!.%,$/!,!!8! 9-:!.!;:!-/#%#!-! 0'! ',4#!,'!,5&<#!

5+(!.!!#*)&4$,''#'!(/#!04$30)0(!30*='!5#(-##*!$,.#'6!>+%(/#%4&%#:!

#,)/! 04$30)0(! 30*=! $!/0".!! 0'! ,''&)0,(#1! -0(/! ,! *#-! $,%,4#(#%!

?910@1!;!1#*&(0*.!(/#!)&*10(0&*,3!$%&5,5030(A!&2! (/#!$,.#!10! (&!5#!

<0'0(#1!.0<#*!)+%%#*(!$,.#!1!6!

B+%!.&,3!0'!(&!#C(%,)(!04$30)0(!30*='!.!!5A!,*,3AD0*.!(/#!&5'#%<#1!

+'#%'E! 5%&-'0*.! 5#/,<0&%'6! 7/#! 01#,! 0'! (/,(! -#! ),*! ,''+4#! .!!

)&*(%&3'! /&-! (/#! +'#%! (%,<#%'#'! 0*! (/#! '4,33! -#56! F,'#1! &*! (/#!

04$30)0(! 30*=!.%,$/!,!! ,*1!#C$30)0(! 30*=!.%,$/!,:!-#!),*!,''+4#!

(/#%#! #C0'('! ,! .#*#%,(0<#!4&1#3! 2&%! +'#%! 3&.6!7/#! #*(0%#! +'#%! 3&.!

)&*'0'('!&2!,!*+45#%!&2!5%&-'0*.!'#''0&*'!2'8!G3H:!3I:!3J:!K;6!L,)/!

'#''0&*!0'!.#*#%,(#1!5A!(/#!2&33&-0*.!'(#$'M!

9H;! N,*1&43A!'#3#)(!,!$,.#!1!'2%&4!-!,'!(/#!'(,%(0*.!$&0*(O!

9I;! P#*#%,(#! ,*! 04$30)0(! $,(/! 91!:!10:!1):!K;! ,))&%10*.! (&! (/#!

04$30)0(! 30*='!.!! ,*1! (/#! ,''&)0,(#1! $%&5,5030(0#':!-/#%#!-#!

,''+4#! #,)/! '#3#)(0&*! &2! 04$30)0(! 30*=! 0'! 0*1#$#*1#*(! &*!

$%#<0&+'!'#3#)(0&*'O!

9J;! >&%!#,)/!$,0%!&2!,1Q,)#*(!$,.#'!1!!,*1!10!0*!(/#!04$30)0(!$,(/:!

%,*1&43A!'#3#)(!,! '#(!&2! 0*R5#(-##*!$,.#'!1+H:!1+I:!K:'1+"',))&%10*.!(&!(/#!#C$30)0(!30*=''.!(&!2&%4!,*!#C$30)0(!$,(/!91!:!

1+H:!1+I:!K:'1+":!10;6!!

S*!&(/#%!-&%1':! (/0'!4&1#3!)&*(%&3! (/#!.#*#%,(0&*!&2! (/#!+'#%! 3&.!

5,'#1! &*! 04$30)0(! 30*='! ,*1! #C$30)0(! 30*='6! 7/#! 20*,3! +'#%! 3&.!

)&*(,0*'!,5+*1,*(! 0*2&%4,(0&*!&2!,33! 04$30)0(! 30*='6!7/+'!-#!),*!

#C(%,)(! 04$30)0(! 30*='!5A!,*,3AD0*.! (/#!&5'#%<#1!#C$30)0(!$,(/'! 0*!

(/#!+'#%!3&.6!!

"#! +'#! ,! '04$3#! (-&R0(#4! '#T+#*(0,3! $,((#%*! 40*0*.!4#(/&1! (&!

10')&<#%! $&''053#! 04$30)0(! 30*='6! 7/0'! 4#(/&1! +'#'! ,! .3010*.!

-0*1&-! (&! 4&<#! &<#%! #,)/! #C$30)0(! $,(/:! .#*#%,(0*.! ,33! (/#!

&%1#%#1! $,0%'! ,*1! )&+*(0*.! (/#! &))+%%#*)#! &2! #,)/! 10'(0*)(! $,0%6!

7/#!.3010*.!-0*1&-!'0D#!%#$%#'#*('!(/#!4,C04+4!0*(#%<,3!,!+'#%!

)30)='!5#(-##*!(/#!'&+%)#!$,.#!,*1!(/#!(,%.#(!$,.#6!>&%!#C,4$3#:!

2&%! ,*!#C$30)0(! $,(/! 91!H:!1!I:!1!J:!K:'1!);:! &+%!4#(/&1!.#*#%,(#'!

$,0%'!9!H:!!I;:!9!H:!J;:!K:!9!H:!!);:!9!I:!J;:!K:!9!I:! !);:!K!S2!&*#!&2!

(/#!$,0%':!#6.6!9!:!0;:!)&%%#'$&*1'!(&!,*!04$30)0(!30*=!9$!/0".!;:!$,(/'!

&2!(/#!$,((#%*!91!:!K:!10;!'/&+31!&))+%!2%#T+#*(3A!0*!(/#!3&.:!-0(/!

1022#%#*(!0*R5#(-##*!$,.#'6!!

U33!$&''053#!&%1#%#1!$,0%'!,*1!(/#0%!2%#T+#*)A!,%#!),3)+3,(#1!2%&4!

,33! (/#! 5%&-'0*.! '#''0&*'! 2:! ,*1! 0*2%#T+#*(! &))+%%#*)#'! ,%#!

203(#%#1!5A!,!40*04+4!'+$$&%(!(/%#'/&316!?%#)0'#3A:!(/#!'+$$&%(!&2!

,*! 0(#4! !:! 1#*&(#1! ,'! 34##9!;:! %#2#%'! (&! (/#! $#%)#*(,.#! &2! (/#!

'#''0&*'!(/,(!)&*(,0*!(/#!0(#4!!6!7/#!'+$$&%(!&2!,!(-&R0(#4!$,0%!9!:!

0;:! 1#*&(#1! 34##9!:! 0;:! 0'! 1#20*#1! 0*! ,! '0403,%! -,A6! U! (-&R0(#4!

&%1#%#1!$,0%!0'!2%#T+#*(!02!0('!'+$$&%(!34##9!:!0;#'"!(534##:!-/#%#!

"!(634##!0'!,!+'#%!'$#)020#1!*+45#%6!

U2(#%!(-&R0(#4!'#T+#*(0,3!$,((#%*'!,%#!.#*#%,(#1:!(/#A!,%#!+'#1!(&!

+$1,(#!(/#!04$30)0(!30*=!.%,$/!,!!8!9-:!.!;!1#')%05#1!$%#<0&+'3A6!

U33!(/#!-#0./('!&2!#1.#'!0*!.!!,%#!0*0(0,30D#1!(&!D#%&6!>&%!#,)/!

(-&R0(#4!'#T+#*(0,3!$,((#%*!9!:!0;:!-#!,11!0('!'+$$&%(!34##9!:!0;!(&!

(/#!-#0./(!&2!(/#!#1.#!$!:!06!U33!(/#!-#0./('!,%#!*&%4,30D#1!(&!

%#$%#'#*(!(/#!%#,3!$%&5,5030(A6!7/#!%#'+3(0*.!.%,$/!0'!+'#1!0*!,!

'+5'#T+#*(!30*=!,*,3A'0'!,3.&%0(/46!

S(! 0'!)3#,%! (/,(!*+4#%&+'! %#1+*1,*(!$,0%'!&(/#%! (/,*! %#,3! 04$30)0(!

30*='! 4,A! ,3'&! 5#! 0*)3+1#1! 5#),+'#! (/#! +'#%E'! 5%&-'0*.! /,'! (&!

2&33&-! (/#! #C$30)0(! 30*='6! V&-#<#%:! 5,'#1! &*! &+%! '(,(0'(0),3!

,*,3A'0':!(/0'!#22#)(!0'!'4,336!W$#)020),33A:!02!(/#!)&**#)(0<0(A!0*!,!

'4,33! -#5! 0'! /0./! ,*1! (/#! +'#%'! /,<#! *&! '0.*020),*(! 50,'! 0*!

'#3#)(0*.! $,(/':! 04$30)0(! 30*='! )&+31! 5#! '#$,%,(#1! 2%&4! #C$30)0(!

30*='!5A!'#((0*.!,*!,$$%&$%0,(#!40*04+4!'+$$&%(6!>&%!(/#!1#(,03!&2!

(/#!,*,3A'0':!$3#,'#!%#2#%!(&!(/#!U$$#*10C!U6!

!"#$ %&&'()*+$,-+./-*0$12$34&')5)1$6)*07$U2(#%! &5(,0*0*.! (/#! 04$30)0(! 30*=! '(%+)(+%#:! -#! ,$$3A! ,! 30*=!

,*,3A'0'! ,3.&%0(/4! '0403,%! (&!?,.#N,*=! XIYZ! (&! %#R%,*=! (/#!-#5R

$,.#'! 2&%!,!'4,33!-#5!'#,%)/6!"#!)&*'(%+)(!,!4,(%0C! (&!1#')%05#!

(/#!.%,$/6! S*!$,%(0)+3,%:! ,''+4#! (/#!.%,$/!)&*(,0*'!(!$,.#'6!7/#!

($(! ,1Q,)#*)A!4,(%0C! 0'! 1#*&(#1! 5A!%! ,*1! (/#! #*(%0#'!7X!:! 0Z! 0'!

1#20*#1!(&!5#!(/#!-#0./(!&2!(/#!04$30)0(!30*='!$!/06!!

7/#!,1Q,)#*)A!4,(%0C! 0'!+'#1! (&!)&4$+(#! (/#! %,*=!')&%#!&2!#,)/!

$,.#6! S*! ,*! [01#,3\! 2&%4:! (/#! %,*=! ')&%#! 89!! &2! $,.#! 1!! 0'!

#<,3+,(#1! 5A! ,! 2+*)(0&*! &*! (/#! %,*=! ')&%#'! &2! ,33! (/#! $,.#'! (/,(!

$&0*(!(&!$,.#!1!'M!!

! %"

&'.$0

0!

0!

!078989M

Z:X ! 9J;!

7/0'!%#)+%'0<#!1#20*0(0&*!.0<#'!#,)/!$,.#!,!2%,)(0&*!&2!(/#!%,*=!&2!

#,)/! $,.#! $&0*(0*.! (&! 0(]0*<#%'#3A!-#0./(#1! 5A! (/#! '(%#*.(/! &2!

(/#! 30*='!&2! (/,(!$,.#6!7/#!,5&<#!#T+,(0&*!),*!5#!-%0((#*! 0*! (/#!

2&%4!&2!4,(%0C!,'M!

! 8989 !' ! 9^;!

V&-#<#%:!0*!$%,)(0)#:!4,*A!$,.#'!/,<#!*&!0*R30*='!9&%!(/#!-#0./(!

&2!(/#4!0'!_;:!,*1!(/#!#0.#*<#)(&%!&2!(/#!,5&<#!#T+,(0&*!0'!4&'(3A!

D#%&6!7/#%#2&%#:!(/#!5,'0)!4&1#3!0'!4&1020#1!(&!&5(,0*!,*![,)(+,3!

4&1#3\!+'0*.!:;(<="'1;$)6!`$&*!5%&-'0*.!,!-#5R$,.#:!-0(/!(/#!

$%&5,5030(A!HR(:!,!+'#%! %,*1&43A!)/&&'#'!&*#!&2! (/#! 30*='!&*! (/#!

)+%%#*(! $,.#! ,*1! Q+4$'! (&! (/#! $,.#! 0(! 30*='! (&O! ,*1! -0(/! (/#!

$%&5,5030(A!(:! (/#!+'#%! [%#'#(\!5A! Q+4$0*.! (&! ,!-#5R$,.#! $0)=#1!

+*02&%43A! ,*1! ,(! %,*1&4! 2%&4! (/#! )&33#)(0&*6! 7/#%#2&%#! (/#!

%,*=0*.!2&%4+3,!0'!4&1020#1!(&!(/#!2&33&-0*.!2&%4M!

! %"

&)*'.$0

0!

!0

!0789(

89:M

Z:X;H9 (( ! 9a;!

B%:!0*!(/#!4,(%0C!2&%4M!

! 89*(

89 !;H9 ((

)*'! ! 9b;!

-/#%#! *!

!0'!(/#!<#)(&%!&2!,33!HE':!,*1!(!9_c(cH;!0'!,!$,%,4#(#%6!S*!

&+%! #C$#%04#*(:! -#! '#(! (! (&! _6Ha6! S*'(#,1! &2! )&4$+(0*.! ,*!

#0.#*<#)(&%:! (/#! '04$3#'(! 0(#%,(0<#! 4#(/&1]d,)&50! 0(#%,(0&*! 0'!

+'#1!(&!%#'&3<#!(/#!#T+,(0&*6!

7/#! %#R%,*=0*.! 4#)/,*0'4! 0'! 5,'#1! &*! (-&! (A$#'! &2! 30*#,%!

)&450*,(0&*M! (/#! ')&%#! 5,'#1! %#R%,*=0*.! ,*1! (/#! &%1#%! 5,'#1! %#R

%,*=0*.6!7/#!')&%#!5,'#1!%#R%,*=0*.!+'#'!,!30*#,%!)&450*,(0&*!&2!

)&*(#*(R5,'#1!'0403,%0(A!')&%#!,*1!(/#!?,.#N,*=!<,3+#!&2!,33!-#5R

$,.#'M!

' 2%=:*91;!8!+2!"'e!9HR!+;!89!!!!9+"!X_:!HZ;! 9Y;!

-/#%#!2!"'0'!(/#!)&*(#*(R5,'#1!'0403,%0(A!5#(-##*!-#5R$,.#'!,*1!

T+#%A!-&%1':!,*1!89!0'!(/#!?,.#N,*=!<,3+#6!!

7/#!&%1#%!5,'#1!%#R%,*=0*.!0'!5,'#1!&*!(/#!%,*=!&%1#%'!&2!(/#!-#5R

$,.#'6! V#%#! -#! +'#! ,! 30*#,%! )&450*,(0&*! &2! $,.#'E! $&'0(0&*'! 0*!

(-&!30'('!0*!-/0)/!&*#!30'(!0'!'&%(#1!5A!'0403,%0(A!')&%#'!(/#!&(/#%!

30'(!0'!'&%(#1!5A!?,.#N,*=!<,3+#'6!7/,(!0':!

58

!

"#!$%&$&'#! (&!)&*'(%+)(!,!*#-!!"#$!%!&' $!()'.%,$/' 0*'(#,1!&2! (/#!

&%0.0*,3! *+#$!%!&' $!()'.%,$/! 0*! ,! '4,33!-#56!7/0'! *#-! .%,$/! 0'! ,!

-#0./(#1!10%#)(#1!.%,$/!,!!8! 9-:!.!;:!-/#%#!-! 0'! ',4#!,'!,5&<#!

5+(!.!!#*)&4$,''#'!(/#!04$30)0(!30*='!5#(-##*!$,.#'6!>+%(/#%4&%#:!

#,)/! 04$30)0(! 30*=! $!/0".!! 0'! ,''&)0,(#1! -0(/! ,! *#-! $,%,4#(#%!

?910@1!;!1#*&(0*.!(/#!)&*10(0&*,3!$%&5,5030(A!&2! (/#!$,.#!10! (&!5#!

<0'0(#1!.0<#*!)+%%#*(!$,.#!1!6!

B+%!.&,3!0'!(&!#C(%,)(!04$30)0(!30*='!.!!5A!,*,3AD0*.!(/#!&5'#%<#1!

+'#%'E! 5%&-'0*.! 5#/,<0&%'6! 7/#! 01#,! 0'! (/,(! -#! ),*! ,''+4#! .!!

)&*(%&3'! /&-! (/#! +'#%! (%,<#%'#'! 0*! (/#! '4,33! -#56! F,'#1! &*! (/#!

04$30)0(! 30*=!.%,$/!,!! ,*1!#C$30)0(! 30*=!.%,$/!,:!-#!),*!,''+4#!

(/#%#! #C0'('! ,! .#*#%,(0<#!4&1#3! 2&%! +'#%! 3&.6!7/#! #*(0%#! +'#%! 3&.!

)&*'0'('!&2!,!*+45#%!&2!5%&-'0*.!'#''0&*'!2'8!G3H:!3I:!3J:!K;6!L,)/!

'#''0&*!0'!.#*#%,(#1!5A!(/#!2&33&-0*.!'(#$'M!

9H;! N,*1&43A!'#3#)(!,!$,.#!1!'2%&4!-!,'!(/#!'(,%(0*.!$&0*(O!

9I;! P#*#%,(#! ,*! 04$30)0(! $,(/! 91!:!10:!1):!K;! ,))&%10*.! (&! (/#!

04$30)0(! 30*='!.!! ,*1! (/#! ,''&)0,(#1! $%&5,5030(0#':!-/#%#!-#!

,''+4#! #,)/! '#3#)(0&*! &2! 04$30)0(! 30*=! 0'! 0*1#$#*1#*(! &*!

$%#<0&+'!'#3#)(0&*'O!

9J;! >&%!#,)/!$,0%!&2!,1Q,)#*(!$,.#'!1!!,*1!10!0*!(/#!04$30)0(!$,(/:!

%,*1&43A!'#3#)(!,! '#(!&2! 0*R5#(-##*!$,.#'!1+H:!1+I:!K:'1+"',))&%10*.!(&!(/#!#C$30)0(!30*=''.!(&!2&%4!,*!#C$30)0(!$,(/!91!:!

1+H:!1+I:!K:'1+":!10;6!!

S*!&(/#%!-&%1':! (/0'!4&1#3!)&*(%&3! (/#!.#*#%,(0&*!&2! (/#!+'#%! 3&.!

5,'#1! &*! 04$30)0(! 30*='! ,*1! #C$30)0(! 30*='6! 7/#! 20*,3! +'#%! 3&.!

)&*(,0*'!,5+*1,*(! 0*2&%4,(0&*!&2!,33! 04$30)0(! 30*='6!7/+'!-#!),*!

#C(%,)(! 04$30)0(! 30*='!5A!,*,3AD0*.! (/#!&5'#%<#1!#C$30)0(!$,(/'! 0*!

(/#!+'#%!3&.6!!

"#! +'#! ,! '04$3#! (-&R0(#4! '#T+#*(0,3! $,((#%*! 40*0*.!4#(/&1! (&!

10')&<#%! $&''053#! 04$30)0(! 30*='6! 7/0'! 4#(/&1! +'#'! ,! .3010*.!

-0*1&-! (&! 4&<#! &<#%! #,)/! #C$30)0(! $,(/:! .#*#%,(0*.! ,33! (/#!

&%1#%#1! $,0%'! ,*1! )&+*(0*.! (/#! &))+%%#*)#! &2! #,)/! 10'(0*)(! $,0%6!

7/#!.3010*.!-0*1&-!'0D#!%#$%#'#*('!(/#!4,C04+4!0*(#%<,3!,!+'#%!

)30)='!5#(-##*!(/#!'&+%)#!$,.#!,*1!(/#!(,%.#(!$,.#6!>&%!#C,4$3#:!

2&%! ,*!#C$30)0(! $,(/! 91!H:!1!I:!1!J:!K:'1!);:! &+%!4#(/&1!.#*#%,(#'!

$,0%'!9!H:!!I;:!9!H:!J;:!K:!9!H:!!);:!9!I:!J;:!K:!9!I:! !);:!K!S2!&*#!&2!

(/#!$,0%':!#6.6!9!:!0;:!)&%%#'$&*1'!(&!,*!04$30)0(!30*=!9$!/0".!;:!$,(/'!

&2!(/#!$,((#%*!91!:!K:!10;!'/&+31!&))+%!2%#T+#*(3A!0*!(/#!3&.:!-0(/!

1022#%#*(!0*R5#(-##*!$,.#'6!!

U33!$&''053#!&%1#%#1!$,0%'!,*1!(/#0%!2%#T+#*)A!,%#!),3)+3,(#1!2%&4!

,33! (/#! 5%&-'0*.! '#''0&*'! 2:! ,*1! 0*2%#T+#*(! &))+%%#*)#'! ,%#!

203(#%#1!5A!,!40*04+4!'+$$&%(!(/%#'/&316!?%#)0'#3A:!(/#!'+$$&%(!&2!

,*! 0(#4! !:! 1#*&(#1! ,'! 34##9!;:! %#2#%'! (&! (/#! $#%)#*(,.#! &2! (/#!

'#''0&*'!(/,(!)&*(,0*!(/#!0(#4!!6!7/#!'+$$&%(!&2!,!(-&R0(#4!$,0%!9!:!

0;:! 1#*&(#1! 34##9!:! 0;:! 0'! 1#20*#1! 0*! ,! '0403,%! -,A6! U! (-&R0(#4!

&%1#%#1!$,0%!0'!2%#T+#*(!02!0('!'+$$&%(!34##9!:!0;#'"!(534##:!-/#%#!

"!(634##!0'!,!+'#%!'$#)020#1!*+45#%6!

U2(#%!(-&R0(#4!'#T+#*(0,3!$,((#%*'!,%#!.#*#%,(#1:!(/#A!,%#!+'#1!(&!

+$1,(#!(/#!04$30)0(!30*=!.%,$/!,!!8!9-:!.!;!1#')%05#1!$%#<0&+'3A6!

U33!(/#!-#0./('!&2!#1.#'!0*!.!!,%#!0*0(0,30D#1!(&!D#%&6!>&%!#,)/!

(-&R0(#4!'#T+#*(0,3!$,((#%*!9!:!0;:!-#!,11!0('!'+$$&%(!34##9!:!0;!(&!

(/#!-#0./(!&2!(/#!#1.#!$!:!06!U33!(/#!-#0./('!,%#!*&%4,30D#1!(&!

%#$%#'#*(!(/#!%#,3!$%&5,5030(A6!7/#!%#'+3(0*.!.%,$/!0'!+'#1!0*!,!

'+5'#T+#*(!30*=!,*,3A'0'!,3.&%0(/46!

S(! 0'!)3#,%! (/,(!*+4#%&+'! %#1+*1,*(!$,0%'!&(/#%! (/,*! %#,3! 04$30)0(!

30*='! 4,A! ,3'&! 5#! 0*)3+1#1! 5#),+'#! (/#! +'#%E'! 5%&-'0*.! /,'! (&!

2&33&-! (/#! #C$30)0(! 30*='6! V&-#<#%:! 5,'#1! &*! &+%! '(,(0'(0),3!

,*,3A'0':!(/0'!#22#)(!0'!'4,336!W$#)020),33A:!02!(/#!)&**#)(0<0(A!0*!,!

'4,33! -#5! 0'! /0./! ,*1! (/#! +'#%'! /,<#! *&! '0.*020),*(! 50,'! 0*!

'#3#)(0*.! $,(/':! 04$30)0(! 30*='! )&+31! 5#! '#$,%,(#1! 2%&4! #C$30)0(!

30*='!5A!'#((0*.!,*!,$$%&$%0,(#!40*04+4!'+$$&%(6!>&%!(/#!1#(,03!&2!

(/#!,*,3A'0':!$3#,'#!%#2#%!(&!(/#!U$$#*10C!U6!

!"#$ %&&'()*+$,-+./-*0$12$34&')5)1$6)*07$U2(#%! &5(,0*0*.! (/#! 04$30)0(! 30*=! '(%+)(+%#:! -#! ,$$3A! ,! 30*=!

,*,3A'0'! ,3.&%0(/4! '0403,%! (&!?,.#N,*=! XIYZ! (&! %#R%,*=! (/#!-#5R

$,.#'! 2&%!,!'4,33!-#5!'#,%)/6!"#!)&*'(%+)(!,!4,(%0C! (&!1#')%05#!

(/#!.%,$/6! S*!$,%(0)+3,%:! ,''+4#! (/#!.%,$/!)&*(,0*'!(!$,.#'6!7/#!

($(! ,1Q,)#*)A!4,(%0C! 0'! 1#*&(#1! 5A!%! ,*1! (/#! #*(%0#'!7X!:! 0Z! 0'!

1#20*#1!(&!5#!(/#!-#0./(!&2!(/#!04$30)0(!30*='!$!/06!!

7/#!,1Q,)#*)A!4,(%0C! 0'!+'#1! (&!)&4$+(#! (/#! %,*=!')&%#!&2!#,)/!

$,.#6! S*! ,*! [01#,3\! 2&%4:! (/#! %,*=! ')&%#! 89!! &2! $,.#! 1!! 0'!

#<,3+,(#1! 5A! ,! 2+*)(0&*! &*! (/#! %,*=! ')&%#'! &2! ,33! (/#! $,.#'! (/,(!

$&0*(!(&!$,.#!1!'M!!

! %"

&'.$0

0!

0!

!078989M

Z:X ! 9J;!

7/0'!%#)+%'0<#!1#20*0(0&*!.0<#'!#,)/!$,.#!,!2%,)(0&*!&2!(/#!%,*=!&2!

#,)/! $,.#! $&0*(0*.! (&! 0(]0*<#%'#3A!-#0./(#1! 5A! (/#! '(%#*.(/! &2!

(/#! 30*='!&2! (/,(!$,.#6!7/#!,5&<#!#T+,(0&*!),*!5#!-%0((#*! 0*! (/#!

2&%4!&2!4,(%0C!,'M!

! 8989 !' ! 9^;!

V&-#<#%:!0*!$%,)(0)#:!4,*A!$,.#'!/,<#!*&!0*R30*='!9&%!(/#!-#0./(!

&2!(/#4!0'!_;:!,*1!(/#!#0.#*<#)(&%!&2!(/#!,5&<#!#T+,(0&*!0'!4&'(3A!

D#%&6!7/#%#2&%#:!(/#!5,'0)!4&1#3!0'!4&1020#1!(&!&5(,0*!,*![,)(+,3!

4&1#3\!+'0*.!:;(<="'1;$)6!`$&*!5%&-'0*.!,!-#5R$,.#:!-0(/!(/#!

$%&5,5030(A!HR(:!,!+'#%! %,*1&43A!)/&&'#'!&*#!&2! (/#! 30*='!&*! (/#!

)+%%#*(! $,.#! ,*1! Q+4$'! (&! (/#! $,.#! 0(! 30*='! (&O! ,*1! -0(/! (/#!

$%&5,5030(A!(:! (/#!+'#%! [%#'#(\!5A! Q+4$0*.! (&! ,!-#5R$,.#! $0)=#1!

+*02&%43A! ,*1! ,(! %,*1&4! 2%&4! (/#! )&33#)(0&*6! 7/#%#2&%#! (/#!

%,*=0*.!2&%4+3,!0'!4&1020#1!(&!(/#!2&33&-0*.!2&%4M!

! %"

&)*'.$0

0!

!0

!0789(

89:M

Z:X;H9 (( ! 9a;!

B%:!0*!(/#!4,(%0C!2&%4M!

! 89*(

89 !;H9 ((

)*'! ! 9b;!

-/#%#! *!

!0'!(/#!<#)(&%!&2!,33!HE':!,*1!(!9_c(cH;!0'!,!$,%,4#(#%6!S*!

&+%! #C$#%04#*(:! -#! '#(! (! (&! _6Ha6! S*'(#,1! &2! )&4$+(0*.! ,*!

#0.#*<#)(&%:! (/#! '04$3#'(! 0(#%,(0<#! 4#(/&1]d,)&50! 0(#%,(0&*! 0'!

+'#1!(&!%#'&3<#!(/#!#T+,(0&*6!

7/#! %#R%,*=0*.! 4#)/,*0'4! 0'! 5,'#1! &*! (-&! (A$#'! &2! 30*#,%!

)&450*,(0&*M! (/#! ')&%#! 5,'#1! %#R%,*=0*.! ,*1! (/#! &%1#%! 5,'#1! %#R

%,*=0*.6!7/#!')&%#!5,'#1!%#R%,*=0*.!+'#'!,!30*#,%!)&450*,(0&*!&2!

)&*(#*(R5,'#1!'0403,%0(A!')&%#!,*1!(/#!?,.#N,*=!<,3+#!&2!,33!-#5R

$,.#'M!

' 2%=:*91;!8!+2!"'e!9HR!+;!89!!!!9+"!X_:!HZ;! 9Y;!

-/#%#!2!"'0'!(/#!)&*(#*(R5,'#1!'0403,%0(A!5#(-##*!-#5R$,.#'!,*1!

T+#%A!-&%1':!,*1!89!0'!(/#!?,.#N,*=!<,3+#6!!

7/#!&%1#%!5,'#1!%#R%,*=0*.!0'!5,'#1!&*!(/#!%,*=!&%1#%'!&2!(/#!-#5R

$,.#'6! V#%#! -#! +'#! ,! 30*#,%! )&450*,(0&*! &2! $,.#'E! $&'0(0&*'! 0*!

(-&!30'('!0*!-/0)/!&*#!30'(!0'!'&%(#1!5A!'0403,%0(A!')&%#'!(/#!&(/#%!

30'(!0'!'&%(#1!5A!?,.#N,*=!<,3+#'6!7/,(!0':!

58

UsingImplicitLinksinRanking

•  CalculatePageRankbasedonthewebgraphwithimplicitlinks

•  CombinePageRankandcontent‐basedsimilarityusingaweightedlinearcombinaWon

•  Approach1:userawscores

•  Approach2:useranksinsteadofscores

!

"#!$%&$&'#! (&!)&*'(%+)(!,!*#-!!"#$!%!&' $!()'.%,$/' 0*'(#,1!&2! (/#!

&%0.0*,3! *+#$!%!&' $!()'.%,$/! 0*! ,! '4,33!-#56!7/0'! *#-! .%,$/! 0'! ,!

-#0./(#1!10%#)(#1!.%,$/!,!!8! 9-:!.!;:!-/#%#!-! 0'! ',4#!,'!,5&<#!

5+(!.!!#*)&4$,''#'!(/#!04$30)0(!30*='!5#(-##*!$,.#'6!>+%(/#%4&%#:!

#,)/! 04$30)0(! 30*=! $!/0".!! 0'! ,''&)0,(#1! -0(/! ,! *#-! $,%,4#(#%!

?910@1!;!1#*&(0*.!(/#!)&*10(0&*,3!$%&5,5030(A!&2! (/#!$,.#!10! (&!5#!

<0'0(#1!.0<#*!)+%%#*(!$,.#!1!6!

B+%!.&,3!0'!(&!#C(%,)(!04$30)0(!30*='!.!!5A!,*,3AD0*.!(/#!&5'#%<#1!

+'#%'E! 5%&-'0*.! 5#/,<0&%'6! 7/#! 01#,! 0'! (/,(! -#! ),*! ,''+4#! .!!

)&*(%&3'! /&-! (/#! +'#%! (%,<#%'#'! 0*! (/#! '4,33! -#56! F,'#1! &*! (/#!

04$30)0(! 30*=!.%,$/!,!! ,*1!#C$30)0(! 30*=!.%,$/!,:!-#!),*!,''+4#!

(/#%#! #C0'('! ,! .#*#%,(0<#!4&1#3! 2&%! +'#%! 3&.6!7/#! #*(0%#! +'#%! 3&.!

)&*'0'('!&2!,!*+45#%!&2!5%&-'0*.!'#''0&*'!2'8!G3H:!3I:!3J:!K;6!L,)/!

'#''0&*!0'!.#*#%,(#1!5A!(/#!2&33&-0*.!'(#$'M!

9H;! N,*1&43A!'#3#)(!,!$,.#!1!'2%&4!-!,'!(/#!'(,%(0*.!$&0*(O!

9I;! P#*#%,(#! ,*! 04$30)0(! $,(/! 91!:!10:!1):!K;! ,))&%10*.! (&! (/#!

04$30)0(! 30*='!.!! ,*1! (/#! ,''&)0,(#1! $%&5,5030(0#':!-/#%#!-#!

,''+4#! #,)/! '#3#)(0&*! &2! 04$30)0(! 30*=! 0'! 0*1#$#*1#*(! &*!

$%#<0&+'!'#3#)(0&*'O!

9J;! >&%!#,)/!$,0%!&2!,1Q,)#*(!$,.#'!1!!,*1!10!0*!(/#!04$30)0(!$,(/:!

%,*1&43A!'#3#)(!,! '#(!&2! 0*R5#(-##*!$,.#'!1+H:!1+I:!K:'1+"',))&%10*.!(&!(/#!#C$30)0(!30*=''.!(&!2&%4!,*!#C$30)0(!$,(/!91!:!

1+H:!1+I:!K:'1+":!10;6!!

S*!&(/#%!-&%1':! (/0'!4&1#3!)&*(%&3! (/#!.#*#%,(0&*!&2! (/#!+'#%! 3&.!

5,'#1! &*! 04$30)0(! 30*='! ,*1! #C$30)0(! 30*='6! 7/#! 20*,3! +'#%! 3&.!

)&*(,0*'!,5+*1,*(! 0*2&%4,(0&*!&2!,33! 04$30)0(! 30*='6!7/+'!-#!),*!

#C(%,)(! 04$30)0(! 30*='!5A!,*,3AD0*.! (/#!&5'#%<#1!#C$30)0(!$,(/'! 0*!

(/#!+'#%!3&.6!!

"#! +'#! ,! '04$3#! (-&R0(#4! '#T+#*(0,3! $,((#%*! 40*0*.!4#(/&1! (&!

10')&<#%! $&''053#! 04$30)0(! 30*='6! 7/0'! 4#(/&1! +'#'! ,! .3010*.!

-0*1&-! (&! 4&<#! &<#%! #,)/! #C$30)0(! $,(/:! .#*#%,(0*.! ,33! (/#!

&%1#%#1! $,0%'! ,*1! )&+*(0*.! (/#! &))+%%#*)#! &2! #,)/! 10'(0*)(! $,0%6!

7/#!.3010*.!-0*1&-!'0D#!%#$%#'#*('!(/#!4,C04+4!0*(#%<,3!,!+'#%!

)30)='!5#(-##*!(/#!'&+%)#!$,.#!,*1!(/#!(,%.#(!$,.#6!>&%!#C,4$3#:!

2&%! ,*!#C$30)0(! $,(/! 91!H:!1!I:!1!J:!K:'1!);:! &+%!4#(/&1!.#*#%,(#'!

$,0%'!9!H:!!I;:!9!H:!J;:!K:!9!H:!!);:!9!I:!J;:!K:!9!I:! !);:!K!S2!&*#!&2!

(/#!$,0%':!#6.6!9!:!0;:!)&%%#'$&*1'!(&!,*!04$30)0(!30*=!9$!/0".!;:!$,(/'!

&2!(/#!$,((#%*!91!:!K:!10;!'/&+31!&))+%!2%#T+#*(3A!0*!(/#!3&.:!-0(/!

1022#%#*(!0*R5#(-##*!$,.#'6!!

U33!$&''053#!&%1#%#1!$,0%'!,*1!(/#0%!2%#T+#*)A!,%#!),3)+3,(#1!2%&4!

,33! (/#! 5%&-'0*.! '#''0&*'! 2:! ,*1! 0*2%#T+#*(! &))+%%#*)#'! ,%#!

203(#%#1!5A!,!40*04+4!'+$$&%(!(/%#'/&316!?%#)0'#3A:!(/#!'+$$&%(!&2!

,*! 0(#4! !:! 1#*&(#1! ,'! 34##9!;:! %#2#%'! (&! (/#! $#%)#*(,.#! &2! (/#!

'#''0&*'!(/,(!)&*(,0*!(/#!0(#4!!6!7/#!'+$$&%(!&2!,!(-&R0(#4!$,0%!9!:!

0;:! 1#*&(#1! 34##9!:! 0;:! 0'! 1#20*#1! 0*! ,! '0403,%! -,A6! U! (-&R0(#4!

&%1#%#1!$,0%!0'!2%#T+#*(!02!0('!'+$$&%(!34##9!:!0;#'"!(534##:!-/#%#!

"!(634##!0'!,!+'#%!'$#)020#1!*+45#%6!

U2(#%!(-&R0(#4!'#T+#*(0,3!$,((#%*'!,%#!.#*#%,(#1:!(/#A!,%#!+'#1!(&!

+$1,(#!(/#!04$30)0(!30*=!.%,$/!,!!8!9-:!.!;!1#')%05#1!$%#<0&+'3A6!

U33!(/#!-#0./('!&2!#1.#'!0*!.!!,%#!0*0(0,30D#1!(&!D#%&6!>&%!#,)/!

(-&R0(#4!'#T+#*(0,3!$,((#%*!9!:!0;:!-#!,11!0('!'+$$&%(!34##9!:!0;!(&!

(/#!-#0./(!&2!(/#!#1.#!$!:!06!U33!(/#!-#0./('!,%#!*&%4,30D#1!(&!

%#$%#'#*(!(/#!%#,3!$%&5,5030(A6!7/#!%#'+3(0*.!.%,$/!0'!+'#1!0*!,!

'+5'#T+#*(!30*=!,*,3A'0'!,3.&%0(/46!

S(! 0'!)3#,%! (/,(!*+4#%&+'! %#1+*1,*(!$,0%'!&(/#%! (/,*! %#,3! 04$30)0(!

30*='! 4,A! ,3'&! 5#! 0*)3+1#1! 5#),+'#! (/#! +'#%E'! 5%&-'0*.! /,'! (&!

2&33&-! (/#! #C$30)0(! 30*='6! V&-#<#%:! 5,'#1! &*! &+%! '(,(0'(0),3!

,*,3A'0':!(/0'!#22#)(!0'!'4,336!W$#)020),33A:!02!(/#!)&**#)(0<0(A!0*!,!

'4,33! -#5! 0'! /0./! ,*1! (/#! +'#%'! /,<#! *&! '0.*020),*(! 50,'! 0*!

'#3#)(0*.! $,(/':! 04$30)0(! 30*='! )&+31! 5#! '#$,%,(#1! 2%&4! #C$30)0(!

30*='!5A!'#((0*.!,*!,$$%&$%0,(#!40*04+4!'+$$&%(6!>&%!(/#!1#(,03!&2!

(/#!,*,3A'0':!$3#,'#!%#2#%!(&!(/#!U$$#*10C!U6!

!"#$ %&&'()*+$,-+./-*0$12$34&')5)1$6)*07$U2(#%! &5(,0*0*.! (/#! 04$30)0(! 30*=! '(%+)(+%#:! -#! ,$$3A! ,! 30*=!

,*,3A'0'! ,3.&%0(/4! '0403,%! (&!?,.#N,*=! XIYZ! (&! %#R%,*=! (/#!-#5R

$,.#'! 2&%!,!'4,33!-#5!'#,%)/6!"#!)&*'(%+)(!,!4,(%0C! (&!1#')%05#!

(/#!.%,$/6! S*!$,%(0)+3,%:! ,''+4#! (/#!.%,$/!)&*(,0*'!(!$,.#'6!7/#!

($(! ,1Q,)#*)A!4,(%0C! 0'! 1#*&(#1! 5A!%! ,*1! (/#! #*(%0#'!7X!:! 0Z! 0'!

1#20*#1!(&!5#!(/#!-#0./(!&2!(/#!04$30)0(!30*='!$!/06!!

7/#!,1Q,)#*)A!4,(%0C! 0'!+'#1! (&!)&4$+(#! (/#! %,*=!')&%#!&2!#,)/!

$,.#6! S*! ,*! [01#,3\! 2&%4:! (/#! %,*=! ')&%#! 89!! &2! $,.#! 1!! 0'!

#<,3+,(#1! 5A! ,! 2+*)(0&*! &*! (/#! %,*=! ')&%#'! &2! ,33! (/#! $,.#'! (/,(!

$&0*(!(&!$,.#!1!'M!!

! %"

&'.$0

0!

0!

!078989M

Z:X ! 9J;!

7/0'!%#)+%'0<#!1#20*0(0&*!.0<#'!#,)/!$,.#!,!2%,)(0&*!&2!(/#!%,*=!&2!

#,)/! $,.#! $&0*(0*.! (&! 0(]0*<#%'#3A!-#0./(#1! 5A! (/#! '(%#*.(/! &2!

(/#! 30*='!&2! (/,(!$,.#6!7/#!,5&<#!#T+,(0&*!),*!5#!-%0((#*! 0*! (/#!

2&%4!&2!4,(%0C!,'M!

! 8989 !' ! 9^;!

V&-#<#%:!0*!$%,)(0)#:!4,*A!$,.#'!/,<#!*&!0*R30*='!9&%!(/#!-#0./(!

&2!(/#4!0'!_;:!,*1!(/#!#0.#*<#)(&%!&2!(/#!,5&<#!#T+,(0&*!0'!4&'(3A!

D#%&6!7/#%#2&%#:!(/#!5,'0)!4&1#3!0'!4&1020#1!(&!&5(,0*!,*![,)(+,3!

4&1#3\!+'0*.!:;(<="'1;$)6!`$&*!5%&-'0*.!,!-#5R$,.#:!-0(/!(/#!

$%&5,5030(A!HR(:!,!+'#%! %,*1&43A!)/&&'#'!&*#!&2! (/#! 30*='!&*! (/#!

)+%%#*(! $,.#! ,*1! Q+4$'! (&! (/#! $,.#! 0(! 30*='! (&O! ,*1! -0(/! (/#!

$%&5,5030(A!(:! (/#!+'#%! [%#'#(\!5A! Q+4$0*.! (&! ,!-#5R$,.#! $0)=#1!

+*02&%43A! ,*1! ,(! %,*1&4! 2%&4! (/#! )&33#)(0&*6! 7/#%#2&%#! (/#!

%,*=0*.!2&%4+3,!0'!4&1020#1!(&!(/#!2&33&-0*.!2&%4M!

! %"

&)*'.$0

0!

!0

!0789(

89:M

Z:X;H9 (( ! 9a;!

B%:!0*!(/#!4,(%0C!2&%4M!

! 89*(

89 !;H9 ((

)*'! ! 9b;!

-/#%#! *!

!0'!(/#!<#)(&%!&2!,33!HE':!,*1!(!9_c(cH;!0'!,!$,%,4#(#%6!S*!

&+%! #C$#%04#*(:! -#! '#(! (! (&! _6Ha6! S*'(#,1! &2! )&4$+(0*.! ,*!

#0.#*<#)(&%:! (/#! '04$3#'(! 0(#%,(0<#! 4#(/&1]d,)&50! 0(#%,(0&*! 0'!

+'#1!(&!%#'&3<#!(/#!#T+,(0&*6!

7/#! %#R%,*=0*.! 4#)/,*0'4! 0'! 5,'#1! &*! (-&! (A$#'! &2! 30*#,%!

)&450*,(0&*M! (/#! ')&%#! 5,'#1! %#R%,*=0*.! ,*1! (/#! &%1#%! 5,'#1! %#R

%,*=0*.6!7/#!')&%#!5,'#1!%#R%,*=0*.!+'#'!,!30*#,%!)&450*,(0&*!&2!

)&*(#*(R5,'#1!'0403,%0(A!')&%#!,*1!(/#!?,.#N,*=!<,3+#!&2!,33!-#5R

$,.#'M!

' 2%=:*91;!8!+2!"'e!9HR!+;!89!!!!9+"!X_:!HZ;! 9Y;!

-/#%#!2!"'0'!(/#!)&*(#*(R5,'#1!'0403,%0(A!5#(-##*!-#5R$,.#'!,*1!

T+#%A!-&%1':!,*1!89!0'!(/#!?,.#N,*=!<,3+#6!!

7/#!&%1#%!5,'#1!%#R%,*=0*.!0'!5,'#1!&*!(/#!%,*=!&%1#%'!&2!(/#!-#5R

$,.#'6! V#%#! -#! +'#! ,! 30*#,%! )&450*,(0&*! &2! $,.#'E! $&'0(0&*'! 0*!

(-&!30'('!0*!-/0)/!&*#!30'(!0'!'&%(#1!5A!'0403,%0(A!')&%#'!(/#!&(/#%!

30'(!0'!'&%(#1!5A!?,.#N,*=!<,3+#'6!7/,(!0':!

58

!

! "#$%&"'#!$!!(")*!%!"&'!!#(+,!!!!"!"!()*!&+#! ",#!

-./0/!(")*!123!(+,!10/!456787526!59!41:/!'!72!67;7<1078=!6>50/!<768!

123!?1:/@12A!<768*!0/64/>87B/<=C!

DE0!/F4/07;/281<!0/6E<8!7237>18/3!8.18! 8./!503/0'G16/3!0/'012A72:!

4/0950;6!G/88/0!8.12!8./!6>50/'G16/3!0/'012A72:C!

!"# $%&'()*'+,-#H2! 8.76! 6/>8752*! -/! 376>E66! 8./! /F4/07;/281<! 3181! 6/8*! 5E0!

/B1<E18752! ;/807>6*! 123! 8./! 0/6E<8! 59! 5E0! 68E3=! G16/3! 52! 8.56/!

;/807>6C!

!".# /*&0)1),#2)+3#4'+'(5,)6+#I./!:51<!59!5E0!-50A!76!85!7;405B/!6;1<<!-/G!6/10>.!G=!121<=J72:!

8./!E6/0!1>>/66! <5:! 72! 8./!-/G!678/C!I5!>523E>8! 8./! /F4/07;/286*!

8./! -/G678/! 18! .884KLL---C>6CG/0A/</=C/3EL<5:6L! -78.! M';528.!

><7>A'8.0E! <5:6! -16! E6/3C! N/950/! ;7272:! 950! 8./! E6/0! 1>>/66!

4188/026! 52! 8.76! <5:*! -/! 40/405>/66! 8./! <5:! G=! 4/0950;72:! 3181!

></1272:*! 6/66752! 73/28797>18752! 123! >526/>E87B/! 0/4/8787526!

/<7;7218752C! O181! ></1272:! 76! 352/! G=! 67;4<=! 97<8/072:! 5E8! 8./!

1>>/66! /2807/6! 950! /;G/33/3! 5GP/>86! 6E>.! 16! 7;1:/6! 123! 6>07486C!

Q98/0-103*! E6/06! 10/! 376872:E76./3! G=! 8./70! H?! 1330/66*! 7C/C! -/!

166E;/! 8.18! >526/>E87B/! 1>>/66/6! 905;! 8./! 61;/! H?! 3E072:! 1!

>/08172!87;/!728/0B1<!10/!905;!1!61;/!E6/0C!!

H2! 503/0! 85! .123</! 8./! >16/! 59!;E<87'E6/06! -78.! 8./! 61;/! H?*! H?!

1330/66/6! -.56/! 41:/! .786! >5E28! /F>//36! 65;/! 8.0/6.5<3! 10/!

0/;5B/3C! I./! >526/>E87B/! /2807/6! 10/! 8./2! :05E4/3! 7285! 1!

G05-672:!6/66752C!O799/0/28!:05E472:!>078/071!.1B/!G//2!;53/</3!

123! >5;410/3! 72! (&)+C! H2! 8.76! 414/0*! -/! >.556/! 8./! R5B/087;/S!

>078/0752! 67;7<10! 85! (&)+*! 7C/C*! 1! 2/-! 6/66752! 681086! -./2! 8./!

3E018752!59!8./!-.5</!:05E4!59!801B/061<6!/F>//36!1!87;/!8.0/6.5<3C!

T526/>E87B/!0/4/8787526!-78.72!1!6/66752!10/!8./2!/<7;7218/3*!/C:C*!

6/66752! "-.! -.! /.! 0.! 0.! 0#! 76! >5;41>8/3! 85! "-.! /.! 0#C! Q98/0!

40/405>/6672:*! 8./! <5:! >5281726! 1G5E8! U))*)))! 801261>87526*!

&V)*)))!41:/6!123!W)*)))!E6/06C!!

I./!507:721<!-/G'41:/6!123!<72A!680E>8E0/!76!35-2<513/3!905;!8./!

-/G678/C!QG5E8!&V)*)))!41:/6!10/!35-2<513/3!123!723/F/3!G=!8./!

DA147!6=68/;!(U)+C!X50!/1>.!41:/*!12!YIZ[!4106/0!76!144<7/3!85!

0/;5B72:! 81:6! 123! /F801>872:! <72A6! 72! 41:/6C! X721<<=*! -/! :58!

\&W*VM,!.=4/0<72A6!72!8581<C!

]/!97F/3!6/B/01<!4101;/8/06!950!8./!0/68!/F4/07;/286C!7C/C!-7235-!

67J/! 16! M*! ;727;E;! 6E44508! 8.0/6.5<3! 16! V*! E672:! 6E44508'

-/7:.8/3!13P1>/28!;1807F*!123!503/0'G16/3!0/'012A72:!950!6/10>.C!

I./6/! 4101;/8/06! 10/! 3/8/0;72/3! G16/3! 52! 12! /F8/267B/!

/F4/07;/28!-.7>.!-7<<!G/!376>E66/3!72!^/>8752!MCUC!DE0!1<:5078.;!

76!>5;410/3!-78.!6/B/01<!6818/'59'8./'108!1<:5078.;6!72><E372:!9E<<!

8/F8! 6/10>.*! /F4<7>78! <72A'G16/3! ?1:/@12A*! O70/>8Y78*! 123!

;53797/3'YHI^!1<:5078.;!(\_+C!

Q98/0! 8-5'78/;! 6/`E/2871<! 4188/02!;7272:*! UUW*,&\! 7;4<7>78! <72A6!

10/!:/2/018/3C!X7:E0/!\!6.5-6!8./0/!10/!\\*&\\!<72A6!8.18!10/!G58.!

72!8./!/F4<7>78!<72A6!123!8./!:/2/018/3!7;4<7>78!<72A6C!I.18!76*!&&a!

59!8./!<72A6!10/!5B/0<144/3*!-.7>.!76!6;1<<C!!!

!7)89('#:;#<='(05&#6>#)*&0)1),#0)+3-#5+?#'%&0)1),#0)+3-"#

^5;/!/B73/2>/6! 6.5E<3!G/!:7B/2! 85!405B/! 8.18! 8./! 7;4<7>78! <72A6!

618769=!8./!0/>5;;/2318752!166E;48752C!]/!3/B/<54!1!40/37>8752!

;53/<!G=!E672:!7;4<7>78!<72A6*!-.7>.!76!67;7<10!85!(U\+C!I.76!;53/<!

40/37>86!-./8./0!1!41:/!-7<<!G/!B7678/3!G=!1!E6/0!72!8./!2/F8!68/4C!

I./! 40/37>8752! 1>>E01>=! 370/>8<=! 0/9</>86! 8./! >500/>82/66! 59! 8./!

41:/!0/>5;;/2318752C!]/!81A/!ML_!59!8./!<5:!3181!16!8./!8017272:!

3181!85!>0/18/!7;4<7>78!<72A6!123!&L_!59!8./!3181!16!8/6872:!3181C!I./!

95<<5-72:!40/>76752!76!E6/3!16!5E0!/B1<E18752!;/807>6C!!

! ?0/37>8752!40/>76752$

#" #$

$

$ ++

+ ! "b#!

-./0/! +%! 123! +'! 10/! 8./! 2E;G/06! 59! >500/>8! 123! 72>500/>8!

40/37>87526*!0/64/>87B/<=C!

Q6! 6818/3! 40/B75E6<=! 72! ^/>8752! UC\*! 8./! 7;4<7>78! <72A6! 10/!

:/2/018/3!G=! 8./!0/6807>8752!59! 8./!;727;E;!6E44508C!I./!.7:./0!

8./! 6E44508!59! 8./! 7;4<7>78! <72A*! 8./!.7:./0! 8./!405G1G7<78=!59! 8./!

<72A/3!41:/6!1>>/66/3!18!8./!61;/!6/66752C!Q6!6.5-2!72!X7:E0/!U*!

8./!40/37>8752!40/>76752!;5258525E6<=!72>0/16/6!16! 8./!;727;E;!

6E44508!72>0/16/6C!H8!7237>18/6!8.18!5E0!7;4<7>78!<72A!76!1>>E018/!123!

0/9</>86!E6/0c6!G/.1B7506!123!728/0/686C!!

)

)C\

)CM

)CW

)C,

) \ M W , &) &\ &M &W;72d6E44

?0/>76752!59!40/37>8752

!7)89('#@;#A('1)-)6+#6>#&58'#&('?)1,)6+#BC#)*&0)1),#0)+3-"#

I./!`E1<78=!59!7;4<7>78!<72A6!76!/B1<E18/3!905;!.E;12!4/064/>87B/C!

]/!01235;<=!6/</>8!8.0//!6EG6/86!-.7>.!>528172!UV_!7;4<7>78!<72A6!

72!8581<C!^/B/2!B5<E28//0!:013E18/3!68E3/286!-.5!10/!91;7<710!-78.!

8./! 6EGP/>86! 59! 8./! 41:/6! 10/! >.56/2! 16! 5E0! /B1<E18752! 6EGP/>86C!

I./=! 10/! 16A/3! 85! /B1<E18/! -./8./0! 8./! 7;4<7>78! <72A6! 10/!

0/>5;;/2318752! <72A6! 1>>50372:! 85! 8./! >528/28! 59! 8./! 41:/6C!Q6!

6.5-2!72!8./!E44/0!4108!59!I1G</!&*!1G5E8!WVa!59!7;4<7>78!<72A6!72!

1B/01:/!10/!0/>5;;/2318752!<72A6C!Q258./0!8.0//!6EG6/86!6/</>8/3!

905;!/F4<7>78! <72A6! 76! 6.5-2! 72! 8./! <5-/0!4108!59!I1G</!&*!-./0/!

8./!1B/01:/!0/>5;;/2318752!<72A!01875!76!PE68!1G5E8!UbaC!!

D5B0'#.;#E'16**'+?5,)6+#0)+3-#)+#)*&0)1),#5+?#'%&0)1),#0)+3-"#

^EG6/8! H;4<7>78!<72A! @/>5;;C!<72A! @1875!

&! &\,! ,V! )CW,!

\! &&M! ,\! )CV\!

U! &UU! ,M! )CWU!

QB/01:/! F"GH#

^EG6/8! eF4<7>78!<72A! @/>5;;C!<72A! @1875!

&! &)V! MV! )CMM!

\! ,M! \W! )CU&!

U! bb! M\! )CM\!

QB/01:/! F"@I#

!

^/B/01<!/F1;4</6!59!8./6/!7;4<7>78!<72A6!10/!6.5-2!72!I1G</!\C!X50!

/F1;4</*!8./!95E08.!7;4<7>78!<72A!RfE12<52:c6!>5E06/K!T^&,,S!%!

R]7</26A=c6!>5E06/K!T^&,,S!0/40/6/286!8./!61;/!>5E06/!81E:.8!G=!

3799/0/28!72680E>8506C!]./2!8./!E6/0!B7678/3!8./!41:/!RfE12<52:c6!

59

ExperimentalResults

•  Dataset:4‐monthslogsfromwww.cs.berkeley.edu(300,000traces;170,000pages;60,000users)

•  216,748explicitlinks;336,812implicitlinks(11%arecommontobothsets)

•  10queries;volunteersidenWfyrelevantpagesand10mostauthoritaWvepagesforeachqueryoutoftop30results

•  Measure“Precision@30”and“Authority@10”

ExperimentalResults

!

"#$%&'(! )*+,,-.! /01'! 2345'6&789&! "#$%&'(! )*+,,-! "#$5:! ;'!

%'"#<<'6:':=!>0;5'!?!05&#!&@#A&!B@0B!/0%B&!#C!B@'!4</54"4B!5467&!

0%'! #D'%50//':!A4B@! B@'! 'E/54"4B! 5467&.!A@4"@! 0%'! "%'0B':! ;8! B@'!

0$B@#%!06:!&0B4&C8!B@'!%'"#<<'6:0B4#6!0&&$</B4#6=!!

!"#$%&'(&)*"+,$%-&./&01%&2+,$2320&$245-6&

7& 8.9:3%&;"<%& !":<%0&;"<%& )*,$"4"02.4& )*,6&

=245>&

+=!

F##7(!G%B4C4"405!

H6B'5541'6"'(!G!

I#:'%6!G//%#0"@!

>@'!;##79&!&54:'! G!;##7!06:!4B&!

&54:'&!

J!

?=!K#%:069&!L#<'/01'! G6:%'A!M1N&!

L#<'/01'!

>'0"@'%!06:!

&B$:'6B!

J!

O=!P0%4#$&!/4"B$%'&! Q06:&"0/'!

/@#B#1%0/@&!

R4"B$%'! M!

S=!T$065#619&!"#$%&'(!

)*+,,!

345'6&789&!"#$%&'(!

)*+,,!

*0<'!"#$%&'! M!

U=!G6B@#68!K#&'/@9&!

L#<'/01'!

F%406!L0%D'89&!

L#<'R01'!

R'#/5'!46!

P4&4#6!1%#$/!

M!

V=!GH!#6!B@'!3';! I0"@46'!5'0%6461!

&#CBA0%'!

I0"@46'!

5'0%6461!

M!

W=!*'X$469&!"#$%&'(!

)*?,S!

*'X$469&!"#$%&'(!

)*?,U!

)#$%&'!#C!

&0<'!/'%&#6!

M!

!

?6'& 8%":31&@%-9$0&F'"0$&'!#$%!&8&B'<!#658!%'Y%067&!B@'!%'&$5B&!#C!B@'!C$55!B'EB!&'0%"@!

'6146'&.!B@'!15#;05!&'0%"@!/%'"4&4#6!4&!6#B!"@061':=!L#A'D'%.!B@'!

/%'"4&4#6!#C! B#/!<0B"@'&!4&! 4</%#D':=!Z4D'6!0!X$'%8!!.! 5'B!"!;'!

B@'!&'B!#C!B@'!%'5'D06B!/01'&!B#!B@'!X$'%8!06:!["[!;'!B@'!6$<;'%!#C!

/01'&=!G&&$<'!B@'!&8&B'<!1'6'%0B'&!0!%'&$5B!&'B.!A'!#658!B07'!B@'!

B#/!O\!C%#<!B@'!%'&$5B!&'B!0&!#=!>@'!/%'"4&4#6!#C!&'0%"@!4&!:'C46':!

0&(!

$ %&'()*)+,$]!

[[

[[

#

#"! ! ^+\_!

H6! #%:'%! B#! 'D05$0B'! #$%! <'B@#:! 'CC'"B4D'58.! A'! /%#/#&'! 0! 6'A!

'D05$0B4#6!<'B@#:#5#18(! B@'!:'1%''!#C!0$B@#%4B8=!Z4D'6!0!X$'%8.!

A'! 0&7! B@'! &'D'6! D#5$6B''%&! B#! 4:'6B4C8! B@'! B#/! +\! 0$B@#%4B0B4D'!

/01'&!0""#%:461!B#!0!@$<06!/'%&/'"B4D'!%067461!#C!055!B@'!%'&$5B&=!

>@'!&'B!#C!+\!0$B@#%4B0B4D'!A';Y/01'&!4&!:'6#B':!;8!-!06:!B@'!&'B!

#C!B#/!+\!%'&$5B&!%'B$%6':!;8!&'0%"@!'6146'&!4&!:'6#B':!;8!.=!

$ #/01+&)02$]!

[[

[[

-

.- ! ! ^++_!

>@'! /%'"4&4#6! <'0&$%'&! B@'! :'1%''! B#! A@4"@! B@'! 051#%4B@<!

/%#:$"'&! 06! 0""$%0B'! %'&$5B`! A@45'! B@'! 0$B@#%4B8! <'0&$%'&! B@'!

0;454B8!#C!B@'!051#%4B@<!B#!/%#:$"'!B@'!/01'&!B@0B!0%'!<#&B!547'58!

B#! ;'! D4&4B':! ;8! B@'! $&'%! #%! B@'! 0$B@#%4B8! <'0&$%'<'6B! 4&! <#%'!

%'5'D06B!B#!$&'%9&!&0B4&C0"B#%8!:'1%''!#6!B@'!/'%C#%<06"'!#C!&<055!

A';!&'0%"@!'6146'=!

>@'! D#5$6B''%&! A'%'! 0&7':! B#! 'D05$0B'! ;#B@! /%'"4&4#6! 06:!

0$B@#%4B8!#C!&'0%"@!%'&$5B&! C#%! B@'!&'5'"B':!+\!X$'%4'&! ^A@4"@!0%'!

3+&45,6$7)*)+,6$#&0)8)()59$:,0'99);',('6$<52'*)5,$.'0=+&>6$=)&'9'**$

.'0=+&>6$ ?'(/&)026$ "'),8+&('@',06$ AB:6$ C505$ -),),;6$ 06:!

B?DEE_=! >@'! C4605! %'5'D06"'! a$:1<'6B! C#%! '0"@! :#"$<'6B! 4&!

:'"4:':!;8!<0a#%4B8!D#B'&=!b41$%'!S!&@#A&!B@'!"#</0%4&#6!#C!#$%!

0//%#0"@! A4B@! C$55! B'EB! &'0%"@.! R01'c067.! d4%'"BL4B! 06:!

<#:4C4':YLH>*! 051#%4B@<&=! L'%'! )%".! '%".! @A! 06:! CA!

"#%%'&/#6:! B#! 4</54"4B! 5467Y;0&':! R01'c067.! 'E/54"4B! 5467Y;0&':!

R01'c067.!<#:4C4':YLH>*.!06:!d4%'"BL4B.!%'&/'"B4D'58=!>@'!%41@BY

<#&B!50;'5!2GD1-!&B06:&!C#%!B@'!0D'%01'!D05$'!C#%!B@'!+\!X$'%4'&=!

G&! "06! ;'! &''6! C%#<! b41$%'! S.! #$%! 051#%4B@<! #$B/'%C#%<&! B@'!

#B@'%! S! 051#%4B@<&=! >@'! 0D'%01'! 4</%#D'<'6B! #C! /%'"4&4#6! #D'%!

B@'!C$55!B'EB!4&!+Ve.!R01'c067!?\e.!d4%'"BL4B!+Se!06:!<#:4C4':Y

LH>*!+?e=!3@45'!B@'!0D'%01'!4</%#D'<'6B!#C!0$B@#%4B8!#D'%!B@'!

C$55! B'EB! 4&! ?Se.! R01'c067! ?Ve.! d4%'"BL4B! +Ue! 06:! <#:4C4':Y

LH>*!+Se=!

!A2<9:%& ?(& ;:%32-2.4& "4B& "901.:20C& ./& 01%& B2//%:%40& :"4524<&

+%01.B-6&

b%#<! b41$%'! S.! A'! C#$6:! B@0B! B@'! /'%C#%<06"'! #C! 'E/54"4B! 5467Y

;0&':! R01'c067! 4&! 'D'6! A#%&'! B@06! B@0B! #C! B@'! C$55! B'EB! &'0%"@!

B'"@64X$'.! :'<#6&B%0B461! B@'! $6%'540;454B8! #C! 'E/54"4B! 5467!

&B%$"B$%'!#C!B@4&!A';&4B'=!

H6! b41$%'! S.! d4%'"BL4B! @0&! 0! <':4$<! /'%C#%<06"'! 46! 055! B@'!

051#%4B@<&=!d4%'"BL4B!#$B/'%C#%<&!C$55!B'EB!&'0%"@!;'"0$&'!4B!B07'&!

$&01'! 46C#%<0B4#6! 46B#! 0""#$6B=! L#A'D'%.! d4%'"BL4B! "#$5:! 6#B!

%'D'05! B@'! %'05! 0$B@#%4B0B4D'6'&&! #C! A';Y/01'&=! >@'! 'E/'%4<'6B!

05&#!&@#A&!B@0B!d4%'"BL4B!#658!4</%#D'&!0!/0%B!#C!/#/$50%!X$'%4'&9!

/%'"4&4#6=! *#! B@'! 0D'%01'! /%'"4&4#6! 4&! 6#B! 0&! 1##:! 0&! #$%!

051#%4B@<=!!

!"#$%&D(&@"45-&./&E9%:C&FG2-2.4H&24&B2//%:%40&+%01.B6&

3';Y/01'!d'&"%4/B4#6&! )%"$ '%"$ @A$ CA$

f)!F'%7'5'8!)#</$B'%!P4&4#6!Z%#$/! +! S+! ?! ,!

d0D4:!b#%&8B@N&!F##7(!)#</$B'%!P4&4#6! ?! gS! +! S!

d0D4:!b#%&8B@N&!F##7(!)#</$B'%!

P4&4#6^O%:!d%0CB_!O! g! O! +\!

G!A#%7&@#/!#6!P4&4#6!06:!Z%0/@4"&! S! SS! ?\! +!

f)!F'%7'5'8!)#</$B'%!P4&4#6!Z%#$/! U! ?! +O! W!

)*!?,\!L#<'!R01'! V! +S! +\! ++!

>@#<0&!Q'$61N&!R$;54"0B4#6&! W! UU! S! O+!

K4B'6:%0!I05479&!F%4'C!F4#1%0/@8! ,! +W! W! V!

G6!#D'%D4'A!#C!Z%#$/461!06:!

R'%"'/B$05!h%1064i0B4#6!g! U! ?+! U!

d0D4:!b#%&8B@N&!L#<'/01'! +\! ,W! ?g! OU!

G!/0/'%!#C!R@45! +O! +! V! g!

j4<9!k$3@06!%'&$<'! +,! O! U! ?!

G!&54:'!#C!Q06:089&!B057!0;#$B!M#B'/05&!! OW! S! OO! +S!

K#@6!G=!L0::#69&!/$;54"0B4#6! Og! ?O! S+! +O!

G!&54:'!#C!Q06:089&!B057!0;#$B!M#B'/05&!! S+! V! S?! +,!

)@%4&!F%'15'%N&!R$;54"0B4#6&! SS! ?W! ,! S?!

)#$%&'(!G//'0%06"'!I#:'5&!C#%!

)#</$B'%!Z%0/@4"&!06:!P4&4#6!U+! VO! SW! O!

c'C'%'6"'!#C!h;a'"B!c'"#164B4#6! V?! Ug! g! +W!

!

\!

\=?!

\=S!

\=V!

\=,!

+!

+! ?! O! S! U! V! W! ,! g! +\! GD1

*'0%"@!R%'"4&4#6!

\!

\=?!

\=S!

\=V!

\=,!

+!

G$B@#%4B8!

b$55!>'EB!!!!!!'Rc!!!!!!dL!!!!!!<L!!!!!!4Rc!

60

Summary

•  UserbrowsingtracescanbecollectedeasilyintheEnterprise

•  Twotypesoftraces:– TracesstarWngfromsearchenginequeries– Arbitrarytraces

•  TracesareveryusefulforcalculaWngauthoritaWvenessofwebpagesandwebsites,andcanbesuccessfullyusedtoimprovesearchranking

Short‐termUserContextandEye‐trackingbasedFeedback

•  TwotypesofusercontextinformaWon:– Short‐termcontext– Long‐termcontext

•  Long‐termcontext:– User’stopicsofinterest,departmentandposiWon,accumulatedqueryhistory,desktopcontext,etc.

•  Short‐termcontext:– Queriesandclicksinthesamesession,thetextuserhasreadinthepast5min,etc.

Context-Sensitive Information Retrieval UsingImplicit Feedback

Xuehua ShenDepartment of Computer

ScienceUniversity of Illinois atUrbana-Champaign

Bin TanDepartment of Computer

ScienceUniversity of Illinois atUrbana-Champaign

ChengXiang ZhaiDepartment of Computer

ScienceUniversity of Illinois atUrbana-Champaign

ABSTRACT

A major limitation of most existing retrieval models and systemsis that the retrieval decision is made based solely on the query anddocument collection; information about the actual user and searchcontext is largely ignored. In this paper, we study how to ex-ploit implicit feedback information, including previous queries andclickthrough information, to improve retrieval accuracy in an in-teractive information retrieval setting. We propose several context-sensitive retrieval algorithms based on statistical language modelsto combine the preceding queries and clicked document summarieswith the current query for better ranking of documents. We usethe TREC AP data to create a test collection with search contextinformation, and quantitatively evaluate our models using this testset. Experiment results show that using implicit feedback, espe-cially the clicked document summaries, can improve retrieval per-formance substantially.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Retrieval models

General Terms

Algorithms

Keywords

Query history, query expansion, interactive retrieval, context

1. INTRODUCTIONIn most existing information retrieval models, the retrieval prob-

lem is treated as involving one single query and a set of documents.From a single query, however, the retrieval system can only havevery limited clue about the user’s information need. An optimal re-trieval system thus should try to exploit as much additional contextinformation as possible to improve retrieval accuracy, whenever itis available. Indeed, context-sensitive retrieval has been identifiedas a major challenge in information retrieval research[2].

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGIR’05, August 15–19, 2005, Salvador,Brazil.Copyright 2005 ACM 1-59593-034-5/05/0008 ...$5.00.

There are many kinds of context that we can exploit. Relevancefeedback [14] can be considered as a way for a user to providemore context of search and is known to be effective for improv-ing retrieval accuracy. However, relevance feedback requires thata user explicitly provides feedback information, such as specifyingthe category of the information need or marking a subset of re-trieved documents as relevant documents. Since it forces the userto engage additional activities while the benefits are not always ob-vious to the user, a user is often reluctant to provide such feedbackinformation. Thus the effectiveness of relevance feedback may belimited in real applications.For this reason, implicit feedback has attractedmuch attention re-

cently [11, 13, 18, 17, 12]. In general, the retrieval results using theuser’s initial query may not be satisfactory; often, the user wouldneed to revise the query to improve the retrieval/ranking accuracy[8]. For a complex or difficult information need, the user may needto modify his/her query and view ranked documents with many iter-ations before the information need is completely satisfied. In suchan interactive retrieval scenario, the information naturally availableto the retrieval system is more than just the current user query andthe document collection – in general, all the interaction history canbe available to the retrieval system, including past queries, informa-tion about which documents the user has chosen to view, and evenhow a user has read a document (e.g., which part of a document theuser spends a lot of time in reading). We define implicit feedbackbroadly as exploiting all such naturally available interaction historyto improve retrieval results.A major advantage of implicit feedback is that we can improve

the retrieval accuracy without requiring any user effort. For ex-ample, if the current query is “java”, without knowing any extrainformation, it would be impossible to know whether it is intendedto mean the Java programming language or the Java island in In-donesia. As a result, the retrieved documents will likely have bothkinds of documents – some may be about the programming lan-guage and some may be about the island. However, any particularuser is unlikely searching for both types of documents. Such anambiguity can be resolved by exploiting history information. Forexample, if we know that the previous query from the user is “cgiprogramming”, it would strongly suggest that it is the programminglanguage that the user is searching for.Implicit feedback was studied in several previous works. In [11],

Joachims explored how to capture and exploit the clickthrough in-formation and demonstrated that such implicit feedback informa-tion can indeed improve the search accuracy for a group of peo-ple. In [18], a simulation study of the effectiveness of differentimplicit feedback algorithms was conducted, and several retrievalmodels designed for exploiting clickthrough information were pro-

ProblemofContext‐IndependentSearch

58

Jaguar

Car Apple Software

Animal

Chemistry Software

Pu�ngSearchinContext

59

Other Context Info: Dwelling time Mouse movement

Clickthrough

Query History

Apple software

Hobby …

Short‐termContexts

•  Willlookat2typesofshort‐termcontexts:– SessionQueryHistory:precedingqueriesissuedbythesameuserinthecurrentsession

– SessionClickedSummary:concatenaWonofthedisplayedtextabouttheclickedurlsinthecurrentsession

•  WilluselanguagemodelingframeworktoincorporatetheabovedataintotherankingfuncWon

UsingShort‐termContextsforRanking

•  BasicRetrievalModel:–  ForeachdocumentDbuildaunigramlanguagemodelθD,specifyingp(ω|θD)

– GivenaqueryQ,buildaquerylanguagemodelθQ,specifyingp(ω|θQ)

–  RankthedocumentsaccordingtotheKLdivergenceofthetwomodels:

•  Assuminguseralreadyissuedk-1queriesQ1,..,Qk-1,wanttoesWmatethe“contextquerymodel” θk specifying p(ω|θk) forthecurrentquery Qk to use instead of θQ

D(θQ ||θD ) = P(ω |θQω∑ )log

P(ω |θQ )P(ω |θD )

UsingShort‐termContextsforRanking

•  FixedCoefficientInterpolaWon:

current queryQk, as well as the query historyHQ and clickthrough

historyHC . We now describe several different language models for

exploiting HQ and HC to estimate p(w|θk). We will use c(w, X)to denote the count of word w in text X, which could be either aquery or a clicked document’s summary or any other text. We will

use |X| to denote the length of textX or the total number of words

inX.

3.2 Fixed Coefficient Interpolation (FixInt)Our first idea is to summarize the query history HQ with a un-

igram language model p(w|HQ) and the clickthrough history HC

with another unigram language model p(w|HC). Then we linearlyinterpolate these two history models to obtain the history model

p(w|H). Finally, we interpolate the history model p(w|H) withthe current query model p(w|Qk). These models are defined asfollows.

p(w|Qi) =c(w, Qi)|Qi|

p(w|HQ) =1

k − 1

i=k−1

i=1

p(w|Qi)

p(w|Ci) =c(w, Ci)|Ci|

p(w|HC) =1

k − 1

i=k−1

i=1

p(w|Ci)

p(w|H) = βp(w|HC) + (1 − β)p(w|HQ)

p(w|θk) = αp(w|Qk) + (1 − α)p(w|H)

where β ∈ [0, 1] is a parameter to control the weight on each his-tory model, and where α ∈ [0, 1] is a parameter to control theweight on the current query and the history information.

If we combine these equations, we see that

p(w|θk) = αp(w|Qk) + (1 − α)[βp(w|HC) + (1 − β)p(w|HQ)]

That is, the estimated context query model is just a fixed coefficient

interpolation of three models p(w|Qk), p(w|HQ), and p(w|HC).

3.3 Bayesian Interpolation (BayesInt)One possible problem with the FixInt approach is that the coeffi-

cients, especially α, are fixed across all the queries. But intuitively,if our current query Qk is very long, we should trust the current

query more, whereas if Qk has just one word, it may be beneficial

to put more weight on the history. To capture this intuition, we treat

p(w|HQ) and p(w|HC) as Dirichlet priors andQk as the observed

data to estimate a context query model using Bayesian estimator.

The estimated model is given by

p(w|θk) =c(w, Qk) + µp(w|HQ) + νp(w|HC)

|Qk| + µ + ν

=|Qk|

|Qk| + µ + νp(w|Qk)+

µ + ν

|Qk| + µ + ν[

µ

µ + νp(w|HQ)+

ν

µ + νp(w|HC)]

where µ is the prior sample size for p(w|HQ) and ν is the priorsample size for p(w|HC). We see that the only difference betweenBayesInt and FixInt is the interpolation coefficients are now adap-

tive to the query length. Indeed, when viewing BayesInt as FixInt,

we see that α = |Qk||Qk|+µ+ν , β = ν

ν+µ , thus with fixed µ and ν,we will have a query-dependent α. Later we will show that such anadaptive α empirically performs better than a fixed α.

3.4 Online Bayesian Updating (OnlineUp)Both FixInt and BayesInt summarize the history information by

averaging the unigram language models estimated based on pre-

vious queries or clicked summaries. This means that all previous

queries are treated equally and so are all clicked summaries. How-

ever, as the user interacts with the system and acquires more knowl-

edge about the information in the collection, presumably, the refor-

mulated queries will become better and better. Thus assigning de-

caying weights to the previous queries so as to trust a recent query

more than an earlier query appears to be reasonable. Interestingly,

if we incrementally update our belief about the user’s information

need after seeing each query, we could naturally obtain decaying

weights on the previous queries. Since such an incremental online

updating strategy can be used to exploit any evidence in an interac-

tive retrieval system, we present it in a more general way.

In a typical retrieval system, the retrieval system responds to

every new query entered by the user by presenting a ranked list

of documents. In order to rank documents, the system must have

some model for the user’s information need. In the KL divergence

retrieval model, this means that the system must compute a query

model whenever a user enters a (new) query. A principled way of

updating the query model is to use Bayesian estimation, which we

discuss below.

3.4.1 Bayesian updating

We first discuss how we apply Bayesian estimation to update a

query model in general. Let p(w|φ) be our current query modeland T be a new piece of text evidence observed (e.g., T can be a

query or a clicked summary). To update the query model based on

T , we use φ to define a Dirichlet prior parameterized as

Dir(µT p(w1|φ), ..., µT p(wN |φ))

where µT is the equivalent sample size of the prior. We use Dirich-

let prior because it is a conjugate prior for multinomial distribu-

tions. With such a conjugate prior, the predictive distribution of φ(or equivalently, the mean of the posterior distribution of φ is givenby

p(w|φ) =c(w, T ) + µT p(w|φ)

|T | + µT(1)

where c(w, T ) is the count of w in T and |T | is the length of T .Parameter µT indicates our confidence in the prior expressed in

terms of an equivalent text sample comparable with T . For exam-ple, µT = 1 indicates that the influence of the prior is equivalent toadding one extra word to T .

3.4.2 Sequential query model updating

We now discuss how we can update our query model over time

during an interactive retrieval process using Bayesian estimation.

In general, we assume that the retrieval system maintains a current

query model φi at any moment. As soon as we obtain some implicit

feedback evidence in the form of a piece of text Ti, we will update

the query model.

Initially, before we see any user query, we may already have

some information about the user. For example, we may have some

information about what documents the user has viewed in the past.

We use such information to define a prior on the query model,

which is denoted by φ′0. After we observe the first query Q1, we

can update the query model based on the new observed data Q1.

The updated query model φ1 can then be used for ranking docu-

ments in response to Q1. As the user views some documents, the

displayed summary text for such documents C1 (i.e., clicked sum-

maries) can serve as some new data for us to further update the

current queryQk, as well as the query historyHQ and clickthrough

historyHC . We now describe several different language models for

exploiting HQ and HC to estimate p(w|θk). We will use c(w, X)to denote the count of word w in text X, which could be either aquery or a clicked document’s summary or any other text. We will

use |X| to denote the length of textX or the total number of words

inX.

3.2 Fixed Coefficient Interpolation (FixInt)Our first idea is to summarize the query history HQ with a un-

igram language model p(w|HQ) and the clickthrough history HC

with another unigram language model p(w|HC). Then we linearlyinterpolate these two history models to obtain the history model

p(w|H). Finally, we interpolate the history model p(w|H) withthe current query model p(w|Qk). These models are defined asfollows.

p(w|Qi) =c(w, Qi)|Qi|

p(w|HQ) =1

k − 1

i=k−1

i=1

p(w|Qi)

p(w|Ci) =c(w, Ci)|Ci|

p(w|HC) =1

k − 1

i=k−1

i=1

p(w|Ci)

p(w|H) = βp(w|HC) + (1 − β)p(w|HQ)

p(w|θk) = αp(w|Qk) + (1 − α)p(w|H)

where β ∈ [0, 1] is a parameter to control the weight on each his-tory model, and where α ∈ [0, 1] is a parameter to control theweight on the current query and the history information.

If we combine these equations, we see that

p(w|θk) = αp(w|Qk) + (1 − α)[βp(w|HC) + (1 − β)p(w|HQ)]

That is, the estimated context query model is just a fixed coefficient

interpolation of three models p(w|Qk), p(w|HQ), and p(w|HC).

3.3 Bayesian Interpolation (BayesInt)One possible problem with the FixInt approach is that the coeffi-

cients, especially α, are fixed across all the queries. But intuitively,if our current query Qk is very long, we should trust the current

query more, whereas if Qk has just one word, it may be beneficial

to put more weight on the history. To capture this intuition, we treat

p(w|HQ) and p(w|HC) as Dirichlet priors andQk as the observed

data to estimate a context query model using Bayesian estimator.

The estimated model is given by

p(w|θk) =c(w, Qk) + µp(w|HQ) + νp(w|HC)

|Qk| + µ + ν

=|Qk|

|Qk| + µ + νp(w|Qk)+

µ + ν

|Qk| + µ + ν[

µ

µ + νp(w|HQ)+

ν

µ + νp(w|HC)]

where µ is the prior sample size for p(w|HQ) and ν is the priorsample size for p(w|HC). We see that the only difference betweenBayesInt and FixInt is the interpolation coefficients are now adap-

tive to the query length. Indeed, when viewing BayesInt as FixInt,

we see that α = |Qk||Qk|+µ+ν , β = ν

ν+µ , thus with fixed µ and ν,we will have a query-dependent α. Later we will show that such anadaptive α empirically performs better than a fixed α.

3.4 Online Bayesian Updating (OnlineUp)Both FixInt and BayesInt summarize the history information by

averaging the unigram language models estimated based on pre-

vious queries or clicked summaries. This means that all previous

queries are treated equally and so are all clicked summaries. How-

ever, as the user interacts with the system and acquires more knowl-

edge about the information in the collection, presumably, the refor-

mulated queries will become better and better. Thus assigning de-

caying weights to the previous queries so as to trust a recent query

more than an earlier query appears to be reasonable. Interestingly,

if we incrementally update our belief about the user’s information

need after seeing each query, we could naturally obtain decaying

weights on the previous queries. Since such an incremental online

updating strategy can be used to exploit any evidence in an interac-

tive retrieval system, we present it in a more general way.

In a typical retrieval system, the retrieval system responds to

every new query entered by the user by presenting a ranked list

of documents. In order to rank documents, the system must have

some model for the user’s information need. In the KL divergence

retrieval model, this means that the system must compute a query

model whenever a user enters a (new) query. A principled way of

updating the query model is to use Bayesian estimation, which we

discuss below.

3.4.1 Bayesian updating

We first discuss how we apply Bayesian estimation to update a

query model in general. Let p(w|φ) be our current query modeland T be a new piece of text evidence observed (e.g., T can be a

query or a clicked summary). To update the query model based on

T , we use φ to define a Dirichlet prior parameterized as

Dir(µT p(w1|φ), ..., µT p(wN |φ))

where µT is the equivalent sample size of the prior. We use Dirich-

let prior because it is a conjugate prior for multinomial distribu-

tions. With such a conjugate prior, the predictive distribution of φ(or equivalently, the mean of the posterior distribution of φ is givenby

p(w|φ) =c(w, T ) + µT p(w|φ)

|T | + µT(1)

where c(w, T ) is the count of w in T and |T | is the length of T .Parameter µT indicates our confidence in the prior expressed in

terms of an equivalent text sample comparable with T . For exam-ple, µT = 1 indicates that the influence of the prior is equivalent toadding one extra word to T .

3.4.2 Sequential query model updating

We now discuss how we can update our query model over time

during an interactive retrieval process using Bayesian estimation.

In general, we assume that the retrieval system maintains a current

query model φi at any moment. As soon as we obtain some implicit

feedback evidence in the form of a piece of text Ti, we will update

the query model.

Initially, before we see any user query, we may already have

some information about the user. For example, we may have some

information about what documents the user has viewed in the past.

We use such information to define a prior on the query model,

which is denoted by φ′0. After we observe the first query Q1, we

can update the query model based on the new observed data Q1.

The updated query model φ1 can then be used for ranking docu-

ments in response to Q1. As the user views some documents, the

displayed summary text for such documents C1 (i.e., clicked sum-

maries) can serve as some new data for us to further update the

Queryhistorymodel

Clicksummarymodel

UsingShort‐termContextsforRanking

•  ProblemwithFixedCoefficientInterpolaWonisthatthecoefficientsarethesameforallqueries.Wanttotrustthecurrentquerymoreifitislongerandlessifitisshorter

•  BayesianInterpolaWon:

current queryQk, as well as the query historyHQ and clickthrough

historyHC . We now describe several different language models for

exploiting HQ and HC to estimate p(w|θk). We will use c(w, X)to denote the count of word w in text X, which could be either aquery or a clicked document’s summary or any other text. We will

use |X| to denote the length of textX or the total number of words

inX.

3.2 Fixed Coefficient Interpolation (FixInt)Our first idea is to summarize the query history HQ with a un-

igram language model p(w|HQ) and the clickthrough history HC

with another unigram language model p(w|HC). Then we linearlyinterpolate these two history models to obtain the history model

p(w|H). Finally, we interpolate the history model p(w|H) withthe current query model p(w|Qk). These models are defined asfollows.

p(w|Qi) =c(w, Qi)|Qi|

p(w|HQ) =1

k − 1

i=k−1

i=1

p(w|Qi)

p(w|Ci) =c(w, Ci)|Ci|

p(w|HC) =1

k − 1

i=k−1

i=1

p(w|Ci)

p(w|H) = βp(w|HC) + (1 − β)p(w|HQ)

p(w|θk) = αp(w|Qk) + (1 − α)p(w|H)

where β ∈ [0, 1] is a parameter to control the weight on each his-tory model, and where α ∈ [0, 1] is a parameter to control theweight on the current query and the history information.

If we combine these equations, we see that

p(w|θk) = αp(w|Qk) + (1 − α)[βp(w|HC) + (1 − β)p(w|HQ)]

That is, the estimated context query model is just a fixed coefficient

interpolation of three models p(w|Qk), p(w|HQ), and p(w|HC).

3.3 Bayesian Interpolation (BayesInt)One possible problem with the FixInt approach is that the coeffi-

cients, especially α, are fixed across all the queries. But intuitively,if our current query Qk is very long, we should trust the current

query more, whereas if Qk has just one word, it may be beneficial

to put more weight on the history. To capture this intuition, we treat

p(w|HQ) and p(w|HC) as Dirichlet priors andQk as the observed

data to estimate a context query model using Bayesian estimator.

The estimated model is given by

p(w|θk) =c(w, Qk) + µp(w|HQ) + νp(w|HC)

|Qk| + µ + ν

=|Qk|

|Qk| + µ + νp(w|Qk)+

µ + ν

|Qk| + µ + ν[

µ

µ + νp(w|HQ)+

ν

µ + νp(w|HC)]

where µ is the prior sample size for p(w|HQ) and ν is the priorsample size for p(w|HC). We see that the only difference betweenBayesInt and FixInt is the interpolation coefficients are now adap-

tive to the query length. Indeed, when viewing BayesInt as FixInt,

we see that α = |Qk||Qk|+µ+ν , β = ν

ν+µ , thus with fixed µ and ν,we will have a query-dependent α. Later we will show that such anadaptive α empirically performs better than a fixed α.

3.4 Online Bayesian Updating (OnlineUp)Both FixInt and BayesInt summarize the history information by

averaging the unigram language models estimated based on pre-

vious queries or clicked summaries. This means that all previous

queries are treated equally and so are all clicked summaries. How-

ever, as the user interacts with the system and acquires more knowl-

edge about the information in the collection, presumably, the refor-

mulated queries will become better and better. Thus assigning de-

caying weights to the previous queries so as to trust a recent query

more than an earlier query appears to be reasonable. Interestingly,

if we incrementally update our belief about the user’s information

need after seeing each query, we could naturally obtain decaying

weights on the previous queries. Since such an incremental online

updating strategy can be used to exploit any evidence in an interac-

tive retrieval system, we present it in a more general way.

In a typical retrieval system, the retrieval system responds to

every new query entered by the user by presenting a ranked list

of documents. In order to rank documents, the system must have

some model for the user’s information need. In the KL divergence

retrieval model, this means that the system must compute a query

model whenever a user enters a (new) query. A principled way of

updating the query model is to use Bayesian estimation, which we

discuss below.

3.4.1 Bayesian updating

We first discuss how we apply Bayesian estimation to update a

query model in general. Let p(w|φ) be our current query modeland T be a new piece of text evidence observed (e.g., T can be a

query or a clicked summary). To update the query model based on

T , we use φ to define a Dirichlet prior parameterized as

Dir(µT p(w1|φ), ..., µT p(wN |φ))

where µT is the equivalent sample size of the prior. We use Dirich-

let prior because it is a conjugate prior for multinomial distribu-

tions. With such a conjugate prior, the predictive distribution of φ(or equivalently, the mean of the posterior distribution of φ is givenby

p(w|φ) =c(w, T ) + µT p(w|φ)

|T | + µT(1)

where c(w, T ) is the count of w in T and |T | is the length of T .Parameter µT indicates our confidence in the prior expressed in

terms of an equivalent text sample comparable with T . For exam-ple, µT = 1 indicates that the influence of the prior is equivalent toadding one extra word to T .

3.4.2 Sequential query model updating

We now discuss how we can update our query model over time

during an interactive retrieval process using Bayesian estimation.

In general, we assume that the retrieval system maintains a current

query model φi at any moment. As soon as we obtain some implicit

feedback evidence in the form of a piece of text Ti, we will update

the query model.

Initially, before we see any user query, we may already have

some information about the user. For example, we may have some

information about what documents the user has viewed in the past.

We use such information to define a prior on the query model,

which is denoted by φ′0. After we observe the first query Q1, we

can update the query model based on the new observed data Q1.

The updated query model φ1 can then be used for ranking docu-

ments in response to Q1. As the user views some documents, the

displayed summary text for such documents C1 (i.e., clicked sum-

maries) can serve as some new data for us to further update the

Coefficientsdependonthequerylength

ExperimentalResults

•  Dataset:TRECAssociatedPresssetofnewsarWcles(~250,000arWcles)

•  Select30mostdifficulttopics,havevolunteersissue4queriesforeachtopicandrecordqueryreformulaWonandclickthroughinformaWon

•  MeasureMAPandPrecision@20

ExperimentalResults

•  ResultsshowthatincorporaWngcontextualinformaWonsignificantlyimprovestheresults

•  AddiWonalexperimentsshowedthatimprovementismostlyduetousingSessionClickedSummaries

FixInt BayesInt OnlineUp BatchUpQuery (α = 0.1, β = 1.0) (µ = 0.2, ν = 5.0) (µ = 5.0, ν = 15.0) (µ = 2.0, ν = 15.0)

MAP pr@20docs MAP pr@20docs MAP pr@20docs MAP pr@20docs

q1 0.0095 0.0317 0.0095 0.0317 0.0095 0.0317 0.0095 0.0317

q2 0.0312 0.1150 0.0312 0.1150 0.0312 0.1150 0.0312 0.1150

q2 + HQ + HC 0.0324 0.1117 0.0345 0.1117 0.0215 0.0733 0.0342 0.1100

Improve. 3.8% -2.9% 10.6% -2.9% -31.1% -36.3% 9.6% -4.3%

q3 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483

q3 + HQ + HC 0.0726 0.1967 0.0816 0.2067 0.0706 0.1783 0.0810 0.2067

Improve 72.4% 32.6% 93.8% 39.4% 67.7% 20.2% 92.4% 39.4%

q4 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933

q4 + HQ + HC 0.0891 0.2233 0.0955 0.2317 0.0792 0.2067 0.0950 0.2250

Improve 66.2% 15.5% 78.2% 19.9% 47.8% 6.9% 77.2% 16.4%

Table 1: Effect of using query history and clickthrough data for document ranking.

is 34.4 words. Among 91 clicked documents, 29 documents arejudged relevant according to TREC judgment file. This data set is

publicly available 1.

5. EXPERIMENTS

5.1 Experiment designOur major hypothesis is that using search context (i.e., query his-

tory and clickthrough information) can help improve search accu-

racy. In particular, the search context can provide extra information

to help us estimate a better query model than using just the current

query. So most of our experiments involve comparing the retrieval

performance using the current query only (thus ignoring any con-

text) with that using the current query as well as the search context.

Since we collected four versions of queries for each topic, we

make such comparisons for each version of queries. We use two

performance measures: (1) Mean Average Precision (MAP): This

is the standard non-interpolated average precision and serves as a

good measure of the overall ranking accuracy. (2) Precision at 20

documents (pr@20docs): This measure does not average well, but

it is more meaningful than MAP and reflects the utility for users

who only read the top 20 documents. In all cases, the reported

figure is the average over all of the 30 topics.

We evaluate the four models for exploiting search context (i.e.,

FixInt, BayesInt, OnlineUp, and BatchUp). Each model has pre-

cisely two parameters (α and β for FixInt; µ and ν for others).Note that µ and ν may need to be interpreted differently for dif-ferent methods. We vary these parameters and identify the optimal

performance for each method. We also vary the parameters to study

the sensitivity of our algorithms to the setting of the parameters.

5.2 Result analysis

5.2.1 Overall effect of search context

We compare the optimal performances of four models with those

using the current query only in Table 1. A row labeled with qi is

the baseline performance and a row labeled with qi + HQ + HC

is the performance of using search context. We can make several

observations from this table:

1. Comparing the baseline performances indicates that on average

reformulated queries are better than the previous queries with the

performance of q4 being the best. Users generally formulate better

and better queries.

2. Using search context generally has positive effect, especially

when the context is rich. This can be seen from the fact that the

1http://sifaka.cs.uiuc.edu/ir/ucair/QCHistory.zip

improvement for q4 and q3 is generally more substantial compared

with q2. Actually, in many cases with q2, using the context may

hurt the performance, probably because the history at that point is

sparse. When the search context is rich, the performance improve-

ment can be quite substantial. For example, BatchUp achieves

92.4% improvement in the mean average precision over q3 and

77.2% improvement over q4. (The generally low precisions also

make the relative improvement deceptively high, though.)

3. Among the four models using search context, the performances

of FixInt and OnlineUp are clearly worse than those of BayesInt

and BatchUp. Since BayesInt performs better than FixInt and the

main difference between BayesInt and FixInt is that the former uses

an adaptive coefficient for interpolation, the results suggest that us-

ing adaptive coefficient is quite beneficial and a Bayesian style in-

terpolation makes sense. The main difference between OnlineUp

and BatchUp is that OnlineUp uses decaying coefficients to com-

bine the multiple clicked summaries, while BatchUp simply con-

catenates all clicked summaries. Therefore the fact that BatchUp

is consistently better than OnlineUp indicates that the weights for

combining the clicked summaries indeed should not be decaying.

While OnlineUp is theoretically appealing, its performance is infe-

rior to BayesInt and BatchUp, likely because of the decaying coef-

ficient. Overall, BatchUp appears to be the best method when we

vary the parameter settings.

We have two different kinds of search context – query history

and clickthrough data. We now look into the contribution of each

kind of context.

5.2.2 Using query history only

In each of four models, we can “turn off” the clickthrough his-

tory data by setting parameters appropriately. This allows us to

evaluate the effect of using query history alone. We use the same

parameter setting for query history as in Table 1. The results are

shown in Table 2. Here we see that in general, the benefit of using

query history is very limited with mixed results. This is different

from what is reported in a previous study [15], where using query

history is consistently helpful. Another observation is that the con-

text runs perform poorly at q2, but generally perform (slightly) bet-

ter than the baselines for q3 and q4. This is again likely because

at the beginning the initial query, which is the title in the original

TREC topic description, may not be a good query; indeed, on av-

erage, performances of these “first-generation” queries are clearly

poorer than those of all other user-formulated queries in the later

generations. Yet another observation is that when using query his-

tory only, the BayesInt model appears to be better than other mod-

els. Since the clickthrough data is ignored, OnlineUp and BatchUp

•  Feedbackonsub‐documentlevelshouldallowforbeserretrievalimprovements

•  Useaneye‐trackertoautomaWcallydetectwhichporWonsofthedisplayeddocumentwerereadorskimmed

•  Determinewhichpartsofthedocumentarerelevant

Attention-Based Information Retrieval

Georg Buscher German Research Center for Artificial Intelligence (DFKI)

Kaiserslautern, Germany

[email protected]

ABSTRACT In the proposed PhD thesis, it will be examined how attention data

from the user can be exploited in order to enhance and personalize

information retrieval.

Up to now, nearly all implicit feedback sources that are used for

information retrieval are based on mouse and keyboard input like

clickthrough, scrolling and annotation behavior. In this work, an

unobtrusive eye tracker will be used as an attention evidence source

being able to precisely detect read or skimmed document passages.

This information will be stored in attention-annotated documents

(e.g., containing read, skimmed, highlighted, commented passages).

Based on such annotated documents, the user’s current thematic

context will be estimated.

First, this attention-based context will be utilized for attention-based

changes concerning the index of vector-space based IR methods. It

is intended to regard the contexts as virtual documents, which will

be included in the index. In this way, local desktop or enterprise-

wide search engines could be enhanced by allowing new types of

queries, e.g., “Find a set of documents on my computer concerning

the current topic that I have formerly used in a similar context”.

Second, the thematic context will be utilized for pre- and

postprocessing steps concerning the retrieval process, e.g., for query

expansion and result reranking. Therefore, attention-enhanced

keyword extraction techniques will be developed that take the

degree of attention from the user into account.

Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Indexing methods, H.3.3

[Information Search and Retrieval]: Relevance feedback

General Terms

Algorithms, Human Factors

Keywords

Eye tracking, implicit feedback, attention-based index

1. INTRODUCTION In the last years it could be observed that the human-centered

perspective came more and more into the focus of IR research. In

this regard, a recent trend is of strong importance: the environment

of the user gets more and more personalized; the narrower and

broader context of a user is considered in order to better understand

the user’s needs (e.g., see [2], [3], [21]). This context is increasingly

taken into account in information retrieval systems.

However, context is a relatively vague term. For instance, context

can be generated implicitly or explicitly, it can be meant in a

thematic way (based on content) or rather in an environmental way

(e.g., based on application structures, people worked with, etc.), or it

can be seen as a short-term or as a long-term context, etc.

One of the current challenges is to elicit the context of a user. This

can be done explicitly, for example by asking the user, whether a

document is currently relevant or not (i.e., explicit relevance

feedback). To use such explicitly generated context is suggestive

and yields better results in IR than without considering any user

context. However, asking the user about explicit feedback requires a

higher effort on the user’s side and should therefore be avoided.

Thus, implicit feedback recently gained in importance, i.e.,

observing the user’s actions and environment and trying to infer

what might be relevant for him.

Up to now, the main sources for implicitly generating thematic

context are the user’s click-through, scrolling and typing behavior

(see [12]); thus, data, which is provided by all normally available

input devices that are used to interact with a computer.

A very interesting new evidence source for implicit feedback are the

user’s eye movements, because mostly they reflect the user’s visual

attention directly. The eye trackers of today are unobtrusive and are

able to identify the user’s gaze with high precision (e.g., see [24]).

Therefore, applying an eye tracker as a new evidence source for the

user’s attention introduces a potentially very valuable new

dimension of contextual information in information retrieval. It is

clear that eye trackers will not be wide spread in the near future due

to their expensiveness. However, if becoming less expensive, they

might well be interesting for knowledge workers in middle- or

large-sized enterprises.

Motivated by these thoughts, in this proposed dissertation, the

impact of attention evidence data especially obtained by an eye

tracker on information retrieval methods will be examined.

Therefore, the focus lies on local desktop and enterprise-wide

search. It primarily shall be concentrated on attention-based changes

concerning the index of vector-space based IR methods, but also on

attention-enhanced pre- and postprocessing steps like query

expansion and reranking mechanisms. As an eye tracker is not the

only source for attention evidence, a model shall be developed,

which integrates different attention evidence sources so that a

standardized overall level of attention can be derived for any piece

of text in a document.

2. BACKGROUND AND RELATED WORK Relevance feedback has been considered in information retrieval

research for some decades, now. While it has first been focused on

explicit relevance feedback, the focus now lies more on implicit

feedback due to the lower effort it requires from the user. The

/

Howcanweusethis?

•  Foreachpage,canaggregatethe“visualannotaWons”acrosstheusersoftheenterprise

•  Canconstructapreciseshort‐termusercontext

task/informaWonneedcontext

termsdescribingtheuser‘scurrentinterest/context

Summary

•  Usingshort‐termusercontexttoimprovesearchqualityisanewandverypromisingdirecWonofresearch

•  IniWalresultsshowthatitcanbeveryeffecWve•  Usingeyetrackingcanhelptoimprovethequalityandincreasetheamountofthecontextdata

•  ManyunexploredapplicaWons:on‐the‐flyreranking,abstractpersonalizaWon,etc.

InteresWngProblemsandPromisingResearchDirecWons

•  ApplyingthetechniqueswetalkedabouttoimproveEnterpriseWebsearch,extendingthemtobesersuitEnterpriseenvironment

•  ModelsfortheEnterpriseWebwhichtakeintoaccountitscomplexstructureandallowforexpressingdifferentusagedata

•  PersonalizaWonintheEnterpriseWebsearch(usagedata+employeepersonalinfo)

•  Usingcontext(recenthistory+desktopinfo)toimproveEnterpriseWebsearch

References•  [Fagin03]Fagin.R.,Kumar,R.,McCurley,K.S.,Novak,J.,Sivakumar,D.,Tomlin,J.A.,

Williamson,D.P.“SearchingtheWorkplaceWeb”.WWWConference,May2003,Budapest,Hungary.

•  [Hawking04]Hawking,D.“ChallengesinEnterpriseSearch”.ADCConference,Dunedin,NZ.

•  [Dmitriev06]Dmitriev,P.,Eiron,N.,Fontoura,M.,Shekita,E.“UsingAnnotaWoninEnterpriseSearch”.WWWConference,May2006,Edinburgh,Scotland.

•  [Poblete08]Poblete,B.,Baeza‐Yates,R.“Query‐Sets:UsingImplicitFeedbackandQueryPasernstoOrganizeWebDocuments”.WWWConference,April2008,Beijing,China.

•  [Joachims02]Joachims,T.OpWmizingSearchEnginesUsingClickthroughData.KDDConference,2002.

•  [Radlinski05]Radlinski,F.,Joachims,T.“QueryChains:learningtorankfromimplicitfeedback”.KDDConference,2005,NewYork,USA.

•  [Broder00]Broder,A.,Kumar,R.,Maghoul,F.,Raghavan,P.,Rajagopalan,S.,Stata,R.,Tomkins,A.,Wiener,J.“GraphStructureintheWeb”.WWWConference,2000.

•  [Dwork01]Dwork,C.,Kumar,R.,Naor,M.,Sivkumar,D.“RankAggregaWonMethodsfortheWeb”.WWWConference,2001.

•  [Shen05]Shen,X.,Tan,B.,Zhai,C.“Context‐SensiWveInformaWonRetrievalUsingImplicitFeedback”.SIGIRConference,2005.

References•  [Fagin03‐1]Fagin,R.,Lotem,A.,Naor,M.“OpWmalAggregaWonAlgorithmsfor

Middleware”.JournalofComputerandSystemsSciences,66:614‐656,2003.•  [Chirita07]Chirita,P.‐A.,Costache,S.,Handschuh,S.,Nejdl,W.“P‐TAG:LargeScale

GeneraWonofPersonalizedAnnotaWonTAGsfortheWeb”.WWWConference,2007.

•  [Bao07]Bao,S.,Wu,X.,Fei,B.,Xue,G.,Su,Z.,Yu,Y.“OpWmizingWebSearchUsingSocialAnnotaWons”.WWW‐Conference,2007.

•  [Xu07]Xu,S.,Bao,S.,Cao,Y.,Yu,Y.“UsingSocialAnnotaWonstoImproveLanguageModelforInformaWonRetrieval”.CIKMConference,2007.

•  [Millen06]Millen,D.R.,Feinberg,J.,Kerr,B.“Dogear:SocialBookmarkingintheEnterprise”.CHIConference,2006.

•  [Bilenko08]Bilenko,M.,White,R.W.“MiningtheSearchTrailsofSurfingCrowds:IdenWfyingRelevantWebSitesfromUserAcWvity”.WWWConference,2008.

•  [Xue03]Xue,G.‐R.,Zeng,H.‐J.,Chen,Z.,Ma,W.‐Y.,Zhang,H.‐J.,Lu,C.‐J.“ImplicitLinkAnalysisforSmallWebSearch”.SIGIRConference,2003.

•  [Burges05]Burges,C.J.C.,Shaked,T.,Renshaw,E.,Lazier,A.,Deeds,M.,Hamilton,N.,Hullender,G.N.“LearningtoRankUsingGradientDescent”.ICMLConference,2005.

•  [Buscher07]Buscher,G.“AsenWon‐BasedInformaWonRetrieval”.DoctoralConcorWum,SIGIRConference,2007.