level-based link analysis

12
Level-Based Link Analysis Guang Feng 1, , Tie-Yan Liu 2 , Xu-Dong Zhang 1 , Tao Qin 1 , Bin Gao 3 , and Wei-Ying Ma 2 1 MSPLAB, Department of Electronic Engineering, Tsinghua University, Beijing 100084, P. R. China {fengg03, qinshitao99}@mails.tsinghua.edu.cn, [email protected] 2 Microsoft Research Asia, No.49 Zhichun Road, Haidian District, Beijing 100080, P. R. China {t-tyliu, wyma}@microsoft.com 3 LMAM, School of Mathematical Sciences, Peking University, Beijing 100871, P. R. China [email protected] Abstract. In order to get high-quality web pages, search engines often resort retrieval pages by their ranks. The rank is a kind of measurement of importance of pages. Famous ranking algorithms, including PageRank and HITS, make use of hyperlinks to compute the importance. Those algorithms consider all hyperlinks identically in sense of recommenda- tion. However, we find that the World Wide Web is actually organized with the natural multi-level structure. Benefiting from the level proper- ties of pages, we can describe the recommendation of hyperlinks more reasonably and precisely. With this motivation, a new level-based link analysis algorithm is proposed in this paper. In the proposed algorithm, the recommendation weight of each hyperlink is computed with the level properties of its two endings. Experiments on the topic distillation task of TREC2003 web track show that our algorithm can evidently improve searching results as compared to previous link analysis methods. 1 Introduction With the explosive growth of the Web, it becomes more and more difficult for surfers to find valuable pages in such huge repository. Consequently, search en- gines come forth to help them to retrieve appropriate and valuable web pages. At the very beginning, almost all search engines worked in the same manner as conventional information retrieval systems where only relevance scores are utilized to sort pages for a certain query. Whereas, researchers found that this scheme merely led to a poor result in the web. The top-ranking pages were often not the most valuable ones and sometimes were even rubbish. In other words, high relevance score does not mean high quality. This work was performed at Microsoft Research Asia. Y. Zhang et al. (Eds.): APWeb 2005, LNCS 3399, pp. 183–194, 2005. c Springer-Verlag Berlin Heidelberg 2005

Upload: microsoft

Post on 28-Jan-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Level-Based Link Analysis

Guang Feng1,�, Tie-Yan Liu2, Xu-Dong Zhang1, Tao Qin1,Bin Gao3, and Wei-Ying Ma2

1 MSPLAB, Department of Electronic Engineering,Tsinghua University, Beijing 100084, P. R. China

{fengg03, qinshitao99}@mails.tsinghua.edu.cn,[email protected]

2 Microsoft Research Asia, No.49 Zhichun Road,Haidian District, Beijing 100080, P. R. China

{t-tyliu, wyma}@microsoft.com3 LMAM, School of Mathematical Sciences,

Peking University, Beijing 100871, P. R. [email protected]

Abstract. In order to get high-quality web pages, search engines oftenresort retrieval pages by their ranks. The rank is a kind of measurementof importance of pages. Famous ranking algorithms, including PageRankand HITS, make use of hyperlinks to compute the importance. Thosealgorithms consider all hyperlinks identically in sense of recommenda-tion. However, we find that the World Wide Web is actually organizedwith the natural multi-level structure. Benefiting from the level proper-ties of pages, we can describe the recommendation of hyperlinks morereasonably and precisely. With this motivation, a new level-based linkanalysis algorithm is proposed in this paper. In the proposed algorithm,the recommendation weight of each hyperlink is computed with the levelproperties of its two endings. Experiments on the topic distillation taskof TREC2003 web track show that our algorithm can evidently improvesearching results as compared to previous link analysis methods.

1 Introduction

With the explosive growth of the Web, it becomes more and more difficult forsurfers to find valuable pages in such huge repository. Consequently, search en-gines come forth to help them to retrieve appropriate and valuable web pages.

At the very beginning, almost all search engines worked in the same manneras conventional information retrieval systems where only relevance scores areutilized to sort pages for a certain query. Whereas, researchers found that thisscheme merely led to a poor result in the web. The top-ranking pages were oftennot the most valuable ones and sometimes were even rubbish. In other words,high relevance score does not mean high quality.

� This work was performed at Microsoft Research Asia.

Y. Zhang et al. (Eds.): APWeb 2005, LNCS 3399, pp. 183–194, 2005.c© Springer-Verlag Berlin Heidelberg 2005

184 G. Feng et al.

In order to get high-quality pages, search engines turned to resort retrievalpages by their importance. PageRank [2] and HITS [6] are two of the mostpopular algorithms, which utilize hyperlinks to compute the importance of eachpage. Taking relevance and rank into account, the quality of top-ranking retrievalpages can be improved by much.

More generally speaking, PageRank, HITS and other methods that use hy-perlinks to measure the importance of pages are referred to as link analysisalgorithms in the literature. The hyperlink between two web pages is treated asa kind of recommendation from source to destination. If there is a hyperlink frompage A to page B, we believe A endorses B for its importance. Hence, the Webcan be considered as a tremendous voting system. With the continuous iterationof voting, each page will get a stable measurement of its importance eventually.

Previous works [1][2][3][4][6][8] showed the effectiveness and efficiency of linkanalysis algorithms, which consider each hyperlink to be identical in sense ofrecommendation. However, we argue it is not the best way of utilizing hyper-links although it has worked well. Optimally, different hyperlinks should havedifferent weights in the voting process. For example, a hyperlink from the portalof a website to a common page should have stronger recommendation than ahyperlink from the common page to the portal. And we believe that the weightof a hyperlink should be decided by the level properties of its two ending pages.

With such a motivation, we introduce a new concept to link analysis algo-rithms, named level-based link analysis. Compared to previous algorithms, eachhyperlink will be assigned a weight to express its strength of recommendation.By applying this concept, almost all previous link analysis algorithms can berefined with only a little modification to the adjacent matrix of web graph. Infollowing sections, we will show how to combine traditional link analysis methodswith this level-based concept in details.

The rest of this paper is organized as follows. In Section 2, we review someprevious works to show the common process of link analysis. In Section 3, wedescribe the level-based link analysis in details. The experiments and correspond-ing results are shown in Section 4. Finally, we give the concluding remarks andfuture works in Section 5.

2 Related Works

We might feel that hyperlinks make up a great part of the Web from the saying“The Web is a hyperlinked environment” [6]. In the literature, link analysis algo-rithms have shown their success in measuring the importance of pages. Amongthem, PageRank and HITS are two of the widely-recognized representatives.

Before dropping in the detailed descriptions of them, we will give some basicdefinitions first. In many works, the Web were modelled as a directed graph

G =< V, E > ,

where V = {1, 2, · · · , n} is the collection of nodes, each of which representsa page; and E = {< i, j > |i, j ∈ V } is the collection of edges, each of which

Level-Based Link Analysis 185

represents a hyperlink. For example, < i, j > means a hyperlink from page i topage j.

The adjacent matrix A of the web graph is defined as follows:

Aij :={

1, if < i, j >∈ E0, otherwise . (1)

That is, if there is a hyperlink from page i to page j, Aij = 1. Otherwise,Aij = 0. This matrix is the core component of link analysis algorithms.

2.1 HITS

The HITS algorithm assigns two numeric properties to each page, called au-thority score and hub score. The higher authority score a page has, the moreimportant it will be. If a page points to many pages with high authority score,it will obtain a high hub score. If a page is pointed by many pages with highhub score, it will obtain a high authority score symmetrically. Hub scores andauthority scores exhibit a mutually reinforcing relationship. We can obtain thetwo scores of each page in an iterative manner.

Let a = (a1, a2, · · · , an)T and h = (h1, h2, · · · , hn)T denote the authority andhub scores of the web graph respectively. Without regard to normalization, theiteration process can be formulated as follows [9]:

a(t+1)i =

∑j:<j,i>∈E

h(t)j , (2)

h(t+1)i =

∑j:<i,j>∈E

a(t)j . (3)

Representing them with matrix, the above equations will be

a(t+1) = AT h(t) =(AT A

)a(t) , (4)

h(t+1) = Aa(t) =(AAT

)h(t) . (5)

It can be proved that the stable values of a and h (denoted by a∗ and h∗)will be the principal eigenvectors of AT A and AAT respectively, when AT A aswell as AAT has unique principal eigenvector [5].

2.2 PageRank

The PageRank algorithm assigns one numeric property, called PageRank, toeach page to represent its importance. This algorithm simulates a random walkprocess in the web graph. Suppose there is a surfer in an arbitrary page of theWeb. At each step, he/she will transfer to one of the destination pages of thehyperlinks on the current page with probability ε, or to another page in thewhole graph with probability 1 − ε. This process can also be formulated in aniterative manner.

186 G. Feng et al.

Firstly, normalize each row of the adjacent matrix A with its sum and get aprobability matrix A. Then the above random walk can be represented as

A = εA + (1 − ε) U , (6)

where U is a uniform probability transition matrix, all elements of which equalto 1/n (n is the dimension of U). Denote π = (π1, π2, · · · , πn)T as the PageRankof the whole web graph. It can be computed through the below iterative process:

π(k+1) = ATπ(k) . (7)

Again, the stable value of π corresponds to the principal eigenvector of

ATwhen A

Thas unique principal eigenvector [5].

3 Level-Based Link Analysis

In this section, we illustrate the concept of level-based link analysis. First, wediscuss how to compute the weight of each hyperlink so as to define the level-based adjacent matrix. Then we show how to add this concept to existing linkanalysis algorithms.

3.1 Weight of the Hyperlink

As aforementioned, the existing link analysis methods treat all hyperlinks iden-tically in sense of recommendation. However, as we know, the Web is not orga-nized with a flat structure but multi-level structure. Thus, hyperlinks should betreated non-identically. Then comes the problem of how to define the differencebetween two hyperlinks. To tackle it, we make use of the level properties of pagesin the website. In particular, this can be illustrated by Fig.1, where a websiteis denoted by a tree; the circles denote pages; the solid lines denote hyperlinkswhile the dash lines denote the organization structure.

Suppose i1, i2 and j are three web pages. As we can see, there are twohyperlinks pointing to j from i1 and i2 respectively. Denote the level propertyof page i by li. If i is on the highest level, let li = 1. And li increases by onewhen i goes down to the next level of the tree. Here we use wj|i to represent theweight of the hyperlink from i to j and use anc (i, j) to denote the ancestor ofthese two pages.

We show three cases of organization and hyperlinks in Fig. 1(a) to (c). Thequestion is which hyperlink is stronger in sense of recommendation with respectto j, the one from i1 or from i2. To answer this question, we design two intuitiveand reasonable rules as follows.

Rule 1. In the case shown in Fig.1(a), where the joint-ancestor of i1 and j isthe same as the joint-ancestor of i2 and j, we claim that < i1, j > has strongerrecommendation than < i2, j >, i.e.

wj|i1 > wj|i2when li1 < li2 and lanc(i1,j) = lanc(i2,j) .

Level-Based Link Analysis 187

Fig. 1. Illustration for weight of the hyperlink

Generally speaking, a higher-level page will choose its hyperlinks more cau-tiously. Once a hyperlink appears in it, it means that the destination of thehyperlink is endorsed with much deliberation of the author. Thus, the hyperlinkwould probably have strong recommendation. So, we claim that the hyperlinkfrom a higher-level page will have larger weight than the hyperlink from a lowerone when other conditions are the same.

Rule 2. In the case shown in Fig.1(b), where the joint-ancestor m1 of i1 andj is not the same as the joint-ancestor m2 of i2 and j, we claim that < i2, j >has stronger recommendation than < i1, j >, i.e.

wj|i1 < wj|i2

when li1 = li2 and lanc(i1,j) > lanc(i2,j) .

This claim is also fairly reasonable. Suppose the site with the portal m2focuses on the sports (its URL is http://www.***sports.com). And m1 is theentry point of one sub-topic of that site. For example, the URL of m2 is*http://www.***sports.com/football. As i1 and j are in the same site and sharethe same topic, it is very common that there is a hyperlink between them.However, such a hyperlink may have more organizational sense than recommen-dation. Comparatively, the hyperlink between i2 and j would represent morerecommendation.

According to above discussion, we can define the primary weight of the hy-perlink as follows:

wj|i =1

li · lanc(i,j). (8)

In particular, if i and j have no ancestor, as shown in Fig.1(c), we cannot come out a realization of (8). In this case, we construct a virtual pagein the higher level than any existing pages, denoted by mc (the dotted circlein Fig.1(c)). We consider this virtual page as the parent of the pages in thetop level, or the root of all web sites. The level property of this virtual page isdenoted by l0. To guarantee the denominator in (8) is not zero, let l0 = 0.1.

188 G. Feng et al.

With the above discussions, we can define the primary level-based adjacentmatrix L as follows:

Lij :={

wj|i, if < i, j >∈ E0, otherwise . (9)

The reason for which we emphasize “primary” here is that we will get thefinal-version of level-based adjacent matrix after taking more information intoconsideration as shown in the next sub section.

3.2 Level-Punishment

Intuitively, the higher level a page is on, the more important it is. Therefore, apage’s importance should be punished by its level property li. Take HITS forexample. After replacing A by L, (4) and (5) can be rewritten as follows:

a(t+1) = LT h(t) =(LT L

)a(t) , (10)

h(t+1) = La(t) =(LLT

)h(t) . (11)

We denote the level-punishment matrix by P = diag(1/l1, 1/l2, · · · , 1/ln) andintroduce it into the calculation of (10) and (11). Then we have

a(t+1) = P · LT h(t) =(PLT PL

)a(t) , (12)

h(t+1) = P · La(t) =(PLP LT

)h(t) . (13)

If defineL = PL , (14)

we can obtaina(t+1) = LT h(t) =

(LT L

)a(t) , (15)

h(t+1) = La(t) =(LLT

)h(t) . (16)

Equation (15) and (16) is called level-based HITS (LBHITS) algorithm. Com-pared with (4) and (5), we only replaced A by L in LBHITS, where

Lij =1li

· wj|i =1

li · lj · lanc(i,j). (17)

Up to now, we can reformulate the weight of the hyperlink as follows:

wj|i =1

li · lj · lanc(i,j). (18)

Generally speaking, as long as replacing A by L in conventional link analysisalgorithms, we can always obtain the level-based version of the original linkanalysis algorithms accordingly. We omit the corresponding deductions here forsimplicity.

Level-Based Link Analysis 189

3.3 Convergence of Level-Based Link Analysis

In this subsection, we give the proofs of the convergence of LBHITS and level-based PageRank(LBPR). For other level-based link analysis algorithms, theproofs are similar.

Lemma 1. If A is a symmetric matrix and x is a vector not orthogonal to theprincipal eigenvector of A, then

limk→∞

Akx = x∗ ,

when the principal eigenvector of A is unique. And x∗ equals to the unique prin-cipal eigenvector [5].

Theorem 1 (Convergence of LBHITS). Replacing A by L in (4) and (5),a and h will converge to a∗

L and h∗L respectively.

Proof. Let h(0) denote the arbitrary initial value of authority. Then after k stepsof iteration, we can easily obtain

a(k) = (LT L)k−1LT h(0) ,

h(k) = (LLT )kh(0) .

In terms of above lemma, because h(0) is an arbitrary value, we suppose it is notorthogonal to the principal eigenvector of LLT . Hence, h(k) converges to a limith∗

L. In the same way, LT h(0) can be also considered to be not orthogonal to theprincipal eigenvector of LT L so that a(k) converges to a limit a∗

L. ��Theorem 2 (Convergence of LBPR). Replacing A by L in computing LBPR,π will converge to π∗

L.

Proof. After replacement, we can obtain

L = εL + (1 − ε) U .

Because L is finite and non-negative, L is finite and non-negative, too. Besides,U is a uniform probability transition matrix with any element positive so thatL must be finite and absolutely positive. Thereby L is an irreducible probabilitytransition matrix. It must have stationary distribution which can be computedas follows.

LTπ = π . ��

4 Experiments

In our experiments, the topic distillation task of TREC2003 web track was usedto evaluate our algorithms. The data corpus in this task was crawled from .govdomain in 2002. It contains 1,247,753 documents, 1,053,111 of which are htmlfiles. We only used these html files in our experiments.

190 G. Feng et al.

Fig. 2. A typical sitemap structure

There are totally 50 queries in this task. The number of positive answers(provided by the TREC committee) for each query ranged from 1 to 86 with anaverage of 10.32.

In order to realize the level-based link analysis, we will first show how toconstruct the sitemap. After that, we will describe the implementation of thebaseline algorithms and then show the improved retrieval performance after in-troducing the link-based analysis to the retrieval framework.

4.1 Sitemap Construction

The sitemap is usually referred to as a regular service which is provided bymany websites to represent their organizational structures. However, in our ex-periments, the sitemap is a data structure and contains more information thanits original. The sitemap records the level properties of pages and the hierar-chical structure of the website. In other words, it must record the parent-childrelationship of pages. There is a constraint here that each page can have only oneparent. In our current implementation, we define the sitemap strictly as a tree.A typical sitemap is showed in Fig. 2. Because the sitemap is implicit in manywebsites, we need an algorithm to construct it from the relationship among webpages automatically. For this purpose, we make use of the URLs of each page toconstruct sitemap according to the following rules.

1. For the URL whose format is like http://www.abc.gov/.../index.*, we regu-larize it to the same as http://www.abc.gov/.../. For the cases of default.*,home.* and homepage.*, we also regularize them in the same way.

2. For those pages with only one-level URL such as http://www.usgs.gov/ andhttp://www.bp.gov/, we treat them as the roots of new sitemaps.

3. For those pages with multi-level URL, we will find a parent for them. For ex-ample, the parent of http://www.aaa.gov/.../bb/ is http://www.aaa.gov/.../while the parent of http://www.aaa.gov/.../bb/cc.* is* http://www.aaa.gov/.../bb/. If such parents happen to be not included in the data corpus (whichmeans that these pages are missing in crawler process), we simply treat theoriginal page as the root of a new sitemap.

Level-Based Link Analysis 191

In such a way, we construct a simple sitemap to display the structure of awebsite. We acknowledge the above process is not very accurate, but sitemapconstruction is not the focus of our paper although we can foresee that the bettersitemap we have, the more effective our level-based link analysis will be.

4.2 Algorithm Confirmation

In web information retrieval system, when a query is submitted, we firstly com-pute the relevance score of each page with respect to the query and select topn relevant pages. Secondly, we integrate the relevance score and the rank scorelinearly into the final score that is used to resort the top n pages, formulatingas follows:

Score = α · relevance + (1 − α) · rank . (19)

In our experiment, we use BM2500 [10] as the relevance weighting function.The retrieval result without regard to rank scores is called baseline, where themean average precision(MAP) is 0.1367 and the precision at ten(P@10) is 0.108.Compared with the best result of TREC2003 participants (with MAP of 0.1543and P@10 of 0.1280), this baseline is reasonable.

Specifically, the rank in (19) used in both HITS and LBHITS are the authorityvalues. To illustrate the advantage of our level-based link analysis algorithm, wechoose HITS for comparison.

As we know, HITS is a query-dependent algorithm. In the original paper [6]of HITS, the size of the root set is 200 and the in-link parameter is 50. Besides,the intrinsic links are removed before computing authority and hub scores. Weimplement the HITS algorithm strictly according to the above descriptions.

In the implementation of our LBHITS algorithm, we use the same root setand in-link parameter as in the original HITS algorithm. However, we don’tremove intrinsic links. The reason for removing intrinsic links is that “intrinsiclinks very often exist purely to allow for navigation of the infrastructure of asite” [6]. However, as we have punished the weight of these intrinsic links in ouralgorithm thus we don’t need to remove them at all.

4.3 Experimental Result

In this subsection, we listed the retrieval performance of both HITS and LBHITSon the TREC 2003 topic distillation task.

We resort the relevance documents according to the composite score to selecttop-1000 pages for evaluation. Both MAP and P@10 are used to evaluate theperformance of the algorithms.

In Fig.3 and Fig.4, we listed MAP and P@10 with respect to different α.The curves of both HITS+Relevance and LBHITS+Relevance converge to thebaseline when α = 1.

From the above figures, we can see when purely using the importance, LBHITShas already been better than HITS, although the corresponding performance isvery low. After combining with relevance scores, both HITS+Relevance and LB-HITS+Relevance get improvement over the baseline. For LBHITS+Relevance, the

192 G. Feng et al.

Fig. 3. Comparison of P@10 for Topic Distillation Task on TREC2003

Fig. 4. Comparison of MAP for Topic Distillation Task on TREC2003

best P@10, 0.136, is achieved when α = 0.68. This is better than HITS+Relevanceand the baseline by 13.3% and 25.9% respectively. And the best MAP, 0.162, isachieved when α = 0.77. This corresponds to 7% and 18.5% improvements overHITS+Relevance and the baseline.

Furthermore, HITS+Relevance performed worse than the best result of TREC2003 while LBHITS+Relevance sometimes performs better than the best result.For the best case, LBHITS+Relevance achieves 6.3% and 5.0% improvementsover the best result of TREC 2003 in sense of MAP and P@10 respectively.

Level-Based Link Analysis 193

Table 1. Retrieval Performance Comparison

Methods P@10 MAPBaseline 0.108 0.1367

HITS only 0.040 0.0502LBHITS only 0.070 0.0789

HITS+Relevance 0.120 0.1514LBHITS+Relevance 0.136 0.1620

Best Result on TREC2003 0.128 0.1543

As the experiments exhibit, level-based link analysis is effective and in ac-cordance with our theoretical analysis in the previous sections. At the end ofthis section, we place the retrieval performance of all algorithms tested in ourexperiments in a table for a comprehensive comparison.

5 Conclusions

In this paper, we refine the previous link analysis algorithms by introducing thelevel properties of pages. We discuss the reasoning and propose how to definethe weight of the hyperlink based on the level properties of its two endings.Through our experiments, we prove that the level-based link analysis can giverise to higher retrieval performance than the previous algorithms.

For the future works, there are still many issues that need to be explored.First, how to define the weight of the hyperlink to better represent the influenceof the level property is still a challenge. In section 3, we just discover the trendof the weight with respect to the changing endings. However, it is just a naıveattempt. There must be some more precise and effective approaches to representthe strength of recommendation of the hyperlink. Second, as we have shown, thelevel property can improve the performance of rank. It is natural that we wantto know whether it can improve the performance of relevance as well. Becausethe Web is naturally organized with multi-level structure, we believe that thelevel property should influence every aspect of researches on the Web, includingthe relevance and many other basic components.

References

1. Bharat, K. and Henzinger, M. R.: Improved algorithms for topic distillation ina hyperlinked environment. In Proc. 21st Annual Intl. ACM SIGIR Conference,pages 104-111. ACM, 1998

2. Brin, S. and Page, L.: The anatomy of a large-scale hypertextual Web search engine.In The Seventh International World Wide Web Conference, 1998.

3. Chakrabarti, S.: Integrating the Document Object Model with hyperlinks for en-hanced topic distillation and information extraction. In the 10th InternationalWorld Wide Web Conference, 2001.

194 G. Feng et al.

4. Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P. and Ra-jagopalan, S.: Automatic resource list compilation by analyzing hyperlink structureand associated text. In Proc. of the 7th Int. World Wide Web Conference, May1998.

5. Golub, G. H. and Van Loan, C. F.: Matrix Computations. Johns Hopkins Univ.Press, 1996.

6. Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of theACM, vol. 46, No. 5, pp. 604-622, 1999.

7. Langville, A. N. and Meyer, C. D.: Deeper Inside PageRank. Internet Mathematics,2004.

8. Lempel, R. and Moran, S.: The stochastic approach for linkstructure analysis(SALSA) and the TKC effect. Proc. 9th International World Wide Web Confer-ence, 2000.

9. Ng, A. Y., Zheng, A. X. and Jordan, M. I.: Link analysis, eigenvectors and stability.International Joint Conference on Artificial Intelligence (IJCAI-01), 2001.

10. Robertson, S. E.: Overview of the okapi projects. Journal of Documentation, Vol.53, No. 1, 1997, pp. 3-7.