[ieee 21st international conference on advanced networking and applications - niagara falls, on,...

Un-biasing the Link Farm Effect in PageRank Computation

Arnon Rungsawang‡± Komthorn Puntumapon± Bundit Manaskasemsak± ‡Thai National Grid Center

Software Industry Promotion Agency Ministry of Information and

Communication Technology, Thailand [email protected]

±Massive Information & Knowledge Engineering Department of Computer Engineering

Faculty of Engineering, Kasetsart University Bangkok 10900, Thailand

{por,un}@mikelab.net

Abstract

Link analysis is a critical component of current

Internet search engines' results ranking software, which determines the ordering of query results returned to the user. The ordering of query results can have an enormous impact on web traffic and the resulting business activity of an enterprise; hence businesses have a strong interest in having their web pages highly ranked in search engine results. This has led to attempts to artificially inflate page ranks by spamming the link structure of the web. Building an artificial condensed link structure called a "link farm" is one technique to influence a page ranking system, such as the popular PageRank algorithm. In this paper, we present an approach to remove the bias due to link farms from PageRank computation. We propose a method to first measure the PageRank weight accumulated by link farms, and then distribute the weight to other web pages by a modification of the transition matrix in the standard PageRank algorithm. We present results of a selected web graph that is manually spammed. The results show that the proposed approach can effectively reduce the bias from link farms in PageRank computation.

1. Introduction

Users depend heavily upon Internet search engines to locate information and resources on the web, and generally rely to the search engines' ranking system in selecting which results to view. They typically visit at most the top 10 results returned by a search engine [13]. Since organizations compete with each other to sell their products or provide services via commercial web sites, increasing web page viewership is a low cost way to gain business and market share. Therefore,

most such organizations want their commercial web pages to be ranked as high as possible in the results returned by well-known search engines.

The economic value of prominent search engine placement has led to the emergence of search engine optimization (SEO) services, that help organizations improve their ranking in search results. SEO can assist clients in constructing legitimate, well-structured web content to improve their ranking by popular search engines; however, their services may extend to grey hat activities such as engineering a link farm [14], consisting of a set of densely connected, artificial web pages whose only purpose is to inflate the rank of a client's web pages in search engine rankings. This web spam degrades the quality of the search results and inflates them with useless pages [6,9,12].

Most link farms engineered by grey hat SEO organizations aim at boosting the PageRank scores of one or a small number of target web pages. This in turn has motivated research for new algorithms to automatically combat such spam [2,3,7,8]. However, as with all spam, discriminating between link farms and groups of legitimate web pages with highly interconnected link structures, e.g. web communities, is a difficult task. Algorithms to combat web spam may incorrectly identify web communities as link farm pages and demote them from the search results, or penalize them from the next web crawling process [7,9]. Consequently, web users may miss some relevant or desirable answers to queries.

In this paper we propose an approach to mitigate the effect of link farms without excluding their pages from the web database. Instead, our approach attempts to measure the PageRank score being aggregated by each link farm and redistribute this weight to other nodes in the web graph, thereby removing -- or at least diminishing -- the bias due to link farms in PageRank computation.

21st International Conference on Advanced Networking and Applications(AINA'07)0-7695-2846-5/07 $20.00 © 2007

The rest of this paper is organized as follows. Section 2 reviews some related works; then section 3 provides a brief overview of the web model and the PageRank algorithm. In section 4, we present our approach to un-bias the effect of link farms. We describe our experiments and results in section 5, and present conclusions in the last section.

2. Related works

Research on web spam encompasses classification, detection and identification, and counter-measures. Perkins [12] proposes a classification for web spam, and Gyöngyi and Garcia-Molina [6] describe a taxonomy. They broadly categorize web spam as term spam and link spam. Term spam aims at increasing the relevancy for queries not related to the page's content, while link spam deceives the link-based ranking algorithm to increase the importance of a target page. They further describe techniques utilized by web spam: boosting techniques seek to achieve higher relevance and/or importance for some pages, while hiding methods attempt to conceal the boosting techniques from human users.

On the detection of web spam, Fetterly et al. [4] demonstrate that statistical analysis of URLs, links, and content can identify a great deal of spam. Acharya et al. [1] use historical data to identify link spam pages. Gyöngyi and Garcia-Molina [8] describe the optimal link structure for link farms and alliances; their quantitative results provide motivation for detection techniques that measure the amplification factor of alliances or the "relative spam mass" that a link farm contributes to a target. Recently, Becchetti et al. [2] propose an efficient, parallelizable approximate neighborhood counting method that identified 80% of the spam pages in a large dataset from the .uk domain.

Detection and counter-measures can be part of a single process. Wu and Davison [14] propose an algorithm that first identifies a spam seed set based on the intersection of incoming and outgoing links of pages, then iteratively extends the set to include pages with several links to pages already marked as bad. They counter link spam by deleting links between bad pages in the web adjacency matrix, and de-weight multiple links from a host to the same web page. PageRank, HITS, or weighted popularity are computed from the modified adjacency matrix.

TrustRank [7] is a modification of PageRank that incorporates a notion of trust. A seed set is manually screened to select a small number of high quality, durable, trustworthy sites with good coverage of the

web graph. Trust is then propagated using a variation of the PageRank algorithm. For a web graph of 31 million sites (rather than pages), TrustRank outperformed PageRank at removing bad sites from the upper ranks. The paper also introduces several metrics for assessing effectiveness of page ranking algorithms.

The SpamRank method proposed by Benczur et al. [3] identifies and re-weights link spam. For each page in the web graph, they check the PageRank distribution of its incoming links; if there are a large proportion of low quality incoming links then the page is marked as a seed page. They then propagate spam values backwards under the hypothesis that spam pages will have in-links from other spam pages.

3. Quantitative model

3.1. Web model We model the web as a directed graph G=(V, E),

where V is a set of N web pages (vertices) and E is a set of hyperlinks (edges) that connect pages. (p, q)∈E if there is a link from page p to page q, ignoring multiplicity and self-referencing hyperlinks (p = q). Each page may have both incoming and outgoing links; define I(p) as the set of incoming links and O(p) as the set of outgoing links of page p.

The link structure of a web graph can then be represented by a transition matrix T, defined as:

⎩⎨⎧ ∈

=otherwise 0

),(if|)(|/1),(

EpqqOqpT (1)

3.2. PageRank algorithm The intuitive idea behind PageRank [5] is that if

web page p has a link to web page q, it implies that author of page p recommends or confers some importance on page q. Thus a page q is important if it has many incoming links from other important pages.

The PageRank score r(p) of web page p is defined as:

NqOqrpr

pIq

)1(|)(|

)()()(

αα −+= ∑

∈

(2)

where α is a decay factor, usually assigned a value of 0.85. The first term in (2) represents the rank score of p conferred by pages that point to p, while the second (constant) term represents the "random surfer" phenomenon and additionally helps avoid the rank sink computational problem [11]. Equation (2) can be used to iteratively compute the PageRank of all nodes. If we


let )(iRv

denote the vector of PageRank values computed at iteration i, then an iterative form of (2) expressed in vector notation is:

1)1()( 1)1(

×− −

+= Nii

NRTR

vvv αα (3)

It can be shown that this iterative computation always convergences. In practice, a modification of (3), such as an adaptive PageRank algorithm, may be used to reduce computation and accelerate convergence [10].

4. Details of the un-biasing algorithm

Our approach to removing page rank bias introduced by link farms is via a two-phase process: the first phase is identification of the spam source (the link farms) such as the method of Wu and Davison [14], followed by the second phase to remove the spam bias. The present work only focuses on the second phase.

It has been observed that the PageRank scores of most web pages in a link farm are generally aggregated from two sources: other pages residing in the farm's own network, and many low quality in-link pages that individually contribute very little weight, but may collectively confer significant weight to PageRank scores. This motivates the intuitive idea behind our approach to un-bias the effect of a link farm by redistributing PageRank weights accumulated by web pages in its own network to other pages outside. The question is, how much of the weight should we re-distribute? The amount depends on the structure of the link farm itself.

With this motivation, we propose a way to measure the amount of PageRank weight accumulated by a link farm and how to distribute that amount to other web pages via a modification to the transition matrix used in the standard PageRank computation. To measure the score aggregated in a link farm, we introduce two new matrices: the link farm matrix and non-link farm matrix, to quantify the structure of a link farm according to the web graph. We also introduce a virtual link matrix to re-distribute the PageRank weights from a link farm.

4.1. Constructing the link farm matrices For each link farm (LF) in the web graph, the link

farm matrix is an N×N transition matrix composed of zero-column vectors for nodes outside the link farm, and normal column vectors for transitions between nodes within the link farm. An equation for the link farm matrix for LFi is:

⎩⎨⎧ ∈∈

=otherwise 0

and ),(if|)(|/1),( i

iLFqEpqqO

qpLFM (4)

The non-link farm matrix (non_LFM) is an N×N

transition matrix that has zero-column vector for nodes in any link farm, and normal column vectors for transitions between nodes not in any link farms:

⎪⎩

⎪⎨⎧ ∉∈

=otherwise 0

and),(if|)(|/1),(_

ii

LFqEpqqOqpLFMnon

U (5)

Figure1. A sample web graph.

Consider a simple web graph in Figure 1; suppose that the four shaded nodes represent web pages that have been identified as a link farm. For this web graph the single link farm matrix is:

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

=

000010

0021000

100000

0021000

000100000000

1LFM

and the non-link farm matrix is:

⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

=

000000000000000000000000000001000000

non_LFM

If there are n link farms in a web graph, the

transition matrix for the web graph can be written as the sum of a non-link farm matrix and all link farm matrices as:

spam page

normal page

3

4

1 5

6

2


∑=

+=n

iiLFMLFMnonT

1

_ (6)

4.2. Virtual link matrix To redistribute the PageRank weight from a link

farm to other web pages, we utilize a virtual link matrix (VM) representing virtual links from all nodes in a link farm to every other node in the web graph, assigning equal weight to all possible transitions out of the link farm. An expression for the VM of link farm i is:

⎩⎨⎧ ∈∉−

=otherwise0

andif|)|/(1),( iii

iLFqLFpLFN

qpVM (7)

For the web graph in Figure 1, the virtual link

matrix is:

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

=

000000210

21

21

210

000000000000000000210

21

21

210

VM

Figure 2. Reduced web graph for Figure 1.

4.3. Un-bias the link farm effect To un-bias the link farm effect, we first compute

how much PageRank weight is accumulated in a link farm, and then redistribute that weight equally to other nodes in the web graph. Such amount can be calculated from the average change rate of the probability that a random surfer will land in any node outside a link farm (ACF). To calculate the ACF, we create a reduced web graph that includes the link farm nodes plus a single sink node representing all destinations outside the link farm. The steps are:

(1) separate the link farm, including its links, from

the web graph, (2) create a virtual node S which has an incoming

link to itself (i.e., a sink node), (3) for each node in the link farm, replace all links

to nodes outside the farm with a single link to S.

Figure 2 depicts the steps in constructing a reduced web graph for the link farm in Figure 1.

1:

1+= iLFn

),(),(ˆ qpTqpT = for all iLFqp ∈,

∑ ∈−=

iLFpqpTqST ),(1),(ˆ

1),(ˆ =SxT for x = S, 0 otherwise 85.0=α

2: [ ]1

1)0(×

=nnR

v

3: ∑−

=

=1

1

)0()0( )(n

ji jrPr

4: 1,0,0 === errorsumk 5: while ( δ>error ) { 6: 1+= kk 7: [ ]

11)1()( )1(ˆ

×− −+=

nnkk RTR αα

vv

8: ∑−

=

=1

1

)()( )(n

j

kki jrPr

9: )1(

)1()()1( )())1)((1(−

−− −−+−+= k

i

kki

ki

Pr

nrNNPrPrsumsum

α

10: )1(

)1()(

−

−−= k

kk

R

RRerror v

vv

11: } 12: ksumACFi /=

Figure 3. Pseudo code of ACFi computation.

3

4

6

2

3

4

6

2 S

3

4

6

2 S

Step 1

Step 2

Step 3


For each link farm structure, we compute the PageRank vector corresponding to its reduced web graph using the procedure in Figure 3. At each iteration k of the calculation we also compute the sum of the PageRank scores for nodes in the link farm, denoted by )(Pr k

i . The boosting effect of the link farm is quantified as the average rate of change in this sum of transition probabilities, averaged over all iterations (K) until convergence:

∑=

−

−−− )−+−=

K

kk

i

kN

Nki

ki

iPr

SrPrPr

KACF

1)1(

)1(1)()1( ))(1((1 α (8)

Experiments indicate that the value of ACF is

unique and depends on the link structure of a link farm. In successive iterations, PageRank scores from the link farm gradually migrate to the rank sink node S. Figure 4 plots the value of ACF versus iteration number for the web graph in Figure 2.

0.13

0.14

0.15

0.16

0.17

0.18

1 6 11 16 21 26 31 36 41 46 51Iterations

ACF

Figure 4. Iterative values of ACF for the reduced web graph in Figure 2.

A low ACF value indicates a link farm whose

purpose is to boost the rank score of selected outside pages, while a high ACF value indicates that the link farm does not significantly boost the rank of specific outside pages. Rather, the farm is more likely a web community with a naturally condensed link structure; for example, cross-referencing web sites of a corporate conglomerate.

We now describe how to use the ACF values to remove (or at least diminish) bias from PageRank scores. As shown in equation (6), the PageRank transition matrix can be decomposed into a non-link farm matrix plus a number of link farm matrices. To redistribute weights from each link farm, we reduce the weight of the link farm matrices in equation (6) and

reassign the weight to the corresponding virtual link matrices. This has the desired effect of redistributing the link farm's targeted significance to all nodes in the web graph.

The ACF value serves as the weighting factor for a link farm matrix. The modified transition matrix is:

[ ]∑=

+−

+=′n

iiiii LFMACFVMACF

LFMnonT

1

)()1(

_ (9)

where n is the number of link farms. This new transition matrix satisfies the Markov property and is column-stochastic. Thus we can replace the original transition matrix T in equation (3) with T', giving:

1)1()( 1)1(

×− −

+×′= Nii

NRTR

vvv αα (10)

For the web graph in Figure 1, the ACF value is

0.158 and the modified transition matrix is:

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

+

=

0000158.002842.00

2158.0

2842.0

2842.0

2842.00

158.000000

002158.0000

000158.0012842.00

2842.0

2842.0

2842.00

' T

5. Experimental results

To evaluate the effectiveness of this approach, we used YahooAPI [15] to collect web pages from the Internet. We used the phrase “credit card application”, chosen as a likely candidate for spamming, and expand the web graph of the top 1,000 answers by their outgoing links and incoming links to create a final web graph of approximately 250,000 nodes. The top 30 PageRank scores of this web graph are listed in Table 1. We then manually spam three artificial link farms into the web graph as illustrated in Figure 5. The three link farms consist of:

• LF1 consists of web pages with ids. 250234 to

250239. Web page P0 has id. 250234. • LF2 consists of web pages with ids. 250240 to

250257. The three core nodes have web page ids. 250240, 250241, and 250242.

• LF3 consists of web pages with ids. 250258 to 250272. The three core nodes have web page ids. 250260, 250261, and 250262.


• Each node in each link farm is connected by 10 incoming links from the web graph. We selected the ten nodes with the lowest PageRank scores for this purpose.

Table 1. Top 30 PageRank scores of the experimental web graph.

Rank Page id. PR score Rank Page

id. PR score

1 4 0.02083 16 9 0.00376 2 659 0.01635 17 343 0.00372 3 648 0.00885 18 6 0.00371 4 11 0.00882 19 3 0.00366 5 84 0.00818 20 499 0.00338 6 1838 0.00797 21 80 0.00302 7 1 0.00797 22 3041 0.00302 8 1829 0.00766 23 3042 0.00302 9 47 0.00766 24 22 0.00269 10 1830 0.00766 25 163 0.00260 11 2935 0.00535 26 652 0.00252 12 2934 0.00516 27 660 0.00251 13 2 0.00516 28 17 0.00238 14 5 0.00384 29 456 0.00236 15 15 0.00382 30 251 0.00233

To evaluate the robustness of our approach, we also add some good links from selected pages in the top 30 as follows:

• add links from web pages 2935 and 2934,

ranked 11th and 12th, to web page id. 250234. • add links from web pages 2, 5 and 15, ranked

13th to 15th, to web page id. 250240. • add links from web pages 9, 343, 6, 3 and 499,

ranked 16th to 20th, to web page id. 250260.

Figure 5. Structure of manually created link farms.

We then calculate the standard PageRank scores of

the manually spammed web graph. The web pages with the top 30 PageRank scores are shown in Table 2. Web pages that were successfully spammed into the top 30 are highlighted.

P1

P2

Pn

P0

LF1

LF2

LF3


Table 2. Top 30 PageRank scores of the manually spammed web graph.


id. PR score

1 4 0.01998 16 250261 0.00497 2 659 0.01568 17 250258 0.00497 3 250240 0.01157 18 2935 0.00415 4 250260 0.00934 19 250241 0.00413 5 648 0.00849 20 2934 0.00401 6 11 0.00846 21 2 0.00401 7 1838 0.00764 22 5 0.00368 8 1 0.00764 23 3 0.00351 9 1829 0.00735 24 22 0.00258

10 47 0.00735 25 163 0.00250 11 1830 0.00735 26 652 0.00241 12 84 0.00517 27 660 0.00241 13 250234 0.00502 28 17 0.00229 14 250262 0.00497 29 251 0.00223 15 250259 0.00497 30 242 0.00223

Finally, the PageRank scores of the spammed web

graph were recomputed using the un-biasing transition matrix given in equation (9) with ACF values as described in Section 4. The top 30 PageRank values for this computation are listed in Table 3.

Table 3. Top 30 PageRank scores of the manually spammed web graph after applying un-biasing

technique.


id. PR score

1 4 0.02742 16 652 0.00326 2 659 0.02163 17 660 0.00325 3 648 0.01172 18 22 0.00316 4 11 0.01167 19 5 0.00313 5 1838 0.00801 20 23 0.00280 6 1 0.00801 21 240 0.00268 7 1829 0.00742 22 272 0.00268 8 47 0.00742 23 7239 0.00258 9 1830 0.00742 24 292 0.00240 10 84 0.00494 25 243 0.00235 11 250240 0.00473 26 7366 0.00235 12 250260 0.00420 27 247 0.00231 13 3 0.00365 28 917 0.00229 14 251 0.00348 29 252 0.00229 15 242 0.00348 30 906 0.00229

Table 3 shows that the un-biasing technique has successfully removed all the spammed web pages from the top 10, and all but two pages from the top 30. Perhaps more importantly, the original top 10 web pages have been restored to their rightful positions. Since, as discussed earlier, users generally visit only the top 10 query results, it is highly desirable for any anti-spam technique to leave the top 10 spam-free results relatively unperturbed.

The two remaining incursions into the top 30 (page ids. 250240 and 250260) are the result of good links manually added from other top 30 pages. We verified that the source of their high PageRank scores using the modified algorithm is the result of the added links from web pages with ids. 2, 3, 5, 6, 9, 15, 343, and 499. This shows that even though some pages may have many low quality links (here, links from the bottom 10 pages) suggestive of link spamming, if the pages also have good references from pages with high PageRank scores then they are not severely penalized by our algorithm.

6. Conclusions

This paper describes a technique to remove the bias introduced by link farms into PageRank computation, without removing suspected spam pages from the rank calculation. Our approach automatically measures the amount of PageRank weight aggregated by a link farm and redistributes it to non-specific nodes in the web graph. This method not only avoids accidentally removing good nodes erroneously classified as link farms during the spam identification process, but also accommodates the presence of naturally interlinked web communities that may resemble link farms.

Experimental results confirm that good pages that may have been identified spam are still available for the user in query results, while spam pages are demoted in the ordering by PageRank scores. Experimental results also suggest that this technique differentiates between link farms and naturally condensed link structures of web communities, without severely penalizing the latter.

We believe that this approach offers an automatic, adaptive way of combating link spam that can be readily incorporated into existing PageRank algorithm. However, we feel that additional experiments should be performed with larger web graphs in which incorporate more link spam configurations, and more extensively distributed spam as well as non-spam web communities.


Acknowledgement

We would like to thank to Dr. James Brucker for his valuable comments and discussion. We also thank to anonymous reviewers for their comments that help us improve the content of this paper.

References

[1] A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, O. Sercinoglu, and S. Tong, “Information retrieval based on historical data”, US Patent Application 20050071741, 2005.

[2] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and

R. Baeza-Yates, “Using rank propagation and probabilistic counting for link-based spam detection”, Proc. of the Workshop on Web Mining and Web Usage Analysis, 2006.

[3] A. Benczur, K. Csalogany, T. Sarlos, and M. Uher,

“SpamRank - Fully automatic link spam detection”, Proc. of the 1st International Workshop on Adversarial Information Retrieval on the Web, 2005.

[4] D. Fetterly, M. Manasse, and M. Najork, “Spam, damn

spam, and statistics: Using statistical analysis to locate spam web pages”, Proc. of the 7th International Workshop on the Web and Databases, 2004.

[5] A Survey of Google’s PageRank, http://pr.efactory.de/ [6] Z. Gyöngyi and H. Garcia-Molina, “Web spam

taxonomy”, Proc. of 1st International Workshop on Adversarial Information Retrieval on the Web, 2005.

[7] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen,

“Combating web spam with TrustRank”, Proc. of the 30th International Conference on Very Large Data Bases, 2004.

[8] Z. Gyöngyi and H. Garcia-Molina, “Link spam alliances”, Proc. of the 31st International Conference on Very Large Data Bases, 2005.

[9] M. Henzinger, R. Motwani, and C. Silverstein,

“Challenges in web search engines”, SIGIR Forum, 36(2), 2002.

[10] S. Kamvar, T. Haveliwala, and G. Golub, “Adaptive

methods for the computation of PageRank”, Linear Algebra and Its Applications, Special issue on the numerical solution of Markov chains, 2003.

[11] L. Page, S. Bill., R. Motwani, and T. Winograd, “The

PageRank citation ranking: Bringing order to the web”, Stanford Digital Libraries Working Paper 1999-66, 1999.

[12] A. Perkins, “White paper: The classification of search

engine spam”, Online at http://www.silverdisc.co.uk/articles /spam-classification/, 2001.

[13] C. Silverstein, M. Henzinger, H. Marias, and M.

Moricz, “Analysis of a very large search engine query log”, SIGIR Forum, 33(1), 1999.

[14] B. Wu and B.D. Davison, “Identifying link farm spam

pages”, Proc. of the 14th International World Wide Web Conference, 2005.

[15] Yahoo! Developer Network, http://developer.yahoo.com/


[ieee 21st international conference on advanced networking and applications - niagara falls, on,...

Documents