page rank.pdf

20
Page Rank Algorithm Catherine Benincasa, Adena Calden, Emily Hanlon, Matthew Kindzerske, Kody Law, Eddery Lam, John Rhoades, Ishani Roy, Michael Satz, Eric Valentine and Nathaniel Whitaker Department of Mathematics and Statistics University of Massachusetts, Amherst May 12, 2006 Abstract PageRank is the algorithm used by the Google search engine, originally formulated by Sergey Brin and Larry Page in their paper The Anatomy of a Large-Scale Hypertextual Web Search Engine. It is based on the premise, prevalent in the world of academia, that the importance of a research paper can be judged by the number of citations the paper has from other research papers. Brin and Page have simply transferred this premise to its web equiva- lent: the importance of a web page can be judged by the number of hyperlinks pointing to it from other web pages. 1

Upload: shruti-bansal

Post on 16-Apr-2015

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Page rank.pdf

Page Rank Algorithm

Catherine Benincasa, Adena Calden, Emily Hanlon,Matthew Kindzerske, Kody Law, Eddery Lam,

John Rhoades, Ishani Roy, Michael Satz, Eric Valentineand Nathaniel Whitaker

Department of Mathematics and StatisticsUniversity of Massachusetts, Amherst

May 12, 2006

Abstract

PageRank is the algorithm used by the Google search engine, originallyformulated by Sergey Brin and Larry Page in their paper The Anatomy ofa Large-Scale Hypertextual Web Search Engine. It is based on the premise,prevalent in the world of academia, that the importance of a research papercan be judged by the number of citations the paper has from other researchpapers. Brin and Page have simply transferred this premise to its web equiva-lent: the importance of a web page can be judged by the number of hyperlinkspointing to it from other web pages.

1

Page 2: Page rank.pdf

1 Introduction

There are various methods of information retrieval (IR) such as latent Syman-tic Indexing (LSI). LSI uses the singular value decomposition (SVD) of a”term by document” matrix to capture latent symantic associations. LSImethod can efficiently handle difficult query terms involving synonynms andpolysems. SVD enables LSI to cluster documents and terms into concepts.eg. (car and automobile should belong to the same category.) Unfortu-nately computation and storage of the SVD of the term by documnet matrixis costly. Secondly there are enormous amounts of documents on the web.The documents are not subjected to editorial review process. Therefore theweb contains redundent documents, broken links, or poor quality documents.Moreover the web needs to be updated as pages are modified and/or addedand deleted continuously. The final feature of the IR system which has provento be math worthwhile, is the web’s hyperlink structure. The Pagerank al-gorithm introduced by Google effectively represents the link structure of theinternet, assigning each page a credibility based on this structure. Our focushere will be on the analysis and implementation of this algorithm.

2 PageRank Algorithm

PageRank uses the hyperlink structure of the web to view inlinks into apage as a recommendation of that page from the author of the inlinkingpage. Since inlinks from good pages should carry more wight than the inlinksfrom marginal pages each webpage is assigned an appropriate rank score,which measures the importance of the page. The PageRank algorithm wasformulated by Google founders Larry Page and Sergey Brin as a basis for theirsearch engine. After webpages are retrieved by robot crawlers are indexed andcataloged (which will be discussed in section 1); PageRank values are assignedprior to querry time according to perceived importance. The importance ofeach page is determined by the links to that page. The importance of anypage is increased by the number of sites which link to it. Thus the rank r(P)of a given page P is given by,

r(P ) =∑

Q∈BP

r(Q)

|Q|(1)

2

Page 3: Page rank.pdf

where BP = all pages pointing to P and |Q| = number of outlinks from Q.The terms of the matrix P are usually,

pi,j =

{1|Pi| if Pi links to Pj;

0 otherwise.

(These weights can be distributed in a non-uniform fashion as well, whichwill be explored in the application section. For this particular application, auniform distribution will suffice.) For theoritical and practical reasons such asconvergence and convergence rates the matrix P is adjusted. The raw Googlematrix P is nonnegative with row sums equal to one or zero. Zero row sumscorrespond to pages that have no outlinks; these are referred to as danglingnodes . We eliminate the dangling nodes using one of two techniques. So thatthe rows artifically sum to 1. P is then a row stochastic matrix, which inturn means that the PageRank iteration represents the evolution of a MarkovChain.

2.1 Markov Model

Figure 1

3

Page 4: Page rank.pdf

Figure 1 is a simple example of the stationary distribution of a Markov model.This structure accurately represents the probability that a random surfer isat each of the three pages at any point in time.The Markov model representsthe webs directed graph as a transition probability matrix P whose elementpij is the probability of moving from page i to page j in one step (click).This is accomplished through a few steps. Step one is to create a binaryAdjacency matrix to represent the link structure.

A B C

A 0 1 1B 0 0 1C 1 0 0

The second step is to transform this Adjacency matrix into probability

matrix by normalizing it ).

A B C

A 0 12

12

B 0 0 1C 1 0 0

This matrix is the unadjusted or raw google matrix. The dominant

eigenvalu for every stochastic matrix P is λ = 1. Therefore if the Pager-ank iteration converges it converges to the normalized left hand eigenvectorvT satisfying

vT = vT P (2)

where vT e = 1 which is the stationary or steady state distribution of theMarkov chain. Thus google intuitively characterizes the PageRank valueof each site as the long-run proportion of time spent at the site by a Websurfer eternally clicking on links at random. In this model we have not yetconsidered account clicking back or entering URLs on the command line.

In our basic example, we have:

(R(A) R(B) R(C)) * A = (R(A) R(B) R(C))where A is

A =

A B C

A 0 12

12

B 0 0 1C 1 0 0

4

Page 5: Page rank.pdf

R(A) = R(C)

R(B) =1

2∗R(A)

R(C) =1

2∗R(A) + R(B)

R(A) + R(B) + R(C) = 1

and the solution of this linear system is

(0.4 0.2 0.4)*Asol = (0.4 0.2 0.4)

where Asol is

A =

A B C

A 0 12

12

B 0 0 1C 1 0 0

Let consider a larger network show represents by figure 2.

Figure 2

5

Page 6: Page rank.pdf

This network has 8 nodes and therefore, the corresponding matrix has a size8 x 8 matrix, as shown in figure 3.

Figure 3

Again, we can transform it into stochastic matrix, and the result is thefollowing:

6

Page 7: Page rank.pdf

2.1.1 Generalization

Before going into the logistics of calculating this Pagerank vector, we gener-alize to an n-dimentional system.

Let Ai be the binary vector of outlinks from page i

Ai = (ai1, ai2, ..., aiN) and ‖Ai‖1 =N∑

j=1

Aij (3)

P =

A1

‖A1‖1A2

‖A2‖1......

AN

‖AN‖1

7

Page 8: Page rank.pdf

=

P11 .. .. P1N

: :: :

PN1 .. .. PNN

Pi = (pi1, pi2, ..., piN) so ‖Pi‖1 =N∑

j=1

PiJ = 1 (4)

We now have a row stochastic probability matrix, unless, of course a page(node) points to no others:

Ai = Pi = 0 .

Now let

WiT =

1

N, where i = 1, ..., N

Furthermore, let

di =

{0 if i is not a dead end;1 if it is a dead end.

So W = d ∗ wT , S = W + PS is a stochastic matrix.It should be noted that there is more than one way to deal with dead

ends. Such as removing them altogether or adding an extra link which pointsto all the others ( a so-called master node). We explore qualitatively theeffects these methods have in the results analysis section. (See figure 10 fora deadend).

2.2 Computing PageRank

The computation of PageRank is essentially solving an eigenvector problemof solving the linear system,

vT (I − P ) = 0, (5)

with vT e = 1. There are several methods which can be utilized in thiscalculation, provided our matrix is irreductible, we are able to utilize thepower method.

8

Page 9: Page rank.pdf

2.2.1 Power Method

We are interested in the convergence of the method xmT G = xT

m+1. Forconvenience we convert this expression to GT xm = xm+1. Clearly, the eigen-values of GT are 1> λ1 ≥ λ2 ≥ ... ≥ λn. Let v1, ...vn be the correspondingeigenvectors. Let x0 (dimension n) such that ‖x0‖1 = 1,so for a1 ∈ <

n∑i=1

aiviGT x0 =

∑aiG

T vi =∑

aiλivi

= a1a1v1

a1

+n∑

i=2

aiλivi

a1

= x1

GT x1 = a1v1 +n∑

i=2

aiλ2i vi = x2

GT xm = a1v1 +n∑

i=2

aiλm+1i vi = xm+1

solim

m→∞GT xm = a1v1 = π.

(The stationary state of Markov Chain)

2.3 Irreducibility and Convergence of Markov Chain

A difficulty that arises in comupation is that S can be a reducible matrix whenthe underlying chain is reduible. reducible chains are those that contain setsof states in which the chain eventually becomes trapped. For example ifwebpage Si contains only a link to Sj, and Sj contains only a link to Si, thena random surfer who hits either Si or Sj is trapped into bouncing betweenthe two pages bouncing endlessly, which is the essence of reducibility.Thedefinition of Irreducibility is the following, for each pair i, j, there exists anM such that (Sm)ij 6= 0. In the case of an undirected graph, this is equivalentto disjoint, non-empty subsets (see figure 11). However, the issue of meshingthese rankings back together in a meaningful way still remiains.

2.3.1 Sink

So far we are dealing with a directed graph, however, we also have to beconcerned with the elusive “sink”.(missing figure 16,17) ) A Markov chain in

9

Page 10: Page rank.pdf

which every state is eventually reachable from every other state guaranteesto possess a unique positive stationary distribution by the Perron-FrobeniusTheorem. Hence the raw google matrix P is first modified to produce astochastic matrix S̄. Due to the structure of the World Wide Web and thenature of gathering the web-structure, such as our method “breadth first”(which will be explained in the section on implementation), a stochasticmatrix is almost certainly reducible. One way to force irreducibility is todisplace the stochastic matrix S where α is a scalar between 0 and 1. In ourcomputation we choose α to be 0.85.

For α between 0 and 1, consider the following:

R(u) = α =∑v

R(v)

nv

+ (1− α)

where α = .85then the new stochastic matrix G becomes:

G = αS + (1− αD) (6)

where

D = e ∗W T

e = < 1, 1, ..., 1, 1 >

W Ti = <

1

N,

1

N...

1

N>

Again, it should be noted that W Ti can be any unit vector. In our basic

example, this amounts to:

0.85 * A + 0.15 * B = Cwhere A is our usual 3 * 3 stochastic matrix, B is a 3 by 3 matrix with 1

3

in every entry, and C is

A =

C

0.05 0.475 0.4750.05 0.05 0.90.9 0.05 0.05

This method allows for additional accuracy in our particular model since

it accounts for the possibility of arriving at a particular page by means other

10

Page 11: Page rank.pdf

than via link. This certainly occurs in reality and hence, this method, im-proves the accuracy of our model, as well as providing us with our neededirreducibility, and as we will see, improving the rate of convergence of thepower method.

3 Data Management

Up to this point, we assume that we are always able to discover the desirednetworks or websites that containing information we google for. However,careful readers may notice that we have not really discussed the way offiguring the structure of the networks. In this section, we are going to switchour attentions toward more technical feature. How are we going to figure thestructure of our networks? Furthermore, suppose if we are able to come upwith the list of the websites, is there anyway we can find out the rank moreefficiently and economically?

3.1 Breadth First Search

Breadth First Search Method is our main approach to identify the structureof networks and its algorithm is the following. Let us begin with one singlenode (webpage) in our network, and assigns it with a number 1, as in Figurea

11

Page 12: Page rank.pdf

Figure a

This node links to several nodes and we are going to assign each nodeswith a number, as in Figure b

12

Page 13: Page rank.pdf

Figure b

From figure b, we observe there is one node link to node 2, so we assignthis node another number. Then we switch to node 3, assigning a number tothe node connects to node 3, and so on. Figure c gives us the final result:

13

Page 14: Page rank.pdf

Figure c

As you can see, by using the Breadth First Search Method, we are ableto complete the graph structure, and therefore, we will be able to create ouradjacency matrix.

3.2 Sparse Matrix

Now we are able to form our adjacency matrix by knowing the structureof the network through Breadth First Search Method. But in reality, thenetwork contains over millions or even billions pages, and these matrices willbe huge. If we apply our power method directly to these matrices, even withthe fastest computer in the world, it will take a long time to compute thosedominant eigenvector. Therefore, it will be economical for us to developsome ways to reduce the size of these matrices without affecting the rankingof those pages. In this paper, Sparse Matrix method and Compressed RowStorage are the methods we are going to use to accelerate our calculatingprocess. First, let consider the following network:

14

Page 15: Page rank.pdf

Figure d

Link text formats this information from files to files, represent by the tablenext to the network. Then Sparse PR reads in a from-to file and computesranks. It outputs the pages in order of rank. Figure (e) is the result of oursample

15

Page 16: Page rank.pdf

Figure e

Sparse Matrix allows us to use less memory storage without compromisingthe final ranking. Full matrix format requires N2 + 2N memory locations(N number of nodes). For 60k nodes about 50 Gbytes RAM. Sparse formatrequires 3N +2L locations (L number of links). For 60k nodes and 1.6M linksabout 50 Mbytes RAM. Obviously, Sparse Matrix use a lot less of memorythan a full matrix in computation. Therefore, Sparse Matrix is more efficientthan a full matrix in terms of the amounts of memory being used.

3.3 Compressed Row Vectors

In this section we want to develop a method to accelerate a process of mul-tiplying the matrix. We decide to compress row vectors, since we alreadyknow how each nodes points to other nodes. CRS compresses rows requiretwo vectors of size L (number of links) and one of size N (numbers of nodes).Consider the following example, where we have 3 nodes and 6 links. First,we construct a column vector aa with a size L. This vector represents non-zero entries in reading order. Second, we construct a column vector ja crs

16

Page 17: Page rank.pdf

vectors with size L. This vector represents column indices of non-zero entries.Finally, we are creating the ia vector with size N. This is a cumulative countof non-zero entries by row. For example, the first row has two non-entries,therefore the first element of this ia vector is 2. Second row has one non-entry,therefore the second element of this vector is 3, etc.

Figure f

CRS storage allows us to multiply these matrix-vectors in the followingconcise form: // for each row in original matrix for i = 1 to N // for eachnonzero entry in that row for j = ia(i) to ia(i+1) - 1 //multiply that entryby corresponding //entry in vector; accumulate in result result(i) = result(i)+ aa(j) * vector(ja(j))

CRS is efficient, since we only need L additions and L multiplications,instead of N additions and N2 multiplications. Now we can apply the powermethod and compute those tedious matrix multiplications and additions inmore efficient way.

17

Page 18: Page rank.pdf

4 Results

To apply the PageRank method, an adjacency matrix is needed which rep-resents a directed graph. The conventional use for PageRank is to rank asubset of the internet. A program called a ”webcrawler” must be employedto crawl a desired domain and map its structure (i.e. links). A simple ap-proach to solving this problem is to use a breadth-first search technique. Thistechnique involves starting at a particular node, say node 1, and discover-ing all of Node 1’s neighbors before beginning to search for the neighbors of1’s first discovered neighbors. Figure 4 demonstrates this graphically. Thistechnique can be contrasted with depth-first search which starts on a pathand continues until the path ends before beginning a second unique path.Breadth-first search is much more appropriate for webcrawlers because it ismuch more likely that arbitrarily close neighbors won’t be excluded duringa lengthy crawl.

Figure 4

A crawl in January of 2006 was focused on the ”umass.edu” domain andyielded an adjacency matrix of 60,513x60,513. The PageRank method wasimplemented in conjunction with the CRS scheme to minimize the resourcesrequired. A final ranking was obtained and a sample can be seen in Figure5. Notice that the first and sixth ranked websites are the same. This is dueto the fact that the webcrawler did not differentiate between different aliasesof a URL.

This paper presents one of the possible ways for ranking. However, it isclear that the matrices Google dealing with is thousand times larger than

18

Page 19: Page rank.pdf

the one we used. Therefore, it is safe to assume that Google would have amore efficient way to compute and to rank webpage. Furthermore, we havenot introduced any method to confirm our results and algorithms. It is easyto check if the network is small, but when the networks getting bigger and

bigger, verifying the results will become amazingly difficult. One of thepotential solutions for this problem is to simulate a web surfer and use arandom number generator to determine the linkage between websites. It

should be interesting to see the result.

Figure 5

Another implementation can be applied to a network of airports with flightsrepresenting directed edges. In this implementation, the notion of multilink-ing comes into play. More precisely, there may exist more than one flight fromone airport to the next. In the internet application, the restriction was madeto allow only one link from any particular node to another. Although thisrequires only slight alterations to the working software to ensure a stochasticmatrix. Figure 6 shows a sample of the results in a PageRank application on1500 North American airports.

Figure 6

19

Page 20: Page rank.pdf

A more visible application may be in a sports tournament setting. Themethods used for ranking collegiate football teams is annually a hot topic fordebate. Currently, an average of seven ranking systems are used by the BCSto select which teams are accepted to the appropriate bowl or title games.Five of these models are computer based and are arguably a special case ofPageRank.

5 Conclusion

This paper presents one of the possible ways for ranking. However, it isclear that the matrices Google dealing with is thousand times larger thanthe one we used. Therefore, it is safe to assume that Google would havea more efficient way to compute and to rank webpage. Furthermore, wehave not introduced any method to confirm our results and algorithms. It iseasy to check if the network is small, but when the networks getting biggerand bigger, verifying the results will become amazingly difficult. One ofthe potential solutions for this problem is to simulate a web surfer and usea random number generator to determine the linkage between websites. Itshould be interesting to see the result.

References

[1] Amy N. Langville, Carl D. MeyerA Survey of Eigenvector Methods for Web Information Retrieval SiamReview Vol 47, No 1

[2] S. Brin, L. Page, R. et .al . The PageRank Citation Ranking: BringingOrder to the Web

20