the mathematics behind google's pagerank

48
The Mathematics Behind Google’s PageRank Ilse Ipsen Department of Mathematics North Carolina State University Raleigh, USA Joint work with Rebecca Wills Man – p.1

Upload: ngoquynh

Post on 14-Feb-2017

223 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: The Mathematics Behind Google's PageRank

The MathematicsBehind Google’s PageRank

Ilse Ipsen

Department of Mathematics

North Carolina State University

Raleigh, USA

Joint work with Rebecca Wills

Man – p.1

Page 2: The Mathematics Behind Google's PageRank

Two Factors

Determine where Google displays a web page onthe Search Engine Results Page:

1. PageRank (links)

A page has high PageRankif many pages with high PageRank link to it

2. Hypertext Analysis (page contents)

Text, fonts, subdivisions, location of words,contents of neighbouring pages

Man – p.2

Page 3: The Mathematics Behind Google's PageRank

PageRank

An objective measure of the citation importance ofa web page [Brin & Page 1998]

• Assigns a rank to every web page• Influences the order in which Google displays

search results• Based on link structure of the web graph• Does not depend on contents of web pages• Does not depend on query

Man – p.3

Page 4: The Mathematics Behind Google's PageRank

More PageRank More Visitors

Man – p.4

Page 5: The Mathematics Behind Google's PageRank

PageRank

. . . continues to provide the basis for all of our websearch tools http://www.google.com/technology/

• “Links are the currency of the web”• Exchanging & buying of links• BO (backlink obsession)• Search engine optimization

Man – p.5

Page 6: The Mathematics Behind Google's PageRank

Overview

• Mathematical Model of Internet• Computation of PageRank• Is the Ranking Correct?• Floating Point Arithmetic Issues

Man – p.6

Page 7: The Mathematics Behind Google's PageRank

Mathematical Model of Internet

1. Represent internet as graph

2. Represent graph as stochastic matrix

3. Make stochastic matrix more convenient=⇒ Google matrix

4. dominant eigenvector of Google matrix=⇒ PageRank

Man – p.7

Page 8: The Mathematics Behind Google's PageRank

The Internet as a Graph

Link from one web page to another web page

Web graph:

Web pages = nodesLinks = edges

Man – p.8

Page 9: The Mathematics Behind Google's PageRank

The Internet as a Graph

Man – p.9

Page 10: The Mathematics Behind Google's PageRank

The Web Graph as a Matrix

3

2

55

4

1

S =

0 12

0 12

0

0 0 13

13

13

0 0 0 1 0

0 0 0 0 1

1 0 0 0 0

Links = nonzero elements in matrix

Man – p.10

Page 11: The Mathematics Behind Google's PageRank

Properties of Matrix S

• Row i of S: Links from page i to other pages• Column i of S: Links into page i

• S is a stochastic matrix:All elements in [0, 1]Elements in each row sum to 1

• Dominant left eigenvector:

ωT S = ωT ω ≥ 0 ‖ω‖1 = 1

• ωi is probability of visiting page i

• But: ω not uniqueMan – p.11

Page 12: The Mathematics Behind Google's PageRank

Google Matrix

Convex combination

G = αS + (1 − α)11vT

︸ ︷︷ ︸

rank 1

• Stochastic matrix S

• Damping factor 0 ≤ α < 1e.g. α = .85

• Column vector of all ones 11• Personalization vector v ≥ 0 ‖v‖1 = 1

Models teleportation

Man – p.12

Page 13: The Mathematics Behind Google's PageRank

PageRank

G = αS + (1 − α)11vT

• G is stochastic, with eigenvalues:

1 > α|λ2(S)| ≥ α|λ3(S)| ≥ . . .

• Unique dominant left eigenvector:

πT G = πT π ≥ 0 ‖π‖1 = 1

• πi is PageRank of web page i

[Haveliwala & Kamvar 2003, Eldén 2003,

Serra-Capizzano 2005] Man – p.13

Page 14: The Mathematics Behind Google's PageRank

How Google Ranks Web Pages

• Model:Internet → web graph → stochastic matrix G

• Computation:PageRank π is eigenvector of G

πi is PageRank of page i

• Display:If πi > πk thenpage i may∗ be displayed before page k

∗ depending on hypertext analysis

Man – p.14

Page 15: The Mathematics Behind Google's PageRank

Facts

• The anatomy of a large-scale hypertextual websearch engine [Brin & Page 1998]

• US patent for PageRank granted in 2001• Google indexes 10’s of billions of web pages

(1 billion = 109)• Google serves ≥ 200 million queries per day• Each query processed by ≥ 1000 machines• All search engines combined process more

than 500 million queries per day

[Desikan, 26 October 2006] Man – p.15

Page 16: The Mathematics Behind Google's PageRank

Computation of PageRank

The world’s largest matrix computation[Moler 2002]

• Eigenvector• Matrix dimension is 10’s of billions• The matrix changes often

250,000 new domain names every day• Fortunately: Matrix is sparse

Man – p.16

Page 17: The Mathematics Behind Google's PageRank

Power Method

Want: π such that πT G = πT

Power method:

Pick an initial guess x(0)

Repeat[x(k+1)]T := [x(k)]T G

until “termination criterion satisfied”

Each iteration is a matrix vector multiply

Man – p.17

Page 18: The Mathematics Behind Google's PageRank

Matrix Vector Multiply

xT G = xT[αS + (1 − α)11vT

]

Cost: # non-zero elements in S

A power method iteration is cheapMan – p.18

Page 19: The Mathematics Behind Google's PageRank

Error Reduction in 1 Iteration

πT G = πT G = αS + (1 − α)11vT

[x(k+1) − π]T = [x(k)]T G − πT G

= α[x(k) − π]T S

Error: ‖x(k+1) − π‖1︸ ︷︷ ︸

iteration k+1

≤ α ‖x(k) − π‖1︸ ︷︷ ︸

iteration k

Man – p.19

Page 20: The Mathematics Behind Google's PageRank

Error in Power Method

πT G = πT G = αS + (1 − α)11vT

Error after k iterations:

‖x(k) − π‖1 ≤ αk ‖x(0) − π‖1︸ ︷︷ ︸

≤2

[Bianchini, Gori & Scarselli 2003]

Error bound does not depend on matrix dimension

Man – p.20

Page 21: The Mathematics Behind Google's PageRank

Advantages of Power Method

• Simple implementation (few decisions)• Cheap iterations (sparse matvec)• Minimal storage (a few vectors)• Robust convergence behaviour• Convergence rate independent of matrix

dimension• Numerically reliable and accurate

(no subtractions, no overflow)

But: can be slowMan – p.21

Page 22: The Mathematics Behind Google's PageRank

PageRank Computation

• Power methodPage, Brin, Motwani & Winograd 1999Bianchini, Gori & Scarselli 2003

• Acceleration of power methodKamvar, Haveliwala, Manning & Golub 2003Haveliwala, Kamvar, Klein, Manning & Golub 2003Brezinski & Redivo-Zaglia 2004, 2006Brezinski, Redivo-Zaglia & Serra-Capizzano 2005

• Aggregation/DisaggregationLangville & Meyer 2002, 2003, 2006Ipsen & Kirkland 2006

Man – p.22

Page 23: The Mathematics Behind Google's PageRank

PageRank Computation

• Methods that adapt to web graphBroder, Lempel, Maghoul & Pedersen 2004 Kamvar,Haveliwala & Golub 2004Haveliwala, Kamvar, Manning & Golub 2003Lee, Golub & Zenios 2003Lu, Zhang, Xi, Chen, Liu, Lyu & Ma 2004Ipsen & Selee 2006

• Krylov methodsGolub & Greif 2004Del Corso, Gullí, Romani 2006

Man – p.23

Page 24: The Mathematics Behind Google's PageRank

PageRank Computation

• Schwarz & asynchronous methodsBru, Pedroche & Szyld 2005Kollias, Gallopoulos & Szyld 2006

• Linear system solutionArasu, Novak, Tomkins & Tomlin 2002Arasu, Novak & Tomkins 2003Bianchini, Gori & Scarselli 2003Gleich, Zukov & Berkin 2004Del Corso, Gullí & Romani 2004Langville & Meyer 2006

Man – p.24

Page 25: The Mathematics Behind Google's PageRank

PageRank Computation

• Surveys of numerical methods:Langville & Meyer 2004Berkhin 2005Langville & Meyer 2006 (book)

Man – p.25

Page 26: The Mathematics Behind Google's PageRank

Is the Ranking Correct?

πT =(.23 .24 .26 .27

)

• xT =(.27 .26 .24 .23

)

‖x − π‖∞ = .04

Small error, but incorrect ranking

• yT =(0 .001 .002 .997

)

‖y − π‖∞ = .727

Large error, but correct ranking

Man – p.26

Page 27: The Mathematics Behind Google's PageRank

What is Important?

Numerical value ↔ ordinal rank

ordinal rank:position of an element in an ordered list

Very little research on ordinal rankingRank-stability, rank-similarity

[Lempel & Moran, 2005][Borodin, Roberts, Rosenthal & Tsaparas 2005]

Man – p.27

Page 28: The Mathematics Behind Google's PageRank

Ordinal Ranking

Largest element gets Orank 1

πT =(.23 .24 .26 .27

)

Orank(π4) = 1, Orank(π1) = 4

• xT =(.27 .26 .24 .23

)

Orank(x1) = 1 6= Orank(π1) = 4

• yT =(0 .001 .002 .997

)

Orank(y4) = 1 = Orank(π4)

Man – p.28

Page 29: The Mathematics Behind Google's PageRank

Problems with Ordinal Ranking

When done with power method:• Popular termination criteria

do not guarantee correct ranking• Additional iterations can destroy ranking• Rank convergence depends on:

α, v, initial guess, matrix dimension,structure of web graph

• Even if successive iterates have the sameranking, their ranking may not be correct

[Wills & Ipsen 2007]Man – p.29

Page 30: The Mathematics Behind Google's PageRank

Ordinal Ranking Criterion

Given:

Approximation x, x ≥ 0Error bound β ≥ ‖x − π‖1

Criterion: xi > xj + β =⇒ πi > πj

Why?

(xi − πi) − (xj − πj) ≤ ‖x − π‖1 ≤ β

xi − (xj + β) ≤ πi − πj

0 < xi − (xj + β) =⇒ 0 < πi − πj

[Kirkland 2006]Man – p.30

Page 31: The Mathematics Behind Google's PageRank

Applicability of Criterion

20 40 60 80 100 120 140 160 1800

10

20

30

40

50

60

70

80

90

100

Iteration Number

Perc

enta

ge o

f E

lem

ents

B1 FP Bound AppliesElement Matches Successor

Man – p.31

Page 32: The Mathematics Behind Google's PageRank

Properties of Ranking Criterion

• Applies to any approximation,provided error bound is available

• Requires well-separated elements• Tends to identify ranks of larger elements• Determines partial ranking• Top-k, bucket and exact ranking• Easy to use with power method

Man – p.32

Page 33: The Mathematics Behind Google's PageRank

Top-k Ranking

Given:

Approximation x to PageRank πPermutation P so that x̃ = Px with

x̃1 ≥ . . . ≥ x̃n

Write: π̃ = Pπ

Suppose x̃k > x̃k+1 + β

Man – p.33

Page 34: The Mathematics Behind Google's PageRank

Top-k Ranking

x̃1 Orank(π̃1) ≤ k... ...

x̃k Orank(π̃k) ≤ k

↑β

↓x̃k+1 Orank(π̃k+1) ≥ k + 1

... ...x̃n Orank(π̃n) ≥ k + 1

Man – p.34

Page 35: The Mathematics Behind Google's PageRank

Exact Ranking

Given:

Approximation x to PageRank πPermutation P so that x̃ = Px with

x̃1 ≥ . . . ≥ x̃n

Write: π̃ = Pπ

If x̃k−1 > x̃k + β and x̃k > x̃k+1 + β then

Orank(π̃k) = k

Man – p.35

Page 36: The Mathematics Behind Google's PageRank

Exact Ranking

x̃1...

x̃k−1

l β Orank(π̃k) ≥ k

x̃k

l β Orank(π̃k) ≤ k

x̃k+1...

x̃n

Man – p.36

Page 37: The Mathematics Behind Google's PageRank

Bucket Ranking

Given:

Approximation x to PageRank πPermutation P so that x̃ = Px with

x̃1 ≥ . . . ≥ x̃n

Write: π̃ = Pπ

Suppose x̃k−i > x̃k + β and x̃k > x̃k+j + β

π̃k is in bucket of width i + j − 1

Man – p.37

Page 38: The Mathematics Behind Google's PageRank

Bucket Ranking

x̃1...

x̃k−i

l β Orank(π̃k)≥ k − i + 1

x̃k

l β Orank(π̃k)≤ k + j − 1

x̃k+j...

x̃n

Man – p.38

Page 39: The Mathematics Behind Google's PageRank

Experiments

n # buckets 1. bucket last bucket9,914 4,307 1 7%

3,148,440 34,911 1 95%

n exact rank exact top 100 lowest rank9,914 32% 79 9,215

3,148,440 0.76% 100 151,794

Man – p.39

Page 40: The Mathematics Behind Google's PageRank

Buckets for Small Matrix

0 500 1000 1500 2000 2500 3000 3500 40000

20

40

60

80

100

120

140

160

180

200

Bucket Number

Num

ber

in E

ach

Buc

ket

Man – p.40

Page 41: The Mathematics Behind Google's PageRank

Power Method Ranking

Simple error bound:

‖x(k) − π‖1 ≤ 2αk︸︷︷︸

β

Simple ranking criterion:

If x(k)i > x

(k)j + 2αk then πi > πj

But: 2αk is too pessimistic (not tight enough)

Man – p.41

Page 42: The Mathematics Behind Google's PageRank

Power Method Ranking

Tighter error bound:

‖x(k) − π‖1 ≤α

1 − α‖x(k) − x(k−1)‖1

︸ ︷︷ ︸

β

More effective ranking criterion:

If x(k)i > x

(k)j + β then πi > πj

Man – p.42

Page 43: The Mathematics Behind Google's PageRank

Floating Point Ranking

If x(k)i > x

(k)j + β then πi > πj

β =α

1 − α‖x(k) − x(k−1)‖1 + e

e is round off error from single matvec

IEEE double precision floating point arithmetic:

e ≈ cm10−16

m ≈ max # links into any web page

Man – p.43

Page 44: The Mathematics Behind Google's PageRank

Expensive Implementation

To prevent accumulation of round off

• Explicit normalization of iteratesx(k+1) = x(k+1)/‖x(k+1)‖1

• Compute norms, inner products, matvecswith compensated summation

• Limited by round off error from single matvec

• Analysis for matrix dimensions n < 1014

in IEEE arithmetic (ǫ ≈ 10−16)

Man – p.44

Page 45: The Mathematics Behind Google's PageRank

Difficulties for XLARGE Problems

• Catastrophic cancellation when computing

β =α

1 − α‖x(k) − x(k−1)‖1 + e

• Bound β dominated by round off e

• Compensated summation insufficient to reducehigher order round off O(nǫ2)

• Doubly compensated summation tooexpensive: O(n log n) flops

Man – p.45

Page 46: The Mathematics Behind Google's PageRank

Possible Remedies

• Lump dangling nodes [Ipsen & Selee 2006]Web pages w/o outlinks:pdf & image files, protected pages, web frontierUp to 50%-80% of all web pages

• Remove unreferenced web pages• Use faster converging method

Then 1 power method iteration for ranking• Relative ranking criteria?

Man – p.46

Page 47: The Mathematics Behind Google's PageRank

Summary

• Google orders web pages according to:PageRank and hypertext analysis

• PageRank = left eigenvector of G

G = αS + (1 − α)11vT

• Power method: simple, robust, accurate• Convergence rate depends on α

but not on matrix dimension• Criterion for ordinal ranking• Round off serious for XLarge problems

Man – p.47

Page 48: The Mathematics Behind Google's PageRank

User-Friendly Resources

• Rebecca Wills:Google’s PageRank: The Math Behind theSearch EngineMathematical Intelligencer, 2006

• Amy Langville & Carl Meyer:Google’s PageRank and BeyondThe Science of Search Engine RankingsPrinceton University Press, 2006

• Amy Langville & Carl Meyer:Broadcast of On-Air Interview, November 2006Carl Meyer’s web page

Man – p.48