lecture #10 pagerank

21
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems

Upload: akamu

Post on 17-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Lecture #10 PageRank. CS492 Special Topics in Computer Science: Distributed Algorithms and Systems. Origin of “Google”. Hostnames Active. http://news.netcraft.com. Googol 10^100 Motivation behind Human maintained indices such as Yahoo! Explosive growth. Design Goals of Google. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture  #10 PageRank

Lecture #10PageRank

CS492 Special Topics in Computer Science:Distributed Algorithms and Systems

Page 2: Lecture  #10 PageRank

Origin of “Google”

• Googol– 10^100

• Motivation behind– Human maintained indices such as Yahoo!– Explosive growth

http://news.netcraft.com

HostnamesActive

Page 3: Lecture  #10 PageRank

Design Goals of Google

• Improved search quality– In 1997, 1 out of 4 top search engines found itself– High precision in finding relevant document was neces-

sary

• Academic search engine research– Search engine technology went commercial: an black art– To build systems that a good number of people could use– To build an architecture to support novel research on

large-scale Web data

Page 4: Lecture  #10 PageRank

Weakness of Existing Approaches

• Calculate similarities– Based on flat, vector-space model of each page– Prone to cheating (Web spamming or search en-

gine persuasion)

Page 5: Lecture  #10 PageRank

Basic Idea of PageRank

• Exploit the topological structure of hypertex-tual systems

Page 6: Lecture  #10 PageRank

Simple Example

A

C

B0.2

0.4

0.4

Page 7: Lecture  #10 PageRank

Related Work

• Academic citation analysis– Similarities

• Graph structure; paper = node, web page = node citation = link, URL = link• “node” authority independent of “node” content

– Differences• Uniform unit of info (paper) versus great variability in quality, usage, citations, and length• Equal link weight vs variable importance A backlink from Yahoo! vs. from a friend

Page 8: Lecture  #10 PageRank

Which Page Should Be Ranked Higher?

A B

John Doe

Page 9: Lecture  #10 PageRank

Simple Expression

page rank of

set of pages pointing at

out-degree of

Question: role of c?Answer: total rank of all web pages constant

Page 10: Lecture  #10 PageRank

Dangling links

• Pages without outgoing pointers– Example: Pages not yet downloaded

• Do not affect the calculation much– Remove them, calculate ranks, and add them back

Page 11: Lecture  #10 PageRank

Loop

A

C

B

Question: ranks of A, B, and C?Answer: infinite! (rank sink)

Page 12: Lecture  #10 PageRank

Basic Algorithm

page rank of

set of pages pointing at

out-degree of

dumping factor

Page 13: Lecture  #10 PageRank

Matrix Representation

Question: Where to start?

13

where and

Page 14: Lecture  #10 PageRank

Iterative Algorithm

where and

Question: Will it converge?

Page 15: Lecture  #10 PageRank

Example

[LM04]

Page 16: Lecture  #10 PageRank

Turn the Problem into a Markov Process

[LM04]

Page 17: Lecture  #10 PageRank

Evenly Split Rank of Dangling Links

17

[LM04]

Page 18: Lecture  #10 PageRank

Final Solution

• Eigenvector of P = steady state rank

18

Page 19: Lecture  #10 PageRank

Spam Rank

[BGS05]

Page 20: Lecture  #10 PageRank

Questions

• Where to start?– Find a nondegenerate start vector

• What if there are two pages that point to each other and no one else and there is a page that points to one of them?– Role of dumping factor guarantees no rank sink

Page 21: Lecture  #10 PageRank

References[PBMW] L. Page, S. Brin, R. Motwani, T. Winograd, “The PageRank citation ranking: bringing order

to the web,” WWW 1998[BP98] Sergey Brin, Lawrence Page, “The anatomy of a large-scale hypertextual Web search en-

gine,” Computer Networks and ISDN Systems, Vol. 30, 1998.[BGS05] Monica Bianchini, Marco Gori, Franco Scarselli, “Inside PageRank,” ACM Transactions on

Internet Technology, Vol. 5, No. 1, Feb. 2005.[LM04] Amy N. Langville, Carl Meyer, “Deeper inside PageRank,” Internet Mathematics, Vol. I, No.

3, 2004.[K99] Jon Kleinberg, “Authoritative sources in a Hyperlinked Environment,” Journal of the ACM

46:5 (1999).