page rank

77
1 Importance of Web Pages PageRank Topic-Specific PageRank Hubs and Authorities Combatting Link Spam

Upload: byron-villarreal

Post on 28-Jan-2015

2.322 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Page rank

1

Importance of Web Pages

PageRank

Topic-Specific PageRank

Hubs and Authorities

Combatting Link Spam

Page 2: Page rank

2

PageRank

�Intuition: solve the recursive equation: “a page is important if important pages link to it.”

�In high-falutin’ terms: importance = the principal eigenvector of the transition matrix of the Web.

� A few fixups needed.

Page 3: Page rank

3

Transition Matrix of the Web

� Enumerate pages.

� Page i corresponds to row and column i.

� M [i, j ] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i.

� M [i, j ] is the probability we’ll next be at page i if we are now at page j.

Page 4: Page rank

4

Example: Transition Matrix

i

j

Suppose page j links to 3 pages, including i but not x.

1/3x

0

Page 5: Page rank

5

Random Walks on the Web

�Suppose v is a vector whose i th

component is the probability that each random walker is at page i at a certain time.

�If each walker follows a link from i at random, the probability distribution for walkers is then given by the vector M v.

Page 6: Page rank

6

Random Walks – (2)

�Starting from any vector v, the limit M (M (…M (M v ) …)) is the long-term distribution of walkers.

�Intuition: pages are important in proportion to how likely a walker is to be there.

�The math: limiting distribution = principal eigenvector of M = PageRank.

Page 7: Page rank

7

Example: The Web in 1839

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

Page 8: Page rank

8

Solving The Equations

�Because there are no constant terms, the equations v = M v do not have a unique solution.

�In Web-sized examples, we cannot solve by Gaussian elimination anyway; we need to use relaxation (= iterative solution).

�Can work if you start with a fixed v.

Page 9: Page rank

9

Simulating a Random Walk

�Start with the vector v = [1, 1,…, 1] representing the idea that each Web page is given one unit of importance.

�Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk.

�About 50 iterations is sufficient to estimate the limiting solution.

Page 10: Page rank

10

Example: Iterating Equations

�Equations v = M v :

y = y /2 + a /2

a = y /2 + m

m = a /2

ya =m

111

13/21/2

5/41

3/4

9/811/81/2

6/56/53/5

. . .

Note: “=” isreally “assignment.”

Page 11: Page rank

11

The Walkers

Yahoo

M’softAmazon

Page 12: Page rank

12

The Walkers

Yahoo

M’softAmazon

Page 13: Page rank

13

The Walkers

Yahoo

M’softAmazon

Page 14: Page rank

14

The Walkers

Yahoo

M’softAmazon

Page 15: Page rank

15

In the Limit …

Yahoo

M’softAmazon

Page 16: Page rank

16

Real-World Problems

�Some pages are “dead ends” (have no links out).

� Such a page causes importance to leak out.

�Other groups of pages are spider traps(all out-links are within the group).

� Eventually spider traps absorb all importance.

Page 17: Page rank

17

Microsoft Becomes Dead End

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 0

y a m

Page 18: Page rank

18

Example: Effect of Dead Ends

�Equations v = M v :

y = y /2 + a /2

a = y /2

m = a /2

ya =m

111

11/21/2

3/41/21/4

5/83/81/4

000

. . .

Page 19: Page rank

19

Microsoft Becomes a Dead End

Yahoo

M’softAmazon

Page 20: Page rank

20

Microsoft Becomes a Dead End

Yahoo

M’softAmazon

Page 21: Page rank

21

Microsoft Becomes a Dead End

Yahoo

M’softAmazon

Page 22: Page rank

22

Microsoft Becomes a Dead End

Yahoo

M’softAmazon

Page 23: Page rank

23

In the Limit …

Yahoo

M’softAmazon

Page 24: Page rank

24

M’soft Becomes Spider Trap

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

Page 25: Page rank

25

Example: Effect of Spider Trap

�Equations v = M v :

y = y /2 + a /2

a = y /2

m = a /2 + m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .

Page 26: Page rank

26

Microsoft Becomes a Spider Trap

Yahoo

M’softAmazon

Page 27: Page rank

27

Microsoft Becomes a Spider Trap

Yahoo

M’softAmazon

Page 28: Page rank

28

Microsoft Becomes a Spider Trap

Yahoo

M’softAmazon

Page 29: Page rank

29

In the Limit …

Yahoo

M’softAmazon

Page 30: Page rank

30

PageRank Solution to Traps, Etc.

�“Tax” each page a fixed percentage at each interation.

�Add a fixed constant to all pages.

�Models a random walk with a fixed probability of leaving the system, and a fixed number of new walkers injected into the system at each step.

Page 31: Page rank

31

Example: Microsoft is a Spider Trap; 20% Tax

�Equations v = 0.8(M v ) + 0.2:

y = 0.8(y /2 + a/2) + 0.2

a = 0.8(y /2) + 0.2

m = 0.8(a /2 + m) + 0.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/115/11

21/11

. . .

Page 32: Page rank

32

General Case

�In this example, because there are no dead-ends, the total importance remains at 3.

�In examples with dead-ends, some importance leaks out, but total remains finite.

Page 33: Page rank

33

Solving the Equations

�Because there are constant terms, we can expect to solve small examples by Gaussian elimination.

�Web-sized examples still need to be solved by relaxation.

Page 34: Page rank

34

Finding a Good Starting Vector

1. Newton-like prediction of where components of the principal eigenvector are heading.

2. Take advantage of locality in the Web.

� Each technique can reduce the number of iterations by 50%.

� Important – PageRank takes time!

Page 35: Page rank

35

Predicting Component Values

�Three consecutive values for the importance of a page suggests where the limit might be.

1.0

0.70.6 0.55

Guess for the next round

Page 36: Page rank

36

Exploiting Substructure

�Pages from particular domains, hosts, or directories, like stanford.edu or infolab.stanford.edu/~ullman

tend to have many internal links.

�Initialize PageRank using ranks within your local cluster, then ranking the clusters themselves.

Page 37: Page rank

37

Strategy

�Compute local PageRanks (in parallel?).

�Use local weights to establish weights on edges between clusters.

�Compute PageRank on graph of clusters.

�Initial rank of a page is the product of its local rank and the rank of its cluster.

�“Clusters” are appropriately sized regions with common domain or lower-level detail.

Page 38: Page rank

38

In Pictures

2.0

0.1

Local ranks

2.05

0.05Intercluster weights

Ranks of clusters

1.5

Initial eigenvector

3.0

0.15

Page 39: Page rank

39

Topic-Specific Page Rank

�Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history.”

�Allows search queries to be answered based on interests of the user.

� Example: Query Maccabi wants different

pages depending on whether you are interested in sports or history.

Page 40: Page rank

40

Teleport Sets

� Assume each walker has a small probability of “teleporting” at any tick.

� Teleport can go to:

1. Any page with equal probability.

� As in the “taxation” scheme.

2. A topic-specific set of “relevant” pages (teleport set ).

� For topic-specific PageRank.

Page 41: Page rank

41

Example: Topic = Software

�Only Microsoft is in the teleport set.

�Assume 20% “tax.”

� I.e., probability of a teleport is 20%.

Page 42: Page rank

42

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Dr. Who’sphonebooth.

Page 43: Page rank

43

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 44: Page rank

44

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 45: Page rank

45

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 46: Page rank

46

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 47: Page rank

47

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 48: Page rank

48

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 49: Page rank

49

Picking the Teleport Set

1. Choose the pages belonging to the topic in Open Directory.

2. “Learn” from examples the typical words in pages belonging to the topic; use pages heavy in those words as the teleport set.

Page 50: Page rank

50

Application: Link Spam

�Spam farmers today create networks of millions of pages designed to focus PageRank on a few undeserving pages.

�To minimize their influence, use a teleport set consisting of trusted pages only.

� Example: home pages of universities.

Page 51: Page rank

51

Hubs and Authorities

�Mutually recursive definition:

� A hub links to many authorities;

� An authority is linked to by many hubs.

�Authorities turn out to be places where information can be found.

� Example: course home pages.

�Hubs tell where the authorities are.

� Example: CS Dept. course-listing page.

Page 52: Page rank

52

Transition Matrix A

�H&A uses a matrix A [i, j ] = 1 if page ilinks to page j, 0 if not.

�AT, the transpose of A, is similar to the PageRank matrix M, but AT has 1’s where M has fractions.

Page 53: Page rank

53

Example: H&A Transition Matrix

Yahoo

M’softAmazon

y 1 1 1a 1 0 1m 0 1 0

y a m

A =

Page 54: Page rank

54

Using Matrix A for H&A

�Powers of A and AT have elements of exponential size, so we need scale factors.

�Let h and a be vectors measuring the “hubbiness” and authority of each page.

�Equations: h = λAa; a = µAT h.

� Hubbiness = scaled sum of authorities of successor pages (out-links).

� Authority = scaled sum of hubbiness of predecessor pages (in-links).

Page 55: Page rank

55

Consequences of Basic Equations

�From h = λAa; a = µAT h we can derive:� h = λµAAT h

� a = λµATA a

�Compute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority.� Pick an appropriate value of λµ.

Page 56: Page rank

56

Example: Iterating H&A

1 1 1A = 1 0 1

0 1 0

1 1 0AT = 1 0 1

1 1 0

3 2 1AAT= 2 2 0

1 0 1

2 1 2ATA= 1 2 1

2 1 2

a(yahoo)a(amazon)a(m’soft)

===

111

545

241824

11484

114

. . .

. . .

. . .

1+√321+√3

h(yahoo) = 1h(amazon) = 1h(m’soft) = 1

642

1329636

. . .

. . .

. . .

1.0000.7350.268

2820

8

Page 57: Page rank

57

Solving H&A in Practice

�Iterate as for PageRank; don’t try to solve equations.

�But keep components within bounds.

� Example: scale to keep the largest component of the vector at 1.

�Trick: start with h = [1,1,…,1]; multiply by AT to get first a; scale, then multiply by A to get next h,…

Page 58: Page rank

58

Solving H&A – (2)

�You may be tempted to compute AAT

and ATA first, then iterate these matrices as for PageRank.

�Bad, because these matrices are not nearly as sparse as A and AT.

Page 59: Page rank

59

H&A Versus PageRank

�If you talk to someone from IBM, they may tell you “IBM invented PageRank.”

� What they mean is that H&A was invented by Jon Kleinberg when he was at IBM.

�But these are not the same.

�H&A does not appear to be a substitute for PageRank.

�But may be used by Ask.com.

Page 60: Page rank

60

Spam on the Web

�Search has become the default gateway to the web.

�Very high premium to appear on the first page of search results.

Page 61: Page rank

61

What is Web Spam?

� Spamming = any action whose purpose is to boost a web page’s position in search engine results, without providing additional value.

� Spam = Web pages used for spamming.

�Approximately 10-15% of Web pages are spam.

Page 62: Page rank

62

Web-Spam Taxonomy

�Boosting techniques :

� Techniques for increasing the probability a Web page will be a highly ranked answer to a search query.

�Hiding techniques :

� Techniques to hide the use of boosting from humans and Web crawlers.

Page 63: Page rank

63

Hiding techniques

�Content hiding.

� Use same color for text and page background.

�Cloaking.

� Return different page to crawlers and browsers.

�Redirection.

� Redirects are followed by browsers but not crawlers.

Page 64: Page rank

64

Boosting Techniques

�Term spamming :

� Manipulating the text of Web pages in order to appear relevant to queries.

• Why? You can run ads that are relevant to the query.

�Link spamming :

� Creating link structures that boost PageRank.

Page 65: Page rank

65

Term Spamming – (1)

�Repetition : of one or a few specific terms e.g., “free,” “cheap,” “Viagra.”

�Dumping of a large number of unrelated terms.

� E.g., copy entire dictionaries.

Page 66: Page rank

66

Term Spamming – (2)

�Weaving :

� Copy legitimate pages and insert spam terms at random positions.

� Phrase Stitching :

� Glue together sentences and phrases from different sources.

• E.g., use the top-ranked pages on the topic you want to look like.

Page 67: Page rank

67

The Google Solution to Term Spamming

�In addition to PageRank, the original Google engine had another innovation: it trusted what people said about you in preference to what you said about yourself.

�Give more weight to words that appear in or near anchor text than to words that appear in the page itself.

Page 68: Page rank

68

The Google Solution – (2)

�Today, the Google formula for matching terms to documents involves over 250 factors.

� E.g., does the word appear in a header?

�As closely guarded as the formula for Coke.

Page 69: Page rank

69

Link Spam

� Three kinds of Web pages from a spammer’s point of view:

1. Own pages.

• Completely controlled by spammer.

2. Accessible pages.

• E.g., Web-log comment pages: spammer can post links to his pages.

3. Inaccessible pages.

Page 70: Page rank

70

Spam Farms – (1)

� Spammer’s goal:

� Maximize the PageRank of target page t.

� Technique:

1. Get as many links from accessible pages as possible to target page t.

2. Construct “link farm” to get PageRankmultiplier effect.

Page 71: Page rank

71

Spam Farms – (2)

InaccessibleInaccessible

t

Accessible Own

1

2

M

Goal: boost PageRank of page t.One of the most common and effectiveorganizations for a spam farm.

Page 72: Page rank

72

Analysis – (1)

Suppose rank from accessible pages = x.

PageRank of target page = y.

Taxation rate = 1-β.

Rank of each “farm” page = βy/M + (1-β)/N.

InaccessibleInaccessiblet

Accessible Own

12

M

From t

Share of“tax”

Size ofWeb

Page 73: Page rank

73

Analysis – (2)

y = x + βM[βy/M + (1-β)/N] + (1-β)/N

y = x + β2y + β(1-β)M/N

y = x/(1-β2) + cM/N where c = β/(1+β)

InaccessibleInaccessiblet

Accessible Own

12

M

Tax sharefor t.Very small;ignore.

PageRank ofeach “farm” page

Page 74: Page rank

74

Analysis – (3)

�y = x/(1-β2) + cM/N where c = β/(1+β).

�For β = 0.85, 1/(1-β2)= 3.6.

� Multiplier effect for “acquired” page rank.

�By making M large, we can make y as large as we want.

InaccessibleInaccessiblet

Accessible Own

12

M

Page 75: Page rank

75

Detecting Link-Spam

�Topic-specific PageRank, with a set of “trusted” pages as the teleport set is called TrustRank.

�Spam Mass = (PageRank – TrustRank)/PageRank.

� High spam mass means most of your PageRank comes from untrusted sources –you may be link-spam.

Page 76: Page rank

76

Picking the Trusted Set

�Two conflicting considerations:

� Human has to inspect each seed page, so seed set must be as small as possible.

� Must ensure every “good page” gets adequate TrustRank, so all good pages should be reachable from the trusted set by short paths.

Page 77: Page rank

77

Approaches to Picking the Trusted Set

1. Pick the top k pages by PageRank.

� It is almost impossible to get a spam page to the very top of the PageRank order.

2. Pick the home pages of universities.

� Domains like .edu are controlled.