tutorial 8 (web graph models)

16
Evolutionary Models of the Web Graph Kira Radinsky Web Size estimation models are based on the Standford slides by Christopher Manning and Prabhakar Raghavan

Upload: kira

Post on 22-Nov-2014

373 views

Category:

Technology


0 download

DESCRIPTION

Part of the Search Engine course given in the Technion (2011)

TRANSCRIPT

Page 1: Tutorial 8 (web graph models)

Evolutionary Models of the Web Graph

Kira Radinsky

Web Size estimation models are based on the

Standford slides by Christopher Manning and Prabhakar Raghavan

Page 2: Tutorial 8 (web graph models)

7 December 2010 2

Stochastic Models for the Web’s Graph

So what can explain the observed Power Law in/out degree distributions of Web pages?

• Standard G(n,p) Erdös-Rényi random graphs:– A graph contains n nodes, and every two nodes are connected with

probability p

– Degrees are distributed B(n-1,p), and since on the Web np<<n, they can be viewed as distributed Poisson(np-p)

– Such distributions have light-weight, exponentially decreasing tails - nodes with very large in-degrees are practically impossible – yet, they abound on the Web

Erdös-Rényi random graphs do not model the Web graph

Page 3: Tutorial 8 (web graph models)

7 December 2010 3

Evolutionary Models – First Attempt

• The Web wasn’t built in a day; in fact, it is constantly growing and evolving

• Models should (somewhat) reflect the authoring process of Web pages

• Observation: older, well-established nodes should be better connected as they’ve been around longer and are better known

• A corresponding model:– Start at time 0 with a single node.– At step t, add a new node with a single new edge that connects to

one of the t pre-exiting nodes chosen uniformly at random– The expected in-degree at time T of the node added at time t:

j=t+1,…,T 1/j log T – log t– Doesn’t result in a power law – P(2x)/P(x) is not a constant

Page 4: Tutorial 8 (web graph models)

7 December 2010 236620 Search Engine Technology 4

Preferential Attachment

• Observation: while older, well-established nodes are better known, it is not strictly because of their age but rather because of them having more in-links

• The preferential attachment model:– Start at time 0 with a single node.– At step t, add a new node with a single new edge that connects to one of

the t pre-exiting nodes• The probability of linking to node v: (1+in-degree(v)) / (2t-1)

• A variant involves a parameter α:– Start at time 0 with a single node.– At step t, add a new node with a single new edge that connects to node v

with probability α/t+(1- α)*in-degree(v)/(t-1)

• Both variants indeed result in a Power-Law distribution of in-degrees (different exponents)

Page 5: Tutorial 8 (web graph models)

7 December 2010 5

Preferential Attachment (cont.)

• Another observation: if search engine rankings are influenced by PageRank, then new pages will link to high-PageRank pages more than to low PageRank pages

• The model uses two positive parameters d, p such that d+p<1

• The evolution:– Start at time 0 with a single node.– At step t, add a new node with a single new edge as follows:

• With probability d, connect the edge to one of the existing nodes in proportion to the in-degree (or 1+in-degree) of that node

• With probability p, connect the edge to a node chosen at random according to the PageRank distribution at time t

• With probability 1-p-d, connect the edge to an existing node chosen uniformly at random

• With properly chosen parameters, this model can fit both the in-degree and PageRank Power-Law distributions

Raghavan et al., “Using PageRank to characterize Web Structure”, 2002

Page 6: Tutorial 8 (web graph models)

7 December 2010 6

The Copy Model

The “Copy Model” assumes the following authoring model:• Each page is on a topic of interest to its author.

– Some of its links will be copied from a previous page on the same topic, that the author found useful

– Some links will be “original”, i.e. chosen independently by the author of the page

• The stochastic process creates nodes with an out-degree of d (parallel edges are allowed)

– Start at time 0 with a single node and d self-loops– At step t, add a new node with d out-links as follows

• Choose an intermediate node v chosen u.a.r. from the t existing nodes• For j=1,…,d:

– With probability α, connect link j to a node chosen u.a.r. from the t existing nodes

– With probability 1-α, copy the j’th link of v

• The copy model results in Power-Law in-degree distributions

Page 7: Tutorial 8 (web graph models)

7 December 2010 7

Evolutionary Models - Summary

• Overall, models exist that can simultaneously fit the observed Power-Law distributions of in-degrees, out-degrees and PageRank

– Many other properties of the graph are still unexplained by theoretical evolutionary models

• The accepted models mix-and-match the principles of preferential attachment (degrees/PageRank), copying, and random connectivity

• These models have the “rich get richer” property, and favor seniority (i.e. nodes from earlier rounds tend to have higher degrees)

– One can add some random “fitness” to nodes, with preferential attachment considering fitness as well, to give new nodes better chances of competing with existing nodes

• Note that there’s a difference between “rich get richer” and “winner takes all” – the Web’s graph doesn’t exhibit the dominance of a single winner

Page 8: Tutorial 8 (web graph models)

7 December 2010 236620 Search Engine Technology 8

Related Research Area: The Science of Networks

• Power-law and scale-free networks

• “Small World” networks and the importance of weak ties

– Kleinberg’s small-world grid

• Social/collaboration networks– Milgram’s “six degrees of

separation”

– The six degrees of Kevin Bacon

– Erdös numbers

ד"ודוד של השכן שלי קיבל את הסמג

משינה ,שלומי ברכהסיפרה אישתו של בן של אחותי

Page 9: Tutorial 8 (web graph models)

What is the size of the web ?

• Issues– The web is really infinite

• Dynamic content, e.g., calendar

• Soft 404: www.yahoo.com/<anything> is a valid page

– Static web contains syntactic duplication, mostly due to mirroring (~30%)

– Some servers are seldom connected

• Who cares?– Media, and consequently the user

– Engine design

– Engine crawl policy. Impact on recall.

Page 10: Tutorial 8 (web graph models)

What can we attempt to measure?

(IQ is whatever the IQ tests measure.)

– The statically indexable web is whatever search engines index.

• Different engines have different preferences

– max url depth, max count/host, anti-spam rules, priority rules, etc.

• Different engines index different things under the same URL:

– frames, meta-keywords, document restrictions, document extensions, ...

Page 11: Tutorial 8 (web graph models)

A B = (1/2) * Size A

A B = (1/6) * Size B

(1/2)*Size A = (1/6)*Size B

\ Size A / Size B =

(1/6)/(1/2) = 1/3

Sample URLs randomly from A

Check if contained in B and vice versa

A B

Each test involves: (i) Sampling (ii) Checking

Relative Size from OverlapGiven two engines A and B

Page 12: Tutorial 8 (web graph models)

Sampling URLs

• Ideal strategy: Generate a random URL and check for

containment in each index.

• Problem: Random URLs are hard to find! Enough to generate

a random URL contained in a given Engine.

• Approach 1: Generate a random URL contained in a given engine

– Random queries

– Random searches

• Approach 2: Give us a true estimate of the size of the web (as opposed to just relative sizes of indexes)

– Random IP addresses

– Random walks

Page 13: Tutorial 8 (web graph models)

Random URLs from random queries

• Generate random query: how?

– Lexicon: 400,000+ words from a web crawl

– Conjunctive Queries: w1 and w2

e.g., vocalists AND rsi

• Get 100 result URLs from engine A

• Choose a random URL as the candidate to check for presence in engine B

• This distribution induces a probability weight W(p) for each page.

• Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|

Not an English

dictionary

Page 14: Tutorial 8 (web graph models)

Random searches

• Choose random searches extracted from a local log [Lawrence & Giles 97] or build “random searches” [Notess]

– Use only queries with small result sets.

– Count normalized URLs in result sets.

– Use ratio statistics

Page 15: Tutorial 8 (web graph models)

Random IP addresses

• Generate random IP addresses

• Find a web server at the given address

– If there’s one

• Collect all pages from server

– From this, choose a page at random

Page 16: Tutorial 8 (web graph models)

Random walks

• View the Web as a directed graph

• Build a random walk on this graph– Includes various “jump” rules back to visited sites

• Does not get stuck in spider traps!

• Can follow all links!

– Converges to a stationary distribution• Must assume graph is finite and independent of the walk.

• Conditions are not satisfied (cookie crumbs, flooding)

• Time to convergence not really known

– Sample from stationary distribution of walk

– Use the “strong query” method to check coverage by SE