navigating the web graph - caida.org

89
Navigating the Web graph Workshop on Networks and Navigation Santa Fe Institute, August 2008 Filippo Menczer Informatics & Computer Science Indiana University, Bloomington

Upload: others

Post on 26-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Navigating the Web graph - caida.org

Navigating the Web graphWorkshop on Networks and Navigation

Santa Fe Institute, August 2008

Filippo MenczerInformatics & Computer ScienceIndiana University, Bloomington

Page 2: Navigating the Web graph - caida.org

Outline

Topical locality: Content, link, and semantic topologiesImplications for growth models and navigationApplications

Topical Web crawlersDistributed collaborative peer search

Page 3: Navigating the Web graph - caida.org

The Web as a text corpus

Pages close in word vector space tend to be related

Cluster hypothesis (van Rijsbergen 1979)

The WebCrawler (Pinkerton 1994)

The whole first generation of search engines

weapons

mass

destruction

p1p2

Page 4: Navigating the Web graph - caida.org

Enter the Web’s link structure

Broder & al. 2000

p(i) =α

N+ (1 − α)

j:j→i

p(j)

|" : j → "|

Brin & Page 1998

Barabasi & Albert 1999

Page 5: Navigating the Web graph - caida.org

Three network topologies

Text Links

Page 6: Navigating the Web graph - caida.org

Three network topologies

Text LinksMeaning

Page 7: Navigating the Web graph - caida.org

Connection between semantic topology (topicality or relevance) and link topology (hypertext)

G = Pr[rel(p)] ~ fraction of relevant pages (generality)

R = Pr[rel(p) | rel(q) AND link(q,p)]

Related nodes are “clustered” if R > G (modularity)

Necessary and sufficient condition for a random crawler to find pages related to start points

G = 5/15C = 2R = 3/6 = 2/4

The “link-cluster” conjecture

ICML 1997

Page 8: Navigating the Web graph - caida.org

• Stationary hit rate for a random crawler:

Link-cluster conjecture

η(t + 1) = η(t) ⋅R + (1 −η(t)) ⋅G ≥ η(t)

η t→∞ → η∗ =G

1− (R −G)η∗ >G⇔ R >G

η∗

G−1 =

R−G1 − (R−G)

Value added

Conjecture

Page 9: Navigating the Web graph - caida.org

Pages that link to each other tend to be related

Preservation of semantics (meaning)

A.k.a. topic drift

Link-cluster conjecture

L(q,δ) ≡path(q, p)

{p: path(q,p ) ≤δ }∑

{p : path(q, p) ≤ δ}

R(q,δ)G(q)

≡Pr rel(p) | rel(q)∧ path(q, p) ≤ δ[ ]

Pr[rel(p)]

JASIST 2004

Page 10: Navigating the Web graph - caida.org

9

Correlation of lexical and linkage topology

L(δ): average link distance

S(δ): average similarity to start (topic) page from pages up to distance δ

Correlationρ(L,S) = –0.76

The “link-content” conjecture

S(q,δ) ≡sim(q, p)

{p: path(q,p ) ≤δ }∑

{p : path(q, p) ≤ δ}

Page 11: Navigating the Web graph - caida.org

Heterogeneity of link-content correlation

S = c + (1− c)eaLb edu net

gov

com

signif. diff. a only (p<0.05)

signif. diff. a & b (p<0.05)

org

Page 12: Navigating the Web graph - caida.org

Mapping the relationship between links, content, and

semantic topologies• Given any pair of pages, need ‘similarity’ or

‘proximity’ metric for each topology:– Content: textual/lexical (cosine) similarity– Link: co-citation/bibliographic coupling– Semantic: relatedness inferred from manual classification

• Data: Open Directory Project (dmoz.org)– ~ 1 M pages after cleanup– ~ 1.3*1012 page pairs!

Page 13: Navigating the Web graph - caida.org

Content similarity€

σ c p1, p2( ) =p1 ⋅ p2p1 ⋅ p2

term i

term j

term k

p1p2

p1 p2

σ l (p1, p2) =Up1

∩Up2

Up1∪Up2

Link similarity

Page 14: Navigating the Web graph - caida.org

Semantic similarity

• Information-theoretic measure based on classification tree (Lin 1998)

• Classic path distance in special case of balanced tree€

σ s(c1,c2) =2logPr[lca(c1,c2)]logPr[c1]+ logPr[c2]

top

lca

c1

c2

Page 15: Navigating the Web graph - caida.org

Individual metric distributions

semanticcontent

link

Page 16: Navigating the Web graph - caida.org

| Retrieved & Relevant || Retrieved |

| Retrieved & Relevant || Relevant |

Precision =

Recall =

Page 17: Navigating the Web graph - caida.org

| Retrieved & Relevant || Retrieved |

| Retrieved & Relevant || Relevant |

P(sc,sl ) =

σ s(p,q){p,q:σ c = sc ,σ l = sl }

{p,q :σ c = sc,σ l = sl}

R(sc,sl ) =

σ s(p,q){p,q:σ c = sc ,σ l = sl }

σ s(p,q){p,q}∑

Averagingsemantic similarity

Summingsemantic similarity

Precision =

Recall =

Page 18: Navigating the Web graph - caida.org

Science

σc

σllog Recall Precision

Page 19: Navigating the Web graph - caida.org

Adult

σc

σllog Recall Precision

Page 20: Navigating the Web graph - caida.org

News

σc

σllog Recall Precision

Page 21: Navigating the Web graph - caida.org

All pairs

σc

σllog Recall Precision

Page 22: Navigating the Web graph - caida.org

Outline

Topical locality: Content, link, and semantic topologiesImplications for growth models and navigationApplications

Topical Web crawlersDistributed collaborative peer search

Page 23: Navigating the Web graph - caida.org

Link probability vs lexical distance

r =1 σ c −1

Pr(λ | ρ) =(p,q) : r = ρ∧σ l > λ

(p,q) : r = ρ

Page 24: Navigating the Web graph - caida.org

Link probability vs lexical distance

r =1 σ c −1

Pr(λ | ρ) =(p,q) : r = ρ∧σ l > λ

(p,q) : r = ρ

Phasetransition

Power law tail

Pr(λ | ρ) ~ ρ−α(λ)

ρ*

Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002

Page 25: Navigating the Web graph - caida.org

Local content-based growth model

• Similar to preferential attachment (BA)

• Use degree info (popularity/importance) only for nearby (similar/related) pages

Pr(pt → pi< t ) =k(i)mt

if r(pi, pt ) < ρ*

c[r(pi, pt )]−α otherwise

Page 26: Navigating the Web graph - caida.org

So, many models can predict degree distributions...

Which is “right” ?

Need an independent observation (other than degree) to validate models

Distribution of content similarity across linked pairs

Page 27: Navigating the Web graph - caida.org

None of these models is right!

Page 28: Navigating the Web graph - caida.org

The mixture modelPr(i) ∝ ψ ·

1

t+ (1 − ψ) ·

k(i)

mtdegree-uniform mixture

ai2

i1

i3t

bi2

i1

i3t

ci2

i1

i3t

Page 29: Navigating the Web graph - caida.org

The mixture model

Bias choice by content similarity instead of uniform distribution

Pr(i) ∝ ψ ·1

t+ (1 − ψ) ·

k(i)

mtdegree-uniform mixture

ai2

i1

i3t

bi2

i1

i3t

ci2

i1

i3t

Page 30: Navigating the Web graph - caida.org

Degree-similarity mixture model

Pr(i) ∝ ψ · P̂r(i) + (1 − ψ) ·k(i)

mt

Page 31: Navigating the Web graph - caida.org

Degree-similarity mixture model

Pr(i) ∝ ψ · P̂r(i) + (1 − ψ) ·k(i)

mt

ψ = 0.2, α = 1.7

P̂r(i) ∝ [r(i, t)]−α

Page 32: Navigating the Web graph - caida.org

Both mixture models get the degree distribution right…

Page 33: Navigating the Web graph - caida.org

…but the degree-similarity mixture model predicts the similarity distribution better

Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004

Page 34: Navigating the Web graph - caida.org

0

4

DMOZ

0

0.25

0.5

0.75

1

!l

0

2

PNAS

0 0.25 0.5 0.75 1

!c

0

0.25

0.5

0.75

1

!l

Citation networks

15,785 articles

published in PNAS between 1997 and

2002

Page 35: Navigating the Web graph - caida.org

Citation networks

Page 36: Navigating the Web graph - caida.org

Citation networks

Page 37: Navigating the Web graph - caida.org

Open Questions

Understand distribution of content similarity across all pairs of pages

Growth model to explain co-evolution of both link topology and content similarity

The role of search engines

Page 38: Navigating the Web graph - caida.org

Efficient crawling algorithms?

Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages:

~ log N (Barabasi & Albert 1999)~ log N / log log N (Bollobas 2001)

Page 39: Navigating the Web graph - caida.org

Efficient crawling algorithms?

Practice: can’t find them!• Greedy algorithms based on location in geographical small

world networks: ~ poly(N) (Kleinberg 2000)• Greedy algorithms based on degree in power law

networks: ~ N (Adamic, Huberman & al. 2001)

Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages:

~ log N (Barabasi & Albert 1999)~ log N / log log N (Bollobas 2001)

Page 40: Navigating the Web graph - caida.org

Exception # 1

• Geographical networks (Kleinberg 2000)– Local links to all lattice neighbors– Long-range link probability distribution:

power law Pr ~ r–α

• r: lattice (Manhattan) distance• α: constant clustering exponent

t ~ log2 N⇔α = D

Page 41: Navigating the Web graph - caida.org

Is the Web a geographical network?

local links

long range links (power law tail)

Replace lattice distance by lexical distancer = (1 / σc) – 1

Page 42: Navigating the Web graph - caida.org

Exception # 2

• Hierarchical networks (Kleinberg 2002, Watts & al. 2002)– Nodes are classified at the leaves of tree– Link probability distribution: exponential tail

Pr ~ e–h

• h: tree distance (height of lowest common ancestor)

h=1h=2

t ~ logε N,ε ≥1

Page 43: Navigating the Web graph - caida.org

exponential tail

Is the Web a hierarchical network? Replace tree distance by semantic distance

h = 1 – σs

top

lca

c1

c2

Page 44: Navigating the Web graph - caida.org

Take home message:the Web is a “friendly” place!

Page 45: Navigating the Web graph - caida.org

Outline

Topical locality: Content, link, and semantic topologiesImplications for growth models and navigationApplications

Topical Web crawlersDistributed collaborative peer search

Page 46: Navigating the Web graph - caida.org

Crawler applications• Universal Crawlers

– Search engines!

• Topical crawlers– Live search

(e.g., myspiders.informatics.indiana.edu)

– Topical search engines & portals

– Business intelligence (find competitors/partners)

– Distributed, collaborative search

Page 47: Navigating the Web graph - caida.org

spears

[sic]Topical crawlers

Page 48: Navigating the Web graph - caida.org

Evaluating topical crawlers

• Goal: build “better” crawlers to support applications • Build an unbiased evaluation framework

– Define common tasks of measurable difficulty– Identify topics, relevant targets– Identify appropriate performance measures

• Effectiveness: quality of crawler pages, order, etc.• Efficiency: separate CPU & memory of crawler algorithms from

bandwidth & common utilities

Information Retrieval 2005

Page 49: Navigating the Web graph - caida.org

Evaluating topical crawlers: Topics

• Automate evaluation using edited directories

• Different sources of relevance assessments

Keywords

Description

Targets

Page 50: Navigating the Web graph - caida.org

Evaluating topical crawlers: TasksStart from seeds, find targets

and/or pages similar to target descriptions

d=2

d=3

Page 51: Navigating the Web graph - caida.org

Examples of crawling algorithms

• Breadth-First– Visit links in order encountered

• Best-First– Priority queue sorted by similarity– Variants:

– explore top N at a time– tag tree context– hub scores

• SharkSearch– Priority queue sorted by combination of similarity,

anchor text, similarity of parent, etc.

• InfoSpiders

Page 52: Navigating the Web graph - caida.org

Examples of crawling algorithms

• Breadth-First– Visit links in order encountered

• Best-First– Priority queue sorted by similarity– Variants:

– explore top N at a time– tag tree context– hub scores

• SharkSearch– Priority queue sorted by combination of similarity,

anchor text, similarity of parent, etc.

• InfoSpiders

Page 53: Navigating the Web graph - caida.org

Exploration vs. Exploitation

Pages crawled

Avg

targ

et r

ecal

l

Page 54: Navigating the Web graph - caida.org

Co-citation: hub scoresLink scorehub = linear combination between

link and hub score

Page 55: Navigating the Web graph - caida.org

Recall (159 ODP topics)

Split ODP URLs between seeds and

targets

Add 10 best hubs to seeds for 94 topics

0

5

10

15

20

25

30

35

40

45

0 2000 4000 6000 8000 10000

aver

age

targ

et re

call@

N (%

)

N (pages crawled)

Breadth-FirstNaive Best-First

DOMHub-Seeker

0

5

10

15

20

25

30

35

40

45

0 2000 4000 6000 8000 10000

aver

age

targ

et re

call@

N (%

)

N (pages crawled)

Breadth-FirstNaive Best-First

DOMHub-Seeker

ECDL 2003

Page 56: Navigating the Web graph - caida.org

InfoSpiders adaptive distributed algorithm using an evolving population of learning agents

Page 57: Navigating the Web graph - caida.org

InfoSpiders adaptive distributed algorithm using an evolving population of learning agents

keywordvector neural net

local frontier

Page 58: Navigating the Web graph - caida.org

InfoSpiders adaptive distributed algorithm using an evolving population of learning agents

keywordvector neural net

local frontier

offspring

Page 59: Navigating the Web graph - caida.org
Page 60: Navigating the Web graph - caida.org
Page 61: Navigating the Web graph - caida.org
Page 62: Navigating the Web graph - caida.org

Foreach agent thread: Pick & follow link from local frontier Evaluate new links, merge frontier Adjust link estimator E := E + payoff - cost

If E < 0: Die Elsif E > Selection_Threshold: Clone offspring Split energy with offspring Split frontier with offspring Mutate offspring

Evolutionary Local Selection Algorithm (ELSA)

selectivequery

expansion

matchresource

bias

reinforcementlearning

Page 63: Navigating the Web graph - caida.org

Action selection

Page 64: Navigating the Web graph - caida.org

Q-learningCompare estimated relevance of visited document with estimated relevance of link followed from previous page

Teaching input: E(D) + µ maxl(D) λl

Page 65: Navigating the Web graph - caida.org

Performance

ACM Trans. Internet Technology 2003

Pages crawled

Avg

tar

get

reca

ll

Page 66: Navigating the Web graph - caida.org
Page 67: Navigating the Web graph - caida.org
Page 71: Navigating the Web graph - caida.org

6S: Collaborative Peer Search

AB

queryquery

query

query

query

C

WWW2004 WWW2005 WTAS2005P2PIR2006

bookmarks

localstorage

WWW

IndexCrawler

Peer

Page 72: Navigating the Web graph - caida.org

6S: Collaborative Peer Search

AB

queryquery

hit

hit

Data mining & referral opportunities

query

query

query

C

WWW2004 WWW2005 WTAS2005P2PIR2006

bookmarks

localstorage

WWW

IndexCrawler

Peer

Page 73: Navigating the Web graph - caida.org

6S: Collaborative Peer Search

AB

queryquery

hit

hit

Data mining & referral opportunities

query

query

query

C

Emerging communities

WWW2004 WWW2005 WTAS2005P2PIR2006

bookmarks

localstorage

WWW

IndexCrawler

Peer

Page 74: Navigating the Web graph - caida.org

Reinforcement Learning

Page 75: Navigating the Web graph - caida.org

Query Routing

Page 76: Navigating the Web graph - caida.org

Simulating 500 Users

ODP (dmoz.org)

Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006

Page 77: Navigating the Web graph - caida.org

Simulating 500 Users

ODP (dmoz.org)

Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006

Page 78: Navigating the Web graph - caida.org

Simulating 500 Users

ODP (dmoz.org)

Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006

Page 79: Navigating the Web graph - caida.org

P@10

Page 80: Navigating the Web graph - caida.org

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25

Prec

ision

Recall

6SCentralized Search Engine

Google

Distributed vs Centralized

Page 81: Navigating the Web graph - caida.org

! " #! #" $! $"!$!

!

$!

%!

&!

'!

#!!

#$!

#%!

()*+,*-./*+./**+

01*+02*.34052*.678

39)-:*+.3;*<<,3,*5:=,0>*:*+

Pajek Pajek

Small-world

Page 82: Navigating the Web graph - caida.org

Semantic Similarity

Page 83: Navigating the Web graph - caida.org

Semantic Similarity

Arts/Movies/Filmmaking

Business/Arts_and_Entertainment/Fashion

Business/E-Commerce/Developers

Business/Telecommunications/Call_Centers

Computers/Programming/Graphics

Health/Conditions_and_Diseases/Cancer

Health/Mental_Health/Grief,_Loss_and_Bereavement

Health/Professions/Midwifery

Health/Reproductive_Health/Birth_Control

Home/Family/Pregnancy

Shopping/Clothing/Accessories

Shopping/Clothing/Footwear

Shopping/Clothing/Uniforms

Shopping/Sports/Cycling

Shopping/Visual_Arts/Artist_Created_Prints

Society/Issues/Abortion

Society/People/Women

Sports/Cycling/Racing

Page 84: Navigating the Web graph - caida.org

Ongoing Work

• Improve coverage/diversity in query routing algorithm

• Spam protection: trust/reputation subsystem

• User study with 6S application

Page 85: Navigating the Web graph - caida.org

User study

Page 86: Navigating the Web graph - caida.org

Query network

Page 87: Navigating the Web graph - caida.org

Result network

Page 89: Navigating the Web graph - caida.org

Thank you!Questions?

informatics.indiana.edu/fil

Research supported by NSF CAREER Award IIS-0348940