web mapping, modeling, mining, and minglingwaw2006/winterschool/menczer-tutori… · web mapping,...
TRANSCRIPT
![Page 1: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/1.jpg)
Web mapping, modeling, mining, and mingling
Tutorial for the 2006 MITACS Winter SchoolModelling and Mining of Networked Information Spaces
Filippo MenczerInformatics & Computer ScienceIndiana University, Bloomington
![Page 2: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/2.jpg)
NaN:Networks & agents Network
http://homer.informatics.indiana.edu/~nan/
![Page 3: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/3.jpg)
OutlineMapping
Topical localityContent, link, and semantic topologies in the Web
ModelingHow the Web evolves and why content mattersConsequences for navigation and crawling
MiningTopical Web crawlersAdaptive, intelligent crawling techniques
MinglingSocial Web search & recommendationDistributed collaborative peer search
![Page 4: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/4.jpg)
The Web as a text corpus
Pages close in word vector space tend to be related
Cluster hypothesis (van Rijsbergen 1979)
The WebCrawler (Pinkerton 1994)
The whole first generation of search engines
weapons
mass
destruction
p1p2
![Page 5: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/5.jpg)
Enter the Web’s link structure
Broder & al. 2000
p(i) =!
N+ (1 ! !)
!
j:j"i
p(j)
|" : j " "|
Brin & Page 1998
Barabasi & Albert 1999
![Page 6: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/6.jpg)
Three network topologies
Text LinksMeaning
![Page 7: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/7.jpg)
Connection between semantic topology (topicality or relevance) and link topology (hypertext)
G = Pr[rel(p)] ~ fraction of relevant pages (generality)
R = Pr[rel(p) | rel(q) AND link(q,p)]
Related nodes are “clustered” if R > G (modularity)
Necessary and sufficient condition for a random crawler to find pages related to start points
G = 5/15C = 2R = 3/6 = 2/4
The “link-cluster” conjecture
ICML 1997
![Page 8: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/8.jpg)
• Stationary hit rate for a random crawler:
Link-cluster conjecture
η(t + 1) = η(t) ⋅R + (1 −η(t)) ⋅G ≥ η(t)
η t→∞ → η∗ =G
1− (R −G)η∗ >G⇔ R >G
η∗
G−1 =
R−G1 − (R−G)
Value added
Conjecture
![Page 9: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/9.jpg)
Pages that link to each other tend to be related
Preservation of semantics (meaning)
A.k.a. topic drift
Link-cluster conjecture
€
L(q,δ) ≡path(q, p)
{p: path(q,p ) ≤δ }∑
{p : path(q, p) ≤ δ}
€
R(q,δ)G(q)
≡Pr rel(p) | rel(q)∧ path(q, p) ≤ δ[ ]
Pr[rel(p)]
JASIST 2004
![Page 10: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/10.jpg)
10
Correlation of lexical and linkage topology
L(δ): average link distance
S(δ): average similarity to start (topic) page from pages up to distance δ
Correlationρ(L,S) = –0.76
The “link-content” conjecture
€
S(q,δ) ≡sim(q, p)
{p: path(q,p ) ≤δ }∑
{p : path(q, p) ≤ δ}
![Page 11: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/11.jpg)
Heterogeneity of link-content correlation
€
S = c + (1− c)eaLb edu net
gov
com
signif. diff. a only (p<0.05)
signif. diff. a & b (p<0.05)
org
![Page 12: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/12.jpg)
Discussion
Topic drift: Myth or reality?
![Page 13: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/13.jpg)
Mapping the relationship between links, content, and semantic topologies
• Given any pair of pages, need ‘similarity’ or ‘proximity’ metric for each topology:– Content: textual/lexical (cosine) similarity– Link: co-citation/bibliographic coupling– Semantic: relatedness inferred from manual classification
• Data: Open Directory Project (dmoz.org)– ~ 1 M pages after cleanup– ~ 1.3*1012 page pairs!
![Page 14: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/14.jpg)
Content similarity€
σ c p1, p2( ) =p1 ⋅ p2p1 ⋅ p2
term i
term j
term k
p1p2
p1 p2
€
σ l (p1, p2) =Up1
∩Up2
Up1∪Up2
Link similarity
![Page 15: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/15.jpg)
Semantic similarity
• Information-theoretic measure based on classification tree (Lin 1998)
• Classic path distance in special case of balanced tree€
σ s(c1,c2) =2logPr[lca(c1,c2)]logPr[c1]+ logPr[c2]
top
lca
c1
c2
![Page 16: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/16.jpg)
Individual metric distributions
semanticcontent
link
![Page 17: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/17.jpg)
Correlations between
similarities
Games
Arts
Business
Society
All pairs
Science
Shopping
Adult
Health
Recreation
Kids and Teens
Computers
Sports
Reference
Home
News
0 0.1 0.2 0.3 0.4 0.5
content-linkcontent-semanticlink-semantic
IEEE Internet Computing 2005
![Page 18: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/18.jpg)
| Retrieved & Relevant || Retrieved |
| Retrieved & Relevant || Relevant |
€
P(sc,sl ) =
σ s(p,q){p,q:σ c = sc ,σ l = sl }
∑
{p,q :σ c = sc,σ l = sl}
R(sc,sl ) =
σ s(p,q){p,q:σ c = sc ,σ l = sl }
∑
σ s(p,q){p,q}∑
Averagingsemantic similarity
Summingsemantic similarity
Precision =
Recall =
![Page 19: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/19.jpg)
Business
σc
σllog Recall Precision
![Page 20: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/20.jpg)
Science
σc
σllog Recall Precision
![Page 21: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/21.jpg)
Adult
σc
σllog Recall Precision
![Page 22: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/22.jpg)
Computers
σc
σllog Recall Precision
![Page 23: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/23.jpg)
Reference
σc
σllog Recall Precision
![Page 24: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/24.jpg)
Home
σc
σllog Recall Precision
![Page 25: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/25.jpg)
News
σc
σllog Recall Precision
![Page 26: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/26.jpg)
All pairs
σc
σllog Recall Precision
![Page 27: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/27.jpg)
Discussion: So what?
![Page 28: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/28.jpg)
So what?
Recall
Precision
P (f, !)=
!
p,q:f("c(p,q),"l(p,q))!!
"s(p, q)
|p, q : f("c(p, q), "l(p, q)) ! !|
R(f, !)=
!
p,q:f("c(p,q),"l(p,q))!!
"s(p, q)
!
p,q"s(p, q)
Approximate relevance by semantic similarity
![Page 29: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/29.jpg)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
00.2
0.40.6
0.81
!c 00.2
0.40.6
0.81
!l
0
0.2
0.4
0.6
0.8
1
(!c+!l)/2
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
00.2
0.40.6
0.81
!c 00.2
0.40.6
0.81
!l
0
0.2
0.4
0.6
0.8
1
!l if !c > 0.5
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
00.2
0.40.6
0.81
!c 00.2
0.40.6
0.81
!l
0
0.2
0.4
0.6
0.8
1
max(!c,!l)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
00.2
0.40.6
0.81
!c 00.2
0.40.6
0.81
!l
0
0.2
0.4
0.6
0.8
1
!c !l
Rank by combining content and link similarity
![Page 30: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/30.jpg)
All pairs!l · H(!c ! 0.5)
2
3!l +
1
3!c
!c · !l
max(σc, σl)
!l
!c
Proc. WWW 2004
![Page 31: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/31.jpg)
News
!l
!c
(!l + !c)/2
!c · !l
![Page 32: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/32.jpg)
What about the “hole”?
σc
σl
Precision
![Page 33: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/33.jpg)
t2
t1
t4t3
t5 t6
t7 t8TSR
Edge Type
![Page 34: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/34.jpg)
Better semantic similarity measureWork w/Ana Maguitman & al.
Include cross-links (symbolic) and see-also links (related)
Transitive closure of topic graph
Compute entropy based on fuzzy membership matrix
!s(t1, t2) = maxk
2 · min (Wk1, Wk2) · log Pr[tk]
log(Pr[t1|tk] · Pr[tk]) + log(Pr[t2|tk] · Pr[tk])
W = T+ ! G ! T+
![Page 35: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/35.jpg)
Differences
![Page 36: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/36.jpg)
0 0.2 0.4 0.6 0.8 1
σsT
0
5
10
15
20
rela
tive
diffe
renc
e %
0 0.1 0.2 0.3 0.40
0.1
0 0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1< σsG>
actual difference
![Page 37: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/37.jpg)
mean stderr
tree 5.7% 0.8%
graph 84.7% 1.8%
![Page 38: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/38.jpg)
Combining content & links
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Spea
rman
rank
cor
rela
tion
σc threshold
σc σlσc H(σl)
σl0.2 σc + 0.8 σl0.8 σc + 0.2 σlσc
Proc. WWW 2005, WWWJ 2006
![Page 39: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/39.jpg)
Discussion: Is content really so bad? Why?
![Page 40: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/40.jpg)
OutlineMapping
Topical localityContent, link, and semantic topologies in the Web
ModelingHow the Web evolves and why content mattersConsequences for navigation and crawling
MiningTopical Web crawlersAdaptive, intelligent crawling techniques
MinglingSocial Web search & recommendationDistributed collaborative peer search
![Page 41: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/41.jpg)
Preferential attachment
Pr(i) ! k(i)
“BA” model (Barabasi & Albert 1999, de Solla Price 1976)
At each step t add new page p Create m new links from p to i (i<t)
Rich-get-richer
?
Pr(i) =k(i)
mt=! Pr(k) " k#!
![Page 42: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/42.jpg)
Other growth models
Web copying
(Kleinberg, Kumar & al 1999, 2000)
same indegree distribution
no need to know degree
Pr(i) ! Pr(j) · Pr(j " i)
![Page 43: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/43.jpg)
Other growth modelsRandom mixture
(Pennock & al. 2002,Cooper & Frieze 2001, Dorogovtsev & al 2000)
winners don’t take all
general indegree distribution
fits non-power-law cases
Pr(i) ! ! ·1
t+ (1 " !) ·
k(i)
mt
![Page 44: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/44.jpg)
Networks, power laws and phase transitions
++
+
+
++ ++
+
+
+
+
+
+
+
+
+
++
+
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
Raissa D’Souza, Microsoft Research
Joint with: Noam Berger, Christian Borgs, JenniferChayes, and Bobby Kleinberg.
Other growth modelsMixture with Euclidean distance(HOT: Fabrikant, Koutsoupias & Papadimitriou 2002)
tradeoff between centrality and geometric locality
fits power-law in certain critical trade-off regimes
i = arg min(!rit + gi)
![Page 45: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/45.jpg)
Discussion:What about content?
![Page 46: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/46.jpg)
Link probability vs lexical distance
€
r =1 σ c −1
Pr(λ | ρ) =(p,q) : r = ρ∧σ l > λ
(p,q) : r = ρ
Phasetransition
Power law tail
€
Pr(λ | ρ) ~ ρ−α(λ)
€
ρ*
Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002
![Page 47: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/47.jpg)
Local content-based growth model
γ=4.3±0.1
γ=4.26±0.07
• Similar to preferential attachment (BA)
• Use degree info (popularity/importance) only for nearby (similar/related) pages
€
Pr(pt → pi< t ) =k(i)mt
if r(pi, pt ) < ρ*
c[r(pi, pt )]−α otherwise
![Page 48: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/48.jpg)
So, many models can predict degree distributions...
Which is “right” ?
Need an independent observation (other than degree) to validate models
Distribution of content similarity across linked pairs
![Page 49: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/49.jpg)
None of these models is right!
![Page 50: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/50.jpg)
Back to the mixture model
Bias choice by content similarity instead of uniform distribution
Pr(i) ! ! ·1
t+ (1 " !) ·
k(i)
mtdegree-uniform mixture
ai2
i1
i3t
bi2
i1
i3t
ci2
i1
i3t
![Page 51: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/51.jpg)
Degree-similarity mixture model
Pr(i) ! ! · P̂r(i) + (1 " !) ·k(i)
mt
! = 0.2, " = 1.7
P̂r(i) ! [r(i, t)]"!
![Page 52: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/52.jpg)
Build it...
(M.M.)
![Page 53: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/53.jpg)
Both mixture models get the degree distribution right…
![Page 54: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/54.jpg)
…but the degree-similarity mixture model predicts the similarity distribution better
Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004
![Page 55: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/55.jpg)
0
4
DMOZ
0
0.25
0.5
0.75
1
!l
0
2
PNAS
0 0.25 0.5 0.75 1
!c
0
0.25
0.5
0.75
1
!l
Citation networks
15,785 articles
published in PNAS between 1997 and
2002
![Page 56: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/56.jpg)
Citation networks
![Page 57: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/57.jpg)
Discussion
What questions remain open?
What other factors should we consider?
![Page 58: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/58.jpg)
Open QuestionsUnderstand distribution of content similarity across all pairs of pages
Using TF: exponential
Using TF-IDF: power law
Growth model to explain co-evolution of both link topology and content similarity
The role of search engines (keynote)
Pr(!) ! ""#!
Pr(!) ! "!"#
![Page 59: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/59.jpg)
Efficient crawling algorithms?
Practice: can’t find them!
• Greedy algorithms based on location in geographical small world networks: ~ poly(N) (Kleinberg 2000)
• Greedy algorithms based on degree in power law networks: ~ N (Adamic, Huberman & al. 2001)
Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages:
~ log N (Barabasi & Albert 1999)~ log N / log log N (Bollobas 2001)
![Page 60: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/60.jpg)
Exception # 1
• Geographical networks (Kleinberg 2000)– Local links to all lattice neighbors– Long-range link probability distribution:
power law Pr ~ r–α
• r: lattice (Manhattan) distance• α: constant clustering exponent
€
t ~ log2 N⇔α = D
![Page 61: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/61.jpg)
Is the Web a geographical network?
local links
long range links (power law tail)
Replace lattice distance by lexical distancer = (1 / σc) – 1
![Page 62: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/62.jpg)
Exception # 2
• Hierarchical networks (Kleinberg 2002, Watts & al. 2002)– Nodes are classified at the leaves of tree– Link probability distribution: exponential tail
Pr ~ e–h
• h: tree distance (height of lowest common ancestor)
h=1h=2
€
t ~ logε N,ε ≥1
![Page 63: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/63.jpg)
exponential tail
Is the Web a hierarchical network? Replace tree distance by semantic distance
h = 1 – σs
top
lca
c1
c2
![Page 64: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/64.jpg)
Discussion: Does this stuff really have any
applications?
![Page 65: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/65.jpg)
OutlineMapping
Topical localityContent, link, and semantic topologies in the Web
ModelingHow the Web evolves and why content mattersConsequences for navigation and crawling
MiningTopical Web crawlersAdaptive, intelligent crawling techniques
MinglingSocial Web search & recommendationDistributed collaborative peer search
![Page 66: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/66.jpg)
Crawler applications• Universal Crawlers
– Search engines!
• Topical crawlers– Live search
(e.g., myspiders.informatics.indiana.edu)
– Topical search engines & portals
– Business intelligence (find competitors/partners)
– Distributed, collaborative search
![Page 67: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/67.jpg)
Topical crawlers
• Seminal work– Cho & Garcia-Molina @ Stanford– Chakrabarti & al @ IBM Almaden / IIT– Amento & al @ ATT Shannon– Ben-Shaul & al @ IBM Haifa– many others…
![Page 68: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/68.jpg)
spears
[sic]Topical crawlers
![Page 69: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/69.jpg)
Evaluating topical crawlers
• Goal: build “better” crawlers to support applications • Build an unbiased evaluation framework
– Define common tasks of measurable difficulty– Identify topics, relevant targets– Identify appropriate performance measures
• Effectiveness: quality of crawler pages, order, etc.• Efficiency: separate CPU & memory of crawler algorithms from
bandwidth & common utilities
Information Retrieval 2005
![Page 70: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/70.jpg)
Evaluating topical crawlers: Topics
• Automate evaluation using edited directories
• Different sources of relevance assessments
Keywords
Description
Targets
![Page 71: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/71.jpg)
Topics and Targets
topic level ~ specificitydepth ~ generality
![Page 72: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/72.jpg)
Evaluating topical crawlers: TasksStart from seeds, find targets
and/or pages similar to target descriptions
d=2
d=3
![Page 73: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/73.jpg)
Performance matrix
targetdepth
“recall” “precision”
targetpages
targetdescriptions
d=0d=1
d=2€
Sct ∩TdTd
€
σ c (p,Dd )p∈Sc
t
∑€
Sct ∩TdSct
€
σ c (p,Dd )p∈Sc
t
∑
Sct
![Page 74: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/74.jpg)
Discussion:What assumption (targets)?
R(Relevant)
S(Crawled)
T(Targets)
T ∩ S
R ∩ S
![Page 75: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/75.jpg)
Evaluating topical crawlers: Architecture
URL
Web
CommonData
Structures
Crawler 1Logic
MainKeywords
Seed URLs
Crawler NLogic
Private DataStructures
(limited resource)
Concurrent Fetch/Parse/Stem Modules
HTTP HTTP HTTP
![Page 76: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/76.jpg)
Examples of crawling algorithms
• Breadth-First– Visit links in order encountered
• Best-First– Priority queue sorted by similarity– Variants:
– explore top N at a time– tag tree context– hub scores
• SharkSearch– Priority queue sorted by combination of similarity,
anchor text, similarity of parent, etc.
• InfoSpiders
![Page 77: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/77.jpg)
InfoSpiders adaptive distributed algorithm using an evolving population of learning agents
keywordvector neural net
local frontier
offspring
![Page 78: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/78.jpg)
![Page 79: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/79.jpg)
![Page 80: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/80.jpg)
![Page 81: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/81.jpg)
![Page 82: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/82.jpg)
Foreach agent thread: Pick & follow link from local frontier Evaluate new links, merge frontier Adjust link estimator E := E + payoff - cost
If E < 0: Die Elsif E > Selection_Threshold: Clone offspring Split energy with offspring Split frontier with offspring Mutate offspring
Evolutionary Local Selection Algorithm (ELSA)
selectivequery
expansion
matchresource
bias
reinforcementlearning
![Page 83: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/83.jpg)
Action selection
![Page 84: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/84.jpg)
Q-learningCompare estimated relevance of visited document with estimated relevance of link followed from previous page
Teaching input: E(D) + µ maxl(D) λl
![Page 85: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/85.jpg)
Performance
ACM Trans. Internet Technology 2003
Pages crawled
Avg
tar
get
reca
ll
![Page 86: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/86.jpg)
Exploration vs. Exploitation
Pages crawled
Avg
targ
et r
ecal
l
![Page 87: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/87.jpg)
Efficiency & scalability
Link frontier size
Perf
orm
ance
/cos
t
![Page 88: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/88.jpg)
DOM contextLink score = linear combination between page-based and context-based similarity score
![Page 89: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/89.jpg)
Co-citation: hub scoresLink scorehub = linear combination between
link and hub score
![Page 90: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/90.jpg)
Recall (159 ODP topics)
Split ODP URLs between seeds and
targets
Add 10 best hubs to seeds for 94 topics
0
5
10
15
20
25
30
35
40
45
0 2000 4000 6000 8000 10000
aver
age
targ
et re
call@
N (%
)
N (pages crawled)
Breadth-FirstNaive Best-First
DOMHub-Seeker
0
5
10
15
20
25
30
35
40
45
0 2000 4000 6000 8000 10000
aver
age
targ
et re
call@
N (%
)
N (pages crawled)
Breadth-FirstNaive Best-First
DOMHub-Seeker
ECDL 2003
![Page 91: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/91.jpg)
Normative lessons about crawlers
• Remember links from previous pages• Exploration is important• Adaptation can help
– to focus search– to expand query– to learn link estimates
• Distributed algorithms – boost efficiency– but watch for premature exploitation
• Improve link evaluation by looking at – parent pages– DOM context
![Page 92: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/92.jpg)
Discussion: Which crawler algorithm would
you use in your app?
![Page 93: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/93.jpg)
OutlineMapping
Topical localityContent, link, and semantic topologies in the Web
ModelingHow the Web evolves and why content mattersConsequences for navigation and crawling
MiningTopical Web crawlersAdaptive, intelligent crawling techniques
MinglingSocial Web search & recommendationDistributed collaborative peer search
![Page 94: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/94.jpg)
hierarchical
flat
centralized social
Structure
CollaborationWWW2006, AAAI2006
![Page 95: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/95.jpg)
news research
+ = σ★ ☆★★☆★ ☆☆
σA★ ☆☆★
σB
fun
![Page 96: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/96.jpg)
FA(x) FA(y)
+ = σ(x,y)
FB(a(x,y))
!(x, y) =1
N
N!
u=1
2 log
"|Fu[a(x,y)]|
|Ru|
#
log|Fu(x)|
|Ru| + log|Fu(y)||Ru|
![Page 97: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/97.jpg)
![Page 98: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/98.jpg)
10-16
10-14
10-12
10-10
10-8
10-6
10-4
10-1
100
101
102
103
104
Fra
ctio
n o
f U
RL
s
Degree
smin = 0
smin = 0.0025
smin = 0.0055
Pr(k) !! exp(-0.002*k)
Pr(k) !! k-2
degree
URLs
![Page 99: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/99.jpg)
![Page 100: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/100.jpg)
GoogleGaL
RelevantSet
UserStudy
![Page 101: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/101.jpg)
c(x) =1
|U |!
y!U
"
#$1 + minx!y
!
(u,v)!x!y
%1
!(u, v)"1
&'
()
"1
pt+1(x) = (1 ! !) + ! ·!
y"U
"(x, y) · pt(y)"
z"U "(y, z)
Centrality&
Prestige
![Page 102: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/102.jpg)
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Pre
cis
ion
Recall
GiveALink s.pGiveALink s
u0
p=0.9
u3
p=0.5
u2
p=1.3
u1
p=1.3
s=0.1
s*p=0.13
s=0.1
s*p=0.13
s=0.2
s*p=0.11
s=0.6
Ranking by Similarity * Prestige
![Page 103: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/103.jpg)
Ranking &
Recommendation by Novelty
!(x, y) =
!1 + min
z!U
"1
"(x, z)+
1
"(z, y)" 2
#$"1
"(x, y)
x
y
z
![Page 104: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/104.jpg)
Personalization
!(x, A) = maxy!A
!"(x, y) · log
"N
N(y)
#$
![Page 105: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/105.jpg)
spam
collaborative filtering, social semantic similarity, unlinked pages, multimedia content, trust
scalability,
density
![Page 106: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/106.jpg)
Donate!givealink.org
![Page 107: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/107.jpg)
Discussion: Critical mass?
![Page 108: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/108.jpg)
Peer distributed crawling andcollaborative searching
• Crawl when idle– Start from bookmarks– Past queries or selected
documents as topics
• Index locally– User relevance feedback
is natural– No centralized
coordination bottleneck• Unlike grub.org,
hyperbee.com
bookmarks
localstorage
WWW
indexcrawler
peer
![Page 109: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/109.jpg)
http://homer.informatics.indiana.edu/~nan/6S/
![Page 110: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/110.jpg)
http://homer.informatics.indiana.edu/~nan/6S/
![Page 111: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/111.jpg)
http://homer.informatics.indiana.edu/~nan/6S/
![Page 112: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/112.jpg)
http://homer.informatics.indiana.edu/~nan/6S/
![Page 113: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/113.jpg)
http://homer.informatics.indiana.edu/~nan/6S/
![Page 114: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/114.jpg)
6S: peer distributed crawling and collaborative searching
AB
query
query
hit
hit
Data mining & referral opportunities
query
query
query
C
Emerging communities
![Page 115: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/115.jpg)
Algorithm 1: Random Known
![Page 116: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/116.jpg)
Algorithm 2: Greedy
![Page 117: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/117.jpg)
Algorithm 3: Reinforcement
![Page 118: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/118.jpg)
Learning Peers
A focused profile Wf of other peers for terms in queries
An expanded profile We of other peers for terms not in queries, but that co-occur with query terms and with higher frequency
Query routing:
wi,p(t + 1) = (1 ! !) · wi,p(t) + ! ·!
Sp + 1
Sl + 1! 1
"
!(p, Q) =!
i!Q
"" · w
fi,p + (1 " ") · we
i,p
#
![Page 119: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/119.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.05 0.1 0.15 0.2 0.25
Prec
ision
Recall
6SCentralized Search Engine
Distributed vs CentralizedWTAS 2005
![Page 120: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/120.jpg)
P@10WWW 2005
![Page 121: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/121.jpg)
! " #! #" $! $"!$!
!
$!
%!
&!
'!
#!!
#$!
#%!
()*+,*-./*+./**+
01*+02*.34052*.678
39)-:*+.3;*<<,3,*5:=,0>*:*+
Pajek Pajek
Small-worldWWW 2004
![Page 122: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/122.jpg)
Semantic similarity
Arts/Movies/Filmmaking
Business/Arts_and_Entertainment/Fashion
Business/E-Commerce/Developers
Business/Telecommunications/Call_Centers
Computers/Programming/Graphics
Health/Conditions_and_Diseases/Cancer
Health/Mental_Health/Grief,_Loss_and_Bereavement
Health/Professions/Midwifery
Health/Reproductive_Health/Birth_Control
Home/Family/Pregnancy
Shopping/Clothing/Accessories
Shopping/Clothing/Footwear
Shopping/Clothing/Uniforms
Shopping/Sports/Cycling
Shopping/Visual_Arts/Artist_Created_Prints
Society/Issues/Abortion
Society/People/Women
Sports/Cycling/Racing
P2PIR 2006
![Page 123: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/123.jpg)
Discussion:
Would you use 6S? Why or why not?
![Page 124: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling](https://reader031.vdocuments.mx/reader031/viewer/2022011921/603aecc954b7934c0f52f397/html5/thumbnails/124.jpg)
Thank you!Questions?
informatics.indiana.edu/fil
Research supported by NSF CAREER Award IIS-0348940
homer.informatics.indiana.edu/~nan
www.dataandsearch.org