cs276a text information retrieval, mining, and …ale/bici/ir/slides/bertinoro10.pdfquery-doc...
TRANSCRIPT
Info
rmat
ion R
etri
eval
Lect
ure
10
Rec
ap
Last
lec
ture
HIT
S al
gori
thm
usi
ng a
nch
or
text
topic
-spec
ific
pag
eran
k
Today
’s T
opic
s
Behav
ior-
bas
ed r
anki
ng
Cra
wlin
g a
nd c
orp
us
const
ruct
ion
Alg
ori
thm
s fo
r (n
ear)
duplic
ate
det
ecti
on
Sear
ch e
ngin
e /
Web
IRin
fras
truct
ure
Behav
ior-
bas
ed r
anki
ng
For
each
quer
y Q
, ke
ep t
rack
of
whic
h d
ocs
in
the
resu
lts
are
clic
ked o
nO
n s
ubse
quen
t re
ques
ts f
or
Q,
re-o
rder
docs
in r
esult
s bas
ed o
n c
lick-
thro
ughs
Firs
t due
to D
irec
tHit
→A
skJe
eves
Rel
evan
ce a
sses
smen
t bas
ed o
nBe
hav
ior/
usa
ge
vs.
conte
nt
Quer
y-doc
popula
rity
mat
rix B
Doc
s
Que
riesq
j B qj=
num
ber o
f tim
es d
oc j
clic
ked-
thro
ugh
on q
uery
q
Whe
n qu
ery
q is
sued
aga
in, o
rder
doc
s by
B qj
valu
es.
Issu
es t
o c
onsi
der
Wei
ghin
g/c
om
bin
ing t
ext-
and c
lick-
bas
ed
score
s.W
hat
iden
tifi
es a
quer
y?Fe
rrar
i M
ondia
lFe
rrar
i
Mondia
lFe
rrar
i m
ondia
lfe
rrar
im
ondia
l“F
erra
ri M
ondia
l”C
an u
se h
euri
stic
s, b
ut
sear
ch p
arsi
ng s
low
ed.
Vec
tor
spac
e im
ple
men
tati
on
Mai
nta
in a
ter
m-d
oc
popula
rity
mat
rix C
as o
ppose
d t
o q
uer
y-doc
popula
rity
init
ializ
ed t
o a
ll ze
ros
Each
colu
mn r
epre
sents
a d
oc
jIf
doc
jclic
ked o
n f
or
quer
y q
, updat
e C
j←C
j+
εq
(her
e q
is v
iew
ed a
s a
vect
or)
.
On a
quer
y q
’, c
om
pute
its
cosi
ne
pro
xim
ity
to C
jfo
r al
l j.
Com
bin
e th
is w
ith t
he
regula
r te
xt
score
.
Issu
es
Norm
aliz
atio
n o
f C
jaf
ter
updat
ing
Ass
um
pti
on o
f quer
y co
mposi
tional
ity
“whit
e house
” docu
men
t popula
rity
der
ived
fr
om
“w
hit
e” a
nd “
house
”
Updat
ing -
live
or
bat
ch?
Basi
c A
ssum
pti
on
Rel
evan
ce c
an b
e dir
ectl
y m
easu
red b
y num
ber
of
clic
k th
roughs
Val
id?
Val
idit
y of
Basi
c A
ssum
pti
on
Clic
k th
rough t
o d
ocs
that
turn
out
to b
e non-r
elev
ant:
what
does
a c
lick
mea
n?
Self
-per
pet
uat
ing r
anki
ng
Spam
All
vote
s co
unt
the
sam
e
Var
iants
Tim
e sp
ent
view
ing p
age
Dif
ficu
lt s
essi
on m
anag
emen
tIn
concl
usi
ve m
odel
ing s
o f
ar
Does
use
r bac
k out
of
pag
e?D
oes
use
r st
op s
earc
hin
g?
Does
use
r tr
ansa
ct?
Cra
wlin
g a
nd C
orp
us
Const
ruct
ion
Cra
wl ord
erFi
lter
ing d
uplic
ates
Mir
ror
det
ecti
on
Cra
wlin
g Iss
ues
How
to c
raw
l?
Qua
lity:
“Be
st”
pag
es f
irst
Effi
cien
cy: A
void
duplic
atio
n (
or
nea
r duplic
atio
n)
Etiq
uett
e: R
obots
.txt,
Ser
ver
load
conce
rns
How
much
to c
raw
l? H
ow
much
to index
?C
over
age:
How
big
is
the
Web
? H
ow
much
do w
e co
ver?
Rel
ativ
e C
over
age:
How
much
do c
om
pet
itors
hav
e?
How
oft
en t
o c
raw
l?Fr
eshn
ess:
How
much
has
chan
ged
? H
ow
much
has
rea
llych
anged
?(w
hy
is t
his
a d
iffe
rent
ques
tion?)
Cra
wl O
rder
Best
pag
es f
irst
Pote
nti
al q
ual
ity
mea
sure
s:Fi
nal
Indeg
ree
Final
Pag
eran
k
Cra
wl heu
rist
ic:
BFS
Part
ial In
deg
ree
Part
ial Pa
ger
ank
Ran
dom
wal
k
Perc
. ov
erla
pw
ithbe
stx% by in
degr
ee
x% c
raw
led
by O
(u)
Stan
ford
Web
Bas
e (1
79K
, 1998)
[Cho9
8]
Perc
. ov
erla
pw
ithbe
stx% by pa
gera
nk
x% c
raw
led
by O
(u)
Web
Wid
e C
raw
l (3
28M
pag
es,
2000)
[Naj
o0
1] B
FS c
raw
ling
brin
gs in
hig
h qu
ality
page
s ea
rly in
the
craw
l
BFS
& S
pam
(W
ors
t ca
se s
cenar
io)
Sta
rtP
age
Sta
rtP
age
BFS
dep
th =
2
Nor
mal
avg
outd
egre
e=
10
100
UR
Ls o
n th
e qu
eue
incl
udin
g a
spam
pag
e.
Ass
ume
the
spam
mer
is a
ble
to g
ener
ate
dyna
mic
pag
es
with
100
0 ou
tlink
s
BFS
dep
th =
320
00 U
RLs
on
the
queu
e50
% b
elon
g to
the
spam
mer
BFS
dep
th =
41.
01 m
illio
n U
RLs
on
the
queu
e99
% b
elon
g to
the
spam
mer
Adve
rsar
ial IR
(Sp
am)
Moti
ves
Com
mer
cial
, polit
ical
, re
ligio
us,
lobbie
sPr
om
oti
on f
unded
by
adve
rtis
ing b
udget
Oper
ators
Contr
acto
rs (
Sear
ch E
ngin
e O
pti
miz
ers)
for
lobbie
s,
com
pan
ies
Web
mas
ters
Host
ing s
ervi
ces
Foru
mW
eb m
aste
r w
orl
d (
ww
w.w
ebm
aste
rworl
d.c
om
)Se
arch
engin
e sp
ecif
ic t
rick
s D
iscu
ssio
ns
about
acad
emic
pap
ers ☺
A f
ew s
pam
tec
hnolo
gie
sC
loak
ing
Serv
e fa
ke c
onte
nt
to s
earc
h e
ngin
e ro
bot
DN
S cl
oaki
ng:
Swit
ch IP
addre
ss.
Imper
sonat
e D
oorw
ay p
ages
Pages
opti
miz
ed f
or
a si
ngle
key
word
that
re-
dir
ect
to t
he
real
tar
get
pag
eK
eyw
ord
Spam
Mis
lead
ing m
eta-
keyw
ord
s, e
xce
ssiv
e re
pet
itio
n o
f a
term
, fa
ke “
anch
or
text”
Hid
den
tex
t w
ith c
olo
rs, C
SS t
rick
s, e
tc.
Link
spam
min
gM
utu
al a
dm
irat
ion s
oci
etie
s, h
idden
lin
ks,
awar
ds
Dom
ain
floo
din
g:num
erous
dom
ains
that
poin
t or
re-d
irec
t to
a t
arget
pag
eRobots
Fake
clic
k st
ream
Fake
quer
y st
ream
Mill
ions
of
subm
issi
ons
via
Add-U
rl
Is th
is a
Sea
rch
Engi
ne s
pide
r?
Y N
SPAM
Rea
lD
oc
Clo
akin
g
Met
a-Ke
ywor
ds=
“…Lo
ndon
hot
els,
hot
el, h
olid
ay in
n, h
ilton
, di
scou
nt, b
ooki
ng, r
eser
vati
on, s
ex, m
p3,
brit
ney
spea
rs, v
iagr
a, …
”
Can
you t
rust
word
s on t
he
pag
e?
auct
ions
.hits
offic
e.co
m/
Porn
ogra
phic
C
onte
ntw
ww
.eba
y.co
m/
Exa
mpl
es fr
om J
uly
2002
Sea
rch
Eng
ine
Opt
imiz
atio
n I
Adv
ersa
rial I
R(“
sear
ch e
ngin
e w
ars”
)
Sea
rch
Eng
ine
Opt
imiz
atio
n I
Adv
ersa
rial I
R(“
sear
ch e
ngin
e w
ars”
)
Sea
rch
Eng
ine
Opt
imiz
atio
n II
Tuto
rial o
nC
loak
ing
& S
teal
thTe
chno
logy
Sea
rch
Eng
ine
Opt
imiz
atio
n II
Tuto
rial o
nC
loak
ing
& S
teal
thTe
chno
logy
The
war
agai
nst
spam
Qual
ity
signal
s -
Pref
er a
uth
ori
tati
ve
pag
es b
ased
on:
Vote
s fr
om
auth
ors
(lin
kage
signal
s)V
ote
s fr
om
use
rs (
usa
ge
signal
s)Po
licin
g o
f U
RL
subm
issi
ons
Anti
robot
test
Li
mit
s on m
eta-
keyw
ord
sR
obust
lin
k an
alys
isIg
nore
sta
tist
ical
ly im
pla
usi
ble
lin
kage
(or
text)
Use
lin
k an
alys
is t
o d
etec
t sp
amm
ers
(guilt
by
asso
ciat
ion)
The
war
agai
nst
spam
Spam
rec
ognit
ion b
y m
achin
e le
arnin
gT
rain
ing s
et b
ased
on k
now
n s
pam
Fam
ily f
rien
dly
filt
ers
Linguis
tic
anal
ysis
, gen
eral
cla
ssif
icat
ion
tech
niq
ues
, et
c.Fo
r im
ages
: fl
esh t
one
det
ecto
rs,
sourc
e te
xt
anal
ysis
, et
c.Ed
itori
al inte
rven
tion
Blac
klis
tsT
op q
uer
ies
audit
edC
om
pla
ints
addre
ssed
Duplic
ate/
Nea
r-D
uplic
ate
Det
ecti
on
Dup
licat
ion:
Exac
t m
atch
wit
h f
inger
pri
nts
Nea
r-D
uplic
atio
n: A
ppro
xim
ate
mat
ch
Ove
rvie
wC
om
pute
syn
tact
ic s
imila
rity
wit
h a
n e
dit
-dis
tance
mea
sure
Use
sim
ilari
ty t
hre
shold
to d
etec
t nea
r-duplic
ates
E.g.,
Si
mila
rity
> 8
0%
=>
Docu
men
ts a
re “
nea
r duplic
ates
”N
ot
tran
siti
ve t
hough s
om
etim
es u
sed t
ransi
tive
ly
Com
puti
ng N
ear
Sim
ilari
ty
Feat
ure
s:Se
gm
ents
of
a docu
men
t (n
atura
l or
arti
fici
al
bre
akpoin
ts)
[Bri
n9
5]
Shin
gles
(Word
N-G
ram
s) [B
rin9
5,
Brod9
8]
“a r
ose
is
a ro
se is
a ro
se”
=>
a_
rose
_is_
aro
se_i
s_a_
rose
is_a
_rose
_is
Sim
ilari
ty M
easu
reT
FID
F [S
hiv
95
]Se
t in
ters
ecti
on [
Brod9
8]
(Spec
ific
ally
, Si
ze_o
f_In
ters
ecti
on /
Siz
e_of_
Unio
n )
Shin
gle
s +
Set
Inte
rsec
tion
Com
puti
ng e
xac
tse
t in
ters
ecti
on o
f sh
ingle
s bet
wee
n a
ll pai
rs o
f docu
men
ts is
expen
sive
an
d infe
asib
leA
ppro
xim
ate
usi
ng a
cle
verl
y ch
ose
n s
ubse
t of
shin
gle
s fr
om
eac
h (
a sk
etch
)
Shin
gle
s +
Set
Inte
rsec
tion
Esti
mat
e si
ze_o
f_in
ters
ecti
on /
siz
e_of_
unio
nbas
ed o
n a
short
ske
tch (
[Bro
d9
7,
Brod9
8]
)
Cre
ate
a “s
ketc
h v
ecto
r” (
e.g.,
of
size
20
0)
for
each
docu
men
tD
ocu
men
ts w
hic
h s
har
e m
ore
than
t(s
ay 8
0%
) co
rres
pondin
g v
ecto
r el
emen
ts a
re s
imila
rFo
r doc
D,
sket
ch[
i ]
is c
om
pute
d a
s fo
llow
s:Le
t f
map
all
shin
gle
s in
the
univ
erse
to 0
..2
m(e
.g.,
f =
fi
nger
pri
nti
ng)
Let
π ibe
a sp
ecif
ic r
andom
per
muta
tion o
n 0
..2
m
Pick
ske
tch[i]
:= M
IN π
i (
f(s)
)
ove
r al
l sh
ingle
s s
in D
Com
puti
ng S
ketc
h[i]
for
Doc1
Document 1
264
264
264
264
Start with 64 bit shingles
Permute on the number line
with π
i
Pick the min value
Tes
t if
Doc1
.Ske
tch[i]
= D
oc2
.Ske
tch[i]
Document 1
264
264
264
264
264
264
264
264
AB
Document 2
Are these equal?
Test
for
200
rand
om p
erm
utat
ions
:π 1
, π2,
… π
200
How
ever
… Document 1
Document 2
264
264
264
264
264
264
264
264
BA
A =
B if
fth
e sh
ingl
e w
ith t
he M
IN v
alue
in t
he u
nion
of
Doc
1 an
d D
oc2
is c
omm
on t
o bo
th (
I.e.
, lie
s in
the
inte
rsec
tion)
This
hap
pens
with
pro
babi
lity:
Size
_of_
inte
rsec
tion
/ S
ize_
of_u
nion
Ques
tion
Docu
men
t D
1=
D2 iff
size
_of_
inte
rsec
tion=
size
_of_
unio
n ?
Mir
ror
Det
ecti
on
Mir
rori
ng is
syst
emat
ic r
eplic
atio
n o
f w
eb p
ages
ac
ross
host
s.Si
ngle
lar
ges
t ca
use
of
duplic
atio
n o
n t
he
web
Host
1/α
and H
ost
2/β
are
mir
rors
iff
For
all (o
r m
ost
) pat
hs
p s
uch
that
when
htt
p:/
/Host
1/
α/
p e
xis
tshtt
p:/
/Host
2/
β /
pex
ists
as
wel
lw
ith iden
tica
l (o
r nea
r id
enti
cal) c
onte
nt,
and
vice
ver
sa.
Mir
ror
Det
ecti
on e
xam
ple
htt
p:/
/ww
w.e
lsev
ier.
com
/ an
d h
ttp:/
/ww
w.e
lsev
ier.
nl/
Stru
ctura
l C
lass
ific
atio
n o
f Pr
ote
ins
htt
p:/
/sco
p.m
rc-l
mb.c
am.a
c.uk/
scop
htt
p:/
/sco
p.b
erke
ley.
edu/
htt
p:/
/sco
p.w
ehi.ed
u.a
u/s
cop
htt
p:/
/pdb.w
eizm
ann.a
c.il/
scop
htt
p:/
/sco
p.p
rotr
es.r
u/
Rep
acka
ged
Mir
rors
Auc
tions
.lyco
s.co
mA
uctio
ns.m
sn.c
om
Aug
Moti
vati
on
Why
det
ect
mir
rors
?Sm
art
craw
ling
Fetc
h f
rom
the
fast
est
or
fres
hes
t se
rver
Avo
id d
uplic
atio
n
Bett
er c
onnec
tivi
ty a
nal
ysis
C
om
bin
e in
links
Avo
id d
ouble
counti
ng o
utl
inks
Red
undan
cy in r
esult
lis
tings
“If
that
fai
ls y
ou c
an t
ry:
<m
irro
r>/s
amep
ath”
Proxy
cach
ing
Bott
om
Up M
irro
r D
etec
tion
[Cho0
0]
Mai
nta
in c
lust
ers
of
subgra
phs
Init
ializ
e cl
ust
ers
of
triv
ial su
bgra
phs
Gro
up n
ear-
duplic
ate
single
docu
men
ts into
a c
lust
erSu
bse
quen
t pas
ses
Mer
ge
clust
ers
of
the
sam
e ca
rdin
alit
y an
d c
orr
espondin
g lin
kage
Avo
id d
ecre
asin
g c
lust
er c
ardin
alit
yT
o d
etec
t m
irro
rs w
e nee
d:
Adeq
uat
e pat
h o
verl
ap
Conte
nts
of
corr
espondin
g p
ages
wit
hin
a s
mal
l ti
me
range
Can
we
use
URLs
to f
ind
mir
rors
?ww
w.sy
nthe
sis.
org
ab
cd
synt
hesi
s.st
anfo
rd.e
du
ab
cd
ww
w.s
ynth
esis
.org
/Doc
s/P
rojA
bs/s
ynsy
s/sy
naly
sis.
htm
lw
ww
.syn
thes
is.o
rg/D
ocs/
Pro
jAbs
/syn
sys/
visu
al-s
emi-q
uant
.htm
lw
ww
.syn
thes
is.o
rg/D
ocs/
annu
al.re
port9
6.fin
al.h
tml
ww
w.s
ynth
esis
.org
/Doc
s/ci
cee-
berli
n-pa
per.h
tml
ww
w.s
ynth
esis
.org
/Doc
s/m
yr5
ww
w.s
ynth
esis
.org
/Doc
s/m
yr5/
cice
e/br
idge
-gap
.htm
lw
ww
.syn
thes
is.o
rg/D
ocs/
myr
5/cs
/cs-
met
a.ht
ml
ww
w.s
ynth
esis
.org
/Doc
s/m
yr5/
mec
h/m
ech-
intro
-mec
hatro
n.ht
ml
ww
w.s
ynth
esis
.org
/Doc
s/m
yr5/
mec
h/m
ech-
take
-hom
e.ht
ml
ww
w.s
ynth
esis
.org
/Doc
s/m
yr5/
syns
ys/e
xper
ient
ial-l
earn
ing.
htm
lw
ww
.syn
thes
is.o
rg/D
ocs/
myr
5/sy
nsys
/mm
-mec
h-di
ssec
.htm
lw
ww
.syn
thes
is.o
rg/D
ocs/
yr5a
rw
ww
.syn
thes
is.o
rg/D
ocs/
yr5a
r/ass
ess
ww
w.s
ynth
esis
.org
/Doc
s/yr
5ar/c
icee
ww
w.s
ynth
esis
.org
/Doc
s/yr
5ar/c
icee
/brid
ge-g
ap.h
tml
ww
w.s
ynth
esis
.org
/Doc
s/yr
5ar/c
icee
/com
p-in
teg-
anal
ysis
.htm
l
synt
hesi
s
synt
hesi
s.st
anfo
synt
hesi
s.st
anfo
rd.e
du/D
ocs/
synt
hes
synt
hes
synt
hes
synt
hesi
s.st
anfo
rd.e
du/D
ocs/
Proj
Abs/
deliv
/hig
h-te
ch-…
.sta
nfor
d.ed
u/D
ocs/
Proj
Abs/
mec
h/m
ech-
enha
nced
…sy
nthe
sis.
stan
ford
.edu
/Doc
s/Pr
ojAb
s/m
ech/
mec
h-in
tro-…
synt
hesi
s.st
anfo
rd.e
du/D
ocs/
Proj
Abs/
mec
h/m
ech-
mm
-cas
e-…
synt
hesi
s.st
anfo
rd.e
du/D
ocs/
Proj
Abs/
syns
ys/q
uant
-dev
-new
-…sy
nthe
sis.
stan
ford
.edu
/Doc
s/an
nual
.repo
rt96.
final
.htm
lsy
nthe
sis.
stan
ford
.edu
/Doc
s/an
nual
.repo
rt96.
final
_fn.
htm
lrd
.edu
/Doc
s/m
yr5/
asse
ssm
ent
myr
5/as
sess
men
t/ass
essm
ent-…
is.s
tanf
ord.
edu/
Doc
s/m
yr5/
asse
ssm
ent/m
m-fo
rum
-kio
sk-…
is.s
tanf
ord.
edu/
Doc
s/m
yr5/
asse
ssm
ent/n
eato
-ucb
.htm
lsy
nthe
sis.
stan
ford
.edu
/Doc
s/m
yr5/
asse
ssm
ent/n
ot-a
vaila
ble.
htm
lsy
nthe
sis.
stan
ford
.edu
/Doc
s/m
yr5/
cice
esy
nthe
sis.
stan
ford
.edu
/Doc
s/m
yr5/
cice
e/br
idge
-gap
.htm
lsy
nthe
sis.
stan
ford
.edu
/Doc
s/m
yr5/
cice
e/ci
cee-
mai
n.ht
ml
is.s
tanf
ord.
edu/
Doc
s/m
yr5/
cice
e/co
mp-
inte
g-an
alys
is.h
tml
Top D
ow
n M
irro
r D
etec
tion
[Bhar
99
, Bh
ar0
0c]
E.g.,
ww
w.sy
nthe
sis.
org/
Docs
/Pro
jAbs
/syn
sys/
syna
lysi
s.ht
mlsy
nthe
sis.
stan
ford
.edu
/Doc
s/Pr
ojAb
s/sy
nsys
/qua
nt-d
ev-n
ew-t
each
.htm
lW
hat
fea
ture
s co
uld
indic
ate
mir
rori
ng?
Host
nam
e si
mila
rity
: w
ord
unig
ram
s an
d b
igra
ms:
{ w
ww
, w
ww
.syn
thes
is,
synth
esis
, …
}D
irec
tory
sim
ilari
ty:
Posi
tional
pat
h b
igra
ms
{ 0
:Docs
/Pro
jAbs,
1:P
rojA
bs/
synsy
s, …
}
IP a
ddre
ss s
imila
rity
: 3
or
4 o
ctet
ove
rlap
Man
y host
s sh
arin
g a
n IP
addre
ss =
> v
irtu
al h
ost
ing b
y an
ISP
Host
outl
ink
ove
rlap
Path
ove
rlap
Po
tenti
ally
, pat
h +
ske
tch o
verl
ap
Imple
men
tati
on
Phas
e I -
Can
did
ate
Pair
Det
ecti
on
Find f
eatu
res
that
pai
rs o
f host
s hav
e in
com
mon
Com
pute
a lis
t of
host
pai
rs w
hic
h m
ight
be
mir
rors
Phas
e II
-H
ost
Pai
r V
alid
atio
nT
est
each
host
pai
r an
d d
eter
min
e ex
tent
of
mir
rori
ng
Chec
k if
20
pat
hs
sam
ple
d f
rom
Host
1 h
ave
nea
r-duplic
ates
on H
ost
2 a
nd v
ice
vers
aU
se t
ransi
tive
infe
rence
s:IF
Mir
ror(
A,x
) A
ND
Mir
ror(
x,B
) TH
EN M
irro
r(A
,B)
IF M
irro
r(A
,x)
AN
D !M
irro
r(x,B
) TH
EN !
Mir
ror(
A,B
)
Eval
uat
ion
14
0 m
illio
n U
RLs
on 2
30
,00
0 h
ost
s (1
99
9)
Best
appro
ach c
om
bin
ed 5
set
s of
feat
ure
sT
op 1
00
,00
0 h
ost
pai
rs h
ad p
reci
sion =
0.5
7 a
nd r
ecal
l =
0
.86
Web
IRIn
fras
truct
ure
Connec
tivi
ty S
erve
rFa
st a
cces
s to
lin
ks t
o s
upport
for
link
anal
ysis
Ter
m V
ecto
r D
atab
ase
Fast
acc
ess
to d
ocu
men
t ve
ctors
to a
ugm
ent
link
anal
ysis
Connec
tivi
ty S
erve
r[C
S1:
Bhar
98
b,
CS2
& 3
:R
and0
1]
Fast
web
gra
ph a
cces
s to
support
connec
tivi
ty
anal
ysis
Store
s m
appin
gs
in m
emory
fro
mU
RL
to o
utl
inks
, U
RL
to inlin
ks
Applic
atio
ns
HIT
S, P
ager
ank
com
puta
tions
Cra
wl si
mula
tion
Gra
ph a
lgori
thm
s: w
eb c
onnec
tivi
ty,
dia
met
er e
tc.
more
on t
his
lat
er
Vis
ual
izat
ions
Usa
ge
Inpu
t
Gra
phal
gorit
hm+
URLs +
Valu
es
URL
sto FP
sto ID
s
Exec
utio
n
Gra
phal
gorit
hmru
ns in
mem
ory
IDs
to URL
s
Out
put
URLs +
Valu
es
Tran
slat
ion
Tab
les
on D
isk
URL
text
: 9
byte
s/U
RL
(com
pres
sed
from
~80
byt
es )
FP
(64b
) ->
ID
(32b
): 5
byt
esID
(32b
) ->
FP(
64b)
: 8
byte
sID
(32b
) ->
URL
s: 0
.5 b
ytes
ID a
ssig
nm
ent
E.g.
, HIG
H I
Ds:
M
ax(i
nde
gree
, ou
tdeg
ree)
> 2
54
IDU
RL
… 9891
w
ww
.am
azon
.com
/
9912
w
ww
.am
azon
.com
/job
s/
… 9821
878
ww
w.g
eoci
ties.
com
/
… 4093
0030
w
ww
.goo
gle.
com
/
… 8590
3590
w
ww
.yah
oo.c
om/
Part
itio
n U
RLs
into
3 s
ets,
sort
ed
lexic
ogra
phic
ally
Hig
h: M
ax d
egre
e >
25
4M
ediu
m:
25
4 >
Max
deg
ree
> 2
4Lo
w: re
mai
nin
g (
75%
)
IDs
assi
gned
in s
equen
ce (
den
sely
)
Adja
cenc
y lis
tsIn
mem
ory
tabl
es f
or O
utlin
ks,
Inlin
ksLi
st in
dex
map
s fr
om a
Sou
rce
ID t
o st
art
of a
djac
ency
list
Adja
cency
Lis
t C
om
pre
ssio
n -
I
… …
98 132
153
98 147
153
… …
104
105
106
List
Inde
x
Seq
uenc
eof
Adj
acen
cyLi
sts
… …
-6 34 21 -8 49 6… …
104
105
106
List
Inde
x
Del
taE
ncod
edA
djac
ency
List
s•A
djac
ency
Lis
t: -S
mal
ler d
elta
val
ues
are
expo
nent
ially
mor
e fre
quen
t (80
% to
sam
e ho
st)
-Com
pres
s de
ltas
with
var
iabl
e le
ngth
enc
odin
g (e
.g.,
Huf
fman
)•L
ist I
ndex
poi
nter
s:32
b fo
r hig
h, B
ase+
16b
for m
ed, B
ase+
8b fo
r low
-Avg
= 12
b pe
r poi
nter
Adja
cency
Lis
t C
om
pre
ssio
n -
II
Inte
r Li
st C
om
pre
ssio
nBa
sis:
Sim
ilar
UR
Ls m
ay s
har
e lin
ks
Clo
se in ID
spac
e =
> a
dja
cency
lis
ts m
ay o
verl
ap
Appro
ach
Def
ine
a re
pre
senta
tive
adja
cency
lis
tfo
r a
blo
ck o
f ID
sA
dja
cency
lis
t of
a re
fere
nce
ID
Unio
n o
f ad
jace
ncy
lis
ts in t
he
blo
ck
Rep
rese
nt
adja
cency
lis
t in
ter
ms
of
del
etio
ns
and a
ddit
ions
whe
n it
is c
heap
er t
o do
so
Mea
sure
men
tsIn
tra
List
+ S
tart
s: 8
-11
bit
s per
lin
k (5
80
M p
ages
/16
GB
RA
M)
Inte
r Li
st:
5.4
-5.7
bit
s per
lin
k (8
70
M p
ages
/16
GB
RA
M.)
Ter
m V
ecto
r D
atab
ase
[Sta
t00
]
Fast
acc
ess
to 5
0 w
ord
ter
m v
ecto
rs f
or
web
pag
esT
erm
Sel
ecti
on:
Res
tric
ted t
o m
iddle
1/3
rdof
lexic
on b
y docu
men
t fr
equen
cyT
op
50
word
s in
docu
men
t b
y T
F.ID
F.
Ter
m W
eighti
ng:
Def
erre
d t
ill r
un-t
ime
(can
be
bas
ed o
n t
erm
fre
q,
doc
freq
, doc
length
)
Applic
atio
ns
Conte
nt
+ C
onnec
tivi
ty a
nal
ysis
(e.
g.,
Topic
Dis
tilla
tion)
Topic
spec
ific
cra
wls
Docu
men
t cl
assi
fica
tion
Perf
orm
ance
Stora
ge:
33
GB
for
27
2M
ter
m v
ecto
rsSp
eed:
17
ms/
vect
or
on A
lphaS
erve
r4
10
0 (
late
ncy
to r
ead a
dis
k blo
ck)
Arc
hit
ectu
re
UR
L In
fo
LC:T
IDLC
:TID
…LC
:TID
FRQ
:RL
FRQ
:RL
…FR
Q:R
L
128
Byte
TVR
ecor
d
Term
s
Freq
Bas
e (4
byt
es)
Bit
vect
orFo
r 48
0 U
RLi
ds
offs
et
UR
Lid
to T
erm
Vec
tor
Look
u p
UR
Lid
* 64
/480