cs276a text information retrieval, mining, and …ale/bici/ir/slides/bertinoro10.pdfquery-doc...

48
Information Retrieval Lecture 10

Upload: others

Post on 25-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Info

rmat

ion R

etri

eval

Lect

ure

10

Page 2: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Rec

ap

Last

lec

ture

HIT

S al

gori

thm

usi

ng a

nch

or

text

topic

-spec

ific

pag

eran

k

Page 3: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Today

’s T

opic

s

Behav

ior-

bas

ed r

anki

ng

Cra

wlin

g a

nd c

orp

us

const

ruct

ion

Alg

ori

thm

s fo

r (n

ear)

duplic

ate

det

ecti

on

Sear

ch e

ngin

e /

Web

IRin

fras

truct

ure

Page 4: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Behav

ior-

bas

ed r

anki

ng

For

each

quer

y Q

, ke

ep t

rack

of

whic

h d

ocs

in

the

resu

lts

are

clic

ked o

nO

n s

ubse

quen

t re

ques

ts f

or

Q,

re-o

rder

docs

in r

esult

s bas

ed o

n c

lick-

thro

ughs

Firs

t due

to D

irec

tHit

→A

skJe

eves

Rel

evan

ce a

sses

smen

t bas

ed o

nBe

hav

ior/

usa

ge

vs.

conte

nt

Page 5: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Quer

y-doc

popula

rity

mat

rix B

Doc

s

Que

riesq

j B qj=

num

ber o

f tim

es d

oc j

clic

ked-

thro

ugh

on q

uery

q

Whe

n qu

ery

q is

sued

aga

in, o

rder

doc

s by

B qj

valu

es.

Page 6: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Issu

es t

o c

onsi

der

Wei

ghin

g/c

om

bin

ing t

ext-

and c

lick-

bas

ed

score

s.W

hat

iden

tifi

es a

quer

y?Fe

rrar

i M

ondia

lFe

rrar

i

Mondia

lFe

rrar

i m

ondia

lfe

rrar

im

ondia

l“F

erra

ri M

ondia

l”C

an u

se h

euri

stic

s, b

ut

sear

ch p

arsi

ng s

low

ed.

Page 7: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Vec

tor

spac

e im

ple

men

tati

on

Mai

nta

in a

ter

m-d

oc

popula

rity

mat

rix C

as o

ppose

d t

o q

uer

y-doc

popula

rity

init

ializ

ed t

o a

ll ze

ros

Each

colu

mn r

epre

sents

a d

oc

jIf

doc

jclic

ked o

n f

or

quer

y q

, updat

e C

j←C

j+

εq

(her

e q

is v

iew

ed a

s a

vect

or)

.

On a

quer

y q

’, c

om

pute

its

cosi

ne

pro

xim

ity

to C

jfo

r al

l j.

Com

bin

e th

is w

ith t

he

regula

r te

xt

score

.

Page 8: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Issu

es

Norm

aliz

atio

n o

f C

jaf

ter

updat

ing

Ass

um

pti

on o

f quer

y co

mposi

tional

ity

“whit

e house

” docu

men

t popula

rity

der

ived

fr

om

“w

hit

e” a

nd “

house

Updat

ing -

live

or

bat

ch?

Page 9: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Basi

c A

ssum

pti

on

Rel

evan

ce c

an b

e dir

ectl

y m

easu

red b

y num

ber

of

clic

k th

roughs

Val

id?

Page 10: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Val

idit

y of

Basi

c A

ssum

pti

on

Clic

k th

rough t

o d

ocs

that

turn

out

to b

e non-r

elev

ant:

what

does

a c

lick

mea

n?

Self

-per

pet

uat

ing r

anki

ng

Spam

All

vote

s co

unt

the

sam

e

Page 11: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Var

iants

Tim

e sp

ent

view

ing p

age

Dif

ficu

lt s

essi

on m

anag

emen

tIn

concl

usi

ve m

odel

ing s

o f

ar

Does

use

r bac

k out

of

pag

e?D

oes

use

r st

op s

earc

hin

g?

Does

use

r tr

ansa

ct?

Page 12: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Cra

wlin

g a

nd C

orp

us

Const

ruct

ion

Cra

wl ord

erFi

lter

ing d

uplic

ates

Mir

ror

det

ecti

on

Page 13: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Cra

wlin

g Iss

ues

How

to c

raw

l?

Qua

lity:

“Be

st”

pag

es f

irst

Effi

cien

cy: A

void

duplic

atio

n (

or

nea

r duplic

atio

n)

Etiq

uett

e: R

obots

.txt,

Ser

ver

load

conce

rns

How

much

to c

raw

l? H

ow

much

to index

?C

over

age:

How

big

is

the

Web

? H

ow

much

do w

e co

ver?

Rel

ativ

e C

over

age:

How

much

do c

om

pet

itors

hav

e?

How

oft

en t

o c

raw

l?Fr

eshn

ess:

How

much

has

chan

ged

? H

ow

much

has

rea

llych

anged

?(w

hy

is t

his

a d

iffe

rent

ques

tion?)

Page 14: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Cra

wl O

rder

Best

pag

es f

irst

Pote

nti

al q

ual

ity

mea

sure

s:Fi

nal

Indeg

ree

Final

Pag

eran

k

Cra

wl heu

rist

ic:

BFS

Part

ial In

deg

ree

Part

ial Pa

ger

ank

Ran

dom

wal

k

Page 15: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Perc

. ov

erla

pw

ithbe

stx% by in

degr

ee

x% c

raw

led

by O

(u)

Stan

ford

Web

Bas

e (1

79K

, 1998)

[Cho9

8]

Perc

. ov

erla

pw

ithbe

stx% by pa

gera

nk

x% c

raw

led

by O

(u)

Page 16: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Web

Wid

e C

raw

l (3

28M

pag

es,

2000)

[Naj

o0

1] B

FS c

raw

ling

brin

gs in

hig

h qu

ality

page

s ea

rly in

the

craw

l

Page 17: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

BFS

& S

pam

(W

ors

t ca

se s

cenar

io)

Sta

rtP

age

Sta

rtP

age

BFS

dep

th =

2

Nor

mal

avg

outd

egre

e=

10

100

UR

Ls o

n th

e qu

eue

incl

udin

g a

spam

pag

e.

Ass

ume

the

spam

mer

is a

ble

to g

ener

ate

dyna

mic

pag

es

with

100

0 ou

tlink

s

BFS

dep

th =

320

00 U

RLs

on

the

queu

e50

% b

elon

g to

the

spam

mer

BFS

dep

th =

41.

01 m

illio

n U

RLs

on

the

queu

e99

% b

elon

g to

the

spam

mer

Page 18: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Adve

rsar

ial IR

(Sp

am)

Moti

ves

Com

mer

cial

, polit

ical

, re

ligio

us,

lobbie

sPr

om

oti

on f

unded

by

adve

rtis

ing b

udget

Oper

ators

Contr

acto

rs (

Sear

ch E

ngin

e O

pti

miz

ers)

for

lobbie

s,

com

pan

ies

Web

mas

ters

Host

ing s

ervi

ces

Foru

mW

eb m

aste

r w

orl

d (

ww

w.w

ebm

aste

rworl

d.c

om

)Se

arch

engin

e sp

ecif

ic t

rick

s D

iscu

ssio

ns

about

acad

emic

pap

ers ☺

Page 19: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

A f

ew s

pam

tec

hnolo

gie

sC

loak

ing

Serv

e fa

ke c

onte

nt

to s

earc

h e

ngin

e ro

bot

DN

S cl

oaki

ng:

Swit

ch IP

addre

ss.

Imper

sonat

e D

oorw

ay p

ages

Pages

opti

miz

ed f

or

a si

ngle

key

word

that

re-

dir

ect

to t

he

real

tar

get

pag

eK

eyw

ord

Spam

Mis

lead

ing m

eta-

keyw

ord

s, e

xce

ssiv

e re

pet

itio

n o

f a

term

, fa

ke “

anch

or

text”

Hid

den

tex

t w

ith c

olo

rs, C

SS t

rick

s, e

tc.

Link

spam

min

gM

utu

al a

dm

irat

ion s

oci

etie

s, h

idden

lin

ks,

awar

ds

Dom

ain

floo

din

g:num

erous

dom

ains

that

poin

t or

re-d

irec

t to

a t

arget

pag

eRobots

Fake

clic

k st

ream

Fake

quer

y st

ream

Mill

ions

of

subm

issi

ons

via

Add-U

rl

Is th

is a

Sea

rch

Engi

ne s

pide

r?

Y N

SPAM

Rea

lD

oc

Clo

akin

g

Met

a-Ke

ywor

ds=

“…Lo

ndon

hot

els,

hot

el, h

olid

ay in

n, h

ilton

, di

scou

nt, b

ooki

ng, r

eser

vati

on, s

ex, m

p3,

brit

ney

spea

rs, v

iagr

a, …

Page 20: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Can

you t

rust

word

s on t

he

pag

e?

auct

ions

.hits

offic

e.co

m/

Porn

ogra

phic

C

onte

ntw

ww

.eba

y.co

m/

Exa

mpl

es fr

om J

uly

2002

Page 21: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Sea

rch

Eng

ine

Opt

imiz

atio

n I

Adv

ersa

rial I

R(“

sear

ch e

ngin

e w

ars”

)

Sea

rch

Eng

ine

Opt

imiz

atio

n I

Adv

ersa

rial I

R(“

sear

ch e

ngin

e w

ars”

)

Page 22: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Sea

rch

Eng

ine

Opt

imiz

atio

n II

Tuto

rial o

nC

loak

ing

& S

teal

thTe

chno

logy

Sea

rch

Eng

ine

Opt

imiz

atio

n II

Tuto

rial o

nC

loak

ing

& S

teal

thTe

chno

logy

Page 23: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

The

war

agai

nst

spam

Qual

ity

signal

s -

Pref

er a

uth

ori

tati

ve

pag

es b

ased

on:

Vote

s fr

om

auth

ors

(lin

kage

signal

s)V

ote

s fr

om

use

rs (

usa

ge

signal

s)Po

licin

g o

f U

RL

subm

issi

ons

Anti

robot

test

Li

mit

s on m

eta-

keyw

ord

sR

obust

lin

k an

alys

isIg

nore

sta

tist

ical

ly im

pla

usi

ble

lin

kage

(or

text)

Use

lin

k an

alys

is t

o d

etec

t sp

amm

ers

(guilt

by

asso

ciat

ion)

Page 24: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

The

war

agai

nst

spam

Spam

rec

ognit

ion b

y m

achin

e le

arnin

gT

rain

ing s

et b

ased

on k

now

n s

pam

Fam

ily f

rien

dly

filt

ers

Linguis

tic

anal

ysis

, gen

eral

cla

ssif

icat

ion

tech

niq

ues

, et

c.Fo

r im

ages

: fl

esh t

one

det

ecto

rs,

sourc

e te

xt

anal

ysis

, et

c.Ed

itori

al inte

rven

tion

Blac

klis

tsT

op q

uer

ies

audit

edC

om

pla

ints

addre

ssed

Page 25: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Duplic

ate/

Nea

r-D

uplic

ate

Det

ecti

on

Dup

licat

ion:

Exac

t m

atch

wit

h f

inger

pri

nts

Nea

r-D

uplic

atio

n: A

ppro

xim

ate

mat

ch

Ove

rvie

wC

om

pute

syn

tact

ic s

imila

rity

wit

h a

n e

dit

-dis

tance

mea

sure

Use

sim

ilari

ty t

hre

shold

to d

etec

t nea

r-duplic

ates

E.g.,

Si

mila

rity

> 8

0%

=>

Docu

men

ts a

re “

nea

r duplic

ates

”N

ot

tran

siti

ve t

hough s

om

etim

es u

sed t

ransi

tive

ly

Page 26: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Com

puti

ng N

ear

Sim

ilari

ty

Feat

ure

s:Se

gm

ents

of

a docu

men

t (n

atura

l or

arti

fici

al

bre

akpoin

ts)

[Bri

n9

5]

Shin

gles

(Word

N-G

ram

s) [B

rin9

5,

Brod9

8]

“a r

ose

is

a ro

se is

a ro

se”

=>

a_

rose

_is_

aro

se_i

s_a_

rose

is_a

_rose

_is

Sim

ilari

ty M

easu

reT

FID

F [S

hiv

95

]Se

t in

ters

ecti

on [

Brod9

8]

(Spec

ific

ally

, Si

ze_o

f_In

ters

ecti

on /

Siz

e_of_

Unio

n )

Page 27: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Shin

gle

s +

Set

Inte

rsec

tion

Com

puti

ng e

xac

tse

t in

ters

ecti

on o

f sh

ingle

s bet

wee

n a

ll pai

rs o

f docu

men

ts is

expen

sive

an

d infe

asib

leA

ppro

xim

ate

usi

ng a

cle

verl

y ch

ose

n s

ubse

t of

shin

gle

s fr

om

eac

h (

a sk

etch

)

Page 28: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Shin

gle

s +

Set

Inte

rsec

tion

Esti

mat

e si

ze_o

f_in

ters

ecti

on /

siz

e_of_

unio

nbas

ed o

n a

short

ske

tch (

[Bro

d9

7,

Brod9

8]

)

Cre

ate

a “s

ketc

h v

ecto

r” (

e.g.,

of

size

20

0)

for

each

docu

men

tD

ocu

men

ts w

hic

h s

har

e m

ore

than

t(s

ay 8

0%

) co

rres

pondin

g v

ecto

r el

emen

ts a

re s

imila

rFo

r doc

D,

sket

ch[

i ]

is c

om

pute

d a

s fo

llow

s:Le

t f

map

all

shin

gle

s in

the

univ

erse

to 0

..2

m(e

.g.,

f =

fi

nger

pri

nti

ng)

Let

π ibe

a sp

ecif

ic r

andom

per

muta

tion o

n 0

..2

m

Pick

ske

tch[i]

:= M

IN π

i (

f(s)

)

ove

r al

l sh

ingle

s s

in D

Page 29: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Com

puti

ng S

ketc

h[i]

for

Doc1

Document 1

264

264

264

264

Start with 64 bit shingles

Permute on the number line

with π

i

Pick the min value

Page 30: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Tes

t if

Doc1

.Ske

tch[i]

= D

oc2

.Ske

tch[i]

Document 1

264

264

264

264

264

264

264

264

AB

Document 2

Are these equal?

Test

for

200

rand

om p

erm

utat

ions

:π 1

, π2,

… π

200

Page 31: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

How

ever

… Document 1

Document 2

264

264

264

264

264

264

264

264

BA

A =

B if

fth

e sh

ingl

e w

ith t

he M

IN v

alue

in t

he u

nion

of

Doc

1 an

d D

oc2

is c

omm

on t

o bo

th (

I.e.

, lie

s in

the

inte

rsec

tion)

This

hap

pens

with

pro

babi

lity:

Size

_of_

inte

rsec

tion

/ S

ize_

of_u

nion

Page 32: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Ques

tion

Docu

men

t D

1=

D2 iff

size

_of_

inte

rsec

tion=

size

_of_

unio

n ?

Page 33: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Mir

ror

Det

ecti

on

Mir

rori

ng is

syst

emat

ic r

eplic

atio

n o

f w

eb p

ages

ac

ross

host

s.Si

ngle

lar

ges

t ca

use

of

duplic

atio

n o

n t

he

web

Host

1/α

and H

ost

2/β

are

mir

rors

iff

For

all (o

r m

ost

) pat

hs

p s

uch

that

when

htt

p:/

/Host

1/

α/

p e

xis

tshtt

p:/

/Host

2/

β /

pex

ists

as

wel

lw

ith iden

tica

l (o

r nea

r id

enti

cal) c

onte

nt,

and

vice

ver

sa.

Page 34: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Mir

ror

Det

ecti

on e

xam

ple

htt

p:/

/ww

w.e

lsev

ier.

com

/ an

d h

ttp:/

/ww

w.e

lsev

ier.

nl/

Stru

ctura

l C

lass

ific

atio

n o

f Pr

ote

ins

htt

p:/

/sco

p.m

rc-l

mb.c

am.a

c.uk/

scop

htt

p:/

/sco

p.b

erke

ley.

edu/

htt

p:/

/sco

p.w

ehi.ed

u.a

u/s

cop

htt

p:/

/pdb.w

eizm

ann.a

c.il/

scop

htt

p:/

/sco

p.p

rotr

es.r

u/

Page 35: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Rep

acka

ged

Mir

rors

Auc

tions

.lyco

s.co

mA

uctio

ns.m

sn.c

om

Aug

Page 36: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Moti

vati

on

Why

det

ect

mir

rors

?Sm

art

craw

ling

Fetc

h f

rom

the

fast

est

or

fres

hes

t se

rver

Avo

id d

uplic

atio

n

Bett

er c

onnec

tivi

ty a

nal

ysis

C

om

bin

e in

links

Avo

id d

ouble

counti

ng o

utl

inks

Red

undan

cy in r

esult

lis

tings

“If

that

fai

ls y

ou c

an t

ry:

<m

irro

r>/s

amep

ath”

Proxy

cach

ing

Page 37: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Bott

om

Up M

irro

r D

etec

tion

[Cho0

0]

Mai

nta

in c

lust

ers

of

subgra

phs

Init

ializ

e cl

ust

ers

of

triv

ial su

bgra

phs

Gro

up n

ear-

duplic

ate

single

docu

men

ts into

a c

lust

erSu

bse

quen

t pas

ses

Mer

ge

clust

ers

of

the

sam

e ca

rdin

alit

y an

d c

orr

espondin

g lin

kage

Avo

id d

ecre

asin

g c

lust

er c

ardin

alit

yT

o d

etec

t m

irro

rs w

e nee

d:

Adeq

uat

e pat

h o

verl

ap

Conte

nts

of

corr

espondin

g p

ages

wit

hin

a s

mal

l ti

me

range

Page 38: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Can

we

use

URLs

to f

ind

mir

rors

?ww

w.sy

nthe

sis.

org

ab

cd

synt

hesi

s.st

anfo

rd.e

du

ab

cd

ww

w.s

ynth

esis

.org

/Doc

s/P

rojA

bs/s

ynsy

s/sy

naly

sis.

htm

lw

ww

.syn

thes

is.o

rg/D

ocs/

Pro

jAbs

/syn

sys/

visu

al-s

emi-q

uant

.htm

lw

ww

.syn

thes

is.o

rg/D

ocs/

annu

al.re

port9

6.fin

al.h

tml

ww

w.s

ynth

esis

.org

/Doc

s/ci

cee-

berli

n-pa

per.h

tml

ww

w.s

ynth

esis

.org

/Doc

s/m

yr5

ww

w.s

ynth

esis

.org

/Doc

s/m

yr5/

cice

e/br

idge

-gap

.htm

lw

ww

.syn

thes

is.o

rg/D

ocs/

myr

5/cs

/cs-

met

a.ht

ml

ww

w.s

ynth

esis

.org

/Doc

s/m

yr5/

mec

h/m

ech-

intro

-mec

hatro

n.ht

ml

ww

w.s

ynth

esis

.org

/Doc

s/m

yr5/

mec

h/m

ech-

take

-hom

e.ht

ml

ww

w.s

ynth

esis

.org

/Doc

s/m

yr5/

syns

ys/e

xper

ient

ial-l

earn

ing.

htm

lw

ww

.syn

thes

is.o

rg/D

ocs/

myr

5/sy

nsys

/mm

-mec

h-di

ssec

.htm

lw

ww

.syn

thes

is.o

rg/D

ocs/

yr5a

rw

ww

.syn

thes

is.o

rg/D

ocs/

yr5a

r/ass

ess

ww

w.s

ynth

esis

.org

/Doc

s/yr

5ar/c

icee

ww

w.s

ynth

esis

.org

/Doc

s/yr

5ar/c

icee

/brid

ge-g

ap.h

tml

ww

w.s

ynth

esis

.org

/Doc

s/yr

5ar/c

icee

/com

p-in

teg-

anal

ysis

.htm

l

synt

hesi

s

synt

hesi

s.st

anfo

synt

hesi

s.st

anfo

rd.e

du/D

ocs/

synt

hes

synt

hes

synt

hes

synt

hesi

s.st

anfo

rd.e

du/D

ocs/

Proj

Abs/

deliv

/hig

h-te

ch-…

.sta

nfor

d.ed

u/D

ocs/

Proj

Abs/

mec

h/m

ech-

enha

nced

…sy

nthe

sis.

stan

ford

.edu

/Doc

s/Pr

ojAb

s/m

ech/

mec

h-in

tro-…

synt

hesi

s.st

anfo

rd.e

du/D

ocs/

Proj

Abs/

mec

h/m

ech-

mm

-cas

e-…

synt

hesi

s.st

anfo

rd.e

du/D

ocs/

Proj

Abs/

syns

ys/q

uant

-dev

-new

-…sy

nthe

sis.

stan

ford

.edu

/Doc

s/an

nual

.repo

rt96.

final

.htm

lsy

nthe

sis.

stan

ford

.edu

/Doc

s/an

nual

.repo

rt96.

final

_fn.

htm

lrd

.edu

/Doc

s/m

yr5/

asse

ssm

ent

myr

5/as

sess

men

t/ass

essm

ent-…

is.s

tanf

ord.

edu/

Doc

s/m

yr5/

asse

ssm

ent/m

m-fo

rum

-kio

sk-…

is.s

tanf

ord.

edu/

Doc

s/m

yr5/

asse

ssm

ent/n

eato

-ucb

.htm

lsy

nthe

sis.

stan

ford

.edu

/Doc

s/m

yr5/

asse

ssm

ent/n

ot-a

vaila

ble.

htm

lsy

nthe

sis.

stan

ford

.edu

/Doc

s/m

yr5/

cice

esy

nthe

sis.

stan

ford

.edu

/Doc

s/m

yr5/

cice

e/br

idge

-gap

.htm

lsy

nthe

sis.

stan

ford

.edu

/Doc

s/m

yr5/

cice

e/ci

cee-

mai

n.ht

ml

is.s

tanf

ord.

edu/

Doc

s/m

yr5/

cice

e/co

mp-

inte

g-an

alys

is.h

tml

Page 39: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Top D

ow

n M

irro

r D

etec

tion

[Bhar

99

, Bh

ar0

0c]

E.g.,

ww

w.sy

nthe

sis.

org/

Docs

/Pro

jAbs

/syn

sys/

syna

lysi

s.ht

mlsy

nthe

sis.

stan

ford

.edu

/Doc

s/Pr

ojAb

s/sy

nsys

/qua

nt-d

ev-n

ew-t

each

.htm

lW

hat

fea

ture

s co

uld

indic

ate

mir

rori

ng?

Host

nam

e si

mila

rity

: w

ord

unig

ram

s an

d b

igra

ms:

{ w

ww

, w

ww

.syn

thes

is,

synth

esis

, …

}D

irec

tory

sim

ilari

ty:

Posi

tional

pat

h b

igra

ms

{ 0

:Docs

/Pro

jAbs,

1:P

rojA

bs/

synsy

s, …

}

IP a

ddre

ss s

imila

rity

: 3

or

4 o

ctet

ove

rlap

Man

y host

s sh

arin

g a

n IP

addre

ss =

> v

irtu

al h

ost

ing b

y an

ISP

Host

outl

ink

ove

rlap

Path

ove

rlap

Po

tenti

ally

, pat

h +

ske

tch o

verl

ap

Page 40: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Imple

men

tati

on

Phas

e I -

Can

did

ate

Pair

Det

ecti

on

Find f

eatu

res

that

pai

rs o

f host

s hav

e in

com

mon

Com

pute

a lis

t of

host

pai

rs w

hic

h m

ight

be

mir

rors

Phas

e II

-H

ost

Pai

r V

alid

atio

nT

est

each

host

pai

r an

d d

eter

min

e ex

tent

of

mir

rori

ng

Chec

k if

20

pat

hs

sam

ple

d f

rom

Host

1 h

ave

nea

r-duplic

ates

on H

ost

2 a

nd v

ice

vers

aU

se t

ransi

tive

infe

rence

s:IF

Mir

ror(

A,x

) A

ND

Mir

ror(

x,B

) TH

EN M

irro

r(A

,B)

IF M

irro

r(A

,x)

AN

D !M

irro

r(x,B

) TH

EN !

Mir

ror(

A,B

)

Eval

uat

ion

14

0 m

illio

n U

RLs

on 2

30

,00

0 h

ost

s (1

99

9)

Best

appro

ach c

om

bin

ed 5

set

s of

feat

ure

sT

op 1

00

,00

0 h

ost

pai

rs h

ad p

reci

sion =

0.5

7 a

nd r

ecal

l =

0

.86

Page 41: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Web

IRIn

fras

truct

ure

Connec

tivi

ty S

erve

rFa

st a

cces

s to

lin

ks t

o s

upport

for

link

anal

ysis

Ter

m V

ecto

r D

atab

ase

Fast

acc

ess

to d

ocu

men

t ve

ctors

to a

ugm

ent

link

anal

ysis

Page 42: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Connec

tivi

ty S

erve

r[C

S1:

Bhar

98

b,

CS2

& 3

:R

and0

1]

Fast

web

gra

ph a

cces

s to

support

connec

tivi

ty

anal

ysis

Store

s m

appin

gs

in m

emory

fro

mU

RL

to o

utl

inks

, U

RL

to inlin

ks

Applic

atio

ns

HIT

S, P

ager

ank

com

puta

tions

Cra

wl si

mula

tion

Gra

ph a

lgori

thm

s: w

eb c

onnec

tivi

ty,

dia

met

er e

tc.

more

on t

his

lat

er

Vis

ual

izat

ions

Page 43: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Usa

ge

Inpu

t

Gra

phal

gorit

hm+

URLs +

Valu

es

URL

sto FP

sto ID

s

Exec

utio

n

Gra

phal

gorit

hmru

ns in

mem

ory

IDs

to URL

s

Out

put

URLs +

Valu

es

Tran

slat

ion

Tab

les

on D

isk

URL

text

: 9

byte

s/U

RL

(com

pres

sed

from

~80

byt

es )

FP

(64b

) ->

ID

(32b

): 5

byt

esID

(32b

) ->

FP(

64b)

: 8

byte

sID

(32b

) ->

URL

s: 0

.5 b

ytes

Page 44: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

ID a

ssig

nm

ent

E.g.

, HIG

H I

Ds:

M

ax(i

nde

gree

, ou

tdeg

ree)

> 2

54

IDU

RL

… 9891

w

ww

.am

azon

.com

/

9912

w

ww

.am

azon

.com

/job

s/

… 9821

878

ww

w.g

eoci

ties.

com

/

… 4093

0030

w

ww

.goo

gle.

com

/

… 8590

3590

w

ww

.yah

oo.c

om/

Part

itio

n U

RLs

into

3 s

ets,

sort

ed

lexic

ogra

phic

ally

Hig

h: M

ax d

egre

e >

25

4M

ediu

m:

25

4 >

Max

deg

ree

> 2

4Lo

w: re

mai

nin

g (

75%

)

IDs

assi

gned

in s

equen

ce (

den

sely

)

Adja

cenc

y lis

tsIn

mem

ory

tabl

es f

or O

utlin

ks,

Inlin

ksLi

st in

dex

map

s fr

om a

Sou

rce

ID t

o st

art

of a

djac

ency

list

Page 45: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Adja

cency

Lis

t C

om

pre

ssio

n -

I

… …

98 132

153

98 147

153

… …

104

105

106

List

Inde

x

Seq

uenc

eof

Adj

acen

cyLi

sts

… …

-6 34 21 -8 49 6… …

104

105

106

List

Inde

x

Del

taE

ncod

edA

djac

ency

List

s•A

djac

ency

Lis

t: -S

mal

ler d

elta

val

ues

are

expo

nent

ially

mor

e fre

quen

t (80

% to

sam

e ho

st)

-Com

pres

s de

ltas

with

var

iabl

e le

ngth

enc

odin

g (e

.g.,

Huf

fman

)•L

ist I

ndex

poi

nter

s:32

b fo

r hig

h, B

ase+

16b

for m

ed, B

ase+

8b fo

r low

-Avg

= 12

b pe

r poi

nter

Page 46: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Adja

cency

Lis

t C

om

pre

ssio

n -

II

Inte

r Li

st C

om

pre

ssio

nBa

sis:

Sim

ilar

UR

Ls m

ay s

har

e lin

ks

Clo

se in ID

spac

e =

> a

dja

cency

lis

ts m

ay o

verl

ap

Appro

ach

Def

ine

a re

pre

senta

tive

adja

cency

lis

tfo

r a

blo

ck o

f ID

sA

dja

cency

lis

t of

a re

fere

nce

ID

Unio

n o

f ad

jace

ncy

lis

ts in t

he

blo

ck

Rep

rese

nt

adja

cency

lis

t in

ter

ms

of

del

etio

ns

and a

ddit

ions

whe

n it

is c

heap

er t

o do

so

Mea

sure

men

tsIn

tra

List

+ S

tart

s: 8

-11

bit

s per

lin

k (5

80

M p

ages

/16

GB

RA

M)

Inte

r Li

st:

5.4

-5.7

bit

s per

lin

k (8

70

M p

ages

/16

GB

RA

M.)

Page 47: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Ter

m V

ecto

r D

atab

ase

[Sta

t00

]

Fast

acc

ess

to 5

0 w

ord

ter

m v

ecto

rs f

or

web

pag

esT

erm

Sel

ecti

on:

Res

tric

ted t

o m

iddle

1/3

rdof

lexic

on b

y docu

men

t fr

equen

cyT

op

50

word

s in

docu

men

t b

y T

F.ID

F.

Ter

m W

eighti

ng:

Def

erre

d t

ill r

un-t

ime

(can

be

bas

ed o

n t

erm

fre

q,

doc

freq

, doc

length

)

Applic

atio

ns

Conte

nt

+ C

onnec

tivi

ty a

nal

ysis

(e.

g.,

Topic

Dis

tilla

tion)

Topic

spec

ific

cra

wls

Docu

men

t cl

assi

fica

tion

Perf

orm

ance

Stora

ge:

33

GB

for

27

2M

ter

m v

ecto

rsSp

eed:

17

ms/

vect

or

on A

lphaS

erve

r4

10

0 (

late

ncy

to r

ead a

dis

k blo

ck)

Page 48: CS276A Text Information Retrieval, Mining, and …ale/BICI/IR/Slides/bertinoro10.pdfQuery-doc popularity matrix B Docs Queries q j B qj = number of times doc j clicked-through on query

Arc

hit

ectu

re

UR

L In

fo

LC:T

IDLC

:TID

…LC

:TID

FRQ

:RL

FRQ

:RL

…FR

Q:R

L

128

Byte

TVR

ecor

d

Term

s

Freq

Bas

e (4

byt

es)

Bit

vect

orFo

r 48

0 U

RLi

ds

offs

et

UR

Lid

to T

erm

Vec

tor

Look

u p

UR

Lid

* 64

/480