efficient and effective link analysis with precomputed salsa maps

28
Efficient and Effective Link Analysis with Precomputed SALSA Maps Marc Najork (Microsoft Research, Mt View, CA, USA) Nick Craswell (Microsoft Live Search, Cambridge, UK)

Upload: conan

Post on 12-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Efficient and Effective Link Analysis with Precomputed SALSA Maps. Marc Najork (Microsoft Research, Mt View, CA, USA) Nick Craswell (Microsoft Live Search, Cambridge, UK). Outline. The problem Framework & previous results Review of SALSA; introduction of CS-SALSA - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Efficient and Effective Link Analysis with Precomputed SALSA Maps

Marc Najork (Microsoft Research, Mt View, CA, USA) Nick Craswell (Microsoft Live Search, Cambridge, UK)

Page 2: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Outline• The problem• Framework & previous results• Review of SALSA; introduction of CS-SALSA• Four pre-computed variants of SALSA:

– Strawman: SS-SALSA-0– Woodman: SS-SALSA-1– Tinman: SS-SALSA-2– Ironman: SS-SALSA-3

• Recap: Comparing old & new• Breakdown by query specificity• Related work• Critique

Page 3: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

The problem we are addressing

• Hyperlinks are a valuable feature for ranking of web search results– Combined with many other features (text, traffic)

• Known query-dependent link-based ranking algorithms (SALSA & variants) provide better signal than known query-independent ones (PageRank, in-degree)

• But: SALSA requires substantial query-time work; PageRank etc. is pre-computed

• Can we pre-compute SALSA while preserving signal?

Page 4: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Our experimental framework• Large web graph

– 464 million crawled pages– 2.9 billion distinct URLs– 17.7 billion distinct edges

• Large test set– 28,043 queries (sampled from Live Search logs)– 66.8 million result URLs (~2838/query)– 485,656 judgments (~ 17.3/query); six-point scale

• Standard performance measures: MAP, MRR, NDCG• Same data & measures as used in other work

(SIGIR 2007, CIKM 2007, WAW 2007, WSDM 2009)

Page 5: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Previous results on this data set

See CIKM 2007 (for SALSA), SIGIR 2007 (all other results)

.221

.158

.106

.104

.092

.011

0.00

0.05

0.10

0.15

0.20

0.25

BM25

F

SAL

SA(R

S,ID

,all,

3)

inte

r-do

mai

nin

-deg

ree

HIT

S(R

S,ID

,all,

25)

Pag

eRan

k

Ran

dom

NDCG@10

Page 6: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Some notation

• Web graph G=(V,E) E V V (eliminating intra-domain edges from E)

• URLs u,v,w V• Parent/in-linker set

I(v) = { u V : (u,v) E }• Children/out-linker set

O(u) = { v V : (u,v) E }• Result set R V of a query q

Page 7: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Random vs. consistent sampling

• Un (X) denotes uniform random sample of n elements from X

• Cn (X) denotes consistent sample of n elements from X

• Properties:– Deterministic– Unbiased– Preserves set similarity: BA

BABCACBCAC

Enm

nm

)()()()(

Page 8: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

SALSA algorithm (Lempel & Moran 2000)

• Input: Web graph (V,E); result set R of query q• Form neighborhood graph (B,N):

– Expand R to base set B by including all children and n parents (sampled uniformly at random) of each result in R:

– Neighborhood edge set N includes all edges in E with endpoints in B:

Ru n uIUuOuB

))(()(}{

}:),{( BvBuEvuN

Page 9: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

SALSA algorithm: Authority scores

otherwise0

),(if)(in

1 NvuI v

uv

Inverse-indegree matrix Inverse-outdegree matrix

otherwise0

),(if)(out

1 NvuO u

uv

• Form two matrices based on (B,N):

• Authority score vector = principal eigenvector of ITO

Page 10: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

SALSA scores: Operational definition

Page 11: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

CS-SALSA

• “Consistent-sampling SALSA” (CS-SALSA)– Identical to standard SALSA, except:– Sample in-linkers as well as out-linkers– using consistent sampling (as opposed to random)

– Two free sampling parameters a and b– What are the best settings?

Ru ba uOCuICuB

))(())((}{

Page 12: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Effectiveness of CS-SALSA

NDCG@10• CS-SALSA(2,1) more effective than standard

SALSA (whose NDCG@10 was 0.158)

Page 13: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Basic ideas of “Singleton-seed SALSA”

• Offline (at indexing time):– Pretend that each v V is a singleton result set– Form neighborhood graph around {v}– Compute SALSA scores on that graph

• Online (at query time): – Look up pre-computed scores of each v R and

use them

Page 14: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Strawman: SS-SALSA-0

• Offline:– Input: Web graph (V,E), sampling parameters a, b– Output: Score map g: V R– For each v V:

• Assume R = {v} and fix neighborhood graph (B,N) as in CS-SALSA

• Compute SALSA scores s[u] for each u B• Set g[v] := s[v]

• Online, given result set R and score map g:– For each u R: Assign score g[u]

Page 15: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Effectiveness of SS-SALSA-0

NDCG@10

• Computed off-line, looking up one score per result at query-time (like in-degree, PageRank)

• Substantially less effective than PageRank and in-degree

Page 16: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Woodman: SS-SALSA-1

• Offline:– Input: Web graph (V,E), sampling parameters a, b– Output: Score map g: V V R – For each v V:

• Assume R = {v} and fix neighborhood graph (B,N) as in CS-SALSA

• Compute SALSA scores s[u] for each u B; s[u]=0 for u B• Set g[v] := s (which is of type V R)

• Online, given result set R and score map g:– For each u R: Assign score

Bv

uvg ]][[

Page 17: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Effectiveness of SS-SALSA-1

NDCG@10

• Looking up |B| (≤ a+b+1) scores per result at query-time• More effective than PageRank; less effective than CS-SALSA• Better to sample no parents, more children

– Counter-intuitive when viewing hyperlinks as endorsements

Page 18: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Tinman: SS-SALSA-2

• Same as SS-SALSA-1, except that offline-step uses modified definition of B– Sample a parents and b children of the “result” (the

seed vertex) as before– Additionally, include c children (“siblings”) of each

sampled parent, and d parents (“mates”) of each sampled child

– So, SS-SALSA-2 has four free parameters a,b,c,d– Neighborhood graph and score maps are potentially

much larger

Page 19: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Effectiveness of SS-SALSA-2

• Effectiveness increases monotonically as b (number of sampled children per result) is increased

• Increases further as d (number of sampled mates per sampled child) is increased

• Setting a (number of sampled parents per results) to 0 is best, other values are fairly indistinguishable

• SS-SALSA-2(0,,0,75) has NDCG@10 of 0.157 – Compared to 0.182 for CS-SALSA(2,1)– Huge space cost: ~7500 scores for every page in the corpus!

Page 20: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Ironman: SS-SALSA-3

• Idea: Bound size of score map• For every seed vertex v:

– Fix neighborhood vertex set B and compute scores s in the same way as in SS-SALSA-2

– Set g[v] := topk(s), the vertex-to-score mapping of the k highest-scoring vertices in B

• Note that v itself might not be part of topk(s)• SS-SALSA-3 has five free parameters a,b,c,d,k

Page 21: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Effectiveness of SS-SALSA-3

• Fixed a=0, b=, c=0, d=75• SS-SALSA-3 outperforms PageRank starting at two-entry score maps

Page 22: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Recap: Comparing algorithms

new, all-online new, pre-computed

.221

.182

.158

.157

.153

.140

.106

.104

.092

.011

0.00

0.05

0.10

0.15

0.20

0.25

BM

25F

CS

-SA

LSA

(2,1

)

SAL

SA(R

S,ID

,all,

3)

SS-S

ALS

A-2

(0,a

ll,0,

75)

SS-

SALS

A-3

(0,a

ll,0,

75,1

0)

SS-S

ALS

A-1

(0,5

)

inte

r-dom

ain

in-d

egre

e

HIT

S(R

S,ID

,all,

25)

Page

Ran

k

Ran

dom

NDCG@10

Page 23: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Breakdown by query specificity

• How do SALSA variants, PageRank, and BM25F perform for different classes of queries?

• Different ways to classify queries:– Informational, navigational, transactional

(Broder’s taxonomy)– Commercial vs. non-commercial intent– General vs. specific

• How to measure specificity?– Ideally, by size of result set– Approximation: Sum of IDFs of query terms

Page 24: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Breakdown by query specificity

• CS-SALSA >> SS-SALSA-* for general queries• SS-SALSA-3 as good as SS-SALSA-2 for general queries

Page 25: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Related work• The quest for correct information on the web: hyper search

engines (Marchiori 1997)• The PageRank citation ranking: Bringing order to the web

(Page, Brin, Motwani, Winograd 1998)• Authoritative sources in a hyperlinked environment

(Kleinberg 1998)• The Stochastic Approach for Link-Structure Analysis (SALSA) and

the TKC Effect (Lempel & Moran 2000)• Using Bloom filters to speed up HITS-like ranking algorithms

(Gollapudi, Najork, Panigrahy 2007)• Less is More: Sampling the neighborhood graph makes SALSA

better and faster (Najork, Gollapudi, Panigrahy 2009)

Page 26: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Critique

• Data sets not publicly available“I have a serious problem with the data set used by the authors. It is large, apparently well built, and not publicly available. There is by now stream of papers using these data and making strong claims about the effectiveness of all ranking methods for the web at major conferences; for these papers no claim can be confirmed or evaluated.” (anonymous WSDM 2009 reviewer)

Plan to repeat using standard collections.

Page 27: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

Critique• Issues with data sets:

– Web graph is old– Small fraction of results are judged– Intersection between graph & results is modest

See above – plan to repeat using public collection

• Examined only effectiveness of isolated features– Linear combination with BM25F still improves over PageRank &

BM25F, but improvement much smaller– Use better methods for combining evidence?

• Good point on speed/quality curve?– You be the judge …

Page 28: Efficient and Effective Link Analysis with  Precomputed  SALSA  Maps

QUESTIONS?