efficient and effective link analysis with precomputed salsa maps

Post on 12-Feb-2016

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Efficient and Effective Link Analysis with Precomputed SALSA Maps. Marc Najork (Microsoft Research, Mt View, CA, USA) Nick Craswell (Microsoft Live Search, Cambridge, UK). Outline. The problem Framework & previous results Review of SALSA; introduction of CS-SALSA - PowerPoint PPT Presentation

TRANSCRIPT

Efficient and Effective Link Analysis with Precomputed SALSA Maps

Marc Najork (Microsoft Research, Mt View, CA, USA) Nick Craswell (Microsoft Live Search, Cambridge, UK)

Outline• The problem• Framework & previous results• Review of SALSA; introduction of CS-SALSA• Four pre-computed variants of SALSA:

– Strawman: SS-SALSA-0– Woodman: SS-SALSA-1– Tinman: SS-SALSA-2– Ironman: SS-SALSA-3

• Recap: Comparing old & new• Breakdown by query specificity• Related work• Critique

The problem we are addressing

• Hyperlinks are a valuable feature for ranking of web search results– Combined with many other features (text, traffic)

• Known query-dependent link-based ranking algorithms (SALSA & variants) provide better signal than known query-independent ones (PageRank, in-degree)

• But: SALSA requires substantial query-time work; PageRank etc. is pre-computed

• Can we pre-compute SALSA while preserving signal?

Our experimental framework• Large web graph

– 464 million crawled pages– 2.9 billion distinct URLs– 17.7 billion distinct edges

• Large test set– 28,043 queries (sampled from Live Search logs)– 66.8 million result URLs (~2838/query)– 485,656 judgments (~ 17.3/query); six-point scale

• Standard performance measures: MAP, MRR, NDCG• Same data & measures as used in other work

(SIGIR 2007, CIKM 2007, WAW 2007, WSDM 2009)

Previous results on this data set

See CIKM 2007 (for SALSA), SIGIR 2007 (all other results)

.221

.158

.106

.104

.092

.011

0.00

0.05

0.10

0.15

0.20

0.25

BM25

F

SAL

SA(R

S,ID

,all,

3)

inte

r-do

mai

nin

-deg

ree

HIT

S(R

S,ID

,all,

25)

Pag

eRan

k

Ran

dom

NDCG@10

Some notation

• Web graph G=(V,E) E V V (eliminating intra-domain edges from E)

• URLs u,v,w V• Parent/in-linker set

I(v) = { u V : (u,v) E }• Children/out-linker set

O(u) = { v V : (u,v) E }• Result set R V of a query q

Random vs. consistent sampling

• Un (X) denotes uniform random sample of n elements from X

• Cn (X) denotes consistent sample of n elements from X

• Properties:– Deterministic– Unbiased– Preserves set similarity: BA

BABCACBCAC

Enm

nm

)()()()(

SALSA algorithm (Lempel & Moran 2000)

• Input: Web graph (V,E); result set R of query q• Form neighborhood graph (B,N):

– Expand R to base set B by including all children and n parents (sampled uniformly at random) of each result in R:

– Neighborhood edge set N includes all edges in E with endpoints in B:

Ru n uIUuOuB

))(()(}{

}:),{( BvBuEvuN

SALSA algorithm: Authority scores

otherwise0

),(if)(in

1 NvuI v

uv

Inverse-indegree matrix Inverse-outdegree matrix

otherwise0

),(if)(out

1 NvuO u

uv

• Form two matrices based on (B,N):

• Authority score vector = principal eigenvector of ITO

SALSA scores: Operational definition

CS-SALSA

• “Consistent-sampling SALSA” (CS-SALSA)– Identical to standard SALSA, except:– Sample in-linkers as well as out-linkers– using consistent sampling (as opposed to random)

– Two free sampling parameters a and b– What are the best settings?

Ru ba uOCuICuB

))(())((}{

Effectiveness of CS-SALSA

NDCG@10• CS-SALSA(2,1) more effective than standard

SALSA (whose NDCG@10 was 0.158)

Basic ideas of “Singleton-seed SALSA”

• Offline (at indexing time):– Pretend that each v V is a singleton result set– Form neighborhood graph around {v}– Compute SALSA scores on that graph

• Online (at query time): – Look up pre-computed scores of each v R and

use them

Strawman: SS-SALSA-0

• Offline:– Input: Web graph (V,E), sampling parameters a, b– Output: Score map g: V R– For each v V:

• Assume R = {v} and fix neighborhood graph (B,N) as in CS-SALSA

• Compute SALSA scores s[u] for each u B• Set g[v] := s[v]

• Online, given result set R and score map g:– For each u R: Assign score g[u]

Effectiveness of SS-SALSA-0

NDCG@10

• Computed off-line, looking up one score per result at query-time (like in-degree, PageRank)

• Substantially less effective than PageRank and in-degree

Woodman: SS-SALSA-1

• Offline:– Input: Web graph (V,E), sampling parameters a, b– Output: Score map g: V V R – For each v V:

• Assume R = {v} and fix neighborhood graph (B,N) as in CS-SALSA

• Compute SALSA scores s[u] for each u B; s[u]=0 for u B• Set g[v] := s (which is of type V R)

• Online, given result set R and score map g:– For each u R: Assign score

Bv

uvg ]][[

Effectiveness of SS-SALSA-1

NDCG@10

• Looking up |B| (≤ a+b+1) scores per result at query-time• More effective than PageRank; less effective than CS-SALSA• Better to sample no parents, more children

– Counter-intuitive when viewing hyperlinks as endorsements

Tinman: SS-SALSA-2

• Same as SS-SALSA-1, except that offline-step uses modified definition of B– Sample a parents and b children of the “result” (the

seed vertex) as before– Additionally, include c children (“siblings”) of each

sampled parent, and d parents (“mates”) of each sampled child

– So, SS-SALSA-2 has four free parameters a,b,c,d– Neighborhood graph and score maps are potentially

much larger

Effectiveness of SS-SALSA-2

• Effectiveness increases monotonically as b (number of sampled children per result) is increased

• Increases further as d (number of sampled mates per sampled child) is increased

• Setting a (number of sampled parents per results) to 0 is best, other values are fairly indistinguishable

• SS-SALSA-2(0,,0,75) has NDCG@10 of 0.157 – Compared to 0.182 for CS-SALSA(2,1)– Huge space cost: ~7500 scores for every page in the corpus!

Ironman: SS-SALSA-3

• Idea: Bound size of score map• For every seed vertex v:

– Fix neighborhood vertex set B and compute scores s in the same way as in SS-SALSA-2

– Set g[v] := topk(s), the vertex-to-score mapping of the k highest-scoring vertices in B

• Note that v itself might not be part of topk(s)• SS-SALSA-3 has five free parameters a,b,c,d,k

Effectiveness of SS-SALSA-3

• Fixed a=0, b=, c=0, d=75• SS-SALSA-3 outperforms PageRank starting at two-entry score maps

Recap: Comparing algorithms

new, all-online new, pre-computed

.221

.182

.158

.157

.153

.140

.106

.104

.092

.011

0.00

0.05

0.10

0.15

0.20

0.25

BM

25F

CS

-SA

LSA

(2,1

)

SAL

SA(R

S,ID

,all,

3)

SS-S

ALS

A-2

(0,a

ll,0,

75)

SS-

SALS

A-3

(0,a

ll,0,

75,1

0)

SS-S

ALS

A-1

(0,5

)

inte

r-dom

ain

in-d

egre

e

HIT

S(R

S,ID

,all,

25)

Page

Ran

k

Ran

dom

NDCG@10

Breakdown by query specificity

• How do SALSA variants, PageRank, and BM25F perform for different classes of queries?

• Different ways to classify queries:– Informational, navigational, transactional

(Broder’s taxonomy)– Commercial vs. non-commercial intent– General vs. specific

• How to measure specificity?– Ideally, by size of result set– Approximation: Sum of IDFs of query terms

Breakdown by query specificity

• CS-SALSA >> SS-SALSA-* for general queries• SS-SALSA-3 as good as SS-SALSA-2 for general queries

Related work• The quest for correct information on the web: hyper search

engines (Marchiori 1997)• The PageRank citation ranking: Bringing order to the web

(Page, Brin, Motwani, Winograd 1998)• Authoritative sources in a hyperlinked environment

(Kleinberg 1998)• The Stochastic Approach for Link-Structure Analysis (SALSA) and

the TKC Effect (Lempel & Moran 2000)• Using Bloom filters to speed up HITS-like ranking algorithms

(Gollapudi, Najork, Panigrahy 2007)• Less is More: Sampling the neighborhood graph makes SALSA

better and faster (Najork, Gollapudi, Panigrahy 2009)

Critique

• Data sets not publicly available“I have a serious problem with the data set used by the authors. It is large, apparently well built, and not publicly available. There is by now stream of papers using these data and making strong claims about the effectiveness of all ranking methods for the web at major conferences; for these papers no claim can be confirmed or evaluated.” (anonymous WSDM 2009 reviewer)

Plan to repeat using standard collections.

Critique• Issues with data sets:

– Web graph is old– Small fraction of results are judged– Intersection between graph & results is modest

See above – plan to repeat using public collection

• Examined only effectiveness of isolated features– Linear combination with BM25F still improves over PageRank &

BM25F, but improvement much smaller– Use better methods for combining evidence?

• Good point on speed/quality curve?– You be the judge …

QUESTIONS?

top related