efficient and effective link analysis with precomputed salsa maps

Efficient and Effective Link Analysis with Precomputed SALSA Maps

Marc Najork (Microsoft Research, Mt View, CA, USA) Nick Craswell (Microsoft Live Search, Cambridge, UK)

Outline• The problem• Framework & previous results• Review of SALSA; introduction of CS-SALSA• Four pre-computed variants of SALSA:

– Strawman: SS-SALSA-0– Woodman: SS-SALSA-1– Tinman: SS-SALSA-2– Ironman: SS-SALSA-3

• Recap: Comparing old & new• Breakdown by query specificity• Related work• Critique

The problem we are addressing

• Hyperlinks are a valuable feature for ranking of web search results– Combined with many other features (text, traffic)

• Known query-dependent link-based ranking algorithms (SALSA & variants) provide better signal than known query-independent ones (PageRank, in-degree)

• But: SALSA requires substantial query-time work; PageRank etc. is pre-computed

• Can we pre-compute SALSA while preserving signal?

Our experimental framework• Large web graph

– 464 million crawled pages– 2.9 billion distinct URLs– 17.7 billion distinct edges

• Large test set– 28,043 queries (sampled from Live Search logs)– 66.8 million result URLs (~2838/query)– 485,656 judgments (~ 17.3/query); six-point scale

• Standard performance measures: MAP, MRR, NDCG• Same data & measures as used in other work

(SIGIR 2007, CIKM 2007, WAW 2007, WSDM 2009)

Previous results on this data set

See CIKM 2007 (for SALSA), SIGIR 2007 (all other results)

NDCG@10

Some notation

• Web graph G=(V,E) E V V (eliminating intra-domain edges from E)

• URLs u,v,w V• Parent/in-linker set

I(v) = { u V : (u,v) E }• Children/out-linker set

O(u) = { v V : (u,v) E }• Result set R V of a query q

Random vs. consistent sampling

• Un (X) denotes uniform random sample of n elements from X

• Cn (X) denotes consistent sample of n elements from X

• Properties:– Deterministic– Unbiased– Preserves set similarity: BA

BABCACBCAC

)()()()(

SALSA algorithm (Lempel & Moran 2000)

• Input: Web graph (V,E); result set R of query q• Form neighborhood graph (B,N):

– Expand R to base set B by including all children and n parents (sampled uniformly at random) of each result in R:

– Neighborhood edge set N includes all edges in E with endpoints in B:

Ru n uIUuOuB

))(()(}{

}:),{( BvBuEvuN

SALSA algorithm: Authority scores

otherwise0

),(if)(in

1 NvuI v

Inverse-indegree matrix Inverse-outdegree matrix

otherwise0

),(if)(out

1 NvuO u

• Form two matrices based on (B,N):

• Authority score vector = principal eigenvector of ITO

SALSA scores: Operational definition

CS-SALSA

• “Consistent-sampling SALSA” (CS-SALSA)– Identical to standard SALSA, except:– Sample in-linkers as well as out-linkers– using consistent sampling (as opposed to random)

– Two free sampling parameters a and b– What are the best settings?

Ru ba uOCuICuB

))(())((}{

Effectiveness of CS-SALSA

NDCG@10• CS-SALSA(2,1) more effective than standard

SALSA (whose NDCG@10 was 0.158)

Basic ideas of “Singleton-seed SALSA”

• Offline (at indexing time):– Pretend that each v V is a singleton result set– Form neighborhood graph around {v}– Compute SALSA scores on that graph

• Online (at query time): – Look up pre-computed scores of each v R and

use them

Strawman: SS-SALSA-0

• Offline:– Input: Web graph (V,E), sampling parameters a, b– Output: Score map g: V R– For each v V:

• Assume R = {v} and fix neighborhood graph (B,N) as in CS-SALSA

• Compute SALSA scores s[u] for each u B• Set g[v] := s[v]

• Online, given result set R and score map g:– For each u R: Assign score g[u]

Effectiveness of SS-SALSA-0

NDCG@10

• Computed off-line, looking up one score per result at query-time (like in-degree, PageRank)

• Substantially less effective than PageRank and in-degree

Woodman: SS-SALSA-1

• Offline:– Input: Web graph (V,E), sampling parameters a, b– Output: Score map g: V V R – For each v V:

• Assume R = {v} and fix neighborhood graph (B,N) as in CS-SALSA

• Compute SALSA scores s[u] for each u B; s[u]=0 for u B• Set g[v] := s (which is of type V R)

• Online, given result set R and score map g:– For each u R: Assign score

uvg ]][[

NDCG@10

• Looking up |B| (≤ a+b+1) scores per result at query-time• More effective than PageRank; less effective than CS-SALSA• Better to sample no parents, more children

– Counter-intuitive when viewing hyperlinks as endorsements

Tinman: SS-SALSA-2

• Same as SS-SALSA-1, except that offline-step uses modified definition of B– Sample a parents and b children of the “result” (the

seed vertex) as before– Additionally, include c children (“siblings”) of each

sampled parent, and d parents (“mates”) of each sampled child

– So, SS-SALSA-2 has four free parameters a,b,c,d– Neighborhood graph and score maps are potentially

much larger

• Effectiveness increases monotonically as b (number of sampled children per result) is increased

• Increases further as d (number of sampled mates per sampled child) is increased

• Setting a (number of sampled parents per results) to 0 is best, other values are fairly indistinguishable

• SS-SALSA-2(0,,0,75) has NDCG@10 of 0.157 – Compared to 0.182 for CS-SALSA(2,1)– Huge space cost: ~7500 scores for every page in the corpus!

Ironman: SS-SALSA-3

• Idea: Bound size of score map• For every seed vertex v:

– Fix neighborhood vertex set B and compute scores s in the same way as in SS-SALSA-2

– Set g[v] := topk(s), the vertex-to-score mapping of the k highest-scoring vertices in B

• Note that v itself might not be part of topk(s)• SS-SALSA-3 has five free parameters a,b,c,d,k

• Fixed a=0, b=, c=0, d=75• SS-SALSA-3 outperforms PageRank starting at two-entry score maps

Recap: Comparing algorithms

new, all-online new, pre-computed

NDCG@10

Breakdown by query specificity

• How do SALSA variants, PageRank, and BM25F perform for different classes of queries?

• Different ways to classify queries:– Informational, navigational, transactional

(Broder’s taxonomy)– Commercial vs. non-commercial intent– General vs. specific

• How to measure specificity?– Ideally, by size of result set– Approximation: Sum of IDFs of query terms

Breakdown by query specificity

• CS-SALSA >> SS-SALSA-* for general queries• SS-SALSA-3 as good as SS-SALSA-2 for general queries

Related work• The quest for correct information on the web: hyper search

engines (Marchiori 1997)• The PageRank citation ranking: Bringing order to the web

(Page, Brin, Motwani, Winograd 1998)• Authoritative sources in a hyperlinked environment

(Kleinberg 1998)• The Stochastic Approach for Link-Structure Analysis (SALSA) and

the TKC Effect (Lempel & Moran 2000)• Using Bloom filters to speed up HITS-like ranking algorithms

(Gollapudi, Najork, Panigrahy 2007)• Less is More: Sampling the neighborhood graph makes SALSA

better and faster (Najork, Gollapudi, Panigrahy 2009)

Critique

• Data sets not publicly available“I have a serious problem with the data set used by the authors. It is large, apparently well built, and not publicly available. There is by now stream of papers using these data and making strong claims about the effectiveness of all ranking methods for the web at major conferences; for these papers no claim can be confirmed or evaluated.” (anonymous WSDM 2009 reviewer)

Plan to repeat using standard collections.

Critique• Issues with data sets:

– Web graph is old– Small fraction of results are judged– Intersection between graph & results is modest

See above – plan to repeat using public collection

• Examined only effectiveness of isolated features– Linear combination with BM25F still improves over PageRank &

BM25F, but improvement much smaller– Use better methods for combining evidence?

• Good point on speed/quality curve?– You be the judge …

QUESTIONS?

efficient and effective link analysis with precomputed salsa maps

e e v v

salsa cikm

standard salsa

web graph v

v rfor

result set r v

compute salsa scores

precompute salsa

Documents

precomputed atmospheric scattering - inria - accueile....

rigid-body fracture sound with precomputed - cornell...

efficient and effective link analysis with precomputed ...

precomputed global illumination of isosurfaces

precomputed radiance transfer for x3d based mixed reality...

precomputed search trees: planning for interactive goal...

research article precomputed clustering for movie...

catering at the wu conference centre...

error analysis of precomputed radiance transfer for...

a precomputed polynomial representation for...

precomputed radiance transfer for real-time indirect...

precomputed atmospheric scattering - inria

resource location based on precomputed partial random...

kuh precomputed ultimate

ray tracing depth maps using precomputed edge tables kevin...

precomputed radiance transfer field for rendering...

salsa sell sheet - walnut creek salsa

a gentle introduction to precomputed radiance...

precomputed shadow fields for dynamic scenes

precomputed acoustic transfer: output-sensitive, accurate...