user-centric web crawling* christopher olston cmu & yahoo! research** * joint work with sandeep...
Post on 18-Jan-2018
222 Views
Preview:
DESCRIPTION
TRANSCRIPT
User-Centric Web Crawling*
Christopher OlstonCMU & Yahoo! Research**
* Joint work with Sandeep Pandey** Work done at Carnegie Mellon
Christopher Olston2
Distributed Sources of Dynamic Information
source A source B source C
resource constraints
central monitoring node• Support integrated querying• Maintain historical archive
• Sensors• Web sites
Christopher Olston3
Workload-driven Approach
Goal: meet usage needs, while adhering to resource constraints
Tactic: pay attention to workload• workload = usage + data dynamics
this talk
Current focus: autonomous sources– Data archival from Web sources [VLDB’04]– Supporting Web search [WWW’05]
Thesis work: cooperative sources [VLDB’00, SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b]
Christopher Olston4
Outline
Introduction: monitoring distributed sources User-centric web crawling
– Model + approach– Empirical results– Related & future work
Christopher Olston5
Web Crawling to Support Search
web site A web site B web site C
resource constraint
search engine
repository
index
search queries
userscrawler
Q: Given a full repository, when to refresh each page?
Christopher Olston6
Approach
Faced with optimization problem Others:
– Maximize freshness, age, or similar– Boolean model of document change
Our approach:– User-centric optimization objective– Rich notion of document change,
attuned to user-centric objective
Christopher Olston7
Web Search User Interface
1. User enters keywords
2. Search engine returns ranked list of results
3. User visits subset of results
1. ---------2. ---------3. ---------4. …
documents
Christopher Olston8
Objective: Maximize Repository Quality, from Search Perspective
Suppose a user issues search query q
Qualityq = Σdocuments d (likelihood of viewing d) x (relevance of d to q)
Given a workload W of user queries:
Average quality = 1/K x Σqueries q W (freqq x Qualityq)
Christopher Olston9
Viewing Likelihood
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150
Rank
Prob
abili
ty o
f Vie
win
gvi
ew p
roba
bilit
y
rank
• Depends primarily on rank in list [Joachims KDD’02]
• From AltaVista data [Lempel et al. WWW’03]:
ViewProbability(r) r –1.5
Christopher Olston10
Search engines’ internal notion of how well a document matches a query
Each D/Q pair numerical score [0,1] Combination of many factors, e.g.:
– Vector-space similarity (e.g., TF.IDF cosine metric)– Link-based factors (e.g., PageRank) – Anchortext of referring pages
Relevance Scoring Function
Christopher Olston11
(Caveat)
Using scoring function for absolute relevance(Normally only used for relative ranking)– Need to ensure scoring function has meaning on an
absolute scale Probabilistic IR models, PageRank okay Unclear whether TF-IDF does (still debated, I believe)
Bottom line: stricter interpretability requirement than “good relative ordering”
Christopher Olston12
Measuring Quality
Avg. Quality =
Σq (freqq x Σd (likelihood of viewing d) x (relevance of d to q))
query logs
scoring function over (possibly stale) repository
scoring function over “live” copy of d
usage logs
ViewProb( Rank(d, q) )
Christopher Olston13
Lessons from Quality Metric
ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in
descending order of true relevance
Out-of-date repository: scrambles ranking lowers quality
Avg. Quality =
Σq (freqq x Σd (ViewProb( Rank(d, q) ) x Relevance(d, q)) )
Let ΔQD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D
Christopher Olston14
ΔQD: Improvement in Quality
REDOWNLOAD
Web Copy of D(fresh)
Repository Copy of D(stale)
Repository Quality += ΔQD
Christopher Olston15
Formula for Quality Gain (ΔQD)
Quality beforehand:
Quality after re-download:
Quality gain:
Q(t–) = Σq (freqq x Σd (ViewProb( Rankt–(d, q) ) x Relevance(d, q)) )
Q(t) = Σq (freqq x Σd (ViewProb( Rankt(d, q) ) x Relevance(d, q)) )
∆QD(t) = Q(t) – Q(t–) = Σq (freqq x Σd (VP x Relevance(d, q)) )where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )
Re-download document D at time t.
Christopher Olston16
Download Prioritization
Three difficulties:1. ΔQD depends on order of downloading
2. Given both the “live” and repository copies of D, measuring ΔQD is computationally expensive
3. Live copy usually unavailable
Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly
Christopher Olston17
Difficulty 1: Order of Downloading Matters
ΔQD depends on relative rank positions of D Hence, ΔQD depends on order of downloading
To reduce implementation complexity, avoid tracking inter-document ordering dependencies
Assume ΔQD independent of downloading of other docs.
QD(t) = Σq (freqq x Σd (VP x Relevance(d, q)) )where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )
Christopher Olston18
Difficulty 3: Live Copy Unavailable
Take measurements upon re-downloading D(live copy available at that time)
Use forecasting techniques to project forward
timepast re-downloads now
forecast ΔQD(tnow)
ΔQD(t1) ΔQD(t2)
Christopher Olston19
Ability to Forecast ΔQD
Top 50%
Top 80%
Top 90%
first 24 weeks
seco
nd 2
4 w
eeks
Avg. weekly ΔQD (log scale)
Data: 15 web sites sampled from OpenDirectory topics
Queries: AltaVista query log
Docs downloaded once per week, in random order
Christopher Olston20
Strategy So Far
Measure shift in quality (ΔQD) each time re-download document D
Forecast future ΔQD– Treat each D independently
Prioritize re-downloading by ΔQD
Remaining difficulty:2. Given both the “live” and repository copies of D,
measuring ΔQD is computationally expensive
Christopher Olston21
Difficulty 2: Metric Expensive to Compute
Example: “Live” copy of D becomes less relevant to query q than
before• Now D is ranked too high• Some users visit D in lieu of Y, which is
more relevant• Result: less-than-ideal quality
Upon redownloading D, measuring quality gain requires knowing relevancy of Y, Z
Solution: estimate! Use approximate relevancerank mapping functions,
fit in advance for each query
Results for q
Actual Ideal1. X 1. X2. D 2. Y3. Y 3. Z4. Z 4. D
One problem: measurements of other documents required.
Christopher Olston22
Estimation Procedure
Focus on query q (later we’ll see how to sum across all affected queries)
Let Fq(rel) be relevancerank mapping for q– We use piecewise linear function in log-log space– Let r1 = D’s old rank (r1 = Fq(Rel(Dold, q))), r2 = new rank
– Use integral approximation of summation
DETAIL
QD,q = Σd (ViewProb(d,q) x Relevance(d,q)) = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))
≈ Σr=r1+1…r2 (VP(r–1) – VP(r)) x F–1q(r)
Christopher Olston23
Where we stand …
QD,q = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))
DETAIL
≈ f(Rel(D,q), Rel(Dold,q))
≈ VP( Fq(Rel(D, q)) ) – VP( Fq(Rel(Dold, q)) )
QD,q ≈ g(Rel(D,q), Rel(Dold,q))
Context: QD = Σq (freqq x QD,q )
Christopher Olston24
Difficulty 2, continued
Additional problem: must measure effect of shift in rank across all queries.
Solution: couple measurements with index updating operations
Sketch:– Basic index unit: posting. Conceptually:
– Each time insert/delete/update a posting, compute old & new relevance contribution from term/document pair*
– Transform using estimation procedure, and accumulate across postings touched to get ΔQD
term ID
document ID
scoring factors
* assumes scoring function treats term/document pairs independently
Christopher Olston25
Background: Text Indexes
Dictionary Postings
Term # docs Total freq
aid 1 1
all 2 2
cold 1 1
duck 1 2
Doc # Freq58 1
37 1
62 1
15 1
41 2
Basic index unit: posting– One posting for each term/document pair– Contains information needed for scoring function
Number of occurrences, font size, etc.
DETAIL
Christopher Olston26
Pre-Processing: Approximate the Workload
Break multi-term queries into set of single-term queries
– Now, term query– Index has one posting for each query/document pair
DETAIL
Dictionary Postings
Term # docs Total freq
aid 1 1
all 2 2
cold 1 1
duck 1 2
Doc # Freq58 1
37 1
62 1
15 1
41 2
= query
Christopher Olston27
Taking Measurements During Index Maintenance
While updating index:
– Initialize bank of ΔQD accumulators, one per document
(actually, materialized on demand using hash table)
– Each time insert/delete/update a posting: Compute new & old relevance contributions for
query/document pair: Rel(D,q), Rel(Dold,q) Compute ΔQD,q using estimation procedure, add to
accumulator: ΔQD += freqq x g(Rel(D,q), Rel(Dold,q))
DETAIL
Christopher Olston28
Measurement Overhead
Caveat:Does not handle factors that do not depend on a single term/doc. pair, e.g. term proximity and anchortext inclusion
Implemented in Lucene
Christopher Olston29
Summary of Approach
User-centric metric of search repository quality
(Re)downloading document improves quality
Prioritize downloading by expected quality gain
Metric adaptations to enable feasible+efficient implementation
Christopher Olston30
Next: Empirical Results
Introduction: monitoring distributed sources User-centric web crawling
– Model + approach– Empirical results– Related & future work
Christopher Olston31
Overall Effectiveness
Staleness = fraction of out-of-date documents* [Cho et al. 2000]
Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]
* Used “shingling” to filter out “trivial” changes
Scoring function: PageRank (similar results for TF.IDF)
Quality (fraction of ideal)
reso
urce
requ
irem
ent
Min. StalenessMin. EmbarrassmentUser-Centric
Christopher Olston32(boston.com)
Does not rely on size of text change to estimate importance
Tagged as important by staleness- and embarrassment-based techniques, although did not match many queries in workload
Reasons for Improvement
Christopher Olston33
Reasons for Improvement
Accounts for “false negatives” Does not always ignore
frequently-updated pages
User-centric crawling repeatedly re-downloads this page
(washingtonpost.com)
Christopher Olston34
Related Work (1/2)
General-purpose web crawling– [Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01]
Maximize average freshness or age Balance new downloads vs. redownloading old documents
Focused/topic-specific crawling– [Chakrabarti, many others]
Select subset of documents that match user interests Our work: given a set of docs., decide when to (re)download
Christopher Olston35
Most Closely Related Work
[Wolf et al., WWW’02]:– Maximize weighted average freshness– Document weight = probability of “embarrassment” if not fresh
User-Centric Crawling:– Measure interplay between update and query workloads
When document X is updated, which queries are affected by the update, and by how much?
– Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2 Small embarrassment but big loss in quality
Christopher Olston36
Future Work: Detecting Change-Rate Changes
Current techniques schedule monitoring to exploit existing change-rate estimates (e.g., ΔQD)
No provision to explore change-rates explicitly
Explore/exploit tradeoff– Ongoing work on Bandit Problem formulation
Bad case: change-rate = 0, so never monitor– Won’t notice future increase in change-rate
Christopher Olston37
Summary
Approach:– User-centric metric of search engine quality– Schedule downloading to maximize quality
Empirical results:– High quality with few downloads– Good at picking “right” docs. to re-download
top related