user-centric web crawling* christopher olston cmu & yahoo! research** * joint work with sandeep...

User-Centric Web Crawling*

Christopher OlstonCMU & Yahoo! Research**

* Joint work with Sandeep Pandey** Work done at Carnegie Mellon

Christopher Olston2

Distributed Sources of Dynamic Information

source A source B source C

resource constraints

central monitoring node• Support integrated querying• Maintain historical archive

• Sensors• Web sites

Christopher Olston3

Workload-driven Approach

Goal: meet usage needs, while adhering to resource constraints

Tactic: pay attention to workload• workload = usage + data dynamics

this talk

Current focus: autonomous sources– Data archival from Web sources [VLDB’04]– Supporting Web search [WWW’05]

Thesis work: cooperative sources [VLDB’00, SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b]

Christopher Olston4

Outline

Introduction: monitoring distributed sources User-centric web crawling

– Model + approach– Empirical results– Related & future work

Christopher Olston5

Web Crawling to Support Search

web site A web site B web site C

resource constraint

search engine

repository

search queries

userscrawler

Q: Given a full repository, when to refresh each page?

Christopher Olston6

Approach

Faced with optimization problem Others:

– Maximize freshness, age, or similar– Boolean model of document change

Our approach:– User-centric optimization objective– Rich notion of document change,

attuned to user-centric objective

Christopher Olston7

Web Search User Interface

1. User enters keywords

2. Search engine returns ranked list of results

3. User visits subset of results

1. ---------2. ---------3. ---------4. …

documents

Christopher Olston8

Objective: Maximize Repository Quality, from Search Perspective

Suppose a user issues search query q

Qualityq = Σdocuments d (likelihood of viewing d) x (relevance of d to q)

Given a workload W of user queries:

Average quality = 1/K x Σqueries q W (freqq x Qualityq)

Christopher Olston9

Viewing Likelihood

0 50 100 150

• Depends primarily on rank in list [Joachims KDD’02]

• From AltaVista data [Lempel et al. WWW’03]:

ViewProbability(r) r –1.5

Christopher Olston10

Search engines’ internal notion of how well a document matches a query

Each D/Q pair numerical score [0,1] Combination of many factors, e.g.:

– Vector-space similarity (e.g., TF.IDF cosine metric)– Link-based factors (e.g., PageRank) – Anchortext of referring pages

Relevance Scoring Function

(Caveat)

Using scoring function for absolute relevance(Normally only used for relative ranking)– Need to ensure scoring function has meaning on an

absolute scale Probabilistic IR models, PageRank okay Unclear whether TF-IDF does (still debated, I believe)

Bottom line: stricter interpretability requirement than “good relative ordering”

Measuring Quality

Avg. Quality =

Σq (freqq x Σd (likelihood of viewing d) x (relevance of d to q))

query logs

scoring function over (possibly stale) repository

scoring function over “live” copy of d

usage logs

ViewProb( Rank(d, q) )

Lessons from Quality Metric

ViewProb(r) monotonically nonincreasing Quality maximized when ranking function orders documents in

descending order of true relevance

Out-of-date repository: scrambles ranking lowers quality

Avg. Quality =

Σq (freqq x Σd (ViewProb( Rank(d, q) ) x Relevance(d, q)) )

Let ΔQD = loss in quality due to inaccurate information about D Alternatively, improvement in quality if we (re)download D

ΔQD: Improvement in Quality

REDOWNLOAD

Web Copy of D(fresh)

Repository Copy of D(stale)

Repository Quality += ΔQD

Formula for Quality Gain (ΔQD)

Quality beforehand:

Quality after re-download:

Quality gain:

Q(t–) = Σq (freqq x Σd (ViewProb( Rankt–(d, q) ) x Relevance(d, q)) )

Q(t) = Σq (freqq x Σd (ViewProb( Rankt(d, q) ) x Relevance(d, q)) )

∆QD(t) = Q(t) – Q(t–) = Σq (freqq x Σd (VP x Relevance(d, q)) )where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )

Re-download document D at time t.

Download Prioritization

Three difficulties:1. ΔQD depends on order of downloading

2. Given both the “live” and repository copies of D, measuring ΔQD is computationally expensive

3. Live copy usually unavailable

Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly

Difficulty 1: Order of Downloading Matters

ΔQD depends on relative rank positions of D Hence, ΔQD depends on order of downloading

To reduce implementation complexity, avoid tracking inter-document ordering dependencies

Assume ΔQD independent of downloading of other docs.

QD(t) = Σq (freqq x Σd (VP x Relevance(d, q)) )where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) )

Difficulty 3: Live Copy Unavailable

Take measurements upon re-downloading D(live copy available at that time)

Use forecasting techniques to project forward

timepast re-downloads now

forecast ΔQD(tnow)

ΔQD(t1) ΔQD(t2)

Ability to Forecast ΔQD

Top 50%

Top 80%

Top 90%

first 24 weeks

Avg. weekly ΔQD (log scale)

Data: 15 web sites sampled from OpenDirectory topics

Queries: AltaVista query log

Docs downloaded once per week, in random order

Strategy So Far

Measure shift in quality (ΔQD) each time re-download document D

Forecast future ΔQD– Treat each D independently

Prioritize re-downloading by ΔQD

Remaining difficulty:2. Given both the “live” and repository copies of D,

measuring ΔQD is computationally expensive

Difficulty 2: Metric Expensive to Compute

Example: “Live” copy of D becomes less relevant to query q than

before• Now D is ranked too high• Some users visit D in lieu of Y, which is

more relevant• Result: less-than-ideal quality

Upon redownloading D, measuring quality gain requires knowing relevancy of Y, Z

Solution: estimate! Use approximate relevancerank mapping functions,

fit in advance for each query

Results for q

Actual Ideal1. X 1. X2. D 2. Y3. Y 3. Z4. Z 4. D

One problem: measurements of other documents required.

Estimation Procedure

Focus on query q (later we’ll see how to sum across all affected queries)

Let Fq(rel) be relevancerank mapping for q– We use piecewise linear function in log-log space– Let r1 = D’s old rank (r1 = Fq(Rel(Dold, q))), r2 = new rank

– Use integral approximation of summation

DETAIL

QD,q = Σd (ViewProb(d,q) x Relevance(d,q)) = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))

≈ Σr=r1+1…r2 (VP(r–1) – VP(r)) x F–1q(r)

Where we stand …

QD,q = VP(D,q) x Rel(D,q) + Σd≠D (VP(d,q) x Rel(d,q))

DETAIL

≈ f(Rel(D,q), Rel(Dold,q))

≈ VP( Fq(Rel(D, q)) ) – VP( Fq(Rel(Dold, q)) )

QD,q ≈ g(Rel(D,q), Rel(Dold,q))

Context: QD = Σq (freqq x QD,q )

Difficulty 2, continued

Additional problem: must measure effect of shift in rank across all queries.

Solution: couple measurements with index updating operations

Sketch:– Basic index unit: posting. Conceptually:

– Each time insert/delete/update a posting, compute old & new relevance contribution from term/document pair*

– Transform using estimation procedure, and accumulate across postings touched to get ΔQD

term ID

document ID

scoring factors

* assumes scoring function treats term/document pairs independently

Background: Text Indexes

Dictionary Postings

Term # docs Total freq

aid 1 1

all 2 2

cold 1 1

duck 1 2

Doc # Freq58 1

Basic index unit: posting– One posting for each term/document pair– Contains information needed for scoring function

Number of occurrences, font size, etc.

DETAIL

Pre-Processing: Approximate the Workload

Break multi-term queries into set of single-term queries

– Now, term query– Index has one posting for each query/document pair

DETAIL

Dictionary Postings

Term # docs Total freq

aid 1 1

all 2 2

cold 1 1

duck 1 2

Doc # Freq58 1

= query

Taking Measurements During Index Maintenance

While updating index:

– Initialize bank of ΔQD accumulators, one per document

(actually, materialized on demand using hash table)

– Each time insert/delete/update a posting: Compute new & old relevance contributions for

query/document pair: Rel(D,q), Rel(Dold,q) Compute ΔQD,q using estimation procedure, add to

accumulator: ΔQD += freqq x g(Rel(D,q), Rel(Dold,q))

DETAIL

Measurement Overhead

Caveat:Does not handle factors that do not depend on a single term/doc. pair, e.g. term proximity and anchortext inclusion

Implemented in Lucene

Summary of Approach

User-centric metric of search repository quality

(Re)downloading document improves quality

Prioritize downloading by expected quality gain

Metric adaptations to enable feasible+efficient implementation

Next: Empirical Results

Introduction: monitoring distributed sources User-centric web crawling

– Model + approach– Empirical results– Related & future work

Overall Effectiveness

Staleness = fraction of out-of-date documents* [Cho et al. 2000]

Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002]

* Used “shingling” to filter out “trivial” changes

Scoring function: PageRank (similar results for TF.IDF)

Quality (fraction of ideal)

Min. StalenessMin. EmbarrassmentUser-Centric

Christopher Olston32(boston.com)

Does not rely on size of text change to estimate importance

Tagged as important by staleness- and embarrassment-based techniques, although did not match many queries in workload

Reasons for Improvement

Accounts for “false negatives” Does not always ignore

frequently-updated pages

User-centric crawling repeatedly re-downloads this page

(washingtonpost.com)

Related Work (1/2)

General-purpose web crawling– [Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01]

Maximize average freshness or age Balance new downloads vs. redownloading old documents

Focused/topic-specific crawling– [Chakrabarti, many others]

Select subset of documents that match user interests Our work: given a set of docs., decide when to (re)download

Most Closely Related Work

[Wolf et al., WWW’02]:– Maximize weighted average freshness– Document weight = probability of “embarrassment” if not fresh

User-Centric Crawling:– Measure interplay between update and query workloads

When document X is updated, which queries are affected by the update, and by how much?

– Metric penalizes false negatives Doc. ranked #1000 for a popular query should be ranked #2 Small embarrassment but big loss in quality

Future Work: Detecting Change-Rate Changes

Current techniques schedule monitoring to exploit existing change-rate estimates (e.g., ΔQD)

No provision to explore change-rates explicitly

Explore/exploit tradeoff– Ongoing work on Bandit Problem formulation

Bad case: change-rate = 0, so never monitor– Won’t notice future increase in change-rate

Summary

Approach:– User-centric metric of search engine quality– Schedule downloading to maximize quality

Empirical results:– High quality with few downloads– Good at picking “right” docs. to re-download

user-centric web crawling* christopher olston cmu & yahoo! research** * joint work with sandeep...

Documents

intelligent web crawling

crawling html

marcus fontoura vanja josifovski ravi kumar ...

cs276 lecture 17 crawling and web indexes. today’s lecture...

chris olston, cs294-7, spring 19991 atomicity in electronic...

crawling the web

crawling edgar - unc.edu

web crawling and scraping or post, copy, not do€¦ · web...

advanced crawling techniques chapter 6. outline selective...

a crawling suspicion…

@ carnegie mellon databases user-centric web crawling...

10 crawling

a torrent recommender based on dht crawling · torrents....

albert crawling

5 benefits of web crawling services over in-house crawling

web crawling & crawler

advanced web crawling

crawling and web indexes. today’s lecture crawling...

focused crawling with scalable ordinal regression...

crawling and walking