Download - Scheduling Algorithms for Web Crawling

Outline Motivation Algorithms Experiments Summary References

Scheduling Algorithms for Web Crawling

C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates

Center for Web Researchwww.cwr.cl

LA-WEB 2004

C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl

Motivation

Algorithms

Experiments

Summary

References

Page 3: Scheduling Algorithms for Web Crawling

Motivation

Web search generates more than 13% of the traffic to Websites [StatMarket, 2003].

No search engine indexes more than one third of the publiclyavailable Web [Lawrence and Giles, 1998].

If we cannot download all of the pages, we should at leastdownload the most “important” ones.

Page 4: Scheduling Algorithms for Web Crawling

Motivation

Page 5: Scheduling Algorithms for Web Crawling

Motivation

Page 6: Scheduling Algorithms for Web Crawling

The problem of Web crawling

We must download pages with sizes given by Pi , over a connectionof bandwidth B. Trivial solution: we download all the pagessimultaneously at a speed proportional to the size of each page:

Bi =Pi

T ∗

T ∗ is the optimal time to use all the available bandwidth:

T∗ =

∑Pi

B

Page 7: Scheduling Algorithms for Web Crawling

Optimal scenario

Page 8: Scheduling Algorithms for Web Crawling

Restrictions

Robot exclusion protocol [Koster, 1995]

Waiting time ≈ 10− 30 seconds

Web sites bandwidth BMAXi lower than the crawler bandwidth

B

Distribution of Web site sizes is very skewed

Page 9: Scheduling Algorithms for Web Crawling

Restrictions

B

Page 10: Scheduling Algorithms for Web Crawling

Restrictions

B

Page 11: Scheduling Algorithms for Web Crawling

Restrictions

B

Page 12: Scheduling Algorithms for Web Crawling

Distribution of site sizes

Page 13: Scheduling Algorithms for Web Crawling

Realistic scenario

Page 14: Scheduling Algorithms for Web Crawling

Number of active robots in a batch

Page 15: Scheduling Algorithms for Web Crawling

Goal

If each page has a certain score, capture most of the total value ofthis score downloading just a fraction of the pages.We will use the total Pagerank of the downloaded set vs. thefraction of downloaded pages as a measure of quality

Algorithms

Algorithms are based on a scheduler with two levels of queues:

Queue of Web sites

Queue of Web pages in each Web site

Algorithms

Queue of Web sites

Algorithms

Queue of Web sites

Queues used for the scheduling

Page 20: Scheduling Algorithms for Web Crawling

Algorithms based on Pagerank

Optimal/Oracle: crawler asks for the Pagerank value of eachpage in the frontier using an “Oracle”. This is not available ina real crawl as we do not have the entire graphThe average relative error for estimating the Pagerank fourmonths ahead is about 78% [Cho and Adams, 2004], sohistorical information from previous crawls is not too useful

Batch-Pagerank: Pagerank calculations are executed overthe subset of known pages [Cho et al., 1998]

Partial-Pagerank: a “temporary” Pagerank value is assignedto the pages in between batch-Pagerank calculations

Page 21: Scheduling Algorithms for Web Crawling

Page 22: Scheduling Algorithms for Web Crawling

Page 23: Scheduling Algorithms for Web Crawling

Algorithms not based on Pagerank

Depth: pages are given a priority based on their depths. Thisis graph traversal in breadth-first ordering[Najork and Wiener, 2001]

Length: pages from the Web sites which seem to be biggerare crawled first. We do not know which are really the biggerWeb sites until the end of the crawl. We use partialinformation

Page 24: Scheduling Algorithms for Web Crawling

Algorithms not based on Pagerank

Depth: pages are given a priority based on their depths. Thisis graph traversal in breadth-first ordering[Najork and Wiener, 2001]

Length: pages from the Web sites which seem to be biggerare crawled first. We do not know which are really the biggerWeb sites until the end of the crawl. We use partialinformation

Page 25: Scheduling Algorithms for Web Crawling

Experiments

Download a sample of pages using the WIRE crawler[Baeza-Yates and Castillo, 2002]

3.5 million pages from over 50,000 Web sites in .CL

At most 25,000 pages from each Web site

Strategies are simulated on a graph built using actual data

Simulation includes: bandwidth saturation, network speed ofdifferent Web sites, page sizes, waiting time, latency, etc.

Page 26: Scheduling Algorithms for Web Crawling

Experiments

Page 27: Scheduling Algorithms for Web Crawling

Experiments

Page 28: Scheduling Algorithms for Web Crawling

Experiments

Page 29: Scheduling Algorithms for Web Crawling

Experiments

Simulation parameters

Algorithm

Waiting time between pages from the same Web site w

Number of pages downloaded per connection when re-usingthe HTTP connection k

Number of robots r

Algorithm

Number of robots r

Algorithm

Number of robots r

Algorithm

Number of robots r

Results with one robot

Page 35: Scheduling Algorithms for Web Crawling

Results with many robots

Page 36: Scheduling Algorithms for Web Crawling

Speed-ups with the “Length” strategy

Page 37: Scheduling Algorithms for Web Crawling

Crawling the real Web using the “Length” strategy

Page 38: Scheduling Algorithms for Web Crawling

Pagerank vs day of crawl

Page 39: Scheduling Algorithms for Web Crawling

Depth is not correlated with Pagerank

When depth is ≥ 2 links from the home page

Page 40: Scheduling Algorithms for Web Crawling

Summary

The restrictions, specially waiting time, create a difficultproblem for scheduling

An strategy with an “oracle” was too greedy

We try to keep Web sites in the frontier for as long aspossible, so we always have several Web sites to choose from

Simulation ensures the same conditions, which is criticalbecause the Web is very dynamic

Page 41: Scheduling Algorithms for Web Crawling

Summary

Page 42: Scheduling Algorithms for Web Crawling

Summary

Page 43: Scheduling Algorithms for Web Crawling

Summary

Page 44: Scheduling Algorithms for Web Crawling

Open problems

Scheduling using historical information

Exploiting the Web’s structure

Adversarial IR: Spam detection before downloading the pages

Page 45: Scheduling Algorithms for Web Crawling

Open problems

Page 46: Scheduling Algorithms for Web Crawling

Open problems

Page 47: Scheduling Algorithms for Web Crawling

Baeza-Yates, R. and Castillo, C. (2002).Balancing volume, quality and freshness in web crawling.In Soft Computing Systems - Design, Management andApplications, pages 565–572, Santiago, Chile. IOS PressAmsterdam.

Cho, J. and Adams, R. (2004).Page quality: In search of an unbiased Web ranking.Technical report, UCLA Computer Science.

Cho, J., Garcıa-Molina, H., and Page, L. (1998).Efficient crawling through URL ordering.In Proceedings of the seventh conference on World Wide Web,Brisbane, Australia.

Koster, M. (1995).Robots in the web: threat or treat ?ConneXions, 9(4).

Page 48: Scheduling Algorithms for Web Crawling

Lawrence, S. and Giles, C. L. (1998).Searching the World Wide Web.Science, 280(5360):98–100.

Najork, M. and Wiener, J. L. (2001).Breadth-first crawling yields high-quality pages.In Proceedings of the Tenth Conference on World Wide Web,pages 114–118, Hong Kong. Elsevier Science.

StatMarket (2003).Search engine referrals nearly double worldwide.http://websidestory.com/pressroom/pressreleases.html-?id=181.

Download - Scheduling Algorithms for Web Crawling

Top Related