Download - Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Scheduling Algorithms for Web Crawling
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates
Center for Web Researchwww.cwr.cl
LA-WEB 2004
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Motivation
Algorithms
Experiments
Summary
References
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Motivation
Web search generates more than 13% of the traffic to Websites [StatMarket, 2003].
No search engine indexes more than one third of the publiclyavailable Web [Lawrence and Giles, 1998].
If we cannot download all of the pages, we should at leastdownload the most “important” ones.
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Motivation
Web search generates more than 13% of the traffic to Websites [StatMarket, 2003].
No search engine indexes more than one third of the publiclyavailable Web [Lawrence and Giles, 1998].
If we cannot download all of the pages, we should at leastdownload the most “important” ones.
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Motivation
Web search generates more than 13% of the traffic to Websites [StatMarket, 2003].
No search engine indexes more than one third of the publiclyavailable Web [Lawrence and Giles, 1998].
If we cannot download all of the pages, we should at leastdownload the most “important” ones.
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
The problem of Web crawling
We must download pages with sizes given by Pi , over a connectionof bandwidth B. Trivial solution: we download all the pagessimultaneously at a speed proportional to the size of each page:
Bi =Pi
T ∗
T ∗ is the optimal time to use all the available bandwidth:
T∗ =
∑Pi
B
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Optimal scenario
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Restrictions
Robot exclusion protocol [Koster, 1995]
Waiting time ≈ 10− 30 seconds
Web sites bandwidth BMAXi lower than the crawler bandwidth
B
Distribution of Web site sizes is very skewed
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Restrictions
Robot exclusion protocol [Koster, 1995]
Waiting time ≈ 10− 30 seconds
Web sites bandwidth BMAXi lower than the crawler bandwidth
B
Distribution of Web site sizes is very skewed
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Restrictions
Robot exclusion protocol [Koster, 1995]
Waiting time ≈ 10− 30 seconds
Web sites bandwidth BMAXi lower than the crawler bandwidth
B
Distribution of Web site sizes is very skewed
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Restrictions
Robot exclusion protocol [Koster, 1995]
Waiting time ≈ 10− 30 seconds
Web sites bandwidth BMAXi lower than the crawler bandwidth
B
Distribution of Web site sizes is very skewed
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Distribution of site sizes
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Realistic scenario
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Number of active robots in a batch
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Goal
If each page has a certain score, capture most of the total value ofthis score downloading just a fraction of the pages.We will use the total Pagerank of the downloaded set vs. thefraction of downloaded pages as a measure of quality
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Algorithms
Algorithms are based on a scheduler with two levels of queues:
Queue of Web sites
Queue of Web pages in each Web site
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Algorithms
Algorithms are based on a scheduler with two levels of queues:
Queue of Web sites
Queue of Web pages in each Web site
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Algorithms
Algorithms are based on a scheduler with two levels of queues:
Queue of Web sites
Queue of Web pages in each Web site
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Queues used for the scheduling
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Algorithms based on Pagerank
Optimal/Oracle: crawler asks for the Pagerank value of eachpage in the frontier using an “Oracle”. This is not available ina real crawl as we do not have the entire graphThe average relative error for estimating the Pagerank fourmonths ahead is about 78% [Cho and Adams, 2004], sohistorical information from previous crawls is not too useful
Batch-Pagerank: Pagerank calculations are executed overthe subset of known pages [Cho et al., 1998]
Partial-Pagerank: a “temporary” Pagerank value is assignedto the pages in between batch-Pagerank calculations
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Algorithms based on Pagerank
Optimal/Oracle: crawler asks for the Pagerank value of eachpage in the frontier using an “Oracle”. This is not available ina real crawl as we do not have the entire graphThe average relative error for estimating the Pagerank fourmonths ahead is about 78% [Cho and Adams, 2004], sohistorical information from previous crawls is not too useful
Batch-Pagerank: Pagerank calculations are executed overthe subset of known pages [Cho et al., 1998]
Partial-Pagerank: a “temporary” Pagerank value is assignedto the pages in between batch-Pagerank calculations
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Algorithms based on Pagerank
Optimal/Oracle: crawler asks for the Pagerank value of eachpage in the frontier using an “Oracle”. This is not available ina real crawl as we do not have the entire graphThe average relative error for estimating the Pagerank fourmonths ahead is about 78% [Cho and Adams, 2004], sohistorical information from previous crawls is not too useful
Batch-Pagerank: Pagerank calculations are executed overthe subset of known pages [Cho et al., 1998]
Partial-Pagerank: a “temporary” Pagerank value is assignedto the pages in between batch-Pagerank calculations
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Algorithms not based on Pagerank
Depth: pages are given a priority based on their depths. Thisis graph traversal in breadth-first ordering[Najork and Wiener, 2001]
Length: pages from the Web sites which seem to be biggerare crawled first. We do not know which are really the biggerWeb sites until the end of the crawl. We use partialinformation
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Algorithms not based on Pagerank
Depth: pages are given a priority based on their depths. Thisis graph traversal in breadth-first ordering[Najork and Wiener, 2001]
Length: pages from the Web sites which seem to be biggerare crawled first. We do not know which are really the biggerWeb sites until the end of the crawl. We use partialinformation
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed ofdifferent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed ofdifferent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed ofdifferent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed ofdifferent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed ofdifferent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Simulation parameters
Algorithm
Waiting time between pages from the same Web site w
Number of pages downloaded per connection when re-usingthe HTTP connection k
Number of robots r
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Simulation parameters
Algorithm
Waiting time between pages from the same Web site w
Number of pages downloaded per connection when re-usingthe HTTP connection k
Number of robots r
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Simulation parameters
Algorithm
Waiting time between pages from the same Web site w
Number of pages downloaded per connection when re-usingthe HTTP connection k
Number of robots r
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Simulation parameters
Algorithm
Waiting time between pages from the same Web site w
Number of pages downloaded per connection when re-usingthe HTTP connection k
Number of robots r
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Results with one robot
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Results with many robots
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Speed-ups with the “Length” strategy
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Crawling the real Web using the “Length” strategy
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Pagerank vs day of crawl
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Depth is not correlated with Pagerank
When depth is ≥ 2 links from the home page
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Summary
The restrictions, specially waiting time, create a difficultproblem for scheduling
An strategy with an “oracle” was too greedy
We try to keep Web sites in the frontier for as long aspossible, so we always have several Web sites to choose from
Simulation ensures the same conditions, which is criticalbecause the Web is very dynamic
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Summary
The restrictions, specially waiting time, create a difficultproblem for scheduling
An strategy with an “oracle” was too greedy
We try to keep Web sites in the frontier for as long aspossible, so we always have several Web sites to choose from
Simulation ensures the same conditions, which is criticalbecause the Web is very dynamic
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Summary
The restrictions, specially waiting time, create a difficultproblem for scheduling
An strategy with an “oracle” was too greedy
We try to keep Web sites in the frontier for as long aspossible, so we always have several Web sites to choose from
Simulation ensures the same conditions, which is criticalbecause the Web is very dynamic
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Summary
The restrictions, specially waiting time, create a difficultproblem for scheduling
An strategy with an “oracle” was too greedy
We try to keep Web sites in the frontier for as long aspossible, so we always have several Web sites to choose from
Simulation ensures the same conditions, which is criticalbecause the Web is very dynamic
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Open problems
Scheduling using historical information
Exploiting the Web’s structure
Adversarial IR: Spam detection before downloading the pages
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Open problems
Scheduling using historical information
Exploiting the Web’s structure
Adversarial IR: Spam detection before downloading the pages
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Open problems
Scheduling using historical information
Exploiting the Web’s structure
Adversarial IR: Spam detection before downloading the pages
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Baeza-Yates, R. and Castillo, C. (2002).Balancing volume, quality and freshness in web crawling.In Soft Computing Systems - Design, Management andApplications, pages 565–572, Santiago, Chile. IOS PressAmsterdam.
Cho, J. and Adams, R. (2004).Page quality: In search of an unbiased Web ranking.Technical report, UCLA Computer Science.
Cho, J., Garcıa-Molina, H., and Page, L. (1998).Efficient crawling through URL ordering.In Proceedings of the seventh conference on World Wide Web,Brisbane, Australia.
Koster, M. (1995).Robots in the web: threat or treat ?ConneXions, 9(4).
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
Lawrence, S. and Giles, C. L. (1998).Searching the World Wide Web.Science, 280(5360):98–100.
Najork, M. and Wiener, J. L. (2001).Breadth-first crawling yields high-quality pages.In Proceedings of the Tenth Conference on World Wide Web,pages 114–118, Hong Kong. Elsevier Science.
StatMarket (2003).Search engine referrals nearly double worldwide.http://websidestory.com/pressroom/pressreleases.html-?id=181.
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline Motivation Algorithms Experiments Summary References
C. Castillo, M. Marin, A. Rodrıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling