# scheduling algorithms for web crawling

Post on 11-May-2015

2.899 views

Embed Size (px)

TRANSCRIPT

- 1.OutlineMotivation Algorithms ExperimentsSummaryReferencesScheduling Algorithms for Web Crawling C. Castillo, M. Marin, A. Rodrguez and R. Baeza-Yates Center for Web Researchwww.cwr.cl LA-WEB 2004C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling

2. OutlineMotivation Algorithms ExperimentsSummaryReferences Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 3. OutlineMotivation Algorithms ExperimentsSummaryReferences MotivationWeb search generates more than 13% of the trac to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most important ones. C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 4. OutlineMotivation Algorithms ExperimentsSummaryReferences MotivationWeb search generates more than 13% of the trac to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most important ones. C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 5. OutlineMotivation Algorithms ExperimentsSummaryReferences MotivationWeb search generates more than 13% of the trac to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most important ones. C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 6. OutlineMotivation AlgorithmsExperimentsSummaryReferences The problem of Web crawling We must download pages with sizes given by Pi , over a connection of bandwidth B. Trivial solution: we download all the pages simultaneously at a speed proportional to the size of each page: PiBi = T T is the optimal time to use all the available bandwidth: Pi T = B C. Castillo, M. Marin, A. Rodr guez and R. Baeza-YatesCenter for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 7. OutlineMotivation Algorithms ExperimentsSummaryReferences Optimal scenario C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 8. OutlineMotivation Algorithms ExperimentsSummaryReferences RestrictionsRobot exclusion protocol [Koster, 1995] Waiting time 10 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 9. OutlineMotivation Algorithms ExperimentsSummaryReferences RestrictionsRobot exclusion protocol [Koster, 1995] Waiting time 10 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 10. OutlineMotivation Algorithms ExperimentsSummaryReferences RestrictionsRobot exclusion protocol [Koster, 1995] Waiting time 10 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 11. OutlineMotivation Algorithms ExperimentsSummaryReferences RestrictionsRobot exclusion protocol [Koster, 1995] Waiting time 10 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 12. OutlineMotivation Algorithms ExperimentsSummaryReferences Distribution of site sizes C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 13. OutlineMotivation Algorithms ExperimentsSummaryReferences Realistic scenario C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 14. OutlineMotivation Algorithms ExperimentsSummaryReferences Number of active robots in a batch C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 15. OutlineMotivation Algorithms ExperimentsSummaryReferences Goal If each page has a certain score, capture most of the total value of this score downloading just a fraction of the pages. We will use the total Pagerank of the downloaded set vs. the fraction of downloaded pages as a measure of quality C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 16. OutlineMotivation Algorithms ExperimentsSummaryReferences Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 17. OutlineMotivation Algorithms ExperimentsSummaryReferences Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 18. OutlineMotivation Algorithms ExperimentsSummaryReferences Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 19. OutlineMotivation Algorithms ExperimentsSummaryReferences Queues used for the scheduling C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 20. OutlineMotivation Algorithms ExperimentsSummaryReferences Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an Oracle. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a temporary Pagerank value is assigned to the pages in between batch-Pagerank calculationsC. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 21. OutlineMotivation Algorithms ExperimentsSummaryReferences Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an Oracle. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a temporary Pagerank value is assigned to the pages in between batch-Pagerank calculationsC. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 22. OutlineMotivation Algorithms ExperimentsSummaryReferences Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an Oracle. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a temporary Pagerank value is assigned to the pages in between batch-Pagerank calculationsC. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 23. OutlineMotivation Algorithms ExperimentsSummaryReferences Algorithms not based on PagerankDepth: pages are given a priority based on their depths. This is graph traversal in breadth-rst ordering [Najork and Wiener, 2001] Length: pages from the Web sites which seem to be bigger are crawled rst. We do not know which are really the bigger Web sites until the end of the crawl. We use partial information C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 24. OutlineMotivation Algorithms ExperimentsSummaryReferences Algorithms not based on PagerankDepth: pages are given a priority based on their depths. This is graph traversal in breadth-rst ordering [Najork and Wiener, 2001] Length: pages from the Web sites which seem to be bigger are crawled rst. We do not know which are really the bigger Web sites until the end of the crawl. We use partial information C. Castillo, M. Marin, A. Rodr guez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling 25. OutlineMotivation Algorithms ExperimentsSummaryReferences ExperimentsDownload a sample

Recommended