Download - Distributed Web Crawling over DHTs
![Page 1: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/1.jpg)
Distributed Web Crawling over DHTs
Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy
CS294-4
![Page 2: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/2.jpg)
Index
Search Today
Search
CrawlCrawl
![Page 3: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/3.jpg)
What’s Wrong?Users have a limited search interface Today’s web is dynamic and growing: Timely re-crawls required. Not feasible for all web sites.
Search engines control your search results: Decide which sites get crawled:
550 billion documents estimated in 2001 (BrightPlanet)
Google indexes 3.3 billion documents. Decide which sites gets updated more
frequently May censor or skew results rankings.
Challenge: User customizable searches that scale.
![Page 4: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/4.jpg)
Our Solution: A Distributed Crawler
P2P users donate excess bandwidth and computation resources to crawl the web.
Organized using Distributed Hash tables (DHTs) DHT and Query Processor agnostic crawler:
Designed to work over any DHT Crawls can be expressed as declarative recursive queries
Easy for user customization. Queries can be executed over PIER, a DHT-based relational P2P
Query Processor
Crawlers: PIER nodes
Crawlees: Web Servers
![Page 5: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/5.jpg)
PotentialInfrastructure for crawl personalization:
User-defined focused crawlers Collaborative crawling/filtering (special interest
groups)
Other possibilities: Bigger, better, faster web crawler Enables new search and indexing technologies
P2P Web Search Web archival and storage (with OceanStore)
Generalized crawler for querying distributed graph structures.
Monitor file-sharing networks. E.g. Gnutella. P2P network maintenance:
Routing information. OceanStore meta-data.
![Page 6: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/6.jpg)
Challenges that We Investigated
Scalability and Throughput DHT communication overheads. Balance network load on crawlers
2 components of network load: Download and DHT bandwidth.
Network Proximity. Exploit network locality of crawlers.
Limit download rates on web sites Prevents denial of service attacks.
Main tradeoff: Tension between coordination and communication
Balance load either on crawlers or on crawlees ! Exploit network proximity at the cost of communication.
![Page 7: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/7.jpg)
Crawler Thread
Crawl as a Recursive Query
Downloader
Extractor
Publish WebPage(url)
Input Urls
Output Links
: Link.destUrl WebPage(url)
Redirect
CrawlWrapper
DupElim
Dup Elim
Filters
Publish Link (sourceUrl, destUrl)
Rate Throttle & Reorder
DHT Scan: WebPage(url)
Seed Urls
![Page 8: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/8.jpg)
Crawl Distribution Strategies
Partition by URL Ensures even distribution of crawler workload. High DHT communication traffic.
Partition by Hostname One crawler per hostname. Creates a “control point” for per-server rate
throttling. May lead to uneven crawler load distribution Single point of failure:
“Bad” choice of crawler affects per-site crawl throughput.
Slight variation: X crawlers per hostname.
![Page 9: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/9.jpg)
Redirection• Simple technique that allows a crawler to redirect or pass
on its assigned work to another crawler (and so on….)• A second chance distribution mechanism orthogonal to
the partitioning scheme. • Example: Partition by hostname
• Node responsible for google.com (red) dispatches work (by URL) to grey nodes
• Load balancing benefits of partition by URL• Control benefits of partition by hostname
• When? Policy-based• Crawler load (queue size)• Network proximity
• Why not? Cost of redirection• Increased DHT control traffic• Hence, put a limit number of redirections per URL.
www.google.com
![Page 10: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/10.jpg)
ExperimentsDeployment
WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodes
3 Crawl Threads per crawler, 15 min crawl durationDistribution (Partition) Schemes
URL Hostname Hostname with 8 crawlers per unique host Hostname, one level redirection on overload.
Crawl Workload Exhaustive crawl
Seed URL: http://www.google.com 78244 different web servers
Crawl of fixed number of sites Seed URL: http://www.google.com 45 web servers within google
Crawl of single site within http://groups.google.com
![Page 11: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/11.jpg)
Crawl of Multiple Sites I
Hostname: Can only exploit at most 45 crawlers.
Redirect (hybrid hostname/url) does the best.
Partition by Hostname shows poor imbalance (70% idle). Better off when more crawlers are busy
CDF of Per-crawler Downloads (80 nodes)
Crawl Throughput Scaleup
![Page 12: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/12.jpg)
Crawl of Multiple Sites II
Redirection incurs higher overheads only after queue size exceeds a threshold.
Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.
Redirect: The per-URL DHT overheads hit their maximum around 70 nodes.
Per-URL DHT Overheads
![Page 13: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/13.jpg)
Network Proximity
Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts
Partition by hostname approximates random assignment
Best-3 random is “close enough” to Best-5 random
Sanity check: what if a single host crawls all targets ?
![Page 14: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/14.jpg)
Summary of SchemesLoad- balance download bandwidth
Load- balance DHT bandwidth
Rate limitCrawlees
Network proximity
DHT Communication overheads
URL + + - - -
Hostname
- - + ? +
Redirect + ? + + --
![Page 15: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/15.jpg)
Related WorkHerodotus, at MIT (Chord-based) Partition by URL Batching with ring-based forwarding. Experimented on 4 local machines
Apoidea, at GaTech (Chord-based) Partition by hostname. Forwards crawl to DHT neighbor closest to
website. Experimented on 12 local machines.
![Page 16: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/16.jpg)
ConclusionOur main contributions: Propose a DHT and QP agnostic Distributed
Crawler. Express crawl as a query.
Permits user-customizable refinement of crawls Discover important trade-offs in distributed
crawling: Co-ordination comes with extra communication costs
Deployment and experimentation on PlanetLab. Examine crawl distribution strategies under different
workloads on live web sources Measure the potential benefits of network proximity.
![Page 17: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/17.jpg)
Backup slides
![Page 18: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/18.jpg)
Existing CrawlersCluster-based crawlers Google: Centralized dispatcher sends
urls to be crawled. Hash-based parallel crawlers.
Focused Crawlers BINGO! Crawls the web given basic training
set.
Peer-to-Peer Grub SETI@Home infrastructure. 23993 members .
![Page 19: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/19.jpg)
Exhaustive Crawl
Partition by Hostname shows imbalance. Some crawlers are over-utilized for downloads.
Little difference in throughput. Most crawler threads are kept busy.
![Page 20: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/20.jpg)
Single Site
URL is best, followed by redirect and hostname.
![Page 21: Distributed Web Crawling over DHTs](https://reader030.vdocuments.mx/reader030/viewer/2022033107/5681306d550346895d964d74/html5/thumbnails/21.jpg)
Future Work
Fault ToleranceSecuritySingle-Node ThroughputWork-Sharing between Crawl Queries Essential for overlapping users.
Crawl Global Prioritization A requirement of personalized crawls. Online relevance feedback.
Deep web retrieval.