![Page 1: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/1.jpg)
Crawlers and Spiders
The Web
Web crawler
Indexer
Search
User
Indexes
Query Engine
1
![Page 2: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/2.jpg)
Crawling picture
Web
URLs crawledand parsed
URLs frontier
Unseen Web
Seedpages
Sec. 20.2
![Page 3: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/3.jpg)
Basic crawler operation
Create queue with “seed” pages Repeat Fetch each URL on the queue Parse fetched pages Extract URLs they point to Place the extracted URLs on a queue
Until empty queue or out of time
Q: What type of search is this? DFS? BFS? Other?
![Page 4: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/4.jpg)
Simple picture – complications
Web crawling isn’t feasible with one machine All of the above steps should be distributed
Malicious pages Spam pages Spider traps – including dynamically generated
Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations
How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages
Politeness – don’t hit a server too often
![Page 5: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/5.jpg)
What any crawler must do
Be Polite: Respect implicit and explicit politeness
considerations for a website Only crawl pages you’re allowed to Respect robots.txt (more on this shortly)
Be Robust: Be immune to spider traps
and other malicious behavior from web servers
![Page 6: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/6.jpg)
What any crawler should do
Be capable of distributed operation: designed to run on multiple distributed machinesBe scalable: designed to increase the crawl rate by adding more machinesBe efficient: permit full use of available processing and network resourcesFetch pages of “higher quality” firstHave Continuous operation: Continue fetching fresh copies of a previously fetched pageBe Extensible: Adapt to new data formats, protocols
![Page 7: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/7.jpg)
Distributed crawling picture
URLs crawledand parsed
Unseen Web
SeedPages
URL frontier
Crawling thread
![Page 8: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/8.jpg)
URL frontier
Can include multiple pages from the same hostMust avoid trying to fetch them all at the same timeMust try to keep all crawling threads busy
![Page 9: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/9.jpg)
Explicit and implicit politeness
Explicit politeness: specifications from webmasters on what portions of site can be crawled robots.txt
Implicit politeness: even with no specification, avoid hitting any site too often
![Page 10: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/10.jpg)
Robots.txt
Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html
Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt This file specifies access restrictions
![Page 11: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/11.jpg)
Robots.txt example
No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine":
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:
![Page 12: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/12.jpg)
Processing steps in crawling
Pick a URL from the frontierFetch the document at the URLParse the URL
Extract links from it to other docs (URLs)
Check if URL has content already seen If not, add to indexes
For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination)
E.g., only crawl .edu, obey robots.txt, etc.
Which one?
![Page 13: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/13.jpg)
Basic crawl architecture
WWW
Fetch
DNS
ParseContentseen?
URLfilter
DupURLelim
DocFP’s
URLset
URL Frontier
robotsfilters
![Page 14: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/14.jpg)
Why you need a DNS
Every server is enumerated in an IP address
IPv4: 32 bits written as 4 decimal numerals up to 256, e.g. 149.130.12.213 (Wellesley College)
How many addresses can it represent? IPv6: 128 bits written as 8 blocks of 4 hex digits each,
e.g. AF43:23BC:CAA1:0045:A5B2:90AC:FFEE:8080 How many addresses are in IPv6?
Client translates URLs to IP addresses, e.g. cs.wellesley.edu 149.130.136.19
Uses authoritative sites for address translationa.k.a: “Domain Name Server” (DNS)
![Page 15: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/15.jpg)
DNS (Domain Name Server)
A lookup service on the internet Given a URL, retrieve its IP address Service provided by a distributed set of servers –
thus, lookup latencies can be high (even seconds)
Common OS implementations of DNS lookup are blocking: only one outstanding request at a timeSolutions
DNS caching Batch DNS resolver –
collects requests and sends them out together
WWW
Fetch
DNSDNS
ParseContentseen?
URLfilter
DupURLelim
DocFP’s
URLset
URL Frontier
robotsfilters
![Page 16: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/16.jpg)
Parsing: URL normalization
When a fetched document is parsed, some of the extracted links are relative URLsE.g., at http://en.wikipedia.org/wiki/Main_Page
we have a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
During parsing, must normalize (expand) such relative URLs
WWW
Fetch
DNS
ParseParseContentseen?
URLfilter
DupURLelim
DocFP’s
URLset
URL Frontier
robotsfilters
![Page 17: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/17.jpg)
Content seen?
Duplication is widespread on the webIf the page just fetched is already in the index, do not further process itThis is verified using document fingerprints (Doc FP’s) or shingles
WWW
Fetch
DNS
Parse Contentseen?Contentseen?
URLfilter
DupURLelim
DocFP’s
URLset
URL Frontier
robotsfilters
![Page 18: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/18.jpg)
Filters and robots.txt
Filters – regular expressions for URL’s to be crawled/notOnce a robots.txt file is fetched from a site, need not fetch it repeatedly Doing so burns bandwidth, hits web server
Cache robots.txt files
WWW
Fetch
DNS
ParseContentseen?
URLfilterURLfilter
DupURLelim
DocFP’s
URLset
URL Frontier
robotsfilters
![Page 19: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/19.jpg)
Duplicate URL elimination
For a non-continuous (one-shot) crawl, test to see if an extracted + filtered URL has already been passed to the frontierFor a continuous crawl – see details of frontier implementation
WWW
Fetch
DNS
ParseContentseen?
URLfilter
DupURLelim
DupURLelim
DocFP’s
URLset
URL Frontier
robotsfilters
![Page 20: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/20.jpg)
Distributing the crawler
Run multiple crawl threads, under different processes – potentially at different nodes
Geographically distributed nodes
Partition hosts being crawled into nodes Hash used for partition
How do these nodes communicate?
WWW
Fetch
DNS
ParseContentseen?
URLfilter
DupURLelim
DocFP’s
URLset
URL Frontier
robotsfilters
![Page 21: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/21.jpg)
Communication between nodes
The output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodes
WWW
Fetch
DNS
Parse Contentseen?
URLfilter
DupURLelim
DocFP’s
URLset
URL Frontier
robotsfilters
Hostsplitter
Tootherhosts
Fromotherhosts
![Page 22: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/22.jpg)
URL frontier: two main considerations
Politeness: do not hit a web server too frequentlyFreshness: crawl some pages more often than others E.g., pages (such as News sites) whose content
changes oftenThese goals may conflict each other.(E.g., simple priority queue fails – many links out
of a page go to its own site, creating a burst of accesses to that site.)
![Page 23: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/23.jpg)
Politeness – challenges
Even if we restrict only one thread to fetch from a host, can hit it repeatedlyCommon heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host
![Page 24: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/24.jpg)
URL frontier: Mercator scheme
Prioritizer
Biased front queue selectorBack queue router
Back queue selector
K front queues
B back queuesSingle host on each
URLs
Crawl thread requesting URL
![Page 25: Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56649e555503460f94b4d0d7/html5/thumbnails/25.jpg)
Mercator URL frontier
URL’s flow in from the top into the frontierFront queues manage prioritization
Refresh rate sampled from previous crawls
Application-specific (e.g., “crawl news sites more often”)
Back queues enforce politenessEach queue is FIFO
Prioritizer
1 K
Biased front queue selector
Back queue router
Back queue selector
1 B
Heap