crawling the web- presentation
TRANSCRIPT
-
8/14/2019 crawling the web- presentation
1/21
-
8/14/2019 crawling the web- presentation
2/21
Outline
Crawlers within Search Engine
Motivation and taxonomy of crawlers
Architecture of Web Crawler
Basic crawlers and implementation issues Crawler ethics and conflicts
2
-
8/14/2019 crawling the web- presentation
3/21
Q: How does a searchengine know that all
these pages contain thequery terms?
A: Because all of thosepages have been
crawled
3
-
8/14/2019 crawling the web- presentation
4/21
Organizing the Web The Web is big. Really big.
Over 3 billion pages, just in the indexable Web
The Web is dynamic Problems:
How to store a database of links?
How to crawl the web?
How to recommend pages that match a query?
-
8/14/2019 crawling the web- presentation
5/21
Architecture of a Search Engine1. A web crawler gathers asnapshot of the Web
2. The gathered pages areindexed for easy retrieval
3. User submits a
search query
4. Search engine ranks pages that
match the query and returns an
ordered list
-
8/14/2019 crawling the web- presentation
6/21
Search Engine : major components Crawlers
Collects documents by recursively fetching links from a set of startingpages.
Each crawler has different policiesThe pages indexed by various search engines are different
The IndexerProcesses pages, decide which of them to index, build various datastructures representing the pages (inverted index,web graph, etc),
different representation among search engines. Might also buildadditional structure ( LSI )
The Query ProcessorProcesses user queries and returns matching answers in an orderdetermined by a ranking algorithm.
-
8/14/2019 crawling the web- presentation
7/21
Motivation for crawlers Support universal search engines (Google, Yahoo,MSN/Windows Live, Ask, etc.)
Vertical (specialized) search engines, e.g. news,
shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential
competitors, partners
Monitor Web sites of interest Can you think of some others?
7
-
8/14/2019 crawling the web- presentation
8/21
Crawler:
basic
idea
starting
pages
(seeds)
8
-
8/14/2019 crawling the web- presentation
9/21
Research on crawlers 1993 : First crawler, Matthew Grays Wanderer 1994 :
David Eichmann. The RBSE Spider Balancing Effective
Search Against Web Load. In Proceedings of the FirstInternational World Wide Web Conference, 1994. Oliver A. McBryan. GENVL and WWWW : Tools for
taming the web. In Proceedings of the First InternationalWorld Wide Web Conference, 1994.
Brian Pinkerton . Finding What people Want :Experiences with the webCrawler. In Proceedings of theSecond International World Wide Web Conference, 1994.
-
8/14/2019 crawling the web- presentation
10/21
Many names Crawler
Spider
Robot (or bot)
Web agent
Wanderer, worm,
And famous instances: googlebot, scooter, slurp,msnbot,
10
-
8/14/2019 crawling the web- presentation
11/21
A crawler within a search engine
11
Web
Text index PageRank
Page repository
googlebot
Text & link
analysisQuery
hits
Ranker
-
8/14/2019 crawling the web- presentation
12/21
Crawler basic algorithm1. Remove a URL from the unvisited URL list
2. Determine the IP Address of its host name
3. Download the corresponding document4. Extract any links contained in it.
5. If the URL is new, add it to the list of
unvisited URLs6. Process the downloaded document
7. Back to step 1
-
8/14/2019 crawling the web- presentation
13/21
www
DNS
Fetch
ParseContentSeen?
URLFilter
DupURLElim
URL Frontier
Doc
Fingerprint
Robots
templates
URL
set
URL Frontier: containing URLs yet to be fetches in the current crawl. At first, aseed set is stored in URL Frontier, and a crawler begins by taking a URL from
the seed set.
DNS: domain name service resolution. Look up IP address for domain names.
Fetch: generally use the http protocol to fetch the URL.
Parse: the page is parsed. Texts (images, videos, and etc.) and Links areextracted.
Architecture of a crawler (Contd)
-
8/14/2019 crawling the web- presentation
14/21
www
DNS
Fetch
ParseContentSeen?
URLFilter
DupURLElim
URL Frontier
Doc
Fingerprint
Robots
templates
URL
set
Content Seen?: test whether a web page with the same content has already been seen
at another URL. Need to develop a way to measure the fingerprint of a web page.
URL Filter:
Whether the extracted URL should be excluded from the frontier (robots.txt).
URL should be normalized (relative encoding).
en.wikipedia.org/wiki/Main_Page
Disclaimers
Dup URL Elim: the URL is checked for duplicate elimination.
Architecture of a crawler (Contd)
-
8/14/2019 crawling the web- presentation
15/21
Basic crawlers This is a sequentialcrawler
Seedscan be any list ofstarting URLs
Order of page visits isdetermined by frontierdatastructure
Stopcriterion can beanything
-
8/14/2019 crawling the web- presentation
16/21
Graph traversal
(BFS or DFS?)
Breadth First Search
Implemented with QUEUE (FIFO)
Finds pages along shortest paths
If we start with good pages, this keeps
us close; maybe other good stuff Depth First Search
Implemented with STACK (LIFO)
Wander away (lost in cyberspace)
16
-
8/14/2019 crawling the web- presentation
17/21
Implementation issues Dont want to fetch same page twice!
Keep lookup table (hash) of visited pages What if not visited but in frontier already?
The frontier grows very fast! May need to prioritize for large crawls
Fetcher must be robust! Dont crash if download fails Timeout mechanism
We can also conflate synonymsinto a single form using athesaurus 30-50% smaller index
Doing this in both pages and queries allows to retrieve pagesabout automobile when user asks for car
Thesaurus can be implemented as a hash table
17
-
8/14/2019 crawling the web- presentation
18/21
MORE ABOUT CRAWLERS
Honor the Robot Exclusion ProtocolA server can specify which parts of its document tree any crawler is or isnot allowed to crawl by a file named robots.txt placed in the HTTP rootdirectory, e.g. http://www.indiana.edu/robots.txtCrawler should always check, parse, and obey this file before sending anyrequests to a server
http://www.indiana.edu/robots.txthttp://www.indiana.edu/robots.txt -
8/14/2019 crawling the web- presentation
19/21
-
8/14/2019 crawling the web- presentation
20/21
Gray areas for crawler ethics If you write a crawler that unwillingly followslinks to ads, are you just being careless, or are
you violating terms of service, or are youviolating the law by defrauding advertisers?
Is non-compliance with Googles robots.txt in thiscase equivalent to click fraud?
If you write a browser extension that performssome useful service, should you comply with
robot exclusion?
20
-
8/14/2019 crawling the web- presentation
21/21
Thank you!