crawling the web- presentation

8/14/2019 crawling the web- presentation

1/21


2/21

Outline

Crawlers within Search Engine

Motivation and taxonomy of crawlers

Architecture of Web Crawler

Basic crawlers and implementation issues Crawler ethics and conflicts

2


3/21

Q: How does a searchengine know that all

these pages contain thequery terms?

A: Because all of thosepages have been

crawled

3


4/21

Organizing the Web The Web is big. Really big.

Over 3 billion pages, just in the indexable Web

The Web is dynamic Problems:

How to store a database of links?

How to crawl the web?

How to recommend pages that match a query?


5/21

Architecture of a Search Engine1. A web crawler gathers asnapshot of the Web

2. The gathered pages areindexed for easy retrieval

3. User submits a

search query

4. Search engine ranks pages that

match the query and returns an

ordered list


6/21

Search Engine : major components Crawlers

Collects documents by recursively fetching links from a set of startingpages.

Each crawler has different policiesThe pages indexed by various search engines are different

The IndexerProcesses pages, decide which of them to index, build various datastructures representing the pages (inverted index,web graph, etc),

different representation among search engines. Might also buildadditional structure ( LSI )

The Query ProcessorProcesses user queries and returns matching answers in an orderdetermined by a ranking algorithm.


7/21

Motivation for crawlers Support universal search engines (Google, Yahoo,MSN/Windows Live, Ask, etc.)

Vertical (specialized) search engines, e.g. news,

shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential

competitors, partners

Monitor Web sites of interest Can you think of some others?

7


8/21

Crawler:

basic

idea

starting

pages

(seeds)

8


9/21

Research on crawlers 1993 : First crawler, Matthew Grays Wanderer 1994 :

David Eichmann. The RBSE Spider Balancing Effective

Search Against Web Load. In Proceedings of the FirstInternational World Wide Web Conference, 1994. Oliver A. McBryan. GENVL and WWWW : Tools for

taming the web. In Proceedings of the First InternationalWorld Wide Web Conference, 1994.

Brian Pinkerton . Finding What people Want :Experiences with the webCrawler. In Proceedings of theSecond International World Wide Web Conference, 1994.


10/21

Many names Crawler

Spider

Robot (or bot)

Web agent

Wanderer, worm,

And famous instances: googlebot, scooter, slurp,msnbot,

10


11/21

A crawler within a search engine

11

Web

Text index PageRank

Page repository

googlebot

Text & link

analysisQuery

hits

Ranker


12/21

Crawler basic algorithm1. Remove a URL from the unvisited URL list

2. Determine the IP Address of its host name

3. Download the corresponding document4. Extract any links contained in it.

5. If the URL is new, add it to the list of

unvisited URLs6. Process the downloaded document

7. Back to step 1


13/21

www

DNS

Fetch

ParseContentSeen?

URLFilter

DupURLElim

URL Frontier

Doc

Fingerprint

Robots

templates

URL

set

URL Frontier: containing URLs yet to be fetches in the current crawl. At first, aseed set is stored in URL Frontier, and a crawler begins by taking a URL from

the seed set.

DNS: domain name service resolution. Look up IP address for domain names.

Fetch: generally use the http protocol to fetch the URL.

Parse: the page is parsed. Texts (images, videos, and etc.) and Links areextracted.

Architecture of a crawler (Contd)


14/21

www

DNS

Fetch

ParseContentSeen?

URLFilter

DupURLElim

URL Frontier

Doc

Fingerprint

Robots

templates

URL

set

Content Seen?: test whether a web page with the same content has already been seen

at another URL. Need to develop a way to measure the fingerprint of a web page.

URL Filter:

Whether the extracted URL should be excluded from the frontier (robots.txt).

URL should be normalized (relative encoding).

en.wikipedia.org/wiki/Main_Page

Disclaimers

Dup URL Elim: the URL is checked for duplicate elimination.

Architecture of a crawler (Contd)


15/21

Basic crawlers This is a sequentialcrawler

Seedscan be any list ofstarting URLs

Order of page visits isdetermined by frontierdatastructure

Stopcriterion can beanything


16/21

Graph traversal

(BFS or DFS?)

Breadth First Search

Implemented with QUEUE (FIFO)

Finds pages along shortest paths

If we start with good pages, this keeps

us close; maybe other good stuff Depth First Search

Implemented with STACK (LIFO)

Wander away (lost in cyberspace)

16


17/21

Implementation issues Dont want to fetch same page twice!

Keep lookup table (hash) of visited pages What if not visited but in frontier already?

The frontier grows very fast! May need to prioritize for large crawls

Fetcher must be robust! Dont crash if download fails Timeout mechanism

We can also conflate synonymsinto a single form using athesaurus 30-50% smaller index

Doing this in both pages and queries allows to retrieve pagesabout automobile when user asks for car

Thesaurus can be implemented as a hash table

17


18/21

MORE ABOUT CRAWLERS

Honor the Robot Exclusion ProtocolA server can specify which parts of its document tree any crawler is or isnot allowed to crawl by a file named robots.txt placed in the HTTP rootdirectory, e.g. http://www.indiana.edu/robots.txtCrawler should always check, parse, and obey this file before sending anyrequests to a server
http://www.indiana.edu/robots.txthttp://www.indiana.edu/robots.txt


19/21


20/21

Gray areas for crawler ethics If you write a crawler that unwillingly followslinks to ads, are you just being careless, or are

you violating terms of service, or are youviolating the law by defrauding advertisers?

Is non-compliance with Googles robots.txt in thiscase equivalent to click fraud?

If you write a browser extension that performssome useful service, should you comply with

robot exclusion?

20


21/21

Thank you!

crawling the web- presentation

Documents