b level project combined index

8/3/2019 B Level Project Combined Index

1/59

WEB SPIDER

A Focused Crawler

Acknowledgements

It shall truly be unfair not to show our gratefulness to all those who helped

us complete this project. We would like to show our deepest gratitude to our

project guide Dr. Anupam Agarwal, without whom this project would not

have been possible. It was he who motivated us for this cause and was

always present with his precious guidance and ideas besides being extremely

supportive and understanding all the times.This fueled our enthusiasm evenfurther and encouraged us to boldly step into what was a totally dark and

unexplored expanse before us.

We would also like to thank our batch mates and seniors who were ready

with a positive comment all the time, whether it was an off-hand comment to


2/59

Web Spider: A Focused Crawler

encourage us or a constructive piece of criticism. Their positive as well as

critic comments were of great help in giving the project its present form.

Abstract

The world-wide web, having over 350 million pages, continues to grow

rapidly at a million pages per day. About 600 GB of text changes every

month. Such growth and flux poses basic limits of scale for today's generic

crawlers and search engines. In spite of using high-end multiprocessors and

exquisitely crafted crawling software, the largest crawls cover only 30-40%

of the web, and refreshes take weeks to a month. With such unprecedented

scaling challenges for general-purpose crawlers and search engines, we

propose a hypertext resource discovery system called a Focused Crawler.The goal of a focused crawler is to selectively seek out pages that are

relevant to a pre-defined set of topics. The topics are specified not using

keywords, but using exemplary documents

To achieve such goal-directed crawling, we

evaluate the relevance of a hypertext document with respect to the focus

topics thereby discarding the irrelevant pages and focusing on the hyperlinks

of relevant pages only. Focused crawling, thus steadily acquire relevant

pages only while standard crawling quickly loses its way. Therefore it is

very effective for building high-quality collections of Web documents on

specific topics, using modest desktop hardware.

Indian Institute of Information Technology, Allahabad2


3/59


Contents

Student Declaration 2Supervisor Recommendation 3

Acknowledgement 4

Abstract 5

List of figures used 7

Chapter 1: Introduction 8

1.1 Objective 10

1.2 Motivation 10

1.3 Problem Definition 12

Chapter 2: Literature Survey 14

2.1 Literature survey 15

2.2 Previous Work 20

Chapter 3: Project Model 23

3.1 Basic Architecture 24

3.2 Crawler Policies 28

3.3 Issues 33



4/59


Chapter 4: Algorithm Implementation 35

4.1 Outline 36

4.2 Parsing and Stemming 38

4.3 Threshold calculation 40

4.4 Document Frequency 41

4.5 Robots.txt 42

Chapter 5: Discussion and Results 43

5.1 Retrieval of relevant pages only 44

5.2 Multithreading 45

5.3 Crawl space reduction 45

5.4 Reduction of server overload 46

5.5 Robustness of Acquisition 46

5.6 Snapshots 46

Chapter 6: Conclusion 486.1 Conclusion 49

6.2 Challenges and Future work 49

Appendices 51

Appendix A: Term Vector Model 52Appendix B: Basic Authentication Scheme 54

Appendix C: Term Frequency-Inverse Document Frequency 56

References 58

Technical references 59



5/59


Other references 60

List of figures:

Fig 1.1: Performance of an unfocused crawler 11

Fig 1.2: Performance of focused crawler 12

Fig 2.1: Basic Components of the crawler 16

Fig 2.2: Integration of crawler, classifier and distiller 17

Fig 2.3: Domain of focused web crawler. 19

Fig 3.1: Simple Crawler Configuration 25

Fig 3.2: Control Flow of a Crawler Frontier 27

Fig 4.1: Basic functioning of crawl frontier 35Fig 5.1: Comparison Analysis 44

Fig 5.2: Crawl Space reduction 45

Fig 5.3: Snapshot 1 46

Fig 5.4: Snapshot 2 47



6/59


Chapter I

Introduction

This section covers:

Objective

Motivation

Problem definition



7/59


1.1: Objective

To build a customized multithreaded, focused crawler, which will crawl the web, based

on the relevance of the web page, thus reducing the crawl space.

1.2: Motivation

The World Wide Web has grown from a few thousand pages in 1993 to more than two

billion pages at present. It continues to grow rapidly at a million pages per day.

About 600 GB of text changes every month. Due to this explosion in size, web search

engines are becoming increasingly important as the primary means of locating relevant

information [2]. Such search engines rely on massive collections of web pages that are

acquired with the help of web crawlers, which traverse the web by following hyperlinks

and storing downloaded pages in a large database that is later indexed for efficient

execution of user queries. Many researchers have looked at web search technology over

the last few years, including crawling strategies, storage, indexing, ranking techniques,

and a significant amount of work on the structural analysis of the web and web graph.

.



8/59


In spite of using high-end multiprocessors and exquisitely crafted crawling software, the

largest crawls cover only 30-40% of the web, and refreshes take weeks to a month. The

overwhelming engineering challenges are in part due to the one-size-fits-all philosophy:

the crawler trying to cater to every possible query.

Serious web users adopt the strategy of filtering by relevance and quality. The growth of

the web matters little to a physicist if at most a few dozen pages dealing with quantum

electrodynamics are added or updated per week. Seasoned users also rarely roam

aimlessly; they have bookmarked sites important to them, and their primary need is to

expand and maintain a community around these examples while preserving the quality. A

focused crawler selectively seeks out pages that are relevant to a pre-defined set of topics.

It is crucial that the harvest rate: the fraction of page fetches which are relevant to the

user's interest of the focused crawler be high, otherwise it would be easier to crawl the

whole web and bucket the results into topics as a post-processing step.



9/59


Fig 1.1: Performance of unfocused crawler[10]

Fig 1.2: Performance of focused crawler[10]

As we see in case of focused crawler (Fig 1.2) the fraction of page fetches which are

relevant to the user's interest in case of the focused crawler is very high when compared

to that of unfocused crawler (Fig 1.1). Crawl Space in case of focused crawler can be

reduced to a large extent as compared to a normal crawler.

1.3: Problem Definition



10/59


Our project is ambitious to build customized multithreaded, focused crawler, which

crawls the web, based on the relevance of the web page. The approach should concern

specifically in particular domain.

In order to achieve the objectives it should be able to perform the following:

Efficient Preprocessing: This involves the preprocessing of the input

documents. We aim to provide efficient parsing and stemming of pages. . Initially,

the user will be required to provide a set of example pages along with his search

query. These example pages will be parsed, removing all the stop words and

finally the text will be stemmed

Knowledge Retrieval: To provide efficient retrieval of information containing

words .Once the text will be stemmed, information containing words will be

picked and this will form the information to the crawler which it carries with it.

Crawling: To build a crawler that starts from the root node or a URL, called the

seed. As the crawler visits these URLs, it will identify all the hyperlinks in the

page and adds them to the list of URLs to visit, called the crawl frontier. URLs

from the frontier will now be recursively visited.

Retrieving relevant pages: We aim to retrieve only those pages which are

closely related to the corresponding query. In our case we will deal with the most

relevant pages. It will reduce the burden on the user to scan through all the

retrieved pages to find the pages of his interest



11/59


Chapter II

Literature Survey


Literature survey

Background and Previous work



12/59


2.1: Literature survey

2.1.1: Basic Crawler

The rapid growth of the World-Wide Web poses unprecedented scaling challenges for

general-purpose crawlers and search engines. We want to implement a new hypertext

resource discovery system called a Focused Crawler. The goal of a focused crawler is

to selectively seek out pages that are relevant to a pre-defined set oftopics. The topics are

specified not using keywords, but using exemplary documents. Rather than collecting and

indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a

focused crawler analyzes its crawl boundary to find the links that are likely to be most

relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant

savings in hardware and network resources, and helps keep the crawl more up-to-date.

To achieve such goal-directed crawling, we will design two hypertext

mining programs that guide our crawler: a classifier that evaluates the relevance of a

hypertext document with respect to the focus topics, and a distiller that identifies

hypertext nodes that are great access points to many relevant pages within a few links. [7]

Now report on extensive focused-crawling experiments using several topics at different

levels of specificity.

Focused crawling acquires relevant pages steadily while standard

crawling quickly loses its way, even though they are started from the same root set.

Focused crawling is robust against large perturbations in the starting set of URLs. It

discovers largely overlapping sets of resources in spite of these perturbations. It is also

capable of exploring out and discovering valuable resources that are dozens of links away



13/59


from the start set, while carefully pruning the millions of pages that may lie within this

same radius.[5]

As a result it is highly efficient as compared to normal crawlers. Normal crawlers

when start crawling works good for some time but loose their path making it their biggest

disadvantage over focused crawlers. Our anecdotes suggest that focused crawling is very

Effective for building high-quality collections of Web documents on specific topics,

using modest desktop hardware.[3]

Fig 2.1: Basic Components of the crawler[2]

The focused crawler has three main components: a classifier which makes relevance

judgments on pages crawled to decide on link expansion, a distiller which determines a

measure of centrality of crawled pages to determine visit priorities, and a crawler with

dynamically reconfigurable priority controls which is governed by the classifier and

distiller.[2]



14/59


Its block diagram can be shown as

Fig 2.2:Focused crawler showing how crawler, classifier and distiller are integrated.[1]

2.1.2: Classification

Relevance is enforced on the focused crawler using a hypertext classifier. We assume that

the category taxonomy induces a hierarchical partition on Web documents. (In real life,

documents are often judged to belong to multiple categories) useful pages, not

eliminating irrelevant pages. Human judgment, although subjective and even erroneous,

would be best for measuring relevance. Clearly, even for an experimental crawler that

acquires only ten thousand pages per hour, this is impossible. Therefore we use our

classifier to estimate the relevance of the crawl graph. It is to be noted carefully that we



15/59


are not, for instance, training and testing the classifier on the same set of documents, or

checking the classifiers earlier evaluation of a document using the classifier itself. Just as

human judgment is prone to variation and error, a statistical program can make mistakes.

Based on such imperfect recommendation, we choose to or not to expand pages. Later,

when a page that was chosen is visited, we evaluate its relevance, and thus the value of

that decision.[8]

2.1.3: Distillation

Relevance is not the only attribute used to evaluate a page while crawling. A long essay

very relevant to the topic but without links is only a finishing point in the crawl. A good

strategy for the crawler is to identify hubs: pages that are almost exclusively a collection

of links to authoritative resources that are relevant to the topic.* Social network analysis

is concerned with the properties of graphs formed between entities such as people,

organizations, papers, etc., through coauthoring, citations, mentoring, paying,

telephoning, infecting, etc. Prestige is an important attribute of nodes in a social network,

especially in the context of academic papers and Web documents. The number of

citations to paper is a reasonable but crude measure of the prestige. Also many hubs are

multi-topic in nature, e.g., a published bookmark file pointing to sports car sites and

photography sites.[4]

2.1.4: Integration with the crawler

The crawler has one watchdog thread and many worker threads. The watchdog is in

charge of checking out new work from the crawl frontier, which is stored on disk. New

work is passed to workers using shared memory buffers. Workers save details of newly

explored pages in private per-worker disk structures. In bulk-synchronous fashion,

workers are stopped, and their results are collected and integrated into the central pool of

work.[4]



16/59


While it is fairly easy to build a slow crawler that downloads a few pages

per second for a short period of time, building a high-performance system that can

download hundreds of millions of pages over several weeks presents a number of

challenges in system designed, I/O and network efficiency, and robustness and

manageability.[2]

* Refer Appendix B

Perhaps the most crucial evaluation of focused crawling is to measure the rate at which

relevant pages are acquired, and how effectively irrelevant pages are filtered off from the

crawl. This harvest ratio must be high, otherwise the focused crawler would spend a lot

of time merely eliminating irrelevant pages, and it may be better to use an ordinary

crawler instead! It would be good to judge the relevance of the crawl by human

inspection, even though it is subjective and inconsistent. But this is not possible for the

hundreds of thousands of pages our system crawled. Therefore we have to take recourse

to running an automatic classifier over the collected pages. Specifically, we can use our

classifier. It may appear that using the same classifier to guide the crawler and judge the

relevance of crawled pages is flawed methodology, but it is not so. We are evaluating not

the classifier but the basic crawling heuristic that neighbors of highly relevant pages tend

to be relevant.



17/59


Fig 2.3: Domain of focused web crawler[11]

The unfocused crawler starts out from the same set of dozens of highly relevant links as

the focused crawler, but is completely lost within the next hundred page fetches: the

relevance goes quickly to zero. In contrast the focused one crawl keeps up a healthy pace

of acquiring relevant pages over thousands of pages, in spite of some short-range rate

fluctuations, which is expected. On an average, between a third and half of all page

fetches result in success over the first several thousand fetches, and there seems to be no

sign of stagnation.

Crawling the Web, in a certain way, resembles watching the sky in a clear night: what we

see reflect the state of the stars at different times, as the light travels different distances.

What a Web crawler gets is not a snapshot of the Web, because it does not represents

the Web at any given instant of time. The last pages being crawled are probably very

accurately represented, but the first pages that were downloaded have a high probability

of have been changed [6]



18/59


2.2: Previous work

The following is a list of published crawler architectures for general-purpose crawlers

(excluding focused Web crawlers), with a brief description that includes the names given

to the different components and outstanding features:

2.21: RBSE was the first published web crawler. It was based on two programs: the first

program, "spider" maintains a queue in a relational database, and the second program

mite, is a modified www ASCII browser that downloads the pages from the Web. It was

presented at the First International Conference on the World Wide Web, Geneva,

Switzerland.[12]

2.2.2: Google Crawler is described in some detail, but the reference is only about an

early version of its architecture, which was based in C++ and Python. The crawler was

integrated with the indexing process, because text parsing was done for full-text indexing

and also for URL extraction. There is a URL server that sends lists of URLs to be fetchedby several crawling processes. During parsing, the URLs found were passed to a URL

server that checked if the URL has been previously seen. If not, the URL was added to

the queue of the URL server.[16]

2.2.3: Mercator is a distributed, modular web crawler written in Java. Its modularity

arises from the usage of interchangeable "protocol modules" and "processing modules".

Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and

processing modules are related to how to process Web pages. The standard processing

module just parses the pages and extracts new URLs, but other processing modules can

be used to index the text of the pages, or to gather statistics from the Web.[15]



19/59


2.2.4: WebRACE is a crawling and caching module implemented in Java, and used as a

part of a more generic system called eRACE. The system receives requests from users for

downloading Web pages, so the crawler acts in part as a smart proxy server. The system

also handles requests for "subscriptions" to Web pages that must be monitored: when the

pages change, they must be downloaded by the crawler and the subscriber must be

notified. The most outstanding feature of WebRACE is that, while most crawlers start

with a set of "seed" URLs, WebRACE is continuously receiving new starting URLs to

crawl from.[18]

2.2.5: Ubicrawler is a distributed crawler written in Java, and it has no central process. It

is composed of a number of identical "agents"; and the assignment function is calculated

using consistent hashing of the host names. There is zero overlap, meaning that no page

is crawled twice, unless a crawling agent crashes (then, another agent must re-crawl the

pages from the failing agent). The crawler is designed to achieve high scalability and to

be tolerant to failures.[13]

2.2.6: Some Open-source crawlers [11]

DatePrk Search

GNU Widget

Heritrix

HTTrack

Metaboth

NUTCH

WebSPHINX

Sherlock Holmes

YaCy



20/59


Chapter III

Project Model



21/59



Basic Architecture

Crawler Policies

Issues

3.1: Basic Architecture

In this project we will develop a web crawler that will start with a list of URLs to visit,

called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the

page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the

frontier are recursively visited according to a set of policies.

While it is fairly easy to build a slow crawler that downloads a few pages per second for a

short period of time, building a high-performance system that can download hundreds of

millions of pages over several weeks presents a number of challenges in system design,

I/O and network efficiency, and robustness and manageability.



22/59


3.1.1: Basic Concept -

A web crawler also known as a Web spider or Web robot is a program or automated

script which browses the World Wide Web in a methodical, automated manner. Web

crawlers are mainly used to create a copy of all the visited pages for later processing by a

search engine, that will index the downloaded pages to provide fast searches. Crawlers

can also be used for automating maintenance tasks on a Web site, such as checking links

or validating HTML code.

Web crawlers start by parsing a specified web page, noting any hypertext links on that

page that point to other web pages. They then parse those pages for new links, and so on,

recursively. Web-crawler software doesn't actually move around to different computers

on the Internet, as viruses or intelligent agents do. A crawler resides on a single machine.

The crawler simply sends HTTP requests for documents to other machines on the Internet

just as a web browser does when the user clicks on links. All the crawler really does is to

automate the process of following links. [2]



23/59


3.1.2: Architecture -

The input to the focused crawler is the search query of the user. Also, a set of examplepages relating to the query have to be given to the crawler. A series of parses are done on

these example pages to finally extract the information containing words. These

information containing words are given as input to the crawler which carries them along

with it. Based on this information, the crawler calculates the relevance of an encountered

page, and only if the relevance is satisfying, will it be stored for further crawling. [4]

Fig 3.1: Simple Crawler Configuration [4]



24/59


The architecture can be classified into two major components - crawling system and

crawling application. The crawling system itself consists of several specializedcomponent s, in particular a crawl manager, downloader and DNS resolver.

The crawl manager is responsible for receiving the URL input stream from

the applications. After loading the URLs of a request files, the manager queries the DNS

resolvers for the IP addresses of the servers, unless a recent address is already cached.

The manager then requests the file robots.txt in the web servers root directory, unless it

already has a recent copy of the file. A downloader is a high-performance asynchronous

HTTP client capable of downloading hundreds of web pages in parallel, while a DNS

resolver is an optimized stub DNS resolver that forwards queries to local DNS servers.[6]

Finally, after parsing the robots files and removing excluded URLs, the requested URLs

are sent in batches to the downloader. The manager later notifies the application of the

pages that have been downloaded and are available for processing.

The crawling application starts out with a URL giving it to the crawl

manager. The application then parses each downloaded page for hyperlinks, checks

whether these URLs have already been encountered before, and if not, sends them to the

manager in batches of a few hundred or thousand.[9] The downloaded files are then

forwarded to a storage manager for compression and storage in a repository.

3.1.3: Control flow

As the crawler gets the relevant pages, it retrieves their URLs and makes their list out of

which it takes the URLs one by one and the web page is downloaded. Now the

downloaded page is changed to a text file for simplicity. This text file is parsed removing

all the stop words from it and stemming the remaining words using Porter Stemmer.

Then its relevance is tested. If relevant, the URLs present on the page are extracted and

added to the list of URLs for further crawling.



25/59


Fig 3.2: Control Flow of a Crawler Frontier



26/59


3.2: Crawler policies

There are three important characteristics of the Web that generate a scenario in which

Web crawling is very difficult: its large volume, its fast rate of change, dynamic page

generation, containing a wide variety of possible crawlable URLs.

The large volume implies that the crawler can only download a fraction of the Web

pages within a given time, so it needs to prioritize all of its downloads. The high rate of

change implies that by the time the crawler is downloading the last pages from a site, it is

very likely that new pages have been added to the site, or that pages have already been

updated or even deleted.

The recent increase in the number of pages being generated by server-side scripting

languages has also created difficulty in those endless combinations of HTTP GET

parameters exist, only a small selection of which will actually return unique content. For

example, a simple online photo gallery may offer three options to users, as specified

through HTTP GET parameters. If there exist four ways to sort images, three choices of

thumbnail size, two file formats, and an option to disable user-provided contents, thenthat same set of content can be accessed with forty-eight different URLs, all of which will

be present on the site. This mathematical combination creates a problem for crawlers, as

they must sort through endless combinations of relatively minor scripted changes in order

to retrieve unique content.

The behavior of a Web crawler is the outcome of a combination of policies:

A selection policy that states which pages to download.

A re-visit policy that states when to check for changes to the pages.

A politeness policy that states how to avoid overloading websites.

A parallelization policy that states how to coordinate distributed web crawlers.



27/59


3.2.1: Selection policy -

Given the current size of the Web, even large search engines cover only a portion of the

publicly available internet; A recent study showed that no search engine indexes more

than 16% of the Web. As a crawler always downloads just a fraction of the Web pages, it

is highly desirable that the downloaded fraction contains the most relevant pages, and not

just a random sample of the Web. [2]

This requires a metric of importance for prioritizing Web pages. The importance of a

page is a function of its intrinsic quality, its popularity in terms of links or visits, and

even of its URL (the latter is the case of vertical search engines restricted to a single top-

level domain, or search engines restricted to a fixed Web site). Designing a good

selection policy has an added difficulty: it must work with partial information, as the

complete set of Web pages is not known during crawling.

Crawling can be combined with different strategies. The ordering metrics can be breadth-

first, backlink-count and partial Pagerank calculations. One of the conclusions was that if

the crawler wants to download pages with high Pagerank early during the crawling

process, then the partial Pagerank strategy is the better, followed by breadth-first andbacklink-count. However, these results are for just a single domain.

Though it is basically considered that the breadth first strategy is a better strategy than

page rank . The explanation is simple for this it has been proved that the most important

pages have many links to them from numerous hosts, and those links will be found early,

regardless of on which host or page the crawl originates.

3.2.2: Re-visit policy -

The Web has a very dynamic nature, and crawling a fraction of the Web can take a really

long time, usually measured in weeks or months. By the time a Web crawler has finished

its crawl, many events could have happened. These events can include creations, updates

and deletions. [2]



28/59


From the search engine's point of view, there is a cost associated with not detecting an

event, and thus having an outdated copy of a resource. The most used cost functions are

freshness and age.

Freshness: This is a binary measure that indicates whether the local copy is

accurate or not.

Age: This is a measure that indicates how outdated the local copy is.

The objective of the crawler is to keep the average freshness of pages in its collection as

high as possible, or to keep the average age of pages as low as possible. These objectives

are not equivalent: in the first case, the crawler is just concerned with how many pages

are out-dated, while in the second case, the crawler is concerned with how old the local

copies of pages are.

Two simple re-visiting policies are:

Uniform policy: This involves re-visiting all pages in the collection with the same

frequency, regardless of their rates of change.

Proportional policy: This involves re-visiting more often the pages that change

more frequently. The visiting frequency is directly proportional to the (estimated)

change frequency.

In terms of average freshness, the uniform policy outperforms the proportional policy in

both a simulated Web and a real Web crawl. The explanation for this result comes from

the fact that, when a page changes too often, the crawler will waste time by trying to re-

crawl it too fast and still will not be able to keep its copy of the page fresh.

3.2.3: Politeness policy -

Crawlers can retrieve data much quicker and in greater depth than human searchers, so

they can have a crippling impact on the performance of a site. Needless to say if a single

crawler is performing multiple requests per second and downloading large files, a server

would have a hard time keeping up with requests from multiple crawlers. [2]

The use of Web crawlers is useful for a number of tasks, but comes with a price for the

general community. The costs of using Web crawlers include:



29/59


Network resources, as crawlers require considerable bandwidth and operate with a high

degree of parallelism during a long period of time.

Server overload, especially if the frequency of accesses to a given server is too high.

Poorly written crawlers, which can crash servers or routers, or which download pages

they cannot handle.

Personal crawlers that, if deployed by too many users, can disrupt networks and Web

servers.

A partial solution to these problems is the robots exclusion protocol, also known as the

robots.txt protocol that is a standard for administrators to indicate which parts of their

Web servers should not be accessed by crawlers. This standard does not include a

suggestion for the interval of visits to the same server, even though this interval is the

most effective way of avoiding server overload. Recently commercial search engines like

Ask Jeeves, MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the

robots.txt file to indicate the number of seconds to delay between requests.However, if

pages were downloaded at this rate from a website with more than 100,000 pages over a

perfect connection with zero latency and infinite bandwidth, it would take more than 2

months to download only that entire website; also, only a fraction of the resources from

that Web server would be used. This does not seem acceptable.

Normally one uses 10 seconds as an interval for accesses and some crawlers uses 15

seconds as the default. Some even follows an adaptive politeness policy: if it took t

seconds to download a document from a given server, the crawler waits for 10t seconds

before downloading the next page.

3.2.4: Parallelization policy -

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to

maximize the download rate while minimizing the overhead from parallelization and to

avoid repeated downloads of the same page. To avoid downloading the same page more

than once, the crawling system requires a policy for assigning the new URLs discovered



30/59


during the crawling process, as the same URL can be found by two different crawling

processes. There are basic following policies:-

Dynamic Assignment: With this type of policy, a central server assigns new URLs to

different crawlers dynamically. This allows the central server to; for instance,

dynamically balance the load of each crawler.

With dynamic assignment, typically the systems can also add or remove downloader

processes. The central server may become the bottleneck, so most of the workload must

be transferred to the distributed crawling processes for large crawls.

There are two configurations of crawling architectures with dynamic assignments such

that [2]

A small crawler configuration, in which there is a central DNS resolver and

central queues per Web site, and distributed downloaders.

A large crawler configuration, in which the DNS resolver and the queues are also

distributed.

Static Assignment: With this type of policy, there is a fixed rule stated from the

beginning of the crawl that defines how to assign new URLs to the crawlers.

For static assignment, a hashing function can be used to transform URLs (or, even better,

complete website names) into a number that corresponds to the index of the

corresponding crawling process. As there are external links that will go from a Web site

assigned to one crawling process to a website assigned to a different crawling process,

some exchange of URLs must occur.[14]

To reduce the overhead due to the exchange of URLs between crawling processes, theexchange should be done in batch, several URLs at a time, and the most cited URLs in

the collection should be known by all crawling processes before the crawl (e.g.: using

data from a previous crawl) .

An effective assignment function must have three main properties: each crawling process

should get approximately the same number of hosts (balancing property), if the number



31/59


of crawling processes grows, the number of hosts assigned to each process must shrink

(contra-variance property), and the assignment must be able to add and remove crawling

processes dynamically. Consistent hashing, which replicates the buckets, so adding or

removing a bucket does not require re-hashing of the whole table to achieve all of the

desired properties. Crawling is an effective process synchronization tool between the

users and the search engine.

3.3: Issues

3.3.1: How to Re-visit web pages

The optimum method to re-visit the web and maintain average freshness high of web

page is to ignore the pages that change too often.

The approaches could be: [2]

Re-visiting all pages in the collection with the same frequency, regardless of their

rates of change.

Re-visiting more often the pages that change more frequently.

In both cases, the repeated crawling order of pages can be done either at random or with a

fixed order.

The re-visiting methods considered here regard all pages as homogeneous in terms of

quality ("all pages on the Web are worth the same"), something that is not a realisticscenario

3.3.2: How to avoid overloading websites

Crawlers can retrieve data much quicker and in greater depth than human searchers, so

they can have a crippling impact on the performance of a site. Needless to say if a single



32/59


crawler is performing multiple requests per second and/or downloading large files, a

server would have a hard time keeping up with requests from multiple crawlers.

The use of Web crawler is useful for a number of tasks, but comes with a price for the

general community.

The costs of using Web crawlers include: [1]

Network resources , as crawlers require considerable bandwidth and operate with a

high degree of parallelism during a long period of time.

Server overload , especially if the frequency of accesses to a given server is too

high.

Poorly written crawlers , which can crash servers or routers, or which download

pages they cannot handle.

Personal crawlers that, if deployed by too many users, can disrupt networks and

Web servers.

To resolve this problem we can use robots exclusion protocol, also known as the

robots.txt protocol.

The robots exclusion standard or robots.txt protocol is a convention to prevent

cooperating web spiders and other web robots from accessing all or part of a website. We

can specify the top level directory of web site in a file called robots.txt and this will

prevent the access of that directory to crawler. This protocol uses simple substring

comparisons to match the patterns defined in robots.txt file. So, while using this

robots.txt file we need to make sure that we use final [./.]. [17] Character appended to

directory path. Else, files with names starting with that substring will be matched rather

than directory.



33/59


Chapter IV

Algorithm Implementation


Outline

Parsing and Stemming

Threshold calculation

Document Frequency

Robots.txt



34/59


4.1: Outline

The input to the focused crawler is the search query of the user in form of pages. Parsing

is done and words are retrieved. These information containing words are given as input to

the crawler which carries them along with it. Based on this information, the crawler

calculates the relevance of an encountered page, and only if the relevance is satisfying,

will it be stored for further crawling



35/59


Fig 4.1: Basic functioning of crawl frontier

The Pseudo-code summary of implementing the crawler:Ask user to specify the starting URL on web and file type that crawler

should crawl.

Add the URL to the empty list of URLs to search.

While not empty ( the list of URLs to search )

{

Take the first URL in from the list of URLs

Mark this URL as already searched URL.

If the URL protocol is not HTTP then

break;

go back to while



36/59


If robots.txt file exist on site then

If file includes .Disallow. statement then

break;

go back to while

Open the URL

If the opened URL is not HTML file then

Break;

Go back to while

Iterate the HTML file

While the html text contains another link {

If robots.txt file exist on URL/site then

If file includes .Disallow. statement then

break;

go back to while

If the opened URL is HTML file then

If the URL isn't marked as searched then

Mark this URL as already searched URL.

Else if type of file is user requested

Add to list of files found.

}

}

4.2: Parsing and Stemming

4.2.1: Parsing (more formally syntactical analysis) is the process of analyzing a sequence

of tokens to determine its grammatical structure with respect to a given formal grammar.

A parser is the component of a compiler that carries out this task.

Parsing transforms input text into a data structure, usually a tree, which is suitable for

later processing and which captures the implied hierarchy of the input. Lexical analysis



37/59


creates tokens from a sequence of input characters and it is these tokens that are

processed by a parser to build a data structure such as parse tree or abstract syntax trees.

Parsing is also an earlier term for the diagramming of sentences in grammar of natural

language, and is still used to diagram the grammar of inflected languages, such as the

Romance languages or Latin.

4.2.2: Removal of Stop Words

Firstly, the web page is converted into a text file for convenience. An initial parse

removes all the stop words from the file. Stop words are words which are filtered out

prior to, or after, processing of natural language data (text). Some of the most frequently

used stop words include "a", "of", "the", "I", "it", "you", and "and. These are generally

regarded as 'functional words' which do not carry meaning (are not as important for

communication).[16] The assumption is that, when assessing the contents of the web

page, the meaning can be conveyed more clearly, or interpreted more easily, by ignoring

the functional words. A Stop List is maintained in a separate text file and all the words of

that file are removed from the file being parsed.

4.2.3: Stemming

Next, a Porter Stemmer is run on this file. Stemming is the process for reducing inflected

(or sometimes derived) words to their stem, base or root form generally a written word

form. The stem need not be identical to the morphological root of the word; it is usually

sufficient that related words map to the same stem, even if this stem is not in itself a valid

root. A stemmer for English, for example, should identify the string "cats" (and possibly

"catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed"

as based on "stem". Porter Stemmer is one algorithm for doing this process effectively.

Some examples of the rules include:



38/59


if the word ends in 'ed', remove the 'ed'

if the word ends in 'ing', remove the 'ing'

if the word ends in 'ly', remove the 'ly'

Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the

challenges of linguistics and morphology and encoding suffix stripping rules. Suffix

stripping algorithms are sometimes regarded as crude given the poor performance when

dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix

stripping algorithms are limited to those lexical categories which have well known

suffices with few exceptions. This, however, is a problem, as not all parts of speech have

such a well formulated set of rules. Lemmatization attempts to improve upon this

challenge [16]

When this is done for all the example pages, we perform the frequency analysis of each

word and the number of pages in which it has appeared and then select the words that are

most likely to carry the information content of the page. These words are given to the

crawler.

4.3: Threshold Calculation

4.3.1: Identification of Information carrying words -

Firstly, the frequency of each word in each of the example pages is found out. Also, the

number of pages in which it is appearing is also kept track of. Based on these two criteria,

we select our information containing words. *



39/59


Fixing Threshold:-

We have used the Vector Space Method to fix the threshold from the initial set of pages.

Then we used the following formula to find out the relevance of a particular page.

Relevance = No. of info words with mean freq

--------------------------------------------------------- = 3

Total no. of information words

4.3.2: Vector space model -

It is an algebraic model used for information filtering, information retrieval, indexing and

relevancy rankings. It represents natural language documents (or any objects, in general)

in a formal manner through the use of vectors (of identifiers, such as, for example, index

terms) in a multi-dimensional linear space. Its first use was in the SMART Information

Retrieval System. Documents are represented as vectors of index terms (keywords). The

set of terms is a predefined collection of terms, for example the set of all unique words

occurring in the document corpus.*

*Refer Appendix A

4.4: Document frequency

4.4.1: Term frequency in the given document is simply the number of times a given

term appears in that document. This count is usually normalized to prevent a bias towards

longer documents (which may have a higher term frequency regardless of the actual

importance of that term in the document) to give a measure of the importance of the term

ti within the particular document. [10]

http://en.wikipedia.org/wiki/Indexinghttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Keyword_(linguistics)http://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Keyword_(linguistics)http://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Indexing


40/59


Where ni is the number of occurrences of the considered term, and the denominator is the

number of occurrences of all terms.

4.4.2: The inverse document frequency is a measure of the general importance of the

term (obtained by dividing the number of all documents by the number of documents

containing the term, and then taking the logarithm of that quotient) .*

A high weight in tfidf is reached by a high term frequency (in the given document) and

a low document frequency of the term in the whole collection of documents; the weight

hence tends to filter out common terms.

* Refer Appendix C

4.5: Robots.txt :-

The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt

protocol is a convention to prevent cooperating web spiders and other web robots from

accessing all or part of a website which is, otherwise, publicly viewable. Robots are often

used by search engines to categorize and archive web sites, or by webmasters to



41/59


proofread source code. A robots.txt file on a website will function as a request that

specified robots ignore specified files or directories in their search. This might be, for

example, out of a preference for privacy from search engine results, or the belief that the

content of the selected directories might be misleading or irrelevant to the categorization

of the site as a whole, or out of a desire that an application only operate on certain data.

The protocol, however, is purely advisory. It relies on the cooperation of the web robot,

so that marking an area of a site out of bounds with robots.txt does not guarantee privacy.

Some web site administrators have tried to use the robots file to make private

parts of a website invisible to the rest of the world, but the file is necessarily publicly

available and its content is easily checked by anyone with a web browser.

An example robots.txt file

# robots.txt for http://somehost.com/

User-agent: *

Disallow: /cgi-bin/

Disallow: /registration

Disallow: /login

Chapter V



42/59


Discussion and Results


Retrieval of relevant pages

Multithreading

Crawl Space reduction

Server overload

Robustness of Acquisition

Snapshots

5.1: Retrieval of the relevant pages only

Relevant pages are those which are closely related to the input document to the crawler.

Our focused crawler achieves the relevance of downloaded web-pages up to 80-85%

while for a normal crawler it is only up to 20-25 %.



43/59


Moreover the numbers of relevant pages downloaded are 50-100 per hour as compared to

a normal crawler which goes on downloading pages most of which are irrelevant.

Fig 5.1: Comparison Analysis

5.2: Multithreading

This issue has been dealt successfully as on receiving more than one hyperlink from a file

then, a number of parallel threads are generated and work together for downloading the

pages and parsing them.



44/59


5.3: Crawl space reduction

Crawl space is the number of pages visited by the crawler on the web. Our focused

crawler reduces the crawl space to a great extent as it visits the hyperlinks of only the

relevant pages thus go on pruning most part of the web tree.

Fig 5.2: Crawl Space reduction

5.4: Reduction of server overload

We have used robots exclusion protocol, also known as the robots.txt protocol to prevent

web spider from accessing all or part of a website.



45/59


5.5: Robustness of acquisition

Web Spider has the ability to ramp up to and maintain a healthy acquisition rate without

being too sensitive on the start set.

5.6: Snapshots

Fig5.3: When the page is downloaded, it is parsed and stemmed. The frequency of the

words in the page is calculated and its relevance is checked using the cosine similarity

method.



46/59


Fig 5.4: Initially a URL http://www.cert.org/research/papers.html is given as input to the

Web Spider and the links it visits are shown above out of which only the links of the

relevant page are downloaded.



47/59


Chapter VI

Conclusion


Conclusion

Challenges and Future Work



48/59


6.1: Conclusion

The motives of our project as per the problem definition have been achieved to

completion. Web Spider, the customized multithreaded focused crawler with its all

functionalities properly running is ready to be used.

We have achieved a relatively high reduction in crawl space. The rate of pages being

downloaded varies from 50 to 100 pages an hour and they are the ones most relevant to

the user input document. We have taken care of the Robots Exclusion protocol and have

achieved a healthy acquisition rate without being too sensitive to the start document. Our

project can successfully perform using modest desktop hardware.

No doubt, the process of developing Web Spider was extremely knowledgeful and

enjoying. We got to learn the very deep concepts of information retrieval and their

practical implementation and are proud to complete the project up to a satisfactory level.

6.2: Challenges and future work

6.2.1: Challenges

Server side checking: Web Spider in its present form downloads all the URLs

present in a relevant pages and discard the irrelevant URLs post-downloading. Its



49/59


a challenge for us to implement a server side check, i.e. checking the URLs on the

server side and thus downloading only the relevant ones.

Distributed web crawler: Our project presently works on a single system, to make

it scalable its a challenge to make it a distributed system with many parallelcrawlers running.

6.2.2: Future work

Extending the project for file format son the web other than HTML and text.

Ranking the downloaded pages with respect to their priority. The page having

high cosine similarity with the example pages carry high priority.

Increasing the harvest rate. Presently the relevant pages are downloaded at the

rate of 50-100 pages per hour. Implementing a better focused crawler can increase

the rate.

Implementing better preprocessing algorithms. We have presently employed upto

650 stop words and implemented the Porter Stemmer algorithm. Using more stop

words and implementing a better stemming algorithm may further enhance the

crawler performance.



50/59


Appendices


Term Vector Model

Basic Authentication Scheme

Term Frequency-Inverse Document Frequency



51/59


Appendix A: Term vector model

Term vector model is an algebraic model used for information filtering, information

retrieval, indexing and relevancy rankings. It represents natural language documents (or

any objects, in general) in a formal manner through the use of vectors (of identifiers, such

as, for example, index terms) in a multi-dimensional linear space. Its first use was in the

SMART Information Retrieval System.

Documents are represented as vectors of index terms (keywords). The set of terms is a

predefined collection of terms, for example the set of all unique words occurring in the

document corpus.

Relevancy rankings of documents in a keyword search can be calculated, using the

assumptions of document similarities theory, by comparing the deviation of angles

between each document vector and the original query vector where the query is

represented as same kind of vector as the documents.

In practice, it is easier to calculate the cosine of the angle between the vectors instead of

the angle:

A cosine value of zero means that the query and document vector were orthogonal and

had no match (i.e. the query term did not exist in the document being considered).

Assumptions and Limitations of the Vector Space Model

The Vector Space Model has the following limitations:

Long documents are poorly represented because they have poor similarity values(a small scalar product and a large dimensionality)

Search keywords must precisely match document terms; word substrings might

result in a "false positive match"

http://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Vector_space


52/59


Semantic sensitivity; documents with similar context but different term

vocabulary won't be associated, resulting in a "false negative match".



53/59


Appendix B: Basic authentication scheme

In the context of an HTTP transaction, the basic authentication scheme is a method

designed to allow a web browser, or other client program, to provide credentials in the

form of a user name and password when making a request. Although the scheme is

easily implemented, it relies on the assumption that the connection between the client and

server computers is secure and can be trusted. Specifically, the credentials are passed as

plaintext and could be intercepted easily. The scheme also provides no protection for the

information passed back from the server.

To prevent the user name and password being read directly by a person, they are encoded

as a sequence of base-64 characters before transmission. For example, the user name

"Aladdin" and password "open sesame" would be combined as "Aladdin:open sesame"

which is equivalent to QWxhZGRpbjpvcGVuIHNlc2FtZQ== when encoded in base-64.

Little effort is required to translate the encoded string back into the user name and

password, and many popular security tools will decode the strings "on the fly", so an

encrypted connection should always be used to prevent interception.

One advantage of the basic authentication scheme is that it is supported by almost all

popular web browsers. It is rarely used on normal Internet web sites but may sometimes

be used by small, private systems. A later mechanism, digest access authentication, was

developed in order to replace the basic authentication scheme and enable credentials to be

passed in a relatively secure manner over an otherwise insecure channel.

Example

Here is a typical transaction between an HTTP client and an HTTP server running on the

local machine (localhost). It comprises the following steps.

The client asks for a page that requires authentication but does not provide a user name

and password. Typically this is because the user simply entered the address or followed a

link to the page.



54/59


The server responds with the 401 response code and provides the authentication realm.

At this point, the client will present the authentication realm (typically a description of

the computer or system being accessed) to the user and prompt for a user name and

password. The user may decide to cancel at this point.

Once a user name and password have been supplied, the client re-sends the same request

but includes the authentication header.

In this example, the server accepts the authentication and the page is returned. If the user

name is invalid or the password incorrect, the server might return the 401 response code

and the client would prompt the user again.



55/59


Appendix C: Term frequencyinverse

document frequency

The tfidf weight (term frequencyinverse document frequency) is a weight often used

in information retrieval and text mining. This weight is a statistical measure used to

evaluate how important a word is to a document in a collection or corpus. The importance

increases proportionally to the number of times a word appears in the document but is

offset by the frequency of the word in the corpus. Variations of the tfidf weighting

scheme are often used by search engines to score and rank a document's relevance given

a user query. In addition to tf-idf weighting, Internet search engines use link analysis

based ranking to determine the order in which the scored documents are presented to the

user.

The term frequency in the given document is simply the number of times a given term

appears in that document. This count is usually normalized to prevent a bias towards

longer documents (which may have a higher term frequency regardless of the actualimportance of that term in the document) to give a measure of the importance of the term

ti within the particular document.

where ni is the number of occurrences of the considered term, and the denominator is the

number of occurrences of all terms.

The inverse document frequency is a measure of the general importance of the term

(obtained by dividing the number of all documents by the number of documents

containing the term, and then taking the logarithm of thatquotient).

http://en.wikipedia.org/wiki/Documentshttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Quotienthttp://en.wikipedia.org/wiki/Quotienthttp://en.wikipedia.org/wiki/Documentshttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Quotient


56/59


with

|D| : total number of documents in the corpus

: number of documents where the term ti appears (that is ).

Numeric application of the document frequency.

There are many different formulas used to calculate tfidf. The term frequency (TF) is

the number of times the word appears in a document divided by the number of total

words in the document. If a document contains 100 total words and the word cow appears

3 times, then the term frequency of the word cow in the document is 0.03 (3/100). One

way ofcalculating document frequency (DF) is to determine how many documents

contain the word cow divided by the total number of documents in the collection. So if

cow appears in 1,000 documents out of a total of 10,000,000 then the document

frequency is 0.0001 (1000/10,000,000). The final tf-idf score is then calculated by

dividing the term frequency by the document frequency. For our example, the tf-idf score

for cow in the collection would be 300 (0.03/0.0001). Alternatives to this formula are to

take the log of the document frequency.

Applications in Vector Space Model

The tf-idf weighting scheme is often used in thevector space model together with cosine

similarity to determine the similarity between two documents.

http://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Calculatinghttp://en.wikipedia.org/wiki/Calculatinghttp://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Similarityhttp://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Calculatinghttp://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Similarity


57/59


References


Technical references

Other references



58/59


Technical references

[1]Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new

approach to topic-specific Web resource discovery The Eighth InternationalWorldWide Web Conference,Toronto 1999 Published by Elsevier Science B.V.,1999

[2]Vladislav Shkapenyuk, Torsten Suel, Design and Implementation of a HighPerformance Distributed Web Crawler Proceedings of the 18th International Conference

on Data Engineering ICDE02, 1063-6382/02 2002 IEEE

[3]Ke Hu, Wing Shing Wong, A probabilistic model for intelligent Web crawlers

Computer Software and Applications Conference, 2003. Proceedings .27th national

Conference, On page(s): 278- 282, ISSN: 0730-3157 2003 IEEE

[4]Castillo, C. Effective Web Crawling PhD thesis, University of Chile.Year of

Publication 2004.

[5]Padmini Sriniwasan,Gautam Pant, Learning to Crawl :Comparing Classification

Schemes, ACM Transactions on Information Systems(TOIS), Volume 23, Issue 4,

Pages: 430 462, ISSN:1046-8188, ACM Press 2005

[6]Ipeirotis, P., Ntoulas, A., Cho, J., Gravano, L., Modeling and managing content in

text databases, In Proceedings of the 21st IEEE International Conference, Pages: 606 617, ISBN ~ ISSN:1084-4627 , 0-7695-2285-8, 2005 IEEE

[7]Baeza-Yates, R., Castillo, C., Marin, M. and Rodriguez, A., Crawling a Country:

Better Strategies than Breadth-First for Web Page Ordering. In Proceedings of theIndustrial and Practical Experience track of the 14th conference on World Wide Web,

pages 864872, Chiba, Japan. ACM Press 2005.

[8]Gautam Pant, Padmini Srinivasan, Link Contexts in Classifier-guided topical

crawler, IEEE Transactions on knowledge and data engineering , Vol. 18, No 1,January

2006 2006 IEEE

[9]Jamali, M., Sayyadi, H., Hariri, B.B., Abolhassani, H., A Method for Focused

Crawling Using Combination of Link Structure and Content, Web Intelligence, 2006.WI 2006. IEEE/WIC/ACM International Conference, Publication Date: Dec. 2006 Onpage(s): 753-756 ISBN: 0-7695-2747-7 2006 IEEE

http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813


59/59


Other References

[10]. http://www.devbistro.com/articles/Misc/Effective-Web-Crawler

[11]. http://en.wikipedia.org/wiki/Web_crawler

[12]. http://www.depspid.net/

[13]. http://www-db.stanford.edu/~backrub/google.html

[14]. http://www.webtechniques.com/archives/1997/05/burner/

[15]. http://www.ils.unc.edu/keyes/java/porter/

[16]. http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

[17]. http://combine.it.lth.se/

[18]. http://www.cse.iitb.ac.in/~soumen/focus/
http://www.devbistro.com/articles/Misc/Implementing-Effective-Web-Crawlerhttp://en.wikipedia.org/wiki/Web_crawlerhttp://www.depspid.net/http://www-db.stanford.edu/~backrub/google.htmlhttp://www.webtechniques.com/archives/1997/05/burner/http://www.cse.iitb.ac.in/~soumen/focus/http://www.devbistro.com/articles/Misc/Implementing-Effective-Web-Crawlerhttp://en.wikipedia.org/wiki/Web_crawlerhttp://www.depspid.net/http://www-db.stanford.edu/~backrub/google.htmlhttp://www.webtechniques.com/archives/1997/05/burner/http://www.cse.iitb.ac.in/~soumen/focus/