b level project combined index
TRANSCRIPT
-
8/3/2019 B Level Project Combined Index
1/59
WEB SPIDER
A Focused Crawler
Acknowledgements
It shall truly be unfair not to show our gratefulness to all those who helped
us complete this project. We would like to show our deepest gratitude to our
project guide Dr. Anupam Agarwal, without whom this project would not
have been possible. It was he who motivated us for this cause and was
always present with his precious guidance and ideas besides being extremely
supportive and understanding all the times.This fueled our enthusiasm evenfurther and encouraged us to boldly step into what was a totally dark and
unexplored expanse before us.
We would also like to thank our batch mates and seniors who were ready
with a positive comment all the time, whether it was an off-hand comment to
-
8/3/2019 B Level Project Combined Index
2/59
Web Spider: A Focused Crawler
encourage us or a constructive piece of criticism. Their positive as well as
critic comments were of great help in giving the project its present form.
Abstract
The world-wide web, having over 350 million pages, continues to grow
rapidly at a million pages per day. About 600 GB of text changes every
month. Such growth and flux poses basic limits of scale for today's generic
crawlers and search engines. In spite of using high-end multiprocessors and
exquisitely crafted crawling software, the largest crawls cover only 30-40%
of the web, and refreshes take weeks to a month. With such unprecedented
scaling challenges for general-purpose crawlers and search engines, we
propose a hypertext resource discovery system called a Focused Crawler.The goal of a focused crawler is to selectively seek out pages that are
relevant to a pre-defined set of topics. The topics are specified not using
keywords, but using exemplary documents
To achieve such goal-directed crawling, we
evaluate the relevance of a hypertext document with respect to the focus
topics thereby discarding the irrelevant pages and focusing on the hyperlinks
of relevant pages only. Focused crawling, thus steadily acquire relevant
pages only while standard crawling quickly loses its way. Therefore it is
very effective for building high-quality collections of Web documents on
specific topics, using modest desktop hardware.
Indian Institute of Information Technology, Allahabad2
-
8/3/2019 B Level Project Combined Index
3/59
Web Spider: A Focused Crawler
Contents
Student Declaration 2Supervisor Recommendation 3
Acknowledgement 4
Abstract 5
List of figures used 7
Chapter 1: Introduction 8
1.1 Objective 10
1.2 Motivation 10
1.3 Problem Definition 12
Chapter 2: Literature Survey 14
2.1 Literature survey 15
2.2 Previous Work 20
Chapter 3: Project Model 23
3.1 Basic Architecture 24
3.2 Crawler Policies 28
3.3 Issues 33
Indian Institute of Information Technology, Allahabad3
-
8/3/2019 B Level Project Combined Index
4/59
Web Spider: A Focused Crawler
Chapter 4: Algorithm Implementation 35
4.1 Outline 36
4.2 Parsing and Stemming 38
4.3 Threshold calculation 40
4.4 Document Frequency 41
4.5 Robots.txt 42
Chapter 5: Discussion and Results 43
5.1 Retrieval of relevant pages only 44
5.2 Multithreading 45
5.3 Crawl space reduction 45
5.4 Reduction of server overload 46
5.5 Robustness of Acquisition 46
5.6 Snapshots 46
Chapter 6: Conclusion 486.1 Conclusion 49
6.2 Challenges and Future work 49
Appendices 51
Appendix A: Term Vector Model 52Appendix B: Basic Authentication Scheme 54
Appendix C: Term Frequency-Inverse Document Frequency 56
References 58
Technical references 59
Indian Institute of Information Technology, Allahabad4
-
8/3/2019 B Level Project Combined Index
5/59
Web Spider: A Focused Crawler
Other references 60
List of figures:
Fig 1.1: Performance of an unfocused crawler 11
Fig 1.2: Performance of focused crawler 12
Fig 2.1: Basic Components of the crawler 16
Fig 2.2: Integration of crawler, classifier and distiller 17
Fig 2.3: Domain of focused web crawler. 19
Fig 3.1: Simple Crawler Configuration 25
Fig 3.2: Control Flow of a Crawler Frontier 27
Fig 4.1: Basic functioning of crawl frontier 35Fig 5.1: Comparison Analysis 44
Fig 5.2: Crawl Space reduction 45
Fig 5.3: Snapshot 1 46
Fig 5.4: Snapshot 2 47
Indian Institute of Information Technology, Allahabad5
-
8/3/2019 B Level Project Combined Index
6/59
Web Spider: A Focused Crawler
Chapter I
Introduction
This section covers:
Objective
Motivation
Problem definition
Indian Institute of Information Technology, Allahabad6
-
8/3/2019 B Level Project Combined Index
7/59
Web Spider: A Focused Crawler
1.1: Objective
To build a customized multithreaded, focused crawler, which will crawl the web, based
on the relevance of the web page, thus reducing the crawl space.
1.2: Motivation
The World Wide Web has grown from a few thousand pages in 1993 to more than two
billion pages at present. It continues to grow rapidly at a million pages per day.
About 600 GB of text changes every month. Due to this explosion in size, web search
engines are becoming increasingly important as the primary means of locating relevant
information [2]. Such search engines rely on massive collections of web pages that are
acquired with the help of web crawlers, which traverse the web by following hyperlinks
and storing downloaded pages in a large database that is later indexed for efficient
execution of user queries. Many researchers have looked at web search technology over
the last few years, including crawling strategies, storage, indexing, ranking techniques,
and a significant amount of work on the structural analysis of the web and web graph.
.
Indian Institute of Information Technology, Allahabad7
-
8/3/2019 B Level Project Combined Index
8/59
Web Spider: A Focused Crawler
In spite of using high-end multiprocessors and exquisitely crafted crawling software, the
largest crawls cover only 30-40% of the web, and refreshes take weeks to a month. The
overwhelming engineering challenges are in part due to the one-size-fits-all philosophy:
the crawler trying to cater to every possible query.
Serious web users adopt the strategy of filtering by relevance and quality. The growth of
the web matters little to a physicist if at most a few dozen pages dealing with quantum
electrodynamics are added or updated per week. Seasoned users also rarely roam
aimlessly; they have bookmarked sites important to them, and their primary need is to
expand and maintain a community around these examples while preserving the quality. A
focused crawler selectively seeks out pages that are relevant to a pre-defined set of topics.
It is crucial that the harvest rate: the fraction of page fetches which are relevant to the
user's interest of the focused crawler be high, otherwise it would be easier to crawl the
whole web and bucket the results into topics as a post-processing step.
Indian Institute of Information Technology, Allahabad8
-
8/3/2019 B Level Project Combined Index
9/59
Web Spider: A Focused Crawler
Fig 1.1: Performance of unfocused crawler[10]
Fig 1.2: Performance of focused crawler[10]
As we see in case of focused crawler (Fig 1.2) the fraction of page fetches which are
relevant to the user's interest in case of the focused crawler is very high when compared
to that of unfocused crawler (Fig 1.1). Crawl Space in case of focused crawler can be
reduced to a large extent as compared to a normal crawler.
1.3: Problem Definition
Indian Institute of Information Technology, Allahabad9
-
8/3/2019 B Level Project Combined Index
10/59
Web Spider: A Focused Crawler
Our project is ambitious to build customized multithreaded, focused crawler, which
crawls the web, based on the relevance of the web page. The approach should concern
specifically in particular domain.
In order to achieve the objectives it should be able to perform the following:
Efficient Preprocessing: This involves the preprocessing of the input
documents. We aim to provide efficient parsing and stemming of pages. . Initially,
the user will be required to provide a set of example pages along with his search
query. These example pages will be parsed, removing all the stop words and
finally the text will be stemmed
Knowledge Retrieval: To provide efficient retrieval of information containing
words .Once the text will be stemmed, information containing words will be
picked and this will form the information to the crawler which it carries with it.
Crawling: To build a crawler that starts from the root node or a URL, called the
seed. As the crawler visits these URLs, it will identify all the hyperlinks in the
page and adds them to the list of URLs to visit, called the crawl frontier. URLs
from the frontier will now be recursively visited.
Retrieving relevant pages: We aim to retrieve only those pages which are
closely related to the corresponding query. In our case we will deal with the most
relevant pages. It will reduce the burden on the user to scan through all the
retrieved pages to find the pages of his interest
Indian Institute of Information Technology, Allahabad10
-
8/3/2019 B Level Project Combined Index
11/59
Web Spider: A Focused Crawler
Chapter II
Literature Survey
This section covers:
Literature survey
Background and Previous work
Indian Institute of Information Technology, Allahabad11
-
8/3/2019 B Level Project Combined Index
12/59
Web Spider: A Focused Crawler
2.1: Literature survey
2.1.1: Basic Crawler
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for
general-purpose crawlers and search engines. We want to implement a new hypertext
resource discovery system called a Focused Crawler. The goal of a focused crawler is
to selectively seek out pages that are relevant to a pre-defined set oftopics. The topics are
specified not using keywords, but using exemplary documents. Rather than collecting and
indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a
focused crawler analyzes its crawl boundary to find the links that are likely to be most
relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant
savings in hardware and network resources, and helps keep the crawl more up-to-date.
To achieve such goal-directed crawling, we will design two hypertext
mining programs that guide our crawler: a classifier that evaluates the relevance of a
hypertext document with respect to the focus topics, and a distiller that identifies
hypertext nodes that are great access points to many relevant pages within a few links. [7]
Now report on extensive focused-crawling experiments using several topics at different
levels of specificity.
Focused crawling acquires relevant pages steadily while standard
crawling quickly loses its way, even though they are started from the same root set.
Focused crawling is robust against large perturbations in the starting set of URLs. It
discovers largely overlapping sets of resources in spite of these perturbations. It is also
capable of exploring out and discovering valuable resources that are dozens of links away
Indian Institute of Information Technology, Allahabad12
-
8/3/2019 B Level Project Combined Index
13/59
Web Spider: A Focused Crawler
from the start set, while carefully pruning the millions of pages that may lie within this
same radius.[5]
As a result it is highly efficient as compared to normal crawlers. Normal crawlers
when start crawling works good for some time but loose their path making it their biggest
disadvantage over focused crawlers. Our anecdotes suggest that focused crawling is very
Effective for building high-quality collections of Web documents on specific topics,
using modest desktop hardware.[3]
Fig 2.1: Basic Components of the crawler[2]
The focused crawler has three main components: a classifier which makes relevance
judgments on pages crawled to decide on link expansion, a distiller which determines a
measure of centrality of crawled pages to determine visit priorities, and a crawler with
dynamically reconfigurable priority controls which is governed by the classifier and
distiller.[2]
Indian Institute of Information Technology, Allahabad13
-
8/3/2019 B Level Project Combined Index
14/59
Web Spider: A Focused Crawler
Its block diagram can be shown as
Fig 2.2:Focused crawler showing how crawler, classifier and distiller are integrated.[1]
2.1.2: Classification
Relevance is enforced on the focused crawler using a hypertext classifier. We assume that
the category taxonomy induces a hierarchical partition on Web documents. (In real life,
documents are often judged to belong to multiple categories) useful pages, not
eliminating irrelevant pages. Human judgment, although subjective and even erroneous,
would be best for measuring relevance. Clearly, even for an experimental crawler that
acquires only ten thousand pages per hour, this is impossible. Therefore we use our
classifier to estimate the relevance of the crawl graph. It is to be noted carefully that we
Indian Institute of Information Technology, Allahabad14
-
8/3/2019 B Level Project Combined Index
15/59
Web Spider: A Focused Crawler
are not, for instance, training and testing the classifier on the same set of documents, or
checking the classifiers earlier evaluation of a document using the classifier itself. Just as
human judgment is prone to variation and error, a statistical program can make mistakes.
Based on such imperfect recommendation, we choose to or not to expand pages. Later,
when a page that was chosen is visited, we evaluate its relevance, and thus the value of
that decision.[8]
2.1.3: Distillation
Relevance is not the only attribute used to evaluate a page while crawling. A long essay
very relevant to the topic but without links is only a finishing point in the crawl. A good
strategy for the crawler is to identify hubs: pages that are almost exclusively a collection
of links to authoritative resources that are relevant to the topic.* Social network analysis
is concerned with the properties of graphs formed between entities such as people,
organizations, papers, etc., through coauthoring, citations, mentoring, paying,
telephoning, infecting, etc. Prestige is an important attribute of nodes in a social network,
especially in the context of academic papers and Web documents. The number of
citations to paper is a reasonable but crude measure of the prestige. Also many hubs are
multi-topic in nature, e.g., a published bookmark file pointing to sports car sites and
photography sites.[4]
2.1.4: Integration with the crawler
The crawler has one watchdog thread and many worker threads. The watchdog is in
charge of checking out new work from the crawl frontier, which is stored on disk. New
work is passed to workers using shared memory buffers. Workers save details of newly
explored pages in private per-worker disk structures. In bulk-synchronous fashion,
workers are stopped, and their results are collected and integrated into the central pool of
work.[4]
Indian Institute of Information Technology, Allahabad15
-
8/3/2019 B Level Project Combined Index
16/59
Web Spider: A Focused Crawler
While it is fairly easy to build a slow crawler that downloads a few pages
per second for a short period of time, building a high-performance system that can
download hundreds of millions of pages over several weeks presents a number of
challenges in system designed, I/O and network efficiency, and robustness and
manageability.[2]
* Refer Appendix B
Perhaps the most crucial evaluation of focused crawling is to measure the rate at which
relevant pages are acquired, and how effectively irrelevant pages are filtered off from the
crawl. This harvest ratio must be high, otherwise the focused crawler would spend a lot
of time merely eliminating irrelevant pages, and it may be better to use an ordinary
crawler instead! It would be good to judge the relevance of the crawl by human
inspection, even though it is subjective and inconsistent. But this is not possible for the
hundreds of thousands of pages our system crawled. Therefore we have to take recourse
to running an automatic classifier over the collected pages. Specifically, we can use our
classifier. It may appear that using the same classifier to guide the crawler and judge the
relevance of crawled pages is flawed methodology, but it is not so. We are evaluating not
the classifier but the basic crawling heuristic that neighbors of highly relevant pages tend
to be relevant.
Indian Institute of Information Technology, Allahabad16
-
8/3/2019 B Level Project Combined Index
17/59
Web Spider: A Focused Crawler
Fig 2.3: Domain of focused web crawler[11]
The unfocused crawler starts out from the same set of dozens of highly relevant links as
the focused crawler, but is completely lost within the next hundred page fetches: the
relevance goes quickly to zero. In contrast the focused one crawl keeps up a healthy pace
of acquiring relevant pages over thousands of pages, in spite of some short-range rate
fluctuations, which is expected. On an average, between a third and half of all page
fetches result in success over the first several thousand fetches, and there seems to be no
sign of stagnation.
Crawling the Web, in a certain way, resembles watching the sky in a clear night: what we
see reflect the state of the stars at different times, as the light travels different distances.
What a Web crawler gets is not a snapshot of the Web, because it does not represents
the Web at any given instant of time. The last pages being crawled are probably very
accurately represented, but the first pages that were downloaded have a high probability
of have been changed [6]
Indian Institute of Information Technology, Allahabad17
-
8/3/2019 B Level Project Combined Index
18/59
Web Spider: A Focused Crawler
2.2: Previous work
The following is a list of published crawler architectures for general-purpose crawlers
(excluding focused Web crawlers), with a brief description that includes the names given
to the different components and outstanding features:
2.21: RBSE was the first published web crawler. It was based on two programs: the first
program, "spider" maintains a queue in a relational database, and the second program
mite, is a modified www ASCII browser that downloads the pages from the Web. It was
presented at the First International Conference on the World Wide Web, Geneva,
Switzerland.[12]
2.2.2: Google Crawler is described in some detail, but the reference is only about an
early version of its architecture, which was based in C++ and Python. The crawler was
integrated with the indexing process, because text parsing was done for full-text indexing
and also for URL extraction. There is a URL server that sends lists of URLs to be fetchedby several crawling processes. During parsing, the URLs found were passed to a URL
server that checked if the URL has been previously seen. If not, the URL was added to
the queue of the URL server.[16]
2.2.3: Mercator is a distributed, modular web crawler written in Java. Its modularity
arises from the usage of interchangeable "protocol modules" and "processing modules".
Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and
processing modules are related to how to process Web pages. The standard processing
module just parses the pages and extracts new URLs, but other processing modules can
be used to index the text of the pages, or to gather statistics from the Web.[15]
Indian Institute of Information Technology, Allahabad18
-
8/3/2019 B Level Project Combined Index
19/59
Web Spider: A Focused Crawler
2.2.4: WebRACE is a crawling and caching module implemented in Java, and used as a
part of a more generic system called eRACE. The system receives requests from users for
downloading Web pages, so the crawler acts in part as a smart proxy server. The system
also handles requests for "subscriptions" to Web pages that must be monitored: when the
pages change, they must be downloaded by the crawler and the subscriber must be
notified. The most outstanding feature of WebRACE is that, while most crawlers start
with a set of "seed" URLs, WebRACE is continuously receiving new starting URLs to
crawl from.[18]
2.2.5: Ubicrawler is a distributed crawler written in Java, and it has no central process. It
is composed of a number of identical "agents"; and the assignment function is calculated
using consistent hashing of the host names. There is zero overlap, meaning that no page
is crawled twice, unless a crawling agent crashes (then, another agent must re-crawl the
pages from the failing agent). The crawler is designed to achieve high scalability and to
be tolerant to failures.[13]
2.2.6: Some Open-source crawlers [11]
DatePrk Search
GNU Widget
Heritrix
HTTrack
Metaboth
NUTCH
WebSPHINX
Sherlock Holmes
YaCy
Indian Institute of Information Technology, Allahabad19
-
8/3/2019 B Level Project Combined Index
20/59
Web Spider: A Focused Crawler
Chapter III
Project Model
Indian Institute of Information Technology, Allahabad20
-
8/3/2019 B Level Project Combined Index
21/59
Web Spider: A Focused Crawler
This section covers:
Basic Architecture
Crawler Policies
Issues
3.1: Basic Architecture
In this project we will develop a web crawler that will start with a list of URLs to visit,
called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the
page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the
frontier are recursively visited according to a set of policies.
While it is fairly easy to build a slow crawler that downloads a few pages per second for a
short period of time, building a high-performance system that can download hundreds of
millions of pages over several weeks presents a number of challenges in system design,
I/O and network efficiency, and robustness and manageability.
Indian Institute of Information Technology, Allahabad21
-
8/3/2019 B Level Project Combined Index
22/59
Web Spider: A Focused Crawler
3.1.1: Basic Concept -
A web crawler also known as a Web spider or Web robot is a program or automated
script which browses the World Wide Web in a methodical, automated manner. Web
crawlers are mainly used to create a copy of all the visited pages for later processing by a
search engine, that will index the downloaded pages to provide fast searches. Crawlers
can also be used for automating maintenance tasks on a Web site, such as checking links
or validating HTML code.
Web crawlers start by parsing a specified web page, noting any hypertext links on that
page that point to other web pages. They then parse those pages for new links, and so on,
recursively. Web-crawler software doesn't actually move around to different computers
on the Internet, as viruses or intelligent agents do. A crawler resides on a single machine.
The crawler simply sends HTTP requests for documents to other machines on the Internet
just as a web browser does when the user clicks on links. All the crawler really does is to
automate the process of following links. [2]
Indian Institute of Information Technology, Allahabad22
-
8/3/2019 B Level Project Combined Index
23/59
Web Spider: A Focused Crawler
3.1.2: Architecture -
The input to the focused crawler is the search query of the user. Also, a set of examplepages relating to the query have to be given to the crawler. A series of parses are done on
these example pages to finally extract the information containing words. These
information containing words are given as input to the crawler which carries them along
with it. Based on this information, the crawler calculates the relevance of an encountered
page, and only if the relevance is satisfying, will it be stored for further crawling. [4]
Fig 3.1: Simple Crawler Configuration [4]
Indian Institute of Information Technology, Allahabad23
-
8/3/2019 B Level Project Combined Index
24/59
Web Spider: A Focused Crawler
The architecture can be classified into two major components - crawling system and
crawling application. The crawling system itself consists of several specializedcomponent s, in particular a crawl manager, downloader and DNS resolver.
The crawl manager is responsible for receiving the URL input stream from
the applications. After loading the URLs of a request files, the manager queries the DNS
resolvers for the IP addresses of the servers, unless a recent address is already cached.
The manager then requests the file robots.txt in the web servers root directory, unless it
already has a recent copy of the file. A downloader is a high-performance asynchronous
HTTP client capable of downloading hundreds of web pages in parallel, while a DNS
resolver is an optimized stub DNS resolver that forwards queries to local DNS servers.[6]
Finally, after parsing the robots files and removing excluded URLs, the requested URLs
are sent in batches to the downloader. The manager later notifies the application of the
pages that have been downloaded and are available for processing.
The crawling application starts out with a URL giving it to the crawl
manager. The application then parses each downloaded page for hyperlinks, checks
whether these URLs have already been encountered before, and if not, sends them to the
manager in batches of a few hundred or thousand.[9] The downloaded files are then
forwarded to a storage manager for compression and storage in a repository.
3.1.3: Control flow
As the crawler gets the relevant pages, it retrieves their URLs and makes their list out of
which it takes the URLs one by one and the web page is downloaded. Now the
downloaded page is changed to a text file for simplicity. This text file is parsed removing
all the stop words from it and stemming the remaining words using Porter Stemmer.
Then its relevance is tested. If relevant, the URLs present on the page are extracted and
added to the list of URLs for further crawling.
Indian Institute of Information Technology, Allahabad24
-
8/3/2019 B Level Project Combined Index
25/59
Web Spider: A Focused Crawler
Fig 3.2: Control Flow of a Crawler Frontier
Indian Institute of Information Technology, Allahabad25
-
8/3/2019 B Level Project Combined Index
26/59
Web Spider: A Focused Crawler
3.2: Crawler policies
There are three important characteristics of the Web that generate a scenario in which
Web crawling is very difficult: its large volume, its fast rate of change, dynamic page
generation, containing a wide variety of possible crawlable URLs.
The large volume implies that the crawler can only download a fraction of the Web
pages within a given time, so it needs to prioritize all of its downloads. The high rate of
change implies that by the time the crawler is downloading the last pages from a site, it is
very likely that new pages have been added to the site, or that pages have already been
updated or even deleted.
The recent increase in the number of pages being generated by server-side scripting
languages has also created difficulty in those endless combinations of HTTP GET
parameters exist, only a small selection of which will actually return unique content. For
example, a simple online photo gallery may offer three options to users, as specified
through HTTP GET parameters. If there exist four ways to sort images, three choices of
thumbnail size, two file formats, and an option to disable user-provided contents, thenthat same set of content can be accessed with forty-eight different URLs, all of which will
be present on the site. This mathematical combination creates a problem for crawlers, as
they must sort through endless combinations of relatively minor scripted changes in order
to retrieve unique content.
The behavior of a Web crawler is the outcome of a combination of policies:
A selection policy that states which pages to download.
A re-visit policy that states when to check for changes to the pages.
A politeness policy that states how to avoid overloading websites.
A parallelization policy that states how to coordinate distributed web crawlers.
Indian Institute of Information Technology, Allahabad26
-
8/3/2019 B Level Project Combined Index
27/59
Web Spider: A Focused Crawler
3.2.1: Selection policy -
Given the current size of the Web, even large search engines cover only a portion of the
publicly available internet; A recent study showed that no search engine indexes more
than 16% of the Web. As a crawler always downloads just a fraction of the Web pages, it
is highly desirable that the downloaded fraction contains the most relevant pages, and not
just a random sample of the Web. [2]
This requires a metric of importance for prioritizing Web pages. The importance of a
page is a function of its intrinsic quality, its popularity in terms of links or visits, and
even of its URL (the latter is the case of vertical search engines restricted to a single top-
level domain, or search engines restricted to a fixed Web site). Designing a good
selection policy has an added difficulty: it must work with partial information, as the
complete set of Web pages is not known during crawling.
Crawling can be combined with different strategies. The ordering metrics can be breadth-
first, backlink-count and partial Pagerank calculations. One of the conclusions was that if
the crawler wants to download pages with high Pagerank early during the crawling
process, then the partial Pagerank strategy is the better, followed by breadth-first andbacklink-count. However, these results are for just a single domain.
Though it is basically considered that the breadth first strategy is a better strategy than
page rank . The explanation is simple for this it has been proved that the most important
pages have many links to them from numerous hosts, and those links will be found early,
regardless of on which host or page the crawl originates.
3.2.2: Re-visit policy -
The Web has a very dynamic nature, and crawling a fraction of the Web can take a really
long time, usually measured in weeks or months. By the time a Web crawler has finished
its crawl, many events could have happened. These events can include creations, updates
and deletions. [2]
Indian Institute of Information Technology, Allahabad27
-
8/3/2019 B Level Project Combined Index
28/59
Web Spider: A Focused Crawler
From the search engine's point of view, there is a cost associated with not detecting an
event, and thus having an outdated copy of a resource. The most used cost functions are
freshness and age.
Freshness: This is a binary measure that indicates whether the local copy is
accurate or not.
Age: This is a measure that indicates how outdated the local copy is.
The objective of the crawler is to keep the average freshness of pages in its collection as
high as possible, or to keep the average age of pages as low as possible. These objectives
are not equivalent: in the first case, the crawler is just concerned with how many pages
are out-dated, while in the second case, the crawler is concerned with how old the local
copies of pages are.
Two simple re-visiting policies are:
Uniform policy: This involves re-visiting all pages in the collection with the same
frequency, regardless of their rates of change.
Proportional policy: This involves re-visiting more often the pages that change
more frequently. The visiting frequency is directly proportional to the (estimated)
change frequency.
In terms of average freshness, the uniform policy outperforms the proportional policy in
both a simulated Web and a real Web crawl. The explanation for this result comes from
the fact that, when a page changes too often, the crawler will waste time by trying to re-
crawl it too fast and still will not be able to keep its copy of the page fresh.
3.2.3: Politeness policy -
Crawlers can retrieve data much quicker and in greater depth than human searchers, so
they can have a crippling impact on the performance of a site. Needless to say if a single
crawler is performing multiple requests per second and downloading large files, a server
would have a hard time keeping up with requests from multiple crawlers. [2]
The use of Web crawlers is useful for a number of tasks, but comes with a price for the
general community. The costs of using Web crawlers include:
Indian Institute of Information Technology, Allahabad28
-
8/3/2019 B Level Project Combined Index
29/59
Web Spider: A Focused Crawler
Network resources, as crawlers require considerable bandwidth and operate with a high
degree of parallelism during a long period of time.
Server overload, especially if the frequency of accesses to a given server is too high.
Poorly written crawlers, which can crash servers or routers, or which download pages
they cannot handle.
Personal crawlers that, if deployed by too many users, can disrupt networks and Web
servers.
A partial solution to these problems is the robots exclusion protocol, also known as the
robots.txt protocol that is a standard for administrators to indicate which parts of their
Web servers should not be accessed by crawlers. This standard does not include a
suggestion for the interval of visits to the same server, even though this interval is the
most effective way of avoiding server overload. Recently commercial search engines like
Ask Jeeves, MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the
robots.txt file to indicate the number of seconds to delay between requests.However, if
pages were downloaded at this rate from a website with more than 100,000 pages over a
perfect connection with zero latency and infinite bandwidth, it would take more than 2
months to download only that entire website; also, only a fraction of the resources from
that Web server would be used. This does not seem acceptable.
Normally one uses 10 seconds as an interval for accesses and some crawlers uses 15
seconds as the default. Some even follows an adaptive politeness policy: if it took t
seconds to download a document from a given server, the crawler waits for 10t seconds
before downloading the next page.
3.2.4: Parallelization policy -
A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to
maximize the download rate while minimizing the overhead from parallelization and to
avoid repeated downloads of the same page. To avoid downloading the same page more
than once, the crawling system requires a policy for assigning the new URLs discovered
Indian Institute of Information Technology, Allahabad29
-
8/3/2019 B Level Project Combined Index
30/59
Web Spider: A Focused Crawler
during the crawling process, as the same URL can be found by two different crawling
processes. There are basic following policies:-
Dynamic Assignment: With this type of policy, a central server assigns new URLs to
different crawlers dynamically. This allows the central server to; for instance,
dynamically balance the load of each crawler.
With dynamic assignment, typically the systems can also add or remove downloader
processes. The central server may become the bottleneck, so most of the workload must
be transferred to the distributed crawling processes for large crawls.
There are two configurations of crawling architectures with dynamic assignments such
that [2]
A small crawler configuration, in which there is a central DNS resolver and
central queues per Web site, and distributed downloaders.
A large crawler configuration, in which the DNS resolver and the queues are also
distributed.
Static Assignment: With this type of policy, there is a fixed rule stated from the
beginning of the crawl that defines how to assign new URLs to the crawlers.
For static assignment, a hashing function can be used to transform URLs (or, even better,
complete website names) into a number that corresponds to the index of the
corresponding crawling process. As there are external links that will go from a Web site
assigned to one crawling process to a website assigned to a different crawling process,
some exchange of URLs must occur.[14]
To reduce the overhead due to the exchange of URLs between crawling processes, theexchange should be done in batch, several URLs at a time, and the most cited URLs in
the collection should be known by all crawling processes before the crawl (e.g.: using
data from a previous crawl) .
An effective assignment function must have three main properties: each crawling process
should get approximately the same number of hosts (balancing property), if the number
Indian Institute of Information Technology, Allahabad30
-
8/3/2019 B Level Project Combined Index
31/59
Web Spider: A Focused Crawler
of crawling processes grows, the number of hosts assigned to each process must shrink
(contra-variance property), and the assignment must be able to add and remove crawling
processes dynamically. Consistent hashing, which replicates the buckets, so adding or
removing a bucket does not require re-hashing of the whole table to achieve all of the
desired properties. Crawling is an effective process synchronization tool between the
users and the search engine.
3.3: Issues
3.3.1: How to Re-visit web pages
The optimum method to re-visit the web and maintain average freshness high of web
page is to ignore the pages that change too often.
The approaches could be: [2]
Re-visiting all pages in the collection with the same frequency, regardless of their
rates of change.
Re-visiting more often the pages that change more frequently.
In both cases, the repeated crawling order of pages can be done either at random or with a
fixed order.
The re-visiting methods considered here regard all pages as homogeneous in terms of
quality ("all pages on the Web are worth the same"), something that is not a realisticscenario
3.3.2: How to avoid overloading websites
Crawlers can retrieve data much quicker and in greater depth than human searchers, so
they can have a crippling impact on the performance of a site. Needless to say if a single
Indian Institute of Information Technology, Allahabad31
-
8/3/2019 B Level Project Combined Index
32/59
Web Spider: A Focused Crawler
crawler is performing multiple requests per second and/or downloading large files, a
server would have a hard time keeping up with requests from multiple crawlers.
The use of Web crawler is useful for a number of tasks, but comes with a price for the
general community.
The costs of using Web crawlers include: [1]
Network resources , as crawlers require considerable bandwidth and operate with a
high degree of parallelism during a long period of time.
Server overload , especially if the frequency of accesses to a given server is too
high.
Poorly written crawlers , which can crash servers or routers, or which download
pages they cannot handle.
Personal crawlers that, if deployed by too many users, can disrupt networks and
Web servers.
To resolve this problem we can use robots exclusion protocol, also known as the
robots.txt protocol.
The robots exclusion standard or robots.txt protocol is a convention to prevent
cooperating web spiders and other web robots from accessing all or part of a website. We
can specify the top level directory of web site in a file called robots.txt and this will
prevent the access of that directory to crawler. This protocol uses simple substring
comparisons to match the patterns defined in robots.txt file. So, while using this
robots.txt file we need to make sure that we use final [./.]. [17] Character appended to
directory path. Else, files with names starting with that substring will be matched rather
than directory.
Indian Institute of Information Technology, Allahabad32
-
8/3/2019 B Level Project Combined Index
33/59
Web Spider: A Focused Crawler
Chapter IV
Algorithm Implementation
This section covers:
Outline
Parsing and Stemming
Threshold calculation
Document Frequency
Robots.txt
Indian Institute of Information Technology, Allahabad33
-
8/3/2019 B Level Project Combined Index
34/59
Web Spider: A Focused Crawler
4.1: Outline
The input to the focused crawler is the search query of the user in form of pages. Parsing
is done and words are retrieved. These information containing words are given as input to
the crawler which carries them along with it. Based on this information, the crawler
calculates the relevance of an encountered page, and only if the relevance is satisfying,
will it be stored for further crawling
Indian Institute of Information Technology, Allahabad34
-
8/3/2019 B Level Project Combined Index
35/59
Web Spider: A Focused Crawler
Fig 4.1: Basic functioning of crawl frontier
The Pseudo-code summary of implementing the crawler:Ask user to specify the starting URL on web and file type that crawler
should crawl.
Add the URL to the empty list of URLs to search.
While not empty ( the list of URLs to search )
{
Take the first URL in from the list of URLs
Mark this URL as already searched URL.
If the URL protocol is not HTTP then
break;
go back to while
Indian Institute of Information Technology, Allahabad35
-
8/3/2019 B Level Project Combined Index
36/59
Web Spider: A Focused Crawler
If robots.txt file exist on site then
If file includes .Disallow. statement then
break;
go back to while
Open the URL
If the opened URL is not HTML file then
Break;
Go back to while
Iterate the HTML file
While the html text contains another link {
If robots.txt file exist on URL/site then
If file includes .Disallow. statement then
break;
go back to while
If the opened URL is HTML file then
If the URL isn't marked as searched then
Mark this URL as already searched URL.
Else if type of file is user requested
Add to list of files found.
}
}
4.2: Parsing and Stemming
4.2.1: Parsing (more formally syntactical analysis) is the process of analyzing a sequence
of tokens to determine its grammatical structure with respect to a given formal grammar.
A parser is the component of a compiler that carries out this task.
Parsing transforms input text into a data structure, usually a tree, which is suitable for
later processing and which captures the implied hierarchy of the input. Lexical analysis
Indian Institute of Information Technology, Allahabad36
-
8/3/2019 B Level Project Combined Index
37/59
Web Spider: A Focused Crawler
creates tokens from a sequence of input characters and it is these tokens that are
processed by a parser to build a data structure such as parse tree or abstract syntax trees.
Parsing is also an earlier term for the diagramming of sentences in grammar of natural
language, and is still used to diagram the grammar of inflected languages, such as the
Romance languages or Latin.
4.2.2: Removal of Stop Words
Firstly, the web page is converted into a text file for convenience. An initial parse
removes all the stop words from the file. Stop words are words which are filtered out
prior to, or after, processing of natural language data (text). Some of the most frequently
used stop words include "a", "of", "the", "I", "it", "you", and "and. These are generally
regarded as 'functional words' which do not carry meaning (are not as important for
communication).[16] The assumption is that, when assessing the contents of the web
page, the meaning can be conveyed more clearly, or interpreted more easily, by ignoring
the functional words. A Stop List is maintained in a separate text file and all the words of
that file are removed from the file being parsed.
4.2.3: Stemming
Next, a Porter Stemmer is run on this file. Stemming is the process for reducing inflected
(or sometimes derived) words to their stem, base or root form generally a written word
form. The stem need not be identical to the morphological root of the word; it is usually
sufficient that related words map to the same stem, even if this stem is not in itself a valid
root. A stemmer for English, for example, should identify the string "cats" (and possibly
"catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed"
as based on "stem". Porter Stemmer is one algorithm for doing this process effectively.
Some examples of the rules include:
Indian Institute of Information Technology, Allahabad37
-
8/3/2019 B Level Project Combined Index
38/59
Web Spider: A Focused Crawler
if the word ends in 'ed', remove the 'ed'
if the word ends in 'ing', remove the 'ing'
if the word ends in 'ly', remove the 'ly'
Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the
challenges of linguistics and morphology and encoding suffix stripping rules. Suffix
stripping algorithms are sometimes regarded as crude given the poor performance when
dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix
stripping algorithms are limited to those lexical categories which have well known
suffices with few exceptions. This, however, is a problem, as not all parts of speech have
such a well formulated set of rules. Lemmatization attempts to improve upon this
challenge [16]
When this is done for all the example pages, we perform the frequency analysis of each
word and the number of pages in which it has appeared and then select the words that are
most likely to carry the information content of the page. These words are given to the
crawler.
4.3: Threshold Calculation
4.3.1: Identification of Information carrying words -
Firstly, the frequency of each word in each of the example pages is found out. Also, the
number of pages in which it is appearing is also kept track of. Based on these two criteria,
we select our information containing words. *
Indian Institute of Information Technology, Allahabad38
-
8/3/2019 B Level Project Combined Index
39/59
Web Spider: A Focused Crawler
Fixing Threshold:-
We have used the Vector Space Method to fix the threshold from the initial set of pages.
Then we used the following formula to find out the relevance of a particular page.
Relevance = No. of info words with mean freq
--------------------------------------------------------- = 3
Total no. of information words
4.3.2: Vector space model -
It is an algebraic model used for information filtering, information retrieval, indexing and
relevancy rankings. It represents natural language documents (or any objects, in general)
in a formal manner through the use of vectors (of identifiers, such as, for example, index
terms) in a multi-dimensional linear space. Its first use was in the SMART Information
Retrieval System. Documents are represented as vectors of index terms (keywords). The
set of terms is a predefined collection of terms, for example the set of all unique words
occurring in the document corpus.*
*Refer Appendix A
4.4: Document frequency
4.4.1: Term frequency in the given document is simply the number of times a given
term appears in that document. This count is usually normalized to prevent a bias towards
longer documents (which may have a higher term frequency regardless of the actual
importance of that term in the document) to give a measure of the importance of the term
ti within the particular document. [10]
Indian Institute of Information Technology, Allahabad39
http://en.wikipedia.org/wiki/Indexinghttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Keyword_(linguistics)http://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/SMART_Information_Retrieval_Systemhttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Keyword_(linguistics)http://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Indexing -
8/3/2019 B Level Project Combined Index
40/59
Web Spider: A Focused Crawler
Where ni is the number of occurrences of the considered term, and the denominator is the
number of occurrences of all terms.
4.4.2: The inverse document frequency is a measure of the general importance of the
term (obtained by dividing the number of all documents by the number of documents
containing the term, and then taking the logarithm of that quotient) .*
A high weight in tfidf is reached by a high term frequency (in the given document) and
a low document frequency of the term in the whole collection of documents; the weight
hence tends to filter out common terms.
* Refer Appendix C
4.5: Robots.txt :-
The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt
protocol is a convention to prevent cooperating web spiders and other web robots from
accessing all or part of a website which is, otherwise, publicly viewable. Robots are often
used by search engines to categorize and archive web sites, or by webmasters to
Indian Institute of Information Technology, Allahabad40
-
8/3/2019 B Level Project Combined Index
41/59
Web Spider: A Focused Crawler
proofread source code. A robots.txt file on a website will function as a request that
specified robots ignore specified files or directories in their search. This might be, for
example, out of a preference for privacy from search engine results, or the belief that the
content of the selected directories might be misleading or irrelevant to the categorization
of the site as a whole, or out of a desire that an application only operate on certain data.
The protocol, however, is purely advisory. It relies on the cooperation of the web robot,
so that marking an area of a site out of bounds with robots.txt does not guarantee privacy.
Some web site administrators have tried to use the robots file to make private
parts of a website invisible to the rest of the world, but the file is necessarily publicly
available and its content is easily checked by anyone with a web browser.
An example robots.txt file
# robots.txt for http://somehost.com/
User-agent: *
Disallow: /cgi-bin/
Disallow: /registration
Disallow: /login
Chapter V
Indian Institute of Information Technology, Allahabad41
-
8/3/2019 B Level Project Combined Index
42/59
Web Spider: A Focused Crawler
Discussion and Results
This section covers:
Retrieval of relevant pages
Multithreading
Crawl Space reduction
Server overload
Robustness of Acquisition
Snapshots
5.1: Retrieval of the relevant pages only
Relevant pages are those which are closely related to the input document to the crawler.
Our focused crawler achieves the relevance of downloaded web-pages up to 80-85%
while for a normal crawler it is only up to 20-25 %.
Indian Institute of Information Technology, Allahabad42
-
8/3/2019 B Level Project Combined Index
43/59
Web Spider: A Focused Crawler
Moreover the numbers of relevant pages downloaded are 50-100 per hour as compared to
a normal crawler which goes on downloading pages most of which are irrelevant.
Fig 5.1: Comparison Analysis
5.2: Multithreading
This issue has been dealt successfully as on receiving more than one hyperlink from a file
then, a number of parallel threads are generated and work together for downloading the
pages and parsing them.
Indian Institute of Information Technology, Allahabad43
-
8/3/2019 B Level Project Combined Index
44/59
Web Spider: A Focused Crawler
5.3: Crawl space reduction
Crawl space is the number of pages visited by the crawler on the web. Our focused
crawler reduces the crawl space to a great extent as it visits the hyperlinks of only the
relevant pages thus go on pruning most part of the web tree.
Fig 5.2: Crawl Space reduction
5.4: Reduction of server overload
We have used robots exclusion protocol, also known as the robots.txt protocol to prevent
web spider from accessing all or part of a website.
Indian Institute of Information Technology, Allahabad44
-
8/3/2019 B Level Project Combined Index
45/59
Web Spider: A Focused Crawler
5.5: Robustness of acquisition
Web Spider has the ability to ramp up to and maintain a healthy acquisition rate without
being too sensitive on the start set.
5.6: Snapshots
Fig5.3: When the page is downloaded, it is parsed and stemmed. The frequency of the
words in the page is calculated and its relevance is checked using the cosine similarity
method.
Indian Institute of Information Technology, Allahabad45
-
8/3/2019 B Level Project Combined Index
46/59
Web Spider: A Focused Crawler
Fig 5.4: Initially a URL http://www.cert.org/research/papers.html is given as input to the
Web Spider and the links it visits are shown above out of which only the links of the
relevant page are downloaded.
Indian Institute of Information Technology, Allahabad46
-
8/3/2019 B Level Project Combined Index
47/59
Web Spider: A Focused Crawler
Chapter VI
Conclusion
This section covers:
Conclusion
Challenges and Future Work
Indian Institute of Information Technology, Allahabad47
-
8/3/2019 B Level Project Combined Index
48/59
Web Spider: A Focused Crawler
6.1: Conclusion
The motives of our project as per the problem definition have been achieved to
completion. Web Spider, the customized multithreaded focused crawler with its all
functionalities properly running is ready to be used.
We have achieved a relatively high reduction in crawl space. The rate of pages being
downloaded varies from 50 to 100 pages an hour and they are the ones most relevant to
the user input document. We have taken care of the Robots Exclusion protocol and have
achieved a healthy acquisition rate without being too sensitive to the start document. Our
project can successfully perform using modest desktop hardware.
No doubt, the process of developing Web Spider was extremely knowledgeful and
enjoying. We got to learn the very deep concepts of information retrieval and their
practical implementation and are proud to complete the project up to a satisfactory level.
6.2: Challenges and future work
6.2.1: Challenges
Server side checking: Web Spider in its present form downloads all the URLs
present in a relevant pages and discard the irrelevant URLs post-downloading. Its
Indian Institute of Information Technology, Allahabad48
-
8/3/2019 B Level Project Combined Index
49/59
Web Spider: A Focused Crawler
a challenge for us to implement a server side check, i.e. checking the URLs on the
server side and thus downloading only the relevant ones.
Distributed web crawler: Our project presently works on a single system, to make
it scalable its a challenge to make it a distributed system with many parallelcrawlers running.
6.2.2: Future work
Extending the project for file format son the web other than HTML and text.
Ranking the downloaded pages with respect to their priority. The page having
high cosine similarity with the example pages carry high priority.
Increasing the harvest rate. Presently the relevant pages are downloaded at the
rate of 50-100 pages per hour. Implementing a better focused crawler can increase
the rate.
Implementing better preprocessing algorithms. We have presently employed upto
650 stop words and implemented the Porter Stemmer algorithm. Using more stop
words and implementing a better stemming algorithm may further enhance the
crawler performance.
Indian Institute of Information Technology, Allahabad49
-
8/3/2019 B Level Project Combined Index
50/59
Web Spider: A Focused Crawler
Appendices
This section covers:
Term Vector Model
Basic Authentication Scheme
Term Frequency-Inverse Document Frequency
Indian Institute of Information Technology, Allahabad50
-
8/3/2019 B Level Project Combined Index
51/59
Web Spider: A Focused Crawler
Appendix A: Term vector model
Term vector model is an algebraic model used for information filtering, information
retrieval, indexing and relevancy rankings. It represents natural language documents (or
any objects, in general) in a formal manner through the use of vectors (of identifiers, such
as, for example, index terms) in a multi-dimensional linear space. Its first use was in the
SMART Information Retrieval System.
Documents are represented as vectors of index terms (keywords). The set of terms is a
predefined collection of terms, for example the set of all unique words occurring in the
document corpus.
Relevancy rankings of documents in a keyword search can be calculated, using the
assumptions of document similarities theory, by comparing the deviation of angles
between each document vector and the original query vector where the query is
represented as same kind of vector as the documents.
In practice, it is easier to calculate the cosine of the angle between the vectors instead of
the angle:
A cosine value of zero means that the query and document vector were orthogonal and
had no match (i.e. the query term did not exist in the document being considered).
Assumptions and Limitations of the Vector Space Model
The Vector Space Model has the following limitations:
Long documents are poorly represented because they have poor similarity values(a small scalar product and a large dimensionality)
Search keywords must precisely match document terms; word substrings might
result in a "false positive match"
Indian Institute of Information Technology, Allahabad51
http://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Vector_spacehttp://en.wikipedia.org/wiki/Vector_space -
8/3/2019 B Level Project Combined Index
52/59
Web Spider: A Focused Crawler
Semantic sensitivity; documents with similar context but different term
vocabulary won't be associated, resulting in a "false negative match".
Indian Institute of Information Technology, Allahabad52
-
8/3/2019 B Level Project Combined Index
53/59
Web Spider: A Focused Crawler
Appendix B: Basic authentication scheme
In the context of an HTTP transaction, the basic authentication scheme is a method
designed to allow a web browser, or other client program, to provide credentials in the
form of a user name and password when making a request. Although the scheme is
easily implemented, it relies on the assumption that the connection between the client and
server computers is secure and can be trusted. Specifically, the credentials are passed as
plaintext and could be intercepted easily. The scheme also provides no protection for the
information passed back from the server.
To prevent the user name and password being read directly by a person, they are encoded
as a sequence of base-64 characters before transmission. For example, the user name
"Aladdin" and password "open sesame" would be combined as "Aladdin:open sesame"
which is equivalent to QWxhZGRpbjpvcGVuIHNlc2FtZQ== when encoded in base-64.
Little effort is required to translate the encoded string back into the user name and
password, and many popular security tools will decode the strings "on the fly", so an
encrypted connection should always be used to prevent interception.
One advantage of the basic authentication scheme is that it is supported by almost all
popular web browsers. It is rarely used on normal Internet web sites but may sometimes
be used by small, private systems. A later mechanism, digest access authentication, was
developed in order to replace the basic authentication scheme and enable credentials to be
passed in a relatively secure manner over an otherwise insecure channel.
Example
Here is a typical transaction between an HTTP client and an HTTP server running on the
local machine (localhost). It comprises the following steps.
The client asks for a page that requires authentication but does not provide a user name
and password. Typically this is because the user simply entered the address or followed a
link to the page.
Indian Institute of Information Technology, Allahabad53
-
8/3/2019 B Level Project Combined Index
54/59
Web Spider: A Focused Crawler
The server responds with the 401 response code and provides the authentication realm.
At this point, the client will present the authentication realm (typically a description of
the computer or system being accessed) to the user and prompt for a user name and
password. The user may decide to cancel at this point.
Once a user name and password have been supplied, the client re-sends the same request
but includes the authentication header.
In this example, the server accepts the authentication and the page is returned. If the user
name is invalid or the password incorrect, the server might return the 401 response code
and the client would prompt the user again.
Indian Institute of Information Technology, Allahabad54
-
8/3/2019 B Level Project Combined Index
55/59
Web Spider: A Focused Crawler
Appendix C: Term frequencyinverse
document frequency
The tfidf weight (term frequencyinverse document frequency) is a weight often used
in information retrieval and text mining. This weight is a statistical measure used to
evaluate how important a word is to a document in a collection or corpus. The importance
increases proportionally to the number of times a word appears in the document but is
offset by the frequency of the word in the corpus. Variations of the tfidf weighting
scheme are often used by search engines to score and rank a document's relevance given
a user query. In addition to tf-idf weighting, Internet search engines use link analysis
based ranking to determine the order in which the scored documents are presented to the
user.
The term frequency in the given document is simply the number of times a given term
appears in that document. This count is usually normalized to prevent a bias towards
longer documents (which may have a higher term frequency regardless of the actualimportance of that term in the document) to give a measure of the importance of the term
ti within the particular document.
where ni is the number of occurrences of the considered term, and the denominator is the
number of occurrences of all terms.
The inverse document frequency is a measure of the general importance of the term
(obtained by dividing the number of all documents by the number of documents
containing the term, and then taking the logarithm of thatquotient).
Indian Institute of Information Technology, Allahabad55
http://en.wikipedia.org/wiki/Documentshttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Quotienthttp://en.wikipedia.org/wiki/Quotienthttp://en.wikipedia.org/wiki/Documentshttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Quotient -
8/3/2019 B Level Project Combined Index
56/59
Web Spider: A Focused Crawler
with
|D| : total number of documents in the corpus
: number of documents where the term ti appears (that is ).
Numeric application of the document frequency.
There are many different formulas used to calculate tfidf. The term frequency (TF) is
the number of times the word appears in a document divided by the number of total
words in the document. If a document contains 100 total words and the word cow appears
3 times, then the term frequency of the word cow in the document is 0.03 (3/100). One
way ofcalculating document frequency (DF) is to determine how many documents
contain the word cow divided by the total number of documents in the collection. So if
cow appears in 1,000 documents out of a total of 10,000,000 then the document
frequency is 0.0001 (1000/10,000,000). The final tf-idf score is then calculated by
dividing the term frequency by the document frequency. For our example, the tf-idf score
for cow in the collection would be 300 (0.03/0.0001). Alternatives to this formula are to
take the log of the document frequency.
Applications in Vector Space Model
The tf-idf weighting scheme is often used in thevector space model together with cosine
similarity to determine the similarity between two documents.
Indian Institute of Information Technology, Allahabad56
http://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Calculatinghttp://en.wikipedia.org/wiki/Calculatinghttp://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Similarityhttp://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Calculatinghttp://en.wikipedia.org/wiki/Formulahttp://en.wikipedia.org/wiki/Logarithmhttp://en.wikipedia.org/wiki/Vector_space_modelhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Cosine_similarityhttp://en.wikipedia.org/wiki/Similarity -
8/3/2019 B Level Project Combined Index
57/59
Web Spider: A Focused Crawler
References
This section covers:
Technical references
Other references
Indian Institute of Information Technology, Allahabad57
-
8/3/2019 B Level Project Combined Index
58/59
Web Spider: A Focused Crawler
Technical references
[1]Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new
approach to topic-specific Web resource discovery The Eighth InternationalWorldWide Web Conference,Toronto 1999 Published by Elsevier Science B.V.,1999
[2]Vladislav Shkapenyuk, Torsten Suel, Design and Implementation of a HighPerformance Distributed Web Crawler Proceedings of the 18th International Conference
on Data Engineering ICDE02, 1063-6382/02 2002 IEEE
[3]Ke Hu, Wing Shing Wong, A probabilistic model for intelligent Web crawlers
Computer Software and Applications Conference, 2003. Proceedings .27th national
Conference, On page(s): 278- 282, ISSN: 0730-3157 2003 IEEE
[4]Castillo, C. Effective Web Crawling PhD thesis, University of Chile.Year of
Publication 2004.
[5]Padmini Sriniwasan,Gautam Pant, Learning to Crawl :Comparing Classification
Schemes, ACM Transactions on Information Systems(TOIS), Volume 23, Issue 4,
Pages: 430 462, ISSN:1046-8188, ACM Press 2005
[6]Ipeirotis, P., Ntoulas, A., Cho, J., Gravano, L., Modeling and managing content in
text databases, In Proceedings of the 21st IEEE International Conference, Pages: 606 617, ISBN ~ ISSN:1084-4627 , 0-7695-2285-8, 2005 IEEE
[7]Baeza-Yates, R., Castillo, C., Marin, M. and Rodriguez, A., Crawling a Country:
Better Strategies than Breadth-First for Web Page Ordering. In Proceedings of theIndustrial and Practical Experience track of the 14th conference on World Wide Web,
pages 864872, Chiba, Japan. ACM Press 2005.
[8]Gautam Pant, Padmini Srinivasan, Link Contexts in Classifier-guided topical
crawler, IEEE Transactions on knowledge and data engineering , Vol. 18, No 1,January
2006 2006 IEEE
[9]Jamali, M., Sayyadi, H., Hariri, B.B., Abolhassani, H., A Method for Focused
Crawling Using Combination of Link Structure and Content, Web Intelligence, 2006.WI 2006. IEEE/WIC/ACM International Conference, Publication Date: Dec. 2006 Onpage(s): 753-756 ISBN: 0-7695-2747-7 2006 IEEE
Indian Institute of Information Technology, Allahabad58
http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=8813 -
8/3/2019 B Level Project Combined Index
59/59
Web Spider: A Focused Crawler
Other References
[10]. http://www.devbistro.com/articles/Misc/Effective-Web-Crawler
[11]. http://en.wikipedia.org/wiki/Web_crawler
[12]. http://www.depspid.net/
[13]. http://www-db.stanford.edu/~backrub/google.html
[14]. http://www.webtechniques.com/archives/1997/05/burner/
[15]. http://www.ils.unc.edu/keyes/java/porter/
[16]. http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
[17]. http://combine.it.lth.se/
[18]. http://www.cse.iitb.ac.in/~soumen/focus/
http://www.devbistro.com/articles/Misc/Implementing-Effective-Web-Crawlerhttp://en.wikipedia.org/wiki/Web_crawlerhttp://www.depspid.net/http://www-db.stanford.edu/~backrub/google.htmlhttp://www.webtechniques.com/archives/1997/05/burner/http://www.cse.iitb.ac.in/~soumen/focus/http://www.devbistro.com/articles/Misc/Implementing-Effective-Web-Crawlerhttp://en.wikipedia.org/wiki/Web_crawlerhttp://www.depspid.net/http://www-db.stanford.edu/~backrub/google.htmlhttp://www.webtechniques.com/archives/1997/05/burner/http://www.cse.iitb.ac.in/~soumen/focus/