priority crawler

7/30/2019 Priority Crawler

http://slidepdf.com/reader/full/priority-crawler 1/5

PRIORITY CRAWLER

PRATEEK KHANDELWAL | KHUSHBU PAREEK | RAHUL JAIN | NIKHIL BHAT [email protected] | [email protected]

[email protected] | [email protected]

Mentor: Bhanukiran Vinzamuri | Group ID: Web Miners

Abstract : A crawler is a computer program that

browses the Web in a methodical, automated manner

or in an orderly fashion.Our task here is to build a

priority crawler that can crawl the URL data set

provided for 120 days and selectively choose URL’s

to be kept in the limited size queue based on the data

quality and information provided..So that we can

have a better view about the working of a crawler.

KEYWORDS:Crawler,URL,Feature Selection,SVM.

I: INTRODUCTION

The data set being provided to us was very anonymus

and unstructured.The Data Set under our experimentation

consisted of 121 files of SVM format being provided in

the form of Dayx.svm where x could range from 0-

120.Each of these files contains 16-20000 URLs and

each URL consisted of 3.2 million attributes.The most

challenging task was the correct understanding of the

data set and also correct feature selection since attributes

were anonymus and their properties were unknown.The

only information directly provided was the URL being

benign or malignant depicted by +1 or -1 respectively in

the beginning of the URL.

II: PROBLEM

In this project,we have found out the top 10 URLs for

each day over the data of 121 days by appropriate feature

selection.The crawler we have implemented had to be

time efficient since the results for a query in real world

should be instantaneous as well accurate in finding out

out the top 20 URLs.

III : MOTIVATION

We have done the implementation to basically undestand

the working of a priority based crawler and also to find

out what advancements could be done to make this

process of crawling more time efficient. Since the

experimentation data was huge having 16-20000 URLS

and having over 3.2 milllion attributes for each URL,the

biggest challenge was obviously handling such huge and

anonymus amount of data.The task became even more

interesting because we had no prior experience of

handelling such data.

IV: OVERVIEW

Interactive Crawling module

Calculation of Entropy as well weighted sum of

each URL

Removal of unnecessary data

Extraction of Key value pairs using REGEX

RAW FILE

mailto:[email protected]%20%7C%[email protected]

mailto:[email protected]





mailto:[email protected]%20%7C%[email protected]



V : METHOD

Pre Processing:

Since the experimentation data is noisy,incomplete and

unconsistent,we have to make it ready to use by pre

processing i.e. eliminating the unimportant andredundant data, since quality decisions must be based on

quality data.

First of all,to encounter the incompleteness of data we

abstracted the key value pairs using regular expression

matching,to extract all the <key : value> pairs and

storing it in a 2-D Hash map.By doing this we did not

have to bother about incompleteness of data and did not

have manually add 0s in place of incomplete

attributes,hence saving a lot of precious preprocessing

time.

Secondly,We did not want highly redundant data toappear in our calculations every time so we calculated

the standard deviation and mean for all attributes,to find

out their importance over all the URLs,and removed all

those features which had standard deviation=0 and

mean=1,since both of these imply high number of

occurences.

Even after doing the above steps,there was some amount

of redundancy still left,since there were many columns

which were redundant.Since columns represent attributes

of a particular line/row,therefore its important to remove

attributes that have a minute effect on actual data.

If 40% of lines had an attribute common to them,it might

be considered as unimportant.

After all this,the most important matter of our concern

was, we did not want to lose on rare attributes which had

high importance,so we found out the attributes which

were present in less than 1% of total number of lines

and kept them,since they had a very high importance

value.

The whole process of pre processing can be

mathematically summed up as

Do not keep an attribute if it’s

((SD==0||M==1||f>0.4)&&(!(f<0.01))

Where, S.D=standard deviation of the attribute,

M=Mean of the attribute.,f=frequency of the attribute

over total number of lines

HEURISTICS

Not all pages are necessarily of equal interest to a

crawler’s client. For instance, if the client is building a

specialized database on a particular topic, then pages that

refer to that topic are more important, and should be

visited as early as possible.To make this possible we

should find out the important attributes in dataset and the

URL which consists of more important attributes should

be fetched first.

To implement this,we have considered two important

heuristics.

o Entropy

The main heuristic we have used is the calculating the

entropy metric over the set of attributes left after the step

of pre processing and 16,000 URLs.

Entropy = - ∑pi *log(pi)

Where pi=Number of times that attribute has come

Total number of elements for that attribute

For example , Consider line number 100

If it contains att1:100 att2:50 thenif att1: is present in 1000 urls and 300 times its value has

been 100

and att2: is present in 400 urls and out of those urls it has

assumed the value 50 only twenty times so the entropy

score for this url will be

300/100 log(300/100)+50/400log(50/400)....

o Weighted sum

The second heuristic we have used is assigning a

weighted sum to each URL based on the goodness of the

attributes present in it. In this method,we have ranked

attributes according to how rare are they and accordingly

ranked attributes calculating a score for each url based

upon that.

Again consider the same example of line number 100

If it contains att1:100 att2:50 then

if now att1 is present in like 50 more urls and att2 is

present in like 200 urls then the weighted sum for this

particular attibute wil be 100/50 +50/200

After completing the calculations of entropy or weighted

sum, we sort the URLs depending on these values .After

all the processing an interactive user driven crawling

module appears which asks whether we want to to fetch

or drop n URLs and correspondingly displays the

resuls,as top 20 URLs.

IMPLEMENTATION

We have implemented our priority crawler in PERL as it

contains many inbuilt features for handling such huge

and anonymus data like REGEX.Also instead of using

traditional data structures like queues and stacks,we have

used more advanced data structures like hash maps for

storing the attribute values so that they can be easily

compared for their occurences in different URLs and for

finding out accurate results.



VI : RESULTS

INFERENCES:

Calculation by weighted sum takes almost half run time

than that taken by entropy heuristic but also we saw a

drastic change in the URL queue.

The total run time of code depends on the Day we are

calculating for.The average run time of the code is

around 250 seconds,in fetching the top 20 URLs,for each

day’s data.

The snapshot of results over the data of

Day120 .

For day 0,the variation in entropy values can

be ennumerated as follows

URL ID Entropy Value

7289 4.06722374538319

3237 4.03754144448823

1243 3.97515162331633

3492 3.903471653436653454 3.90347165343665

14133 3.89863541884464

13922 3.77368213065205

3374 3.76249595546937

3297 3.76249595546937

3694 3.71368047511625

3465 3.71368047511625

4694 3.70846175620895

4658 3.70846175620895

2764 3.6962117775535

9955 3.61183637301763

10357 3.61183637301763

11144 3.60009816371488

13441 3.60009816371488

2721 3.60009816371488

11225 3.60009816371488

2794 3.60009816371488

3540 3.53217880206689

3466 3.53217880206689

3570 3.53217880206689

3557 3.53217880206689

3550 3.53217880206689

3460 3.53217880206689

14279 3.51733817831168

13950 3.49387686344043

11721 3.39696069252859

6277 3.3452109708853

A scatter plot of day vs run time

X axis-- Day

Y axis---Run time in seconds

Snap shot of 2nd heuristic i.e,weighted sum on

the data of day 120

0

100

200

300

400

500

0 50 100 150

Run time

Run time



VII: SURVEY

Surveying other possible approaches is a very important

part of this paper,since we have to look at all the pros

and cons of our approach to make it better in future.

In this paper,we have surveyed five possible approachesfor a priority based crawler and have compared them

with the method we have implemented.

The team Quarks has used the approach of first

preprocessing the data by removing the threshold 0s and

inserting 0s in place of incomplete data and then

normalizing all the features to remove the redundant

ones.Their process of preprocessing takes 1.5

hours,which is definintely not preferrable when

designing something like a time constrained crawler.Also

after preprocessing they land up on an unbelievable

figure of 23 important attributes out of total 3.2

million.So,calculating the entropy metric of 16,000

URLs over these 23 attributes they give the top 100

URLs in total 5 minutes.

The other team Spiders 2011 has done simple

preprocessing by just adding 0s in place of incomplete

data which takes around 1 min,and removing redundancy

which takes around 3 minutes.This step of preprocessing

gives top 84,000 attributes on which they have calculated

entropy gain and information gain over just 1000 urls to

give the top 100 URLs.So,there are basically two loop

holes in their approach.Firstly,In the step of

preprocessing they haven’t taken care of saving the

rarely occuring important data,which could get removed

in the process of redundancy removal.Secondly they

have worked only on 1000 URLs of Day 0,Since all

16,000 URLs are not considered,it would definitely

affect the quality of results they are getting by this

approach.

The third approach we shall consider would be of team

SKIM.In Preprocessing,they have considered only

67,000 attributes occuring in randomly selected 1000

URLs.On those attributes they have done redundant 0removal and variance classification,i.e.,if variance of an

attribute>0.9,they have removed it.After this step on

randomly selected 1000 URLs and attributes,they have

done PCA dimension reduction on 3 matrices of size

1000*100,500*100,100*100. And at the final step they

have calculated importance of each URL by the formula.

Importance of a URL=∑Eigen value of a feature*feature

value.

Where summation is over all features in a particular

URL.

Finally they have sorted the URL values in descending

order to present the top 10 URLs.

The main flaw according to us in their approach is the

random sampling of 1000 URLs because of which their

final result would depend only on that sample of 1000

URLs they have taken and not on the complete

dataset,though their idea of PCA reduction is appreciable

since it gives an accurate idea about the importance of a

URL.

The fourth team we have surveyed is A3M.They have

worked only on 5040 URLs which they got through

random sampling on the total of 16,000 URLs.And they

have considered only the features occurring in these

URLs,so the number of attributes reduced to 33,000 from

the total of 3.2 million.After calculating the information

gain of each and every feature,they selected top 100

features after sorting them according to their entropy

gain value. They took only those url’s out of total

16000,which had +1 assigned to them(there were 5963 of

them),and then features were assigned a value 0 or 1

depending on their occurrence in these urls(assign 0 if

the feature is not present in the url).

Then an importance matrix was formed of size 5963*100

in which importance value was calculated through

entropy.

And finally they calcualted the top 100 URLs according

to their importance value.

The approach of team A3M can be criticised on the basis

that they have neglected the importance of preprocessing

and have frelied on random sampling which may or may

not give reliable results.

The last team we surveyed was web spiders.They have

only worked on top 1000 attributes and 50 URLs.In the

preprocessing step they have just used the naïve

approach of inserting 0’s in place of missing data. After

this they calculated entropy of each attribute.then they

subtrated from it the conditidonal entropy of each

attribute.

After that they calculated the igain for each attribute.then

they bubble sorted the attributes .At the last step they

applied ir metrics accordingly to the urls and selected the

urls for enqueing and dequeing.Although they have

applied really good techniques like entropy,igain and ir

metrics but they should have worked on the whole range

of attributes and URLs to get a clear picture of the

crawler they have designed and also to have more

reliable results.

Finally ,we can conclude that the teams we have

surveyed have either sampled k number of urls ,or either

they have taken a fixed number of attributes.While,on the

other hand we have taken care of all atributes andreduced them and then assigned a weighted sum to each



url based on the goodness of a attribute,so as to have

more reliable and accurate results.

Also we can sum up the time taken by our approach and

others to have a clear picture of a better approach

Team Name PreProcessingtime Time taken indisplaying the

results after

preprocessing

Web Miners 18-20 seconds Around 250

seconds

Quarks 1.5 hours 5 minutes

SPIDERS 2011 1-2 minutes 10 minutes

SKIM N/A 6.712 seconds

Web Spiders N/A 1-2 seconds

A3M N/A 0.386 seconds

VIII :REFERENCES

1) http://en.wikipedia.org/wiki/Web_crawler

2) Slides and class notes of the Web Mining And

Knowledge Management course

3) Information provided by other coding teams

http://en.wikipedia.org/wiki/Web_crawler



priority crawler

Documents