priority crawler
TRANSCRIPT
7/30/2019 Priority Crawler
http://slidepdf.com/reader/full/priority-crawler 1/5
PRIORITY CRAWLER
PRATEEK KHANDELWAL | KHUSHBU PAREEK | RAHUL JAIN | NIKHIL BHAT [email protected] | [email protected]
[email protected] | [email protected]
Mentor: Bhanukiran Vinzamuri | Group ID: Web Miners
Abstract : A crawler is a computer program that
browses the Web in a methodical, automated manner
or in an orderly fashion.Our task here is to build a
priority crawler that can crawl the URL data set
provided for 120 days and selectively choose URL’s
to be kept in the limited size queue based on the data
quality and information provided..So that we can
have a better view about the working of a crawler.
KEYWORDS:Crawler,URL,Feature Selection,SVM.
I: INTRODUCTION
The data set being provided to us was very anonymus
and unstructured.The Data Set under our experimentation
consisted of 121 files of SVM format being provided in
the form of Dayx.svm where x could range from 0-
120.Each of these files contains 16-20000 URLs and
each URL consisted of 3.2 million attributes.The most
challenging task was the correct understanding of the
data set and also correct feature selection since attributes
were anonymus and their properties were unknown.The
only information directly provided was the URL being
benign or malignant depicted by +1 or -1 respectively in
the beginning of the URL.
II: PROBLEM
In this project,we have found out the top 10 URLs for
each day over the data of 121 days by appropriate feature
selection.The crawler we have implemented had to be
time efficient since the results for a query in real world
should be instantaneous as well accurate in finding out
out the top 20 URLs.
III : MOTIVATION
We have done the implementation to basically undestand
the working of a priority based crawler and also to find
out what advancements could be done to make this
process of crawling more time efficient. Since the
experimentation data was huge having 16-20000 URLS
and having over 3.2 milllion attributes for each URL,the
biggest challenge was obviously handling such huge and
anonymus amount of data.The task became even more
interesting because we had no prior experience of
handelling such data.
IV: OVERVIEW
Interactive Crawling module
Calculation of Entropy as well weighted sum of
each URL
Removal of unnecessary data
Extraction of Key value pairs using REGEX
RAW FILE
7/30/2019 Priority Crawler
http://slidepdf.com/reader/full/priority-crawler 2/5
V : METHOD
Pre Processing:
Since the experimentation data is noisy,incomplete and
unconsistent,we have to make it ready to use by pre
processing i.e. eliminating the unimportant andredundant data, since quality decisions must be based on
quality data.
First of all,to encounter the incompleteness of data we
abstracted the key value pairs using regular expression
matching,to extract all the <key : value> pairs and
storing it in a 2-D Hash map.By doing this we did not
have to bother about incompleteness of data and did not
have manually add 0s in place of incomplete
attributes,hence saving a lot of precious preprocessing
time.
Secondly,We did not want highly redundant data toappear in our calculations every time so we calculated
the standard deviation and mean for all attributes,to find
out their importance over all the URLs,and removed all
those features which had standard deviation=0 and
mean=1,since both of these imply high number of
occurences.
Even after doing the above steps,there was some amount
of redundancy still left,since there were many columns
which were redundant.Since columns represent attributes
of a particular line/row,therefore its important to remove
attributes that have a minute effect on actual data.
If 40% of lines had an attribute common to them,it might
be considered as unimportant.
After all this,the most important matter of our concern
was, we did not want to lose on rare attributes which had
high importance,so we found out the attributes which
were present in less than 1% of total number of lines
and kept them,since they had a very high importance
value.
The whole process of pre processing can be
mathematically summed up as
Do not keep an attribute if it’s
((SD==0||M==1||f>0.4)&&(!(f<0.01))
Where, S.D=standard deviation of the attribute,
M=Mean of the attribute.,f=frequency of the attribute
over total number of lines
HEURISTICS
Not all pages are necessarily of equal interest to a
crawler’s client. For instance, if the client is building a
specialized database on a particular topic, then pages that
refer to that topic are more important, and should be
visited as early as possible.To make this possible we
should find out the important attributes in dataset and the
URL which consists of more important attributes should
be fetched first.
To implement this,we have considered two important
heuristics.
o Entropy
The main heuristic we have used is the calculating the
entropy metric over the set of attributes left after the step
of pre processing and 16,000 URLs.
Entropy = - ∑pi *log(pi)
Where pi=Number of times that attribute has come
Total number of elements for that attribute
For example , Consider line number 100
If it contains att1:100 att2:50 thenif att1: is present in 1000 urls and 300 times its value has
been 100
and att2: is present in 400 urls and out of those urls it has
assumed the value 50 only twenty times so the entropy
score for this url will be
300/100 log(300/100)+50/400log(50/400)....
o Weighted sum
The second heuristic we have used is assigning a
weighted sum to each URL based on the goodness of the
attributes present in it. In this method,we have ranked
attributes according to how rare are they and accordingly
ranked attributes calculating a score for each url based
upon that.
Again consider the same example of line number 100
If it contains att1:100 att2:50 then
if now att1 is present in like 50 more urls and att2 is
present in like 200 urls then the weighted sum for this
particular attibute wil be 100/50 +50/200
After completing the calculations of entropy or weighted
sum, we sort the URLs depending on these values .After
all the processing an interactive user driven crawling
module appears which asks whether we want to to fetch
or drop n URLs and correspondingly displays the
resuls,as top 20 URLs.
IMPLEMENTATION
We have implemented our priority crawler in PERL as it
contains many inbuilt features for handling such huge
and anonymus data like REGEX.Also instead of using
traditional data structures like queues and stacks,we have
used more advanced data structures like hash maps for
storing the attribute values so that they can be easily
compared for their occurences in different URLs and for
finding out accurate results.
7/30/2019 Priority Crawler
http://slidepdf.com/reader/full/priority-crawler 3/5
VI : RESULTS
INFERENCES:
Calculation by weighted sum takes almost half run time
than that taken by entropy heuristic but also we saw a
drastic change in the URL queue.
The total run time of code depends on the Day we are
calculating for.The average run time of the code is
around 250 seconds,in fetching the top 20 URLs,for each
day’s data.
The snapshot of results over the data of
Day120 .
For day 0,the variation in entropy values can
be ennumerated as follows
URL ID Entropy Value
7289 4.06722374538319
3237 4.03754144448823
1243 3.97515162331633
3492 3.903471653436653454 3.90347165343665
14133 3.89863541884464
13922 3.77368213065205
3374 3.76249595546937
3297 3.76249595546937
3694 3.71368047511625
3465 3.71368047511625
4694 3.70846175620895
4658 3.70846175620895
2764 3.6962117775535
9955 3.61183637301763
10357 3.61183637301763
11144 3.60009816371488
13441 3.60009816371488
2721 3.60009816371488
11225 3.60009816371488
2794 3.60009816371488
3540 3.53217880206689
3466 3.53217880206689
3570 3.53217880206689
3557 3.53217880206689
3550 3.53217880206689
3460 3.53217880206689
14279 3.51733817831168
13950 3.49387686344043
11721 3.39696069252859
6277 3.3452109708853
A scatter plot of day vs run time
X axis-- Day
Y axis---Run time in seconds
Snap shot of 2nd heuristic i.e,weighted sum on
the data of day 120
0
100
200
300
400
500
0 50 100 150
Run time
Run time
7/30/2019 Priority Crawler
http://slidepdf.com/reader/full/priority-crawler 4/5
VII: SURVEY
Surveying other possible approaches is a very important
part of this paper,since we have to look at all the pros
and cons of our approach to make it better in future.
In this paper,we have surveyed five possible approachesfor a priority based crawler and have compared them
with the method we have implemented.
The team Quarks has used the approach of first
preprocessing the data by removing the threshold 0s and
inserting 0s in place of incomplete data and then
normalizing all the features to remove the redundant
ones.Their process of preprocessing takes 1.5
hours,which is definintely not preferrable when
designing something like a time constrained crawler.Also
after preprocessing they land up on an unbelievable
figure of 23 important attributes out of total 3.2
million.So,calculating the entropy metric of 16,000
URLs over these 23 attributes they give the top 100
URLs in total 5 minutes.
The other team Spiders 2011 has done simple
preprocessing by just adding 0s in place of incomplete
data which takes around 1 min,and removing redundancy
which takes around 3 minutes.This step of preprocessing
gives top 84,000 attributes on which they have calculated
entropy gain and information gain over just 1000 urls to
give the top 100 URLs.So,there are basically two loop
holes in their approach.Firstly,In the step of
preprocessing they haven’t taken care of saving the
rarely occuring important data,which could get removed
in the process of redundancy removal.Secondly they
have worked only on 1000 URLs of Day 0,Since all
16,000 URLs are not considered,it would definitely
affect the quality of results they are getting by this
approach.
The third approach we shall consider would be of team
SKIM.In Preprocessing,they have considered only
67,000 attributes occuring in randomly selected 1000
URLs.On those attributes they have done redundant 0removal and variance classification,i.e.,if variance of an
attribute>0.9,they have removed it.After this step on
randomly selected 1000 URLs and attributes,they have
done PCA dimension reduction on 3 matrices of size
1000*100,500*100,100*100. And at the final step they
have calculated importance of each URL by the formula.
Importance of a URL=∑Eigen value of a feature*feature
value.
Where summation is over all features in a particular
URL.
Finally they have sorted the URL values in descending
order to present the top 10 URLs.
The main flaw according to us in their approach is the
random sampling of 1000 URLs because of which their
final result would depend only on that sample of 1000
URLs they have taken and not on the complete
dataset,though their idea of PCA reduction is appreciable
since it gives an accurate idea about the importance of a
URL.
The fourth team we have surveyed is A3M.They have
worked only on 5040 URLs which they got through
random sampling on the total of 16,000 URLs.And they
have considered only the features occurring in these
URLs,so the number of attributes reduced to 33,000 from
the total of 3.2 million.After calculating the information
gain of each and every feature,they selected top 100
features after sorting them according to their entropy
gain value. They took only those url’s out of total
16000,which had +1 assigned to them(there were 5963 of
them),and then features were assigned a value 0 or 1
depending on their occurrence in these urls(assign 0 if
the feature is not present in the url).
Then an importance matrix was formed of size 5963*100
in which importance value was calculated through
entropy.
And finally they calcualted the top 100 URLs according
to their importance value.
The approach of team A3M can be criticised on the basis
that they have neglected the importance of preprocessing
and have frelied on random sampling which may or may
not give reliable results.
The last team we surveyed was web spiders.They have
only worked on top 1000 attributes and 50 URLs.In the
preprocessing step they have just used the naïve
approach of inserting 0’s in place of missing data. After
this they calculated entropy of each attribute.then they
subtrated from it the conditidonal entropy of each
attribute.
After that they calculated the igain for each attribute.then
they bubble sorted the attributes .At the last step they
applied ir metrics accordingly to the urls and selected the
urls for enqueing and dequeing.Although they have
applied really good techniques like entropy,igain and ir
metrics but they should have worked on the whole range
of attributes and URLs to get a clear picture of the
crawler they have designed and also to have more
reliable results.
Finally ,we can conclude that the teams we have
surveyed have either sampled k number of urls ,or either
they have taken a fixed number of attributes.While,on the
other hand we have taken care of all atributes andreduced them and then assigned a weighted sum to each
7/30/2019 Priority Crawler
http://slidepdf.com/reader/full/priority-crawler 5/5
url based on the goodness of a attribute,so as to have
more reliable and accurate results.
Also we can sum up the time taken by our approach and
others to have a clear picture of a better approach
Team Name PreProcessingtime Time taken indisplaying the
results after
preprocessing
Web Miners 18-20 seconds Around 250
seconds
Quarks 1.5 hours 5 minutes
SPIDERS 2011 1-2 minutes 10 minutes
SKIM N/A 6.712 seconds
Web Spiders N/A 1-2 seconds
A3M N/A 0.386 seconds
VIII :REFERENCES
1) http://en.wikipedia.org/wiki/Web_crawler
2) Slides and class notes of the Web Mining And
Knowledge Management course
3) Information provided by other coding teams