![Page 1: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/1.jpg)
Technical University of Crete
Department of Electronic and Computer
Engineering
DESIGN AND EVALUATION OF TOPIC
DRIVEN FOCUSED CRAWLERS
FOR THE WORLD WIDE WEB
By
BATSAKIS SOTIRIOS
A Thesis submit ted in par t ia l fu l f i l lment
of the requi rements for the degree of
Master of Computer Engineer ing
Chania , November 2007
![Page 2: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/2.jpg)
ii
Design and evaluation of topic driven
focused crawlers for the World Wide Web
Batsakis Sotirios
Abst ract
Fo c us e d c r aw l e r s a r e p r o g r am s de s i gne d t o b r ow s e t h e
W eb an d d ow nl o ad p a ge s o n a s p e c i f i c t o p i c . Th e y a r e us e d
f o r a ns w e r i n g us e r q u e r i e s o r f o r bu i l d i n g d i g i t a l l i b r a r i e s
o n a t o p i c s p ec i f i ed b y t h e us e r . T he y a r e d i s t i n gu i s h ed in to
c l as s i c , s e m an t i c a n d l e a r n i n g f o cus e d c r a wl e r s . C l as s i c
f o c us e d c r a wl e r s e s t im a t e t h e r e l ev anc e o f W eb p a ge s wi th
t h e t o p i c b y c o m pu t i n g th e s imi l a r i t y o f W eb p a ge s w i t h a
u s e r p ro v id e d l i s t o f k e yw o r d s t h a t d e sc r ib e t he t op i c o f
i n t e r es t . S em an t i c C r aw l e r s a r e a v a r i a t i o n o f c l a s s i c
f o c us e d c r a wl e r s t h a t u s e c on c ep tua l r e l a t i o ns b e t we e n
t e rm s ( e . g . r e t r i eve d f ro m an on t o l og y) f o r e s t im a t i n g t h e
r e l ev a n c e o f t h e W e b p a ge w i t h t h e t op i c . Le a r n i n g c r a wle r s
e m plo y a t r a in in g p r o ce s s t h a t gu i de t he c r a wl e r t o wa r ds
p a ge s r e l a t ed t o t he t o p i c .
T h i s wo rk a dd r es s i s s u es r e l a t e d t o t h e d e s i gn an d
i mpl e me n t a t i o n o f c l a s s i c , s em an t i c a n d l e a r n i n g fo cu s ed
c r a w le r s . S e ve r a l v a r i a n t s o f c l a s s i c f o cu se d c ra wl e r s
r e l yi n g u p on we b p a ge c on t e n t an d l i nk an c ho r t ex t f o r
e s t im a t in g t h e r e l ev a n c e o f w eb p a ges t o a g i v en t op i c a r e
ex a min e d a nd imp le m e n t ed . A no v e l ty o f t h i s w o rk i s t he
i n t ro du c t io n o f a ne w c a t e go r y o f s e ma n t i c c r a wl e r s m ak i n g
u s e o f W or d Ne t a s t h e un d er l yi n g o n to lo g y f o r o b t a in i n g
t e rm s c on c ep tu a l l y r e l a t e d ( bu t n o t n e c es s a r i l y
l ex i co gr a p h i c a l l y s i mi l a r ) w i th t h e t op i c . Le a r n in g c r a wl e r s
b a s ed on Hid d en M a r ko v Mo d e l ( HM M ) f o r l e a r n i n g n o t
![Page 3: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/3.jpg)
iii
o n l y t h e co n t en t o f r e l ev an t p a ge s bu t a l s o p a t hs l e ad in g to
r e l ev a n t p a ge s fo l l o w in g a c e r t a i n num b er o f r ou t in g h o ps
a r e ex a min e d as w e l l . An a d d i t i ona l c on t r ib u t i on o f t h i s
w o r k i s t h e i n t r od u c t i on o f a ne w c a t e go r y o f h yb r id
c r a w le r s c omb in in g th e s t r e n gt h o f bo th c l a s s i c an d l e a r n in g
f o c us e d c r aw l e r s .
T h e c r a wl e r s r e f e r r e d t o a bo ve a r e a l l i mp l e m en t e d
a n d a c om p ar a t iv e a n a l ys i s o f t h e i r p e r f o r m an c e i s
p r e s en t e d . A l l c r aw l e r s ac h i e v e t h e i r m ax imu m p er f o rma n c e
w h e n a com bi n a t i on o f w eb p a ge an d a n c ho r t ex t i s u s ed f o r
a s s i gn i n g d ow nl oad p r i o r i t i e s t o w e b p a ge s . S e m an t i c
s imi l a r i t y m e t ho ds c om bi n ed wi th a ge n e r a l pu r po se
o n t o l o g y s o u r c e su c h a s W o r dN et do n ’ t a c t u a l l y i m p ro v e
p e r f o r ma n c e , ex ce p t t h e im p l em en t a t i on t h a t r e s t r i c t s
s e ma n t i c s im i l a r i t y t o s yn o n ym t e rm s . H yb r i d c r a wle r s
i mp ro v ed t h e p e r f o r m an c e o f s t a t e o f t h e a r t HM M c r a wle r s
y i e l d in g v e r y p r o mi s in g r e su l t s .
![Page 4: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/4.jpg)
iv
C on t en ts
C hap t e r 1 . I n t r odu c t i on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 . 1 B a c k gr o u n d .............................................................................................................. 2
1 . 2 P r e s e n t w o r k ........................................................................................................... 6
1 . 3 C o n t r i b u t i o n o f t h e c u r r e n t t h e s i s ............................................................... 8
1 . 4 T h e s i s o u t l i n e ......................................................................................................... 9
C hap t e r 2 . R e la t ed W o rk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0
2.1 Introduction ............................................................................................................... 10
2 . 2 N o n F o c u s e d C r a w l e r s ..................................................................................... 11
2 . 3 C l a s s i c F o c u s e d C r a w l e r s ............................................................................... 12
2 . 4 S e ma n t i c C r a w l e r s ............................................................................................. 16
2 . 5 L e a r n i n g C r a w l e r s .............................................................................................. 19
2 . 6 S u mma r y ................................................................................................................. 24
C hap t e r 3 . C raw l er D es ign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 6
3.1 Introduction ............................................................................................................... 26
3 . 2 C l a s s i c C r a w l e r s ................................................................................................. 29
3 . 2 . 2 B e s t F i r s t C r a w l e r w i t h a n c h o r t e x t s i mi l a r i t y ........................... 31
3 . 2 . 3 B e s t F i r s t C r a w l e r w i t h p a g e c o n t e n t a n d a n c h o r t e x t . ........... 31
3 . 3 S e ma n t i c C r a w l e r s ............................................................................................. 32
3 . 3 . 1 E h r i g C r a w l e r ............................................................................................... 34
3 . 3 . 2 S S R M C r a w l e r .............................................................................................. 34
3 . 2 . 3 S e ma n t i c C r a w l e r w i t h s y n o n y m s e t e x p a n s i o n .......................... 35
3 . 4 L e a r n i n g C r a w l e r s .............................................................................................. 35
3 . 4 . 1 H i d d e n M a r ko v M o d e l C r a w l e r ........................................................... 37
3 . 4 . 2 H y b r i d C r a w l e r s .......................................................................................... 39
3 . 5 S u mma r y ................................................................................................................. 41
C hap t e r 4 . E xp e r ime n t a l R esu l t s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3
4.1 Introduction ............................................................................................................... 43
4 . 2 P e r f o r ma n c e me a s u r e s ...................................................................................... 44
4 . 3 E x p e r i me n t s e t u p ................................................................................................ 45
4 . 4 C l a s s i c F o c u s e d C r a w l e r s ............................................................................... 47
4 . 5 S e ma n t i c C r a w l e r s ............................................................................................. 48
4 . 6 L e a r n i n g C r a w l e r s .............................................................................................. 50
4 . 7 D i s c u s s i o n .............................................................................................................. 53
![Page 5: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/5.jpg)
v
C hap t e r 5 . Con c lus ion s and f u tu r e wo r k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4
R ef e r en c es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6
![Page 6: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/6.jpg)
CHAPTER 1. INTRODUCTION
1
Chapter 1. Introduction
T h e W o r l d W id e W eb i s a hu ge i n f o rm a t io n s ou r c e w i t h
b i l l i o ns o f w e b p age s o n e ve r y c o n c e i v ab l e su b j e c t . G en e r a l
p u rp os e s e a r ch en g i n es s u ch as G oo g le [ 5 ] , Y a ho o [ 7 ] , M SN
[ 8 ] a nd As k [ 9 ] ha v e a pp e a r ed in o r d e r t o a s s i s t u s e r s i n
f i nd i n g in f o rm at i on o n t h e W eb . The s e s e a r c h en g i n es a r e
v e r y c o m pl i c a t e d an d s i z a b l e s ys t e ms [ 1 , 2 ] , bu t t h e y d o n ’ t
a c h i e v e a fu l l c ove r a ge o f t h e W e b . G o o g l e a ch i e v es u p to
7 6 % a nd Y ah oo up t o 6 9% co v e r a ge , wh i l e o t h e r s ea r c h
e n g i n es i n dex a n ev e n sm al l e r p e r c e n t a ge o f t h e e n t i r e W eb
[ 3 ] . In f o r m at io n se a r c h es on t h e W e b i s s u ed t h ro u gh W eb
s e a r ch en g i n es a r e n o t p r op a ga t ed o v er t h e W e b in r ea l t i me .
In s t e a d th e y i n d ex , a n a l yz e a n d c a t e go r i z e W e b i n f o rm at io n
a c c um ul a t e d l oc a l ly i n d a t a r e po s i t o r i e s a nd t h i s i n f o r m at ion
i s t h en u s ed f o r ans w e r i n g us e r q ue r i e s . Th e ge n e r a l p u rp os e
s e a r ch e n gi n e ap pro a c h e f f e c t i v e l y a d d r e s s es t h e n e e d o f t h e
e n d us e r t o f i n d spe c i f i c i n f o r m at i on in r e a l t im e .
C r a wl e r s ( a l s o k no w n as R ob o t s o r S p id e r s [ 20 ] ) a r e
t oo l s fo r a s s emb l i n g lo c a l l y i n f o rm at io n f r om t h e W eb .
Fo c us e d c ra wl e r s i n p a r t i cu l a r , h ave b e e n i n t ro du c ed f o r
s a t i s f yi n g th e n e ed o f i nd iv i du a l s ( e . g . d om ai n ex p e r t s ) o r
o r ga n iz a t io ns t o c re a t e a nd m ai n t a i n l o c a l l y d i g i t a l l i b ra r i e s
o n a s ub j e c t o r f o r a n sw e r i n g c omp l i ca t e d qu e r i e s ( f o r wh i ch
a w e b s e a rc h en g in e wo u l d yi e l d l im i t ed o r no s a t i s fa c t o r y
r e s u l t s ) . T yp i c a l r e q u i r em e n t s o f su ch a pp l i c a t i on us e r s a r e
t h e n e ed fo r h i gh q u a l i t y u p - to - d a t e r e su l t s , w h i l e
m in i miz i n g th e amo u n t o f r e s o ur c e s d e d i c a t e d t o t h e s ea r c h
t a sk . Foc us e d c r awl e r s d ow nl o ad a s m a n y p a ge s r e l ev an t t o
t h e s ub j e c t a s t he y c a n , w h i l e k ee p in g th e am ou n t o f
i r r e l ev a n t d a t a dow n lo ad e d to a mi n i mum [ 3 0] . Bes id es t h e
c r e a t i on o f s p e c i a l i z e d d i g i t a l l i b r a r i e s , a pp l i c a t i ons o f
![Page 7: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/7.jpg)
CHAPTER 1. INTRODUCTION
2
f o c us e d c r aw l e r s a l so i nc lu d e gu id i ng i n t e l l i ge n t a gen t s o n
t h e W eb fo r l o c a t in g s pe c i a l i z ed in f o rm at i on ( e . g . f l i gh t
s c h ed u l es a nd t i c ke t p r i c es f o r a vo ya ge p l a nn in g a ge n t ) . As
t h e imp o r t an c e and th e s i z e o f t h e W eb g r o ws s o do es t h e
i mp or t an c e o f Fo cus e d Cr a wl e r s .
1 .1 Background
C r a wl e r s a r e g iv e n a s t a r t i n g s e t o f w e b p a ge s ( s e ed pa ge s )
i n t h e i r i np u t , ex t r a c t o u t go i n g l i n ks a pp e a r in g in t h e s e ed
p a ge s a n d de t e r mine w h a t l i n ks t o v i s i t n ex t b as e d on c e r t a i n
c r i t e r i a . In t h e f o l l o wi n g , w e b p a ges po in t e d t o b y t h e s e
l i n ks a r e do w nlo a de d , a nd th os e s a t i s f yi n g c e r t a i n s e l ec t i o n
c r i t e r i a a r e s to r ed i n a l o c a l r ep os i to r y. C r a wl e r s c on t i nu e
v i s i t i n g W e b p a ges u n t i l a k n ow n numb e r o f p a ge s h a v e b e e n
d o wn lo ad e d o r un t i l l o c a l r e so u rc e s ( su c h a s s to r a ge ) a r e
ex h au s t ed .
T h e Cr a wl e r s u s ed b y ge n e r a l p u rp os e s e a r ch e n g ine s
r e t r i ev e W eb p a ge s m as s iv e l y r e ga r d l es s t h e i r t o p i c . M eth o ds
f o r im p l em e n t i n g su c h Cr a wl e r s i n c lud e :
a ) B r ea dt h F i rs t C r aw l e rs : T he o u t go in g l i nks f r om t he
g i v e n se t o f pa ge s a r e ex t ra c t ed a nd in s e r t ed i n a F i r s t
In F i r s t Ou t ( F IFO ) q ue u e , an d th e i r co r r es po nd in g w eb
p a ge s a r e do w nlo ad e d . T h e p r o c es s c o n t in ue s s im i l a r l y
w i t h t h e n e w p a ges .
b ) Pa g e i mp or t an c e C r aw l e rs : T he y a s s i gn h i gh e r v i s i t
p r io r i t y t o w e b p a ge s ( i . e . t o t he i r c o r r es po nd in g
U R Ls ) l i nk e d to f r om m o r e im po r t a n t p a ge s . P a ge
i mp or t an c e es t im a t i on c r i t e r i a fo r a s s i gn i n g p r i o r i t i e s
t o ex t r a c t e d UR Ls i n c lu d e Ba c k l i n k co u n t ( i . e . num b e r
o f we b p a ge s c on t a i n i n g l i nk s t o a g ive n p a ge ) [ 2 2 ] a nd
P a ge R an k ( t h e imp o r t an c e e s t i m a t i on m et ho d u s ed in
t h e Go o g l e s e a r ch e n g i n e ) [ 6 ] .
![Page 8: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/8.jpg)
CHAPTER 1. INTRODUCTION
3
A l t ho u gh s im pl e , B r e a d th F i r s t C r a w l e r s a ch i e v e go od
p e r f o r ma n c e (m ea s u r ed as t h e a v e r a ge qu a l i t y o f
d o wn lo ad e d p a ge s u s i n g P a ge Ra nk c r i t e r io n ) [ 19 ] , a nd a r e
e f f e c t i v e fo r im p l em e n t i n g no n - f o cu s ed C r a wl e r s . Th e
m aj o r d i s a dv a n t a ge o f Br e a d th F i r s t C r a wl e r s ( a n d o f t h e
o th e r n on t op i c d r iv e n C ra wl e r s ) i s t h a t t h e y u s e o n l y t h e
l i n k s t r uc tu r e o f t h e w e b an d no t w e b pa ge c o n t en t i n
a s s i gn i n g v i s i t p r io r i t i e s t o UR Ls ; c ons e qu e n t l y t h e y f a i l t o
f o c us o n p a ge s o n a t o p i c . Be c au s e p a ge s o n a s p ec i f i c
t op i c a r e a m in o r f r a c t i on o f t h e ov e ra l l W e b , c r a wl i n g o n
t h a t t o p i c u s i n g n o n fo c us ed c r a wl e r s w i l l r e su l t i n to
d o wn lo ad in g a l a r ge n um b er o f i r r e l ev a n t p a ge s , t h us
q u i c k l y e x ha us t ing t h e a v a i l a b l e r e s ou r c es . T h e re fo r e
b u i ld i n g a sp e c i a l i z e d d i g i t a l l i b r a ry c a l l s fo r fo c use d
c r a w le r s .
Fo c us e d c r a wl e r s w o r k b y c o m bi n i n g b o t h t h e co n t en t o f
t h e r e t r i e v ed W eb p a ge s an d th e l i nk s t r u c tu r e o f t h e W eb
f o r a s s i gn in g h i ghe r v i s i t i n g p r io r i t y t o pa ge s r e l e v an t t o
t h e t o p i c . T h e y a r e d i s t i n gu i s h ed in to t h e fo l l o wi n g
c a t e go r i es :
a ) C l ass i c Fo c us ed C r aw l e rs [ 26 ] t ake a s i np u t a u s e r
q u e r y t h a t d es c r i be s t h e t o p i c a nd a s e t o f s t a r t i n g
U R Ls ( s e ed s ) . The c r a wl in g s t a r t s f r om th e us e r
p r ov id e d s ee d URLs . T h e c r aw l e r s a s s i gn a p r i o r i t y
v a lu e t o v i s i t ed p age s a c c o r d in g t o t h e i r r e l ev an c e t o
t h e t o p i c . T h e w e b p a ge s a r e o r de r e d b y r e l e v a n c e a nd
t h e c r aw l e r s p ro c ee d b y v i s i t i n g t h e m os t r e l ev a n t w e b
p a ge s f i r s t . T h e mo s t co mmo n c r i t e r io n fo r r e l e v an c e
e s t im a t io n b e tw e e n a r e t r i e v ed p a ge a n d a u s e r qu e r y
i s d e f i n ed as t h e s imi l a r i t y b e t w e e n t h e t ex t o f t h e
v i s i t ed p a ge wi th t h e qu e r y ( t op i c ) . T yp i c a l l y t h i s i s
c o mp ut ed us in g a t ex t s im i l a r i t y m o d e l su c h as t h e
Bo o le a n o r t h e Ve c t o r Sp a c e Mo d e l [ 12 ] . Foc us e d
![Page 9: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/9.jpg)
CHAPTER 1. INTRODUCTION
4
c r a w le r s u s in g V e c t o r Sp a c e M ode l f o r r e l e v an c e
e s t im a t io n ( Bes t F i r s t C r a wl e r s [ 25 ] ) a r e t h e m os t
e f f e c t i v e c l a s s i c foc u s ed c r aw l in g m et ho d s o f a r [ 26 ] .
Ex i s t i n g wo r k on c l a s s i c fo cu s ed c raw l e r s i s p r e s en t e d
i n s e c t i o n 2 .3 . O u r p r op os e d v a r i a n t s a nd
i mp l e me n t a t i o ns o f c l a s s i c fo c use d c r a wl e r s a re
d i s c uss e d in s e c t i on 3 . 2 .
b ) S e man t i c C raw l e rs a r e a v a r i a t i o n o f c l a s s i c fo cu s ed
c r a w le r s . P a ge v i s i t p r io r i t y i s a s s i gne d t o p a ge s u s in g
t h e i r c on t e n t a nd b y a p p l yi n g s e m a n t i c c r i t e r i a f o r
c o mp ut i n g p a ge - t o - t op i c r e l e v an c e . A p a ge a n d th e
q u e r y c a n b e r e l e v a n t i f t h e y s h a r e c o n c ep t u a l l y
s imi l a r ( bu t no t ne c e s sa r i l y l e x i c a l l y s i m i l a r ) t e rms .
C on c ep tu a l r e l a t i on s b e t w e en t e rm s a r e d e f in e d us in g
a n un d er l yi n g t op i c sp e c i f i c o r ge n e r a l p u r po s e
o n t o l o g y. T h us , s em a n t i c c r a wl e r s d i f f e r w i th c l as s i c
f o c us e d c r a wl e r s i n t h e w a y c o n t en t r e l ev a n c e i s
c o mp ut ed . T o t he b e s t o f ou r k now l ed ge s em a n t i c
c r a w le r s ha v en ’ t be e n c om p ar e d wi th s t a t e - o f - th e - a r t
c l a s s i c fo cu s ed c ra w l e r s s u ch as t h os e r e fe r r ed t o
a b ov e , no r h a v e t h e y b e e n c omb ine d wi t h mo d e rn
s e ma n t i c s i mi l a r i t y m e t h o ds ( as t ho s e p r e s en t e d i n
[ 1 1 ] ) so a s t o a c h i e v e t h e i r fu l l p o t en t i a l . T h e p r es e n t
w o r k ad d r es s e s a l l t h es e i s su es ( s e c t i on 3 . 3 ) .
c ) L e ar n in g C r aw le rs [ 33 ] ap p l y a t r a in in g p ro c e s s fo r
a s s i gn i n g v i s i t p r i o r i t i e s t o W e b p a ge s a n d f o r gu i d in g
t h e c r a wl i n g p ro ce s s . Th e y a r e c h ar a c t e r i z ed b y t h e
w a y r e l e v an t w eb pa ge s o r p a t hs t h r ough w e b l i nk s f o r
r e a c h i n g r e l ev an t p a ge s a r e l e a r n ed b y t h e c r a w le r
( t yp i c a l l y b y m a c h i n e l e a rn i n g o r o the r p r o ce s s e s ) so
t h a t t h e c r a wl e r c an d i s t i n gu i sh b e t we e n r e l e v an t an d
n o n r e l e v an t p a ges . Bu i ld i n g up on t h i s i d e a , a n um be r
![Page 10: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/10.jpg)
CHAPTER 1. INTRODUCTION
5
o f a p pr o a ch e s fo r l e a rn in g r e l ev a n t t o t h e t op i c W eb
p a ge s h a ve ap p e ar ed i n t h e l i t e r a t u re an d in c l ud e :
1 . A p p ro a ch e s b as e d o n m a ch i n e l ea r n i n g : T he
c r a w le r i s s up p l i ed wi t h a t r a i n in g s e t c ons i s t i n g
o f r e l ev a n t a nd n on r e l ev a n t W e b p age s w h i ch i s
u s ed t o t r a i n t h e l e a r n i n g C r a wl e r [ 33 , 34 ] . Du r i n g
c r a w l in g h i gh e r v i s i t p r i o r i t y i s a s s ign e d t o w eb
p a ge s c l as s i f i ed as r e l ev a n t t o t h e t op i c .
2 . A p p ro a ch e s t h a t t a k e n o t o n l y t h e p a ge c on t en t
a n d t h e c o r re sp on d i n g c l a s s i f i c a t i o n o f w eb p a ge s
a s r e l e va n t o r no n r e l ev a n t t o t he t op i c i n t o
a c c o un t , b u t a l s o t h e l i n k s t ru c t u r e o f t h e W eb an d
t h e p ro ba b i l i t y t ha t a g i ve n p a ge (w h ic h c a n b e
n o n re l ev a n t t o t he t op i c ) w i l l l e ad t o a r e l e va n t
p a ge w i t h in t h e min im um n um b er o f s t ep s ( ho ps ) .
M e th od s b a se d i n C on te x t G ra ph s [ 31 ] a nd H id d en
M a r ko v Mo de l s (HM M ) [ 16 ] a r e ex am pl es o f t h i s
c a t e go r y o f m e th o ds . S e c t io n 2 .5 c on t a in a
d e t a i l e d d es c r i p t i on o f t h e se me th od s a n d S ec t i on
3 . 4 t h e e nh an c e me n t s p ro pos e d in t h i s w o r k .
3 . H yb r i d m et ho ds t h a t co mbi n e l e a rn i n g c r a wl e r s
w i t h i d e as o f c l a s s i c f oc us e d c r a wl e r s [ 3 5 ] . O u r
w o r k fo c us e s on hyb r i d c r aw l e r s a nd p ro po s es an
a p p ro a ch t h a t comb in e s t h e s t r e n gt hs o f c l a s s i c
f o c us e d c r aw le r s ( v a r i a t i on s o f Be s t F i r s t
C r a wl e r s ) wi t h Hidd e n M a r ko v M od e l s f o r l e a rn in g
n o t o n l y h o w to d i s t i n gu i s h b e t w ee n r e l ev a n t a nd
n o n r e l e v an t W e b p a ge s b as ed o n c on t e n t , b u t a l s o
o n l e a rn i n g ho w t o gu id e t h e s e a rc h fo r s u ch
r e l ev a n t W eb p a ges t h r ou gh a s e qu e nc e o f ro u t in g
h o ps b e t w e en W e b p a ge s ( s om et im es t h r ou gh n on
r e l ev a n t p a ge s ) . T h i s m e t ho d i s d e s c r ib e d in
s e c t i o n 3 .4 a nd t he ex p er im en t a l r e su l t s ob t a in e d
![Page 11: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/11.jpg)
CHAPTER 1. INTRODUCTION
6
( S e c t i on 4 . 6 ) i n d i c a t e t h a t i t i s a ve r y e f f e c t i ve
c r a w l in g m eth od .
Fig. 1: Crawler Classification
1 .2 Present w ork
T hi s w o rk d e a l s wi th t he d es i gn an d e v a l u a t i on o f fo cu s ed
c r a w le r s . S t a t e o f t h e a r t a pp r o a che s f o r bu i ld i n g to p i c
d r iv en f o cu s ed c ra w l e r s a r e co ns ide r e d in c l ud in g c l as s i c ,
s e ma n t i c a nd l e a rn in g c r a wl e r s . S ev e r a l v a r i an t s o f t h es e
a p p ro a ch e s a r e a l so p r op os e d a nd e v a l u a t ed . Th e em ph as i s o f
t h i s w or k i s on hyb r i d c r a wl e r s com bi n in g t ex t a nd l i n k
i n fo rm at io n fo r r e ac h in g f a s t e r mo r e p r omi s i n g p a ge s on t h e
t op i c o f i n t e re s t .
T h e f i r s t c r a wl e r im p l em e n t ed i s t h e Br e ad t h F i r s t
C raw l er . T h i s i s a c l a s s i c n on to p i c -o r i e n t e d c r aw le r wh i ch
i s u s ed a s a r e fe r e n c e i n a l l com p ar i s on s wi t h fo c us e d
c r a w le r s . S ev e r a l va r i an t s o f t h e B es t F i r s t Cr awl er [ 2 5 ] a r e
a l so im p l em e n t ed a n d e v a l ua t ed . Be s t F i r s t C r a wl e r w or ks b y
e s t im a t in g th e r e l ev a n c e o f t he r e t r i ev e d p a ge w i th t h e u s e r
q u e r y ( b o th r ep r es e n t ed us i n g t e rm v e c t o r s ) u s in g Ve c t o r
S p ac e Mo d e l (VSM ) [ 1 2 ] ; t h en i t v i s i t s t h e l i n ks ex t ra c t ed
f r om t h e m os t r e l e v a n t p a ge . A UR L c a n be r ep r ese n t ed
e i t h e r b y t h e t e r m v e c to r o f t h e W eb p a ge i t wa s ex t ra c t ed
f r om , o r b y t h e t e rm v e c to r o f i t s c o r r e s po nd i n g a n ch o r t ex t
( t he t ex t t h a t a pp ea r s o n th e l i n k po in t i n g t o t h a t UR L) . Al l
s o l u t i on s ( us in g p a ge c on t en t , a n c ho r t ex t o r t h e i r
Crawlers
Non topic oriented crawlers Focused crawlers
Classic focused crawlers Semantic crawlers Learning crawlers
![Page 12: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/12.jpg)
CHAPTER 1. INTRODUCTION
7
c o mbi n a t i on ) a re im p l em e n t ed an d e va l u a t ed . Th e s e me th o ds
a r e de s c r ib e d in s ec t i on 3 .2 .
T he s e co nd c a t e go r y o f m e th o ds i n c lu d es S em a n t i c
C r a wl e r s t ha t e s t im a t e t h e c on c e p tua l ( s em a n t i c ) r e l ev a n c e
o f a W eb p a ge w i t h t h e qu e r y. T h e m e th od b y E h r ig e t . a l
[ 1 3 ] c om bin e s f o cu s ed c r aw l e r s an d s e ma n t i c r e l a t i o ns f r om
a n o n t o lo g y ( i n [ 1 3 ] t op i c s p e c i f i c on to lo g i es w e r e us e d ) , f o r
a s s i gn i n g v i s i t p r io r i t i e s t o p a ge s . In o u r im p l em e n t a t i o n o f
s e ma n t i c c r aw l e r s , t e rm v e c to r s a r e e nh a n c ed wi t h s yn o n ym s
a n d s em a n t i c a l l y s imi l a r t e rm s f rom Wo r dN e t [ 4 ] ( t hu s
m a k in g o u r im p le m e n t a t i on t h e f i r s t ge ne r a l pu rp os e
s e ma n t i c c r a wl e r im p l em e n t a t i on ) . To p i c r e l e va n c e c an t h en
b e c omp ut ed b y V S M [ 1 2] , t h e S e m an t i c S i mi l a r i t y R e t r i e v a l
M od e l (SSR M ) [ 1 4 ] o r b y M i h a l c e a e t . a l . [ 15 ] .
O u r p r op os ed app r o a ch t o Le a r n in g C r aw l e r s i s
i n f lu e n c ed b y w o r k o n H M M C raw l er s [ 16 , 1 8 ] fo r l e a r n in g
p a th s l e ad in g to r e l e v an t p a ge s i n add i t i on t o t h e c on t en t o f
t h e d es i r e d w e b p age s . Th e u s e r o f a n H MM C r a wl e r p rov id e s
a t r a in in g s e t o f p age s ( bo th r e l ev a n t a n d n on r e l ev a n t t o t he
t op i c o f i n t e r es t ) . T h es e p a ge s a re c l us t e r ed a c co r d i ng t o
t h e i r co n t en t . T r ans i t i on p ro b ab i l i t i e s b e tw e e n t he r e su l t i n g
c l us t e r s r ep r es e n t in g r e l e v an t o r no n r e l e v an t p a ge s ( l ea d in g
t o r e l ev an t o n es ) a r e c om put e d an d a r e us e d to e s t im a t e
( g i v en th e c l us t e r a W eb p a ge i s a s s i gn e d) , t h e p ro b ab i l i t y
t h a t i t w i l l l e ad t o r e l ev a n t p a ge s . T h e h i gh e r t h i s p ro b ab i l i t y
i s t h e h i gh e r t he v i s i t p r io r i t y g i v e n t o t h e p a ge ’ s ex t r ac t ed
l i n ks wi l l b e . K -Me a n s [ 4 7 ] an d X -Me a n s [ 1 7 ] c an b e a pp l i ed
f o r t h e c lu s t e r in g o f W eb pa ge s . K-m e a ns c l us t e r i n g i s a n
a l go r i t hm to c l as s i f y o r t o g r ou p ob je c t s b as e d o n
a t t r i b u t es / f e a t u r es i n to K g r ou ps ( K i s pos i t i v e i n t e ge r
p r e d ef in e d num b e r ) . T h e g r o up i n g i s do n e b y m i n im iz in g t h e
s um o f sq u ar e s o f d i s t a n c es be t w e en da t a a nd t h e
c o r r es po nd in g c l us t e r c e n t ro id . X -M ea n s i s an ex t e ns i on o f
![Page 13: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/13.jpg)
CHAPTER 1. INTRODUCTION
8
K - m e an s wi t h d yn a mi c es t im a t io n o f t h e n um b er o f c l us t e r s
d e p en d en t on th e d a t a . In t h i s wo rk f o c us ed c r a wl e r s b a s ed
o n bo th c lu s t e r in g a p p r o ac h es a r e i mp le m en t e d a n d
e v a lu a t e d .
Ba s e d on t h e HMM C ra wl e r , Hy br id Cr awl e r s t h a t
c o mbi n e c l as s i c f oc u s ed c ra wl e r s fo r a s s i gn in g p r i o r i t i e s t o
U R Ls b a s ed o n t op i c r e l ev a n c e , a nd l e a r n i n g c r a wl e r s f o r
l e a rn in g a c c es s p a t hs fo r r e a c h in g re l e va n t p a ge s ( po ss ib l y
t h ro u gh no n r e l ev a n t o ne s ) a r e p r op os e d . T wo hyb r i d
c r a w le r s com bi n i n g H M M w i t h p a ge o r b o t h p a ge a n d anc h o r
t ex t s a r e imp l em en t e d a nd ev a lu a t e d . O u r p r op os ed a pp ro a c h
t o h yb r i d c ra wl e r s i s p re s en t e d i n s e c t i on 3 .4 .
T h e c r aw le r s r e f e r r e d t o ab ov e ( an d t h e i r v a r i a t i o ns )
a r e a l l im p l em en t ed a n d t he i r p e r fo rma n c e i s c omp a r e d ba s e d
o n r es u l t s o b t a i ned f r om t h e w e b u s i n g s ev e r a l d i f f e r e n t
t op i c s a nd s e ed ( s t a r t i n g ) p a ge s . S e c t i on 4 p r es e n t s a
c o mp a r a t i v e s tu d y o f t h e p e r f o rm an c e o f a l l c r a wl e r v a r i a n t s
b y c a t e go r y a l on g w i th a c r i t i c a l a n a l ys i s o f t h e i r
p e r f o r ma n c e .
1 .3 Contr ibut ion o f the current thes i s
T h e c on t r ib u t i on s o f t h i s w or k a r e su mm a r i z e d be lo w:
a ) T hi s t h es i s p r es e n t s a c r i t i c a l e v a l ua t io n o f s t a t e o f t h e
a r t a pp ro a c h es t o W eb C r aw l in g , i n c lu d in g C l as s i c ,
S em a n t i c a nd Le a r n i n g Fo c us ed Cr a w l e r s . T o o u r
k n ow le d ge a s im i l a r e v a l u a t i on h as n ’ t a pp e a r ed in t h e
l i t e ra tu r e b e fo r e .
b ) P ro po s es s ev e r a l v a r i an t s t o ex i s t i n g c r a w l in g
m et ho do lo g i es b a se d o n r e c en t s em a n t i c r e l e v an c e
e s t im a t io n m eth ods a nd com p a r e t he i r p e r fo rm a n ce
w i t h c l a s s i c fo c us ed c r aw l in g m eth od s .
c ) P ro po s es a no v e l hyb r i d a p pr o a ch t o l e a r n i n g c r a wl i n g
c o mbi n i n g c l as s i c f o cu s ed c r a wl e r s fo r a s s i gn i n g
p r io r i t i e s t o UR Ls w i th i d ea s f ro m l e a r n i n g c ra wl e r s
![Page 14: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/14.jpg)
CHAPTER 1. INTRODUCTION
9
f o r l e a rn i n g pa th s f o r r e a c h i n g w e b pa ge s r e l e v an t t o
t h e t op i c .
1 .4 Thes i s out l ine
T h e w or k i n t h i s t he s i s i s o r ga n iz ed as f o l l o ws : R e l a t e d w o r k
o n fo c us ed c r a wl i ng i s p r e s en t e d i n Se c t i on 2 . I t i s o r ga n iz ed
i n s ix su bs e c t io ns ; t h e f i r s t i s t h e i n t r od uc t i on , t he s e co n d
s ub s e c t i on ( 2 . 2 ) p re s e n t s no n t op i c d r i v en c r aw l e r s , t h e t h i rd
s ub s e c t i on (2 .3 ) c l a s s i c im p l em e n t a t i o ns o f fo c us e d c r aw l e r s ,
t h e fo u r th su bs e c t io n (2 .4 ) t h e p r e l im in a r y r e l a t e d wo rk o n
s e ma n t i c c r aw l e r s , t h e f i f t h s ubs e c t io n ( 2 . 5 ) p res e n t s
p r e v io us w o rk on Le a r n i n g Cr a wl e r s a nd th e s ix th i s a
s um ma r i z a t i o n o f t h e ab ov e .
I s s u e s r e l a t e d t o t h e d es i gn an d im ple m e n t a t i on o f W eb
c r a w le r s i s p re s en t e d i n s ec t i on 3 . S u bs e c t i on 3 .1 i s a n
i n t ro du c t io n to t h e t o p i c , s ub s e c t i on 3 . 2 p r ov id es a d e t a i l e d
d e s c r ip t i o n o f c l a s s i c c r aw l e r s i mp lem e n t ed i n t h i s wo rk a n d
s ub s e c t i on 3 . 3 d ea l s w i th i s su es r e l a t e d t o t h e d es i gn o f
s e ma n t i c c r aw l e r s . In s u bs e c t io n 3 . 4 p a r t i c u l a r em p ha s i s i s
g i v e n to l ea r n i n g c r a w le r s an d to t he s ub s eq ue n t d es i gn o f
h yb r i d c r a wl e r s .
S e c t i on 4 p ro v id es a d es c r i p t i on o f t h e ex p er im en t a l
r e s u l t s . S ub s e c t i on 4 .1 p r es en t t h e p u rp os e o f t h e
ex p e r im e n t s , i n t h e s e c on d pa r t ( s u bs e c t i on 4 .2 ) t h e
p e r f o r ma n c e m e a su r e s u s ed to e v a lu a t e t he c r a wl e r s a r e
d e s c r ib e d . Th e ex pe r im e n t a l s e tu p i s d i s c uss e d i n s ub s ec t i on
4 . 3 . Ex p e r i m en t a l r e s u l t s on C l as s i c C r a wl e r s a re p re s en t ed
i n su bs e c t i on 4 .4 f o l l o w e d b y r e s u l t s o b t a i n ed b y s e ma n t i c
a n d l e a r n in g c r a w l e r s i n su bs ec t i on s 4 . 5 an d 4 . 6
r e s p ec t iv e l y. S u bse c t i on 4 .7 p r e s en t s a c r i t i c a l a n a l ys i s o f
t h e p e r fo rm a n ce o f v a r i ou s c r a wl e r s m e t ho ds c on s i de re d in
t h i s wo r k . F i n a l l y c o n c l us i on an d i s su e s f o r f u r th e r r e se a r c h
a r e d i s c us se d in S ec t i on 5 .
![Page 15: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/15.jpg)
CHAPTER 2. RELATED WORK
10
Chapter 2. Related Work
2.1 Introduction
R el a t e d w o rk o n c r a w le r s i n c l ud e s c o n t r i bu t io ns r e gar d in g
b o t h c l a s s i c (n on - to p i c o r i en t ed ) a nd fo c us e d ( t o p i c -
o r i e n t e d ) c r a wl e r s . Ex i s t i n g wo r k o n t h e d es i gn a nd
i mpl e me n t a t i o n o f n on fo c us e d c r aw l e r s an d o f f o cu s ed
( c l as s i c , s e m an t i c a n d l e a r n in g ) c ra w l e r s p r op os ed in t h e
l i t e ra tu r e i s p r es e n t e d i n t h i s c h ap te r .
C l a s s i c no n f o cu s ed Cr a wl e r s ( e . g . c ra w l e r s u s ed b y w e b
s e a r ch e n g in e s f o r a s s em bl i n g w e b p a ge s t o l o c a l
r e p os i to r i e s ) do wnl o ad W eb p a ge s m a ss i v e l y r e ga r d l e s s o f
c o n t e n t i n o r d e r t o c r e a t e v as t p a ge r e po s i t o r i e s . Fo cu s ed
c r a w le r s o n th e o th e r ha nd a r e mo r e s e l e c t i v e , do w nloa d in g
o n l y p a ge s r e l a t e d t o a kn ow n (u s e r p r ov id e d) t o p i c . I s s u es
r e l a t e d t o t h e d es i gn a nd im pl em e n t a t i o n o f c l a s s i c a s w e l l a s
o f fo c us e d c r a wl e r s a r e d i sc us s ed i n t h e fo l l o wi n g a n d
i n c lu d e :
a ) S e ar ch s t ra t eg y : Th e c r a wl e r ca n b r ow s e t h e w e b i n a
b r e a d th f i r s t o r d e r o r s e l e c t l i nk s t o f o l l o w u s in g
i mp or t an c e es t ima t i on c r i t e r i a . Fo c us e d c r aw l e r s
a s s i gn v i s i t i n g p r io r i t i e s t o pa ge s ac c o r d in g t o t h e
r e l ev a n c e o f t h e page w i t h a t o p i c sp ec i f i e d b y a u s e r .
b ) R ef r e sh in g po l i cy : D u e t o t h e d yn a mi c n a tu r e o f t h e
W eb , p a ge s m us t b e r ev i s i t e d i n o rd e r t o ke e p p a ge
r e p os i to r i e s up - t o - d a t e . T h e op t ima l p a ge r e f r es h
p o l i c y t h a t a c h i eve s k e e p i n g p a ge re p os i to r i e s up - to -
d a t e wi th ou t un n ec e s s a r y d o w nl o ad in g o f no n o u t -
d a t ed p a ge s i s a v e r y i m p o r t an t i s s u e i n c r a wl e r d e s i gn
[ 2 1 ] . A l so , s a t i s f yi n g t h e co nf l i c t i n g d e m an ds fo r h i gh
d o wn lo ad in g r a t e w i t ho u t p u t t i n g ex ce s s i ve l o ad to t h e
![Page 16: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/16.jpg)
CHAPTER 2. RELATED WORK
11
v i s i t ed W e b s i t e s i s a m a jo r c on c e rn w h e n de s i gn i n g a
C r a wl e r fo r a s e a r ch en g i n e .
c ) S yn ch ron iz a t i on : C r a wl e r s u s ed b y c o mm e r c i a l s e a r ch
e n g i n es us e m ul t i p l e p a ra l l e l p ro c es se s t h a t m as s iv e l y
r e t r i ev e W e b p a ge s , r e ga rd l es s o f t he i r t op i c . Th e se
p r o c es s e s mu s t be s yn c h r on iz e d i n o r d e r t o av o id
d u p l i c a t e d pa ge dow n lo ad in g [ 20 ] .
2 .2 Non Focused Craw lers
T yp i c a l l y n o n f o cus e d c ra wl e r s a r e u s e d b y ge n e r a l p u rp os e
s e a r ch e n gi n es fo r a s s e mbl in g lo c a l l y W e b i n fo rm at i on .
M e th od s fo r im p l em e n t i n g s u ch C ra wl e r s i n c lu d e :
a ) B r ea dt h F i r s t C raw l e r : A ft e r d ow nlo a d i n g t h e i n i t i a l
p a ge s ( c a l l ed s e ed p a ge s ) t h e ou t go in g l i nks ex t r a c t e d
f r om th e s e p a ge s a r e p u t i n a F IFO q u e u e . Th e l i n ks
ex t r a c t e d f i r s t po i n t t o p a ge s t h a t a r e g i v e n t h e h i gh es t
p r io r i t y f o r d o w nl oa d in g a n d f u r th e r c r a w l in g . B r e a d t h
f i r s t c r a wl i n g i s o n e o f t h e m os t c omm on l y u s e d
c r a w l in g a p p ro a ch e s fo r a s s emb l in g l o c a l l y W e b
c o n t e n t f o r u se b y W eb s e a r ch e n g in es . Go o g l e Bo t [ 5 ] ,
S l u rp [ 7 ] , M SN Bot [ 8 ] a nd T eo ma [ 9 ] a re ex am pl es o f
c r a w le r im p l em e n ta t i on s us e d b y c o mm e r c i a l s e a r ch
e n g i n es . P a ge r e f r e s h p o l i c y, s yn c h r on iz a t io n , a n d
o p t im al do wn lo a d in g r a t e a r e im po r t an t i s s ue s h e r e [ 20 ,
2 1 ] . Te c hn ic a l i s s u es s u ch as t he su pp o r t ed f i l e
f o rm a t s , f i l e s i z e l im i t a t i on s an d t h e v i s i t i n g p o l i c y a r e
a l so o f g r e a t i mpo r t a n c e . B r e a d th f i r s t c ra wl e r s a r e
c a p a b l e o f c r a wl ing a l a r ge pa r t o f t h e W eb [ 3 ] . T h en
t h e d ow nl o ad ed pa ge s a r e a na l yz e d ( e . g . b y c o n t e n t ,
t yp e ) i n d ex e d an d s ub se qu e n t l y s t o r ed in d a t a
r e p os i to r i e s co mp os e d o f t ho us a nd s o f c om pu t e r s a nd
T e r a b yt e s o f d a t a [ 1 , 2 ] . T h i s ap pr oa c h r e qu i r es h u ge
![Page 17: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/17.jpg)
CHAPTER 2. RELATED WORK
12
r e s ou r c es wh i c h a re a v a i l a b l e o n l y t o l a r ge com p an ie s
o r o r ga n iz a t io ns s uc h as G oo g l e o r Y ah o o .
C r a wl e r s s u ch a s M e r c a t o r [ 45 ] an d La r b i n [ 10 ]
a r e ex am pl es o f B r e a d th F i r s t C r aw l e r s wh ic h a r e
f r e e l y a v a i l ab l e t o p ro g r a mm e rs fo r t e s t i n g a nd s ys t e m
d e v e l opm e n t . W h en l imi t e d r es ou r c es a r e a v a i l a b l e t h e y
c a n c r aw l a sm al l p a r t o f t h e i nd ex ed w e b an d re t r i ev e
w e b c on te n t f o r f u r th e r p r o c es s in g . B r e a d t h f i r s t
c r a w le r s yi e l d h i gh q u a l i t y p a ge s [ 19 ] b u t a r e n ’ t t o p i c
o r i e n t e d .
b ) Pa g e i mp o rt an c e C r aw l e rs : T h e y a s s i gn h i ghe r v i s i t
p r io r i t y t o U R Ls r e t r i ev e d f r om mo r e im po r t a n t
p a ge s . T yp i c a l l y , p a ge i mp or t an ce f o r a s s i gn in g
p r io r i t i e s t o ex t r ac t e d U R Ls i s comp ut e d b y B a c k l i n k
c o un t (w h er e h i gh er p r i o r i t y i s g i v en t o p a ge s po i n t e d
t o b y m a n y o t h e r W eb p a ge s ) a nd Pag e Ra n k [ 6 ] . O th e r
c r i t e r i a s uc h a s t h e p o s i t i o n o f t h e p age w i t h in t h e W eb
s i t e h i e r a r ch y ( e . g . l o w d e p t h , a s i nd i c a t ed b y f e w e r - o r
n o ne - s l a s h es i n t o t he p a ge U RL, l e a d to h i gh er
p r io r i t y) , o r t h e nu mb e r o f o u t go i n g l i n ks o f t h a t p a ge
( O ut l in k c ou n t ) can b e u s ed as w e l l . Ch o e t . a l [ 22 ]
p r ov id es a su r v e y o n th i s t yp e o f C r a wl e r s . P a ge
i mp or t an c e c r i t e r i a h a v e b e en sh own to i mp ro v e t he
q u a l i t y o f d ow nl oad e d p a ge s [ 2 2 ] .
2 .3 Class i c Focused Craw lers
C r a wl e r s u s ed b y s e a r c h en g i ne s ( s uch a s t h os e r e f e r r e d t o i n
s e c t i o n 2 . 2 ) a r e d es i gn ed to m ax imiz e t h e t o t a l n um b er a n d
p r ob a b l y t h e q u a l i t y o f d o wn lo a de d w e b p a ge s . To p ic
o r i e n t e d o r Fo c use d C r aw le r s t a k e a s i n pu t a u s e r qu e r y
( C l a s s i c Fo cu se d c r a w le r s ) , o r ex am pl e p a ge s p r ov ide d b y
![Page 18: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/18.jpg)
CHAPTER 2. RELATED WORK
13
t h e u se r a s a t r a in in g s e t ( Le a r n i n g Cr a w l e r s ) a nd f o cu s t h e
c r a w l in g p r oc e s s o n p a ge s r e l ev a n t t o t h e t op i c . Foc u s ed
c r a w le r s k ee p th e o v e r a l l n umb e r o f d o wn lo ad e d W e b pa ge s
t o a m in i mum wh i l e m ax imiz in g t h e p e r c en t a ge o f r e l e v a n t
p a ge s .
T h e p e r fo rm a nc e o f a f oc us e d c r aw l e r d ep e nds o n t h e
s e l e c t i o n o f go o d s t a r t i n g p a ge s ( s e e d p a ge s ) . Go od s e e d
p a ge s c an b e e i t he r w eb pa ge s r e l ev a n t t o qu e r y t o p i c o r
p a ge s f r om wh ic h r e l ev a n t p a ges c a n b e a c c e s s e d t h r ough a
s m al l num b e r o r r ou t i n g h op s . Fo r ex am pl e , i f t h e t op i c i s on
s c i e n t i f i c pu b l i c a t i on s , a go od s ee d p a ge c a n be t h e
p u b l i c a t i on s p a ge o f a n au th o r , l a b o r de p a r tm e n t o r
a l t e r n a t i v e l y t h e w e b p a ge o f t h e au th o r , l ab o r d e pa r tm en t
r e s p ec t iv e l y ( a l t hou gh th e l a s t m a y c o n t a i n n o p ub l i c a t i o ns
a t a l l , i t i s kn ow n t o l e a d t o p a ge s con t a in in g p u b l i c a t i on s ) .
S e ed p a ge s s ho u ld a l s o b e im po r t a n t a s w e l l ( wh e r e
i mp or t an c e i s d e f in e d u s i n g l i n k ana l ys i s m e t ho ds suc h a s
H IT S [ 46 ] a nd Page R a n k [ 6 ] ) . T h e r a t i o na l e b e h i nd t h i s
r e q u i r em en t i s t ha t imp o r t an t W e b p a ge s ( wh e n u s ed a s
s t a r t i n g p a ge s –s ee d s – f o r c r a wl i ng ) m a y gu i d e c raw l i n g
p r o c es s t o o th e r i mp o r t an t W eb p a ge s f a s t e s t , t hu s im p rov in g
t h e qu a l i t y o f t h e r e s u l t s . T h e se e d pa ge s a r e o f t en s e l ec t e d
b y s u b m i t t i n g t h e qu e r y t h a t d es c r i b es t h e t o p i c o f i n t e r es t t o
a s e a r c h e n gi n e a nd b y u s i n g t he t op se a r c h e n g in e r e su l t s .
E a r l y a p p r o a c he s o n Fo cus e d C r a wl i n g in c l ud e am on g
o th e r s t h e F i s h -S ea r c h a l go r i t hm [ 2 3 ] . Th e b a s i c i d e a o f t h e
a l go r i t hm i s t h a t w h e n s ev e r a l p a ges a r e c a nd i d a t es fo r l i nk
b r o ws i n g an d dow n lo ad in g , p r i o r i t y i s g i v en to pa ge s
r e l ev a n t t o t h e t op i c ( a p a ge i s l ab e l ed a s r e l ev an t i f i t
c o n t a i ns t h e qu e r y t ex t ) . E v e r y c a n d id a t e p a ge i s a s s i gne d a
Bo o le a n v a l u e d e r iv e d b y a s imp l e l ex i co gr a ph i c r u l e ( an d i t
i s d ow nl o ad ed b y a s ep a r a t e a pp l i ca t i on t h r e ad ) . T hre a d s
c o r r es po nd in g to r e l e va n t p a ge s c r e a t e n e w th r e ad s f o r t h e i r
![Page 19: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/19.jpg)
CHAPTER 2. RELATED WORK
14
o u t go in g l i nk s , wh i l e t h re a ds c o r r es po nd in g t o i r r e l ev a n t
p a ge s a re s to pp e d . T h i s wo r k ex a mi ne d t h e s ep a r a t e u se o f
t h e a nc ho r t ex t i n a s s i gn i n g p r io r i t i e s t o UR Ls .
T h e m ai n d i s ad v an t a ge o f t h e F i sh -Se a r c h a l go r i t hm i s
t h a t p r io r i t i e s t a ke Bo o le a n v a lu es ; t h e re f o re a l l r e l e v a n t
p a ge s a r e a s s i gn ed th e s am e p r io r i t y . T h e S ha r k -S ea r c h
a l go r i t hm [ 2 4 ] i s a d i r e c t su c c e s so r o f F i s h - Se a r c h , w h e r e
V SM [ 1 2] i s u s ed f o r a s s i gn in g no n Bo o le a n p r io r i t y v a l u es
t o c an d i da t e p a ge s . T h i s im pr ov e d th e r e su l t s o f c r a wl in g
[ 2 4 ] . Th e V e c t o r S p ac e Mo d e l be c am e th e ba s i s o f c l a s s i c
f o c us e d c r aw l e r s ev e r s i n c e .
A c c or d i n g t o VSM , d o cum e n t s a r e r e p r es e n t ed a s t e rm
v e c to r s an d t he we i gh t ��� o f a t e rm j i n do c um e n t i i s
c o mp ut ed as :
� ��� = ���� ∗ ���
���� = ��� ����
, ��� = ��� ���
� �1�
W h e r e ���� i s t h e t e rm f r e qu e n c y o f t e r m j i n do cu m en t i , ���
i s t h e i nv e r s e doc u m en t f r e qu en c y o f t e rm j , ��� i s t h e
f r e q ue n c y o f a pp e ar a n c e o f t e rm j i n to d o cu m en t i , ���� i s
t h e m ax im um f r eq ue n c y o f a l l t e r ms in to d oc um e n t i , � i s t he
t o t a l num b er o f doc u m en t s a nd �� i s t h e n um b er o f do c um en t s
c o n t a i n i n g t e rm j .
R e c en t a pp ro a c he s t o fo c us ed c r a w l i n g i nc lud e
In f o S p id e r s a nd Be s t - F i r s t C r aw le r [ 2 5 ] . In f o S p id e r s u s e
N e u r a l Ne tw o rk s , w h i l e Be s t F i r s t C r a wl e r s u s e t ex t
s imi l a r i t y b y V S M f o r a s s i gn i n g p r i o r i t y v a l u e s t o c a nd i d a t e
p a ge s .
G i ve n a qu e r y a n d a W e b p a ge , t he p r io r i t y o f t h e W eb
p a ge i s c om put e d b y Be s t F i r s t C r a w l e r s a s t h e cos in e
![Page 20: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/20.jpg)
CHAPTER 2. RELATED WORK
15
s imi l a r i t y b e t w e e n t h e i r d o cum e n t v e c t o r s wh e r e ���, ��� a r e
t e rm w e i gh t s o f t h e q ue r y a n d th e we b p a ge r es p e c t i v e l y:
��� �������������, � !���" = ∑ $�� ∗ $��%�&'(∑ $��)�&%
�&' (∑ $��)�&%�&'
�2��
W h e r e + i s t he t o t a l num b e r o f t e rm s i n t o q u er y a n d pa ge
c o n t e n t .
In f a c t , t h e Be s t F i r s t C ra wl e r i s a s im p l i f i e d v e r s i on o f
t h e Sh a rk -S e a r ch c r a w le r : I t d o e sn ’ t c omb in e l i n k a nc h o r
t ex t a nd p r ev i ou s v i s i t ed p a ge s s c o r es i n to t h e p a ge p r io r i t y
f u n c t i on , a s S h ar k -S e a r ch do es . A l s o , Be s t F i r s t C r aw l e r s u s e
o n l y t e r m f r eq u en c y ( t f ) v e c to r s f o r c omp ut in g t op i c
r e l ev a n c e . Th e use o f i n ve r s e d ocu m en t f r eq u en c y ( i d f )
v a lu es ( as su gge s t e d b y V S M ) i n t he c as e o f fo cu s ed
c r a w l in g i s p ro b l em a t i c s i nc e t h i s mi gh t r e q u i r e
r e c a l cu l a t i o n o f a l l t e rm v ec to r s a t e v e r y c r a w l i n g s t ep . In
a d d i t i o n , i d f v a lu es a r e h i gh l y i n a c c ur a t e a t t h e e a r l y s t a ge s
o f c r aw l i n g b e c au s e o f t h e sm al l n um b er o f r e t r i e v e d
d o cu m en t s . Bes t F i r s t C r a wl e r s h a ve b ee n s how n t o
o u t p e r f o rm In f o S p i d er s , a n d S h a rk -S e a r ch a nd a l s o o th e r
n o n- f o cus e d c ra wl in g a p p ro a c he s s u ch a s Br e ad t h F i r s t , a n d
P a ge Ra n k [ 26 ] . Bes t f i r s t c r a wl i n g i s c on s i d e r e d t o b e t h e
m os t e s t a b l i s he d a p p ro a ch t o f o cuse d c ra wl in g du e t o i t s
s im p l i c i t y a n d e f f i c i e n c y. T h e N - Bes t F i r s t C r aw le r i s a
ge n e r a l i z ed v e r s i on o f Be s t F i r s t C r a wl e r : a t e a c h s t ep ,
i n s t e a d o f ch oo s i ng o n e W e b pa ge f o r l i nk ex t r a c t i o n a n d
d o wn lo ad in g o f p age s po in t e d t o b y t h e s e l i n ks , N p a ge s w i th
h i gh es t p r i o r i t y a r e c ho s en [ 2 7 ] .
A l on g t h e s am e l i n es , a n a pp r o ach r e fe r r ed t o a s
“ i n t e l l i ge n t c r aw l in g” [ 2 8 ] su gge s t s c o mbi n i n g p a ge c on t en t ,
U R L s t r i n g an d s t a t i s t i c s a bo u t r e l ev an t / i r r e l e va n t p a ge s a n d
s ib l i n g p a ge s f o r a s s i gn in g p r i o r i t i e s t o c a nd id a t e UR Ls .
T h es e s t a t i s t i c s a re u p da t ed a nd c om bi ne d du r i n g c r aw l i n g
![Page 21: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/21.jpg)
CHAPTER 2. RELATED WORK
16
f o r gu i d in g th e s e l e c t i on o f t h e n ex t l i n ks t o fo l l o w yi e l d i n g
a h i gh l y e f f e c t i ve c r a w l i n g a l go r i t hm th a t l e a rn s t o c r a w l
w i t ho u t d i r e c t u s e r t r a in in g .
2 .4 Semant ic Craw lers
S em a n t i c C r aw l e r s a r e imp l em en te d b y c o m bi n in g a n
o n t o l o g y w i t h s e m a n t i c s im i l a r i t y m e a s u r e s [ 14 ] f o r
d e t ec t i n g t o p i c r e l e v a n c e b e tw e e n re t r i e v ed W e b p a ge s a n d
u s e r qu e r i es . S e ma n t i c s im i l a r i t y p l a ys a n i m p or t an t r o l e
h e r e : i t c a n b e us ed to d e t e c t t o p i c r e l e va n c e b y a s s o c i a t i n g
t e rm s in a q u er y a n d t he W e b p a ge us in g th e o n t o l o g y, a n d
b y a s s i gn in g a d e g r e e o f r e l ev a n ce t o e a c h su ch t e rm
a s so c i a t i o n .
E h r i g e t . a l [ 13 ] p ro p os es u s e o f t op i c o r i en t ed o n to lo g i es
f o r f i n d in g p a ge s r e l ev a n t on t he t op i c o f i n t e r e s t . Ev e r y
t e rm in a W eb p a ge i s ex am in e d a nd co n t r i bu t es pos i t i v e ly t o
a s s i gn i n g a p r io r i t y s c o r e i f i t i s a q u e r y t e r m o r i f i t i s
s e ma n t i c a l l y r e l a t ed t o t h e u s e r q ue ry t e r m s . T h e f o l l ow i n g
v a r i a t i o ns fo r e v a lu a t in g s em an t i c r e l a t i o ns o f p a ge t e r m s
w i t h qu e r y t e rm s we r e us ed :
a ) I f a t e rm i s d i r e c t l y c o n n e c t e d ( d i s t an c e 1 ) t o a qu e r y
t e rm , t h e n i t i s c o n s i de r e d r e l eva n t (d i s t a n c e i s
d e f i n ed a s t h e l eng t h o f t he sh o r t es t p a th c on n e c t i n g
t w o t e rms r ep r e sen t ed a s v e r t i c e s i n to t h e on t o l o g y
g r a p h wh e r e ed ge s r ep r es e n t r e l a t i on o f a d j a c en t
t e rm s) .
b ) I f a t e r m i s c lo s e t o a q u e r y t e r m ( d i s t an c e 2 o r l e s s )
u s i n g o n l y IS - A r e l a t i o ns t h e n i t i s r e l e v an t t o t h e
q u e r y t e r m.
c ) E v e r y p a ge t e r m ap p e a r in g i n t o t h e o n t o l o g y g r a p h i s
a s s i gn ed a r e l ev a nc e v a l u e d ep e nd i ng o n i t s d i s t a n ce
w i t h qu e r y t e r ms . T h e g r e a t e r t h e d i s t an c e t h e l o w er
t h e r e l e v an c e v a l u e w i l l b e . Sp e c i f i c a l l y , u s i n g a t o p i c
![Page 22: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/22.jpg)
CHAPTER 2. RELATED WORK
17
s p e c i f i c un d er l yi n g o n to lo g y t h e s em a n t i c s im i l a r i t y
b e tw e e n t e rm s i s co mp ut ed as :
�� ,-��., �)� = �/0�12,13� �3�
W h e r e � i s a d e c r e as in g f ac to r (0 .5 i n t h i s w o rk ) an d
�5��., �)� i s t h e l e n g th o f sho r t e s t p a t h c on n ec t i n g t e rms
t 1 a nd t 2 i n to t he on to lo g y g r a p h ( 0 i f t h e t e rm s b e lo n g
t o t h e s a m e s yn o nym s e t ) . Th e lo n ger t h e d i s t a n c e o f
t h e t e rms in to t h e g r a p h th e s m al l e r t h e i r s i mi l a r i t y i s .
T h i s m e th od i s a v a r i a t i o n o f t h e s ho r t e s t p a th
s e ma n t i c s imi l a r i t y m eth od .
T h e l a s t ap p ro a ch ha s t h e b es t p e r fo rm a nc e fo r
c o mp ut i n g th e co nc e p tu a l s im i l a r i t y b e tw e e n t e rms a nd w a s
a l so u s ed in o u r w o r k fo r co mp a r i so n w i th o th e r s ema n t i c
r e l a t i on m eth od s a n d s t a t e - o f -a r t c l a s s i c fo c us ed c r aw l i n g
a p p ro a ch e s . A no t he r s t a t e o f t he a r t t e rm s im i l a r i t y m e t ho d
u s ed i n p r e s en t wo rk i s t h e Li e t . a l m e t ho d [ 42 ] :
T h e s em a n t i c s imi l a r i t y b e t w e en t w o t e rm s t 1 a nd t 2 i s
c o mp ut ed a s a fu n c t i on o f t h e l e n g t h o f t h e p a th
c o nn e c t in g t h e t e rm s i n t he u nd e r l yi n g o n t o l o g y g r a p h
a n d th e d ep th o f t e r m s i n t o t h e t ax o nom y:
�� 6���., �)� = !789:;<7:=;<:;<>:=;< �4�
W h e r e L i s t h e sh o r t es t p a t h l e n g th b e t w e en �. an d �), @
i s t h e d e p th o f t h e m os t sp e c i f i c comm on co n c ep t o f �., �)
i n t o t h e t ax on om y a n d �, A a r e c ons t an t s �� = 0,2 a n d A = 0,6
i n ou r im p l em e n t a t i on ) .
A c c o rd in g t o r esu l t s r ep or t ed i n [ 14 ] t h i s m e t ho d h av e
b e e n p ro ve n to b e f a s t a n d a c cu r a t e ( a c h i e v i n g a c c u r ac y
u p to 8 2 % c omp a r ed t o r es u l t s ob t a in ed b y h u m an s ) .
G e n e r a l pu r po se t ax on omi e s su ch a s W o rd N et c a n a l so b e
a p p l i e d f o r f oc us ed c r aw l i n g . W or dN e t i s an o n l in e l ex i ca l
![Page 23: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/23.jpg)
CHAPTER 2. RELATED WORK
18
r e f e r e nc e s ys t e m de v e lo pe d a t P r in c e to n Un i v e r s i t y . W o rd N et
a t t e mpt s t o m od e l t h e l ex i c a l k no wl ed ge o f a n a t iv e s pe a k e r
o f En g l i sh . W o rd Ne t c an a l so b e s e en a s on t o l o g y f o r n a t u ra l
l a n gu a ge t e rm s . I t c o n t a i ns a ro un d 1 00 , 00 0 t e r ms , o r ga n iz ed
i n t o t ax on omi c h i e r a r c h i es . No un s , v e r bs , ad j e c t i v e s a n d
a d v e rb s a r e g r o up ed i n t o s yn o n ym s e t s ( s yn s e t s ) . Th e s yn s e t s
a r e a l s o o r ga n iz e d in t o s e ns es ( i . e . co r r es po nd ing t o
d i f f e r en t m e an in gs o f t h e s am e t e r m o r c o n ce p t ) . T he s yn s e t s
( o r co n ce p t s ) a r e r e l a t e d t o o th e r s yn s e t s h i gh e r o r l owe r i n
t h e h i e r a r ch y d e f i ne d b y d i f f e r e n t t yp e s o f r e l a t i ons h i ps . T h e
m os t c omm on r e l a t i on sh ips a r e t h e Hyp o n ym / H yp e r n ym ( i . e . ,
I s - A r e l a t i o ns h ip s ) , a n d t h e M er on ym / H olo n ym ( i . e . , P a r t -o f
r e l a t i on sh ip s ) . T her e a r e n i n e no un a n d s ev e r a l v e rb Is - A
h i e r a r c h i es ( ad j e c t i v e s a nd ad v e rb s a re n o t o r ga n iz ed i n to Is -
A h i e r a r c h i es ) . F i gu r e 2 i l l u s t r a t es a f r a gm en t o f t h e
W o r dN e t Is - A h i e ra r c h y.
Fi g . 2 W o r dN e t H yp e r n ym / h yp o n ym s s yn s e t s r e l a t i o ns
ex a mpl e
Airplane , aeroplane, plane
Aircraft
Craft
Vehicle
Airship,… Drone,
…
Glider,…
Vessel, watercraft
Rocket, projectile Sled, sledge,…
Spacecraft,… Hovercraft
Airliner Amphibia
n
Jet Fighter Bomber Biplan
e
Monoplane
![Page 24: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/24.jpg)
CHAPTER 2. RELATED WORK
19
T o th e b es t o f o u r kn ow le d ge a c o mp a r a t iv e s tu d y
b e tw e e n s e ma n t i c a n d o t h e r f o cu se d c r a wl in g a p p roa c h e s
h a sn ’ t b e en r epo r t e d i n t h e l i t e r a t u r e b e fo r e . T h e
i mp l e me n t a t i o ns i n [ 13 ] a r e co mp ar e d on l y w i t h a b a s i c
f o c us e d c r a wl e r ( a s s i gn i n g e ac h p a ge a s i mp l e b in a r y p r io r i t y
v a lu e d ep en d ed on t h e p r es e n c e o f qu e r y t e r ms ) r a t h e r t h an
w i t h t h e wi d e l y u s e d Be s t F i r s t C r a wl e r s m ak in g u s e o f V SM
f o r e s t im a t i n g t o p i c r e l e v an c e [ 2 9 ] . Th e p r op os e d w o rk d e a l s
w i t h ex a c t l y t h i s i s s u e an d p r e s en t s a c o m p ar a t iv e s tu d y
b e tw e e n c l as s i c an d s e ve r a l v a r i an t s o f s em an t i c c r aw l i n g
a p p ro a ch e s ( i n c l ud in g E hr i g e t . a l [ 1 3 ] ) .
2 .5 Learning Craw lers
E a r l y a p p r o a c h es t o d e v e lo p i n g l ea r n i n g c r a wl e r s a pp l i e d a
l e a rn in g c l as s i f i e r ( t h a t r e l i e d on we b t ax on omi e s s u ch a s
Y a h oo [ 7 ] ) an d u s ed f o r d i s t i n gu i sh i ng b e t w e en r e l e va n t a n d
n o n r e l e v an t p a ge s [ 3 0 ] . Ev e r y p a ge c on t a in in g l i n ks
c a n d i da t e f o r do wn lo a d in g i s c l a s s i f i e d a s r e l e v an t o r n o t
r e l ev a n t an d as s ign e d a p r io r i t y v a l u e a c c o r d i n g t o t h a t
c l a s s i f i c a t i o n (h i gh e r p r i o r i t y w a s a s s i gn e d t o r e l e v a n t
p a ge s ) . T h i s wo rk i s c on s id e r ed to b e o n e o f t h e f i r s t
c o n t r i bu t io ns i n t h e f i e ld o f Le a r n in g C r a wl e r s . Re s e n t
a p p ro a ch e s i n vo l v in g m a ch i n e l ea r n ing m e th ods fo r f o cu s ed
c r a w l in g i n c l ud e de c i s io n t r e es [ 3 4 ] , N e u r a l N e tw o rk s a n d
S up po r t V e c to r M ac h in es [ 3 3 ] .
Bu i l d i n g u po n s im i l a r i d e as t he c r a w le r i n [ 31 ]
i n t ro du c e d t h e co nc e p t o f Co n t ex t Gra p hs : Fo r e v e r y r e l e v a n t
p a ge a s e a r c h e n gin e ’ s b a c k l i nk s e rv i c e i s a pp l i ed t o r e t r i ev e
i t s p r e d ec e s s o r p a ge s . T h en , a c l a s s i f i e r i s bu i ld a c co r d in g t o
t h e d i s t a n c e o f pa ge s ( Le v e l ) t o t h e r e l ev a n t p a ges s e t .
D o wn lo a d p r io r i t i e s a r e e s t im a t e d u s in g t h i s c l a s s i f i e r : T h e
c l os e r a c a nd i d a t e p a ge t o a r e l e v an t o n e i s , t he g r e a t e r t h e
p r io r i t y o f t h a t p a ge wi l l b e .
![Page 25: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/25.jpg)
CHAPTER 2. RELATED WORK
20
T a r ge t p a ge
Le v e l 1 p a ge
Le v e l 2 p a ge
F i g . 3 Co n t ex t g r a ph : P a ge s a r e c l as s i f i e d a c co r d in g to t he i r
d i s t a n c e ( Le v e l ) f ro m t a r ge t p a ge s .
A n ex t en s i on to t he C on tex t G r ap h m et ho d w as t h e Hid d en
M a r ko v M od e l ( HM M ) c r a wl e r [ 1 6 ] . T h e us e r b r ow s es t h e
W eb an d in d i c a t e s i f a do wn lo ad e d p a ge i s r e l ev a n t t o t he
t op i c o r n o t . Th e v i s i t i n g s eq u en c e i s a l so r e c o rd ed and i s
u s ed fo r t r a i n i n g th e a l go r i t h m to i de n t i f y p a t h s l e ad i ng t o
r e l ev a n t p a ge s . The d o wn lo a de d p a ge s a r e c lu s t e re d an d a
H i dd en M ar ko v M od e l [ 44 ] i s c r e a t ed : E ve r y p a ge i s
c h a r a c t e r i z ed b y t wo s t a t e s ( a ) t h e v i s i b l e s t a t e
c o r r es po nd in g to t he c l us t e r t h a t t he p a ge b e l ongs t o
a c c o rd in g to i t s c o n t en t , a nd (b ) t h e h id de n s t a t e
c o r r es po nd in g t o t h e d i s t a n ce o f t h e p a ge f r o m a r e l ev a n t
p a ge ( 0 i f t h e pa ge i s a t a r ge t / r e l e va n t pa ge ) . Dur i n g
c r a w l in g e v e r y p a ge i s a s s i gn e d a v a lu e e qu a l t o t h e
p r ob a b i l i t y t h a t g i v en th e c l us t e r t h e p a ge b e l on gs t o ,
c r a w l in g wi l l l e ad t o a t a r ge t p age , t h i s p ro b ab i l i t y i s
c o mp ut ed u s i n g th e H id de n M a rk ov Mo d e l .
![Page 26: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/26.jpg)
CHAPTER 2. RELATED WORK
21
S p ec i f i c a l l y A l l pa ge s a r e r ep r es e n t e d b y t h e i r
t e rm v e c t o r s a c c o rd in g t o VS M a nd th e y a r e c l u s t e r e d . T h us
e v e r y p a ge i n to t h e t r a in in g s e t i n c h a r a c t e r i z ed b y t h e
c l us t e r i t be lo n gs t o a nd b y i t s d i s t anc e ( l e v e l ) f r om a t a r ge t
p a ge ( F i g . 4 ) .
L 3 p a ge
L 2 p a ge
L 1 p a ge
L 0 p a ge
Fi g . 4 Re p r es e n t a t i o n o f t h e H MM t r a in in g s e t u s in g
d i s t an c e f r om t a r ge t p a ge s ( Le v e l ) and c l us t e r s o f p a ge s i n
t h e t r a i n i n g s e t .
In f i gu r e 4 g r e e n p a ge s i n d i c a t e t a r ge t o r l ev e l 0 p a ges ,
ye l l o w p a ges a r e l e v e l 1 pa ge s (1 l i n k d i s t a n c e f r om t a r ge t
p a ge s ) , o r an ge p a ge s a r e l e v e l 2 (2 l i nk s aw a y f r o m t a r ge t
p a ge s ) , an d re d p age s a r e 3 o r m o re l i n ks a w a y f r o m t a r ge t
p a ge s . La b e l s on pa ge s r e p r es en t t h e c l us t e r t h e p a ge b e l on gs
t o ( e . g . C 0 , C 1 and C 2 l ab e l s c o r r e sp o nd i n g t o C lu s t e r 0 ,
C l us t e r 1 a nd C lu s t e r 2 r e sp e c t iv e l y) . N o t i c e t h a t pa ge s
w i t h i n t h e s am e C l us t e r c an b e l on g t o d i f f e re n t l e ve l s , a nd
t h a t p a ge s i n t he s am e l e v e l c an b e lo ng t o a d i f f e r en t c lu s t e r .
E v e r y p a ge i s c ha r a c t e r i z ed b y i t s l e v e l o r h i dd e n s t a t e L i
w h e r e i i s t h e l e ve l , a nd b y t h e c l us t e r C j i t b e lo n gs ( o r
v i s i b l e s t a t e ) . T h a t s e t o f p a ge s wi t h h id d en a nd v i s i b l e
C2
C2
C0
C1 C0
C1
![Page 27: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/27.jpg)
CHAPTER 2. RELATED WORK
22
s t a t es fo r m a Hi dd e n M a rk ov Mo d e l [ 44 ] . Th e fo l l ow i n g
s um ma r i z e s t h e pa r a m et e r s an d n o t a t i o n u se d b y H M M
c r a w le r :
I . I n i t ia l p r obab i l i t y ma t r ix :
D = { F�G'�, … , F�G/181:/7.�}
W h e r e ����!� d e no t e s t h e n umb e r o f h id d en s t a t es
a n d F�G�� r e p r es e n t s t he p r ob a b i l i t y o f b e in g a t h id d en
s t a t e i a t t im e 1 . T h i s p r ob a b i l i t y i s co mp ut ed b y
a s s i gn i n g to e a c h p a ge a v a lu e e qu a l t o t h e p e r c en t a ge
o f p a ge s w i t h t h e s a me h i dd e n s t a t e i n t o t h e t r a in in g
s e t .
I I . T r ans i t ion Pr obab i l i t i e s Ma t r i x A :
J = [L��]'N�O/181:/,'N�O/181:/
W h e r e L�� r ep r es e n t s t h e p ro b ab i l i t y o f be i n g a t s t a t e L j
a t t im e t + 1 i f a t s t a t e L i a t t i me t . Th i s p ro b ab i l i t y i s
e s t im a t e d b y c o u n t i n g t h e co r r es po n d in g t r an s i t i o ns
f r om s t a t e L i t o L j on t h e us e r t r a i n i n g s e t , a nd b y
n o rm al i z in g b y t h e o v e r a l l n um b er o f t r a ns i t i on s f rom
s t a t e L i .
I I I . E mi ss i on Pr obab i l i t i e s M at r ix B :
P = [A��]'N�O/181:/,'N�OQ6R/1:S/
W h e r e A�� r e p r e s en t s t h e p r ob a b i l i t y o f b e in g a t c l u s t e r
C j g i v en s t a t e L i an d T� ��!�� i s t h e n um b er o f c lu s t e r s
o f pa ge s . P r ob a b i l i t i e s a re c omp ut e d b y c o u n t i n g t h e
n um b er o f p a ges i n to c lu s t e r C j w i th h i dd e n s t a t e L i
a n d no rm al i z in g b y t h e o v e r a l l n umb e r o f p a ge s wi th
h id d en s t a t e L i .
D u r i n g c r a wl i n g pa ge c o n te n t i s p ro c e s s e d a nd th e H M M
c r a w le r a s s i gns t h e p a ge to a c lu s t e r u s i n g K -N ea r es t
N e i gh bo rs a l go r i t hm [ 43 ] . G i v en th e p a ge c l u s t e r an d t h e
H i dd en M a rk ov M od e l p a r am e te r s (π , A a n d B m at r ix es ) t h e
p r ob a b i l i t y t h a t t he n ex t p a ge v i s i t ed w i l l b e a t a r ge t p age i s
![Page 28: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/28.jpg)
CHAPTER 2. RELATED WORK
23
c o mp ut ed us i n g Vi t e rb i a l go r i t hm [4 0 ] . Th a t p ro ba b i l i t y
r e p r es e n t s a l s o v i s i t p r i o r i t y o f t h e l i nk . Th e V i t e r b i
a l go r i t hm co mp ut es a p r ed i c t i on o f t he s t a t e i n t h e n ex t t ime
s t ep g iv e n th e s e qu e n c e o f w e b p a ges ob s e r v ed t hu s f a r . In
o r d e r t o ca l cu l a t e t h e p r ed i c t i o n v a l ue , e a ch v i s i t ed p a ge i s
a s so c i a t e d wi t h v a lu e s a (L j , t ) , j = 0 , 1 , . . , s t a t es . V a lu e a (L j , t ) i s
t h e p ro ba b i l i t y t h a t t h e s ys t e m i s i n h id d en s t a t e L j a t t i me t ,
b a s ed on ob s e r v a t io ns m ad e t hu s f a r . G i ve n v a lu es a (L j , t -1 )
o f pa r e n t p a ge s , v a l u es a (L j , t ) a r e c om put e d us ing t h e
f o l l o wi n g r ec u rs ion :
��G� , �" = A�QU V ���G�, � − 1� ∗ ���/181:/
�&'� �5�
W h e r e a i j i s t he t r a n s i t i o n p ro b ab i l i t y o f s t a t e L i t o L j f r om
m at r ix A a nd A�QU i s t h e e mis s i on p rob a b i l i t y o f c lu s t e r c t
f r om h id de n s t a t e L j f rom m at r ix B . V a lu e s a (L j , 0 ) a t t h e
f i na l r e c u rs i on s t ep a r e t a ke n f r om in i t i a l p ro b ab i l i t y m a t r ix
π . G iv e n v a lu es a (L j , t ) t h e p r ob a b i l i t y t h a t t h e s ys t e m wi l l be
i n s t a t e L j a t t h e nex t t im e s t e p i s c omp ut e d a s fo l l o w s :
��G�, � + 1" = V ���G�, �� ∗ ���/181:/
�&'� �6�
T h e p ro b ab i l i t y o f b e e n a t s t a t e L 0 ( r e l e v an t pa ge ) i n t he n ex t
s t ep i s t he p r io r i t y a s s i gn ed t o p a ges .
C h ak r ab a r t i e t . a l [ 32 ] p r op os e d a t wo c l as s i f i e r
a p p ro a ch . T he o p en d i re c to r y ( D M O Z) [ 39 ] W e b t ax o nom y i s
u s ed t o c l as s i f y d o w nl oa d ed pa ge s as r e l e va n t o r no t , a n d
f e e d a s e c on d c l as s i f i e r w h i ch i s t r a in e d u s i n g th es e p age s .
T h e s e c on d c l as s i f i e r i s u s e d t o e v a l ua t e t h e p r ob a b i l i t y t h a t
t h e g iv e n p a ge w i l l l e a d t o a t a r ge t p age . A n ex t e ns i v e s tu d y
o f Le a r n i n g C r aw l e r s an d t h e e v a lu a t io n o f s ev e r a l
![Page 29: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/29.jpg)
CHAPTER 2. RELATED WORK
24
c l as s i f i e r s u s e d t o a s s i gn v i s i t p r io r i t y v a l u e s t o p a ges i s
p r e s en t e d i n [ 3 3 ] . C l a s s i f i e r s b a sed o n S up po r t V ec t o r
M a c h in e s [ 38 ] (S VM ) s e em to o u tp e r f o rm Ba ye s C l a s s i f i e r s
a n d c l a s s i f i e r s b as ed on N e ur a l N e t wo rk s on t h a t t a s k .
R e se n t c on t r ib u t i on s t o t h e f i e l d o f l e a rn in g c r a wl in g
i n c lu d e H yb r i d c raw l e r s [ 3 5 ] c om bin in g i d ea s f r om l e a rn in g
a n d c l as s i c f oc us ed c r aw l e r s . In [ 3 5 ] a H yb r i d C r a wl e r i s
p r op os e d : Th e c r awl e r wo r ks b y a c t i n g a l t e r n a t i v e l y e i t he r a s
l e a rn in g c r a w l e r gu id e d b y ge n e t i c a l go r i t hm s ( fo r l e a rn in g
t h e l i nk s e qu en c e l e a d in g t o t a r ge t p age s ) o r a s b r e ad th f i r s t
c r a w le r . In o u r w o r k , w e ap p l y a h yb r i d m eth od f o r
i mp ro v i n g t h e p e r f o rm a n c e o f l e a rn in g c r a w le r s . Ho w ev e r ,
i n s t e a d o f a l t e rna t i n g c r a wl e r s b e t w e en t wo mo d es o f
o p e r a t i on ( Le a r n ing o r Br e a d th f i r s t c r a w l e r ) w e c omb ine t h e
p a ge p r io r i t y f u n c t i on s c omp ut e d b y a H id de n M a rk ov Mo d e l
C r a wl e r an d t h a t o f t h e Bes t F i r s t C ra wl e r i n o rd e r t o
e v a lu a t e t h e o v e ra l l p r i o r i t y v a lu e o f a W e b p a ge .
2 .6 Summary
R el a t e d w o rk o n fo c us e d c r a wl e r s i nc l ud es c l a s s i c , s ema n t i c
a n d l e a r n in g a p pr o a c he s . T h e Bes t F i r s t C r a wl e r a n d
v a r i a t i o ns o f t h i s m e t ho d ( e . g . N- Be s t F i r s t C r a wl e r ) fo r m a
c o mmo n an d e f f e c t i v e ap pr o a ch fo r f o c us e d c r a wl i n g [ 2 6 ] .
S em a n t i c c ra wl e r s p r e s en t e d i n [ 1 3 ] a r e no t w e l l s t ud i e d a n d
a c om p ar i s on w i t h s t a t e o f t h e a r t c l a s s i c fo c us ed c r aw l e r s
s u ch a s Be s t - F i r s t h a s n ’ t a pp e a r ed in t h e l i t e ra tu r e b e f o r e .
Le a r n i n g c r a w l e r s f o rm a d i s t i n c t i ve c a t e go r y o f f o c u s ed
c r a w le r s ba s ed o n a t r a in in g s e t p ro v i d ed b y t h e u s e r f o r
t op i c d es c r i p t i on . Le a r n in g c r a w le r s ba s ed on S VM
c l as s i f i e r s f o r a s s ign i n g p a ge v i s i t i n g p r io r i t i e s a c h i e v e go o d
p e r f o r ma n c e [ 33 ] , w h i l e me th od s t h a t l e a r n p a t hs l e ad i ng t o
r e l ev a n t t o t h e t op i c p a ges su ch as Co n t ex t G r ap h me t ho d
[ 3 1 ] an d Hi dd en Ma r k ov Mo d e l Cr a wle r s [ 16 ,1 8 ] a r e o f g r e a t
![Page 30: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/30.jpg)
CHAPTER 2. RELATED WORK
25
i mp or t an c e . A l s o t h e ne w l y p r o p os ed h yb r i d m et ho ds [ 3 5 ] a r e
v e r y p r om is i n g ap pr o a c h t o f o cu se d c ra w l i n g .
![Page 31: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/31.jpg)
CHAPTER 3. CRAWLER DESIGN
26
Chapter 3. Crawler Design
3.1 Introduction
Is s u e s r e l a t e d t o d es i gn a nd im ple m e n t a t i on s o f f oc u s ed
c r a w le r s a re d i s cus s ed n ex t . G i v en a n a pp l i c a t i on ( ge n e r a l
p u rp os e w e b s ea r ch e n g i n e o r t op i c s p e c i f i c d i g i t a l l i b r a r y)
t h e a pp r op r i a t e t yp e o f w eb c r aw le r h a s t o b e d e t e r mi n ed
f i r s t . Fo r t h e f i r s t a p p l i c a t i o n t yp e , a b r e a d th f i r s t c r aw le r i s
a r e a s on ab l e s o lu t io n . Fo c us ed c r a wl e r s ( c l a s s i c , s em a n t i c o r
l e a rn in g c r a wl e r s ) a r e b es t su i t ed fo r t h e l a t e r ap p l i ca t i on
t yp e .
B r e a d th F i r s t C r a wl e r
Fo c us e d Cr a wl e r s
G r e e n c i r c l e s d eno t e r e l ev a n t t o t he t o p i c p a ges a n d a r c s
l i n ks b e t we e n W eb p a ge s . A r ro w s d e no t e v i s i t s e que n c e
u s i n g d i f f e re n t c ra w l e r s . Fo c us e d C r a wl e r s a s s i gn h i gh e r
v i s i t p r i o r i t i e s t o l i n ks co n t a i n ed i n r e l ev a n t t o t h e t o p i c
p a ge s .
Fi g . 5 C r a wl e r O p er a t i on
![Page 32: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/32.jpg)
CHAPTER 3. CRAWLER DESIGN
27
Fi g . 5 d em on s t r a t e s t h e s e a rc h s t age s o f a c r a wl e r . W eb
p a ge s a r e d en o t e d b y c i r c l es ( g r e e n c i r c l e s co r r es po nd to
p a ge s r e l ev a n t t o t h e t op i c a t h a nd ) wh i l e l i n ks d en o t e
o u t go in g l i n ks f rom a p a ge . T he c r a wl e r r e t r i e v es p a ge s f r om
t h e we b s t a r t i n g wi th a s e ed p a ge sho w n a t t h e ro o t o f t h e
t r e e . A s d i s cu ss ed i n t h e i n t r od u c t io n , t h e ou t go in g l i nk s
( U R Ls ) o f ea c h v i s i t e d p a ge a r e p l a c e d i n a q ue u e f r om
w h ic h th e w e b p a ge to v i s i t nex t i s s e l e c t e d i n so m e o rd e r .
T h e c r a wl e r ge t s t h e UR L, d o wnl o ad t h e p a ge an d p l a c e s
U R Ls e x t r ac t ed f rom th e do w nlo a d ed p a ge i n t h e q u eu e . T h i s
p r o c es s i s r ep e a t ed u n t i l t h e c r aw l e r d e c i d es t o s to p ( e . g .
d i s k s pa c e ex h au s t e d , t im e l ap s ed o r t he us e r i s s a t i s f i ed
w i t h t h e r es u l t s ) . Fo c us e d c r a wl e r s i n t ro du c e a n umb er o f
c r i t e r i a ( e . g . p a ge imp o r t an c e , r e l e v a n c e t o t o p i c ) f o r
a s s i gn i n g p r i o r i t i e s t o w eb p a ge s i n t h e qu e ue an d f o r
s e l e c t i n g w h i c h pa ge t o v i s i t n ex t . F i g . 6 i l l u s t r a t es t h e
o p e r a t i on s t a ge s o f a c ra wl e r :
N o
Y e s
N o
Fi g . 6 O v er v i e w o f C ra wl e r o p er a t i o n
User input
Page downloading
Content processing
Priority assignment
Crawling termination
criteria satisfied?
Output: Web pages satisfying user needs
![Page 33: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/33.jpg)
CHAPTER 3. CRAWLER DESIGN
28
a ) I npu t : C r a wl e r s t a k e a s i np u t a num b er o f s t a r t i n g
( s e e d ) U R Ls a n d ( i n t he c a s e o f f o cu se d c r a wl e r s ) t h e
t op i c d es c r i p t i o n . T h i s d e s c r ip t i o n ca n b e a l i s t o f
k e yw o r d s f o r c l a s s i c a n d s em a n t i c f o cu s ed c r a wl e r s o r
a t r a in i n g s e t fo r l ea r n i n g c r aw le r s .
b ) Pa g e dow nl oa d ing : Pa ge s f r om q u eu e a r e d o wnl o ad e d
i n s om e o rd e r . Fo c us e d c r aw l e r s m a y d e c i d e t o
ex c lu de p a ges no t s a t i s f yi n g t h e t op i c c r i t e r i a f ro m
f u r t h e r i nv es t i ga t i o n . P a ge s a r e s t o r e d lo c a l l y a t a
p a ge r ep os i to r y f o r f u r th e r p ro c e s s i n g .
c ) C on t en t p ro c es s in g : T he p a ge c on t en t i s l ex i c a l l y
a n a l yz e d a n d r ed uce d in to t e rm v ec to r s ( a l l t e rm s a r e
r e d u ce d t o t h e i r m o rp ho l o g i ca l ro o t s b y a p p l yi n g
P or t e r ’ s s t em min g a l go r i t hm [ 4 8 ] a nd s t op wo r ds a re
r e mo v ed ) . Ea c h t e rm in a v e c t o r i s r ep r e s en t e d b y i t s
t e rm f r eq u en c y- i nv e r s e f r e qu en c y v e c to r ( t f - i d f )
a c c o rd in g t o VSM . T h e ou t go in g l i nks o f t h e p a ge a r e
a l so ex t r a c t e d an d p l a ce d in t he p r io r i t y q u e u e .
d ) Pr i o r i t y as s i gnme n t : Ex t r a c t ed U R Ls f ro m
d o wn lo ad e d p a ge s a r e p l a c ed in a p r i o r i t y q u e u e wh e re
p r io r i t i e s a r e d e t e rm in ed b as ed o n th e t yp e o f c r a wl e r
a n d us e r p r e f e re n ce s . T he y r a n ge f r om s imp l e c r i t e r i a
s u ch a s p a ge imp or t an c e o r r e l e v an ce t o q ue r y t o p i c
( c om pu t ed b y m a t c h in g t h e q u er y w i th p a ge o r an c ho r
t ex t ) t o mo r e i nv o l ve d c r i t e r i a ( e . g . c r i t e r i a
d e t e r min e d b y a l e a r n in g p r o c es s ) .
e ) E xpan s i on : UR Ls a r e s e l e c t e d f o r f u r t h e r ex p a ns i on
a n d s t ep s ( b ) - ( e ) a r e r e p e a t e d un t i l s om e c r i t e r i a
( e . g . t h e d es i r ed n umb e r o f p age s h av e b e e n
d o wn lo ad e d ) a r e s a t i s f i e d o r s ys t em r e so u r ce s a r e
ex h au s t ed .
A l l C r a wl e r s f o l l ow t he a bo v e d e s i gn . B r e a d t h F i r s t C raw l e r
r e q u i r es o n l y s e e d p a ge s a s i n pu t . Be s t - F i r s t an d S em an t i c
![Page 34: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/34.jpg)
CHAPTER 3. CRAWLER DESIGN
29
C r a wl e r s t a k e t he s e e d p a ge s an d a u se r q u er y a s i n pu t wh i l e
Le a r n i n g C r a wl e r s a c c e p t a t r a i n i n g s e t o f U R Ls o f p a ge s
i n s t e a d o f a q u er y. C r a wl e r s a l s o d i f f e r i n t h e w a y p r i o r i t i e s
a r e a s s i gn ed t o ex t r a c t ed U R Ls . T h i s i s t h e m os t c ru c i a l p a r t
i n t h e im p l em en t a t i o n o f f o cus e d c r a wl e r s .
A l l C r a wl e r s i n t h i s w o rk a r e i mp l e me n t ed in J a v a [ 36 ]
u s i n g E c l ip s e [ 37 ] . T h e do w nlo a d ed pa ge s m us t b e o f
t ex t / h tm l f o rm at a n d t h e i r co n t en t s i z e mu s t no t ex c e e d
1 0 0K B. R es t r i c t i ons a r e a l so i mp os e d o n co nn e c t io n t i me o u t
a n d d o wnl o ad i n g t i m es fo r p e r fo r m an c e r e as on s . T h os e
r e s t r i c t i o ns ap p l y t o a l l imp l em en te d c r a w le r s . T h e c r a wl in g
p r o c es s i s r e pe a t ed u n t i l t h e p r ed e f in e d num b e r o f pa ge s i s
r e t r i ev e d ( F i g . 6 ) . In o r e x p e r i m en t s t h i s num b e r i s s e t eq u a l
t o 10 00 we b p a ge s .
3 .2 Class i c Craw lers
T h e B r ea d th F i r s t C ra wl er f o rm s t h e b a se l in e f o r
i mp l e me n t in g Be s t F i r s t , Se m an t i c and Le a r n i n g C r aw le r s . I t
i s a s imp le p ro g r a m th a t ge t s o n e o r m o re s e ed p a ge s as i np u t
a n d fo l l o ws t h e l i nk s i n a b r e a d th f i r s t w a y u n t i l t he d es i r ed
n um b er o f W e b p age s i s d o wnl o ad e d . F i g . 7 i l l u s t r a t es t he
i n t e r fa c e o f t h e Br e a d th F i r s t c r aw l e r i mp l e me n t e d . I t
a c c e p t s o ne o r mo re s e ed p a ges a s i np u t . D o wnl o ad e d pa ge s
a r e sh ow n b e lo w .
![Page 35: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/35.jpg)
CHAPTER 3. CRAWLER DESIGN
30
Fi g . 7 Sc r e en sh o t o f B r e a d th F i r s t C r aw l e r
3 .2 .1 B es t F i r s t Cr aw l e r w i t h p ag e co nt en t c r i t er ia
T h e s e c on d c l as s i c Cr a wl e r ( and th e f i r s t f o cu s ed )
i mp l e me n t e d i s t he B e s t F i r s t C ra wl e r us ing p ag e c on t en t
f o r p r io r i t i z i n g c a n d i d a t e UR Ls . W h e n a W e b p age i s
d o wn lo ad e d i t s c on t e n t i s l ex i c a l l y a n a l yz e d an d r ep r e sen t ed
b y t e r m v e c to r s . Ea c h t e rm in su c h ve c t o r i s r e p r es en t ed b y
i t s t f - i d f w e i gh t a cc o r d in g t o VSM [ 12] . P r i o r i t y a s s i gned to
a l i nk e qu a l s t h e c o s in e s im i l a r i t y ( E q . 2 ) o f t h e p a ge
c o n t a i n i n g th e l i nk a nd t h e us e r qu e r y .
![Page 36: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/36.jpg)
CHAPTER 3. CRAWLER DESIGN
31
N o t i c e t h a t u s in g i nv e r s e do c um e n t f r eq u en c y ( i d f )
w e i gh t s c an b e p ro b l em a t i c b e c aus e i d f w e i gh t s n e ed t o b e
u p da t ed a t ev e r y c r a w l in g s t e p , f o r t h i s r e a so n i d f w e igh t s
c a n b e i n a c cu r a t e a t t h e i n i t i a l s t ep s o f c r aw l in g w h en t h e
n um b er o f r e t r i e ve d p a ge s i s sm al l [ 25 ] . M os t Be s t F i r s t
C r a wl e r i mp l e me n t a t i o ns u se o n l y t e r m f r eq u en c y ( t f )
w e i gh t s . In t h i s w o r k i d f w e i gh t s a r e p r ov i d ed b y th e
In t e l l iS e a r ch w eb s e a r ch e n gi n e [ 41 ] h o l d in g i d f s t a t i s t i c s
f o r En g l i sh t e rms . A t t h e n ex t s t e p t he l i n k wi t h t h e h i gh e s t
p r io r i t y i s s e l e c t e d f o r do wn lo a d in g .
3 . 2 . 2 B e s t F i rs t C raw le r w i th an cho r t e x t s i mi l ar i ty
T h e s e co nd v a r i a t i o n o f Bes t F i r s t C ra w l e r i s t h e Be s t F i r s t
C raw l er u s in g an ch or t e x t s im i la r i t y . T h e an c ho r t ex t o f a
U R L i s t h e c l i c k ab l e t ex t t h a t a pp e ar s o n t h e l i nk i n s id e a
W eb p a ge p o i n t i n g t o t h a t UR L. In t h i s w or k w e imp l em en t ed
a v a r i an t o f t h e a b o ve Be s t F i r s t C ra w l e r wh ic h in s t e ad o f
p a ge c on te n t u s es U R Ls a n ch o r t ex t a s t h e r ep r es e n t a t i o n o f
p a ge c on t en t a nd f o r a s s i gn i n g d o wn lo a d p r i o r i t i e s . No t i c e
t h a t l i n ks f rom the s a me p a ge ma y b e a s s i gn e d d i f fe r e n t
p r io r i t y v a l u e s , a s o p pos e d t o t h e f i r s t im p l em e n t a t i on , u s i n g
p a ge t ex t c on t en t f o r a s s i gn i n g p r io r i t i e s , w he r e a l l l i nk s
i n t o t he s am e p a ge a r e g i v e n t he s am e p r io r i t y . A s w i l l b e
s ho w n in t h e r e su l t s , s e l ec t i on o f anc h o r t ex t f o r a s s i gn in g
p r io r i t y v a l u es i mp r ov e d th e ge n e r a l p e r f o r m an c e o f t he
c r a w le r , u s i n g bo t h h a r v es t r a t i o a n d av e r a ge s i mi l a r i t y
c r i t e r i a ( s e c t i on 4 .3 ) .
3 . 2 . 3 B es t F i r s t C r aw l e r w i th pa ge c on t en t an d anc h or
t e x t .
T h e t h i rd v a r i a t i on o f Be s t F i r s t C r a wl e r c om bin e s t he
p r e v io us t wo im ple m e n t a t i on s us i n g p a ge c o n t en t a nd l i nk
![Page 37: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/37.jpg)
CHAPTER 3. CRAWLER DESIGN
32
a n c ho r t ex t r e sp e c t iv e l y. E a c h UR L i s a s s i gn e d a p r io r i t y
v a lu e d e f i ne d a s :
�F���������� = similarity�5�,�" + �� ������� ���,��2 � �7�
W h e r e 5���������� i s t h e p r io r i t y v a l u e a s s i gne d t o l i n k i ,
similarity�5�,�" i s t h e s i mi l a r i t y o f q u e r y � a nd 5� ( t h e c on t e n t
o f t h e p a ge wh e r e t h e l i n k i i s l o c a t e d ) an d similarity���,�" i s
t h e s imi l a r i t y o f anc h o r t ex t �� o f l i n k i a n d qu e r y q .
T h e id e a b eh i nd t h e Be s t F i r s t C ra w l e r wi t h pa ge
c o n t e n t o n l y i s t ha t a p a ge r e l ev a n t t o t h e t o p i c i s m o re
l i k e l y t o p o i n t t o a r e l e v an t p a ge t ha n t o a n on r e l e v an t o n e .
T h us , t he h i ghe r t h e r e l ev an c e o f t he p a ge c o n t a in i n g t h e
l i n k i s , t h e h i ghe r t h e p r ob a b i l i t y t h a t t h e l i nk wi l l po i n t t o a
r e l ev a n t p a ge i s .
T he s e co nd imp l em e n t a t i on ( Be s t F i r s t C r aw le r u s in g
a n c ho r t ex t s i mi l a r i t y) t r i e s t o o v erc o m e a d i s a dv a n t age o f
t h e Be s t F i r s t C ra w l e r wi t h p a ge c o n t e n t on l y: a l l l i n ks
w i t h i n a p a ge h ave t h e s am e p r io r i t y r e ga r d l e s s o f a nc h o r
t ex t . A n ch or t ex t m a y b e r e ga r d e d a s a su mm a r y o f t he
c o n t e n t o f t h e p a ge t h a t t he l i n k po in t s t o . T he r e fo r e i t i s
r e a s on ab l e t o u s e t h i s d es c r i p to r fo r a s s i gn i n g p r i o r i t i e s t o
p a ge s . Ho w e ve r a nc h o r t ex t i s n ’ t a lwa ys d e s c r i p t i v e o f p a ge
c o n t e n t s a nd b y i g n o r i n g t h e p a ge c on t en t u s e f u l i n fo rma t i on
m a y n o t b e us ed . S o th e t h i r d Be s t F i r s t C raw l e r
i mp l e me n t a t i o n us es b o t h p a ge a n ch o r t ex t a nd p a ge c on t en t .
3 .3 Semant ic Craw lers
Be s t F i r s t c r aw le r s e s t i m a t e t h e r e l eva n c e b e t w e en t h e p a ge
c o n t e n t o r an c ho r t ex t a nd a u s e r q u e r y. T h e r e m a y ex i s t
c o n c ep t u a l l y r e l a t ed t e rm s i n bo th t h e q u e r y a n d t h e p a ge ( o r
a n c ho r t ex t ) , i n d i ca t i n g a r e l e v an c e t o t h e t o p i c . H ow e ve r i f
t h es e t e r ms a r e n ’ t l ex i c o gr a ph i c a l l y s im i l a r t h e i r r e l ev a n c e
![Page 38: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/38.jpg)
CHAPTER 3. CRAWLER DESIGN
33
w i l l b e i gno r ed b ec a u s e VS M c omp ute s t ex t s i mi l a r i t y a s a
f u n c t i on o f s i mi l a r i t i e s b e t w ee n i d en t i c a l t e r ms fo un d i n t h e
v e c to r s w h i ch a r e c o mp a r ed . Th i s c a n b e r eso lv e d u s i n g
o n t o l o gi e s o r t e rm t ax o nom i es . In o n to lo g i es co n c ep t ua l l y
s imi l a r t e rms a r e r e l a t e d b y v i r t u e o f IS - A l i nks . A l l t e r ms
c o n c ep t u a l l y s i mi l a r t o u se r qu e r y t e r ms a r e r e t r i e ve d f r om
t h e on t o l o g y a n d u s ed f o r e nh an c ing t h e d e s c r ip t i o n o f t h e
t op i c ( e . g . b y a d d in g s yn o n ym t e r ms t o t h e t op i c k e yw o r d s )
a n d f o r co mp ut in g th e s i mi l a r i t y b e t w e en q u er y a nd
c a n d i da t e p a ge s . Fo r t h i s , v a r iou s m et ho ds h av e b e e n
p r op os e d i n c lu d i ng a m o n g o th e r s S e m an t i c S i mi l a r i t y
R e t r i e v a l M od e l (SSR M ) [ 1 4 ] a nd M ih a l c e a e t . a l [ 1 5 ] . Th e
m os t i mp or t an t r ep r e s en t a t i v e s o f t h i s c a t e go r y o f m e th o ds
a r e im p l em en t ed w i t h i n Bes t F i r s t c r a w le r s f o rmi n g th e so
c a l l e d h e r e a f t e r S em a n t i c C r aw le r s .
In t h i s w o rk , W o rdN e t [ 4 ] t e rm t ax o no m y i s u s ed as a n
o n t o l o g y f o r r e t r i ev in g c o nc e p t u a l l y s imi l a r t e rm s . W ord N et
w a s s e l ec t ed be c au s e i t p ro v id e s a v a s t co v e r a ge o f t h e
E n g l i sh vo c a bu l a r y s o i t c an b e u s ed fo r f o cu se d c r aw l ing o n
a lm os t ev e r y t o p i c m a k i n g o u r imp l em en t a t i on t h e f i r s t
ge n e r a l pu r pos e S em a n t i c C r aw le r . T h e ge n e r a l d es i gn
r e m ai ns s imi l a r t o t h a t o f C l as s i c Focu s ed Cr a wl e r s ( F i g . 6 )
b u t t h e p r i o r i t i e s a s s i gn e d t o l i n ks a r e e v a l u a t ed us in g
m et ho ds s u ch a s SSR M [ 1 4] an d Eh r ig e t . a l [ 1 3 ] . O th e r p a r t s
o f t h e s ys t e m s u ch a s d o wnl o ad in g , l i n k an d an ch o r t ex t
ex t r a c t i o n , p r ep ro ce s s i n g a nd r ep r es en t i n g t ex t s u s in g V e c t o r
S p ac e M od e l t e rm v e c to r s , r em ai n t h e s am e .
In t h e f o l l o wi n g , c a n d id a t e l i n ks f o r d o wnl o ad i n g a r e
r e p r es e n t ed b y t h e i r a nc ho r t ex t s . Ea c h c a nd i d a t e l i n k i s
a s s i gn ed a p r i o r i t y v a l u e wh i ch i s com pu te d a s t h e s ema n t i c
s imi l a r i t y b e t w e e n th e i r an c ho r t ex t a n d th e t op i c [ 1 4 , 1 5 ] .
In t u r n , s e m an t i c t ex t s imi l a r i t y i s c o m pu te d as a f un c t io n o f
![Page 39: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/39.jpg)
CHAPTER 3. CRAWLER DESIGN
34
t h e s em an t i c ( c on ce p tu a l ) s imi l a r i t i e s b e t w e en t h e t e rms t he y
c o n t a i n . Th i s c a n b e de f in ed i n m any d i f f e r e n t w a ys [ 1 1 ]
l e a d in g to t h e im p le m e n t a t i on o f t h r e e s em an t i c c r a wl e r s .
3 . 3 . 1 Eh r ig C r aw le r
In t h i s im p l em e n ta t i on W e b p a ge s a r e r e p re s en t e d b y t h e
a n c ho r t ex t o f t h e l i n ks p o in t in g t o t h em ( in s t e ad o f p a ge
c o n t e n t a s i n [ 1 3 ] . A n ch o r t ex t s a n d th e us e r qu e r y a r e
r e p r es e n t ed b y t e r m v e c t o r s u s i n g t f w e i gh t s [ 13 ] . P a ge
p r io r i t i e s a r e com pu t ed as :
F�������:cS�d��� = V V �� ,-��� , �e" ∗ $�� ∗ $�e e&%
e&'
�&%
�&' �8�
W h e r e + i s t h e t o t a l n um b er o f t e rms in t o an c ho r t ex t an d
q u e r y, a n d �� ,- i s t e rm s e ma n t i c s imi l a r i t y c o m pu t ed u s in g
e q u a t i on 3 . N o t e t h a t on l y t f w e igh t s a r e u s e d wi th ou t
n o rm al i z in g b y v e c t o r l e n g th ( a s i t i s r e c om m en de d f o r sh o r t
t op i c a n d pa ge d e s c r i p t i on s ) , a nd t h a t W or d N et i s u s ed
i n s t e a d o f t op i c s pe c i f i c o n to lo g i es a s i n [ 1 3 ] .
3 . 3 . 2 SS R M C r aw le r
SSR M [ 1 4] i s u se d f o r a s s i gn i n g v i s i t p r io r i t i e s t o w e b
p a ge s . Sp e c i f i c a l l y t h e p r i o r i t y o f a U R L ( r e p r e s en t e d b y i t s
a n c ho r t ex t ) i s de f in e d a s f o l l ow s :
�F�������gghi��� = ∑ ∑ /�jkl�1m,1n"olm∗opnqnrsmrqmrs(∑ oln3nrq
nrs (∑ opn3nrqnrs
� �9�
W h e r e + i s t he t o t a l n um b er o f t e rm s in to t h e a n ch or t ex t
a n d t h e qu e r y. Li e t . a l . [ 42 ] i s t h e t e rm s i mi l a r i t y m e t ho d
![Page 40: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/40.jpg)
CHAPTER 3. CRAWLER DESIGN
35
u s ed in o u r i mp l em e n t a t i on . Th e U RL w i t h h i gh es t p r io r i t y
v a lu e i s d o wnl o ad ed f i r s t .
3 . 2 . 3 S e man t i c C raw le r w i th syn onym s e t exp ans ion
A n ob v i ous im pr ov e m en t i s t o ex pa n d t ex t v e c t o r s w i t h
s yn o n ym s e t s i n W o r d Ne t an d u se bo th a n ch o r t ex t a nd p a ge
c o n t e n t fo r co mp ut in g t ex t s i mi l a r i t y a n d a s s i gn in g
p r io r i t i e s :
�F�������/uv/:1 :w08v,��� = similarity�5�, �x " + �� ������� ��x �, �x �2 � �10�
W h e r e F���������� i s t h e p r io r i t y v a l u e a s s i gne d to l i n k i ,
�� ��������5�, �x " i s t h e co s i ne s i mi l a r i t y o f ex p an d ed q u er y �x ( u s i n g W o r dN e t s yn o n ym s e t s ) an d 5� ( t he c on t en t o f p age
w h e r e t h e l i nk i i s l o c a t e d ) a nd �� ������� ��x �, �x �i s t he c os in e
s imi l a r i t y o f ex p and e d an c ho r t ex t �x � o f l i n k i a nd ex p and e d
q u e r y �x .
3 .4 Learning Craw lers
T h e m ai n i d e a be h i nd Le a r n i n g C r a wl e r s i s t ha t t h e c r aw l e r
l e a rn s us e r p r e f e r en c e s on th e t op i c f r om a s e t o f ex am pl e
p a ge s ( t h e t r a in i n g s e t ) . T r a in in g ma y i n v o l v e l e a r n i ng t h e
p a th l e ad i n g to t h e d e s i re d c on t e n t . In m o s t c a s es t h e
t r a i n i n g s e t c on s i s t s o f r e l ev a n t an d i r r e l ev a n t pa ge s . Ev e r y
d o wn lo ad e d p a ge i s c l a s s i f i ed (b as e d on t h e r es u l t s o f
l e a rn in g ) a s r e l ev an t o r i r r e l e v an t and i s a s s i gn e d a p r io r i t y .
T h e Co n t ex t Gr ap h m eth od [ 31 ] w o r ks n o t on ly b y
c l as s i f yi n g t h e c r aw l ed p a ge s as r e l ev a n t o f no t r e l e v an t , bu t
a l so b y l e a r n in g the d i s t an c e ( i n n umb e r o f r ou t in g ho ps ) t h a t
m a y l e a d f ro m an i r r e l ev a n t p a ge t o a r e l ev a n t on e ( F ig 3 ) .
T h e no n r e l ev a n t pa ge s i n t h e t r a i n ing s e t w e r e do wn loa d e d
![Page 41: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/41.jpg)
CHAPTER 3. CRAWLER DESIGN
36
u s i n g r e cu rs iv e l y G o o g le ’ s b a ck l in k s e rv i ce , s t a r t i n g f r om
r e l ev a n t p a ge s , i n o r de r t o com pu te t he i r d i s t an c e ( l ev e l )
f r om t h e r e l ev a n t o r t a r ge t p a ge s . Du r in g c r aw l in g , page s
s imi l a r t o t h os e c l os e r t o t a r ge t pa ge s a r e g i v en h igh e r
p r io r i t y .
T h e Hi dd e n M a rk ov M od e l C ra wl e r [ 16 , 18 ] ex t en ds t h e
p r e v io us i d e a b y c a t e gor i z in g p a ge s n o t o n l y b y t h e i r
d i s t a n c e f r om a t a r ge t p a ge bu t a l so b y u s in g t h e i r c on t en t ,
t hu s es t im a t in g a r e l a t i o n b e tw e e n page c o n t en t an d t h e p a th
l e a d in g t o r e l e van t pa ge s . In i t i a l l y , a u s e r b r o wse s a
s e qu e n ce o f p a ge s l a b e l i n g t h em as r e l e v an t o r n o t . A s pa ge s
a r e d ow nl oa d ed , t h e v i s i t i n g s eq u en c e i s r ec o rd e d an d a
c o n t ex t g r a ph i s c r e a t ed wi t ho u t t h e n e ed o f a b a ck l i nk
s e r v i c e as i n [ 31 ] .
Fi g . 8 Ou t l i ne o f l e a r n in g f o cu s ed c r aw l in g
F i gu r e 8 i l l u s t r a t e s t h e f u nc t io n a l co mp on e n t s o f t h e H M M
c r a w le r im p l em e n t ed :
I . T r ain in g co mp on en t : Th e f i r s t c omp o ne n t r e co rd s t h e
U R L’ s v i s i t e d b y t h e us e r a nd t h e pa ge v i ew s e qu e n c e .
T h en i t d ow nl oa ds p a ge s a nd c om pu tes t h e t f - i d f v e c to r s
r e p r es e n t i n g th e i r c o n t e n t . F in a l l y p a ge s a r e c lu s t e r ed
u s i n g a c l us t e r in g a l go r i t hm . In o u r im p l em e n t a t i on K-
M e a ns an d X -M e ans [ 1 7 ] w er e u s ed f o r c l us t e r in g .
I I . H M M in i t ia l i z a t i on : Th e s e co nd com po n en t t a k es t he
H M M r ep r es e n t a t i on o f u s e r t r a i n i n g se t ( a s i n f i g . 4 ) a s
User training module
Hidden Markov Model
Initialization
Crawling Component
![Page 42: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/42.jpg)
CHAPTER 3. CRAWLER DESIGN
37
i np u t a nd c ompu t es t h e Hi dd e n M a r ko v M od e l
P a r am et e r s ( i . e . π , A a nd B m a t r ix es ) . T h i s com po n en t i s
a p p l i e d du r i n g t h e i n i t i a l i z a t i on p h as e b e fo r e c r a wl i n g .
I I I . C r aw l i ng co mp one n t : i t do wn lo a ds s e l e c t e d p a ge s ,
ex t r a c t s c on t en t an d l i nk s , p r oc e s s p a ge co n t e n t a nd
a s s i gns t h e pa ge t o a c l us t e r u s i n g K -N e ar es t N e ig hb or s
a l go r i t hm [ 4 3 ] . G iv e n t he p a ge c l us t e r a nd th e H id d en
M a r ko v M od e l p a ra m et e r s (π , A a nd B m at r ix es ) t he
p r ob a b i l i t y t h a t t he n ex t p a ge v i s i t ed w i l l b e a t a r ge t
p a ge i s co mp ut ed u s in g V i t er b i a l go r i t hm [ 4 0 ] . T h a t
p r ob a b i l i t y r e p r e s en t s a l so v i s i t p r i o r i t y o f t h e l i nk . I f
t w o c lu s t e r s yi e l d a lm os t i de n t i c a l p ro b ab i l i t i e s ( i . e . t h e
d i f f e r en c e o f p ro b ab i l i t i e s i s b e l o w a p r e de f in ed
t h re sh o l d ε ) t h en h i gh e r p r i o r i t y i s a s s i gn ed t o t h e
c l us t e r l e a d in g w i th h i gh e r p r ob a b i l i t y t o t a r ge t p a ge s i n
t w o s t e ps ( a l so c o mp ut ed b y a p p l yi n g t h e Vi t e rb i
a l go r i t hm ) .
T h r e e Le a r n in g c r aw l e r s h av e b ee n im p l em e n t ed : t he f i r s t i s
t h e H id d en Ma r kov C ra wl e r (v a r i a n t s p ro po s ed i n [ 1 6 ] a n d
[ 1 8 ] ) . Th e nex t two v a r i an t s ( H yb r i d C r a wl e r s ) a re p ro po s ed
i n t h i s t h e s i s . T he y c o m b in e t h e pa ge p r io r i t y f u n c t i on s
c o mp ut ed b y t h e H i dd e n M a r ko v M ode l Cr a wl e r a nd th a t o f
t h e Be s t F i r s t C ra w l e r i n o r d e r t o e v a l u a t e t h e ov er a l l
p r io r i t y v a l u e o f a W eb p a ge .
3 . 4 . 1 H idd en M a rk ov Mod el Cr aw l er
T w o v a r i an t s o f t h i s c r aw le r h a v e be en i mp l e me n t e d :
a ) T h e f i r s t h i dd e n M a r ko v Mo d e l i mp l em e n t a t i o n us e s
K - M e an s a l go r i t hm f o r c l us t e r i n g a s d es c r i b ed in
[ 1 6 ] . In t h i s wo r k th e d i me ns io n a l i t y r e d u c t i on s t e p i s
o mi t t ed . K wa s s e t t o 5 , a nd t h e l a s t f i f t h c l us t e r
![Page 43: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/43.jpg)
CHAPTER 3. CRAWLER DESIGN
38
h o l ds t h e r e l ev an t p a ge s . P a ge p r io r i t i e s ( pr ior i t y h m m )
a r e co mp ut ed u s i n g Vi t erb i [ 40 ] a l go r i t hm ( F i g . 9 ) .
b ) T h e s e co nd v a r i a n t i s a lm os t i d en t i ca l t o t he p r e v io us
o n e bu t i n s t e ad o f K - Me a ns , X -M e an s [ 17 ] i s u s ed .
O t he r min o r mo d i f i c a t i o ns a r e ( a ) i d f w e i gh t s a r e no t
p r e c omp ut e d ( a s i n t h e p r e v i ou s v a r i an t ) , bu t a r e
c o mp ut ed u s i n g the t r a in i n g s e t a nd ( b ) t h e r e l ev a n t
p a ge s d on ’ t f o rm a s e p a r a t e c lu s t e r b u t t h e y m a y
b e lo n g t o t h e s a me c lu s t e r w i th no n r e l ev a n t p a ge s .
A s w i l l b e s ho wn in t h e ex p e r im e n t s t h e t w o v a r i an t s
d e mo ns t r a t e d i d e n t i c a l p e r f o rm an c e . T h e f i r s t v a r i an t
w a s us e d f o r co mp a r i so ns wi t h t h e H yb r i d C r aw l e r s
p r op os e d i n t h i s wo r k .
F i g u r e 9 s u mma r i ze s H M M C r a w l e r s p r i o r i t y a s s i g n me n t
p r o c e d u r e :
Fi g . 9 HMM C r aw l e r p r io r i t y e s t ima t i on a l go r i t hm
��G� , �" = A�QU V ���G� , � − 1� ∗ ���/181:/
�&'�
��G�, � + 1" = V ���G� , �� ∗ ���/181:/
�&'�
Input: Training set, candidate page (p).
Output: priority value priorityhmm(p) assigned to candidate page p.
1. Cluster training set using K-Means or X-Means algorithm
2. Compute π, A, B matrixes.
3. Classify candidate page p to a cluster T1 using K-Nearest Neighbor
algorithm
4. Compute hidden state probabilities for current step using Viterbi formula:
5. Compute hidden state probabilities estimation for next step using
formula :
6. Assign priority priorityhmm(p) = ��G', � + 1� to page p.
![Page 44: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/44.jpg)
CHAPTER 3. CRAWLER DESIGN
39
3 . 4 . 2 H yb r id C r aw l e rs
T w o v a r i an t s o f h yb r id c r a wl e r s a re im p l em e n t ed :
a ) H yb rid M a rko v M o de l C r aw l e r : T h e Hi dd en M a rk ov
M od e l C r aw l e r su f f e r s f ro m a t l e a s t tw o d r aw b a ck s : ( a ) i t
d o es n ’ t a s s i gn d i f fe r e n t p r io r i t i e s t o p a ge s b e lo n g in g to
t h e s a me c lu s t e r an d ( b ) i t i s v e r y d i f f i c u l t t o r e p re s en t
t h e s e t o f W e b p a ge s n o t r e l e v an t t o t h e t op i c b y c l u s t e r s
( i t i s a v e r y h e t e r oge n e o us s e t ) .
A h yb r i d ap p ro a ch c o mbi n i n g th e t ex t s imi l a r i t y o f a
p a ge w i th t h e c e n t ro i d o f t h e c lus t e r c o n t a i n i n g the
p os i t i v e ex am pl e pa ge s ( us i n g VS M) i s p ro po s ed h e r e fo r
d e a l in g wi th t h es e t wo p r ob l e ms . Th e c e n t ro id i s
c o mp ut ed a s t h e ave r a ge v e c to r o f t he p a ge s b e l on g i n g to
t h e c lus t e r . T ex t s i mi l a r i t y b e t w e en c a n d i da t e p a ge s wi t h
t h e c en t r o id m a y d i f f e r ev en i f p a ge s b e lo n g to t h e s ame
c l us t e r t hu s d e a l in g w i t h t h e f i r s t p r ob l em m e n t i one d .
S i mi l a r i t y w i t h t he c e n t r o i d o f r e l e v a n t p a ges i s no t
a f f e c t e d b y t h e wa y n o n r e l e v an t pa ge s a r e r e p r es e n t e d
t hu s d e a l i n g w i th t h e s e c on d p r ob l em a s we l l .
T h e H yb r i d Ma r ko v M od e l Cr a wl e r d i f f e r s f ro m th e
H M M Cr a wl e r i n t he w a y p r i o r i t i e s a r e a s s i gn e d to
c a n d i da t e p a ge s . I t c om pu t es a p r i o r i t y s c o r e fo r a p a ge
u s i n g t h e Hi dd e n M a r ko v Mo d e l ( pr io r i t y h m m ) an d a l so
c o mp ut es t h e t ex t s imi l a r i t y o f p a ge c o n t e n t wi t h t h e
c e n t r o i d o f t h e c l us t e r c on t a in in g t h e r e l e va n t p a ge s f r om
t h e us e r t r a in in g se t u s i n g eq ua t i on 2 . F i n a l l y , t h e p r io r i t y
o f p a ge p i i s c om put e d as fo l l o w s :
�5�������cuyS�,�5�� = �� ��������5�,TS" + 5�������cjj�5��2 � �11�
![Page 45: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/45.jpg)
CHAPTER 3. CRAWLER DESIGN
40
W h e r e TS i s t he c en t ro id o f r e l ev an t p a ge s i n t r a i n i n g s e t ,
similarity�5�,TS" i s t h e cos in e s im i l a r i t y o f p a ge co n t e n t 5� w i t h c en t r o i d TS o f r e l ev a n t p a ge s , 5�������cjj�5�� i s t h e
p r io r i t y a s s i gne d to pa ge l i nk i i n to p a ge 5� u s i n g Hid d en
M a r ko v M od e l an d 5�������cuyS�,�5�� i s t h e p r io r i t y a s s i gn ed
t o l i n ks i n pa ge 5� b y t h e H yb r i d C ra wl e r .
b ) H yb rid H M M C raw le r w i th p ag e c o nt en t a nd an cho r
t e x t : A n ob v i ou s ex t ens io n to m et hod ( a ) i s t o u s e bo th
a n c ho r a n d pa ge t ex t i n t h e c om pu ta t i on o f p a ge
p r io r i t i e s . Th i s l ead t o t h e f o l l o win g e q u a t i on :
�5�������cuyS�, 8vQczS ���� = 5�������cjj�5�� + similarity�5�,TS " + similarity���,TS "2
2 � �12�
W h e r e TS i s t h e c en t r o i d o f r e l ev a n t p a ge s i n t r a i n in g s e t ,
�� ��������5�,TS " i s t h e cos in e s imi l a r i t y o f pa ge
c o n t e n t 5� w i t h t h e c en t r o i d TS o f r e l ev a n t pa ge s ,
�� ����������,TS" i s t h e c os i n e s im i l a r i t y o f l i nk a n c ho r
t ex t �� w i t h t h e c e n t r o i d TS o f r e l e va n t p a ge s ,
5�������cjj�5�� i s t h e p r io r i t y v a lu e as s i gn ed to l i nk s i n t o
p a ge 5� u s i n g H idd e n M a rk ov Mo d e l an d
5�������cuyS�, 8vQczS ���� i s t h e p r io r i t y a s s i gn ed to t h e l i n k
w i t h a n ch o r t ex t �� by t h e H yb r i d HM M C r aw le r wi t h p a ge
c o n t e n t a nd an c ho r t ex t .
T h e p r io r i t y f u n c t i on o f eq u a t i on 1 2 im pr ov e s t he
p e r f o r ma n c e o f t he h yb r i d C r aw l e r . As w i l l b e sh o wn in t h e
ex p e r im e n t s wh e n a n c ho r t ex t i s u s ed th e c r a wl e r i s ev e n
m o re f o cu s ed to t he t op i c . F i gu r e 9 i l l u s t r a t es t h e op e ra t i on
o f h yb r i d c r a wl e r s :
![Page 46: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/46.jpg)
CHAPTER 3. CRAWLER DESIGN
41
C l us t e r 2
L 3 p a ge
L 2 p a ge C l us t e r 0
c e n t ro id
L 0 p a ge
L 1 p a ge
C l us t e r 1
C a nd i d a t e p a ges
Fi g . 10 H yb r i d c r aw l e r s o p er a t i o n .
In f i gu r e 1 0 t wo p a ge s (b lu e c i r c l e s ) a r e c an d id a t e f o r
d o wn lo ad in g . Th e H M M Cr a wl e r w i l l a s s i gn h i gh e r p r io r i t y
t o c an d i da t e p a ge p 1 b e l on g in g t o c l us t e r 1 s in c e t h i s c lu s t e r
l e a ds wi th h i gh e r p r ob a b i l i t y t o t a r ge t p a ge s ( c lu s t e r 0 ) i n
t w o l i nk s t ep s ( s i nc e t he p ro b ab i l i t y o f l e a d i n g t o c l us t e r 0
i n o n e s t e p i s i de n t i c a l fo r c lus t e r s 1 an d 2 ) . In s t ea d , a
H yb r i d c r a wl e r w i l l s e l e c t f o r ex p an s i on t h e p a ge p 2
b e lo n gi n g t o c lu s t e r 2 b e c au s e o f i t s p rox i mi t y ( s i mi l a r i t y)
w i t h t h e c en t r o i d o f c l us t e r 0 ( t h e c l us t e r co n t a i n in g th e
r e l ev a n t p a ge s f rom t he t r a i n i n g s e t ) .
3 .5 Summary
Cl a ss i c c ra wl e r s i n c lu d i n g t he w e l l k no w n Br e ad th -F i r s t
c r a w le r an d v a r i a t i o ns o f t h e Be s t - F i r s t C r a wl e r p r es en t ed i n
t h i s c h a p t e r h a v e b e e n i mp l e me n t e d i n t h e c u r r e n t t he s i s .
C2
C2
C0
C1
C0
C3
C1
cr
P2
P1
![Page 47: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/47.jpg)
CHAPTER 3. CRAWLER DESIGN
42
S em a n t i c c r aw l e r s i n c lu d i n g a v a r i a t i o n o f t h e E hr i g c r aw l e r
u s i n g W or d N et , an d t h e n ov e l S SR M an d S yn o n ym s e t
ex p an s i on c r a wl e r s h av e b e e n imp le m e n t ed an d comp a r e d
w i t h s t a t e o f t h e a r t Be s t F i r s t C r aw l e r s . F i n a l l y a s e t o f
s t a t e o f t h e a r t HMM c r a wl e r s i n c lu d in g [ 1 6 , 18 ] an d th e h e r e
p r op os e d h yb r i d c r a w l e r s a re a l s o im p l em e n t ed a nd th e i r
p e r f o r ma n c e i s e v a l u a t ed .
![Page 48: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/48.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
43
Chapter 4. Experimental Results
4.1 Introduction
T h e f o l l o win g s e t o f ex p e r im e n t s i s d es i gn ed to :
a ) P ro v i d e a c r i t i c a l e v a l u a t i on o f t h e v a r i ou s t yp e s o f
c r a w le r s ex ami n ed i n t h i s wo rk in c lu d in g c l as s i c
( Br e a d t h - F i r s t ) , t o p i c d r i v en ( Bes t - F i r s t a n d i t s
v a r i a n t s i n c lu d in g S em a n t i c c r aw l e r s ) , Le a r n in g a n d
H yb r i d c ra wl e r s .
b ) D e mo ns t r a t e t h e s up e r i o r i t y o f t h e n ew H yb r i d c r a wl e r
p r op os e d in t h i s w o r k o ve r s t a t e o f t h e a r t HM M
l e a rn in g c r aw l e r s su c h a s [ 16 , 18 ] .
S ix d i f f e r e n t t op i cs w e r e us e d ( “ l i n ux ” , “ as thm a ” ,
“ r o bo t i c s ” , “ de n gue f e v er ” , “ j a v a p ro gr a mm in g” an d “ f i r s t
a i d ” ) a nd t h e a b i l i t y o f t h e c r a wl e r s t o d ow nl oa d p a ge s on
t h e a bo v e to p i cs w a s m e as u r e d . T h e i r p e r fo rm a n c e w a s
c o mp ut ed us in g t wo w e l l e s t a b l i sh e d m e as u r es r e f e r r e d t o a s
h a r v es t r a t i o and a v e ra ge s i mi l a r i t y . E a c h c r aw l e r
d o wn lo ad e d 1 00 0 pa ge s a n d i t s av e r a ge p e r fo r ma n c e (o v er a l l
t op i c s ) w a s c om put e d u s in g b o t h c r i t e r i a . R e l ev a n t j ud ge d
p a ge s w e r e p r ov i de d b y t h e u se r wh o m a nu a l l y i n sp ec t e d
r e s u l t s ob t a in e d b y t he Go o g l e s ea r c h e n g in e on e ac h top i c .
T h es e r es u l t s w e r e u s ed as g r ou nd t ru th a nd co mp a r ed w i t h
r e s u l t s o b t a i ne d b y t h e c r a wl e r s . T h e m o re s im i l a r ( t o g ro u nd
t r u th ) t h e r esu l t s o f a c r aw l e r a r e , t h e mo s t s u c ce s s f u l t h e
c r a w le r s i s ( t h e h i gh er t h e p r ob a b i l i t y t h a t t h e c r aw l e r
r e t r i ev e s r e su l t s s im i l a r t o t h e t o p i c ) . P a ge t o t op i c r e l eva n c e
i s c omp ut e d b y V SM i n a l l c as e s .
![Page 49: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/49.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
44
4 .2 Per formance measures
T w o d i f f e r e n t e v a lu a t io n c r i t e r i a w e re us e d :
a ) H a rv es t r a t i o : Fo r e v e r y p a ge i t s c o s i ne s imi l a r i t y
w i t h a l l p a ge s j ud ge d a s r e l ev a n t b y t h e u s e r i s
c o mp ut ed a nd t h e m ax im um o f t h es e c o s i ne s im i l a r i t i e s
i s t ak e n . I f t h e max imu m s i mi l a r i t y i s g r e a t e r t h a n a
p r e d ef in e d th r es ho ld ( 0 . 75 i n t h i s w o rk ) t h en t h e p a ge
i s ma r k ed as r e l e va n t ( o th e r wi s e t h e p a ge i s m a rk ed as
i r r e l ev a n t ) . T he h a r v es t r a t i o i s d e f i n ed a s t h e
p e r c en t a ge o f d ow nl o ad e d p a ges wi t h s imi l a r i t y g r e a t e r
t h an t he t h re sh o ld ( i n t h i s t h es i s t h e n um be r o f
r e l ev a n t p a ge s wa s u s e d i n s t e a d o f t he f r a c t i on o f t h em
a m on g t h e t o t a l num b er o f d o wnl o ad ed pa ge s ) .
b ) A v e r ag e s i mi la r i ty . T h e m ax imu m s i mi l a r i t y o f e a c h
d o wn lo ad e d p a ge w i t h a l l p a ge s m a rk e d a s r e l e v an t i s
c o mp ut ed . T h e a ve r a ge s imi l a r i t y i s d e f in e d a s t h e
a v e r a ge v a lu e o f t h e s e s i mi l a r i t i e s fo r a l l do w nlo a d ed
p a ge s .
T h e f i r s t c r i t e r i o n i s mo r e s e l e c t i v e t h a n t he s e co nd . H ar v e s t
r a t i o c an b e ad ju s t e d ( b y u s in g h i gh er t h re sh o l d ) t o m e as u re
t h e ab i l i t y o f t h e c r a w l e r t o do wn lo ad p a ge s h i gh l y r e l e v a n t
t o t h e t o p i c . A n app l i c a t i on c a l l ed “ ev a lu a t o r ” w a s d e v e l op e d
f o r au to ma t i n g t h e e v a lu a t i on p ro c es s . I t r e c e iv es a s i n p u t
t h e p os i t i v e p a ge s s e t (5 0 r e l ev a n t p a ge s on e v e r y t o p i c i n
o u r ex p e r i m en t ) and t h e 10 00 e va lu a t e d pa ge s d o wnl o ade d b y
t h e c r a wl e r , an d co mp ut es t h e p e r f o rm a n ce o f t h e c r a wle r a t
h a nd wi t h bo th c r i t e r i a .
![Page 50: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/50.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
45
4 . 3 E xp e r i me n t se tup
T h e f o l l o win g c r a wl e r s a re co mp a r ed :
1 ) N o n Fo cu s ed C ra wle r s :
a ) B r e a d th F i r s t C r aw l e r
2 ) C l a s s i c Fo c us e d C r a wl e r s :
b ) Bes t F i r s t C r a wl e r wi t h p a ge c o n t e n t
c ) Be s t F i r s t C r a wl e r wi t h a n ch or t ex t
d ) Bes t F i r s t C r a wl e r wi t h p a ge c o n t e n t &
an c ho r t ex t
3 ) S e m an t i c C r aw l e r s :
e ) S em an t i c C r a wl e r u s i n g E h r ig e t . a l . [ 1 3 ]
m e t ho d fo r t ex t s i mi l a r i t y e s t i ma t i on .
f ) S em a n t i c C r a wl e r u s i n g SSRM [ 1 4]
m e t ho d fo r t ex t s i mi l a r i t y e s t i ma t i on .
g ) S e m an t i c C r a wl e r wi t h S yn s e t Ex p a ns i on .
4 ) Le a r n i n g C r aw le r s :
h ) Hi dd e n Ma r k ov M od e l C r aw le r
i ) H yb r i d Hid d en M a r ko v M od e l C r aw l e r
j ) H yb r i d Hid d en M a r ko v M od e l C r aw l e r wi t h
pa ge c on ten t & an c ho r t ex t .
A l l C r a wl e r s w er e e v a lu a t e d u s i n g the f o l l o wi n g to p i cs a n d
s e e d p a ges :
query seed Linux http://dir.yahoo.com/Computers_and_Internet/Software/Operating_Systems/UNIX/Linux
Asthma http://dir.yahoo.com/Health/Diseases_and_Conditions/Asthma/
Robotics http://dir.yahoo.com/Science/Computer_Science/
Dengue Fever http://health.yahoo.com/
Java programming http://dir.yahoo.com/Computers_and_Internet/
First Aid http://dir.yahoo.com/Health/
Fi g . 11 Ex p e r i m en t s e t up
1 0 00 p a ge s w e r e do w nl oa d ed fo r ea ch c r aw l e r an d f o r e a c h
t op i c . N o t i c e t h a t i n fo u r o u t o f t h e s ix t op i cs t h e s e e d p a ge
d o es n ’ t d i r e c t l y l i nk t o t a r ge t p a ge s .
![Page 51: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/51.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
46
T h e ex pe r im en t s i n t h i s s e c t i on a re o r ga n iz ed b y c r a w l e r
t yp e s h o win g a c om p ar i s on b e t w e en v a r i ou s i mp l e m en ta t i on s
o f t he c r a wl e r o f t h e s am e t yp e . Sp e c i f i c a l l y t h e ex p er im e n t s
a r e o r ga n iz e d a s fo l l o ws :
a ) C l ass i c Fo cu se d Cr aw l e r s E xp e r i me n t
C r a wl e r s ( a ) - (d ) we r e e va lu a t e d us ing t h e s ix t o p i cs o f
F i g . 1 1 .
b ) S e man t i c C r aw le rs E xp e r i me nt s
C r a wl e r s ( e ) - ( f ) , a n d (c ) - (d ) fo r c o mp a r i so n , w ere
e v a lu a t e d us i n g t h e 6 t o p i cs o f F i g . 1 1 .
c ) L e ar n in g C r aw l ers E xp e r i me nt
C r a wl e r s ( h ) - ( j ) w e r e ev a l u a t ed us in g f ou r t o p i cs
( “ Ro bo t i cs ” , “ D engu e Fe v e r ” , “ J av a P r o gr am min g” a n d
“ F i r s t A i d ” ) .
In t h e ex p er im en t s b e l ow e a c h me tho d i s r e p re s en t e d b y a
p lo t sh ow in g n umb e r o f r e l ev a n t p age s i n t h e Y ax i s a s a
f u n c t i on o f t o t a l nu mb e r o f p a ge s r e t r i ev e d . E a ch po in t i n a
p lo t co r r esp on ds t o h a r ve s t r a t i o o r a v e r a ge s imi l a r i t y
m e as u r ed r e sp e c t ive l y.
N o t i c e t h a t Le a r n ing C r a wl e r s h a v e d i f f e r e n t i np u t ( t h e
t r a i n i n g s e t ) t h a n th e C l a s s i c a n d S ema n t i c f oc us e d C r a wl e r s
( t ha t h av e t h e us e r q ue r y a s i n pu t ) s o d i r e c t c omp a r i so ns
b e tw e e n t h e p e r fo rm a n ce o f l e a rn in g a n d o th e r c a t e go r i es o f
c r a w le r s i n n o t r e a l l y p l au s i b l e .
![Page 52: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/52.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
47
4 .4 C lass ic Focused Craw lers
Fi g . 12 H ar v es t r a t i o f o r c l a s s i c c r a wle r s
T h e c om p ar i s on in F i g . 1 2 i n d i c a t es t h e p oo r p e r fo rm a nc e o f
Br e a d th F i r s t C r a wl e r , a s ex p e c t ed f o r a n on f o cu se d c r aw l e r .
T h e f a c t t h a t t h e Be s t F i r s t C r aw l e r u s in g a n c ho r t ex t o n l y
o u t p e r f o rms th e c ra w l e r u s in g o n l y p a ge c o n t en t i nd i ca t es
t h e v a lu e o f a n ch o r t ex t f o r c omp ut in g p a ge t o t o p i c
r e l ev a n c e .
T h e c r a wl e r c om bi n in g p a ge a n d a n ch o r t ex t
d e mo ns t r a t e d s up e r i o r p e r fo rm a n c e . Th i s r e su l t i nd i c a t e s t h a t
W eb c on t en t r e l e va n c e i s no t com put e d b y p a ge o r a n c h o r
t ex t a l on e . In s t e ad , t h e c om bin a t ion o f p a ge c on t en t a n d
a n c ho r t ex t fo rm s a mo r e r e l i ab l e p a ge d es c r i p t i on .
0
50
100
150
200
250
300
350
50
10
01
50
20
02
50
30
03
50
40
04
50
50
05
50
60
06
50
70
07
50
80
08
50
90
09
50
10
00
rele
va
nt
pa
ge
s
crawled pages
Breadth First
Best First-page content
Best First-anchor text
Best First-content & anchor
text
![Page 53: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/53.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
48
Fi g . 13 Av e r a ge s im i l a r i t y f o r c l a s s i c f o cu s ed c r a wl e r s
F i g . 1 3 co n f i rms t h e r es u l t s o f t h e p r e v i ou s co mp a r i s on .
O v e r a l l a b es t f i r s t c r a wl e r com bi n ing p a ge a n d a n ch or t ex t
a c h i e v es s up e r i o r p e r f o r ma n c e ov e r a l l i t s com p et i t o r s w i th
b o t h c r i t e r i a .
4 .5 Semant ic Craw lers
T h e s e c on d ex pe r im e n t m e as u r es t h e p e r f o r ma n c e o f s em a n t i c
c r a w le r s u s i n g t he s ix t op i c s o f F i g . 1 1 ( a s i n t h e p r ev io us
ex p e r im e n t ) .
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
70,00%
50
10
01
50
20
02
50
30
03
50
40
04
50
50
05
50
60
06
50
70
07
50
80
08
50
90
09
50
10
00
av
ara
ge
sim
ila
rity
crawled pages
Breadth First
Best First-page content
Best First-anchor text
Best First-content & anchor
text
![Page 54: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/54.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
49
Fi g . 14 : H a rv es t Ra t i o f o r S em a n t i c Cr a w l e r s .
F i g . 14 i l l u s t r a t es o n l y m a r g i n a l p e r fo r ma n c e im pr ov e me n t s
o f s em a n t i c c r aw l e r s ov e r b es t f i r s t c r aw l e r s . I t i s
c o n j e c t u re d th a t t he p oo r p e r fo rm a nce o f s em a n t i c c r aw l e r s
s ho u l d no t b e r e ga r d e d a s a f a i l u r e o f s em a n t i c c r a wl e r s bu t
r a th e r a s a f a i l u r e o f W o r dN e t t o p r ov id e t e rm s c on c e p tu a l l y
s imi l a r t o t h e t o p i c . W or d N et i s a ge n e r a l t ax o nomy f o r
E n g l i sh t e rm s an d n o t a l l l i n k ed t e r m s a r e a c t u a l l y v e r y
s imi l a r , i mp l yi n g t h a t t h e r e su l t s c a n b e im p ro v ed b y u s in g
t op i c s p e c i f i c o n to l o g ie s . S uc h to p i c s pe c i f i c on t o l o g ies on
s e v e ra l d iv e r s e t op i cs w e r e no t a v a i l a b l e t o u s fo r t h e s e
ex p e r im e n t s .
0
50
100
150
200
250
300
350
50
10
01
50
20
02
50
30
03
50
40
04
50
50
05
50
60
06
50
70
07
50
80
08
50
90
09
50
10
00
rele
va
nt
pa
ge
s
crawled pages
Semantic Crawler Ehrig
method
Semantic Crawler SSRM
method
Best First-anchor text
Best First Content &
anchor text
Semantic Crawler with
synset expantion
![Page 55: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/55.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
50
Fi g 1 5 : A v e ra ge S im i l a r i t y f o r S em an t i c C r aw le r s
R e su l t s wi th a v er age s i mi l a r i t y a c t u a l l y c o n f i rm e d t h e
r e s u l t s o f F i g . 14 . H e r e s em an t i c c ra w l e r s i mp ro v ed aga i n
t h e r es u l t s o f b es t f i r s t c r a wl e r s b u t o n l y m a r g i n a l l y ,
i nd i c a t i n g t h a t av e r a ge s imi l a r i t y ( a s l e s s s t r i c t c r i t e r i on ) i s
m o re t o l e r a n t t o r e l ax e d in t e r p r e t a t i o ns o f c on c ep tu a l
s imi l a r i t y a s p r o v id e d b y W o r dN et a nd t e r m s i mi l a r i t y
m e as u r es ( su c h a s Li e t . a l [ 4 2 ] ) .
4 .6 Learning Craw lers
T h e r esu l t s b e lo w a r e t a k en on fou r t op i cs ( “ r ob o t i c s ” ,
“ d e n gu e f e ve r ” , “ j a v a p ro gr a mmi n g” a nd “ f i r s t a id ” ) a n d
m e as u r ed o n th e f i r s t 10 00 w e b pa ge s r e t u rn ed b y e a c h
c r a w le r on e a c h t op i c . O n l y Le a r n i n g c r a wl e r s w e r e
e v a lu a t e d i n t h i s ex p e r im e n t : T wo v ar i an t s o f HM M C r aw l e r s
w e r e t e s t ed c o r r e spo n d in g t o d i f f e re n t im p l em e n t a t i o n o f t h e
c l us t e r i n g c omp on e n t (w i t h K -M e a ns an d X -M e a ns
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
70,00%
50
10
01
50
20
02
50
30
03
50
40
04
50
50
05
50
60
06
50
70
07
50
80
08
50
90
09
50
10
00
av
ara
ge
sim
ila
rity
crawled pages
Semantic Crawler Ehrig
method
Semantic Crawler SSRM
method
Best First-anchor text
Best First content & anchor
text
Semantic Crawler with
synset expantion
![Page 56: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/56.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
51
r e s p ec t iv e l y) . T h e r e s u l t s i n d i c a t e t ha t K -M e an s ( us in g K = 5
a s s u gge s t e d a t [ 16 ] ) a nd X - M e an s H i dd en M a r ko v Mo d e l
C r a wl e r s h a v e id e n t i c a l p e r f o rma n c e . Bo th c r aw l e r s
d e mo ns t r a t e d po o r p e r f o r ma n c e ( F i gs . 1 6 - 17 ) an d th i s c an be
a t t r i b u t ed t o s ev er a l r e a so ns : bo th v a r i a n t s do n’ t a s s i gn
d i f f e r en t p r i o r i t i e s t o p a ge s i n t o t h e s am e c l us t e r , a n d
b e tw e e n l i n ks i n to t h e s am e p a ge . Bo t h v a r i a n t s m us t b e
p r ov id e d w i th a t r a i n i n g s e t v e r y s i m i l a r i n co n t en t a nd l i n k
s t ru c t u re t o t h e p a r t o f t h e W eb t h a t wi l l b e c ra w l ed
( s om et h in g n o t a lw a ys a c h i ev a b l e ) . Be c a u s e t h e tw o H M M
C r a wl e r s ( u s i n g X - M e an s an d K -M e a ns ) h av e id en t i c a l
p e r f o r ma n c e th e f i r s t v a r i an t ( HM M Cr a w l e r u s i n g K -M ea n s )
w a s c ho s en fo r c omp a r i so n wi th t he o th e r Le a r n in g C r a wle r s .
In F i g . 1 6 t he pe r f o rm an c e o f t h e H M M c r a wl e r i s
c o mp a r ed wi t h t h e p e r fo rm a n c e o f t he n e w H yb r i d c r aw l e r s
( u s i n g c om bin a t ion o f p a ge c o n t en t a nd a n ch o r t ex t )
p r op os e d i n t h i s w o r k . Th e f i r s t ( Hyb r i d H MM us in g p a ge
c o n t e n t ) p r i o r i t i z es l i nk s u s in g e q ua t io n 1 1 ( s im i l a r i t y o f t h e
p a ge c o n t a in i n g t he l i n ks w i t h t h e ce n t r o i d o f t h e r e l ev a n t
p a ge s i n t h e t r a i n i n g se t ) . In a d d i t i o n t o t h a t t h e s ec o nd
i mpl e me n t a t i o n ( Hyb r i d H MM C r aw l e r w i t h a nc ho r t ex t ) a l so
c o mbi n es t h e s im i l a r i t y o f t h e c e n t ro i d wi t h t h e an c ho r t ex t
o f l i n ks po in t in g to c a nd id a t e p a ges f o r p r io r i t y a s s i gnm e n t
a s s u gge s t e d b y e q u a t io n 12 .
![Page 57: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/57.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
52
Fi g . 16 : H a rv es t Ra t i o f o r HM M & Hyb r i d C r a wl e r s
Fi g . 17 : Av e r a ge Co s i n e S imi l a r i t y f o r HM M & H yb r i d
C r a wl e r s
T he H yb r i d c ra wle r s ou t p e r f o rm t h e H i dd en M a rk ov M od e l
u s i n g bo t h c r i t e r i a . T h e u s e o f p os i t i ve ex am pl es c e n t ro id a s
0
5
10
15
20
25
30
35
40
45
50
10
01
50
20
02
50
30
03
50
40
04
50
50
05
50
60
06
50
70
07
50
80
08
50
90
09
50
10
00
rela
tiv
e p
ag
es
crawled pages
HMM Crawler
Hybrid HMM Crawler
with page content
Hybrid HMM with page
content & anchor text
0,00%
5,00%
10,00%
15,00%
20,00%
25,00%
30,00%
35,00%
40,00%
45,00%
50,00%
50
10
01
50
20
02
50
30
03
50
40
04
50
50
05
50
60
06
50
70
07
50
80
08
50
90
09
50
10
00
av
ara
ge
sim
ila
rity
crawled pages
HMM Crawler
Hybrid HMM Crawler with
page content
Hybrid HMM with page
content & anchor text
![Page 58: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/58.jpg)
CHAPTER 4. EXPERIMENTAL RESULTS
53
a q u e r y c l e a r l y i n c r e a s es p e r fo rm a n ce b e c a us e i t ov e r c om e s
t h e p ro b l ems o f H M M c r aw l e r s . As F i g . 16 an d F ig . 17
i nd i c a t e , t h e r esu l t s o b t a i n ed b y H yb r i d c r a wl e r s a r e
p r omi s i n g a n d m a y l e a d t o f u r th e r r e sea r c h o n t h i s d i r ec t i o n .
4 .7 Di scuss ion
Cl a ss i c Fo c us e d Cr a w l e r r e su l t s s how t h a t c om bin in g p a ge
c o n t e n t an d a n ch or t ex t ( Be s t F i r s t C r aw l e r - pa ge c on t en t
a n d an c ho r t ex t ) y i e l ds t h e b es t r e su l t s . Bo th p a ge c on t e n t
a n d an ch o r t ex t fo r m a r e p r e s en t a t i ve c o n t en t d es c r i p to r f o r
w e b p a ge s . S em an t i c C r aw le r s , wh e n c om bi ne d w i t h a
ge n e r a l pu r po s e on to lo g y, p e r f o rm ed po or l y c o m p a r ed to
Be s t F i r s t c r aw l e r s . B y r e s t r i c t i n g s e m an t i c r e l a t i on s t o
s yn o n ym s e t s (S ema n t i c C r aw l e r - S yn s e t ex p a nd m e th od ) t h e
p e r f o r ma n c e wa s im p ro ve d m a r g in a l ly . S yn o n ym s , a l t ho u gh
n o t l ex i c a l l y s i mi l a r su c c e ed i n i d e n t i f yi n g p a ge s w i t h
c o n t e n t s imi l a r t o t h e t op i c , i nd i c a t i ng t h a t i t i s p os s ib ly t o
ex p e c t fu r t h e r p e r f o rm a nc e imp r ove m e n t s b y u s i n g t op i c
s p e c i f i c on t o l o gi e s r i c h i n t e r ms v e r y s imi l a r t o t h e t e rm s o f
t h e t o p i c . A t t h i s p o i n t , on to l o g i es o f t h i s t yp e a r e n o t
a v a i l ab l e t o u s . Bo t h H yb r i d Cr a w l e r s a c h i ev e b e t t e r
p e r f o r ma n c e th a n t h e H id d en M a r kov Mo d e l C r a wl e r . T h e
r e s u l t s o b t a i n ed in d i c a t e t h a t p os i t i v e ex am pl es a r e m o re
i mp or t an t t h an the n e ga t iv e o n es d u r i n g t r a in in g in an
e n v i ro nm e n t s u ch a s t h e W o r ld W i d e W eb . Us i n g o n l y
p os i t i v e ex am pl es t h e p e r fo rm a nc e o f l e a r n in g c r a wl e r s i s
ex p e c t ed t o i mp r ove .
![Page 59: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/59.jpg)
CHAPTER 5. CONCLUSIONS AND FUTURE WORK
54
Chapter 5. Conclusions and future
work
In t h e p r es e n t t h es i s , s e v e ra l v a r i an t s o f f o c us ed c r aw l e r s
w e r e im p l em e n t ed a n d e v a l ua t ed us i n g c om mo n ev a l ua t i on
c r i t e r i a . F i r s t t h e Br e a d th F i r s t C ra wl e r a nd v a r i an t s o f t h e
Be s t F i r s t C ra wl e r u s i n g p a ge c o n t en t , a nc ho r t ex t o r b o t h
w e r e co mp a r ed . Th e n s em an t i c r e l a t i on s w e r e us e d i n t h e
i mp l e me n t a t i o n o f t h r e e S em a n t i c C r a wl e r s t h a t w e r e
c o mp a r ed wi th c l as s i c fo c us e d c r a wle r s (v a r i a t i o ns o f b e s t
f i r s t c r a wl e r ) . F i na l l y , b a s e d on t h e H id d en Ma r ko v M od e l
l e a rn in g c r aw l e r , t wo n ov e l h yb r i d c r a wl e r s c om bi n i n g
e l em e n t s f rom l e a r n in g a nd c l as s i c f o c us e d c r aw l e r s w e r e
i mp l e me n t e d a nd ev a lu a t e d .
T h e ex p e r im e n t a l r e s u l t s i nd i c a t e t h a t t h e
i mp l e me n t a t i o n o f f o c us e d c r aw l e r s i s a p r o c es s wh e r e mi no r
c h a n ge s i n t h e c r a w le r d es i gn ha v e g r e a t e f f e c t i n
p e r f o r ma n c e . T he c o mb in a t io n o f a n c ho r t ex t a nd p a ge
c o n t e n t yi e l d s g r e a t p e r f o rm an c e i mp ro v em e n t i n t he c a se o f
c l a s s i c , s em a n t i c an d l e a rn in g f o cu s ed c r a wl e r s . T h e a dd i t i on
o f s e m an t i c r e l a t i o ns d id n ’ t im p ro ve p e r fo rm a n ce wi th t h e
ex c e p t i on o f ex pa n s i on wi th s yn o n ym s w h e r e s e ma n t i c
r e l a t i on s a r e r es t r i c t e d t o s yn o n ym t e r ms . P e r f o rm anc e i s
ex p e c t ed t o im p ro ve b y u s in g a p p l i c a t i on sp e c i f i c on t o l og i e s
( r e l a t ed t o t h e t o p i c ) , i n s t e a d o f ge ne r a l pu r po s e on t o lo g i e s
s u ch as W o rd N et .
Le a r n i n g C r a wl e r s t a k e as i np u t u s e r s e l e c t e d pa ge s n o t
d e s c r ib e d b y a s im p l e q ue r y. I t i s n o t o n l y t h a t Le a r n in g
c r a w le r s r e c e i v e d i f f e re n t i n pu t t h a n t h a t o f o t h e r f o cu s ed
c r a w le r s bu t a l s o t h e y a r e i n t en de d t o p e r fo r m a v e r y
d i f f i cu l t t a sk : t h ey a t t e mp t t o l e a rn w e b c r a wl i n g pa t t e rn s
![Page 60: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/60.jpg)
CHAPTER 5. CONCLUSIONS AND FUTURE WORK
55
l e a d in g to r e l e v an t p a ge s p os s ib l y t h r o u gh o th e r n on r e l e v a n t
p a ge s t hu s i nc r e as i n g t h e p r ob ab i l i t y o f f a i l u r e ( s in c e w e b
s t ru c t u re s c an no t a l w a ys b e m o d e l e d b y s u c h l i nk p a t t e rn s ) .
H o w ev e r t h e i d e a l oo ks p r om is i n g ov e r a l l a nd m a y l e a d to
e v e n mo r e su c c es s fu l imp l em en ta t i ons o f l ea r n i n g c r a wl e r s i n
t h e f u tu r e . Th e p r es e n t w o rk ca n be r e ga r de d a s a
c o n t r i bu t io n to w a rd s t h a t d i r e c t i o n .
A n o t he r d i re c t i o n fo r fu tu r e w o rk wo u l d b e t o do m o re
e l a bo r a t e t e s t s w i th s em an t i c c r a wl e r s , m ak in g us e o f t o p i c
s p e c i f i c o n t o lo g i es ( e . g . m e d i c a l o n to l o g i e s fo r ap p l i c a t i on s
r e l a t e d t o h e a l t h ca r e ) . T h e p os i t i ve r e su l t s ob t a in ed b y
h yb r i d c r a wl e r s i nd i c a t e t h a t t h e r e l e v a n c e o f a c an d id a t e
p a ge w i th t h e s e t o f po s i t i v e ex am ple s on l y, i s an e f f ec t i v e
w a y f o r a s s i gn i n g p r io r i t i e s t o c a nd id a t e p a ge s . Us i n g o n l y
p os i t i v e ex am pl es ( i n s t e a d o f p os i t i ve a n d n e ga t i ve ) mi gh t
i mp ro v e t h e p e r f o r m an c e o f l ea r n i ng c r a w l e r s i n t e rm s o f
s p e ed an d a c cu r a c y .
![Page 61: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/61.jpg)
REFERENCES
56
References:
[ 1 ] “ W e b S e a r c h f o r a P l a n e t : T h e G o o g l e C l u s t e r
A r c h i t e c t u r e ” L A B a r r o s o , J D e a n , U H o l z l e - M i c r o , IE E E ,
2 0 0 3 .
[ 2 ] “ V e r y L a r ge S c a l e R e t r i e va l a n d W e b S e a r c h ” D
H a w ki n g , N C r a s w e l l , I n E . V o o r h e e s a n d D . H a r ma n ,
e d i t o r s , T R E C : E x p e r i me n t a n d E va l u a t i o n i n
I n f o r ma t i o n R e t r i e v a l . M IT P r e s s , 2 0 0 5 .
[ 3 ] “ T h e In d e x a b l e W e b i s M o r e t h a n 1 1 . 5 B i l l i o n P a g e s ” A
G u l l i , A S i gn o r i n i - I n t e r n a t i o n a l W o r l d W i d e W e b
C o n f e r e n c e , 2 0 0 5 .
[ 4 ] h t t p : / / w o r d n e t . p r i n c e t o n . e d u
[ 5 ] h t t p : / / w w w . g o o g l e . c o m
[ 6 ] “ T h e A n a t o my o f a L a r ge -S c a l e H y p e r t e x t u a l W e b S e a r c h
E n g i n e ” S B r i n , L P a g e W W W 7 / C o mp u t e r N e t w o r ks , 1 9 9 8 .
[ 7 ] h t t p : / / w w w . ya h o o . c o m.
[ 8 ] h t t p : / / w w w . ms n . c o m
[ 9 ] h t t p : / / w w w . a s k . c o m
![Page 62: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/62.jpg)
REFERENCES
57
[ 1 0 ] h t t p : / / l a r b i n . s o u r c e f o r ge . n e t / i n d e x -e n g . h t ml
[ 1 1 ] " In f o r ma t i o n R e t r i e v a l b y S e ma n t i c S i mi l a r i t y" A n ge l o s
H l i a o u t a k i s , G i a n n i s V a r e l a s , E p i me n i d i s V o u t s a k i s ,
E u r i p i d e s G . M . P e t r a k i s , E v a n ge l o s M i l i o s , I n t e r n a t i o n a l
J o u r n a l o n S e ma n t i c W e b a n d In f o r ma t i o n S ys t e ms
( I J S W IS ) , S p e c i a l I s s u e o f M u l t i me d i a S e ma n t i c s , V o l . 3 ,
N o . 3 , J u l y / S e p t e mb e r , 2 0 0 6 , p p . 5 5 -7 3 .
[ 1 2 ] “ A V e c t o r S p a c e M o d e l f o r A u t o ma t i c In d e x i n g ” G
S a l t o n , A W o n g , C S Y a n g – C o mmu n i c a t i o n s o f t h e A C M ,
1 9 7 5 .
[ 1 3 ] “ O n t o l o g y -F o c u s e d C r a w l i n g o f D o c u me n t s a n d
R e l a t i o n a l M e t a d a t a ” A l e x a n d e r M a e d c h e , M a r c E h r i g ,
S i e g f r i e d H a n d s c h u h , R a p h a e l V o l z , a n d L j i l j a n a
S t o j a n o v i c . P r o c e e d i n gs o f t h e E l e ve n t h In t e r n a t i o n a l
W o r l d W i d e W e b C o n f e r e n c e W W W -2 0 0 2 .
[ 1 4 ] “ S e ma n t i c S i mi l a r i t y M e t h o d s i n W o r d N e t a n d t h e i r
A p p l i c a t i o n t o In f o r ma t i o n R e t r i e v a l o n t h e W e b ” V a r e l a s
G . , V o u t s a k i s E . , R a f t o p o u l o u P . , P e t r a k i s E . , M i l i o s E . I n :
7 t h A C M In t e r n a t i o n a l W o r ks h o p o n W e b In f o r ma t i o n a n d
D a t a M a n a g e me n t ( W ID M 2 0 0 5 ) , B r e me n , G e r ma n y ( 2 0 0 5 ) .
[ 1 5 ] “ M e a s u r i n g t h e S e ma n t i c S i mi l a r i t y o f T e x t s . ”
C o r l e y , C . , M i h a l c e a , R . : , P r o c e e d i n gs o f t h e A C L
W o r k s h o p o n E mp i r i c a l M o d e l i n g o f S e ma n t i c
E q u i va l e n c e a n d E n t a i l me n t . A n n A r b o r , J u n e 2 0 0 5 .
![Page 63: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/63.jpg)
REFERENCES
58
[ 1 6 ] “ F o c u s e d C r a w l i n g b y L e a r n i n g H M M f r o m u s e r ’ s
T o p i c -S p e c i f i c B r o w s i n g . ” H . L i u , E . M i l i o s , a n d J .
J a n s s e n . I n P r o c e e d i n g s o f 2 0 0 4 IE E E / W IC / A C M
I n t e r n a t i o n a l C o n f e r e n c e o n W e b In t e l l i g e n c e , p a ge s
7 3 2 – 7 3 5 , B e i j i n g , C h i n a , S e p t e mb e r 2 0 -2 4 , 2 0 0 4 .
[ 1 7 ] “ X -me a n s : E x t e n d i n g K -me a n s w i t h E f f i c i e n t
E s t i ma t i o n o f t h e N u mb e r o f C l u s t e r s . ” D . P e l l e g a n d A .
M o o r e . I n P r o c e e d i n gs o f t h e 1 7 t h In t e r n a t i o n a l
C o n f . o n M a c h i n e L e a r n i n g , p a ge s 7 2 7 – 7 3 4 . M o r ga n
K a u f ma n n , S a n F r a n c i s c o , C A , 2 0 0 0 .
[ 1 8 ] “ U s i n g H M M t o L e a r n U s e r B r o w s i n g P a t t e r n s f o r
F o c u s e d W e b C r a w l i n g ” H L i u , J J a n s s e n , E M i l i o s - D a t a
& K n o w l e d ge E n g i n e e r i n g , 2 0 0 6 .
[ 1 9 ] “ B r e a d t h -F i r s t S e a r c h C r a w l i n g Y i e l d s H i g h -Q u a l i t y
P a ge s . ” M . N a j o r k a n d J . L . W i e n e r . I n P r o c . 1 0t h
I n t e r n a t i o n a l W o r l d W i d e W e b C o n f e r e n c e , 2 0 0 1 .
[ 2 0 ] “ C r a w l i n g t h e W e b : D i s c o v e r y a n d M a i n t e n a n c e o f a
L a r ge -S c a l e W e b D a t a . ” C h o , J . 2 0 0 1 . P h . D . t h e s i s ,
S t a n f o r d U n i v e r s i t y .
[ 2 1 ] “ S e a r c h i n g t h e W e b . ” A r v i n d A r a s u , J u n gh o o C h o ,
H e c t o r G a r c i a -M o l i n a , A n d r e a s P a e p c k e , a n d S r i r a m
R a g h a va n . T r a n s a c t i o n s o n In t e r n e t T e c h n o l o g y ,
2 0 0 1 .
[ 2 2 ] “ E f f i c i e n t C r a w l i n g T h r o u gh U R L O r d e r i n g . ” J u n gh o o
C h o , H e c t o r G a r c i a - M o l i n a , L a w r e n c e P a g e . S e ve n t h
I n t e r n a t i o n a l W e b C o n f e r e n c e ( W W W 9 8 ) . B r i s b a n e ,
A u s t r a l i a , A p r i l 1 4 -1 8 , 1 9 9 8 .
![Page 64: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/64.jpg)
REFERENCES
59
[ 2 3 ] “ In f o r ma t i o n R e t r i e va l i n D i s t r i b u t e d H y p e r t e x t s ” P .
D e B r a , G . - J . H o u b e n , Y . K o r n a t z k y , a n d R . P o s t , i n :
P r o c e e d i n g s o f R IA O '9 4 , I n t e l l i g e n t M u l t i me d i a ,
I n f o r ma t i o n R e t r i e v a l S ys t e ms a n d M a n a ge me n t , N e w
Y o r k , N Y , 1 9 9 4 .
[ 2 4 ] “ T h e S h a r k -S e a r c h A l go r i t h m - A n A p p l i c a t i o n :
T a i l o r e d W e b S i t e M a p p i n g” H e r s o v i c i , M . , J a c o v i , M . ,
M a a r e k , Y . S . , P e l l e g , D . , S h t a l h a i m , M . a n d U r , S .
( 1 9 9 8 ) , C o mp u t e r N e t w o r k s a n d IS D N S ys t e ms , V o l . 3 0
N o . 1 -7 , p p . 3 1 7 -2 6 .
[ 2 5 ] “ E va l u a t i n g T o p i c -D r i ve n W e b C r a w l e r s ” F . M e n c ze r ,
G . P a n t , M . R u i z , P . S r i n i va s a n , , P r o c . 2 4 t h A n n u a l I n t l .
A C M S IG IR C o n f . o n R e s e a r c h a n d D e v e l o p me n t i n
I n f o r ma t i o n R e t r i e v a l , A C M P r e s s , N e w Y o r k , N Y , 2 0 0 1
[ 2 6 ] “ T o p i c a l W e b C r a w l e r s : E va l u a t i n g A d a p t i ve
A l go r i t h ms ” F M e n c ze r , G P a n t , P S r i n i v a s a n – A C M
T r a n s a c t i o n s o n In t e r n e t T e c h n o l o g y ( T O IT ) , 2 0 0 4 .
[ 2 7 ] “ A G e n e r a l E va l u a t i o n F r a me w o r k f o r T o p i c a l
C r a w l e r s ” P S r i n i v a s a n , F M e n c ze r , G P a n t –
I n f o r ma t i o n R e t r i e v a l , 2 0 0 5 – S p r i n g e r .
[ 2 8 ] “ In t e l l i g e n t C r a w l i n g o n t h e W o r l d W i d e W e b w i t h
A r b i t r a r y P r e d i c a t e s . ” C . A g g a r w a l , F . A l -G a r a w i , a n d P .
Y u . I n P r o c . 1 0 t h In t l . W o r l d W i d e W e b C o n f e r e n c e ,
p a g e s 9 6 – 1 0 5 , 2 0 0 1 .
[ 2 9 ] “ A S u r ve y o f F o c u s e d W e b C r a w l i n g A l g o r i t h ms . ”
N o va k , B . P r o c e e d i n g s o f t h e 7 t h In t e r n a t i o n a l mu l t i -
c o n f e r e n c e In f o r ma t i o n S o c i e t y IS -2 0 0 4 , L j u b l j a n a :
I n s t i t u t “ J o že f S t e f a n ” , 2 0 0 4 .
![Page 65: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/65.jpg)
REFERENCES
60
[ 3 0 ] “ F o c u s e d C r a w l i n g : A N e w A p p r o a c h f o r T o p i c
S p e c i f i c R e s o u r c e D i s c o v e r y” S C h a kr a b a r t i , M v a n d e n
B e r g , B D o m - W W W C o n f e r e n c e , 1 9 9 9 .
[ 3 1 ] “ F o c u s e d C r a w l i n g U s i n g C o n t e x t G r a p h s . ” M .
D i l i ge n t i , F . C o e t ze e , S . L a w r e n c e , C . L . G i l e s , a n d M .
G o r i . I n P r o c . 2 6 t h In t e r n a t i o n a l C o n f e r e n c e o n V e r y
L a r ge D a t a b a s e s ( V L D B 2 0 0 0 ) , p a ge s 5 2 7 – 5 3 4 , C a i r o ,
E g y p t , 2 0 0 0 .
[ 3 2 ] “ A c c e l e r a t e d F o c u s e d C r a w l i n g t h r o u g h O n l i n e
R e l e v a n c e F e e d b a c k ” C h a kr a b a r t i , S . , P u n e r a , K . , a n d
S u b r a ma n ya m, M . , I n P r o c e e d i n g s o f t h e e l e v e n t h
i n t e r n a t i o n a l c o n f e r e n c e o n W o r l d W i d e W e b ( W W W 2 0 0 2 ) ,
2 0 0 2 , p p . 1 4 8 -1 5 9 .
[ 3 3 ] “ L e a r n i n g t o C r a w l : C o mp a r i n g C l a s s i f i c a t i o n
S c h e me s ” G P a n t , P S r i n i va s a n – A C M T r a n s a c t i o n s o n
I n f o r ma t i o n S y s t e ms ( T O IS ) , 2 0 0 5 .
[ 3 4 ] “ F o c u s e d C r a w l i n g b y E x p l o i t i n g A n c h o r T e x t U s i n g
D e c i s i o n T r e e ” L i J u n , F u r u s e K , Y a ma g u c h i K . C ,
P r o c e e d i n g s o f t h e 1 4 t h In t e r n a t i o n a l W o r l d W i d e W e b
C o n f e r e n c e . 2 0 0 5 : 1 1 9 0 -1 1 9 1 .
[ 3 5 ] “ A N o ve l H y b r i d F o c u s e d C r a w l i n g A l go r i t h m t o B u i l d
D o ma i n -S p e c i f i c C o l l e c t i o n s ” Y C h e n , P h D t h e s i s – 2 0 0 7 .
[ 3 6 ] h t t p : / / j a va . s u n . c o m/
[ 3 7 ] h t t p : / / w w w . e c l i p s e . o r g /
[ 3 8 ] “ A T u t o r i a l o n S u p p o r t V e c t o r M a c h i n e s f o r P a t t e r n
R e c o g n i t i o n ” C J C B u r g e s - D a t a M i n i n g a n d K n o w l e d ge
D i s c o v e r y , 1 9 9 8 .
[ 3 9 ] h t t p : / / w w w . d mo z . o r g /
![Page 66: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC](https://reader033.vdocuments.mx/reader033/viewer/2022042112/5e8d57086e1dbf3b8b3f9da9/html5/thumbnails/66.jpg)
REFERENCES
61
[ 4 0 ] “ T h e V i t e r b i A l g o r i t h m” G D F o r n e y - P r o c e e d i n gs o f
t h e IE E E , 1 9 7 3 .
[ 4 1 ] “ In t e l l i S e a r c h : I n t e l l i g e n t S e a r c h f o r Ima g e s a n d
T e x t o n t h e W e b ” E V o u t s a k i s , E G M P e t r a k i s , E M i l i o s .
3 r d In t e r n . C o n f e r e n c e o n Ima g e A n a l y s i s a n d
R e c o g n i t i o n ( IC I A R 2 0 0 6 ) , p p . 6 9 7 -7 0 8 , S e p t . 1 8 -2 0 ,
2 0 0 6 , P o v o a d e V a r z i m , P o r t u ga l .
[ 4 2 ] “ A n A p p r o a c h f o r M e a s u r i n g S e ma n t i c S i mi l a r i t y
b e t w e e n w o r d s u s i n g M u l t i p l e In f o r ma t i o n S o u r c e s ” Y L i ,
Z B a n d a r - IE E E T r a n s a c t i o n s o n K n o w l e d g e a n d
D a t a E n g i n e e r i n g , 2 0 0 3 .
[ 4 3 ] “ N e a r e s t N e i g h b o r P a t t e r n C l a s s i f i c a t i o n ” T C o v e r , P
H a r t - I n f o r ma t i o n T h e o r y , IE E E T r a n s a c t i o n s o n , 1 9 6 7 .
[ 4 4 ] “ A n In t r o d u c t i o n t o H i d d e n M a r k o v M o d e l s ” L
R a b i n e r , B J u a n g - A S S P M a ga z i n e 1 9 8 6 .
[ 4 5 ] “ M e r c a t o r : A S c a l a b l e , E x t e n s i b l e W e b C r a w l e r ” A
H e y d o n , M N a j o r k – W o r l d W i d e W e b , 1 9 9 9 – S p r i n ge r .
[ 4 6 ] “ M i n i n g t h e L i n k S t r u c t u r e o f t h e W o r l d W i d e W e b ”
S o u me n C h a k r a b a r t i , B yr o n E . D o m, D a v i d G i b s o n , J o n
K l e i n b e r g , R a v i K u ma r , P r a b h a k a r R a g h a v a n , S r i d h a r
R a j a g o p a l a n , a n d A n d r e w T o mk i n s . IE E E C o mp u t e r ,
3 2 ( 8 ) : 6 0 -6 7 , 1 9 9 9 .
[ 4 7 ] “ D a t a C l u s t e r i n g : a R e v i e w ” A K J a i n , M N M u r t y , P J
F l yn n - A C M C o mp u t i n g S u r v e ys ( C S U R ) , 1 9 9 9 .
[ 4 8 ] “ A n A l go r i t h m f o r S u f f i x S t r i p p i n g ” P o r t e r , M . F . ( 1 9 8 0 )
P r o gr a m, 1 4 ( 3 ) : 1 3 0 - 1 3 7 .