comparative study of link analysis web page ranking...
TRANSCRIPT
[VNSGU JOURNAL OF SCIENCE AND TECHNOLOGY] Vol.5. No. 1, July, 2016 119 – 133,ISSN : 0975-5446
Comparative study of link analysis web page ranking algorithms based
on weights of links to extract pertinent links
PATEL Hemangini S. Bhagwan Mahavir college of Comp. App.
(BCA), Bharthana, vesu.
DESAI Apurva A. Department of Computer Science ,
Veer Narmad South Gujarat University, Surat.
Abstract
Ranking is a key factor meant for information retrieval. The web is a huge set of pages to provide
an unlimited source of information, which consist of myriad hyperlinks. Search engine database
includes the huge quantity of web pages so the ranking of web page is a crucial for satisfying the
user requirements. Hypertext Induced Topic Search (HITs) computes hubs and authority scores and
Page rank algorithm computes the rank of particular web page. In this paper, the analysis of the
significance of web page is compared by utilizing ranking algorithms. The modified weighted
HITs(WHITs) algorithm is proposed and its superiority has been compared with existing
algorithms like Hypertext Induced Topic Search (HITs), Page Rank, Norm (p) and sNorm (p) to
help the search engine to extract a pertinent and valuable link. The algorithms are tested over
various short term queries.
Keywords: Information Retrieval, HITs, weighted HITs (WHITs), Page Rank, SALSA, norm (P),
sNorm(p).
1. Introduction
The prime objective of information retrieval is to discover the entire pertinent documents for a user
query within a set of documents. With the advent of the web, novel resources of information
became obtainable [1]. Link analysis ranking algorithms ranks the qualitative and pertinent
documents. The main task of ranking algorithm is to recognize the high rank authorities’
documents within the bulk of the pages. Most of this research has centred on presenting an
improved link-based ranking algorithm however weighted HITs (WHITs) to increase the
computational efficiency of existing ones (primarily, HITs). However, it allocates every link by the
equivalent weight. These hypothesis outcomes in topic drift.
For web search, HITs algorithm is said to be most significant; to a few measure, one may imagine
HITs to be further useful as compared to former link based characteristics as it is query reliant: it
attempts to determine the importance of pages among to a specified query [2]. Page Rank is a
query independent algorithm used by Google search engine based on the connectivity structure of
the web pages. The Page Rank importance of a page is weighted by every link to the page
proportionally to the superiority of the page holding the links; i.e., the Page Rank importance of a
page will extend uniformly to all the pages it directs to [3].
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 120
Hence, The Page Rank of a web page is computed as a summation of the Page Ranks of all pages
linking towards it (its inlinks) divided by the number of links on each of those pages (its outlinks).
HITs initially constructs a neighbourhood graph for the query which uses content based web search
engine to collect top 200 corresponding web results. It holds all the pages of the top 200 web pages
linked to and web pages that linked to these top 200 top pages. After an iterative computation on
the values of authority and hub, for each webpage p, the authority and hub values are calculated.
The authority importance of webpage p is the summation of hub scores of entire web pages that
directs to p and the hub importance of page p is the summation of authority scores of entire web
pages that p directs to. Until the values converged, iteration proceeds on the neighbourhood graph.
The SALSA algorithm combines the ideas both from page rank and HITs algorithm. In this
algorithm, a random walk on the bipartite hubs and authorities graph alternatively between hubs
and authorities is performed. When on an authority part of the bipartite graph at a node, the
algorithm chooses one of the inlinks consistently at arbitrary and shifts to a hub node on the hub
part. When at node on the hub part the algorithm chooses one of the outlinks consistently at
arbitrary and shifts to an authority. Norm(p) and sNorm(p) algorithms are belong to the family of
additive online learning algorithms [4,5,6]. Norm is a function which assigns a positive length to
all the vectors in the given vector space. These algorithms work on the standard of special
management of the authority weights. It can be implemented by using a norm or an operator. By
this, we will be able to use the fact that minor authority weights give fewer to the hub weight. The
simplest approach is to scale the weights. The most common solution to how to choose the scaling
aspects is to utilize the authority weight to determine the scaling factor. As higher authority
weights are significant in the calculation of hub weight so hub weight of the given node i is set to
be the p-norm of the vector of the authority weights of all the nodes directed to by the given node i,
Norm(p) scales the hub weight but here interest is to find authority with higher weight. It is
available in sNorm(p) due to symmetric setting of authority weight and hub weight.
In the present study, relative effectiveness of the ranking algorithms, namely HITS, Page Rank [2],
SALSA [3], Norm (p) and sNorm (p) [4] are compared with further algorithm proposed namely
weighted HITs (WHITs) and also compared according to their criteria.
2. Related Work
The thought of using hyperlink investigation arises about 1997 and evident itself by the PageRank
[7, 8] and HITs [2] algorithms meant for ranking web search results. There are various attempts to
improve better effectiveness of link analysis algorithms. Various researchers [9,10,11,12] have
provided explanation about the difficulty of searching, querying the Web, by considering its
structure-information as well as the meta-information incorporated in the hyperlinks and the text
adjacent to them. One way of doing the research is based on analyzing the mathematical assets of
link analysis algorithms. Langville and Meyer inspected PageRank for essential properties of it like
continued existence and individuality of an eigenvector and convergence of power iteration [13].
Dang and Croft [14] have introduced to utilize the anchor text due to a likeness of anchor text to
queries. Borodin et al. [15] have examined a variety of hypothetical properties of Page Rank, HITs
and SALSA including their similarity, locality and stability. Kraft et al. have analyzed anchor text
for query refinement due to the likeliness of queries and anchor texts [16]. Lee et al. [17] have
formed intention of a query is navigational or informational by locating all anchors that have same
text as the query term to discover an exact web page otherwise to visit multiple pages. Zhang et
al.[18] and Liu et al. [19] developed I-HITS algorithm supported on likeliness of the page and the
query by considering target page and query with its similarity of anchor text, and thereby improves
the capability of differentiate link importance and keep away topic drift. Craswell et. al. [20] have
found a way of appearance of inlinks is to see the anchor text as (element of) a “query” in model
Okapi BM25 to which the linked document is an “answer”. It can succeed when the required page
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 121
is outside the crawl or contains no text. Also, it is concluded that anchor text information without
additionally pre-processed or fine-tuned is more helpful than content information for the site
discovery task.
By considering anchor texts, one can additionally improve Web search rankings, especially for the
navigational queries, named page finding, the homepage, and ad-hoc search tasks ss[21].
Furthermore, Eiron and McCurley [22] have argued related to anchor text that it is extremely short
and web search engine users usually tend to submit very short queries and it summarized about
target document instead of the source document. On statistical bases, one of their interesting
comments is that likeliness of anchor texts and real user search queries. Initially motivated to
investigate for query refinement due to likeliness of anchor text and search queries. An ARC
(Automatic Resource Compiler) was built-up as component of the CLEVER scheme for indistinct
queries to repeatedly produce listing of hubs and authorities [23]. By using the anchor text as well
as a window of terms in the region of the anchor text to find out a target page is related to query
topic or not, and adjust the weights of the links in their web graph consequently. By contrast, our
study is based on link based modified HITs known as weighted HITs (WHITs) which collect all
anchors of outlinks, titles of inlinks and double weighting links whose anchors and titles matched
with query term.
3. Data Set
The study presented in this paper is based on the data sets collected from the Google AJEX search
API and Bing search API for the short term query Q which is generally of one term or two terms
are shown in Table 1. Root set contains around 100 highest ranked nodes (t) after eliminating
duplicates. By using nodes of Root set collection of out-links, anchors of out-links and in-links and
titles of in-links are performed to build base set (S). Afterwards web graph having maximum nodes
(after normalizing links embedded in the same web page) in the neighborhood graph which are
interconnected by links have been selected from base set. These nodes are represented as
connection between links in order to manage and calculate their respective rank values effectively.
On this web graph, Page Rank, HITs, weighted HITs, Norm(P) and sNorm(p) algorithms are
applied to check the relative effectiveness of ranking orders. The base sets for query (Q) are builds
in the manner described by Kleinberg. and Numerical data are in Table 1.
Table 1: Experimental data for various queries
Sr.
No. Query (Q) Nodes(t)
Out
link In link Links
After
normalization
Base Set (S)
1. Java 102 11546 1912 13458 10806
2. Jaguar 102 16527 744 17373 12711
3. Harvard 95 27243 4271 31514 13192
4. Search engine 100 8264 2273 10637 9152
5. Kyoto
University 94 6393 700 7093 6070
4. Web page ranking Algorithms
Jon Kleinberg’s HITs algorithm discovers superior authorities and hubs for a given topic by
assigning two statistics to a page: an authority and a hub weight. Page Rank is one of the most
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 122
significant ranking techniques used to measuring the importance of web pages. The SALSA is
based on the concept of Markov Chains and uses the stochastic properties of random walks which
is performed on the collection of web pages. In this approach, a neighborhood graph is determined
first. On that neighborhood graph, one step backward and one step forward random walk are
performed [3]. Norm (p) works on the principle that small authority weights should adds a lesser
amount to the computation of the hub weights [4]. The authority threshold algorithm works on the
standard of special management of the authority weights. That is, high authority weights should be
further significant in the calculation of the hub weight [5, 6]. Parameter value passed to the
algorithm is the value of p. It can be assumed that p [1, ∞] as p raises the significance of the p-
norm and is dominated by the top weights. For example, or p=2, we basically scale all weight with
themselves. The sNorm (p) algorithm [5,6] can be made symmetric by setting the authority weight
of a node to be the p-norm of the vector of the hub weights of the hubs that point to that node.
4.1 Weighted HITs Algorithm
Authorities are often not particularly self-descriptive. If one can try to discover the major “search
engines”, it would be a severe fault to confine his concentration to the set of the entire pages holds
the expression “search engines”. Even though this set is huge, it doesn’t contain the majority of the
innate authorities that one would like to discover (e.g. pages like Google, Yahoo!, Excite, Bing,
AltaVista, InfoSeek etc.). Likewise, there is no cause to imagine the home pages of Honda or
Toyota to hold the word “Japanese automobile manufacturers," or the home pages of Lotus or
Microsoft to holds the word “software companies." [9]. As such, non-self descriptive web pages are
not selected by the term based search engine; they cannot be included in the relevance set and
could not be selected in result by Page Rank. With respect to HITs and SALSA algorithms and
Norm(p) and SNorm(p) algorithms, consideration of hub and authority concept make it easy to
identify popular page on the given subject, even if the query words do not appear anywhere in the
page. According to these interpretation, the subsequent method has been recommended for
compiling the subset of the Web to create the root set from existing search engine which include
several pages that either links to a page into the root set, or is linked to by a page into the root set
and construct base set for which to compute hub and authority scores are computed.
The base set is created in such a way that: A fine authority page may not hold the query text (such
as search engine). The “expansion” of the root set into the base set improves the universal pool of
finer hubs and authorities. It will resolve the ambiguity at some extents by expansion the root set
into base set. To overcome this issue, construct the adjacency matrix that double weighting links
whose outlinks anchors and inlinks titles contains query word to capture pages which are highly
authoritative, but not self-descriptive. It is helpful for mining pertinent links for queries which are
not directly connected the related links.
Hence, in order to decrease the computation complication, calculating the likeliness of the
destination page and the query Q is simplified by calculating the likeliness of the anchor text of
links and outlinks as well as titles of inlinks with the query Q. It calculates authority scores and hub
scores using the weighted matrix, as compared to HITs algorithm. So, it will ultimately, increase
the weights of links which are mostly not self-descriptive on the basis of likeliness of outlinks
anchors and inlinks titles with query Q. For a given page i in S, a weighted authority score Wa(i)
and weighted hub score Wh(i) is assigned using weighted matrix.
Wa(i) = Wh(j)(𝑗 ,𝑖)∊E
(1) Wh(i) = Wa(j)(𝑖 ,𝑗 )∊E
(2)
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 123
The calculation of HITs algorithm is performed on our data set as described and compared with
modified improved weighted HITs (WHITs) algorithm. Results obtained are relatively increased
weights for highest authority is shown in Table 2 for top 10 authority ranks for a queries ”Java”,
”Jaguar”, “Harvard” ,“Search Engine” and “Kyoto University”.
From this, that the ranking of nodes is obtained whose weights are doubled by considering anchors
of outlinks as well as titles of inlinks and these nodes are ranked highest. Top authoritative pages
which describe the major search engines for query “search engine” are listed in top 10 authorities
using weighted HITs (WHITs) Algorithm. As shown in Table 2, Weighted HITs (WHITs)
increased the weights of the pages which are authoritative and provided almost all available major
search engines. In similar way, it increases weights of pertinent links for queries “Java”, “Jaguar”,
“Harvard”, “Kyoto University”, respectively.
Table 2: Top ten Authorities and weighted Authorities for queries “Java” , ”Jaguar” ,
“Harvard”, “Search Engine” and “Kyoto University”
Sr.
No Query
General HITs
Authority
weight
Links
WHITs
Authority
weight
Links
1. Java
0.3490 https://plus.google.co
m 0.3221
http://www.oracle.com/technet
work/java/index.html
0.2510
http://www.oracle.co
m/technetwork/java/i
ndex.html 0.2997 http://www.oracle.com
0.2071 http://www.youtube.c
om 0.2663 http://java.com
0.1885 http://www.oracle.co
m 0.2590
http://www.oracle.com/technet
work/java/javase/downloads/in
dex.html
0.1698 http://java.com 0.2262 https://www.oracle.com
0.1661 http://www.facebook.
com 0.2147 https://cloud.oracle.com
0.1605
http://www.oracle.co
m/technetwork/java/j
avase/downloads/ind
ex.html
0.2147 http://www.java.net
0.1592 https://twitter.com 0.1871 https://community.oracle.com
0.1573 http://twitter.com 0.1841 http://education.oracle.com
0.1530 https://www.oracle.c
om 0.1757 https://blogs.oracle.com
2. Jaguar
0.2188 http://www.jaguarusa
.com/index.html 0.6969
http://www.jaguarusa.com/ind
ex.html
0.1582 http://www.jaguar.co
.uk/index.html 0.6147 http://www.jaguarusa.com/
0.1507 http://www.jaguar.co
m/index.html 0.1449
http://www.jaguar.com/index.
html
0.1434 http://www.jaguar.co
m.au/index.html 0.1360
http://www.jaguar.co.uk/index.
html
0.1390 http://www.jaguar.ie/
index.html 0.1144
http://www.jaguar.co.za/index.
html
0.1384 http://www.jaguar.in/
index.html 0.1141
http://www.jaguar.com.au/inde
x.html
0.1361 http://www.jaguar.co
.za/index.html 0.1113
http://www.jaguar.in/index.ht
ml
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 124
0.1177 http://www.jaguar.co
m 0.1023
http://www.jaguar.ie/index.ht
ml
0.1086 http://jaguar.pl 0.0572 https://twitter.com
0.1071 http://www.jaguar.co
m.my 0.0366 http://instagram.com
3. Harvar
d
0.3300 http://twitter.com 0.3306 http://www.hbs.edu/
0.2914 https://twitter.com 0.3048 http://hms.harvard.edu/
0.2682 http://www.harvard.e
du 0.2864 https://www.hsph.harvard.edu/
0.2634 https://www.faceboo
k.com 0.2672 https://www.hms.harvard.edu/
0.2239 http://www.harvard.e
du/ 0.2545 http://www.gsd.harvard.edu/
0.2118 https://plus.google.co
m 0.2500 http://alumni.harvard.edu/
0.2087 http://www.facebook.
com 0.2409 https://college.harvard.edu/
0.1850 http://www.youtube.c
om
0.2284 https://www.gocrimson.com/
0.1816 http://www.linkedin.
com 0.1856 https://library.harvard.edu/
0.1639 http://news.harvard.e
du 0.1611 http://www.harvard.edu/
0.1558 http://trademark.harv
ard.edu 0.1585 http://hpac.harvard.edu/
4. Search
Engine
0.1199 http://www.google.co
m 0.1209 http://www.google.com
0.1176 http://www.bing.com 0.1191 http://www.bing.com
0.1129 http://www.ask.com 0.1136 http://www.ask.com
0.1083 http://www.yahoo.co
m 0.1088 http://www.yahoo.com
0.1008 http://www.lycos.co
m 0.1011 http://www.lycos.com
0.0975 http://www.facebook.
com 0.0987 http://www.facebook.com
0.0960 http://www.ixquick.c
om 0.0971 http://www.ixquick.com
0.0960 http://www.webcrawl
er.com 0.0962 http://www.webcrawler.com
0.0929 http://www.galaxy.co
m 0.0931 http://www.excite.com
0.0929 http://www.excite.co
m 0.0931 http://www.galaxy.com
5.
Kyoto
Univer
sity
0.7126 http://www.kyoto-
u.ac.jp/en 0.7216 http://www.kyoto-u.ac.jp/en
0.6204 http://www.kyoto-
u.ac.jp/en/
0.5809 http://www.kyoto-u.ac.jp/en/
0.1158 http://www.kyoto-
u.ac.jp 0.2267 http://www.kyoto-u.ac.jp
0.0740 http://www.opir.kyot
o-u.ac.jp 0.1037 http://www.opir.kyoto-u.ac.jp
0.0523 http://www.kyoto-
u.ac.jp/en/faculties-
and-graduate/
0.0866 http://www.opir.kyoto-
u.ac.jp/kuprofile/
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 125
0.0523 http://www.oc.kyoto-
u.ac.jp/en/ 0.0473 http://www.t.kyoto-u.ac.jp/en
0.0513 http://twitter.com 0.0452 http://www.kyoto-
u.ac.jp/en/faculties-and-
graduate/
0.0499 http://www.asafas.ky 0.0452 http://www.oc.kyoto-
oto-u.ac.jp/en/ u.ac.jp/en/
0.0494 https://www.faceboo
k.com 0.0444 http://sph.med.kyoto-u.ac.jp
0.0440 http://www.med.kyot
o-u.ac.jp 0.0437 http://www.t.kyoto-u.ac.jp
The calculation of HITs, Page-Rank, Norm(p), sNorm(p) and Weighted HITs (WHITs) are
performed on our data set then compared with Weighted HITs (WHITs). It shows that the ranking
of nodes is enhanced whose weights are doubled by considering anchors of links and those nodes
are ranked highest. Top authoritative pages which describes the major search engines, it is listed in
top 10 authorities using weighted HITs Algorithm (WHITs). As shown in table 3, weighted HITs
increased the weight of the pages which are authoritative and provided almost all available search
engines. Results obtained are shown in Table 3 for top 10 authority ranks for a queries ”Java”,
”Jaguar”, ”Harvard” ,“Search Engine” and ”Kyoto University” in which WHITs is outperformed
for pertinent links. It discovered links which are generally not discover by rest of the algorithms.
Table 3 : comparison of weights of Top ten weighted Authorities for queries “Java” ,
”Jaguar” , “Harvard”, “Search Engine” and “Kyoto University” with rest of algorithms
Sr.
No
.
Query No
of
lin
ks
Genera
l HITs
Author
ity
weight
Page
Rank
Norm
p (p),
P = 2
Normp
(p),
P = 3
Snor
mp
(p),
P = 2
Weig
hted
HITs
autho
rity
weigh
t
Links
1. Java
1. 0.2510 0.0095 0.243
4
0.2486 0000 0.322
1
http://www.oracle.co
m/technetwork/java/i
ndex.html
2. 0.1885 0.0017 0.185
1
0.1873 0000 0.299
7
http://www.oracle.co
m
3. 0.1698 0.0012 0.166
8
0.1687 0000 0.266
3
http://java.com
4. 0.1605 0.0010 0.157
2
0.1593 0000 0.259
0
http://www.oracle.co
m/technetwork/java/ja
vase/downloads/index
.html
5. 0.1530 0.0044 0.150
0
0.1520 0000 0.226
2
https://www.oracle.co
m
6. 0.1398 0.0010 0.136
9
0.1389 0000 0.214
7
https://cloud.oracle.co
m
7. 0.1398 0.0010 0.136
9
0.1389 0000 0.214
7
http://www.java.net
8. 0.1507 0.0010 0.147 0.1497 0000 0.187 https://community.ora
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 126
7 1 cle.com
9. 0.1461 0.0010 0.143
1
0.1451 0000 0.184
1
http://education.oracle
.com
10. 0.1101 0.0015 0.108
2
0.1094 0000 0.175
7
https://blogs.oracle.co
m
2. Jaguar 1. 0.2188 0.0146 0.215
5
0.2050 0.001
5 0.696
9
http://www.jaguarusa.
com/index.html
2. 0.0970 0.0131 0.093
7
0.0863 0.000
1 0.614
7
http://www.jaguarusa.
com/
3. 0.1507 0.0091 0.135
2
0.1265 0.000
1
0.144
9
http://www.jaguar.co
m/index.html
4. 0.1582 0.0100 0.151
9
0.1454 0.000
1
0.136
0
http://www.jaguar.co.
uk/index.html
5. 0.1361 0.0019 0.124
0
0.1168 0.000
1
0.114
4
http://www.jaguar.co.
za/index.html
6. 0.1434 0.0029 0.135
6
0.1294 0.000
1
0.114
1
http://www.jaguar.co
m.au/index.html
7. 0.1384 0.0024 0.125
0
0.1174 0.000
1
0.111
3
http://www.jaguar.in/i
ndex.html
8. 0.1390 0.0027 0.126
8
0.1195 0.000
1
0.102
3
http://www.jaguar.ie/i
ndex.html
9. 0.0932 0.0080 0.253
1
0.2976 0.000
1
0.057
2
https://twitter.com
10. 0.0545 0.0043 0.135
8
0.1572 0000 0.036
6
http://instagram.com
3. Harvar
d
1. 0.1419 0.0126 0.141
9
0.1420 0.018
9 0.330
6
http://www.hbs.edu/
2. 0.1423 0.0078 0.142
3
0.1424 0.004
3 0.304
8
http://hms.harvard.ed
u/
3. 0.0443 0.0128 0.044
3
0.0443 0000 0.286
4
https://www.hsph.har
vard.edu/
4. 0.0751 0.0068 0.075
2
0.0752 0.004
3 0.267
2
https://www.hms.harv
ard.edu/
5. 0.0785 0.0093 0.078
5
0.0785 0.004
5 0.254
5
http://www.gsd.harva
rd.edu/
6. 0.0326 0.0082 0.032
6
0.0657 0.008
4 0.250
0
http://alumni.harvard.
edu/
7. 0.0657 0.0080 0.065
7
0.0697 0.007
6 0.240
9
https://college.harvard
.edu/
8. 0.0697 0.0059 0.069
7
0.0196 0000 0.228
4
https://www.gocrimso
n.com/
9. 0.0196 0.0106 0.019
6
0.2239 0000 0.185
6
https://library.harvard
.edu/
10. 0.2239 0.0153 0.223
9
0.0327 0.059
5
0.161
1
http://www.harvard.e
du/
4. Search
Engine
1. 0.1199 0.0034 0.119
9
0.1213 0000 0.120
9
http://www.google.co
m
2. 0.1176 0.0007 0.117
6
0.1191 0000 0.119
1
http://www.bing.com
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 127
Figure 1-5 shows HITs, Page Rank, Norm(p), Snorm(p) and WHITS authority weights for
dataset queries, as depicted in figure WHITs outperforms for our dataset queries.
3. 0.1129 0.0004 0.112
9
0.1139 0000 0.113
6
http://www.ask.com
4. 0.1083 0.0009 0.108
3
0.1092 0000 0.108
8
http://www.yahoo.co
m
5. 0.1008 0.0011 0.100
8
0.1016 0000 0.101
1
http://www.lycos.com
6. 0.0975 0.0042 0.097
5
0.0983 0000 0.098
7
http://www.facebook.
com
7. 0.0960 0.0003 0.096
0
0.0968 0000 0.097
1
http://www.ixquick.c
om
8. 0.0960 0.0003 0.096
0
0.0967 0000 0.096
2
http://www.webcrawl
er.com
9. 0.0929 0.0003 0.092
9
0.0936 0000 0.093
1
http://www.excite.co
m
10. 0.0929 0.0003 0.092
9
0.0936 0000 0.093
1
http://www.galaxy.co
m
5. Kyoto
Univer
sity
1. 0.7126 0.0272 0.707
7
0.6723 0.025
0 0.721
6
http://www.kyoto-
u.ac.jp/en
2. 0.6204 0.0199 0.613
9
0.5698 0.025
0
0.580
9
http://www.kyoto-
u.ac.jp/en/
3. 0.1158 0.0608 0.125
8
0.1734 0.000
9 0.226
7
http://www.kyoto-
u.ac.jp
4. 0.0740 0.0084 0.077
0
0.0912 0.001
1 0.103
7
http://www.opir.kyoto
-u.ac.jp
5. 0.0440 0.0109 0.044
8
0.0476 0.002
8 0.086
6
http://www.opir.kyoto
-u.ac.jp/kuprofile/
6. 0.0384 0.0014 0.039
9
0.0475 0.000
7
0.047
3
http://www.t.kyoto-
u.ac.jp/en
7. 0.0523 0.0046 0.054
1
0.0628 0.112
9
0.045
2
http://www.kyoto-
u.ac.jp/en/faculties-
and-graduate/
8. 0.0523 0.0046 0.054
1
0.0628 0.000
7
0.045
2
http://www.oc.kyoto-
u.ac.jp/en/
9. 0.0330 0.0007 0.034
8
0.0442 0000 0.044
4
http://sph.med.kyoto-
u.ac.jp
10. 0.0276 0.0015 0.027
5
0.0267 0000 0.043
7
http://www.t.kyoto-
u.ac.jp
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 128
5. Comparison Of various Link Based Web Page Ranking Algorithms:
On the basis of analysis, a comparison of various link based web page ranking algorithms is carried
out on the basis of some basic criteria such as main techniques, key parameters, relevancy, size of
matrix etc.
Table 5: Comparison of Link based Ranking Algorithms or eigenvector-based ranking algorithms
(or linear link analysis algorithms)
Figure 1 “Java” authority weights Figure 2 “Jaguar” authority weights
Figure 3 “Harvard” authority weights Figure 4 “Search Engine” authority weights
Figure 5 “Kyoto University” authority weights
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 129
Algorithm
Criteria HITs Page Rank SALSA Norm(p) sNorm(p)
Weighted
HITs, WHITs
Basic Criteria Link analysis
algorithm
Link
analysis
algorithm
based on
random
surfer.
Link analysis
algorithm
based on
Markov chain.
Ranking
algorithm
Ranking
algorithm
Link analysis
algorithm
Mining
procedure
used
Web
Structure &
Web Content
Web
Structure
Web
Structure &
Web Content
Web
Structure
Web
Structure
Web
Structure &
Web Content
Key
parameters
Back &
Forward links Back-links
Back &
Forward links
Back &
Forward links
Back &
Forward links
Back &
Forward links
Used by
IBM clever Google
Twitter uses
salsa like
algorithm,
Research
Model
Research
Model
Research
Model
Research
Model
Query
dependency
Query
dependent
Query
independent
Query
dependent
Query
dependent
Query
dependent
Query
dependent
Neighborhood
Applied to the
neighborhood
of pages
adjacent to the
results of a
query, directed
Sub-graph
(1000-5000
nodes)
Whole-Web
Sub-graph
(bipartite
undirected
graph)
Sub-graph of
nodes
Sub-graph of
nodes
Applied to the
neighborhood
of pages
adjacent to the
results of a
query, directed
Sub-graph
(1000-5000
nodes)
Model Hubs and
Authorities
Authorities
and Markov
Model
of random
walk
Hubs,
Authorities
and
Markov chains
Hubs and
Authorities
with scaling
weights itself
Hubs and
Authorities
with scaling
weights itself
Hubs and
Authorities
with double
weighting
Authority &
Hub
computation
calculates
authority
and hub scores
by the un-
weighted
matrix
calculates
authority
score by a
row-
weighted
matrix
calculate its
hub and
authority
scores by both
row and
column
weighting.
small authority
weights should
contribute less
to calculation
of the hub
weights
Symmetric for
authority and
hub weight
calculates
authority
and hub scores
by the
weighted
matrix
Complexity
(Worst Case) < O(n)
2 O(log n) < O(n)
2 _ _ < O(n)
2
Mutual
reinforcement
Mutual
reinforcement
between
authority and
hub web pages
Doesn’t
distinguish
hubs and
authorities.
ranks pages
by
Departure
from HITs i.e.
Tightly Knit
Community
(TKC) effect
Mutual
reinforcement
between
authority and
hub web pages
Mutual
reinforcement
between
authority and
hub web pages
Mutual
reinforcement
between
authority and
hub web pages
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 130
Authority.
Stability
Can be
unstable:
changing a
few links can
lead to quite
different
rankings.
Can be
unstable:
changing a
few links
can lead to
quite
different
rankings.
Can be
unstable
Can be
unstable
Can be
unstable
Can be
unstable:
changing a
few links can
lead to quite
different
rankings.
Response
time
Crucial due to
calculation of
rank is done at
query time.
Calculation
of rank is
offline so
no crucial
issue for
response
time.
Crucial due to
calculation of
rank is done at
query time.
Larger
response time
Larger
response time.
Crucial due to
calculation of
rank is done at
query time.
Relevancy
More relevant
as it uses
hyperlink
structure and
consider the
content
Fewer
relevancies
as it ranks
the page at
indexing
time.
More
relevancies
More
relevancies
More
relevancies
More relevant
as it uses
hyperlink
structure and
consider the
content
Size of matrix
Matrix
calculation is
done on the
bases of
neighbourhood
directed graph.
World’s
largest
matrix
calculation
problem.
Matrix
calculation is
done on the
bases of
bipartite
undirected
graph.
Matrix
calculation is
done on the
bases of
neighbourhood
graph.
Matrix
calculation is
done on the
bases of
neighbourhood
graph.
Double
weighted
Matrix
calculation is
done on the
bases of
neighbourhood
directed graph.
Dual ranking Hub and
Authority rank
Doesn’t
supply dual
ranking,
Random
walk
Hub markov
chain and
Authority
markov chain
Higher
authority
weight should
be further
essential to
compute the
hub weight.
Symmetric by
setting the
authority
weight of node
to the p-norm
of the vector
the hub
weights of the
hubs that point
to node.
Hub and
Authority rank
with double
weighted
matrix
Efficiency
Less efficient,
today’s search
engines, which
need to handle
millions of
queries per
day.
Computes
score at
query time,
much
greater
efficiency.
Less efficient Less efficient Less efficient
Less efficient
for today’s
search
engines, which
need to handle
millions of
queries per
day.
Analysis
Scope Single page Single page Single page Single page Single page Single page
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 131
Quality of
result
Less than page
rank Medium Average Medium Medium
Improved then
HITs
Compute
eigenvector
Summing
weights of
linked nodes at
each step
Compute
eigenvector
using a
Markov
chain
Compute
eigenvector
using a
Markov chain
Compute
eigenvector by
scaling
weights itself
Compute
eigenvector by
scaling
weights itself
Summing
double
weighted
linked nodes at
each step
matched with
query, anchor
text and titles
Convergence
Converges to a
fix-point if
iterated for an
indefinite
period.
• Authority
vector, a,
converges to
the principal
eigenvector
of ATA.
• Hub vector,
h, converges
to the
principal
eigenvector
of AAT
• Generally, 20
iterations
turn out into
fairly stable
results.
The
percentage
of time
spent at
each page
will
converge to
a fixed
value.
• Page Rank
algorithm
converged
generally,
in about
52
iterations.
Matrices A
and H should
be irreducible,
• neighborhood
graph G,
• connected,
then both A
and H are
irreducible
• not
connected,
tperforming
power
method on A
and H will
not outcome
in
convergence
to a unique
dominant
eigenvector.
Varying on
importance of
P passed as
parameter and
converges to
fixed point.
Varying on
importance of
P passed as
parameter and
converges to
fixed point.
Converges to a
fix-point if
iterated
indefinitely.
• Authority
vector, a,
converges to
the principal
eigenvector
of ATA with
weighted
input
• Hub vector,
h, converges
to the
principal
eigenvector
of AAT
weighted
input
• Generally, 20
iterations
turn into
fairly stable
Vulnerability
• Query
Dependency.
• Irrelevant
authorities/
hub problem.
• Mutually
reinforcing
relationships.
• Topic Drift.
• Effect of
additional
pages.
• Query
Dependency.
• TKC effect.
• Query
dependent.
• Mutually
reinforcing.
• Query
dependent.
• Mutually
reinforcing.
• Query
Dependency.
• Mutually
reinforcing
relationships.
6. Conclusions
Most of the web information retrieval tools use the textual information while ignores the link
information that could be very valuable. Link analysis algorithms for ranking and utilization of
anchor text in IR were explored in the study. Experimentation with the weighted HITs (WHITs) by
double weighting of links of nodes which matches with query and the anchors of out-links and
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 132
titles of in-links was found outperformed for authority pages as compared to other algorithms like
HITs, Page Rank and norm (p) family of algorithms. Hence, WHITs with anchor texts, one can
improve Web search rankings for the pages which are not self-descriptive. Also, in future it will be
helpful for homepage finding, named page finding, navigational queries and ad hoc search tasks.
References
[1] M. Henzinger, Link analysis in web information retrieval, IEEE Data Engineering Bulleitin.
(2000)1-6.
[2] J. M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM.
46(5) (1999) 604–632.
[3] R. Lempel, S. Moran, The stochastic approach for link-structure analysis (SALSA) and the
TKC effect, Computer Networks. 33(2000) 387–401.
[4] C. Rudin, Ranking with a P-Norm Push ,Springer-Verlag Berlin Heidelberg. (2006) 1-20.
[5] M. Kumar, Web Page Ranking Solution through Snorm (P) Algorithm Implementation
(Doctoral dissertation, Thapar University Patiala). (2008).
[6] M. Kumar, A New Approach for Web Page Ranking Solution: sNorm (p) Algorithm,
International Journal of Computer Applications (0975 – 8887) Volume 9– No.10, November.
(2010).
[7] S. Brin and L. Page, The anatomy of a large-scale hyper textual Web search engine. Computer
Networks and ISDN Systems. 30(1–7): (1998)107–117.
[8] L. Page, S. Brin, R. Motwani and T. Winograd, The page rank citation ranking: Bringing order
to the web. (1998).
[9] S. Chakrabati, B. Dom, D. Gibson, J. Kleinberg, , S. Kumar, P. Raghavan and A. Tomkins,
Mining the link structure of the World Wide Web. IEEE Computer, 32(8), (1999), 60-67.
[10] T.H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search
on the web. In Proceedings of the 11th international conference on World Wide Web, ACM.
(2002), 432-442.
[11] I. Varlamis, M. Vazirgiannis, M. Halkidi, and B. Nguyen. THESUS, a closer view on web
content management enhanced with link semantics. Knowledge and Data Engineering, IEEE
Transactions on. (2004), 16(6), 685-700.
[12] D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In
Proceedings of the ninth ACM conference on Hypertext and hypermedia: links, objects, time
and space---structure in hypermedia systems: links, objects, time and space---structure in
hypermedia systems, ACM. (1998), 225-234.
[13] A. N. Langville, C.D. Mayer. Deeper inside Page Rank, Journal of Internet Mathematics.
Vol. 1, No. 3.(2003),335–380.
[14] V. Dang and W. B. Croft, Query reformulation using anchor text, In Proc. of the 3rd ACM Int.
Conf. on Web Search and Data Mining, WSDM'10. (2010) 41-50.
[15] A. Borodin, G. Roberts, J. Rosenthal, P. Tsaparas, Finding authorities and hubs from link
structures on the world wide web, Proceedings of the 10th International World Wide Web
Conference. (2001) 415–429.
[16] R. Kraft and J. Zien, Mining anchor text for query refinement, In Proceedings of WWW.
(2004) 666-674.
[17] U. Lee, Z. Liu, and J. Cho, Automatic identification of user goals in web search, In
Proceedings of the 14th Int. Conf. on WWW’05, ACM. (2005) 391–400.
[18] X. Zhang, H. Yu, C. Zhang, & X. Liu. (2007). An Improved Weighted HITS Algorithm Based
on Similarity and Popularity. Computer and Computational Sciences. IMSCCS 2007. Second
International Multi-Symposiums on. IEEE, (2007), 477- 480.
[19] X. Liu, H. Lin and C. Zhang, An improved HITS algorithm based on page-query similarity
and page popularity. Journal of Computers. 7(1), (2012) 130-134.
VNSGU Journal of Science and Technology – V 5(1) - 2016 - 133
[20] N. Craswell, D. Hawking, and S. Robertson, Effective site finding using link anchor
information. In Proceedings of the 24th Annual Int. ACM SIGIR Conf. on Research and
Development in Information Retrieval ACM. (2001) 250–257.
[21] H. S. Patel, A. A. Desai, An Anchor Based Information Retrieval For Link Analysis: A
Survey, VNSGU JOURNAL OF SCIENCE AND TECHNOLOGY. Vol.4.No.1, 22 - 35,
ISSN: 0975-5446], July (2015).
[22] N. Eiron and K. S. McCurley. Analysis of anchor text for web search. In Proceedings of the
26th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, ACM Press. (2003), 459–460.
[23] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic
resource compilation by analyzing hyperlink structure and associated text. Proceedings of the
7th World Wide Web Conference. (1998).