a novel algorithm to determine the quality of a web page for improving the search engine ranking...

7/30/2019 A Novel Algorithm to Determine the Quality of a Web Page for Improving the Search Engine Ranking Framework

1/7

1

A Novel Algorithm to Determine the Quality of aWeb Page for Improving the Search Engine

Ranking FrameworkSheikh Muhammad Sarwar, Md. Mustafizur Rahman and Mosaddek Hossain Kamal

AbstractThis paper proposes and develops a general and formal static rank computation algorithm for ranking web documents

considering the availability, significance, appeal and relevance of the images present in them. Different types of images appear in a

web document; some of them increase the content quality of the web page and some are deemed irrelevant considering the content.

Moreover, some images are not appealing for catching the attraction of the users and they do not necessarily improve the content. In

this paper, a static ranking algorithm like PageRank has been proposed, which works based on the analysis of the images appearing

in the web document. A method for integrating this algorithm with a complete ranking framework, which is based on Markov Random

Field Model has also been presented. The algorithm computes a metric IBQV (Image Based Quality Value) that demonstrates the

extent to which images in a web document increase its value. The theoretical and practical implications of IBQV has been shown and

experimental results indicate that incorporating IBQV increases the correctness of the search result.

Index TermsInformation Retrieval, Web Page Ranking, Image Search Engine, Markov Random Field Model

!

1 INTRODUCTION

Existing document retrieval models usually assume that all

the documents in the collection have the same quality. The

equal quality assumption does not, however, hold for large

and heterogeneous web corpora [1]. As the number of web

documents is growing exponentially, it is becoming very

difficult to find the required web documents with respect to

a user query. Even if there are many documents those can

satisfy the users information need, many web documentscontain unnecessary textual and multimedia information

which hampers readability, makes navigation difficult and

tiresome and scatters the presentation and layout. As a result,

quality of a web document becomes a crucial factor when

designing a ranking function or framework.

Most of the current researches focus on the structure mining

of the web pages to give it a rank value. Two page ranking

algorithms, HITS [2] and PageRank [3], are commonly

used in web structure mining. Both the algorithms measure

the importance of a web Page based on the non-local link

structure. They rely solely on the votes from the neighbors

of the document in the link graph to determine the quality of

the document [1]. So, they are not sufficient for determining

Sheikh Muhammad Sarwar is with the Department of Computer Scienceand Engineering, University of Dhaka, Dhaka, Bangladesh. E-mail:[email protected]

Md. Mustafizur Rahman is with the Department of Computer Scienceand Engineering, University of Dhaka, Dhaka, Bangladesh. E-mail:[email protected]

Mosaddek Hossain Kamal is with the Department of Computer Sci-ence and Engineering, University of Dhaka, Dhaka, Bangladesh. E-mail:[email protected]

the static rank of a web page as content representation is

not considered as a quality criteria.

Few researches for integrating document quality into the

ranking framework have been done. Text based features of

a web page have been integrated in a Markov Random

Field Ranking Model [1]. Markov Random Field Model for

Information Retrieval (MRF-IR), was proposed by Metzler

and Croft [4]. This model can make use of features based

on single terms, ordered phrases, and unordered phrases.MRF-IR has been used as an effective tool in different search

tasks. It performs reasonably well for text based queries. But,

web documents contain quality images, which can also act

as discriminators when choosing pages to represent to a user.

A document with images, that supports its text content, can

really be useful and interactive for the user. A research related

to the learning process of students showed the importance of

images in teaching, and the findings confirmed the benefit of

incorporating images in teaching and learning. If the images

are selected and used appropriately, they can enhance and

lead to a deep approach to learning amongst students [5].

So, images that are coherent with the textual content, will

enhance the quality of a web document.Images that are appealing increase the aesthetics of a

web page and attracts users. Finding appealing images for

automatic album creation is a popular research topic [6].

Several algorithms and techniques for computing the appeal

of an image or a frame in a video has been proposed in

literature [7] [8]. From users point of view, appealing images

increase the quality of web documents and should have an

impact in ranking them.

From the above assumptions, the design and architecture

of a ranking model has been proposed in this work, where

a document containing images, which are appealing and

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 10, OCTOBER 2012, ISSN (Online) 2151-9617

https://sites.google.com/site/journalofcomputing

WWW.JOURNALOFCOMPUTING.ORG 49

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617


2/7

2

coherent with the content of the document, should be given

preference, when being ranked with respect to a user query.

In this context, a new metric, Image Based Quality Value

(IBQV) has been proposed and the calculation process

of IBQV has been illustrated. A new kind of image with

respect to a web page has been proposed which is Stop Image.

2 BACKGROUND

There are some topics on which a short discussion is needed

before describing the new image based quality model for

integrating into the ranking framework for web pages. These

topics form the basic idea for developing the new quality

model. A new topic has been introduced in this context which

is Stop Image, and we have provided the definition of Stop

Image which can be treated as a part of our contribution in

this paper.

2.1 Vector Space Model (VSM) for Document Simi-larity Analysis

Vector Space Model (VSM) treats each document as a vector.

The dimensions of the vector are terms. If a term occurs in the

document, its value in the vector is non-zero. Several different

ways of computing these values, also known as (term) weights,

have been developed. One of the best known schemes is tf-idf

weighting. The definition of term depends on the application.

Typically terms are single words, keywords, or longer phrases

[9]. To measure the similarity between two documents, first,

vector representations of the two documents are created. Then,

cosine similarity measure is used to compute the similarityscore [9]. There are other similarity measures that can be used

too.

2.2 Web Image Search Engine

A web image search engine combines the functionality of

keyword based image searching and content based image

searching. For keyword based image searching, the user gives

text queries to the search engine and the search engine returns

a list of web pages where the object mentioned in the keyword

occurs. For content based image searching, the user presents

an image query to the search engine and the search engineuses image distance base measure to find out the web pages

which contain the matched image. An image distance measure

compares the similarity of two images in various dimensions

such as color, texture, shape, and others. For example, a

distance of 0 signifies an exact match with the query, with

respect to the dimensions that were considered. As one may

intuitively gather, a value greater than 0 indicates various

degrees of similarities between the images. Search results then

can be sorted based on their distance to the queried image [10].

In this work, we will only consider exact match which will

only consider the value 0 for image distance base measure.

2.3 Stop Image

Stop images are those, which can be deleted from the

document and the document will not lose any visual

information it is providing. In this context, a text based

concept can be mentioned which is stop word. In a document

text, there are terms and words which are not necessary

from a retrieval perspective. Articles, prepositions are the

words that do not improve the knowledge inside a document.They are stop words. Analogously, we can say that, there are

images in a web document which do not necessarily improve

the content of the document. They are the background images,

images with low resolution, images of very small size and

single color images.

2.4 Image Appeal Value

Image Appeal Value or Image Aesthetic Value (IAV) is nec-

essary for choosing images in automatic albuming application

[6]. Image appeal value of an image depends on contrast and

colorfulness of the image [7]. There are several methods forcalculating contrast and colorfulness of an image or a region

of an image and different images can be compared using

this metrics. The contrast value can be calculated using the

following equations [7] :

CNi =

1

ni1

jregioni

(xj x)2

1/2x = 1ni

jregioni

xj

An established method described in [11] is used to compute

the colorfulness of an image. It combines chroma magnitude

and color variance in the CIE-Lab color space. The following

equation is used to compute the colorfulness:

CFi =

aibi + 0

.37aibi

According to the equation above, aibi is the trigonometric

length of the standard deviation in CIE-L ab space, and aibiis the distance of the centre of gravity in CIE-L ab space to

the neutral color axis [7]. IAV is calculated as the average of

contrast and colorfulness values in this work.

3 RELATED WORKS

Web document ranking without any query is a part of the

ranking framework or ranking function. It is usually men-

tioned as static ranking. There has been many approaches

for calculating link-based priors such as PageRank [3], HITS

[2], and SALSA [12] which are often used in web search.Research showed that including these priors can significantly

improve the performance of the ranking function [13] [14]

[15]. Click-based priors can also be a measure of relevance as

they are measured by information like how frequently a user

clicks a web document and how much time they spend on that

page. They have been proven beneficial for web search [16]

[17]. Documents text quality was also taken into account to

calculate document quality based priors and incorporated into

the ranking function [1].

Link-based, click-based and document text quality based

priors do not take into account the quality of images present






3/7

3

in a web page. Features, based on document text, are used

to get a numerical value against the text quality of the

document text [1]. There has been research on the images

present in the web pages and their role in summarising the

contents of the document [18]. A research showed that, in the

evaluation of web pages, the presentation plays an effective

role [19]. Evaluation in the context of presentation considered

two aspects:

Text presentation (font size, character).

Multimedia presentation (image quality, image size, num-

ber of images in a page, resolution of video etc.).

But there were no approach for integrating the presentation

aspect of a web page into the ranking function of a search

engine.

4 TWO NOVEL METRICS: ITSV AN D IBQV

In this work, we have proposed a new image based feature

to incorporate into the ranking model, which is Image Based

Quality Value (IBQV). To calculate IBQV, we have used

a new metric ITSV (Image Text Similarity Value). In

this section, the computation process of these two values is

illustrated.

4.1 Image Text Similarity Value (ITSV)

ITSV is just a simple variation of document similarity value.

If two documents D1 and D2 contains the same image, then

the text based similarity value is defined as Image Text

Similarity Value (ITSV). This value is between 0 and 1. To

asses the document similarity, Vector Space Model (VSM)

is used to represent each of the documents. Then using the

cosine similarity measure, the similarity value is obtained and

this value is the ITSV value according to this paper.

4.2 Image Based Quality Value (IBQV)

IBQV is calculated with the help of an image search engine.

IBQV is the value that is added as a static rank value of

the web document. To find the value of IBQV for a web

document, first we extract all the images from the web

document. Then, from the set of images, Stop Images are

identified and removed. The remaining images form the

set of quality images for that web page. Now, each of the

remaining images is presented to an image search engine

and top-k documents are retrieved. After the retrieval of

top-k documents, each of the top-k documents are takenone by one and the ITSV value is calculated with respect

to the web document being considered for the calculation

of IBQV. Then the summation of the ITSV value for each

pair of documents is obtained and divided by the value of k

to obtain the average ITSV value for an image. The value

of IAV is calculated as mentioned in section 2.4 and added

with average ITSV value. Average ITSV value and IAV value

are added with IBQV value. Finally, IBQV value is obtained

by dividing the summation of IBQV values by the number

of images. Figure 1 shows the process of calculating IBQV

for a single web document. The pseudo-code for calculating

IBQV is illustrated in algorithm 1.

Algorithm 1. IBQV Calculation Algorithm

Input:

A web document D.

Set of images in D, I = {i1, i2,...,in} where n is the numberof images in D.

Final Image Set F = {}.

Output:

IBQVvalue ofD.

Method:

1. FOR each image ik I do2. if ik is not a Stop Image

3. then F = F ik4. END FOR

5. IBQV= 0

6. FOR each image fk F do7. submit fk to an image search engine as query8. K = { top-k documents from the search engine }9. ITSV = 010. FOR each element ki K do11. find ITSV i ofD and ki12. ITSV = ITSV + ITSV i13. END FOR

14. ITSV = ITSV |K|15. IAV = calculate IAV(fk)

16. IBQV= IBQV+ (ITSV + IAV) 217. END FOR

18. IBQV= IBQV |F|

5 INTEGRATION OF IBQV WITH MARKOVRANDOM FIELD MODEL

A Markov random field (MRF) is a graphical model where

the nodes correspond to random variables and the edges

are undirected. The edges define dependencies among the

variables. MRF can represent cyclic dependencies. Metzler

and Croft [4] proposed to model a joint relevance distribution

over a query Q = q1,...,qn and a document D using MRF [1].Figure 2 shows the MRF model for a document and a three

term query. As shown in the model, given D, non-adjacent

query terms are independent, but adjacent query terms are

dependent on each other as there is an edge between them.

Using the MRF, the joint distribution over the random

variables in the graph G, can be calculated. In this process,

a set of cliques C(G) in the graph G are found and anon-negative potential function is defined over the set

of cliques. Given a query, Q, and document, D, the joint

relevance distribution is expressed as:

PG,(Q,D) =1Z

cC(G)

(c;)






4/7

4

Fig. 1. Computation procedure of IBQV

Fig. 2. MRF model with a sequential dependence as-sumption [1]

In the above equation, Z is a normalizing constant and is a set of free parameters that are used within the potential

functions. The potential function usually takes the following

form:

(c;) = e

iifi(c)

The score of a document D with respect to the query Q can

be defined as [4]:

score(Q,D) = logPG,(D | Q)

= logPG,(Q,D) logPG,(Q)

=

cC(G) log(c;) logZ logPG,(Q)rank=

cC(G) log(c;)

Now, to instantiate the MRF model, a set of cliques, C(G),and a set of potential functions, (c;), over the cliqueshave to be defined. There are several possible instantiations,

based on the different dependence assumptions between the

document and the query terms [4]. The sequential dependence

instantiation has been shown as an effective instantiation

[4], [1]. The sequential dependence instantiation of the MRF

model, shown in Figure 2, assumes dependence only between

the adjacent query terms.There are three types of cliques that can be found when

considering the sequential dependency assumption. The first

type of cliques involve a single term node and a document

node. The potential function for these cliques are defined as

follows [1]:

log(qi, D;) = TfT(qi, D)

Here fT(qi, D) is a feature function defined over the queryterm qi and the document D, and T is a free parameter.

The second type of cliques involve two query terms and the

document node. The potential functions over these cliques aredefined as:

log(qi, qi+1, D;) = OfO(qi, qi+1, D)

+ UfU(qi, qi+1,D)

Where, fO(qi, qi+1,D) and fU(qi, qi+1, D) are featurefunctions, and O and U are free parameters. These

potentials are made up of two distinct components. The first

considers ordered (exact phrase) matches and is denoted by O






5/7

5

subscript. The second, denoted by the U subscript, considers

unordered matches [1].

The third type of cliques over which the potential function

is defined contains the document node only. [1] defined

the query independent potential function, based on a set

of quality based factors, which increases content clarity,

document readability and ease of navigation. The query

independent potential function, which is defined relying on

the text based features, only takes the following form:

log(D;) =

LL(D) LfL(D)

In the equation above, L(D) is the set of quality basedfactors associated with the document node D. The features

values are calculated using the set of quality based factors and

parameter values are multiplied for each of the features. This

way the quality value of the document can be obtained [1].

We propose the integration of IBQV with the final ranking

function, and after the integration, the equation for the final

score computation will take the following form:

score(Q,D) = TfT(qi,D)

+ OfO(qi, qi+1, D) + UfU(qi, qi+1,D)

+

LL LfL(D)

+ i IBQV

Here, i is a free parameter which indicates the impor-

tance of IBQV in the ranking function. The value of ican be determined by a learning-to-rank [20] scheme. The

determination of the parameter values for other features were

done [1] using a co-ordinate ascent algorithm proposed byMetzler and Croft [21]. In total, parameters for 13 features

were tuned using the algorithm [1]. But in that model, 10

features were document quality based and all of them were

text based features. No image based features were considered

for ranking the documents. Our assumption is that, inclusion

of the novel metric IBQV would certainly improve the ranking

framework as quality images are the assets of a web document.

They certainly increase the quality of the document and can

act as discriminating agents when total ranking is performed.

6 EXPERIMENTAL RESULTS

For the purpose of experiment, 40 documents from yahoo webdirectory were downloaded. Among them 20 documents were

from the general health directory and the other 20 documents

were downloaded from general business directory. The web

pages were carefully chosen, so that they contained significant

number of images. Human judgment values for image based

quality rating for each web page were taken from three experts

and the range of their judgment values were between 0 and

10. 0 indicated thats images in the web document do not

possess any quality to support the actual text content of the

document and are less appealing. 10 indicated that, images in

the web document highly support the actual text content of the

document and the document quality is excellent considering

the images and text. The average of the values provided by

human judges for each document was calculated. Then, our

developed program calculated the IBQV values for each of

the documents. Apache Lucene [22] was used for calculating

document similarity based on vector space model. Google

Image Search Engine was used to find the web documents

containing a specific image.

Figure 3 shows IBQV values given by both human judges

and our program for general health related documents graph-

ically. Correlation coefficient was calculated between these

values and 0.65 correlation was found. Figure 4 shows IBQV

values given by both human judges and our program for

general business related documents graphically. In case of

Figure 4, 0.71 was the correlation coefficient.

The values provided by human judges do not exactly match

with the values captured by our program. But, in most of the

cases, they were nearer to the values provided by the human

judges. It proves that, the proposed metric IBQV can estimate

the appeal of the images in the web documents and the degree

to which they support the content.

7 CONCLUSIONS

In this work, an attempt has been taken to enhance the existing

ranking framework used by the current web search engines

by integrating image based features. Images have become

an indispensable part of a web page as they provide visual

information which is essential for prompt understanding of the

content. The appeal and relevance of the information conveyed

by images are crucial factors for determining the static rank

of a web document. Moreover, there are images which are

advertisements, Spam images and unnecessary images for

content enrichment. They drive the users towards a wrong

direction and make them unsatisfied with the low quality ofinformation. Our proposed method can find out documents

that contain quality visual information and lift them up in

the ranked list presented to the user. The metrics, those we

proposed, are quite scalable and can be easily integrated

with a ranking framework because of the simplicity of the

computation process.

REFERENCES

[1] M. Bendersky, W. B. Croft, and Y. Diao, Quality-biased rankingof web documents, in Proceedings of the fourth ACM internationalconference on Web search and data mining, ser. WSDM 11. NewYork, NY, USA: ACM, 2011, pp. 95104. [Online]. Available:

http://doi.acm.org/10.1145/1935826.1935849[2] J. M. Kleinberg, Authoritative sources in a hyperlinked environment,

J. ACM, vol. 46, pp. 604632, September 1999. [Online]. Available:http://doi.acm.org/10.1145/324133.324140

[3] S. Brin and L. Page, The anatomy of a large-scale hypertextualweb search engine, Comput. Netw. ISDN Syst., vol. 30, pp. 107117, April 1998. [Online]. Available: http://dx.doi.org/10.1016/S0169-7552(98)00110-X

[4] D. Metzler and W. B. Croft, A markov random field model for termdependencies, in Proceedings of the 28th annual international ACMSIGIR conference on Research and development in information retrieval,ser. SIGIR 05. New York, NY, USA: ACM, 2005, pp. 472479.[Online]. Available: http://doi.acm.org/10.1145/1076034.1076115

[5] S. N. Keegan, Importance of visual images in lectures: Case study ontourism management students, Journal of Hospitality, Leisure, Sportsand Tourism Education, vol. 6, pp. 5865, 2007.






6/7

6

Fig. 3. IBQV values given by both human judges and our program for general health related documents.

Fig. 4. IBQV values given by both human judges and our program for general business related documents.

[6] A. E. Savakis, S. P. Etz, and A. C. P. Loui, Evaluation of imageappeal in consumer photography, B. E. Rogowitz and T. N. Pappas,Eds., vol. 3959, no. 1. SPIE, 2000, pp. 111120. [Online]. Available:http://dx.doi.org/10.1117/12.387147

[7] P. Obrador and N. Moroney, Low level features for image appealmeasurement, pp. 72 420T72 420T12, 2009. [Online]. Available: +http://dx.doi.org/10.1117/12.806140

[8] A. K. Moorthy, P. Obrador, and N. Oliver, Towards computationalmodels of the visual aesthetic appeal of consumer videos, inProceedings of the 11th European conference on Computer vision: Part

V, ser. ECCV10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 114.[Online]. Available: http://dl.acm.org/citation.cfm?id=1888150.1888152

[9] G. Salton, A. Wong, and C. S. Yang, A vector space model forautomatic indexing, Commun. ACM, vol. 18, pp. 613620, November1975. [Online]. Available: http://doi.acm.org/10.1145/361219.361220

[10] L. a b Shapiro, Computer Vision, 1st ed. Prentice Hall, 2001.

[11] D. Hasler and S. E. Suesstrunk, Measuring colorfulness in naturalimages, in Society of Photo-Optical Instrumentation Engineers (SPIE)

Conference Series, ser. Society of Photo-Optical Instrumentation Engi-neers (SPIE) Conference Series, B. E. Rogowitz and T. N. Pappas, Eds.,vol. 5007, Jun. 2003, pp. 8795.

[12] M. A. Najork, Comparing the effectiveness of hits and salsa,in Proceedings of the sixteenth ACM conference on Conferenceon information and knowledge management, ser. CIKM 07. NewYork, NY, USA: ACM, 2007, pp. 157164. [Online]. Available:http://doi.acm.org/10.1145/1321440.1321465

[13] N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor, Relevanceweighting for query independent evidence, in Proceedings of the28th annual international ACM SIGIR conference on Researchand development in information retrieval, ser. SIGIR 05. NewYork, NY, USA: ACM, 2005, pp. 416423. [Online]. Available:http://doi.acm.org/10.1145/1076034.1076106

[14] W. Kraaij, T. Westerveld, and D. Hiemstra, The importance ofprior probabilities for entry page search, in Proceedings of the25th annual international ACM SIGIR conference on Researchand development in information retrieval, ser. SIGIR 02. New






7/7

7

York, NY, USA: ACM, 2002, pp. 2734. [Online]. Available:http://doi.acm.org/10.1145/564376.564383

[15] J. Peng, C. Macdonald, B. He, and I. Ounis, Combination of documentpriors in web information retrieval, in Large Scale Semantic Access toContent (Text, Image, Video, and Sound), ser. RIAO 07. Paris, France,France: LE CENTRE DE HAUTES ETUDES INTERNATIONALESDINFORMATIQUE DOCUMENTAIRE, 2007, pp. 596611. [Online].Available: http://dl.acm.org/citation.cfm?id=1931390.1931446

[16] Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li,Browserank: letting web users vote for page importance, in

Proceedings of the 31st annual international ACM SIGIR conferenceon Research and development in information retrieval, ser. SIGIR 08.New York, NY, USA: ACM, 2008, pp. 451458. [Online]. Available:http://doi.acm.org/10.1145/1390334.1390412

[17] M. Richardson, A. Prakash, and E. Brill, Beyond pagerank:machine learning for static ranking, in Proceedings of the 15thinternational conference on World Wide Web, ser. WWW 06. NewYork, NY, USA: ACM, 2006, pp. 707715. [Online]. Available:http://doi.acm.org/10.1145/1135777.1135881

[18] E. Baratis, E. G. M. Petrakis, and E. E. Milios, Automatic websitesummarization by image content: A case study with logo and trademarkimages. IEEE Trans. Knowl. Data Eng., vol. 20, no. 9, pp. 11951204,2008.

[19] O. Signore, A comprehensive model for web sites quality, in WSE,2005, pp. 3038.

[20] T.-Y. Liu, Learning to rank for information retrieval, Found. TrendsInf. Retr., vol. 3, pp. 225331, March 2009. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1618303.1618304[21] D. Metzler and W. Bruce Croft, Linear feature-based models for

information retrieval, Inf. Retr., vol. 10, pp. 257274, June 2007.[Online]. Available: http://dl.acm.org/citation.cfm?id=1265488.1265494

[22] A. S. Foundation, Apache lucene - apache lucene core,http://lucene.apache.org/core/, 2011.

Sheikh Muhammad Sarwar is currently work-ing as a researcher in the department of

CSE, University of Dhaka, Bangladesh. Hecompleted his M.Sc. and B.Sc. from Uni-versity of Dhaka. His research interests in-clude Information Retrieval, Image Process-ing, Quantum Computing etc. He publishedhis research paper in an international confer-ence. He received scholarship for his resultin B.Sc. from University of Dhaka.

Md. Mustafizur Rahman is currently work-ing as an Associate Professor in Depart-ment of Computer Science and Engineering,University of Dhaka, Dhaka, Bangladesh. Heobtained his B.Sc. and M.Sc. from Univer-sity of Dhaka. He completed his PhD. fromKyung Hee University, South Korea. His re-search interests include Mobile Ad-hoc Net-work, Wireless Mesh Network, InformationRetrieval etc.

Mosaddek Hossain Kamal is currently work-ing as an Associate Professor in Departmentof Computer Science and Engineering, Uni-versity of Dhaka, Dhaka, Bangladesh. He ob-tained his B.Sc. from University of Dhaka,Bangladesh and M.Sc. from University ofNew South Wales, Australia. His research in-terests include Mobile Middle ware, RoutingAlgorithms, Information Retrieval etc.