[ieee 2010 international conference on advances in social networks analysis and mining (asonam 2010)...

Text-Based Web Page Classification with Use of Visual Information

Vladimír Bartík Dept. of Information Systems, Faculty of Information Technology

Brno University of Technology Brno, Czech Republic

e-mail: [email protected]

Abstract—As the number of pages on the web is permanently increasing, there is a need to classify pages into categories to facilitate indexing or searching them. In the method proposed here, we use both textual and visual information to find a suitable representation of web page content. In this paper, several term weights, based on TF or TF-IDF weighting are proposed. Modification is based on visual areas, in which the text appears and their visual properties. Some results of experiments are included in the final part of the paper.

Keywords - web page classification, term weights, text classification, TF-IDF weight, visual information, visual blocks.

I. INTRODUCTION As the amount of information provided by World Wide

Web (WWW) is permanently increasing, there is a need of some useful knowledge obtained from WWW. Web page classification, also known as web page categorization, is a process of assigning the web pages to one of several predefined classes. Classification of web pages is essential to many web information retrieval tasks, such as indexing of pages, improving the web search or constructing of Web directories.

In the beginning, classification methods were applied primarily to structured databases. To classify semi-structured data in a form of web pages, we have to find a representation of web page content, which is suitable for classification methods. As we know, there are two main information types contained on a web page. There is a visual structure formed by HTML code, which hides information about visual blocks on a page, but also unstructured information in a form of text is present.

Contemporary methods for classification of Web data work mainly with the text information, which is present on a Web page. These text-based classification methods usually use the bag-of-words representation to represent the contents of a document. In this case, a document is represented by a vector of TF/IDF weights assigned to individual terms.

However, this representation of a document does not capture visual information. On the other hand, there have been some classification methods based on web page structure proposed. In this case, structure of a document along with pictures, are considered as input information for classification algorithms. But text information is not taken into consideration.

Because for most of web pages, text plays crucial role for representation of web page content, it is appropriate to enrich text representation with visual information. It is possible to use information from HTML tags to improve term weighting. This allows capturing several properties of text, such as font size or text color, and using it to modify the weights of text terms. However, this method does not reflect other visual properties of a web page, such as layout and, location of an element etc.

To capture this type of visual information, we can use page segmentation methods. Segmentation is defined as a process of detecting organization of visual blocks on the page and analyzing the properties of component visual elements. Data obtained by segmentation can also be used to improve the web page representation. Segmentation algorithms usually work with rendered web documents and their visual representation.

In this paper, we introduce a new way of modifying term weighting with visual information. Segmentation algorithm is used to capture visual blocks, which form the web page. Then, we are able to classify visual blocks into predefined categories, i.e. heading, main text of a page, advertisement or navigation.

These types of visual blocks have different importance in representation of a page. The importance should be taken into account by classification of whole web pages. Therefore, we present modifications of weights capturing this information.

At first, we propose the way of visual blocks classification very briefly. Next, modifications of term weighting are introduced, and some results of web page classification experiments are described in the final part of this paper.

II. RELATED WORK

A. Term Weighting for Classification Classification of web pages based on text extracted from

pages is the most common way to classify web pages. The basic methods are based on a bag-of-words representation with TF or TF/IDF weights [1]. TF means the term frequency in a document. It should be normalized, for example as:

)(),(5.05.0),(

dMaxFreqdtfdtTF ∗+= , (1)

2010 International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4138-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ASONAM.2010.34

416


978-0-7695-4138-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ASONAM.2010.34

416


978-0-7695-4138-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ASONAM.2010.34

416

where MaxFreq(d) is the maximum frequency of any term in a document. It is necessary to consider inverse document frequency (IDF), which represents the general importance of a term among all documents. The resulting TF/IDF weight is obtained as:

))log(1(),(),(/kndtTFdtIDFTF +∗= , (2)

where n is the count of all documents in the dataset and k is the number of documents containing the term t.

Another frequently used text representation is N-gram representation [2], which allows representing a document with terms consisting of more than one word.

As mentioned above, importance of various visual blocks is different; therefore it is necessary to reflect it in document representation. In [3], the use of information derived from HTML tags of a page for classification, is proposed. Similar method, in which the HTML tags are divided into three groups with different importance of terms in each group, is described in [4]. The main disadvantage of these two approaches is the fact that the formation of various web pages is different and the same information on a page is often represented in different ways in HTML.

The method proposed in [5], takes four aspects of term frequency into consideration: term frequency, term frequency in heading, frequency of emphasized words in text and word position function assuming that the first and last quarters of text are most relevant. These four aspects are combined to form the resultant term weights.

B. Other Methods for Document Representation In [6], TF/IDF weights are replaced by keywords extracted

from informative web page blocks. Informative blocks are discovered via DOM tree of a page. The assumption that non-content blocks are repeatedly appearing in a DOM tree, is used.

There are also methods, which use graph representation instead of weight vectors, proposed. This graph representation can keep also the structural information about the document. If there is ability to compute the similarity between two graphs, some lazy classification algorithms (i.e. k-nearest neighbor) can be used for classification [7]. A combination of graph and vector representation is proposed in [8]. Graphs are processed via a frequent sub-graph mining method. The frequent sub-graphs obtained become attributes for vector representation.

A method, which represents a web document according to visual properties of a document, is proposed in [9]. Here, the visual adjacency multigraph representation is presented. It is able to represent information about mutual position of visual parts on a page and contents of component parts.

C. Segmentation of web pages To perform segmentation of a web document, it is

necessary to render the document and obtain information about visual areas of rendered document (not about HTML tags).

There are some segmentation algorithms proposed. The probably most known segmentation algorithm is VIPS presented in [10]. The result is a tree of visual areas independent on HTML tags. A document is divided into visual blocks, based on its visual cues, such as different fonts or colors, lines and other separators etc.

Another segmentation method used in our method described here, is proposed in [11]. This method uses a bottom-up approach to find visual areas. The result is a set of visual rectangular blocks visually separated from the remainder of the web page.

III. ROLE OF VISUAL INFORMATION IN WEB PAGE CLASSIFICATION

As it is mentioned above, visual information about web page areas could be used to improve web page classification. If the visual information is considered, it is necessary to distinguish between content and non-content blocks.

If we are able to obtain visual properties and position of each visual area automatically, it is possible to determine importance of each visual area. It is very important due to the fact that most of space on a web document is occupied by information, which is not relevant to the main topic of the document and should not be considered in document representation. We can mention advertisement, navigation bars, and links to other pages or copyright information as examples of unimportant areas of a page.

A. Classification of Web Page Visual Blocks According to these remarks, to obtain a good document

representation resulting in accurate classification of pages, we have to distinguish between the following types of visual areas:

• headings: includes the main heading and subheadings contained on the page. It is typically characterized by a font size higher than the rest of the page.

• main text: the main content of the page, the most important information, which should be included in representation of a document.

• date/authors: includes information about web page authors or date of page creation; unimportant information for further processing.

• Navigation bar: links to other parts of a web site, typically constant for all pages of a site. It is also unimportant for page representation.

• Links: links to other pages related to the actual page; some of the links can be irrelevant

• Others: other parts of a page, i.e. caption of a whole page, advertisement; also unimportant.

If we have necessary information about visual properties and position of a block, the categories described above can be assigned to page blocks via classification. The detailed description of this classification approach, visual properties used to classify web page areas and detailed results of experiments are described in [12].

417417417

B. Role of Categories of Visual Blocks in Page Classification If we have information about a category for each web page

visual block obtained by classification, it is possible to use it to enrich the standard TF/IDF scheme. The text terms that are contained in the most important parts are evaluated with the highest weight (a standard weight multiplied by a coefficient), the terms in some less important parts have a standard weight and the least important areas will be omitted from the representation of document contents. Then, some classification method can be used to assign a category to whole pages.

In the next chapter, several variants of TF/IDF scheme are described in detail. They differ primarily in the coefficients, which are used to express the weight of constituent visual blocks categories.

IV. WEB PAGE REPRESENTATION WITH MODIFIED WEIGHTS

A. Text Preprocessing Before we can count and assign the weights to the terms, it

is necessary to perform stop words removal and stemming.

Stop words removal ensures that non-content words appearing in most of text documents, such as “the”, “in” or “this”. The main purpose of stop words removal is to reduce the number of index terms and keep the words which are important to represent the contents of a document.

Stemming is a process of reducing the words into their stems (root form). The reason to use stemming is to unify words with similar meaning into one index term. It is arranged by a set of rules, which remove prefixes and suffixes of words. In our implementation, the Porter stemmer [13] has been used.

After that, we are removing words, which appear in a very small number of documents, because they are also not important for document classification.

B. Modifications of Standard Weights In this subsection, various possibilities to modify the TF,

IDF and TF/IDF weights according to knowledge about categories of visual blocks, to which the parts of text belong, are described. At first, the possible modifications of the TF weight will be summarized. The ways of TF modification are following:

• We can set the weight of a term according to the visual block, in which it appears. For example, a word in a “heading” block can have higher importance than a word from “links”. This is ensured by multiplication of a weight by a coefficient set for each category.

• Some parts of a web page can be omitted from weighting. There are some categories, which do not have a content related to a topic of the whole page. It is necessary to choose these non-content categories, such as navigation bars or date/authors as examples.

• It is also possible to make a modification of the equation for normalized term frequency – see equation (1). There is a MaxFreq value meaning the maximum frequency value for a document. We can now decide, if

this value will be counted for the whole document consisting of all visual blocks or only for content parts. The first way better reflects the whole size of a page; the second one reflects the length of the main text content on a page.

• It is not certain, if blocks of category “links” should be omitted from the representation. In case of omitting these blocks, we can lose some piece of information about links to related pages. On the other hand, a lot of links usually refer to irrelevant pages.

According to the remarks above, we can define a modified term frequency to represent text contents of a web document:

Assume that we have a set of web documents D = {d1, … dn} and a set of terms (words) T = {t1, … tm}, which occur in documents from D.

After rendering documents from D, we can divide each document into visual blocks, which can be classified into several classes. Let us denote a vector of class labels as C = (c1, …, ck). Each class is evaluated by a coefficient according to the significance of relevant visual block. This is denoted as a vector of coefficients V = (v1,…, vk), where vj is the coefficient of a corresponding class of visual block cj.

Then, a modified document frequency of a term t∈T in a web document d∈D is defined as:

∑=

∗=k

iii vdtFcdtMTF

1),,(),( , (3)

where F(t, d, bi) is a frequency of term t in all blocks of class ci in a document d. The resultant term weight is obtained as a summarization of all weights for component visual blocks.

This modified term frequency should also be normalized, as in equation (1). Modified term frequency is defined as:

)(),(5.05.0),(

dMaxFreqdtMTFdtnMTF

V

∗+= , (4)

where MaxFreqV(d) is the maximum frequency of any term in content parts of a document. Content of visual blocks classified into classes, coefficient of which is equal to zero in vector V, are not considered for counting of a MaxFreq value.

This kind of information can also be used during the preprocessing of documents. We can remove the words, which occur only in non-content blocks from the representation and reduce the size of weight vectors already before the representation is created.

C. Modification of Inverse Document Frequency The modification of inverse document frequency is similar

to the modifications of term frequency. In this case, the categories of visual blocks will also be considered to determine the number of documents containing the term – see equation (2). The modified inverse document frequency is defined as:

418418418

)log(1)(VkntMIDF += , (5)

where t is a term from the set of terms T, n is the count of all documents in the dataset and kV is the number of documents, in which content visual blocks (having coefficient in vector V higher than zero) at least once contain the term t.

The resulting modified TF/IDF weight is obtained as multiplication of modified document frequency and inverse document frequency.

V. EXPERIMENTAL RESULTS OF CLASSIFICATION Several variants of modifications and their influence on

accuracy of classification are described in this section.

The WEKA tool has been used for experiments. We have chosen four classifiers, which bring the best results for our data – two Bayesian classifiers (Naïve Bayes, Bayes Net), a tree-based classifier (FT–Functional Trees) and Support Vector Machines classifier (SMO–Sequential Minimal Optimization).

A. Description of Datasets Used for Experiments There have been two datasets containing web documents

used for experiments described here.

First of them, a freely available WebKB corpus of web pages has been used to verify the functionality of the method. It contains 4518 web pages from the computer science department websites. They are classified into six categories – course, department, faculty, project, staff and student.

The second dataset was manually created. It contains web pages taken from several English written news websites (CNN.com, Reuters.com, nytimes.com, boston.com and usatoday.com). These pages have been manually annotated. They are categorized into six topics: politics, business, sport, art, health and science. In total, dataset contains almost 500 pages, approximately well-proportioned into the six topics.

B. Experiments with a Web-KB Dataset The Web-KB dataset can be used to verify the functionality

of classification, but not to compare standard term weighting and modified weights, because this dataset contains relatively old web pages, with minimum of non-content blocks. On most of pages, no navigation bars, links and advertisements are present. The most of the contents is formed by the main text, headings and some date/authors information.

TABLE I. CLASSIFICATION RESULTS FOR WEB-KB DATASET

TF TF/IDF MTF/IDF Naïve Bayes 68.8 74.8 78.6 Bayes Net 76.4 77.3 80.7

Funct. Trees 83.0 81.4 78.8 SMO 75.1 80.1 72.7

Therefore the difference of classification accuracy with standard and modified weights is very small. It is cause by the

fact that visual information has very small influence on modified weighting of terms. The accuracy was approximately 80% for both weightings, as shown in Table I.

The modified TF/IDF weights are used with the following coefficients for visual blocks: for main text, value is set to 2; for headings: 5; for links: 1; other blocks: 0.

C. Experiments with a Dataset of Pages from News Websites The second dataset consists of web pages with a lot of non-

content blocks, therefore it is expected that modified weighting will have higher influence on accuracy of page classification.

First, it is necessary to make a comparison of classification accuracy between standard and modified term weighting to see the effect of modifications. In Table II, you can see the results with standard TF and TF/IDF weights and modified weight.

TABLE II. COMPARISON OF STANDARD AND MODIFIED WEIGHTING

TF TF/IDF MTF/IDF Naïve Bayes 86.7 80.9 86.1 Bayes Net 89.3 90.6 93.4

Funct. Trees 87.0 88.6 90.9 SMO 85.4 88.0 90.5

The setting of coefficients for modified weights of visual blocks is the same as used in the previous experiment.

As you can see from Table II, using modified weights leads to better accuracy of classification for most of classification methods except Naïve Bayes being quite better for TF weights.

The objective of the second experiment was to discover, if the visual blocks classified as “links” are useful to be included into the web document representation. This is realized by setting the coefficient for the “links” category in the vector V. Three experiments with values of coefficient for “links” category have been performed. It was set to 0 (excluded), 1 (low importance) and 5 (same as main text).

TABLE III. COMPARISON OF VARIOUS “LINKS” COEFFICIENT SETTINGS

vlinks = 0 vlinks = 1 vlinks = 5 Naïve Bayes 83.4 86.1 84.9 Bayes Net 87.8 93.4 92.3

Funct. Trees 83.1 90.9 88.2 SMO 80.0 90.5 90.4

The results lead to a conclusion that links are also useful for representation of the whole web document. Utilization of links with a small coefficient causes a small increase of classification accuracy. On the other hand, increase of the coefficient doesn’t bring better results of classification.

The third experiment is focused on a MaxFreq value, which can be computed in two different ways – see equations (1) and (4). Maximum frequency can be computed for the whole document or with respect to content blocks only. The setting of coefficients is the same as used in the first experiment again.

419419419

TABLE IV. CLASSIFICATION RESULTS FOR DIFFERENT MAXFREQ

MaxFreq MaxFreqV Naïve Bayes 86.1 81.6 Bayes Net 93.4 92.2

Funct. Trees 90.9 86.0 SMO 90.5 89.2

The modification of maximum frequency does not bring improvement of classification accuracy. As you can see from the results shown in Table IV, with use of all four classification methods, the accuracy is worse for the modified MaxFreq value.

In the last experiment, the influence of inverse document frequency and its modification on classification had to be examined.

Three measurements were accomplished: in the first one, only modified TF weight was used (without IDF); then, a modified TF/IDF weight with standard IDF computation was used; at last, modified TF/IDF with modified IDF weight – see equation (5) was used. The same setting of coefficients for visual blocks as in the first experiments is used again.

TABLE V. COMPARISON OF STANDARD AND MODIFIED WEIGHTING

MTF MTF/IDF MTF/MIDF Naïve Bayes 88.6 86.1 85.3 Bayes Net 91.9 93.4 92.6

Funct. Trees 84.5 90.9 88.9 SMO 81.6 90.5 90.8

It is obvious that the results are better with use of both IDF weights (Naïve Bayes method is the only exception). Difference between both IDF weights is not high, modified IDF does not bring any improvement of accuracy.

VI. CONCLUSION AND FUTURE WORKS In this paper, we have presented a new way of web page

content representation based on visual features. Visual features are used to modify the term weights usually used to represent text content of a document. The visual information is obtained by page rendering and segmentation. Then, it is used to express the significance component text terms on a page. This is achieved by various modifications of standard TF/IDF term weights.

Several ways of modification have been proposed here. Then, the experiments proved the improvement of web page classification with use of these modifications. Also the comparison of all variants of weighting modifications has been presented.

In the future research, we are going to join the process of visual blocks classification and text-based classification into one process of two-phase classification. This will allow making the process automatic. There is also an issue to find optimal

setting of coefficients for visual blocks significance. In this paper, we have only presented a few possibilities of setting these coefficients. The concept of modified term weights could be useful also for other web mining tasks, for example clustering of web pages, which can be used to find similar web pages within some dataset.

ACKNOWLEDGMENT This research has been supported by the Research Plan No.

MSM0021630528 – “Security-Oriented Research in Information Technology” and by the BUT FIT grant No. FIT-10-S-2 – “Recognition and Presentation of Multimedia Data”.

REFERENCES [1] G. Salton and C. Buckley: “Term weighting approaches in automatic

text retrieval”. Information Processing and Management, Vol. 24, 1998, pp.513–523.

[2] D. Mladenic: “Turning Yahoo into an automatic Web-page classifier”. In Proceedings of the European Conference on Artificial Intelligence (ECAI’98), pp 473–474, 1998.

[3] K. Golub and A. Ardo: “Importance of HTML structural elements and metadata in automated subject classification”. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005). Lecture Notes in Computer Science, vol. 3652, pp. 368-378, Springer, Berlin, Germany, 2005.

[4] O.-W. Kwon and J.-H.Lee: “Text categorization based on k-nearest neighbor approach for Web site classification”. Information Processing and Management, Vol. 39, Issue 1, pp. 25–44, Pergamon Press, Inc., 2003

[5] V. Fresno and A. Ribeiro: “An analytical approach to concept extraction in HTML environments”. Journal of Intelligent Information Systems, Vol. 22, Number 3, pp. 215-235., Springer, 2004.

[6] S. Lee, M. Jung and E. Lee: “A novel Web page analysis method for efficient reasoning of user preference.” In Proceedings of the 8th Asia-Pacific Conference on Computer-Human interaction 2008 (Seoul, Korea). Lecture Notes In Computer Science, vol. 5068. Springer-Verlag, Berlin, Heidelberg, pp. 86-93, 2008.

[7] A. Schenker, M. Last, H. Bunke and A. Kandel, ”Classification of Web documents using graph matching”, International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and PatternRecognition, Vol. 18, No. 3, pp. 475-496, 2004.

[8] A. Markov and M. Last: “A simple, structure-sensitive approach for Web document classification”, In: Proceedings of the Third International Atlantic Web Intelligence Conference, AWIC 2005, Lodz, Poland, Lecture Notes in Computer Science, Vol. 3528, Springer, pp. 293-298, 2005.

[9] M. Kovacevic, M. Diligenti, M. Gori and V. Milutinovic: “Visual adjacency multigraphs - a novel approach for a Web page classification”. In Proceedings of the Workshop on Statistical Approaches to Web Mining (SAWM 2004), pp. 38–49, 2004.

[10] D. Cai, S. Yu, J. R. Wen and W. Y. Ma: “VIPS: a Vision-based page segmentation algorithm.” Microsoft Research , 2003.

[11] R. Burget: “Automatic document structure detection for data integration.” In: Proceedings of Business Information Systems (BIS 2007), Lecture Notes in Computer Science, Vol. 4439, Poznan, Poland, pp. 391-397, 2007.

[12] R. Burget and I. Rudolfová: ”Web page element classification based on visual features” In Proceedings of the First Asian Conference on Intelligent Information and Database Systems (ACIIDS 2009), pp. 67-72, 2009

[13] M.F. Porter: “An algorithm for suffix stripping”, Program, Vol. 14(3), pp 130−137, 1980.

420420420

[ieee 2010 international conference on advances in social networks analysis and mining (asonam 2010)...

Documents