latifah lor abdullah - ir.unimas.my loh abdullah (24pgs).pdfdeclaration i certify that all works in...

24
THE EFFECT OF TERM WEIGHTING MEASURES ON FEATURE SELECTION LATIFAH LOR ABDULLAH A dissertation submitted in partial fulfillment of the requirements for the degree of Master of Advanced Information Technology Faculty of Computer Science and Information Technology UNIVERSITI MALAYSIA SARAW AK 2007

Upload: trantu

Post on 11-May-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

THE EFFECT OF TERM WEIGHTING MEASURES ON FEATURE SELECTION

LATIFAH LOR ABDULLAH

A dissertation submitted in partial fulfillment of the requirements for the degree of

Master of Advanced Information Technology

Faculty of Computer Science and Information Technology UNIVERSITI MALAYSIA SARA W AK

2007

Page 2: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

Declaration

I certify that all works in this dissertation are carried out between October 2004 and

March 2007 and they have not been submitted for any academic awards at other

colleges, institutes or universities. The work presented here is carried out under the

supervision of Mr. Bong Chih How and Associate Professor Narayanan

Kulathuramaiyer. All other works in the thesis are my own except those where noted.

Signed,

Latifah Loh Abdullah

March,2007

Acknowledgment

I would like to thank Assoc. Prof. Narayanan Kulathuramaiyer for his inspiring

comments and proofreading. Many thanks to Mr. Bong Chih How for his valuable ideas,

helpful suggestions and comments, careful proofreading and for extending the code used

in the experiments. Special thanks to my husband and families for their love and

support, which gave me courage to complete this dissertation. Last but not least, I would

like to thank all my colleagues (CICTS and FCSIT) for their cooperation in many ways.

1

Page 3: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

Table of Contents

Declaration .................................................................................................................. i Acknowledgment ........................................................................................................ i Table of Contents ...................................................................................................... ii List of Figures ........................................................................................................... iv List of Tables ............................................................................................................... v Abstract ...................................................................................................................... vi Abstrak ...................................................................................................................... vii Chapter 1: Overview .................................................................................................. 1

1.1 Background ............................................................................................... 1 1.2 Objectives .................................................................................................. 3 1.3 Problem Statement ................................................................................. 4 1.4 Purpose of Study ..................................................................................... 5 1.5 Research Significance ............................................................................ 5 1.6 Scope of Work ........................................................................................... 5 1.7 Chapters Overview ................................................................................. 6

Chapter 2: Literature Review .................................................................................. 7 2.1 Introduction .............................................................................................. 7 2.2 Feature Selection .................................................................................... 7

2.2.1 Information Gain .................................................................... 10 2.2.2 Chi-Square ............................................................................... 10 2.2.3 Correlation Coefficient .......................................................... 11 2.2.4 Odd Ratio ................................................................................. 11 2.2.5 GSS Coefficient ....................................................................... 12 2.2.6 Categorical Term Descriptor ............................................... 12

2.3 Text Classifier ........................................................................................ 14 2.4 Measurement for Category Assignments ........................................ 14 2.5 Conclusions ............................................................................................. 15

Chapter 3: Methodology .......................................................................................... 16 3.1 Introduction ............................................................................................ 16 3.2 The Framework ..................................................................................... 16 3.3 Datasets ................................................................................................... 19 3.4 Feature Selection Methods ................................................................. 21

3.4.1 Categorical Term Descriptor (CTD) ...................................... 22 3.4.2 Other Feature Selection Methods (CTF, CTFICF, CTFIDF)

....................................................................................................... 24 3.4.3 Common Term Factor (CommTF) ......................................... 25

3.5 Multinomial NaIve Bayes .................................................................... 26 3.6 Measurement ......................................................................................... 27 3. 7 Conclusion ............................................................................................... 27

Chapter 4: Experimental Approach ...................................................................... 29 4.1 Introduction ............................................................................................ 29 4.2 Experiments ........................................................................................... 29

11

Page 4: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

4.3 The Evaluations of CTF, CTFICF, CTFIDF and CTD as Feature Selection .................................................................................................. 30

4.4 Common Term Factor (CommTF) ..................................................... 31 4.5 Semantic Analysis of Selected Term to Their Category............... 31 4.6 Conclusion............................................................................................... 32

Chapter 5: Experimental Result and Analysis ................................................... 33 5.1 Introduction............................................................................................ 33 5.2 CTF, CTFICF, CTFIDF and CTD as a Feature Selection Method33

5.2.1 Average of Overall Categories in SITE95-99 and Reuters­21578............................................................................................ 33

5.2.2 Individual Categories in SITE95-99 and Reuters-21578 .35 5.2.2.1 SITE95-99 .................................................................. 35 5.2.2.2 Reuters-21578 ............................................................ 46

5.2.3 Discussion on Feature Selection Analysis ........................... 48 5.3 Misclassification .................................................................................... 50 5.4 Common Term Factor (CommTF) ..................................................... 51 5.5 Semantic Analysis of Relevance of the Selected Terms to Their

Categories ............................................................................................... 53 5.5.1 Mathematics ........................................................ ....... 54 5.5.2 Carcass ........................................................................ 64 5.5.3 Discussion on Semantic Analysis ......................... 68

5.6 Conclusion............................................................................................... 68 Chapter 6: Conclusion ............................................................................................. 70

6.1 Introduction ............................................................................................ 70 6.2 Discussion ............................................................................................... 70 6.3 Future Work ........................................................................................... 74

References ................................................................................................................. 76 Appendices................................................................................................................ 84

A. The 80 selected terms for Mathematics category .................................... 84 B. The 80 selected terms for Mathematics, in ascending order................ 85 C. The 80 selected terms for Carcass category .............................................. 88 D. The 80 selected terms for Carcass, in ascending order.......................... 89

iii

Page 5: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

List of Figures

Figure 1. Micro-average F-Measure for six feature selection measures on Reuters­21578....................................................................................................................................... 13 Figure 2. Micro-average F1 for six feature selection measures on SITE95-99...... 13 Figure 3. Macro averaged F-Measure in the SITE95-99 dataset ............................ 34 Figure 4. Macro averaged F-Measure in Reuters-21578 datasets ............................ 34 Figure 5. The percentage numbers of categories under each group ....................... 37 Figure 6. Macro Average of F-Measure for CommTF in SITE95-99 dataset ........ 52 Figure 7. Macro Average of F-Measure for CommTF in Reuters-21578 dataset.. 52

IV

Page 6: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

List of Tables

Table 1. Feature selection measures................................................................................. 9

Table 10. Comparison between the existence of the terms within and outside

Table 11. Summary of categories in each effect comparison of CTFICF, CTFIDF

Table 15. The highest score for CTFIDF and CTD before and after

Table 18. Terms selected in CTF and CTFICF but are not selected by CTD, in

Table 21. Terms selected in CTD but not in CTF or CTFICF, in ascending order.

Table 2. SITE95-99 training and testing set ................................................................. 19 Table 3. Categories in SITE95-99 dataset............................................... ....................... 20 Table 4. Categories in Reuters-21578 dataset............................................................... 21 Table 5. The propose feature selection methods used in the experiment.............. 24 Table 6. The highest score for F-Measure in each category of SITE95-99. ............ 36 Table 7. Top 40 terms selected for Simulation(14) category...................................... 39 Table 8. Example of 5 unique terms selected by feature selection, CTFICF........ 41 Table 9. Example of terms selected by feature selection Instructional Design(6) 42

Instructional Design(6) category...................................................................................... 43

and CTD ................................................................................................................................. 44 Table 12. The categories selected from Reuters-21578 dataset................................ 46 Table 13. The highest score for the selected categories of Reuters-21578 . ............ 46 Table 14. The top 11 terms selected by CTD ................................................................ .47

misclassification for the selected categories of SITE95-96 and Reuters-21578..... 51 Table 16. Top 80 terms for Mathematic (7) category................................................... 55 Table 17. Terms selected in CTD, CTF and CTFICF in ascending order............... 57

ascending order .................................................................................................................... 57 Table 19. Selected terms ranked by CTF and CTFICF.............................................. 59 Table 20. Terms selected in CTD and CTFICF sorted in ascending order ............. 60

..................................................................................................................................................62 Table 22. Comparison of term's frequency and number of documents that consist the term.................................................................................................................................. 63 Table 23. Terms selected in CTD, CTF and CTFICF, in ascending order

Table 26. Terms selected in CTD but not selected by CTF and CTFICF, sorted in

............. 64 Table 24. Terms selected in CTF and CTFICF in ascending order......................... 65 Table 25. Terms selected in CTFICF and CTD, in ascending order........................ 66

ascending order.................................................................................................................... 67

v

Page 7: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

Abstract

Feature Selection is an important stage III any text mining classification

techniques. In this dissertation, we study and analyze Categorical Term

Descriptor (CTD) (Bong, C.H., 2001) feature selection method, which gives

comparative accuracy results compared to other well-known feature selection

method like Information Gain and Chi-Square. Our goal is to evaluate the

significance of each term weighting measure that forms the CTD method. Our

experimental results have shown that CTD does not handle datasets that contain

misclassifications. We have proven that CTD performs well in categories which

are distinct as opposed to general and miscellaneous categories~We have ~/

identified that categorical term's frequency (CTF) and its discriminative

capability across categories (ICF) are the two most significant factors for feature

selection method. They do not only enhance the classification performance but

also help to select the most relevant terms. A term's discriminative feature across

documents in a category (IDF) does not only exacerbate the overall performance

by selecting rare terms but also chooses terms which are mostly irrelevant. IDF

has also been shown to be biased towards misclassification cases. This work has

also highlighted that IDF is the least significant factor in CTD.

VI

Page 8: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

Abstrak

Proses pemilihan ciri yang sesuai bagi melambangkan kategori sesuatu teks adalah

merupakan suatu kaedah penting dalam mana-mana teknik pengelasan. Di dalam tesis

ini, kami telah mengkaji dan menganalisa kaedah pemilihan ciri Categorical Term

Descriptor (CTD) yang diilhamkan oleh Bong (Bong, C.H., 2001), di mana kaedah

pemilihan CTD ini telah berjaya menghasilkan ketepatan keputusan pengelasan yang

tinggi setanding dengan kaedah pemilihan ciri terkenal yang lain seperti Information

Gain and Chi-Square. Matlamat kami adalah untuk menilai siknifikasi bagi setiap

faktor yang membentuk kaedah pemilihan CTD. Keputusan eksperimen kami telah

membuktikan bahawa CTD tidak sesuai digunakan apabila sesuatu set data

mengandungi dokumen yang tersilap pengelasannya. Selain daripada itu, kami juga

telah membuktikan bahawa CTD berjaya menghasilkan ketepatan yang tinggi pada set

data yang mengandungi kategori yang agak unik tetapi tidak pada kategori umum.

Kami telah mengkaji bahawa faktor frekuensi di dalam kategori (CTF) dan kuasa

pembezaan dikalangan kategori (ICF) adalah amat penting dalam kaedah pemilihan ciri

di mana ia meningkatkan ketepatan pengelasan serta membantu memilih ciri yang

relevan. Faktor ciri kuasa perbezaan di kalangan dokumen bagi sesuatu kategori (IDF)

pula gagal untuk meningkatkan prestasi pengelasan teks dan memilih ciri yang benar­

benar relevan kepada kategorinya. Pada keadaan berlakunya kesilapan pengelasan

dokumen, faktor ini juga akan memberi bacaan yang tinggi dan ini akan mengakibatkan

keputusan pengelasan menjadi 'bias'. Oleh itu, tesis ini telah membuktikan bahawa

faktor ciri kuasa perbezaan di kalangan dokumen bagi sesuatu kategori adalah

merupakan factor yang kurang penting dalam cara pemilihan ciri CTD.

vii

Page 9: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

Chapter 1: Overview

1.1 Background

Nowadays, there are numerous databases systems being built to store and capture

textual data. Without adequate knowledge of the implicit, stored data, it becomes

difficult to retrieve results from databases. To deal with these large volumes of data,

people have resorted to text mining as a means of discovering insights. Text mining

employs an unstructured text-based data with an intention of classifying electronic

documents automatically. In this research fields, a lot of effort has been put into finding

effective algorithms, techniques and methods to achieve results comparable to humans.

A great effort has also been placed in finding a good feature selection method.

A large number of algorithms have been proposed for performing feature selection. The

problem of feature selection is in acquiring a set of candidate terms and selecting the

subset of terms that performs the best under particular classification systems. This

procedure can reduce not only the cost of recognition in which by reducing the number of

terms, but in some other cases, it also provides better classification accuracy due to the

finite sample size effects.

The ultimate objective of feature selection is to obtain a feature space with (1) a low

dimensionality, (2) retention of sufficient information, (3) enhancement in feature

selection space, for examples by removing effects due to noisy features and (4)

comparability of features among examples in the same category (Meisel, W.s., 1972).

1

~: .;

Page 10: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

The work described in this dissertation focuses on a simple feature selection method

proposed by Bong, C.H. (2001) in his thesis "A Machine Learning Framework for

Automated Text Classification". He has introduced a new feature selection approach,

namely Categorical Term Descriptor (CTD) with a potential to enhance the degree of

dimensionality reduction and classification accuracy.

Bong, in his finding, has actually taken advantage of Term Frequency Inverse Document

Frequency (TFIDF) (Salton, G., and Buckley, C., 1987; Korfhage, KK, 1997) methods to

overcome common feature selection problems as faced by term's frequency, term

weighting and information gain in measuring the "importance" of the term in the text

classification task. In CTD, the TFIDF method also has been refined by considering

term frequency in documents category collection. CTD has been able to produce better

results on some datasets as compared to other well-known and more complex feature

selection method like Information Gain (Mladenic, D., 1996; Yang, Y., and Pederson, J.,

1997) and Chi-Square (Schutze, H., Hull, D.A., and Pederson, J., 1995).

Term Weighting is applied in the indexing of terms to identify a subset of terms that best

reflects a collection of document. Bong, C.H., and Kulathuramaiyer, N., (2004) have

shown the effects of various term weighting schemes across datasets. The potential

capability of CTD has been highlighted. We further extend their research to indicate the

effects of component weighting schemes within their CTD scheme, namely the CTF, ICF

and IDF.

In this dissertation, we will mainly focus on the discovery of the significant factors that

form the CTD method. Factors that are not significant for term selection would be

2

Page 11: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

ignored. We will also describe how Bong's findings have been biased towards cases of

misclassifications. This will then result in the CTD method proposed earlier needs to be

reformulated to minimize the impact of such cases.

1.2 Objectives

Past literature has shown that CTD surprisingly works well on experimental datasets

(Bong, C.R, Kulathuramaiyer, N., and Wong, T.K., 2005). We would like to explore the

use of CTD on a life dataset, Conference Papers for Society for Information Technology

and Teacher Education Annual (SITE) to see whether CTD is able to perform well

consistently. Particularly, we will explore the following objectives. We will identify

factors in CTD that works well, and explore variations to enhance it for datasets with

larger number of documents, and contain a great deal of overlaps among documents.

i) To perform an experimental study to highlight the strengths and weaknesses

of CTD (to determine situation under which it works well and when it does

not).

ii) To determine the significance of factors in CTD and to validate the need of

each factor.

iii) To determine the usage of common terms as opposed to umque terms III

characterizing statistical feature selection.

iv) To explore the significance of discriminative terms with regards to

categorization accuracy.

3

Page 12: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

1.3 Problem Statement

Although a vast amount of literature on feature selection method exists, there is no best

method that works well for all datasets. There are datasets whereby Document

Frequency (yang, Y., and Pederson, J., 1997; Sebastiani, F., 1999) performs better or

almost as good as the popularly used methods, such as Information Gain (Mladenic, D.,

1996; Yang, Y., and Pederson, J., 1997) and Chi-Square (Schutze, H., Hull, D.A., and

Pederson, J., 1995)

For CTD, a term is assumed important based on the number of terms occurring within a

category and the term's discriminative capability across categories as well as among

documents in a category. Considering a case where, given a term 'mining'in a Computer

category, the term may occurs with a very high frequency only in a single document of

Computer category and fewer occurrences in other categories. According to Bong (2001),

this term is important to represents its category because it is not common across

categories as well as across documents. This term is thus considered as a unique term to

differentiate its category with others.

An intuition to this method is that it may be biased towards misclassification cases.

Similarly for such cases, a term for example 'food' is also uncommon in Computer

category, as there is definitely not many documents that contain the term. Therefore, its

discriminative capability among documents in this category is higher. So, instead of

selecting a term that is unique to represent its category, it may actually select a rare

term that is irrelevant to the category.

4

Page 13: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

Due to this, our research explores the discriminative capability of a term among

documents in a category and the possibility of misclassification. We believe that terms

that are less discriminative across documents in a category; in other words common term

across documents are more significant for a life dataset.

Hence, in this study we will show that the discriminative capability across documents in

a category is not important, and therefore, should be ignored in document classification.

1.4 Purpose of Study

We belief that the simple but effective feature selection method namely CTD can be

further improved. Therefore, the purpose of our study is to explore reasons why it works

well and identify refinements in order to enhance the method.

1.5 Research Significance

The significance of this study is to define a good and simple feature selection method

which not only can serve as a dimensionality reduction method and enhance the accuracy

of its classification, but also be able to find relevant term which is very significant to

represent its category.

1.6 Scope of Work

Here, we will mainly focus on the analysis of the significance of the two factors that form

CTD's method namely factor ICF (Inverse Category Frequency) and IDF (Inverse

Document Frequency). As this is an extended work to Bong (2001) who proposed a "A

Machine Learning Framework for Automated Text Classification", we employ the same

5

Page 14: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

framework and machine learning tools. As the previous work mainly focused on the

effects of CTD on an experimental dataset, we further explore a detailed experimental

study on a second dataset, SITE. It is important to use the same datasets, tools and

techniques as mentioned above in order to maintain the efficiency and accuracy of the

feature selection method.

We will compare the significance of parameter in both Reuters and SITE dataset. Then,

we will analyze whether the common terms can help to increase the efficiency of the

feature selection method.

Term analysis will be conducted to compare their relevancy to a category. The outcome

results will help to support the findings on the factors' significance.

1.7 Chapters Overview

This chapter provides a general overview of the study carried out for this dissertation.

Chapter 2 describes the literature review performed in the areas of text classification. It

highlights the importance of the text classification area, as well as, having a good feature

selection method and also briefly describes the processes involved in a text classification

framework. Chapter 3 presents the feature selection methods, dataset, text classifier

and measurements that are intended to be used in the experiments as described in

Chapter 4. Chapter 4 highlights the recommendation for effective experiments which

can help to determine the parameter's significance. The interesting results of these

experiments are discussed in Chapter 5. Finally, Chapter 6 summarizes the conclusion

of the dissertation in which it also highlights the contributions of the study and its

directions for feature work.

6

Page 15: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

Chapter 2: Literature Review

2.1 Introduction

This chapter provides the literature reVIews that are done in the areas of text

classification. Section 2.2 briefly describes the processes involved in a text classification

framework and focuses on literatures for feature selection methods. Section 2.3

describes the text classifier approaches that are used to assign category label to a

document while the measurements for their text assignment are discussed in Section 2.4.

2.2 Feature Selection

The research on text mining has started since the 1960's (Maron, M.E., and Kuhns, J.L.,

1960; Stiles, H.E., 1961; Doyle, L., 1962; Lesk, M.E., and Salton, G., 1969) and it still

receives high attention today. Text classification serves a variety of purposes, such as

classifying text by topics for topic tracking in predicting documents of interest (Fan, W.,

et. al., 2006), classifying email for the purpose of spam filtering (McEntire, J., 2003) or

even finding English words which are later be translated to Chinese text (Li, H., Cao, Y.

and Li, C., 2003).

In a document classification framework, there are many processes involved, namely pre­

processing of raw data, building the document representation model, calculating the

similarities in documents, identifying significant features and the usage of sophisticated

machine learning scheme, which enable automatic induction for a classifier by

"observing" the characteristics of a set of examples that has been classified by the

experts. Among these processes, however, feature selection plays a crucial role in a

document classification, as the construction of classification rules in a learning process,

7

Page 16: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

""

heavily relies on it. The major "role" of feature selection is to measure the "importance"

of the term for the classification task. It will find the subset of the original set of

features, which is most useful in its classification task to reduce the computational

complexity in classification. A term is weighted for its "importance" and only those that

are "important" enough will be used for further processing, as well as, for text

representation model.

The challenge faced in the feature selection method is in choosing a small subset of

features that is ideally necessary and sufficient to describe the target concept (Kira, K.,

and Rendell, L., 1992). A goal of feature selection is therefore, to avoid selecting too

many or too few features. If too few features are selected, there is a good chance that the

information content in this set of features will be low. On the other hand, if too many

features are selected, the effects due to irrelevant terms may overshadow the

information content and further increase the computational complexity. Hence, this is a

tradeoff which must be achieved by any feature selection method.

There exists a vast amount of literature on feature selection. Researchers have

attempted feature selection through varied means, such as statistical (Kittler, J., 1975),

geometrical (Elomaa, T., and Ukkonen, E., 1994), information-theoretic measures

(Battiti, R., 1994), mathematical programming (Bradley, P.S., Mangasarian, O.L., and

Street, W. N., 1998), among others. The wrapper method (Kohavi, R., 1995) for example,

searches for a good feature using the induction algorithm as a black box. The feature

selection algorithm exists as a wrapper around the induction algorithm. The induction

works by running on datasets with subsets of features, and the feature with the highest

8

51

Page 17: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

estimated value of a performance criterion will be chosen. The induction algorithm is

used to evaluate the dataset with the chosen features on an independent dataset.

A few other well-known feature selection methods are Information Gain (yang, Y., and

Pederson, J., 1997), Chi-Square (Schutze, H., Hull, D.A., and Pederson, J., 1995),

Correlation Coefficient (Zheng, Z., Srihari, R., 2003), Odd Ratio (Rijsbergen, V., 1979),

GSS Coefficient (Galavotti, L., Sebastiani, F., & Simi, M., 2000) and our local finding of

Categorical Term Descriptor (Bong, C.H., 2001). Table 1 shows the feature selection

methods that have been used in past researches.

Table 1. Feature selection measures.

Descri] Formul:

Information Gain

I

Chi-Square

Corelated Coefficient

. Odd Ratio

I . GSS Coefficient

I Categorical Term

Descriptor

i

9

Page 18: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

2.2.1 Information Gain

Information Gain (IG) is a statistical property that measures the worth of a feature for a

classifier. A feature for IG is defined as the difference between the prior uncertainty and

expected posterior uncertainty. For example, feature X is preferred compare to feature Y

if the information gain of feature X is greater than that from feature Y. The IG formula

presented in Table 1 is used to measure the goodness of terms throughout all categories

in a document collection. For all the documents in the collections, each unique term's

"Information Gain" is computed.

Basically, to employ IG as feature selection is to compute the amount of category

information gained by the presence and absence of a term in documents. Even though IG

as reported in the past literature can perform well, the method used however, IS very

complex and not ready to be interpreted of its derivation (Bong, C.H., 2001).

2.2.2 Chi-Square

Chi-Square (CHI) is a well-known categorical data measurement in the study of

statistics. CHI computes how closely an observed probability distribution corresponds to

an expected probability distribution. An observed probability distribution is formed by

observed frequencies of a number of documents either belonging to a category or not that

contain a term. A similar measurement is also computed for the term's absence. The

expected probability distribution is that all expected frequencies of the presence or

absence of a term will be equal in a category or not in a category. This measure is

described in Table 1.

10

Page 19: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

CHI is used to test the hypothesis of how close the observed and expected outcomes come

from the same probability distribution. Similar to IG, CHI is also complex and it has been

proven that it does not provide reliable score for low frequency terms (Yang, Y., and

Pederson, J., 1997).

2.2.3 Correlation Coefficient

Correlation Coefficient (CC) is a variant of the CHI metric, where CC2=CHP. CC can be

viewed as a "one-sided" CHI metric. The positive values correspond to features indicative

of membership, while negative values indicate non-membership. Feature selection using

CC selects terms with maximum CC values. The rationale behind is that terms coming

from non-relevant texts of a category are considered useless. On the other hand, CHI is

non-negative, whose value indicates either membership or non-membership of a term to

one category. Accordingly the ambiguous features will be ranked lower. In contrast with

CC, CHI considers the term coming from both the relevant and non-relevant texts.

2.2.4 Odd Ratio

Odds Ratio (OR) was proposed originally by Van Rijsbergen et al. for selecting terms for

relevance feedback. The basic idea is that the distribution of features on the relevant

documents is different from the distribution of features on the non-relevant documents.

Its formula is defined as in Table 1.

11

Page 20: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

2.2.5 GSS Coefficient

GSS Coefficient (GSS) is another simplified variant of the statistics CHI proposed by

Galavotti et. al. Similar to CC and OR, GSS only consider the positive features. GSS's

formula is defined in Table 1.

2.2.6 Categorical Term Descriptor

Categorical Term Descriptor (CTD) adopted Term Frequency Inverse Document

Frequency (TFIDF) by deploying the term weighting technique in the context of

documents in the categories instead of using only term frequency in a document. CTD

uses simple method but yet can serve well as an efficient feature selection method (Bong,

C.H, and Kulathuramaiyer, N., 2004). Figure 1 and 2 show the comparison of the

performance of the mentioned 6 feature selection methods, namely CTD, CC, CHI, GSS,

IG and OR in Reuters-21578 and SITE95-99 dataset..

Figure 1 shows that CTD performance is comparable to other measures, and when

applied to operational dataset, SITE95-99J Figure 2, CTD is promising and able to

achieve the best performance.

12

Page 21: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

O.7~--~------~--------~----------------------~

0.65

0.6

~ CTD CC CHI

--1-­ GSS 0.25 - -B­ IG

-"*- OR O.2L---~----~-----------L--------------------~

10 200 soo 1000 2000 Number offearufes

Figure 1. Micro-average F-Measure for six feature selection measures on Reuters-21578

0.35 ....."---.-----.---------,r------~----_,

-41- OR oL-~----~------~---------=====~10 200 500 1000 2000

Number of features

Figure 2. Micro-average Fl for six feature selection measures on SITE95-99

..... ---" ...... "". .. ­

""" ", .. " ........... ,,- ... ­

.•wr_'-~~~~~·~··~a-.~ •. *-.-.~.~~~~~~ .. ~.~--~ ., ...... "......

........ CTD CC CHI

~-+. GSS --G­ IG

13

Page 22: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

2.3 Text Classifier

Other than feature selection, text classifier is another area of interest to many

researchers. Text classifier is the process of assigning predefined category labels to new

documents based on the classifier learnt from their training examples. Many text

classifier techniques have been proposed, for examples, the Rocchio algorithm (Rocchio,

J. 1971), the naIve Bayes method (NB) (McCallum, A, Nigam, KJ., and Seymore, K,

1999), support vector machines (SVM) (Vapnik, V., 1995) and many others. Among the

various type of text classifiers, naIve Bayes has been widely used because of its

simplicity. Multinomial naive Bayes text classifier is the mostly used version of naIve

Bayes (McCallum, A, Nigam, KJ., and Seymore, K, 1999). This classifier has adopted

a topic unigram language model approach to estimate the term probabilities given a

class.

2.4 Measurement for Category Assignments

For evaluating the effectiveness of a category assignments by classifiers to documents,

there are three common measurements used, namely

(1) Precision: the proportion of classifications that are right with respect to the total

of those proposed

(2) Recall: the proportion of classifications that are correct with respect to the true

classifica tio n

(3) F-Measure : the harmonic mean of precision and recall.

F-Measure provides an overall measure of accuracy and therefore it is commonly adopted

1

I

for text classification. F-Measure combines Recall (R) and Precision (P) with an equal

2RPweight as follows: F-Measure =--.

R+P

14

Page 23: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

2.5 Conclusions

Results from previous studies indicate that feature selection is of a paramount

importance for any learning algorithm which when poorly done (i.e., a poor set of

features is selected), may lead to problems associated with incomplete information, noisy

or irrelevant features. The learning algorithm used is slowed down unnecessarily due to

high dimensions of the feature space. It also experience lower accuracies in text

classification due to learning irrelevant features. For this dissertation, we have selected

CTD because it uses a simple method and yet can serve well as an efficient feature

selection method. The subsequent chapter will describe in details, the methodology that

we used in our study.

15

Page 24: LATIFAH LOR ABDULLAH - ir.unimas.my Loh Abdullah (24pgs).pdfDeclaration I certify that all works in this dissertation are carried out between October 2004 and March 2007 and they have

1

Chapter 3: Methodology

3.1 Introduction

The methodology used in this dissertation, is actually based on the choice of the earlier

work by Bong, C.R (2001). This is to ensure that the efficiency and accuracy of the

feature selection method is maintained and easily benchmarked. This chapter discusses

the design of the study and the multi-method approaches used in the evaluation. Section

3.2 describes the common framework used for a text categorization. Section 3.3

highlights the dataset involved and Section 3.4 describes the feature selection methods

used to identify the significance of Category Term Frequency (CTF) as compared to

Inverse Document Frequency (IDF). We also include the proposed parameter for

Common Term Factor (CommTF) which hopes to improve the performance of the feature

selection method. Section 3.4 describes our choice of text classifier and lastly, Section 3.5

provides the details on the measurement that is used to measure their performance.

3.2 The Framework

The text categorization framework IS make up of five different components such as

document preprocessing, feature selection, learning tools, learned classifier and

categorization tools as illustrated in Figure 3.

16