latifah lor abdullah - ir.unimas.my loh abdullah (24pgs).pdfdeclaration i certify that all works in...
TRANSCRIPT
THE EFFECT OF TERM WEIGHTING MEASURES ON FEATURE SELECTION
LATIFAH LOR ABDULLAH
A dissertation submitted in partial fulfillment of the requirements for the degree of
Master of Advanced Information Technology
Faculty of Computer Science and Information Technology UNIVERSITI MALAYSIA SARA W AK
2007
Declaration
I certify that all works in this dissertation are carried out between October 2004 and
March 2007 and they have not been submitted for any academic awards at other
colleges, institutes or universities. The work presented here is carried out under the
supervision of Mr. Bong Chih How and Associate Professor Narayanan
Kulathuramaiyer. All other works in the thesis are my own except those where noted.
Signed,
Latifah Loh Abdullah
March,2007
Acknowledgment
I would like to thank Assoc. Prof. Narayanan Kulathuramaiyer for his inspiring
comments and proofreading. Many thanks to Mr. Bong Chih How for his valuable ideas,
helpful suggestions and comments, careful proofreading and for extending the code used
in the experiments. Special thanks to my husband and families for their love and
support, which gave me courage to complete this dissertation. Last but not least, I would
like to thank all my colleagues (CICTS and FCSIT) for their cooperation in many ways.
1
Table of Contents
Declaration .................................................................................................................. i Acknowledgment ........................................................................................................ i Table of Contents ...................................................................................................... ii List of Figures ........................................................................................................... iv List of Tables ............................................................................................................... v Abstract ...................................................................................................................... vi Abstrak ...................................................................................................................... vii Chapter 1: Overview .................................................................................................. 1
1.1 Background ............................................................................................... 1 1.2 Objectives .................................................................................................. 3 1.3 Problem Statement ................................................................................. 4 1.4 Purpose of Study ..................................................................................... 5 1.5 Research Significance ............................................................................ 5 1.6 Scope of Work ........................................................................................... 5 1.7 Chapters Overview ................................................................................. 6
Chapter 2: Literature Review .................................................................................. 7 2.1 Introduction .............................................................................................. 7 2.2 Feature Selection .................................................................................... 7
2.2.1 Information Gain .................................................................... 10 2.2.2 Chi-Square ............................................................................... 10 2.2.3 Correlation Coefficient .......................................................... 11 2.2.4 Odd Ratio ................................................................................. 11 2.2.5 GSS Coefficient ....................................................................... 12 2.2.6 Categorical Term Descriptor ............................................... 12
2.3 Text Classifier ........................................................................................ 14 2.4 Measurement for Category Assignments ........................................ 14 2.5 Conclusions ............................................................................................. 15
Chapter 3: Methodology .......................................................................................... 16 3.1 Introduction ............................................................................................ 16 3.2 The Framework ..................................................................................... 16 3.3 Datasets ................................................................................................... 19 3.4 Feature Selection Methods ................................................................. 21
3.4.1 Categorical Term Descriptor (CTD) ...................................... 22 3.4.2 Other Feature Selection Methods (CTF, CTFICF, CTFIDF)
....................................................................................................... 24 3.4.3 Common Term Factor (CommTF) ......................................... 25
3.5 Multinomial NaIve Bayes .................................................................... 26 3.6 Measurement ......................................................................................... 27 3. 7 Conclusion ............................................................................................... 27
Chapter 4: Experimental Approach ...................................................................... 29 4.1 Introduction ............................................................................................ 29 4.2 Experiments ........................................................................................... 29
11
4.3 The Evaluations of CTF, CTFICF, CTFIDF and CTD as Feature Selection .................................................................................................. 30
4.4 Common Term Factor (CommTF) ..................................................... 31 4.5 Semantic Analysis of Selected Term to Their Category............... 31 4.6 Conclusion............................................................................................... 32
Chapter 5: Experimental Result and Analysis ................................................... 33 5.1 Introduction............................................................................................ 33 5.2 CTF, CTFICF, CTFIDF and CTD as a Feature Selection Method33
5.2.1 Average of Overall Categories in SITE95-99 and Reuters21578............................................................................................ 33
5.2.2 Individual Categories in SITE95-99 and Reuters-21578 .35 5.2.2.1 SITE95-99 .................................................................. 35 5.2.2.2 Reuters-21578 ............................................................ 46
5.2.3 Discussion on Feature Selection Analysis ........................... 48 5.3 Misclassification .................................................................................... 50 5.4 Common Term Factor (CommTF) ..................................................... 51 5.5 Semantic Analysis of Relevance of the Selected Terms to Their
Categories ............................................................................................... 53 5.5.1 Mathematics ........................................................ ....... 54 5.5.2 Carcass ........................................................................ 64 5.5.3 Discussion on Semantic Analysis ......................... 68
5.6 Conclusion............................................................................................... 68 Chapter 6: Conclusion ............................................................................................. 70
6.1 Introduction ............................................................................................ 70 6.2 Discussion ............................................................................................... 70 6.3 Future Work ........................................................................................... 74
References ................................................................................................................. 76 Appendices................................................................................................................ 84
A. The 80 selected terms for Mathematics category .................................... 84 B. The 80 selected terms for Mathematics, in ascending order................ 85 C. The 80 selected terms for Carcass category .............................................. 88 D. The 80 selected terms for Carcass, in ascending order.......................... 89
iii
List of Figures
Figure 1. Micro-average F-Measure for six feature selection measures on Reuters21578....................................................................................................................................... 13 Figure 2. Micro-average F1 for six feature selection measures on SITE95-99...... 13 Figure 3. Macro averaged F-Measure in the SITE95-99 dataset ............................ 34 Figure 4. Macro averaged F-Measure in Reuters-21578 datasets ............................ 34 Figure 5. The percentage numbers of categories under each group ....................... 37 Figure 6. Macro Average of F-Measure for CommTF in SITE95-99 dataset ........ 52 Figure 7. Macro Average of F-Measure for CommTF in Reuters-21578 dataset.. 52
IV
List of Tables
Table 1. Feature selection measures................................................................................. 9
Table 10. Comparison between the existence of the terms within and outside
Table 11. Summary of categories in each effect comparison of CTFICF, CTFIDF
Table 15. The highest score for CTFIDF and CTD before and after
Table 18. Terms selected in CTF and CTFICF but are not selected by CTD, in
Table 21. Terms selected in CTD but not in CTF or CTFICF, in ascending order.
Table 2. SITE95-99 training and testing set ................................................................. 19 Table 3. Categories in SITE95-99 dataset............................................... ....................... 20 Table 4. Categories in Reuters-21578 dataset............................................................... 21 Table 5. The propose feature selection methods used in the experiment.............. 24 Table 6. The highest score for F-Measure in each category of SITE95-99. ............ 36 Table 7. Top 40 terms selected for Simulation(14) category...................................... 39 Table 8. Example of 5 unique terms selected by feature selection, CTFICF........ 41 Table 9. Example of terms selected by feature selection Instructional Design(6) 42
Instructional Design(6) category...................................................................................... 43
and CTD ................................................................................................................................. 44 Table 12. The categories selected from Reuters-21578 dataset................................ 46 Table 13. The highest score for the selected categories of Reuters-21578 . ............ 46 Table 14. The top 11 terms selected by CTD ................................................................ .47
misclassification for the selected categories of SITE95-96 and Reuters-21578..... 51 Table 16. Top 80 terms for Mathematic (7) category................................................... 55 Table 17. Terms selected in CTD, CTF and CTFICF in ascending order............... 57
ascending order .................................................................................................................... 57 Table 19. Selected terms ranked by CTF and CTFICF.............................................. 59 Table 20. Terms selected in CTD and CTFICF sorted in ascending order ............. 60
..................................................................................................................................................62 Table 22. Comparison of term's frequency and number of documents that consist the term.................................................................................................................................. 63 Table 23. Terms selected in CTD, CTF and CTFICF, in ascending order
Table 26. Terms selected in CTD but not selected by CTF and CTFICF, sorted in
............. 64 Table 24. Terms selected in CTF and CTFICF in ascending order......................... 65 Table 25. Terms selected in CTFICF and CTD, in ascending order........................ 66
ascending order.................................................................................................................... 67
v
Abstract
Feature Selection is an important stage III any text mining classification
techniques. In this dissertation, we study and analyze Categorical Term
Descriptor (CTD) (Bong, C.H., 2001) feature selection method, which gives
comparative accuracy results compared to other well-known feature selection
method like Information Gain and Chi-Square. Our goal is to evaluate the
significance of each term weighting measure that forms the CTD method. Our
experimental results have shown that CTD does not handle datasets that contain
misclassifications. We have proven that CTD performs well in categories which
are distinct as opposed to general and miscellaneous categories~We have ~/
identified that categorical term's frequency (CTF) and its discriminative
capability across categories (ICF) are the two most significant factors for feature
selection method. They do not only enhance the classification performance but
also help to select the most relevant terms. A term's discriminative feature across
documents in a category (IDF) does not only exacerbate the overall performance
by selecting rare terms but also chooses terms which are mostly irrelevant. IDF
has also been shown to be biased towards misclassification cases. This work has
also highlighted that IDF is the least significant factor in CTD.
VI
Abstrak
Proses pemilihan ciri yang sesuai bagi melambangkan kategori sesuatu teks adalah
merupakan suatu kaedah penting dalam mana-mana teknik pengelasan. Di dalam tesis
ini, kami telah mengkaji dan menganalisa kaedah pemilihan ciri Categorical Term
Descriptor (CTD) yang diilhamkan oleh Bong (Bong, C.H., 2001), di mana kaedah
pemilihan CTD ini telah berjaya menghasilkan ketepatan keputusan pengelasan yang
tinggi setanding dengan kaedah pemilihan ciri terkenal yang lain seperti Information
Gain and Chi-Square. Matlamat kami adalah untuk menilai siknifikasi bagi setiap
faktor yang membentuk kaedah pemilihan CTD. Keputusan eksperimen kami telah
membuktikan bahawa CTD tidak sesuai digunakan apabila sesuatu set data
mengandungi dokumen yang tersilap pengelasannya. Selain daripada itu, kami juga
telah membuktikan bahawa CTD berjaya menghasilkan ketepatan yang tinggi pada set
data yang mengandungi kategori yang agak unik tetapi tidak pada kategori umum.
Kami telah mengkaji bahawa faktor frekuensi di dalam kategori (CTF) dan kuasa
pembezaan dikalangan kategori (ICF) adalah amat penting dalam kaedah pemilihan ciri
di mana ia meningkatkan ketepatan pengelasan serta membantu memilih ciri yang
relevan. Faktor ciri kuasa perbezaan di kalangan dokumen bagi sesuatu kategori (IDF)
pula gagal untuk meningkatkan prestasi pengelasan teks dan memilih ciri yang benar
benar relevan kepada kategorinya. Pada keadaan berlakunya kesilapan pengelasan
dokumen, faktor ini juga akan memberi bacaan yang tinggi dan ini akan mengakibatkan
keputusan pengelasan menjadi 'bias'. Oleh itu, tesis ini telah membuktikan bahawa
faktor ciri kuasa perbezaan di kalangan dokumen bagi sesuatu kategori adalah
merupakan factor yang kurang penting dalam cara pemilihan ciri CTD.
vii
Chapter 1: Overview
1.1 Background
Nowadays, there are numerous databases systems being built to store and capture
textual data. Without adequate knowledge of the implicit, stored data, it becomes
difficult to retrieve results from databases. To deal with these large volumes of data,
people have resorted to text mining as a means of discovering insights. Text mining
employs an unstructured text-based data with an intention of classifying electronic
documents automatically. In this research fields, a lot of effort has been put into finding
effective algorithms, techniques and methods to achieve results comparable to humans.
A great effort has also been placed in finding a good feature selection method.
A large number of algorithms have been proposed for performing feature selection. The
problem of feature selection is in acquiring a set of candidate terms and selecting the
subset of terms that performs the best under particular classification systems. This
procedure can reduce not only the cost of recognition in which by reducing the number of
terms, but in some other cases, it also provides better classification accuracy due to the
finite sample size effects.
The ultimate objective of feature selection is to obtain a feature space with (1) a low
dimensionality, (2) retention of sufficient information, (3) enhancement in feature
selection space, for examples by removing effects due to noisy features and (4)
comparability of features among examples in the same category (Meisel, W.s., 1972).
1
~: .;
The work described in this dissertation focuses on a simple feature selection method
proposed by Bong, C.H. (2001) in his thesis "A Machine Learning Framework for
Automated Text Classification". He has introduced a new feature selection approach,
namely Categorical Term Descriptor (CTD) with a potential to enhance the degree of
dimensionality reduction and classification accuracy.
Bong, in his finding, has actually taken advantage of Term Frequency Inverse Document
Frequency (TFIDF) (Salton, G., and Buckley, C., 1987; Korfhage, KK, 1997) methods to
overcome common feature selection problems as faced by term's frequency, term
weighting and information gain in measuring the "importance" of the term in the text
classification task. In CTD, the TFIDF method also has been refined by considering
term frequency in documents category collection. CTD has been able to produce better
results on some datasets as compared to other well-known and more complex feature
selection method like Information Gain (Mladenic, D., 1996; Yang, Y., and Pederson, J.,
1997) and Chi-Square (Schutze, H., Hull, D.A., and Pederson, J., 1995).
Term Weighting is applied in the indexing of terms to identify a subset of terms that best
reflects a collection of document. Bong, C.H., and Kulathuramaiyer, N., (2004) have
shown the effects of various term weighting schemes across datasets. The potential
capability of CTD has been highlighted. We further extend their research to indicate the
effects of component weighting schemes within their CTD scheme, namely the CTF, ICF
and IDF.
In this dissertation, we will mainly focus on the discovery of the significant factors that
form the CTD method. Factors that are not significant for term selection would be
2
ignored. We will also describe how Bong's findings have been biased towards cases of
misclassifications. This will then result in the CTD method proposed earlier needs to be
reformulated to minimize the impact of such cases.
1.2 Objectives
Past literature has shown that CTD surprisingly works well on experimental datasets
(Bong, C.R, Kulathuramaiyer, N., and Wong, T.K., 2005). We would like to explore the
use of CTD on a life dataset, Conference Papers for Society for Information Technology
and Teacher Education Annual (SITE) to see whether CTD is able to perform well
consistently. Particularly, we will explore the following objectives. We will identify
factors in CTD that works well, and explore variations to enhance it for datasets with
larger number of documents, and contain a great deal of overlaps among documents.
i) To perform an experimental study to highlight the strengths and weaknesses
of CTD (to determine situation under which it works well and when it does
not).
ii) To determine the significance of factors in CTD and to validate the need of
each factor.
iii) To determine the usage of common terms as opposed to umque terms III
characterizing statistical feature selection.
iv) To explore the significance of discriminative terms with regards to
categorization accuracy.
3
1.3 Problem Statement
Although a vast amount of literature on feature selection method exists, there is no best
method that works well for all datasets. There are datasets whereby Document
Frequency (yang, Y., and Pederson, J., 1997; Sebastiani, F., 1999) performs better or
almost as good as the popularly used methods, such as Information Gain (Mladenic, D.,
1996; Yang, Y., and Pederson, J., 1997) and Chi-Square (Schutze, H., Hull, D.A., and
Pederson, J., 1995)
For CTD, a term is assumed important based on the number of terms occurring within a
category and the term's discriminative capability across categories as well as among
documents in a category. Considering a case where, given a term 'mining'in a Computer
category, the term may occurs with a very high frequency only in a single document of
Computer category and fewer occurrences in other categories. According to Bong (2001),
this term is important to represents its category because it is not common across
categories as well as across documents. This term is thus considered as a unique term to
differentiate its category with others.
An intuition to this method is that it may be biased towards misclassification cases.
Similarly for such cases, a term for example 'food' is also uncommon in Computer
category, as there is definitely not many documents that contain the term. Therefore, its
discriminative capability among documents in this category is higher. So, instead of
selecting a term that is unique to represent its category, it may actually select a rare
term that is irrelevant to the category.
4
Due to this, our research explores the discriminative capability of a term among
documents in a category and the possibility of misclassification. We believe that terms
that are less discriminative across documents in a category; in other words common term
across documents are more significant for a life dataset.
Hence, in this study we will show that the discriminative capability across documents in
a category is not important, and therefore, should be ignored in document classification.
1.4 Purpose of Study
We belief that the simple but effective feature selection method namely CTD can be
further improved. Therefore, the purpose of our study is to explore reasons why it works
well and identify refinements in order to enhance the method.
1.5 Research Significance
The significance of this study is to define a good and simple feature selection method
which not only can serve as a dimensionality reduction method and enhance the accuracy
of its classification, but also be able to find relevant term which is very significant to
represent its category.
1.6 Scope of Work
Here, we will mainly focus on the analysis of the significance of the two factors that form
CTD's method namely factor ICF (Inverse Category Frequency) and IDF (Inverse
Document Frequency). As this is an extended work to Bong (2001) who proposed a "A
Machine Learning Framework for Automated Text Classification", we employ the same
5
framework and machine learning tools. As the previous work mainly focused on the
effects of CTD on an experimental dataset, we further explore a detailed experimental
study on a second dataset, SITE. It is important to use the same datasets, tools and
techniques as mentioned above in order to maintain the efficiency and accuracy of the
feature selection method.
We will compare the significance of parameter in both Reuters and SITE dataset. Then,
we will analyze whether the common terms can help to increase the efficiency of the
feature selection method.
Term analysis will be conducted to compare their relevancy to a category. The outcome
results will help to support the findings on the factors' significance.
1.7 Chapters Overview
This chapter provides a general overview of the study carried out for this dissertation.
Chapter 2 describes the literature review performed in the areas of text classification. It
highlights the importance of the text classification area, as well as, having a good feature
selection method and also briefly describes the processes involved in a text classification
framework. Chapter 3 presents the feature selection methods, dataset, text classifier
and measurements that are intended to be used in the experiments as described in
Chapter 4. Chapter 4 highlights the recommendation for effective experiments which
can help to determine the parameter's significance. The interesting results of these
experiments are discussed in Chapter 5. Finally, Chapter 6 summarizes the conclusion
of the dissertation in which it also highlights the contributions of the study and its
directions for feature work.
6
Chapter 2: Literature Review
2.1 Introduction
This chapter provides the literature reVIews that are done in the areas of text
classification. Section 2.2 briefly describes the processes involved in a text classification
framework and focuses on literatures for feature selection methods. Section 2.3
describes the text classifier approaches that are used to assign category label to a
document while the measurements for their text assignment are discussed in Section 2.4.
2.2 Feature Selection
The research on text mining has started since the 1960's (Maron, M.E., and Kuhns, J.L.,
1960; Stiles, H.E., 1961; Doyle, L., 1962; Lesk, M.E., and Salton, G., 1969) and it still
receives high attention today. Text classification serves a variety of purposes, such as
classifying text by topics for topic tracking in predicting documents of interest (Fan, W.,
et. al., 2006), classifying email for the purpose of spam filtering (McEntire, J., 2003) or
even finding English words which are later be translated to Chinese text (Li, H., Cao, Y.
and Li, C., 2003).
In a document classification framework, there are many processes involved, namely pre
processing of raw data, building the document representation model, calculating the
similarities in documents, identifying significant features and the usage of sophisticated
machine learning scheme, which enable automatic induction for a classifier by
"observing" the characteristics of a set of examples that has been classified by the
experts. Among these processes, however, feature selection plays a crucial role in a
document classification, as the construction of classification rules in a learning process,
7
""
heavily relies on it. The major "role" of feature selection is to measure the "importance"
of the term for the classification task. It will find the subset of the original set of
features, which is most useful in its classification task to reduce the computational
complexity in classification. A term is weighted for its "importance" and only those that
are "important" enough will be used for further processing, as well as, for text
representation model.
The challenge faced in the feature selection method is in choosing a small subset of
features that is ideally necessary and sufficient to describe the target concept (Kira, K.,
and Rendell, L., 1992). A goal of feature selection is therefore, to avoid selecting too
many or too few features. If too few features are selected, there is a good chance that the
information content in this set of features will be low. On the other hand, if too many
features are selected, the effects due to irrelevant terms may overshadow the
information content and further increase the computational complexity. Hence, this is a
tradeoff which must be achieved by any feature selection method.
There exists a vast amount of literature on feature selection. Researchers have
attempted feature selection through varied means, such as statistical (Kittler, J., 1975),
geometrical (Elomaa, T., and Ukkonen, E., 1994), information-theoretic measures
(Battiti, R., 1994), mathematical programming (Bradley, P.S., Mangasarian, O.L., and
Street, W. N., 1998), among others. The wrapper method (Kohavi, R., 1995) for example,
searches for a good feature using the induction algorithm as a black box. The feature
selection algorithm exists as a wrapper around the induction algorithm. The induction
works by running on datasets with subsets of features, and the feature with the highest
8
51
estimated value of a performance criterion will be chosen. The induction algorithm is
used to evaluate the dataset with the chosen features on an independent dataset.
A few other well-known feature selection methods are Information Gain (yang, Y., and
Pederson, J., 1997), Chi-Square (Schutze, H., Hull, D.A., and Pederson, J., 1995),
Correlation Coefficient (Zheng, Z., Srihari, R., 2003), Odd Ratio (Rijsbergen, V., 1979),
GSS Coefficient (Galavotti, L., Sebastiani, F., & Simi, M., 2000) and our local finding of
Categorical Term Descriptor (Bong, C.H., 2001). Table 1 shows the feature selection
methods that have been used in past researches.
Table 1. Feature selection measures.
Descri] Formul:
Information Gain
I
Chi-Square
Corelated Coefficient
. Odd Ratio
I . GSS Coefficient
I Categorical Term
Descriptor
i
9
2.2.1 Information Gain
Information Gain (IG) is a statistical property that measures the worth of a feature for a
classifier. A feature for IG is defined as the difference between the prior uncertainty and
expected posterior uncertainty. For example, feature X is preferred compare to feature Y
if the information gain of feature X is greater than that from feature Y. The IG formula
presented in Table 1 is used to measure the goodness of terms throughout all categories
in a document collection. For all the documents in the collections, each unique term's
"Information Gain" is computed.
Basically, to employ IG as feature selection is to compute the amount of category
information gained by the presence and absence of a term in documents. Even though IG
as reported in the past literature can perform well, the method used however, IS very
complex and not ready to be interpreted of its derivation (Bong, C.H., 2001).
2.2.2 Chi-Square
Chi-Square (CHI) is a well-known categorical data measurement in the study of
statistics. CHI computes how closely an observed probability distribution corresponds to
an expected probability distribution. An observed probability distribution is formed by
observed frequencies of a number of documents either belonging to a category or not that
contain a term. A similar measurement is also computed for the term's absence. The
expected probability distribution is that all expected frequencies of the presence or
absence of a term will be equal in a category or not in a category. This measure is
described in Table 1.
10
CHI is used to test the hypothesis of how close the observed and expected outcomes come
from the same probability distribution. Similar to IG, CHI is also complex and it has been
proven that it does not provide reliable score for low frequency terms (Yang, Y., and
Pederson, J., 1997).
2.2.3 Correlation Coefficient
Correlation Coefficient (CC) is a variant of the CHI metric, where CC2=CHP. CC can be
viewed as a "one-sided" CHI metric. The positive values correspond to features indicative
of membership, while negative values indicate non-membership. Feature selection using
CC selects terms with maximum CC values. The rationale behind is that terms coming
from non-relevant texts of a category are considered useless. On the other hand, CHI is
non-negative, whose value indicates either membership or non-membership of a term to
one category. Accordingly the ambiguous features will be ranked lower. In contrast with
CC, CHI considers the term coming from both the relevant and non-relevant texts.
2.2.4 Odd Ratio
Odds Ratio (OR) was proposed originally by Van Rijsbergen et al. for selecting terms for
relevance feedback. The basic idea is that the distribution of features on the relevant
documents is different from the distribution of features on the non-relevant documents.
Its formula is defined as in Table 1.
11
2.2.5 GSS Coefficient
GSS Coefficient (GSS) is another simplified variant of the statistics CHI proposed by
Galavotti et. al. Similar to CC and OR, GSS only consider the positive features. GSS's
formula is defined in Table 1.
2.2.6 Categorical Term Descriptor
Categorical Term Descriptor (CTD) adopted Term Frequency Inverse Document
Frequency (TFIDF) by deploying the term weighting technique in the context of
documents in the categories instead of using only term frequency in a document. CTD
uses simple method but yet can serve well as an efficient feature selection method (Bong,
C.H, and Kulathuramaiyer, N., 2004). Figure 1 and 2 show the comparison of the
performance of the mentioned 6 feature selection methods, namely CTD, CC, CHI, GSS,
IG and OR in Reuters-21578 and SITE95-99 dataset..
Figure 1 shows that CTD performance is comparable to other measures, and when
applied to operational dataset, SITE95-99J Figure 2, CTD is promising and able to
achieve the best performance.
12
O.7~--~------~--------~----------------------~
0.65
0.6
~ CTD CC CHI
--1- GSS 0.25 - -B IG
-"*- OR O.2L---~----~-----------L--------------------~
10 200 soo 1000 2000 Number offearufes
Figure 1. Micro-average F-Measure for six feature selection measures on Reuters-21578
0.35 ....."---.-----.---------,r------~----_,
-41- OR oL-~----~------~---------=====~10 200 500 1000 2000
Number of features
Figure 2. Micro-average Fl for six feature selection measures on SITE95-99
..... ---" ...... "". ..
""" ", .. " ........... ,,- ...
.•wr_'-~~~~~·~··~a-.~ •. *-.-.~.~~~~~~ .. ~.~--~ ., ...... "......
........ CTD CC CHI
~-+. GSS --G IG
13
2.3 Text Classifier
Other than feature selection, text classifier is another area of interest to many
researchers. Text classifier is the process of assigning predefined category labels to new
documents based on the classifier learnt from their training examples. Many text
classifier techniques have been proposed, for examples, the Rocchio algorithm (Rocchio,
J. 1971), the naIve Bayes method (NB) (McCallum, A, Nigam, KJ., and Seymore, K,
1999), support vector machines (SVM) (Vapnik, V., 1995) and many others. Among the
various type of text classifiers, naIve Bayes has been widely used because of its
simplicity. Multinomial naive Bayes text classifier is the mostly used version of naIve
Bayes (McCallum, A, Nigam, KJ., and Seymore, K, 1999). This classifier has adopted
a topic unigram language model approach to estimate the term probabilities given a
class.
2.4 Measurement for Category Assignments
For evaluating the effectiveness of a category assignments by classifiers to documents,
there are three common measurements used, namely
(1) Precision: the proportion of classifications that are right with respect to the total
of those proposed
(2) Recall: the proportion of classifications that are correct with respect to the true
classifica tio n
(3) F-Measure : the harmonic mean of precision and recall.
F-Measure provides an overall measure of accuracy and therefore it is commonly adopted
1
I
for text classification. F-Measure combines Recall (R) and Precision (P) with an equal
2RPweight as follows: F-Measure =--.
R+P
14
2.5 Conclusions
Results from previous studies indicate that feature selection is of a paramount
importance for any learning algorithm which when poorly done (i.e., a poor set of
features is selected), may lead to problems associated with incomplete information, noisy
or irrelevant features. The learning algorithm used is slowed down unnecessarily due to
high dimensions of the feature space. It also experience lower accuracies in text
classification due to learning irrelevant features. For this dissertation, we have selected
CTD because it uses a simple method and yet can serve well as an efficient feature
selection method. The subsequent chapter will describe in details, the methodology that
we used in our study.
15
1
Chapter 3: Methodology
3.1 Introduction
The methodology used in this dissertation, is actually based on the choice of the earlier
work by Bong, C.R (2001). This is to ensure that the efficiency and accuracy of the
feature selection method is maintained and easily benchmarked. This chapter discusses
the design of the study and the multi-method approaches used in the evaluation. Section
3.2 describes the common framework used for a text categorization. Section 3.3
highlights the dataset involved and Section 3.4 describes the feature selection methods
used to identify the significance of Category Term Frequency (CTF) as compared to
Inverse Document Frequency (IDF). We also include the proposed parameter for
Common Term Factor (CommTF) which hopes to improve the performance of the feature
selection method. Section 3.4 describes our choice of text classifier and lastly, Section 3.5
provides the details on the measurement that is used to measure their performance.
3.2 The Framework
The text categorization framework IS make up of five different components such as
document preprocessing, feature selection, learning tools, learned classifier and
categorization tools as illustrated in Figure 3.
16