a fuzzy self-constructing feature clustering

A Fuzzy Self-Constructing Feature ClusteringAlgorithm for Text Classification

Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member, IEEE

Abstract—Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper,

we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document

set are grouped into clusters, based on similarity test. Words that are similar to each other are grouped into the same cluster. Each

cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired

number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature,

corresponding to a cluster, is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership

functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the

number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be

avoided. Experimental results show that our method can run faster and obtain better extracted features than other methods.

Index Terms—Fuzzy similarity, feature clustering, feature extraction, feature reduction, text classification.

Ç

1 INTRODUCTION

IN text classification, the dimensionality of the featurevector is usually huge. For example, 20 Newsgroups [1]

and Reuters21578 top-10 [2], which are two real-world datasets, both have more than 15,000 features. Such highdimensionality can be a severe obstacle for classificationalgorithms [3], [4]. To alleviate this difficulty, featurereduction approaches are applied before document classi-fication tasks are performed [5]. Two major approaches,feature selection [6], [7], [8], [9], [10] and feature extraction[11], [12], [13], have been proposed for feature reduction. Ingeneral, feature extraction approaches are more effectivethan feature selection techniques, but are more computa-tionally expensive [11], [12], [14]. Therefore, developingscalable and efficient feature extraction algorithms is highlydemanded for dealing with high-dimensional documentdata sets.

Classical feature extraction methods aim to convert therepresentation of the original high-dimensional data set intoa lower-dimensional data set by a projecting processthrough algebraic transformations. For example, PrincipalComponent Analysis [15], Linear Discriminant Analysis[16], Maximum Margin Criterion [12], and OrthogonalCentroid algorithm [17] perform the projection by lineartransformations, while Locally Linear Embedding [18],ISOMAP [19], and Laplacian Eigenmaps [20] do featureextraction by nonlinear transformations. In practice, linearalgorithms are in wider use due to their efficiency. Severalscalable online linear feature extraction algorithms [14],

[21], [22], [23] have been proposed to improve thecomputational complexity. However, the complexity ofthese approaches is still high. Feature clustering [24], [25],[26], [27], [28], [29] is one of effective techniques for featurereduction in text classification. The idea of feature cluster-ing is to group the original features into clusters with a highdegree of pairwise semantic relatedness. Each cluster istreated as a single new feature, and, thus, featuredimensionality can be drastically reduced.

The first feature extraction method based on featureclustering was proposed by Baker and McCallum [24],which was derived from the “distributional clustering” ideaof Pereira et al. [30]. Al-Mubaid and Umair [31] useddistributional clustering to generate an efficient representa-tion of documents and applied a learning logic approach fortraining text classifiers. The Agglomerative InformationBottleneck approach was proposed by Tishby et al. [25],[29]. The divisive information-theoretic feature clusteringalgorithm was proposed by Dhillon et al. [27], which is aninformation-theoretic feature clustering approach, and ismore effective than other feature clustering methods. Inthese feature clustering methods, each new feature isgenerated by combining a subset of the original words.However, difficulties are associated with these methods. Aword is exactly assigned to a subset, i.e., hard-clustering,based on the similarity magnitudes between the word andthe existing subsets, even if the differences among thesemagnitudes are small. Also, the mean and the variance of acluster are not considered when similarity with respect tothe cluster is computed. Furthermore, these methodsrequire the number of new features be specified in advanceby the user.

We propose a fuzzy similarity-based self-constructingfeature clustering algorithm, which is an incremental featureclustering approach to reduce the number of features for thetext classification task. The words in the feature vector of adocument set are represented as distributions, and pro-cessed one after another. Words that are similar to each

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011 335

. The authors are with the Department of Electrical Engineering, NationalSun Yat-Sen University, Kaohsiung, 804, Taiwan.E-mail: [email protected], [email protected],[email protected].

Manuscript received 18 Feb. 2009; revised 27 July 2009; accepted 4 Nov. 2009;published online 26 July 2010.Recommended for acceptance by K.-L. Tan.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2009-02-0077.Digital Object Identifier no. 10.1109/TKDE.2010.122.

1041-4347/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

other are grouped into the same cluster. Each cluster ischaracterized by a membership function with statisticalmean and deviation. If a word is not similar to any existingcluster, a new cluster is created for this word. Similaritybetween a word and a cluster is defined by considering boththe mean and the variance of the cluster. When all the wordshave been fed in, a desired number of clusters are formedautomatically. We then have one extracted feature for eachcluster. The extracted feature corresponding to a cluster is aweighted combination of the words contained in the cluster.Three ways of weighting, hard, soft, and mixed, areintroduced. By this algorithm, the derived membershipfunctions match closely with and describe properly the realdistribution of the training data. Besides, the user need notspecify the number of extracted features in advance, andtrial-and-error for determining the appropriate number ofextracted features can then be avoided. Experiments on real-world data sets show that our method can run faster andobtain better extracted features than other methods.

The remainder of this paper is organized as follows:Section 2 gives a brief background about feature reduction.Section 3 presents the proposed fuzzy similarity-based self-constructing feature clustering algorithm. An example,illustrating how the algorithm works, is given in Section 4.Experimental results are presented in Section 5. Finally, weconclude this work in Section 6.

2 BACKGROUND AND RELATED WORK

To process documents, the bag-of-words model [32], [33] is

commonly used. Let D ¼ fd1;d2; . . . ;dng be a document set

of n documents, where d1, d2; . . . ;dn are individual docu-

ments, and each document belongs to one of the classes in the

set fc1; c2; . . . . . . ; cpg. If a document belongs to two or more

classes, then two or more copies of the document with

different classes are included in D. Let the word set W ¼fw1; w2; . . . ; wmg be the feature vector of the document set.

Each document di, 1 � i � n, is represented as di ¼<di1; di2; . . . ; dim>, where each dij denotes the number of

occurrences of wj in the ith document. The feature reduction

task is to find a new word set W0 ¼ fw01; w02; . . . ; w0kg, k << m,

such that W and W0 work equally well for all the desired

properties with D. After feature reduction, each document diis converted into a new representation d0i ¼ <d0i1; d0i2; . . . ; d0ik>

and the converted document set is D0 ¼ fd01;d02; . . . ;d0ng. If k

is much smaller than m, computation cost with subsequent

operations on D0 can be drastically reduced.

2.1 Feature Reduction

In general, there are two ways of doing feature reduction,

feature selection, and feature extraction. By feature selec-

tion approaches, a new feature set W0 ¼ fw01; w02; . . . ; w0kg is

obtained, which is a subset of the original feature set W.

Then W0 is used as inputs for classification tasks.

Information Gain (IG) is frequently employed in the

feature selection approach [10]. It measures the reduced

uncertainty by an information-theoretic measure and gives

each word a weight. The weight of a word wj is calculated

as follows:

IGðwjÞ ¼ �Xpl¼1

P ðclÞlogP ðclÞ

þ P ðwjÞXpl¼1

P ðcljwjÞlogP ðcljwjÞ

þ P ðwjÞXpl¼1

P ðcljwjÞlogP ðcljwjÞ;

ð1Þ

where P ðclÞ denotes the prior probability for class cl, P ðwjÞdenotes the prior probability for feature wj, P ðwjÞ isidentical to 1� P ðwjÞ, and P ðcljwjÞ and P ðcljwjÞ denotethe probability for class cl with the presence and absence,respectively, of wj. The words of top k weights in W areselected as the features in W0.

In feature extraction approaches, extracted features areobtained by a projecting process through algebraic trans-formations. An incremental orthogonal centroid (IOC)algorithm was proposed in [14]. Let a corpus of documentsbe represented as an m� n matrix X 2 Rm�n, where m isthe number of features in the feature set and n is thenumber of documents in the document set. IOC tries to findan optimal transformation matrix F� 2 Rm�k, where k is thedesired number of extracted features, according to thefollowing criterion:

F� ¼ arg max traceðFTSbFÞ; ð2Þ

where F 2 Rm�k and FTF ¼ I, and

Sb ¼Xpq¼1

P ðcqÞðMq �MallÞðMq �MallÞT ð3Þ

with P ðcqÞ being the prior probability for a patternbelonging to class cq, Mq being the mean vector ofclass cq, and Mall being the mean vector of all patterns.

2.2 Feature Clustering

Feature clustering is an efficient approach for featurereduction [25], [29], which groups all features into someclusters, where features in a cluster are similar to eachother. The feature clustering methods proposed in [24], [25],[27], [29] are “hard” clustering methods, where each wordof the original features belongs to exactly one word cluster.Therefore each word contributes to the synthesis of onlyone new feature. Each new feature is obtained by summingup the words belonging to one cluster. Let D be the matrixconsisting of all the original documents with m features andD0 be the matrix consisting of the converted documentswith new k features. The new feature set W0 ¼ fw01; w02; . . . ;w0kg corresponds to a partition fW1;W2; . . . ;Wkg of theoriginal feature set W, i.e., Wt

TWq ¼ ;, where 1 � q; t � k

and t 6¼ q. Note that a cluster corresponds to an element inthe partition. Then, the tth feature value of the converteddocument d0i is calculated as follows:

d0it ¼Xwj2Wt

dij; ð4Þ

which is a linear sum of the feature values in Wt.The divisive information-theoretic feature clustering

(DC) algorithm, proposed by Dhillon et al. [27] calculatesthe distributions of words over classes, P ðCjwjÞ, 1 � j � m,

336 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011

where C ¼ fc1; c2; . . . . . . ; cpg, and uses Kullback-Leibler di-

vergence to measure the dissimilarity between two dis-tributions. The distribution of a cluster Wt is calculated

as follows:

P ðCjWtÞ ¼Xwj2Wt

P ðwjÞPwj2Wt

P ðwjÞP ðCjwjÞ: ð5Þ

The goal of DC is to minimize the following objective

function:

Xkt¼1

Xwj2Wt

P ðwjÞKLðP ðCjwjÞ; P ðCjWtÞÞ; ð6Þ

which takes the sum over all the k clusters, where k is

specified by the user in advance.

3 OUR METHOD

There are some issues pertinent to most of the existing

feature clustering methods. First, the parameter k, indicat-

ing the desired number of extracted features, has to be

specified in advance. This gives a burden to the user, since

trial-and-error has to be done until the appropriate numberof extracted features is found. Second, when calculating

similarities, the variance of the underlying cluster is not

considered. Intuitively, the distribution of the data in a

cluster is an important factor in the calculation of similarity.

Third, all words in a cluster have the same degree ofcontribution to the resulting extracted feature. Sometimes, it

may be better if more similar words are allowed to have

bigger degrees of contribution. Our feature clustering

algorithm is proposed to deal with these issues.Suppose, we are given a document set D of n documents

d1, d2; . . . ;dn, together with the feature vector W of

m words w1; w2; . . . ; wm and p classes c1, c2; . . . ; cp, asspecified in Section 2. We construct one word pattern for

each word in W. For word wi, its word pattern xi is defined,

similarly as in [27], by

xi ¼ <xi1; xi2; . . . ; xip>

¼ <P ðc1jwiÞ; P ðc2jwiÞ; . . . ; P ðcpjwiÞ>;ð7Þ

where

P ðcjjwiÞ ¼Pn

q¼1 dqi � �qjPnq¼1 dqi

; ð8Þ

for 1 � j � p. Note that dqi indicates the number of

occurrences of wi in document dq, as described in Section 2.

Also, �qj is defined as

�qj ¼1; if document dq belongs to class cj;0; otherwise:

�ð9Þ

Therefore, we have m word patterns in total. For example,

suppose we have four documents d1, d2, d3, and d4 belongingto c1, c1, c2, and c2, respectively. Let the occurrences of w1 in

these documents be 1, 2, 3, and 4, respectively. Then, the word

pattern x1 of w1 is:

P ðc1jw1Þ ¼1� 1þ 2� 1þ 3� 0þ 4� 0

1þ 2þ 3þ 4¼ 0:3;

P ðc2jw1Þ ¼1� 0þ 2� 0þ 3� 1þ 4� 1

1þ 2þ 3þ 4¼ 0:7;

x1 ¼ <0:3; 0:7>:

ð10Þ

It is these word patterns, our clustering algorithm will workon. Our goal is to group the words in W into clusters, basedon these word patterns. A cluster contains a certain numberof word patterns, and is characterized by the product ofp one-dimensional Gaussian functions. Gaussian functionsare adopted because of their superiority over otherfunctions in performance [34], [35]. Let G be a clustercontaining q word patterns x1, x2; . . . ;xq. Let xj ¼ <xj1;xj2; . . . ; xjp>, 1 � j � q. Then the mean m ¼ <m1;m2; . . . ;mp> and the deviation �� ¼ <�1; �2; . . . ; �p> of G aredefined as

mi ¼Pq

j¼1 xji

jGj ð11Þ

�i ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPqj¼1 ðxji �mjiÞ2

jGj

sð12Þ

for 1 � i � p, where jGj denotes the size of G, i.e., thenumber of word patterns contained in G. The fuzzysimilarity of a word pattern x ¼ <x1; x2; . . . ; xp> to clusterG is defined by the following membership function:

�GðxÞ ¼Ypi¼1

exp � xi �mi

�i

� �2" #

: ð13Þ

Notice that 0 � �GðxÞ � 1. A word pattern close to the meanof a cluster is regarded to be very similar to this cluster, i.e.,�GðxÞ � 1. On the contrary, a word pattern far distant froma cluster is hardly similar to this cluster, i.e., �GðxÞ � 0. Forexample, suppose G1 is an existing cluster with a meanvector m1 ¼< 0:4; 0:6 > and a deviation vector ��1 ¼ <0:2;0:3>. The fuzzy similarity of the word pattern x1 shown in(10) to cluster G1 becomes

�G1ðx1Þ

¼ exp � 0:3� 0:4

0:2

� �2" #

� exp � 0:7� 0:6

0:3

� �2" #

¼ 0:7788� 0:8948 ¼ 0:6969:

ð14Þ

3.1 Self-Constructing Clustering

Our clustering algorithm is an incremental, self-construct-ing learning approach. Word patterns are considered oneby one. The user does not need to have any idea aboutthe number of clusters in advance. No clusters exist at thebeginning, and clusters can be created if necessary. Foreach word pattern, the similarity of this word pattern toeach existing cluster is calculated to decide whether it iscombined into an existing cluster or a new cluster iscreated. Once a new cluster is created, the correspondingmembership function should be initialized. On thecontrary, when the word pattern is combined into anexisting cluster, the membership function of that clustershould be updated accordingly.

JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 337

Let k be the number of currently existing clusters. The

clusters are G1, G2; . . . ; Gk, respectively. Each cluster Gj has

mean mj ¼ <mj1;mj2; . . . ;mjp> and deviation ��j ¼ <�j1;�j2; . . . ; �jp>. Let Sj be the size of cluster Gj. Initially, we

have k ¼ 0. So, no clusters exist at the beginning. For each

word pattern xi ¼ <xi1; xi2; . . . ; xip>; 1 � i � m, we calcu-

late, according to (13), the similarity of xi to each existing

clusters, i.e.,

�GjðxiÞ ¼

Ypq¼1

exp � xiq �mjq

�jq

� �2" #

ð15Þ

for 1 � j � k. We say that xi passes the similarity test on

cluster Gj if

�GjðxiÞ � �; ð16Þ

where �, 0 � � � 1, is a predefined threshold. If the user

intends to have larger clusters, then he/she can give a

smaller threshold. Otherwise, a bigger threshold can begiven. As the threshold increases, the number of clusters

also increases. Note that, as usual, the power in (15) is 2 [34],

[35]. Its value has an effect on the number of clusters

obtained. A larger value will make the boundaries of the

Gaussian function sharper, and more clusters will be

obtained for a given threshold. On the contrary, a smaller

value will make the boundaries of the Gaussian function

smoother, and fewer clusters will be obtained instead.Two cases may occur. First, there are no existing fuzzy

clusters on which xi has passed the similarity test. For this

case, we assume that xi is not similar enough to any existing

cluster and a new cluster Gh, h ¼ kþ 1, is created with

mh ¼ xi; ��h ¼ ��0; ð17Þ

where ��0 ¼ <�0; . . . ; �0> is a user-defined constant vector.

Note that the new cluster Gh contains only one member, theword pattern xi, at this point. Estimating the deviation of a

cluster by (12) is impossible, or inaccurate, if the cluster

contains few members. In particular, the deviation of a new

cluster is 0, since it contains only one member. We cannot

use zero deviation in the calculation of fuzzy similarities.

Therefore, we initialize the deviation of a newly created

cluster by ��0, as indicated in (17). Of course, the number of

clusters is increased by 1 and the size of cluster Gh, Sh,

should be initialized, i.e.,

k ¼ kþ 1; Sh ¼ 1: ð18Þ

Second, if there are existing clusters on which xi has passed

the similarity test, let cluster Gt be the cluster with the

largest membership degree, i.e.,

t ¼ arg max1�j�k

ð�GjðxiÞÞ: ð19Þ

In this case, we regard xi to be most similar to cluster Gt,

and mt and ��t of cluster Gt should be modified to include xias its member. The modification to cluster Gt is described as

follows:

mtj ¼St�mtj þ xij

St þ 1; ð20Þ

�tj ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiA�Bp

þ �0; ð21Þ

A ¼ ðSt � 1Þð�tj � �0Þ2 þ St�mtj2 þ xij2

St; ð22Þ

B ¼ St þ 1

St

St�mtj þ xijSt þ 1

� �2

; ð23Þ

for 1 � j � p, and

St ¼ St þ 1: ð24Þ

Equations (20) and (21) can be derived easily from (11) and

(12). Note that k is not changed in this case. Let’s give an

example, following (10) and (14). Suppose x1 is most similar

to cluster G1, and the size of cluster G1 is 4 and the initial

deviation �0 is 0.1. Then cluster G1 is modified as follows:

m11 ¼4� 0:4þ 0:3

4þ 1¼ 0:38;

m12 ¼4� 0:6þ 0:7

4þ 1¼ 0:62;

m1 ¼ <0:38; 0:62>;

A11 ¼ð4� 1Þð0:2� 0:1Þ2 þ 4�0:382 þ 0:32

4;

B11 ¼4þ 1

4

4� 0:38þ 0:3

4þ 1

� �2

;

�11 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA11 �B11

pþ 0:1 ¼ 0:1937;

A12 ¼ð4� 1Þð0:3� 0:1Þ2 þ 4� 0:622 þ 0:72

4;

B12 ¼4þ 1

4

4� 0:62þ 0:7

4þ 1

� �2

;

�12 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA12 �B12

pþ 0:1 ¼ 0:2769;

��1 ¼ <0:1937; 0:2769>;

S1 ¼ 4þ 1 ¼ 5:

The above process is iterated until all the word patterns

have been processed. Consequently, we have k clusters. The

whole clustering algorithm can be summarized below.

Initialization:

] of original word patterns: m

] of classes: p

Threshold: �

Initial deviation: �0

Initial ] of clusters: k ¼ 0

Input:

xi ¼ <xi1; xi2; . . . ; xip>, 1 � i � mOutput:

Clusters G1, G2; . . . ; Gk

procedure Self-Constructing-Clustering-Algorithm

for each word pattern xi, 1 � i � mtemp W ¼ fGjj�Gj

ðxiÞ � �; 1 � j � kg;if (temp W ¼¼ �)

A new cluster Gh, h ¼ kþ 1, is created by

(17)-(18);else let Gt 2 temp W be the cluster to which xi is

closest by (19);

Incorporate xi into Gt by (20)-(24);

endif;


endfor;return with the created k clusters;

endprocedure

Note that the word patterns in a cluster have a high degree of

similarity to each other. Besides, when new training patterns

are considered, the existing clusters can be adjusted or new

clusters can be created, without the necessity of generating

the whole set of clusters from the scratch.Note that the order in which the word patterns are fed

in influences the clusters obtained. We apply a heuristic to

determine the order. We sort all the patterns, in decreas-

ing order, by their largest components. Then the word

patterns are fed in this order. In this way, more significant

patterns will be fed in first and likely become the core of

the underlying cluster. For example, let x1 ¼ <0:1; 0:3;

0:6>;x2 ¼ <0:3; 0:3; 0:4>, and x3 ¼ <0:8; 0:1; 0:1> be three

word patterns. The largest components in these word

patterns are 0.6, 0.4, and 0.8, respectively. The sorted list is

0.8, 0.6, 0.4. So the order of feeding is x3, x1, x2. This

heuristic seems to work well.We discuss briefly here the computational cost of our

method and compare it with DC [27], IOC [14], and IG [10].

For an input pattern, we have to calculate the similarity

between the input pattern and every existing cluster. Each

pattern consists of p components where p is the number of

classes in the document set. Therefore, in worst case, the

time complexity of our method is OðmkpÞ, where m is the

number of original features and k is the number of clusters

finally obtained. For DC, the complexity is OðmkptÞ, where

t is the number of iterations to be done. The complexity of

IG is OðmpþmlogmÞ, and the complexity of IOC is

OðmkpnÞ, where n is the number of documents involved.

Apparently, IG is the quickest one. Our method is better

than DC and IOC.

3.2 Feature Extraction

Formally, feature extraction can be expressed in the

following form:

D0 ¼ DT; ð25Þ

where

D ¼ ½d1 d2 � � � dnT ; ð26ÞD0 ¼ ½d01 d02 � � � d0n

T; ð27Þ

T ¼

t11 . . . t1k

t21 . . . t2k

..

. . .. ..

.

tm1 . . . tmk

266664

377775; ð28Þ

with

di ¼ ½di1 di2 � � � dim;d0i ¼ ½d0i1 d0i2 � � � d0ik;

for 1 � i � n. Clearly, T is a weighting matrix. The goal of

feature reduction is achieved by finding an appropriate T

such that k is smaller than m. In the divisive information-

theoretic feature clustering algorithm [27] described in

Section 2.2, the elements of T in (25) are binary and can bedefined as follows:

tij ¼1; if wi 2Wj;0; otherwise;

�ð29Þ

where 1 � i � m and 1 � j � k. That is, if a word wi belongsto cluster Wj, tij is 1; otherwise tij is 0.

By applying our clustering algorithm, word patterns havebeen grouped into clusters, and words in the feature vectorW are also clustered accordingly. For one cluster, we haveone extracted feature. Since we have k clusters, we havek extracted features. The elements of T are derived based onthe obtained clusters, and feature extraction will be done. Wepropose three weighting approaches: hard, soft, and mixed.In the hard-weighting approach, each word is only allowedto belong to a cluster, and so it only contributes to a newextracted feature. In this case, the elements of T in (25) aredefined as follows:

tij ¼1; if j ¼ arg max1��kð�G�

ðxiÞÞ;0; otherwise:

�ð30Þ

Note that if j is not unique in (30), one of them is chosenrandomly. In the soft-weighting approach, each word isallowed to contribute to all new extracted features, with thedegrees depending on the values of the membershipfunctions. The elements of T in (25) are defined as follows:

tij ¼ �GjðxiÞ: ð31Þ

The mixed-weighting approach is a combination of thehard-weighting approach and the soft-weighting ap-proach. For this case, the elements of T in (25) are definedas follows:

tij ¼ ð�Þ�tHij þ ð1� �Þ�tSij; ð32Þ

where tHij is obtained by (30) and tSij is obtained by (31), and �is a user-defined constant lying between 0 and 1. Note that �is not related to the clustering. It concerns the merge ofcomponent features in a cluster into a resulting feature. Themerge can be “hard” or “soft” by setting � to 1 or 0. Byselecting the value of �, we provide flexibility to the user.When the similarity threshold is small, the number ofclusters is small, and each cluster covers more trainingpatterns. In this case, a smaller � will favor soft-weightingand get a higher accuracy. On the contrary, when thesimilarity threshold is large, the number of clusters is large,and each cluster covers fewer training patterns. In this case, alarger � will favor hard-weighting and get a higher accuracy.

3.3 Text Classification

Given a set D of training documents, text classification canbe done as follows: We specify the similarity threshold � for(16), and apply our clustering algorithm. Assume thatk clusters are obtained for the words in the feature vectorW. Then we find the weighting matrix T and convert D toD0 by (25). Using D0 as training data, a classifier based onsupport vector machines (SVM) is built. Note that anyclassifying technique other than SVM can be applied.Joachims [36] showed that SVM is better than othermethods for text categorization. SVM is a kernel method,


which finds the maximum margin hyperplane in featurespace separating the images of the training patterns intotwo groups [37], [38], [39]. To make the method moreflexible and robust, some patterns need not be correctlyclassified by the hyperplane, but the misclassified patternsshould be penalized. Therefore, slack variables i areintroduced to account for misclassifications. The objectivefunction and constraints of the classification problem can beformulated as:

minw;b

1

2wTwþ C

Xli¼1

i

s:t:yi wT�ðxiÞ þ b� �

� 1� i; i � 0; i ¼ 1; 2; . . . ; l;

ð33Þ

where l is the number of training patterns, C is aparameter, which gives a tradeoff between maximummargin and classification error, and yi, being +1 or -1, isthe target label of pattern xi. Note that � : X ! F is amapping from the input space to the feature space F ,where patterns are more easily separated, and wT�ðxiÞ þb ¼ 0 is the hyperplane to be derived with w, and b beingweight vector and offset, respectively.

An SVM described above can only separate apart twoclasses, yi ¼ þ1 and yi ¼ �1. We follow the idea in [36] toconstruct an SVM-based classifier. For p classes, we create pSVMs, one SVM for each class. For the SVM of class cv,1 � v � p, the training patterns of class cv are treated ashaving yi ¼ þ1, and the training patterns of the otherclasses are treated as having yi ¼ �1. The classifier is thenthe aggregation of these SVMs. Now we are ready forclassifying unknown documents. Suppose, d is an unknowndocument. We first convert d to d0 by

d0 ¼ dT: ð34Þ

Then we feed d0 to the classifier. We get p values, one fromeach SVM. Then d belongs to those classes with 1,appearing at the outputs of their corresponding SVMs.For example, consider a case of three classes c1, c2, and c3. Ifthe three SVMs output 1, �1, and 1, respectively, then thepredicted classes will be c1 and c3 for this document. If thethree SVMs output �1, 1, and 1, respectively, the predictedclasses will be c2 and c3.

4 AN EXAMPLE

We give an example here to illustrate how our methodworks. Let D be a simple document set, containing

9 documents d1, d2; . . . ;d9 of two classes c1 and c2, with10 words “office,” “building,”; . . . ; “fridge” in the featurevector W, as shown in Table 1. For simplicity, we denotethe ten words as w1, w2; . . . ; w10, respectively.

We calculate the ten word patterns x1;x2; . . . ;x10

according to (7) and (8). For example, x6 ¼ <P ðc1jw6Þ;P ðc2jw6Þ> and P ðc2jw6Þ is calculated by (35).

P ðc2jw6Þ ¼ 1� 0þ 2� 0þ 0� 0þ 1� 0þ 1� 1þ 1� 1

þ1� 1þ 1� 1þ 0� 1=1þ 2þ 0þ 1þ 1þ 1

þ1þ 1þ 0 ¼ 0:50:

ð35Þ

The resulting word patterns are shown in Table 2. Note thateach word pattern is a two-dimensional vector, since thereare two classes involved in D.

We run our self-constructing clustering algorithm, bysetting �0 ¼ 0:5 and � ¼ 0:64, on the word patterns andobtain 3 clusters G1,G2, and G3, which are shown in Table 3.The fuzzy similarity of each word pattern to each cluster isshown in Table 4. The weighting matrices TH , TS , and TM

obtained by hard-weighting, soft-weighting, and mixed-weighting (with � ¼ 0:8), respectively, are shown in Table 5.The transformed data sets D0H , D0S , and D0M obtained by (25)for different cases of weighting are shown in Table 6.

Based on D0H , D0S , or D0M , a classifier with two SVMs isbuilt. Suppose d is an unknown document, and d ¼<0; 1; 1; 1; 1; 1; 0; 1; 1; 1> . We first convert d to d0 by (34).Then, the transformed document is obtained as d0H ¼dTH ¼ <2; 4; 2>;d0S ¼ dTS ¼ <2:5591; 4:3478; 3:9964>, o rd0M ¼ dTM ¼ <2:1118; 4:0696; 2:3993>. Then the trans-formed unknown document is fed to the classifier. For thisexample, the classifier concludes that d belongs to c2.

5 EXPERIMENTAL RESULTS

In this section, we present experimental results to show theeffectiveness of our fuzzy self-constructing feature clusteringmethod. Three well-known data sets for text classification


TABLE 1A Simple Document Set D

TABLE 2Word Patterns of W

research: 20 Newsgroups [1], RCV1 [40], and Cade12 [41] areused in the following experiments. We compare our methodwith other three feature reduction methods: IG [10], IOC [14],and DC [27]. As described in Section 2, IG is one of the state-of-art feature selection approaches, IOC is an incrementalfeature extraction algorithm, and DC is a feature clusteringapproach. For convenience, our method is abbreviated asFFC, standing for Fuzzy Feature Clustering, in this section. Inour method, �0 is set to 0.25. The value was determined byinvestigation. Note that �0 is set to the constant 0.25 for all thefollowing experiments. Furthermore, we use H-FFC, S-FFC,and M-FFC to represent hard-weighting, soft-weighting, andmixed-weighting, respectively. We run a classifier built onsupport vector machines [42], as described in Section 3.3, onthe extracted features obtained by different methods. We usea computer with Intel(R) Core(TM)2 Quad CPU Q66002.40 GHz, 4 GB of RAM to conduct the experiments. Theprogramming language used is MATLAB 7.0.

To compare classification effectiveness of each method,

we adopt the performance measures in terms of micro-

averaged precision (MicroP), microaveraged recall (Mi-

croR), microaveraged F1 (MicroF1), and microaveraged

accuracy (MicroAcc) [4], [43], [44] defined as follows:

MicroP ¼ �pi¼1TPi

�pi¼1ðTPi þ FPiÞ

; ð36Þ

MicroR ¼ �pi¼1TPi

�pi¼1ðTPi þ FNiÞ

; ð37Þ

MicroF1 ¼ 2MicroP �MicroR

MicroP þMicroR; ð38Þ

MicroAcc ¼ �pi¼1ðTPi þ TNiÞ

�pi¼1ðTPi þ TNi þ FPi þ FNiÞ

; ð39Þ

where p is the number of classes. TPi (true positives wrt ci)

is the number of ci test documents that are correctly

classified to ci. TNi (true negatives wrt ci) is the number of

non-ci test documents that are classified to non-ci. FPi (false

positives wrt ci) is the number of non-ci test documents that

are incorrectly classified to ci. FNi (false negatives wrt ci) is

the number of ci test documents that are classified to non-ci.

5.1 Experiment 1: 20 Newsgroups Data Set

The 20 Newsgroups collection contains about 20,000 articles

taken from the Usenet newsgroups. These articles are


TABLE 3Three Clusters Obtained

TABLE 5Weighting Matrices: Hard TH , Soft TS , and Mixed TM

TABLE 6Transformed Data Sets: Hard D0H , Soft D0S, and Mixed D0M

TABLE 4Fuzzy Similarities of Word Patterns to Three Clusters

evenly distributed over 20 classes, and each class has about1,000 articles, as shown in Fig. 1a. In this figure, the x-axisindicates the class number, and the y-axis indicates thefraction of the articles of each class. We use two-thirds ofthe documents for training and the rest for testing. Afterpreprocessing, we have 25,718 features, or words, for thisdata set.

Fig. 2 shows the execution time (sec) of different featurereduction methods on the 20 Newsgroups data set. Since H-FFC, S-FFC, and M-FFC have the same clustering phase,they have the same execution time, and thus we use FFC todenote them in Fig. 2. In this figure, the horizontal axisindicates the number of extracted features. To obtaindifferent numbers of extracted features, different values of� are used in FFC. The number of extracted features isidentical to the number of clusters. For the other methods,the number of extracted features should be specified inadvance by the user. Table 7 lists values of certain points inFig. 2. Different values of � are used in FFC and are listed inthe table. Note that for IG, each word is given a weight. Thewords of top k weights in W are selected as the extractedfeatures in W0. Therefore, the execution time is basically thesame for any value of k. Obviously, our method runs muchfaster than DC and IOC. For example, our method needs8.61 seconds for 20 extracted features, while DC requires

88.38 seconds and IOC requires 6,943.40 seconds. For84 extracted features, our method only needs 17.68 seconds,

but DC and IOC require 293.98 and 28,098.05 seconds,respectively. As the number of extracted features increases,

DC and IOC run significantly slow. In particular, when thenumber of extracted features exceeds 280, IOC spends more

than 100,000 seconds without getting finished, as indicatedby dashes in Table 7.

Fig. 3 shows the MicroAcc (percent) of the 20 News-groups data set obtained by different methods, based on

the extracted features previously obtained. The vertical axisindicates the MicroAcc values (percent), and the horizontal

axis indicates the number of extracted features. Table 8 listsvalues of ceratin points in Fig. 3. Note that � is set to 0.5 for

M-FFC. Also, no accuracies are listed for IOC when thenumber of extracted features exceeds 280. As shown in the

figure, IG performs the worst in classification accuracy,especially when the number of extracted features is small.

For example, IG only gets 95.79 percent in accuracy for20 extracted features. S-FFC works very well when the

number of extracted features is smaller than 58. Forexample, S-FFC gets 98.46 percent in accuracy for 20 ex-

tracted features. H-FFC and M-FFC perform well inaccuracy all the time, except for the case of 20 extracted

features. For example, H-FFC and M-FFC get 98.43 percent


Fig. 1. Class distributions of three data sets. (a) 20 Newsgroups. (b) RCV1. (c) Cade12.

Fig. 2. Execution time (sec.) of different methods on 20 Newsgroupsdata.

TABLE 7Sampled Execution Times (Seconds) of Different Methods on 20 Newsgroups Data

Fig. 3. Microaveraged accuracy (percent) of different methods for20 Newsgroups data.

and 98.54 percent, respectively, in accuracy for 58 extractedfeatures, 98.63 percent and 98.56 percent for 203 extractedfeatures, and 98.65 percent and 98.59 percent for 521 ex-tracted features. Although the number of extracted featuresis affected by the threshold value, the classificationaccuracies obtained keep fairly stable. DC and IOC performa little bit worse in accuracy when the number of extractedfeatures is small. For example, they get 97.73 percent and97.09 percent, respectively, in accuracy for 20 extractedfeatures. Note that while the whole 25,718 features are usedfor classification, the accuracy is 98.45 percent.

Table 9 lists values of the MicroP, MicroR, and MicroF1(percent) of the 20 Newsgroups data set obtained bydifferent methods. Note that � is set to 0.5 for M-FFC.From this table, we can see that S-FFC can, in general, getbest results for MicroF1, followed by M-FFC, H-FFC, andDC in order. For MicroP, H-FFC performs a little bitbetter than S-FFC. But for MicroR, S-FFC performs betterthan H-FFC by a greater margin than in the MicroP case.M-FFC performs well for MicroP, MicroR, and MicroF1.For MicroF1, M-FFC performs only a little bit worse thanS-FFC, with almost identical performance when thenumber of features is less than 500. For MicroP, M-FFCperforms better than S-FFC most of the time. H-FFCperforms well when the number of features is less than200, while it performs worse than DC and M-FFC whenthe number of features is greater than 200. DC performs

well when the number of features is greater than 200, butit performs worse when the number of features is lessthan 200. For example, DC gets 88.00 percent and M-FFCgets 89.65 percent when the number of features is 20. ForMicroR, S-FFC performs best for all the cases. M-FFCperforms well, especially when the number of features isless than 200. Note that while the whole 25,718 featuresare used for classification, MicroP is 94.53 percent, MicroRis 73.18 percent, and MicroF1 is 82.50 percent.

In summary, IG runs very fast in feature reduction, butthe extracted features obtained are not good for classifica-tion. FFC can not only run much faster than DC and IOC infeature reduction, but also provide comparably good orbetter extracted features for classification. Fig. 4 shows the


TABLE 8Microaveraged Accuracy (Percent) of Different Methods for 20 Newsgroups Data

TABLE 9Microaveraged Precision, Recall, and F1 (Percent) of Different Methods for 20 Newsgroups Data

Fig. 4. Microaveraged F1 (percent) of M-FFC with different � values for20 Newsgroups data.

influence of � values on the performance of M-FFC inMicroF1 for three numbers of extracted features, 20, 203,and 1,453, indicated by M-FFC20, M-FFC203, and M-FFC1453, respectively. As shown in the figure, the MircoF1values do not vary significantly with different settings of �.

5.2 Experiment 2: Reuters Corpus Volume 1 (RCV1)Data Set

The RCV1 data set consists of 804,414 news storiesproduced by Reuters from 20 August 1996 to 19 August1997. All news stories are in English, and have 109 distinctterms per document on average. The documents aredivided, by the “LYRL2004” split defined in Lewis et al.[40], into 23,149 training documents and 781,265 testingdocuments. There are 103 Topic categories and thedistribution of the documents over the classes is shown inFig. 1b. All the 103 categories have one or more positive testexamples in the test set, but only 101 of them have one ormore positive training examples in the training set. It isthese 101 categories that we use in this experiment. Afterpre-processing, we have 47,152 features for this data set.

Fig. 5 shows the execution time (sec) of different featurereduction methods on the RCV1 data set. Table 10 listsvalues of certain points in Fig. 5. Clearly, our method runsmuch faster than DC and IOC. For example, for 18 extractedfeatures, DC requires 609.63 seconds and IOC requires94,533.88 seconds, but our method only needs 14.42 seconds.For 65 extracted features, our method only needs 37.40 sec-onds, but DC and IOC require 1,709.77 and over 100,000 sec-onds, respectively. As the number of extracted featuresincreases, DC and IOC run significantly slow.

Fig. 6 shows the MicroAcc (percent) of the RCV1 data setobtained by different methods, based on the extractedfeatures previously obtained. Table 11 lists values of certainpoints in Fig. 6. Note that � is set to 0.5 for M-FFC. Noaccuracies are listed for IOC when the number of extractedfeatures exceeds 18. As shown in the figure, IG performs theworst in classification accuracy, especially when the number

of extracted features is small. For example, IG only gets96.95 percent in accuracy for 18 extracted features. H-FFC, S-FFC, and M-FFC perform well in accuracy all the time.For example, H-FFC, S-FFC, and M-FFC get 98.03 percent,98.26 percent, and 97.85 percent, respectively, in accuracyfor 18 extracted features, 98.13 percent, 98.39 percent, and98.31 percent, respectively, for 65 extracted features, and98.41 percent, 98.53 percent, and 98.56 percent, respectively,for 421 extracted features. Note that while the whole47,152 features are used for classification, the accuracy is98.83 percent. Table 12 lists values of the MicroP, MicroR,and MicroF1 ( percent) of the RCV1 data set obtained bydifferent methods. From this table, we can see that S-FFC can,in general, get best results for MicroF1, followed by M-FFC,DC, and H-FFC in order. S-FFC performs better than theother methods, especially when the number of features is lessthan 500. For MicroP, all the methods perform about equallywell, the values being around 80 percent-85 percent. IG gets ahigh MicroP when the number of features is smaller than 50.M-FFC performs well for MicroP, MicroR, and MicroF1. Notethat while the whole 47,152 features are used for classifica-tion, MicroP is 86.66 percent, MicroR is 75.03 percent, andMicroF1 is 80.43 percent.

In summary, IG runs very fast in feature reduction, but theextracted features obtained are not good for classification. Forexample, IG gets 96.95 percent in MicroAcc, 91.58 percentin MicroP, 5.80 percent in MicroR, and 10.90 percent inMicroF1 for 18 extracted features, and 97.24 percent inMicroAcc, 68.82 percent in MicroP, 25.72 percent in MicroR,and 37.44 percent in MicroF1 for 65 extracted features. FFCcan not only run much faster than DC and IOC in featurereduction, but also provide comparably good or betterextracted features for classification. For example, S-FFCgets 98.26 percent in MicroAcc, 85.18 percent in MicroP,55.33 percent in MicroR, and 67.08 percent in MicroF1 for18 extracted features, and 98.39 percent in MicroAcc,82.34 percent in MicroP, 63.41 percent in MicroR, and


Fig. 5. Execution time (sec) of different methods on RCV1 data. Fig. 6. Microaveraged accuracy (percent) of different methods for RCV1data.

TABLE 10Sampled Execution Times (sec.) of Different Methods on RCV1 Data

71.65 percent in MicroF1 for 65 extracted features. Fig. 7shows the influence of � values on the performance of M-FFCin MicroF1 for three numbers of extracted features, 18, 202,and 1,700, indicated by M-FFC18, M-FFC202, and M-FFC1700, respectively. As shown in the figure, the MircoF1values do not vary significantly with different settings of �.

5.3 Experiment 3: Cade12 Data

Cade12 is a set of classified Web pages extracted from theCad Web directory [41]. This directory points to BrazilianWeb pages that were classified by human experts into12 classes. The Cade12 collection has a skewed distribution,and the three most popular classes represent more than50 percent of all documents. A version of this data set,

40,983 documents in total with 122,607 features, wasobtained from [45], from which two-thirds, 27,322 docu-ments, are split for training, and the remaining, 13,661 docu-ments, for testing. The distribution of Cade12 data is shownin Fig. 1c and Table 13.

Fig. 8 shows the execution time (sec) of different featurereduction methods on the Cade12 data set. Clearly, ourmethod runs much faster than DC and IOC. For example,for 22 extracted features, DC requires 689.63 seconds andIOC requires 57,958.10 seconds, but our method only needs24.02 seconds. For 84 extracted features, our method onlyneeds 47.48 seconds, while DC requires 2,939.35 secondsand IOC has a difficulty in completion. As the number ofextracted features increases, DC and IOC run significantlyslow. In particular, IOC can only get finished in 100,000 sec-onds with 22 extracted features, as shown in Table 14.

Fig. 9 shows the MicroAcc (percent) of the Cade12 dataset obtained by different methods, based on the extractedfeatures previously obtained. Table 15 lists values ofcertain points in Fig. 9. XO Table 16 lists values of theMicroP, MicroR, and MicroF1 (percent) of the Cade12 dataset obtained by different methods. Note that � is set to 0.5for M-FFC. Also, no accuracies are listed for IOC when thenumber of extracted features exceeds 22. As shown in thetables, all the methods work equally well for MicroAcc.But none of the methods work satisfactorily well forMicroP, MicroR, and MicroF1. In fact, it is hard to get


TABLE 12Microaveraged Precision, Recall and F1 (Percent) of Different Methods for RCV1 Data

TABLE 11Microaveraged Accuracy (percent) of Different Methods for RCV1 Data

Fig. 7. Microaveraged F1(percent) of M-FFC with different � values forRCV1 data.

feature reduction for Cade12. Loose correlation existsamong the original features. IG performs best for MicroP.For example, IG gets 78.21 percent in MicroP when thenumber of features is 38. However, IG performs worst forMicroR, getting only 7.04 percent in MicroR when thenumber of features is 38. S-FFC, M-FFC, and H-FFC get alittle bit worse than IG in MicroP, but they are good inMicroR. For example, S-FFC gets 75.83 percent in MicroPand 38.33 percent in MicroR when the number of featuresis 38. In general, S-FFC can get best results, followed by M-FFC, H-FFC, and DC in order. Note that while the whole122,607 features are used for classification, MicroAccis 93.55 percent, MicroP is 69.57 percent, MicroR is40.11 percent, and MicroF1 is 50.88 percent. Again, thisexperiment shows that FFC can not only run much fasterthan DC and IOC in feature reduction, but also providecomparably good or better extracted features for classifica-tion. Fig. 10 shows the influence of � values on the

performance of M-FFC in MicroF1 for three numbers of

extracted features, 22, 286, and 1,338, indicated by M-

FFC22, M-FFC286, and M-FFC1338, respectively. As shown

in the figure, the MircoF1 values do not vary significantly

with different settings of �.

6 CONCLUSIONS

We have presented a fuzzy self-constructing feature

clustering (FFC) algorithm, which is an incremental

clustering approach to reduce the dimensionality of the

features in text classification. Features that are similar to

each other are grouped into the same cluster. Each cluster is

characterized by a membership function with statistical

mean and deviation. If a word is not similar to any existing

cluster, a new cluster is created for this word. Similarity

between a word and a cluster is defined by considering


TABLE 13The Number of Documents Contained in Each Class of the Cade12 Data Set

Fig. 8. Execution time (seconds) of different methods on Cade12 data.

TABLE 14Sampled Execution Times (Seconds) of Different Methods on Cade12 Data

TABLE 15Microaveraged Accuracy (Percent) of Different Methods for Cade12 Data

Fig. 9. Microaveraged accuracy (percent) of different methods forCade12 data.

both the mean and the variance of the cluster. When all thewords have been fed in, a desired number of clusters areformed automatically. We then have one extracted featurefor each cluster. The extracted feature corresponding to acluster is a weighted combination of the words contained inthe cluster. By this algorithm, the derived membershipfunctions match closely with and describe properly the realdistribution of the training data. Besides, the user need notspecify the number of extracted features in advance, andtrial-and-error for determining the appropriate number ofextracted features can then be avoided. Experiments onthree real-world data sets have demonstrated that ourmethod can run faster and obtain better extracted featuresthan other methods.

Other projects were done on text classification, with orwithout feature reduction, and evaluation results werepublished. Some methods adopted the same performancemeasures as we did, i.e., microaveraged accuracy ormicroaveraged F1. Others adopted unilabeled accuracy(ULAcc). Unilabeled accuracy is basically the same as thenormal accuracy, except that a pattern, either training ortesting, with m class labels are copied m times, each copy isassociated with a distinct class label. We discuss somemethods here. The evaluation results given here for themare taken directly from corresponding papers. Joachims [33]

applied the Rocchio classifier over TFIDF-weighted repre-sentation and obtained 91.8 percent unilabeled accuracy onthe 20 Newsgroups data set. Lewis et al. [40] used the SVMclassifier over a form of TFIDF weighting and obtained81.6 percent microaveraged F1 on the RCV1 data set. Al-Mubaid and Umair [31] applied a classifier, called Lsquare,which combines the distributional clustering of words and alearning logic technique, and achieved 98.00 percentmicroaveraged accuracy and 86.45 percent microaveragedF1 on the 20 Newsgroups data set. Cardoso-Cachopo [45]applied the SVM classifier over the normalized TFIDF termweighting, and achieved 52.84 percent unilabeled accuracyon the Cade12 data set and 82.48 percent on the 20 News-groups data set. Al-Mubaid and Umair [31] applied AIB[25] for feature reduction, and the number of features wasreduced to 120. No feature reduction was applied in theother three methods. The evaluation results published forthese methods are summarized in Table 17. Note that,according to the definitions, unilabeled accuracy is usuallylower than Microaveraged accuracy for multilabel classifi-cation. This explains why both Joachims [33] and Cardoso-Cachopo [45] have lower unilabeled accuracies than themicroaveraged accuracies shown in Tables 8 and 15.However, Lewis et al. [40] has a higher Microaveraged F1than that shown in Table 11. This may be due to that Lewiset al. [40] used the TFIDF weighting, which is moreelaborate than the TF weighting we used. However, it is


TABLE 17Published Evaluation Results for Some Other Text Classifiers

TABLE 16Microaveraged Precision, Recall and F1 (Percent) of Different Methods for Cade12 Data

Fig. 10. Microaveraged F1 (percent) of M-FFC with different � values forCade12 data.

hard for us to explain the difference between the micro-averaged F1 obtained by Al-Mubaid and Umair [31] andthat shown in Table 9, since a kind of sampling was appliedin Al-Mubaid and Umair [31].

Similarity-based clustering is one of the techniques wehave developed in our machine learning research. In thispaper, we apply this clustering technique to text categor-ization problems. We are also applying it to other problems,such as image segmentation, data sampling, fuzzy model-ing, web mining, etc. The work of this paper was motivatedby distributional word clustering proposed in [24], [25],[26], [27], [29], [30]. We found that when a document set istransformed to a collection of word patterns, as by (7), therelevance among word patterns can be measured, and theword patterns can be grouped by applying our similarity-based clustering algorithm. Our method is good for textcategorization problems due to the suitability of thedistributional word clustering concept.

ACKNOWLEDGMENTS

The authors are grateful to the anonymous reviewers fortheir comments, which were very helpful in improving thequality and presentation of the paper. They’d also like toexpress their thanks to National Institute of Standards andTechnology for the provision of the RCV1 data set. Thiswork was supported by the National Science Council underthe grants NSC-95-2221-E-110-055-MY2 and NSC-96-2221-E-110-009.

REFERENCES

[1] Http://people.csail.mit.edu/jrennie/20Newsgroups/, 2010.[2] Http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.

html. 2010.[3] H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text

Classification with Support Vector Machines,” J. Machine LearningResearch, vol. 6, pp. 37-53, 2005.

[4] F. Sebastiani, “Machine Learning in Automated Text Categoriza-tion,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.

[5] B.Y. Ricardo and R.N. Berthier, Modern Information Retrieval.Addison Wesley Longman, 1999.

[6] A.L. Blum and P. Langley, “Selection of Relevant Features andExamples in Machine Learning,” Aritficial Intelligence, vol. 97,nos. 1/2, pp. 245-271, 1997.

[7] E.F. Combarro, E. Montanes, I. Dıaz, J. Ranilla, and R. Mones,“Introducing a Family of Linear Measures for Feature Selection inText Categorization,” IEEE Trans. Knowledge and Data Eng., vol. 17,no. 9, pp. 1223-1232, Sept. 2005.

[8] K. Daphne and M. Sahami, “Toward Optimal Feature Selection,”Proc. 13th Int’l Conf. Machine Learning, pp. 284-292, 1996.

[9] R. Kohavi and G. John, “Wrappers for Feature Subset Selection,”Aritficial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.

[10] Y. Yang and J.O. Pedersen, “A Comparative Study on FeatureSelection in Text Categorization,” Proc. 14th Int’l Conf. MachineLearning, pp. 412-420, 1997.

[11] D.D. Lewis, “Feature Selection and Feature Extraction for TextCategorization,” Proc. Workshop Speech and Natural Language,pp. 212-217, 1992.

[12] H. Li, T. Jiang, and K. Zang, “Efficient and Robust FeatureExtraction by Maximum Margin Criterion,” T. Sebastian, S.Lawrence, and S. Bernhard eds. Advances in Neural InformationProcessing System, pp. 97-104, Springer, 2004.

[13] E. Oja, Subspace Methods of Pattern Recognition. Research StudiesPress, 1983.

[14] J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi,and Z. Chen, “Effective and Efficient Dimensionality Reductionfor Large-Scale and Streaming Data Preprocessing,” IEEE Trans.Knowledge and Data Eng., vol. 18, no. 3, pp. 320-333, Mar. 2006.

[15] I.T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.[16] A.M. Martinez and A.C. Kak, “PCA versus LDA,” IEEE Trans.

Pattern Analysis and Machine Intelligence, vol. 23, no. 2 pp. 228-233,Feb. 2001.

[17] H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representa-tion of Text Data Based on Centroids and Least Squares,” BITNumerical Math, vol. 43, pp. 427-448, 2003.

[18] S.T. Roweis and L.K. Saul, “Nonlinear Dimensionality Reductionby Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326,2000.

[19] J.B. Tenenbaum, V. de Silva, and J.C. Langford, “A GlobalGeometric Framework for Nonlinear Dimensionality Reduction,”Science, vol. 290, pp. 2319-2323, 2000.

[20] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and SpectralTechniques for Embedding and Clustering,” Advances in NeuralInformation Processing Systems, vol. 14, pp. 585-591, The MIT Press2002.

[21] K. Hiraoka, K. Hidai, M. Hamahira, H. Mizoguchi, T. Mishima,and S. Yoshizawa, “Successive Learning of Linear DiscriminantAnalysis: Sanger-Type Algorithm,” Proc. IEEE CS Int’l Conf.Pattern Recognition, pp. 2664-2667, 2000.

[22] J. Weng, Y. Zhang, and W.S. Hwang, “Candid Covariance-FreeIncremental Principal Component Analysis,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 25, no. 8, pp. 1034-1040, Aug.2003.

[23] J. Yan, B.Y. Zhang, S.C. Yan, Z. Chen, W.G. Fan, Q. Yang, W.Y.Ma, and Q.S. Cheng, “IMMC: Incremental Maximum MarginCriterion,” Proc. 10th ACM SIGKDD, pp. 725-730, 2004.

[24] L.D. Baker and A. McCallum, “Distributional Clustering of Wordsfor Text Classification,” Proc. ACM SIGIR, pp. 96-103, 1998.

[25] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distribu-tional Word Clusters versus Words for Text Categorization,”J. Machine Learning Research, vol. 3, pp. 1183-1208, 2003.

[26] M.C. Dalmau and O.W.M. Florez, “Experimental Results of theSignal Processing Approach to Distributional Clustering of Termson Reuters-21578 Collection,” Proc. 29th European Conf. IR Research,pp. 678-681, 2007.

[27] I.S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Infomation-Theoretic Feature Clustering Algorithm for Text Classification,”J. Machine Learning Research, vol. 3, pp. 1265-1287, 2003.

[28] D. Ienco and R. Meo, “Exploration and Reduction of the FeatureSpace by Hierarchical Clustering,” Proc. SIAM Conf. Data Mining,pp. 577-587, 2008.

[29] N. Slonim and N. Tishby, “The Power of Word Clusters for TextClassification,” Proc. 23rd European Colloquium on InformationRetrieval Research (ECIR), 2001.

[30] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering ofEnglish Words,” Proc. 31st Ann. Meeting of ACL, pp. 183-190, 1993.

[31] H. Al-Mubaid and S.A. Umair, “A New Text CategorizationTechnique Using Distributional Clustering and Learning Logic,”IEEE Trans. Knowledge and Data Eng., vol. 18, no. 9, pp. 1156-1165,Sept. 2006.

[32] G. Salton and M.J. McGill, Introduction to Modern Retrieval.McGraw-Hill Book Company, 1983.

[33] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithmwith TFIDF for Text Categorization,” Proc. 14th Int’l Conf. MachineLearning, pp. 143-151, 1997.

[34] J. Yen and R. Langari, Fuzzy Logic-Intelligence, Control, andInformation. Prentice-Hall, 1999.

[35] J.S. Wang and C.S.G. Lee, “Self-Adaptive Neurofuzzy InferenceSystems for Classification Applications,” IEEE Trans. FuzzySystems, vol. 10, no. 6, pp. 790-802, Dec. 2002.

[36] T. Joachims, “Text Categorization with Support Vector Machine:Learning with Many Relevant Features,” Technical Report LS-8-23, Univ. of Dortmund, 1998.

[37] C. Cortes and V. Vapnik, “Support-Vector Network,” MachineLearning, vol. 20, no. 3, pp. 273-297, 1995.

[38] B. Scholkopf and A.J. Smola, Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond. MIT Press, 2001.

[39] J. Shawe-Taylor and N. Cristianini, Kernel Methods for PatternAnalysis. Cambridge Univ. Press, 2004.

[40] D.D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A NewBenchmark Collection for Text Categorization Research,”J. Machine Learning Research, vol. 5, pp. 361-397, http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf, 2004.

[41] The Cad Web directory, http://www.cade.com.br/, 2010.


[42] C.C. Chang and C.J. Lin, “Libsvm: A Library for Support VectorMachines,” http://www.csie.ntu.edu.tw/~cjlin/libsvm. 2001.

[43] Y. Yang and X. Liu, “A Re-Examination of Text CategorizationMethods,” Proc. ACM SIGIR, pp. 42-49, 1999.

[44] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining Multi-LabelData,” Data Mining and Knowledge Discovery Handbook, O. Maimonand L. Rokach eds., second ed. Springer, 2009.

[45] Http://web.ist.utl.pt/~acardoso/data sets/, 2010.

Jung-Yi Jiang received the BS degree from I-Shou University, Taiwan, in 2002, and the MSEEdegree from Sun Yat-Sen University, Taiwan, in2004. He is currently pursuing the PhD degree atthe Department of Electrical Engineering, Na-tional Sun Yat-Sen University. His main re-search interests include machine learning, datamining, and information retrieval.

Ren-Jia Liou received the BS degree fromNational Dong Hwa University, Taiwan, in 2005,and the MSEE degree from Sun Yat-SenUniversity, Taiwan, in 2009. He is currently aresearch assistant in the Department of Chem-istry, National Sun Yat-Sen University. His mainresearch interests include machine learning,data mining, and web programming.

Shie-Jue Lee received the BS and MS degreesin electrical engineering, in 1977 and 1979,respectively, from National Taiwan University,and the PhD degree in computer science fromthe University of North Carolina, Chapel Hill, in1990. He joined the faculty of the Department ofElectrical Engineering at National Sun Yat-SenUniversity, Taiwan, in 1983, where he has beenworking as a professor since 1994. Now, he isthe deputy dean of academic affairs and director

of NSYSU-III Research Center, National Sun Yat-Sen University. Hisresearch interests include artificial intelligence, machine learning, datamining, information retrieval, and soft computing. He served as directorof the Center for Telecommunications Research and Development,National Sun Yat-Sen University (1997-2000), director of the SouthernTelecommunications Research Center, National Science Council (1998-1999), and chair of the Department of Electrical Engineering, NationalSun Yat-Sen University (2000-2003). He is a recipient of theDistinguished Teachers Award of the Ministry of Education, Taiwan,1993. He was awarded by the Chinese Institute of Electrical Engineeringfor Outstanding MS Thesis Supervision, 1997. He received theDistinguished Paper Award of the Computer Society of the Republic ofChina, 1998, and the Best Paper Award of the Seventh Conference onArtificial Intelligence and Applications, 2002. He received the Distin-guished Research Award of National Sun Yat-Sen University, 1998. Hereceived the Distinguished Teaching Award of National Sun Yat-SenUniversity, 1993 and 2008. He received the Best Paper Award of theInternational Conference on Machine Learning and Cybernetics, 2008.He also received the Distinguished Mentor Award of National Sun Yat-Sen University, 2008. He served as the program chair for theInternational Conference on Artificial Intelligence (TAAI-96), Kaohsiung,Taiwan, December 1996, International Computer Symposium—Work-shop on Artificial Intelligence, Tainan, Taiwan, December 1998, and theSixth Conference on Artificial Intelligence and Applications, Kaohsiung,Taiwan, November, 2001. He is a member of the IEEE Society ofSystems, Man, and Cybernetics, the Association for AutomatedReasoning, the Institute of Information and Computing Machinery, theTaiwan Fuzzy Systems Association, and the Taiwanese Association ofArtificial Intelligence. He is also a member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.