single pass text classification by direct feature weighting

21
Under consideration for publication in Knowledge and Information Systems Single Pass Text Classification by Direct Feature Weighting Hassan H. Malik 1 , Dmitriy Fradkin 2 and Fabian Moerchen 2 1 Thomson Reuters, 195 Broadway, New York, NY 10007, USA 2 Integrated Data Systems, Siemens Corporate Research, 755 College Rd. East, Princeton, NJ 08540, USA Keywords: Text Classification, Feature Weighting, Linear Classifiers, Information Gain, Scal- able Learning Abstract. The Feature Weighting Classifier (FWC) is an efficient multi-class classification al- gorithm for text data that uses Information Gain to directly estimate per-class feature weights in the classifier. This classifier requires only a single pass over the dataset to compute the feature frequencies per class, is easy to implement, and has memory usage that is linear in the number of features. Results of experiments performed on 128 binary and multi-class text and web datasets show that FWC’s performance is at least comparable to, and often better than that of Naive Bayes, TWCNB, Winnow, Balanced Winnow and linear SVM. On a large-scale web dataset with 12,294 classes and 135,973 training instances, FWC trained in 13 seconds and yielded comparable clas- sification performance to a state of the art multi-class SVM implementation, which took over 15 minutes to train. 1. Introduction Supervised classification of documents into predefined categories is a very common task. Text classification applications include web page categorization, email spam filter- ing, internet advertising, topic identification, document indexing, and word sense dis- ambiguation, etc. Supervised text classifiers are typically constructed by an inductive process (Sebas- tiani 2002), i.e., by automatically learning a model from a set of previously labeled documents and then applying this model to obtain labels for previously unseen docu- ments. Some of the popular inductive classifiers that have successfully been applied to text classification include probabilistic classifiers such as Naive Bayes (McCallum & Received February 04, 2010 Revised May 03, 2010 Accepted May 22, 2010

Upload: rohitkaranth

Post on 26-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Under consideration for publication in Knowledge and Information Systems

Single Pass Text Classification by DirectFeature WeightingHassan H. Malik1, Dmitriy Fradkin2 and Fabian Moerchen2

1Thomson Reuters, 195 Broadway, New York, NY 10007, USA2Integrated Data Systems, Siemens Corporate Research, 755 College Rd. East, Princeton, NJ 08540, USA

Keywords: Text Classification, Feature Weighting, Linear Classifiers, Information Gain, Scal-able Learning

Abstract. The Feature Weighting Classifier (FWC) is an efficient multi-class classification al-gorithm for text data that uses Information Gain to directly estimate per-class feature weights inthe classifier. This classifier requires only a single pass over the dataset to compute the featurefrequencies per class, is easy to implement, and has memory usage that is linear in the number offeatures. Results of experiments performed on 128 binary and multi-class text and web datasetsshow that FWC’s performance is at least comparable to, and often better than that of Naive Bayes,TWCNB, Winnow, Balanced Winnow and linear SVM. On a large-scale web dataset with 12,294classes and 135,973 training instances, FWC trained in 13 seconds and yielded comparable clas-sification performance to a state of the art multi-class SVM implementation, which took over 15minutes to train.

1. Introduction

Supervised classification of documents into predefined categories is a very commontask. Text classification applications include web page categorization, email spam filter-ing, internet advertising, topic identification, document indexing, and word sense dis-ambiguation, etc.

Supervised text classifiers are typically constructed by an inductive process (Sebas-tiani 2002), i.e., by automatically learning a model from a set of previously labeleddocuments and then applying this model to obtain labels for previously unseen docu-ments. Some of the popular inductive classifiers that have successfully been applied totext classification include probabilistic classifiers such as Naive Bayes (McCallum &

Received February 04, 2010Revised May 03, 2010Accepted May 22, 2010

2 H. Malik et al

Nigam 1998), decision tree classifiers such as ID3 (Quinlan 1986) and C4.5 (Quinlan1993), rule-based classifiers such as FOIL (Quinlan & Cameron-Jones 1993), RIPPER(Cohen 1995) and CPAR (Yin & Han 2003), and maximum margin classifiers such asLinear SVMs (Joachims 2002).

With the amount of information available online increasing at an unprecedentedrate (Lyman & Varian 2003), popular web catalogues such as the Open Directory1 havegrown to hundreds of thousands of categories. Consequently, modern text classificationsystems sometimes encounter training datasets that contain hundreds of thousands tomillions of documents (e.g., the datasets used in the recent Pascal challenge on large-scale hierarchical text classification2). In such situations training speed becomes a con-cern, in addition to a more traditional concern over the quality of classification. There-fore recent research (Malik & Kender 2008, Fan et al. 2008, Joachims 2006, Anag-nostopoulos et al. 2008) has focused on improving the runtime performance of textclassification algorithms while maintaining or improving the quality.

Many of these efficient approaches rely on linear classifiers, which tend to be fasterto train and to apply, while maintaining a high level of quality on text data. Given a setof labeled training instances D, with F features and C classes, the models producedby linear classifiers, such as Naive Bayes and linear SVM, can be represented by a|F | × |C| matrix, where the entry in ith row and jth column is the learned weight forfeature-class pair (fi, cj). We explore the question whether expensive training proce-dures are necessary for linear classifiers or if feature weights can be directly estimatedusing information-theoretic measures to construct a high-quality classifier.

We propose the Feature Weighting Classifier (FWC), a simple classification algo-rithm that uses Information Gain (IG) to directly compute per-class feature weights.FWC obtains per-class feature frequencies in one pass over the dataset, and then usesthese frequencies to compute feature IG and feature support in each class. The finalweights for each feature-class pair are derived by combining feature IG with featureclass support using a user-specified parameter α.

Test instances are classified by computing a score for each class and selecting theclass with the highest score. Per-class scores are computed as the product of the vectorof weights for a class with the term frequency vector of the document. We performedexperiments on diverse binary and multi-class text and web datasets indicate that FWC’spredictive performance is at least comparable to that of SVM and noticeably better thanthat of Naive Bayes and Winnow.

A number of recent papers explored the use of probabilistic and information-theoreticmeasures as feature weights. These approaches assign a weight wfc to each feature-class pair f ∈ F and c ∈ C, based on the discriminative information that f providesfor distinguishing class c from the remaining classes. Feature weighting was used asa pre-processing step by (Forman 2008) to improve the quality of existing text clas-sification algorithms, whereas (Junejo & Karim 2008) used feature weighting to mapdocuments into a two-dimensional space (for each class), where linear discriminantfunctions for classes were learned. Unlike these approaches, where feature weighting isused to change document representation as a pre-processing step to an induction algo-rithm, whether by transformation in the same feature space (Forman 2008) or by map-ping into a new one (Junejo & Karim 2008), FWC directly computes feature weightsfor the classifier from the document collection, requiring no further learning. This dis-

1 http://www.dmoz.org2 http://lshtc.iit.demokritos.gr

Single Pass Text Classification by Direct Feature Weighting 3

tinguishes FWC from typical feature weighting schemes and makes it similar to classi-fication schemes such as Naive Bayes.

In similar spirit to our work, a recently published approach (Madani et al. 2009)involves direct learning of feature-class weights on a feature index. Each feature con-tributes only to its top classes, with the number of such connections per feature limitedby a parameter. The feature-class weights are directly updated in an online fashion astraining examples become available. This approach was shown to be competitive interms of accuracy but more efficient than state of the art classifiers.

The rest of the paper is organized as follows. We describe and analyze the FWCalgorithm in detail in Section 2. The empirical evaluation is described in Section 3.Section 4 describes related work. Discussion and future work are presented in Section 5.Conclusions are presented in Section 6.

2. FWC Algorithm

In this section we describe the FWC algorithm. We first discuss steps involved in con-structing the FWC classification model. Next, we describe applying this model to clas-sify test instances and discuss the computational complexity of the FWC algorithm.Finally, we provide motivation for FWC weights and compare them with weights pro-duced by Naive Bayes.

2.1. Training the FWC Classifier

Let aij be the number of documents of class cj where feature fi occurs, let bj be thenumber of documents in class cj , let κi be the number of classes where fi occurs, andlet ri be the number of documents where feature fi occurs.

The support of feature fi in class cj is defined as:

pij =aijbj

(1)

This is the empirial conditional probability P (fi|cj). The Information Gain (IG) offeature fi, which in this case is the same as Mutual Information of fi and class labeldistribution, is defined as:

IG =∑c∈C

∑xi∈{0,1}

P (c, xi) log2

P (c, xi)

P (c)P (xi)(2)

where xi takes values 0 or 1 depending on absence or presence of feature fi in a docu-ment. P (c, fi) can be computed as aij

|D| and P (fi) = ri|D| (McCallum & Nigam 1998).

The construction of the FWC model (Algorithm 1) begins by making one pass overthe training data to determine aij , bj and κi (Line 1). Knowing these quantities enablesus to compute the weight for each feature for each class in a single pass over all features(Lines 2-7). The weight for each feature with respect to a class is calculated in Line 5using three components: a global significance measure, a class significance measure,and a discriminative penalty. Here we use Information Gain as the global significancemeasure (computed in Line 3 according to Equation 2), representing the discriminativeinformation of a feature across all classes. The class significance measure is the classsupport of a feature pij , and κi is the penalty factor. A user-defined parameter α is used

4 H. Malik et al

to control the tradeoff between the global feature significance and the feature signifi-cance in individual classes. Note that the features used here are essentially binary andFWC does not utilize feature frequencies within individual training documents. In con-trast, feature frequencies in test documents are utilized in computing class scores fortest instances, as we discuss in Section 2.2.

Algorithm 1 Training the FWC modelRequire: A setD of sparse labeled instances, with a set of features F and set of classes

C; a parameter α ≥ 0.1: Read the data, keeping track of: aij , bj , κi, and computing pij (Eq. 1)2: for i = 1, . . . , |F | do3: Compute IGi (Eq. 2)4: for j = 1, . . . , |C| do5: wij = IGi

κi(pij)

α

6: end for7: end for8: Wj = {wij} // weight vector for class cj9: return M = {Wj}, j = 1, . . . , |C|.

2.2. Using the FWC Classifier

Test instances are classified by computing a score for each class and selecting the classwith highest score (Algorithm 2). Each feature in the test instance that has a non-zeroweight in the model for a class contributes towards the class score. In addition, featurefrequencies in the test instance are used to scale feature weights, which allows locally-frequent features to have a higher contribution towards class scores. Per-class scoresare therefore computed as a product of the vector of weights for a class with the termfrequency vector of the document.

Algorithm 2 Applying the FWC modelRequire: A sparse instance d, FWC model M = {Wj}

1: for j = 1, . . . , |C| do2: sj =

∑fi∈(d

⋂Wj)

dij ∗ wij3: end for4: return cm, where m = argmaxjsj .

2.3. Algorithm Analysis

In this section we discuss computational complexity of FWC. The first line in Algo-rithm 1 involves a single pass over the data, and requires Θ(|D||F |) when the datasetis dense or Θ(|E|), where E is the set of non-zero entries, when the dataset is sparse.During this pass, values aij , bj and κi are stored and pij values are computed, requiringΘ(|F ||C|) storage.

The double loop in lines 2-7 takes Θ(|F ||C|) time, since computation of IG for afeature (Line 3) is linear in the number of classes, as follows from Equation 2, and so is

Single Pass Text Classification by Direct Feature Weighting 5

the inner loop (lines 4-7), since computation of wij takes constant time. The amount ofmemory required is Θ(|F |) for IG values, and Θ(|F ||C|) for the weights.

Therefore, training FWC takes Θ(|E|+|F ||C|) time and Θ(|F ||C|) space on sparsedata. Effectively, training FWC requires a single pass over the data, followed by a singlepass over the coefficients.

The time complexity of testing / assignment of new instances is linear (a single pass)in the number of features in the new document times the number of classes in the model.

These characteristics are exactly the same as for the Naive Bayes classifier. In Sec-tion 3, we show that FWC tends to be more accurate than Naive Bayes.

A linear time algorithm for training SVM has recently been developed (Joachims2006). However, this algorithm is complicated and involves multiple passes over thedata, making it much slower than Naive Bayes or FWC. We show below (Section 3)that FWC’s accuracy is comparable to that of SVM, which is considered “state of theart”.

2.4. Motivation for the FWC Weights

There have been several probabilistic and information-theoretic feature weighting andselection schemes proposed in the literature (Forman 2003). Some of the commonlyused schemes include Accuracy, Bi-Normal Separation (BNS), Chi-Square, DocumentFrequency (i.e., Global Support), F1-Measure, Information Gain, Odds Ratio, Odds Ra-tio Numerator, Power and Probability Ratio. Among these schemes, only Chi-Square,Information Gain and Document Frequency generalize to multi-class problems (For-man 2003). Since FWC focuses on direct multi-class classification, we limit ourselvesto these three schemes. Furthermore, our experiments (Section 3.8) indicate that Infor-mation Gain is more stable and results in better classification performance as comparedto Chi-Square and Global Support.

Information Gain considers both the presence and absence of a word to measureits significance. This means that Information Gain expects a word to provide informa-tion when it does not occur, whereas in typical text classification scenario words occursparsely and only provide information when they occur (Rennie 2001). Because of thisproperty, words that occur just a few times in the dataset receive low IG scores eventhough these words may be highly predictive of some classes.

Because of this undesirable side effect, while being extremely useful as a globalfeature significance measure, Information Gain may not be suitable as the sole featureweighting technique (see Section 3.7 for experimental results).

On the other hand, using class significance as the sole feature weighting techniquemay not be suitable either, as this approach would result in assigning a very high weightto features found in a large fraction of a rare class. For example, in the extreme case,a class with only one instance that contains all features in the dataset would receivehighest weights for most features, resulting in assigning almost all test instances to thisclass.

Therefore, we attempt to balance global significance with class significance by ad-justing feature Information Gain with its class support (Lines 5-7) using parameter α. Asuitable value for α can be estimated with cross-validation on the training set. Alterna-tively, since wide ranges of values for α result in similar performance, as demonstratedin Section 3.9, reasonable defaults can be selected. Intuitively, features that are sharedacross many classes are less useful for classification than features that are observed inonly a few classes. We therefore use κi as a penalty factor (multi-class penalty) in Line

6 H. Malik et al

5. Section 3.7 shows that penalizing weights of features that are shared across classesalmost always improves the classification accuracy.

2.5. Comparison of FWC Weights With Naive Bayes

FWC is a heuristic approach where model weights are not derived as a solution to op-timizing some objective function. We have discussed the motivation for the particularchoice of the formula for the weights (Section 2.4), and our experimental results inSection 3 show that FWC performs well on a variety of text classification problems.

In this section we explore in some detail the relationship between FWC and a similarmethod, Naive Bayes. Multivariate Bernoulli Naive Bayes (McCallum & Nigam 1998)can be seen as associating with each class cj a set of weights wNBij for each feature fi:

wNBij = log(1 + aij2 + bj

) (3)

These weights are combined linearly with document features, and a prediction for a newinstance d is made by selecting a class cj with the highest score:∑fi∈(d

⋂Wj)

dij ∗ wNBij (4)

FWC weights are described by the formula in Line 5 of Algorithm 1. Examining thisformula, we observe that each weight is a product of two parts. The first part is inde-pendent of classes, i.e. is the same regardless of the class of the weight vector:

IGiκi

(5)

and can therefore be seen as a form of feature weighting similar to IDF (Salton &Buckley 1988), but taking advantage of label information.

The second part depends on both the class and the feature:

(pij)α = (

aijbj

)α (6)

and can be seen as a counterpart to Naive Bayes weights (Equation 3). Whereas NBuses a log of an expression, FWC uses an exponential (0 < α < 1) of a very similarexpression. The use of exponential as opposed to logarithm limits the effect of featuresthat are rare in a class, with their contribution to the class score limited to a very smallpositive value, rather to a large negative value. This means that in FWC features that arerare in all classes will not significantly affect scores for any classes. In contrast, featuresthat are frequent in some classes but infrequent in other classes will boost the scores ofclasses where the feature is frequent.

In Multinomial NB (McCallum & Nigam 1998), used in our experiments, the weightsare:

wMNNBij = log(

1 + nij|F |+ nj

) (7)

where nij is the number of occurrences of feature fi in documents of class cj , andnj is the total number of occurrences of all features in documents of class cj . Again,the main difference with class-specific part of FWC weight (Equation 6) is in use oflogarithm rather than exponential, and the same reasoning as when comparing FWCand Multivariate Bernoulli NB applies.

Single Pass Text Classification by Direct Feature Weighting 7

dataset |C| |D| |F | avg cls min cls max cls

cacmcisi 2 4663 14409 2331.5 1460 3203cranmed 2 2431 31720 1215.5 1033 1398

fbis 17 2463 2000 144.882 38 506hitech 6 2301 22498 383.5 116 603

k1a 20 2340 21839 117 9 494k1b 6 2340 21839 390 60 1389la1 6 3204 29714 534 273 943la2 6 3075 19692 512.5 248 905mm 2 2521 29973 1260.5 1133 1388

new3 44 9558 70822 217.227 104 696ohscal 10 11465 11465 1116.2 709 1621

re0 13 1504 2886 115.692 11 608re1 25 1657 3758 66.28 10 371

reviews 5 4069 36746 813.8 137 1388sports 7 8580 27673 1225.71 122 3412tr11 9 414 6429 46 6 132tr12 8 313 5804 39.125 9 93tr23 6 204 5832 34 6 91tr31 7 927 10128 132.429 2 352tr41 10 878 7454 87.8 9 243tr45 10 690 8261 69 14 160wap 20 1560 8460 78 5 341

Table 1. Summary of Cluto datasets

Thus, FWC can be also seen as a combination of specific term weighting with aweight function that aims to compensate for some potentially damaging behaviors ofNaive Bayes weights.

3. Experimental Evaluation

3.1. Datasets

The main evaluation was performed on the TechTC-100 (Davidov et al. 2004) and Cluto(Karypis 2003) collections of text datasets. Both are publicly available and frequentlyused to evaluate classification algorithms. Additional datasets were used to validate theinitial findings (Section 3.4) and to evaluate the runtime performance (Section 3.6).Table 1 summarizes the properties of datasets in the Cluto collection. These datasetsrepresent text classification problems collected from various sources, such as Reutersnews articles, TREC tasks, and OHSUMED (Medline records). The TechTC-100 col-lection consists of 100 binary classification problems with 100-200 documents each.The problems were generated using real web sites classified by human editors as part ofthe Open Directory Project3. They are designed to have a varying difficulty for classifi-cation algorithms (Davidov et al. 2004).

TechTC-100 datasets were preprocessed using standard stemming and stop-wordelimination techniques and converted to the Cluto file format.

3 http://www.dmoz.org

8 H. Malik et al

3.2. Methods

The FWC classification algorithm was evaluated against Naive Bayes (McCallum &Nigam 1998) and linear SVM. Both methods are known to perform well on textual data.FWC was also evaluated against Winnow (Littlestone 1988) and balanced Winnow (Lit-tlestone 1989) on binary datasets (Section 3.5) and against TWCNB (Rennie 2001) onlarge-scale unbalanced datasets (Section 3.6). The LibLinear (Fan et al. 2008) imple-mentation was used for linear SVM training in linear time (see also (Joachims 2006)).All methods were evaluated with repeated randomized cross-validation with the samesplits by using the same random seed. On the Cluto datasets 10 times 10-fold cross-validation was used and on the TechTC100 datasets 50 times 2-fold cross-validationwas used resulting in 100 evaluation runs each. The TechTC100 datasets have relativelyfew documents each. Fewer folds result in less training data but avoid quantization ef-fects of the quality evaluation.

The evaluation measures used were accuracy, i.e., the fraction of correctly classifieddocuments, and macro-averaged F1 measure, i.e., the average of per-class harmonicmean of precision and recall scores, without considering class sizes.

FWC: FWC training is described in Section 2. The exponent of the weighting func-tion (i.e., theα parameter) was automatically selected from the set {10−3, k10−2, k10−1|k =1, ..., 9} using 5-fold cross-validation on the given training data. The model with the bestaccuracy was used to predict class labels for the test instances

Naive Bayes: The multinomial NB implementation closely followed the descriptionof (McCallum & Nigam 1998).

SVM: SVM was trained with TFIDF vectors using the following commonly usedweighting (Lewis et al. 2004):

TFIDF = log(TF + 1) · log

(N

DF

)(8)

where TF is the term frequency (how many times did the term appear in the document),N the number of documents processed, and DF is the document frequency (in how manydocuments did the term appear). The regularization parameter C was selected from theset {10k|k = −4, ..., 1} using a 5-fold cross-validation on the given training data. Themodel with the best accuracy was used to predict class labels for the test instances. Formulti-class problems, the default Liblinear setting was used which uses a one-vs-reststrategy.

The comparison of the methods is done without additional feature selection meth-ods. While feature selection is known to improve performance of classification methodsin many cases (Gabrilovich & Markovitch 2004), it would need to be tuned separatelyfor each method and dataset, obscuring a direct comparison of the approaches that is thefocus of this evaluation. We plan to pursue this direction in future work.

3.3. Comparison with NB and SVM

Figures 1 and 2 present the performance of the three classifiers on the Cluto and TechTC-100 collections for each dataset individually as well as accuracy and F1 score sum-maries for each collection. The summaries show the mean and one standard deviationof the difference in accuracy or macro-averaged F1 score for each dataset collection.FWC is found to be comparable to NB and SVM on the Cluto datasets and consistentlybetter than both competitors on the TechTC-100 datasets. NB is comparable to FWCand to SVM on the Cluto datasets but is consistently worse than both FWC and SVM

Single Pass Text Classification by Direct Feature Weighting 9

accuracy macro-averaged F1

dataset SVM NB FWC SVM NB FWC

bbc 0.982 0.932 0.955 0.982 0.930 0.954bbc sport 0.991 0.957 0.994 0.992 0.959 0.995

review polarity 0.809 0.659 0.811 0.809 0.657 0.809

Table 2. Summary of additional datasets and classifier performance on these datasets

on the TechTC-100. Note that the baseline SVM results on TechTC-100 presented in(Gabrilovich & Markovitch 2004) are different from ours (but still significantly worseas compared to FWC) because of the differences in the experimental setup.

3.4. Results on Additional Datasets

We used three additional “hold-out” datasets to validate our findings. The BBC dataset(Greene & Cunningham 2006) contains 2225 stories with 9636 features from the BritishBroadcasting Service website that correspond to five topical classes. The BBC sportsdataset (Greene & Cunningham 2006) contains 737 news articles with 4613 featuresthat correspond to five sports-related categories. The review polarity 2.0 dataset (Pang &Lee 2004) is commonly used for sentiment analysis and contains 1000 positive and 1000negative movie reviews from the Internet Movie Database4, with 26187 features. Table2 presents ten fold cross-validated accuracies and macro-averaged F1 scores achievedby the three classifiers.

We observe that SVM is somewhat better than FWC on the BBC dataset, and slightlyworse on BBC sports and review datasets. Naive Bayes is ranked 3rd on all datasets, andin case of the review dataset, by a very large margin. This is consistent with our previousanalysis.

3.5. Comparison with Winnow

In this section we compare FWC with Winnow (Littlestone 1988) and balanced Winnow(Littlestone 1989). Unlike SVM, Winnow is a completely on-line learning algorithm,i.e., it can be trained in a single pass over the data, one example at a time. Winnow isa linear classifier that associates with each feature fi a weight wi. During training, aprediction is made for an example. If the prediction is correct, no change to the model ismade. If the prediction is wrong, then the weights of features present in the example areincreased (promotion step) or decreased (demotion step) by some factor α (Littlestone1988). Since the updates involve only the features that occurred in the example, Winnowis very efficient. Note that FWC models can also be updated using one example at a time,but this requires re-computing of all IG values and therefore of all the weights, makingit less efficient than Winnow in the online mode.

For the evaluation, we used Weka (Hall et al. 2009) implementations of both Win-now methods. Since these support only two-class problems, we used cacmcisi, cranmed,mm and review polarity datasets, as well as all 100 TechTC problems (essentially, all

4 http://www.imdb.com

10 H. Malik et al

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

cacm

cisi.

hitec

h.wap

.k1

a.

ohsc

al.

new3.

re0.

fbis. re

1.tr1

1.tr2

3. la1.

la2.

tr12.

tr45.

revie

ws.tr4

1.tr3

1.m

m.

spor

ts. k1b.

cran

med

.

accu

racy

FWC IGNBSVM

(a) Accuracies on Cluto

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

wap.

k1a.

hitec

h.

cacm

cisi.

fbis. re

0.tr1

1.

ohsc

al. re1.

new3.

tr23.

tr31.

tr45.

tr12. la1

.la2

.tr4

1.

revie

ws.

spor

ts. k1b.

mm

.

cran

med

.

Mac

roav

erag

eF1

FWC IGNBSVM

(b) Macro-averaged F1 on Cluto

NB SVM

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

wor

se th

an F

WC

| be

tter

than

FW

C

∆ accuracy

(c) Cluto accuracy summary

NB SVM

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

wor

se th

an F

WC

| be

tter

than

FW

C

∆ MacroaverageF1

(d) Cluto F1 summary

Fig. 1. Accuracies and F1 scores of FWC, NB and SVM on Cluto datasets. 1(a) and 1(b) shows results perdataset ordered by increasing performance of FWC. 1(c) and 1(d) shows mean ± one standard deviation ofthe differences in performance between FWC and its competitors. Values below the x-axis indicate worseperformance than that of FWC.

Single Pass Text Classification by Direct Feature Weighting 11

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

datasets

accu

racy

FWC IGNBSVM

(a) Accuracies on TechTC-100

0.4

0.5

0.6

0.7

0.8

0.9

1

datasets

Mac

roav

erag

eF1

FWC IGNBSVM

(b) Macro-averaged F1 on TechTC-100

NB SVM

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

wor

se th

an F

WC

| be

tter

than

FW

C

∆ accuracy

(c) TechTC-100 accuracy summary

NB SVM

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4w

orse

than

FW

C |

bette

r th

an F

WC

∆ MacroaverageF1

(d) TechTC-100 F1 summary

Fig. 2. Accuracies and F1 scores of FWC, NB and SVM on TechTC-100 datasets. 2(a) and 2(b) shows resultsper dataset ordered by increasing performance of FWC. 2(c) and 2(d) shows mean ± one standard deviationof the differences in performance between FWC and its competitors. Values below the x-axis indicate worseperformance than that of FWC.

two-class datasets used in our experiments). The results, obtained with 4-fold cross-validation on TechTC and 10-fold cross-validation on the other datasets, are presentedin Table 3. They show that Winnow lags behind FWC in terms of accuracy and F1 on alldatasets except for cacmcisi, which is the one dataset where FWC does the worst. Of the100 TechTC problems, FWC outperforms both Winnow methods in terms of accuracyon 89 problems, and in terms of F1 on 90 problems. The results on TechTC are shownin Figure 3.

3.6. Runtime Comparisons

While we argued that FWC is comparable to Naive Bayes and much more efficient thanSVM, it is interesting to compare them on actual large-scale text classification problems.Our evaluation is done on 3 large datasets.

The SRAA dataset5 contains 73,218 documents in four classes. The methods wereevaluated using 10-fold cross validation on this dataset.

Two multi-class classification problems were derived from the recent Pascal chal-

5 http://www.cs.umass.edu/˜mccallum/code-data.html

12 H. Malik et al

accuracy macro-averaged F1

dataset FWC W BW FWC W BW

cacmcisi 0.673 0.983 0.983 0.673 0.980 0.980

cranmed 0.999 0.922 0.948 0.999 0.920 0.946

mm 0.980 0.878 0.956 0.980 0.878 0.955

review polarity 0.811 0.637 0.726 0.809 0.636 0.723

TechTC-100* 0.891 0.748 0.792 0.890 0.736 0.785

Table 3. Results of comparison between FWC, Winnow (W) and Balanced Winnow (BW). The results forTechTC-100 are averages over all 100 problems.

0.4

0.5

0.6

0.7

0.8

0.9

1

datasets

accu

racy

FWCWinnowBalanced Winnow

(a) Accuracies on TechTC-100

0.4

0.5

0.6

0.7

0.8

0.9

1

datasets

Mac

roav

erag

e F

1

FWCWinnowBalanced Winnow

(b) F1 scores on TechTC-100

Fig. 3. Accuracies and F1 scores of FWC and the two Winnow variants on TechTC-100.

lenge on large-scale hierarchical text classification (i.e., the LSHTC challenge6). LSHTCchallenge uses a five-level deep classification hierarchy and each document is assignedto exactly one leaf node. Since hierarchical classification is not the main focus of thisresearch, we flattened the hierarchy by considering each leaf-node as a direct class inthe multi-class classification problem. The first classification problem contains 128,710web page description vectors in 12,294 classes. The classification methods were evalu-ated using 10-fold cross validation on this problem. The second classification problemuses 135,973 description vectors for training (i.e., all web page description vectors fromthe first problem and additional category description vectors) and uses 34,880 contentvectors for testing. This problem corresponds to the “cheap” task in the LSHTC chal-lenge and simulates a scenario where training and test sets follow different word distri-butions7. Since the public datasets do not include actual labels for these test instances,we have used the evaluation “oracle” provided on the LSHTC challenge web site tomeasure the accuracies and Macro-F1 scores.

Both of the LSHTC classification problems are highly unbalanced. The smallestclasses contain as few as two instances and the largest classes contain as many as 3,000

6 http://lshtc.iit.demokritos.gr7 We did not use the content vectors from other LSHTC tasks for training classifiers because they containabout 400,000 unique features, making it resource prohibitive to train an in-memory flat classifier on our testmachine (i.e., the memory required to allocate |F | × |C| matrices exceeds the available physical memory)

Single Pass Text Classification by Direct Feature Weighting 13

instances whereas the average number of instances assigned to the classes is 10.5. Con-sidering that the basic Naive Bayes is known to perform poorly on unbalanced data(Rennie et al. 2003), we have also included the TWCNB algorithm (Rennie 2001) inour comparisons.

All algorithms implementations were in Java and the same runtime environment wasused for all experiments. The runtime environment consisted of a 64-bit Java VirtualMachine deployed on a dedicated 64-bit system with two 2.67GHz Intel quad-core pro-cessors. The Weka toolkit (Hall et al. 2009) was used for training multinomial NaiveBayes and TWCNB, and the Java version of the LibLinear (Fan et al. 2008) librarywas used for linear SVM training. The SVM results in all other sections in this paperused the default LibLinear setting for training multi-class SVMs, which uses a one-vs-rest strategy. However, the default setting turned out to be very inefficient on theLSHTC datasets, because of the large number of classes. Therefore, for these experi-ments we switched to LibLinear implementation of multi-class SVM by Crammer andSinger (Crammer & Singer 2002) which uses an improved and more efficient formula-tion (Keerthi et al. 2008). This resulted in over 80% reduction in SVM training timeson LSHTC datasets.

The results reported here do not include any I/O times or times needed to prepareimplementation specific dataset representations (such as preparing the in-memory Wekadataset for training). We also do not report the classification times here because for allmethods used in our experiments classification time is linear in size of the test samplesand any differences are likely to be implementation related. Finally, the training timesreported here used the optimal parameter for each method on each dataset.

Table 4 presents the average 10-fold cross-validation accuracies and macro-F1 scoresas well as the total training times for all folds on the SRAA dataset. It is interesting tonote that, while SRAA is clearly an easy classification problem, SVM and FWC arecomparable to each other and better than both Naive Bayes methods. Also, TWCNB(Rennie 2001) did not perform better than regular NB on this dataset. This finding isconsistent with recent results ((Kibriya et al. 2004)), which also shows that TWCNB isnot always better than the regular NB. In terms of the training times, SVM was about10 times slower than FWC.

Tables 5 and 6 present the classification and runtime performance of the four classi-fication methods on LSHTC datasets. On both datasets, TWCNB outperformed regularNB on predictive measures, which indicates that TWCNB indeed performs better thanNB on unbalanced data. However, FWC outperformed both Naive Bayes methods with avery high margin. While SVM achieved the highest classification accuracies, its Macro-F1 scores are comparable to or lower than FWC, and it was also substantially moreexpensive to train. We also noticed that the SVM training times are highly sensitive tothe regularization parameter values. For example with C = 10−3, SVM training on thefirst LSHTC dataset took almost twice as much time as compared to training SVM us-ing C = 10−5. This observation is consistent with (Joachims 2006). In contrast, FWCtraining times do not vary with α. This could greatly simplify workload balancing forconcurrent selection of FWC parameters on multi-core and parallel architectures.

Note that the Naive Bayes methods show somewhat longer training times than FWCon all three datasets. Since the computational complexity of Naive Bayes is equivalentto FWC, these times are slower at least partially because of the Weka overhead, whichextensively uses Java objects. A more efficient Naive Bayes implementation is expectedto result in training times that are similar to FWC.

Finally, we note that the FWC results reported here used α values that maximize theclassification accuracies on each dataset. For situations where the classification perfor-mance on small classes is considered more important, α may be selected to maximize

14 H. Malik et al

method accuracy macro-averaged F1 training time (seconds)

SVM 1.000 1.000 59.86

NB 0.989 0.981 22.85

TWCNB 0.953 0.929 13.23

FWC 0.999 0.999 5.53

Table 4. Classification and runtime performance of SVM, NB, TWCNB and FWC on the SRAA dataset.

method accuracy macro-averaged F1 training time (seconds)

SVM 0.456 0.294 8556.3

NB 0.115 0.014 355.7

TWCNB 0.304 0.157 478.1

FWC 0.417 0.297 111.3

Table 5. Classification and runtime performance of SVM, NB, TWCNB and FWC on the first LSHTC clas-sification problem.

Macro-F1 instead of accuracy. On the first LSHTC dataset, this method improved theMacro-F1 score to 0.313 and on the second dataset, it improved the Macro-F1 to 0.28while slightly reducing the accuracies.

3.7. Comparing Variants of FWC

As discussed in Section 2.1, feature weight for each class combines Information Gain,class support factor and multi-class penalty factors. Here we evaluate the effect of re-moving class support or penalty for features shared by multiple classes on the perfor-mance. The results are presented in Figure 4. It is clear from these results that withoutclass support FWC performs extremely poorly. This behavior is observed both on Clutoand on TechTC. Removing penalty for features shared by multiple classes has little ef-fect on TechTC but somewhat larger effect on Cluto. This is because TechTC problemsare binary and most features occur in both classes, so class penalty affects few weights.On Cluto, where many of the problems are multi-class, removing the penalty factorleads to a loss of accuracy.

method accuracy macro-averaged F1 training time (seconds)

SVM 0.370 0.255 1083.6

NB 0.026 0.0004 34.8

TWCNB 0.236 0.126 59.9

FWC 0.352 0.266 13.6

Table 6. Classification and runtime performance of SVM, NB, TWCNB and FWC on the second LSHTCclassification problem.

Single Pass Text Classification by Direct Feature Weighting 15

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

cacm

cisi.

hitec

h.wap

.k1

a.

ohsc

al.

new3.

re0.

fbis. re

1.tr1

1.tr2

3. la1.

la2.

tr12.

tr45.

revie

ws.tr4

1.tr3

1.m

m.

spor

ts. k1b.

cran

med

.

accu

racy

DEFAULTNO CLASS SUPPORTNO PENALTY

(a) Accuracies on Cluto

0.4

0.5

0.6

0.7

0.8

0.9

1

datasets

accu

racy

DEFAULTNO CLASS SUPPORTNO PENALTY

(b) Accuracies on TechTC-100

Fig. 4. Accuracy of FWC with default settings, FWC without class support, and FWC without multi-classpenalty on Cluto and TechTC-100 datasets. The problems are ordered by increasing accuracy of FWC withdefault settings.

Hence, these experiments provide an empirical justification for using class supportto balance Information Gain and using the multi-class penalty factor.

3.8. Alternative Feature Weighting Schemes

In this section we compare the classification performance of Information Gain with Chi-Square and Global Support (i.e., Document Frequency) when used as the global featuresignificance scheme in FWC (Algorithm 1). As we have discussed in Section 2, we donot consider other feature weighting and selection schemes such as BNS (Forman 2003)and Odds Ratio in this paper because they do not have a multi-class form.

Figure 5 presents the classification accuracies and macro-averaged F1 scores ofFWC on Cluto and TechTC-100 dataset collections using Information Gain, Chi-Squareand Global Support as global significance measures for FWC. Clearly, Information Gainis the most stable measure and consistently outperformed the alternative measures onboth dataset collections. Global support is also stable but resulted in poor classificationperformance on small classes, as the difference in F1 scores on Cluto datasets is higherthan the difference in accuracies between Information Gain and Global Support. Thisbehavior is not observed on TechTC-100 datasets because they are highly balanced. Fi-nally, Chi-Square outperformed Information Gain in some cases but is highly unstablein general. This is not surprising because Chi-Square is known to behave erratically forvery small expected counts, which is common in text classification, as others have alsonoted (Forman 2003).

3.9. FWC Parameter Sensitivity

FWC has only one tunable parameter, α, which trades off information gain and classsupport of a feature. We have shown in Section 3.7 that all these components are nec-essary for high performance of FWC. However, it is reasonable to ask how FWC per-formance changes with different values of α. To address this question, we have variedα over a range [0.001, 0.7]. Figure 6 shows results on TechTC datasets (qualitativelysimilar results on Cluto are not included).

These plots show the minimum, maximum and mean differences in FWC accuracy

16 H. Malik et al

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

cacm

cisi.

hitec

h.wap

.k1

a.

ohsc

al.

new3.

re0.

fbis. re

1.tr1

1.tr2

3. la1.

la2.

tr12.

tr45.

revie

ws.tr4

1.tr3

1.m

m.

spor

ts. k1b.

cran

med

.

accu

racy

FWC IGFWC CHIFWC GS

(a) Classification accuracies on Cluto

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

wap.

k1a.

hitec

h.

cacm

cisi.

fbis. re

0.tr1

1.

ohsc

al. re1.

new3.

tr23.

tr31.

tr45.

tr12. la1

.la2

.tr4

1.

revie

ws.

spor

ts. k1b.

mm

.

cran

med

.

Mac

roav

erag

eF1

FWC IGFWC CHIFWC GS

(b) F1-Measure on Cluto

0.4

0.5

0.6

0.7

0.8

0.9

1

datasets

accu

racy

FWC IGFWC CHIFWC GS

(c) Classification accuracies on TechTC-100

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

datasets

Mac

roav

erag

eF1

FWC IGFWC CHIFWC GS

(d) F1-Measure on TechTC-100

Fig. 5. Classification accuracies and macro-averaged F1 scores of FWC using Information Gain, Chi-Squareand Global Support on Cluto and TechTC-100 datasets.

0.01 0.03 0.05 0.07 0.10 0.20 0.50 0.70−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Impr

ovem

ent o

ver

alph

a=0.

001

alpha

accuracy

0.01 0.03 0.05 0.07 0.10 0.20 0.50 0.70−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Impr

ovem

ent o

ver

alph

a=0.

001

alpha

MacroaverageF1

Fig. 6. Comparison of classification accuracies (left) and macro-averaged F1 scores (right) of FWC onTechTC datasets when α is varied. The plots show min, max, mean and standard deviation of the differences.The baseline used α = 0.001. The y-axis is improvement over α = 0.001.

and F1 for different values of α when compared against α = 0.001. The average im-provement over α = 0.001 increases slowly. The standard deviations and the ranges ofthe differences do increase as the alpha values are taken further apart. However, notethat the difference in accuracy or F1 between α = 0.001 and α = 0.7 on any topic isnot greater than 2.5%, suggesting that FWC performance is rather stable over a large

Single Pass Text Classification by Direct Feature Weighting 17

range of values of α. Note that this plot does not suggest that higher values of α lead toimprovement on all topics.

Also, consider that average tuned FWC accuracy and macro-F1 on TechTC are0.8717 and 0.8679 respectively. The same measures for α = 0.7 are 0.8684 and 0.8632and decrease by several points to 0.8336 and 0.8277 for a ten-fold decrease in α = 0.07.For comparison, SVM results are still below these (at 0.8170 and 0.8142 for accuracyand macro-F1). Therefore, one can simply pick a value of α in this range without per-forming any tuning and still obtain high quality results.

4. Related Work

Feature selection is a well-known problem and many papers have addressed it by sug-gesting a score indicative of feature’s usefulness (Yang & Pedersen 1997, Forman 2003).Traditionally, the top-k features with the highest scores, or all features with scores abovea certain threshold are retained and the rest are discarded. Another way of using thesescores however is to use them as weights for the corresponding features.

Feature weighting or scaling is a task of assigning each feature (a word) in an in-stance (a document) a weight that would correspond to that feature in a vector represen-tation of the document. The basic idea here is to assign higher weights to more discrimi-native features to simplify the task of learning predictive models. A set of such weightedvectors is usually provided to a learning algorithm with a set of relevance labels in orderto construct predictive models. The TFIDF scheme (Salton & Buckley 1988) and relatedvariants are the most common vector representations used in text analysis. In addition,various alternative weighting schemas have been proposed in the literature. Recently,Forman (Forman 2008) suggested using BNS, a feature selection measure proposed in(Forman 2003), for term scaling in the context of binary text classification. He arguedthat using BNS leads to better performance than many other representations, and alsoobviates the need for feature selection.

In the recently proposed Democratic Classifier (Malik & Kender 2008), a weight isdirectly assigned to each feature in the model without the intermediate steps of creatinga new document representation, or explicitly training a classifier. However, the Demo-cratic Classifier requires many additional steps such as ensuring instance coverage bysome minimum number of features, new feature construction from pairs of words, andcomputing additional weights. The FWC approach presented here is significantly sim-pler than the Democratic Classifier with faster training and comparable accuracy. Forexample, FWC was as much as ten times faster to train on the sports dataset, whileachieving an accuracy of 0.983 which is about 1% better than the Democratic Classi-fier.

The Discriminative Term Weighting Classifier (DTWC) (Junejo & Karim 2008) ex-plores similar ideas. Odds, log-odds and information gain (KL divergence) were evalu-ated as discriminative term weighting schemes. Positive and negative scores were com-puted for each document in the training set, and a linear classifier in the two-dimensionalspace of such scores was trained. The evaluation performed on subsets of three text clas-sification problems: Spam8, Movie Reviews9 and SRAA10 indicates that DTWC, whentuned with the best performing weighting scheme on each dataset, results in accuraciesthat are comparable to (and in some cases, better than) those of SVM and Naive Bayes.

8 http://www.ecmlpkdd2006.org/challenge.html9 http://www.cs.cornell.edu/People/pabo/movie-review-data/10 http://www.cs.umass.edu/˜mccallum/code-data.html

18 H. Malik et al

However, none of the three term weighting schemes consistently outperformed existingclassifiers. Unlike DTWC, FWC constructs only one model that contains scores for allclasses, and does not require training a separate linear classifier, resulting in a simplerand more efficient training procedure.

Winnow (Littlestone 1988) is an incremental linear-threshold algorithm that at-tempts to reduce the classification error with each incoming example. Winnow respondsto each training example according to the current hypothesis, and then updates the hy-pothesis based on the correct classification, if necessary. Winnow is especially usefulwhen the majority of the attributes are irrelevant. Balanced Winnow (Littlestone 1989)is a variant of Winnow that maintains separate positive and negative weights for eachfeature, thus allowing for negative coefficients. These negative coefficients make Bal-anced Winnow more robust for documents with varying lengths.

5. Discussion and Future Work

Recent advances in automatic training data expansion (Wang et al. 2009) and the un-precedented on-going growth in the sizes of real life text databases such as the OpenDirectory, PubMed11, Blogs, and the web content in general is making it difficult to ap-ply traditional learning methods on modern text classification problems. This requiresresearchers to investigate new methods that are capable of handling large-scale trainingdatasets with tens of thousands to hundreds of thousands of categories and millions ofdocuments.

With FWC we attempt to address this problem by directly constructing a classifierfrom feature frequency counts obtained in a single-pass over training data. In terms ofsimplicity, FWC resembles Naive Bayes but goes beyond utilizing conditional probabil-ities. FWC draws inspiration from (Malik & Kender 2008) and (Junejo & Karim 2008),which combined several significance measures for short pattern-based classification andfrom (Forman 2008), that evaluated many feature weighting functions.

While there may not be a direct theoretical justification for FWC (which is also truefor many other methods that we have cited in the previous section), FWC is based ona well-studied measure from information theory (i.e., Information Gain) and our exten-sive experimental study suggests that it works well on a variety of text classificationproblems. Information Gain was also found useful in many other applications such asonline recommender systems (Zhang & Tran 2010).

In the future we plan to apply FWC to data streams with an online formulation ofthe learning algorithm.

6. Conclusions

We proposed FWC, a novel single pass text classifier that is constructed directly fromfeature frequency counts, following the spirit of Naive Bayes. For each class, FWCassigns a weight for each feature obtained by incorporating global significance, classsignificance, and multi-class presence. The time complexity is linear in the numberof entries in the training dataset and the space complexity is linear in the number offeatures. Experiments performed on 128 binary and multi-class text and web datasetsfrom various domains show that FWC’s performance is at least comparable to, and often

11 http://www.ncbi.nlm.nih.gov/PubMed/

Single Pass Text Classification by Direct Feature Weighting 19

better than that of linear SVM, while being much easier to train. FWC also performsbetter than Naive Bayes, and has comparable training complexity.

FWC is an efficient text classifier that is very easy to implement and rivals morecomplex linear SVMs in terms of classification performance on a variety of datasets.We recommend including it as one of the choices to evaluate on any text classificationproblem, in particular if scalability is an issue.

7. Acknowledgments

We would like to thank the editors and the anonymous reviewers for their constructiveand detailed comments that greatly helped us improve this paper.

References

Anagnostopoulos, A., Broder, A. & Punera, K. (2008), ‘Effective and efficient classifi-cation on a search-engine model’, Knowledge and Information Systems 16(2), 129–154.

Cohen, W. (1995), Fast effective rule induction, in ‘Proceedings of the InternationalConference on Machine Learning (ICML)’, pp. 115–123.

Crammer, K. & Singer, Y. (2002), ‘On the learnability and design of output codes formulticlass problems’, Machine Learning 47.

Davidov, D., Gabrilovich, E. & Markovitch, S. (2004), Parameterized generation oflabeled datasets for text categorization based on a hierarchical directory, in ‘The 27thAnnual International ACM SIGIR Conference’, pp. 250–257.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. (2008), ‘LIBLIN-EAR: a library for large linear classification’, Journal of Machine Learning Research9, 1871–1874.

Forman, G. (2003), ‘An extensive empirical study of feature selection metrics for textclassification’, Journal of Machine Learning Research (JMLR) 3, 1289–1305.

Forman, G. (2008), BNS feature scaling: An improved representation over TF-IDF forSVM text classification, in ‘Proceedings of 17th ACM Conference on Informationand Knowledge Management (CIKM)’, pp. 263–270.

Gabrilovich, E. & Markovitch, S. (2004), Text categorization with many redundant fea-tures: Using aggressive feature selection to make SVMs competitive with c4.5, in‘The 21st International Conference on Machine Learning (ICML)’, pp. 321–328.

Greene, D. & Cunningham, P. (2006), Practical solutions to the problem of diagonaldominance in kernel document clustering, in ‘Proceedings of the 23rd InternationalConference on Machine learning (ICML)’, pp. 377–384.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I. H. (2009),‘The WEKA data mining software: An update’, SIGKDD Explorations 11.

Joachims, T. (2002), Learning to Classify Text Using Support Vector Machines: Meth-ods, Theory and Algorithms, Springer.

Joachims, T. (2006), Training linear SVMs in linear time, in ‘Proceedings of the Interna-tional Conference on Knowledge Discovery and Data Mining (KDD)’, pp. 217–226.

Junejo, K. N. & Karim, A. (2008), A robust discriminative term weighting based lineardiscriminant method for text classification, in ‘Proceedings of IEEE InternationalConference on Data Mining (ICDM)’, pp. 323–332.

20 H. Malik et al

Karypis, G. (2003), ‘CLUTO: A software package for clustering high dimensionaldatasets’, http://www-users.cs.umn.edu/˜karypis/cluto/.

Keerthi, S. S., Sundararajan, S., Chang, K.-W., Hsieh, C.-J. & Lin, C.-J. (2008), Asequential dual method for large scale multi-class linear SVMs, in ‘Proceedings ofthe 14th ACM SIGKDD International Conference on Knowledge Discovery and DataMining’.

Kibriya, A. M., Frank, E., Pfahringer, B. & Holmes, G. (2004), Multinomial NaiveBayes for text categorization revisited, in G. Webb & X. Yu, eds, ‘AI 2004, LNAI3339’, Springer-Verlag, pp. 488–499.

Lewis, D. D., Yang, Y., Rose, T. & Li, F. (2004), ‘RCV1: a new benchmark collectionfor text categorization’, Journal of Machine Learning Research 5, 361–397.

Littlestone, N. (1988), ‘Learning quickly when irrelevant attributes are abound: A newlinear threshold algorithm’, Machine Learning 2, 285–318.

Littlestone, N. (1989), Mistake bounds and logarithmic linear-threshold learning algo-rithms, Technical report UCSC-CRL-89-11, University of California, Santa Cruz.

Lyman, P. & Varian, H. R. (2003), ‘How much information?’,http://www2.sims.berkeley.edu/research/projects/how-much-info-2003.

Madani, O., Connor, M. & Greiner, W. (2009), ‘Learning when concepts abound’, Jour-nal of Machine Learning Research 10, 2571–2613.

Malik, H. H. & Kender, J. R. (2008), Classifying high-dimensional text and web datausing very short patterns, in ‘Proceedings of IEEE International Conference on DataMining (ICDM)’, pp. 923–928.

McCallum, A. & Nigam, K. (1998), A comparison of event models for Naive Bayes textclassification, in ‘Proceedings of AAAI-98 Workshop on Learning for Text Catego-rization’, pp. 41–48.

Pang, B. & Lee, L. (2004), A sentimental education: Sentiment analysis using subjec-tivity summarization based on minimum cuts, in ‘Proceedings of the ACL’.

Quinlan, J. R. (1986), ‘Induction of decision trees’, Machine Learning 1, 81–106.Quinlan, J. R. (1993), C4.5: Programs for Machine Learning, Morgan Kaufman.Quinlan, J. R. & Cameron-Jones, R. M. (1993), FOIL: A midterm report, in ‘Proceed-

ings of the European Conference on Machine Learning (ECML)’, pp. 3–20.Rennie, J. D. (2001), Improving multi-class text classification with Naive Bayes, AI

technical report 2001-04, Massachusetts Institute of Technology.Rennie, J. D., Shih, L., Teevan, J. & Karger, D. (2003), Tackling the poor assumptions

of Naive Bayes text classifiers, in ‘Proceedings of the 20th International Conferenceon Machine Learning (ICML)’.

Salton, G. & Buckley, C. (1988), ‘Term-weighting approaches in automatic text re-trieval’, Information Processing and Management 24(5), 513523.

Sebastiani, F. (2002), ‘Machine learning in automated text categorization’, ACM Com-puting Surveys 34, 1–47.

Wang, P., Hu, J., Zeng, H.-J. & Chen, Z. (2009), ‘Using wikipedia knowledge to improvetext classification’, Knowledge and Information Systems 19(3).

Yang, Y. & Pedersen, J. O. (1997), A comparative study on feature selection in text cat-egorization, in ‘Proceedings of ICML-97, 14th International Conference on MachineLearning’, pp. 412–420.

Yin, X. & Han, J. (2003), CPAR: Classification based on predictive association rules,in ‘Proceedings of the SIAM International Conference on Data Mining (SDM)’,pp. 331–335.

Single Pass Text Classification by Direct Feature Weighting 21

Zhang, R. & Tran, T. (2010), ‘An information gain-based approach for recommendinguseful product reviews’, Knowledge and Information Systems .

Author Biographies

Hassan Malik is currently a Senior Technical Specialist at Thomson Reuterswhere he conducts data mining and machine learning research for large-scaletext processing applications. Prior to joining Thomson Reuters, he was a Re-search Scientist at Siemens Corporate Research. Concurrent to his undergraduateand graduate studies, he had held several senior technical and management po-sitions at companies in New Jersey, Silicon Valley, North Carolina and Karachi,Pakistan. Dr. Malik obtained his Ph.D. in Computer Science from Columbia Uni-versity in the City of New York in May 2008. His doctoral research focused oninvestigating efficient algorithms for mining unstructured data. He also holds aMaster of Engineering degree in Computer Science from North Carolina StateUniversity in Raleigh, NC, 2003 and undergraduate degrees in Computer Sciencefrom SZABIST and University of Karachi, 1999.

Dmitriy Fradkin received his B.A. in Mathematics and Computer Science fromBrandeis University, Waltham, MA in 1999 and his Ph.D. from Rutgers, The StateUniversity of New Jersey in 2006. He then worked for 1.5 years at Ask.com,and since 2007 has been at Siemens Corporate Research in Princeton, NJ. Hisresearch interests include pattern mining, information retrieval, classification andcluster analysis. Dr. Fradkin is a member of the ACM and the ACM SIGKDD.

Fabian Moerchen graduated with a Ph.D. in Feb 2006 from University or Mar-burg, Germany after just over 3 years with summa cum laude. In his thesis heproposed a radically different approach to temporal interval patterns that usesitemset and sequential pattern mining paradigms. Since 2006 he has been work-ing at Siemens Corporate Research, a division of Siemens Corporation, leadingdata mining projects with applications in predictive maintenance, text mining,healthcare, and sustainable energy. He has continued the study of temporal datamining in the context of industrial and scientific problems and has served thecommunity as a reviewer, organizer of workshops, and presenter of tutorials.

Correspondence and offprint requests to: Hassan H. Malik, Thomson Reuters, 195 Broadway, New York NY10007, USA. Email: [email protected]