training text classiﬁers with svm on very few positive ... · training text classiﬁers with svm...

Training text classifiers with SVM on very few positive examples

Janez [email protected]

Natasa [email protected]

Marko [email protected]

Dunja [email protected]

April 2003

Technical ReportMSR-TR-2003-34

Text categorization is the problem of automatically assigning text documents into one or morecategories. Typically, an amount of labelled data, positive and negative examples for a category, isavailable for training automatic classifiers. We are particularly concerned with text classificationwhen the training data is highly imbalanced, i.e., the number of positive examples is very small.We show that the linear support vector machine (SVM) learning algorithm is adversely affected byimbalance in the training data. While the resulting hyperplane has a reasonable orientation, theproposed score threshold (parameter b) is too conservative. In our experiments we demonstratethat the SVM-specific cost-learning approach is not effective in dealing with imbalanced classes.We obtained better results with methods that directly modify the score threshold. We propose amethod based on the conditional class distributions for SVM scores that works well when very fewtraining examples is available to the learner.

Microsoft ResearchMicrosoft CorporationOne Microsoft Way

Redmond, WA 98052http://www.research.microsoft.com

1 Introduction

Text categorization involves a predefined set of cat-egories and a set of documents that need to be clas-sified using that categorization scheme. Each docu-ment can be assigned one or multiple categories (orperhaps none at all). We address the multi-class cate-gorization problem as a set of binary problems where,for each category, the set of positive examples con-sists of documents belonging to the category whileall other documents are considered negative exam-ples. Labelled documents are used as input to variouslearning algorithms to train classifiers and automat-ically categorize new unlabelled documents.

Traditionally, machine learning research has as-sumed that the class distribution in the training datais reasonably balanced. More recently it has been rec-ognized that this is often not the case with realisticdata sets where many more negative examples thanpositive ones are available. The question then ariseshow best to utilize the available labelled data.

It has been observed that a disproportional abun-dance of negative examples decreases the perfor-mance of learning algorithms such as naive Bayes anddecision trees [KM97]. Thus, research has been con-ducted into balancing the training set by duplicat-ing positive examples (oversampling) or discardingnegative ones (downsizing) [Jap00]. When discard-ing negative examples, the emphasis has sometimesbeen on those that are close to positive ones. In thisway, one reduces the chance that the learning methodmight produce a classifier that would misclassify pos-itive examples as negatives [KM97]. An alternativeapproach has been explored in [CS98]. It involvestraining several classifiers on different balanced datasubsets, each constructed to include all positive train-ing examples and a sample of negative examples of acomparable size. The predictions of these classifiersare then combined through stacking.

On the other hand, systematic experiments by[WP01] indicate that neither natural nor balanceddistributions are necessarily best for training. It isevident that having a majority of positive examplesin the training set is important. However, the pro-portion of positive and negative examples that leadsto best classifiers generally depends on the data set.

Implications of rebalancing the training set by modi-fying the number of negative examples have also beenexplored analytically in [Elk01]. Most of this workhas been done on symbolic domains. However, dis-carding negative examples has also been used in textcategorization [NGL97] in conjunction with the per-ceptron as the learning algorithm.

In this paper we investigate the problem of textcategorization with the linear SVM classifier whenvery few positive examples are available for training.In order to investigate this issue systematically, wevary the number of positive examples in the trainingset and apply the resulting classifiers to test datawith a natural distribution of positive and negativedocuments. We are particularly interested in highlyimbalanced distributions with the ratio of positive tonegative documents ranging from less than 0.1 % upto 10 %.

The linear SVM learner produces an optimal hy-perplane wT x = b that separates the positive exam-ple from negative ones in the high-dimensional spaceof features that represent documents. The optimiza-tion problem involves determining the trade-off be-tween the error margin and misclassification cost forthe examples in the training data. The resulting hy-perplane is then used to classify the new, unlabeleddata.

Our results, presented in Section 6, show that im-balanced class distributions negatively affect the per-formance of the SVM classifiers. This appears to bemostly due to the poor estimation of the parameterb, also referred to as the score threshold since thedocuments with the score wT x above the thresholdb are assigned the positive label. The value of b ob-tained by the original learner is too high, resultingin a too conservative assignment of positive labels totest documents. In addition, our experiments withcost-based adjustment of the dividing hyperplane,through modification of the optimization function,have shown that the learner achieves improved per-formance mostly through the change in the parameterb. Thus, we investigate several methods for alteringthe score threshold directly.

In Section 6 we show that this approach is verysuccessful in dealing with highly imbalanced train-ing data. Cross-validation over the training data en-

1

ables us to set the threshold successfully even whenthe training data contains less than 0.01 % of pos-itive examples: the macroaveraged F1 performanceover the test data is increased from 0.0166 to 0.3603.This improvement is also a considerable step towardsthe optimal F1 performance achievable through themodification of b: F1 = 0.4724.

We structured our paper in the following way. Firstwe provide a brief overview of the SVM learningmethod, discussing its cost-based learning and scorethresholding aspects. Then we describe the exper-imental set-up and results of the experiments thatexplore each of these two aspects in detail. We con-clude with summary of our findings and discussion offuture work.

2 Support vector machines

The support vector machine (SVM) [CV95] is a rela-tively recent machine learning algorithm proven to bevery successful in a variety of domains. It is currentlyconsidered the state-of-the-art algorithm in text cat-egorization.

SVM treats learning as an optimization problem.Training and test examples are represented as d-dimensional real vectors in the space of features de-scribing the data. Given a category, each example be-longs to one of two classes referred to as positive andnegative class, respectively. Thus, the training setconsists of pairs (xi, yi), i = 1, . . . , l, where xi ∈ Rd

is the i-th training vector and yi ∈ {+1,−1} is theclass label of this instance.

The learner attempts to separate positive fromnegative instances using a hyperplane of the formwT x = b. Here, w is the “normal” of the hyper-plane, i.e., a vector perpendicular to it, which definesthe orientation of the plane. The constant b definesthe position of the plane in space. To choose w andb, SVM minimizes the following criterion function:

f(w, b) := 12 ||w||

2 + C∑

i ξi

subject to ∀i : yi(wT xi − b) ≥ 1− ξi, ξi ≥ 0.

The space between the planes wT x = b± 1 is calledthe “margin” and its width can be expressed as afunction of w: 2/||w||. The above criterion function

causes the learner to look for a trade-off between min-imizing the term ||w||2, which is equivalent to max-imizing the width of the margin, on one hand, andminimizing the classification errors over the trainingdata, expressed by the term containing

∑i ξi. The

constant C, chosen before learning, defines the rela-tive importance of these two terms. The associatedinequality constraints require each training exampleto lie on the correct side of the hyperplane and suf-ficiently far from it. Otherwise the slack variable ξi

becomes positive and decreases the value of the abovecriterion function for the hyperplane under consider-ation.

Once w and b are calculated, the classifier usesthe following criterion to predict class labels of newdocuments:

prediction(x) := sgn(wT x− b).

For that reason, the parameter b is referred to as thescore threshold. Documents x with the score wT xabove the value b are assigned the positive label.

This basic formulation has also been extended toaccommodate various classification types and addressissues such multiple classes and use of nonlinear clas-sifiers.

2.1 Cost-based learning

One important extension of the SVM model is aimedat highly imbalanced training data where the ratio ofpositive examples to negative examples is very small.It allows us to treat classification errors on positivetraining examples more seriously than those on nega-tive training examples. The general criterion functiontakes the form

f(w, b) :=12||w||2 + jC

∑i:yi=+1

ξi + C∑

i:yi=−1

ξi; (1)

this enables one to set the cost of misclassifying pos-itive examples j times higher than in the originalformulation. This is, in fact, equivalent to over-sampling, i.e., simulating the situation in which jcopies of each positive training example is used inthe original optimization problem.

2

Based on the definition of the optimization prob-lem we expect that the SVM learner will be ratherrobust with respect to the large number of negativeexamples. The negative examples that lie below themargin (i.e., wT xi < b − 1) do not have an influ-ence on the learning process. However, the negativeexamples that are close to the positives, or even in-terspersed among them, could still be relatively detri-mental to the learning process, given that the posi-tives are not so numerous.

2.2 Thresholding strategies

Modifying the score threshold in order to improve theSVM classification performance has been exploredby Platt [Pla99], who assumed that the probabil-ity that a document with the score wT x is posi-tive, P (y = +1|wT x), can be described by a sig-moid function of the score wT x. The parameters ofthe sigmoid are chosen using a validation set and anew threshold is placed where the sigmoid functionachieved the value of 1/2. This, in effect, aims atpredicting the class with the maximum a posterioriprobability. However, the separation of positive andnegative examples is not perfect and such a thresh-old could cause a non-negligible number of positiveexamples to be misclassified as negative. In case ofimbalanced classes, where there were few positive ex-amples to begin with, this can lead to a poor recalland, consequently, poor F1-measure.

Chakrabarti et al. [CRS02] have also observed thatthe F1 performance of an SVM classifier can be im-proved by tuning the threshold with the aid of avalidation set. They investigated this issue onlybriefly, without systematic consideration of imbal-anced training data.

Methods for setting score thresholds are com-monly considered in the area of information retrieval[Yang01] which also comprises various text catego-rization tasks. Among them is the proportional as-signment of positive labels, also known as “PCut”.First, the test documents are ranked based on thescore which estimates the likelihood that a documentis positive. A number of top ranked documents isthen assigned the positive label so that the percent-age of positive predictions on the test set matches

the percentage of positive documents in the trainingset. However, such a technique is less suitable whenone needs to model classification of individual docu-ments rather than a collection of documents at once.Other thresholding strategies mentioned in [Yang01]can be used in such situations, however. In particu-lar, “SCut” consists of selecting a separate thresholdfor each category using a separate validation set ofdocuments. “RCut” assigns each document to a fixednumber of categories that appear to be the most likelyfor this document.

3 Design of the experiments

We designed our experiments to examine cost basedand threshold based improvements of the SVM learn-ing algorithm for highly imbalanced classes. We con-trolled the proportion of positive documents in thetraining data but evaluated the methods on the testdata with the natural class distributions.

We used documents from the Reuters Corpus, Vol-ume 1 [Reu00] that consists of 806,791 Reuters arti-cles dated 20 August 1996 through 19 August 1997.Out of those, we chose 302,323 documents dated laterthan 14 April 1997 for the test set. For training, werandomly selected document sets from those dated14 April 1997 or earlier. To ensure that the train-ing data contains a particular number or percentageof positive examples we used a different training setfor each category. Thus, we treated classification ofeach category completely independently of the clas-sification of the others. For each category C we builttraining sets Tr(C,P,N) with a predefined numberP of positive examples, i.e., documents belonging tocategory C, and a number N of negative examples,i.e., documents not belonging to C. In all our exper-iments, we fix N = 10,000 and vary the values of Pfrom 8 to 1024. The training sets for different valuesof P are nested in the sense that P < P ′ implies thatpositive examples in Tr(C,P,N) are a subset of thosein Tr(C,P ′, N). The negative examples are the samein both.

We used the familiar bag-of-words model to repre-sent the documents, eliminating stopwords and wordsoccurring in less than four training documents. Each

3

document was represented by a vector of featuresweighted by the standard TF-IDF score and was nor-malized to the Euclidean length of 1.

The original Reuters classification scheme involves103 categories. In our experiments we used a subsetof 16 categories: c13, c15, c185, c313, e121, e13, e132,e142, e21, ghea, gobit, godd, gpol, gspo, m14, m143,which were chosen on the basis of preliminary exper-iments with the full set of categories. We observedthe performance of the SVM classifier on a smaller setof training and test data and selected the categoriesthat cover a range of category sizes and classificationdifficulty, as measured by the breakeven point. Thesame set of categories has been already been used inour earlier work [BGMM02].

To quantify the classifier performance we calcu-lated the F1 measure, a function of the precision pand recall r, F1 := 2pr/(p + r), where p is the pro-portion of correct predictions among documents thatwere predicted to be positive, and r is the proportionof correct predictions among the documents that weretruly positive.

We describe the performance in terms of themacroaveraged F1 values over the set of categories.More precisely, given a particular value of P anda particular thresholding strategy, we train a classi-fier for each category, using Tr(C,P,N) as a trainingset. The classifier is then applied to the test dataand its F1 performance on the test set is recorded.The macroaveraged F1 is then simply the average ofF1 values across all 16 categories. Thus the influ-ence of smaller categories on this performance mea-sure is probably smaller than it would be if macroav-eraged F1 were computed over all Reuters cate-gories (where the smaller categories would predomi-nate more strongly). At the same time, the influenceof the smaller categories is probably larger than itwould be if microaveraged performance values wereused.

We also observe the precision-recall breakevenpoint (BEP) [Lew91, p. 105] for individual classifiersand the macroaverage across the classifiers. As thethreshold is gradually decreased, more documents areassigned positive labels, resulting in increased recallbut, generally, a decreased precision. At some pointthe precision and recall become equal and this value

of precision and recall is defined as the breakevenpoint (BEP). Thus, by definition, BEP does not de-pend upon a particular choice of a threshold but onlyon the ranking of document scores wT x. Therefore,it is particularly useful for evaluating the quality ofthe hyperplane orientation w determined by the SVMlearner.

We used Thorsten Joachims’ SVMlight program[Joa99] to train the SVM models. We conductedthree sets of experiments. The first set explores theeffect that the cost parameter j has on the perfor-mance of the classifier. The second set investigatesstrategies for setting the score threshold value b. Fi-nally, in the third set we experiment with the combi-nation of varying j and b simultaneously.

4 Experiments withcost-based learning

In this section we explore the effects of cost adjust-ment in the SVM optimization problem by varyingthe value of j (see Section 2.1). We gradually in-crease j starting with the default value j = 1. This isexpected to increase the assignment level of positivelabels to documents, thus leading to an increase inrecall at the expense of decrease in precision. Theresults of our experiments are presented in Figure 1.

As it can be seen, the j parameter has a relativelysmall effect on the categorization of test data. Thiseffect is adverse if the training set contains a suffi-ciently large ratio of positive examples (in this case10 %, i.e., 1024 documents). Then the gain in recallis outweighed by the loss of precision and, as a result,the F1 measure decreases.

We also observe that most significant changes oc-cur as j increases from its default value of 1 to j = 2.The effect of further increasing j is rather small. Itquickly reaches the value for which no misclassifica-tions of positive examples in the training set occur.Given that j only affects errors on positive examples,1

1Strictly speaking, it influences all the positive training ex-amples that have ξi > 0. They need not actually be misclassi-fied; they might also lie within the margin (i.e., on the correctside of the separating hyperplane but not far enough from it).

4

0

0.2

0.4

0.6

0.8

1

1 2 5 10 20 30Value of the J parameter

Macroaverages, 11 positive training examples

training precisiontraining recall

test precisiontest recall

0

0.2

0.4

0.6

0.8

1



training F1training breakeven point

test F1test breakeven point

0

0.2

0.4

0.6

0.8

1





0

0.2

0.4

0.6

0.8

1





0

0.2

0.4

0.6

0.8

1





0

0.2

0.4

0.6

0.8

1





Figure 1: The dependence of macroaveraged performance measures on the cost of false negatives (the jparameter). Number of positive training examples: top row, 11 (0.1 %); middle row, 128 (1 %) bottom row,1024 (10%). Charts in the left column show precision and recall, and those in the right column show F1 andthe break-even point. Precision and recall behave similarly on both the training and the test set: precisiondecreases and recall increases with increasing values of j. F1 increases with j, except when there are manypositive training examples. Interestingly, the breakeven point hardly changes with the increase of j.5

increases in j beyond that value does not influencethe learning process.

The BEP hardly changes with the increase of j.On the training set it remains practically unchanged.On the test set it remains almost unchanged for smallnumber of positive examples but shows decrease inperformance for larger number of positive examples.Since BEP is only affected by the orientation of thedividing hyperplane, this suggests that the parame-ter j is not effective in adjusting the orientation ofthe hyperplane in order to improve the classificationperformance. As an illustration, in experiments with11 positive examples, increasing j does change theorientation of the normal by 13,6◦ on average (aver-age over the 16 categories, that is). Thus, the maineffect of j on the precision, recall, and F1 is throughthe change in the score threshold b.

The following sections provide detailed analysis ofthe effect that j has on the classifier performance forP = 11, 128, and 1,024 positive documents in thetraining corpus, which correspond to 0.1 %, 1 %, and10 % of the training data. (Figure 1.)

Influence of j for very few positive examples(P = 11, i.e., 0.1 % of the training set). Recallat j = 1 is quite low even when applied to the train-ing set (49.3 %), and extremely low (1.0 %) on the testset. In other words, the scarcity of positive trainingexamples yields a very conservative classifier; on thetest set, with the natural distribution of positive ex-amples, it hardly labels any document as belongingto the positive class. However, increasing the valuesof j quickly improves recall over the training data,reaching recall of 97.9 % at j = 2 and 100 % at j = 5.Note that from that point on there are no misclassi-fied positive examples and thus increasing j furtherdoes not seriously affect the learning process. In fact,the data shows that the separating hyperplane itselftypically does not change from j = 8 onwards.

Increase in recall on the test set is much less dra-matic, reaching 5.8 % at j = 2 and 8.1 % at j = 5. Atthe same time, precision for j = 1 is 100 % for boththe training and the test set as the classifier is veryconservative. It decreases to 99.1 % on the trainingset and 86.5 % on the test set as j is increased. The

net result is, in some respect, captured by F1. It in-creases from 50.3 % to 99.1 % on the training set andfrom 1.6 % to 12.9 % on the test set.

It is also instructive to observe the precision-recallbreakeven point (BEP), which helps us analyze the ef-fect of j on the orientation of the hyperplane (w) as itis independent of the threshold b. BEP on the train-ing data remains completely unchanged (at 66.3 %),and decreases on the test data from 46.0 % to 45.4 %.It is, thus, clear that most of the classification per-formance change, captured by other measures, is dueto the adjustment of the score threshold.

A closer look at the data shows that changing thej parameter caused the orientation of the hyperplaneto be adjusted by 13.6◦ on average but without apositive effect. On the other hand, the average scorethreshold value for 16 categories increased from 1.08to 1.23 and the average ||w|| increased from 3.06 to7.36. This implies that the learner has been willingto accept a narrower margin (which is always 2/||w||units wide) as j increased. It turns out that the hy-perplane has moved closer towards the origin of thecoordinate system by 0.192 units on average (whichis reasonable, because it makes the classifier more lib-eral as the threshold b was positive). This is, in fact, arather modest shift in the hyperplane position whencompared with those observed in the threshold ad-justment experiments presented further in the paper.Holding the hyperplane normal fixed (as learned forj = 1) and tuning b to maximize F1 over the test datamoves the plane by 0.302 units on average. Settingthe value of b using cross-validation on the trainingset moves the plane by 0.287 units on average whilechoosing b to be the score value where BEP occurson the training set moves the plane by 0.232 units onaverage.

Thus we conclude that incrementing j has not suc-ceeded in finding a better orientation of the hyper-plane nor modifying the score threshold b appropri-ately to produce an effective SVM classifier.

Furthermore, comparing the weights of individualfeatures in the resulting normals for j = 1 (say w)and j = 8 (say w) shows high overlap in features withnonzero weights in the two vectors. Indeed, hardlyany new features have been introduced into the nor-mal by changing j: w has 4,829 nonzero components

6

on average while w has 5,097. Furthermore, if thenewly-introduced features are discarded from w, theangle ∠(w, w) hardly changes.

A more detailed analysis of the normals shows thatpractically all positive features in w remain positivein w. The features that were dropped (i.e., are zeroin w but not in w) were all negative and almost all ofthe newly introduced features have negative weights.Furthermore, hardly any weights have changed theirsign: on average 10 from negative to positive, and12 from positive to negative. The absolute valuesof the weights in w and w probably shouldn’t becompared directly as length of the vector w is about2.4 times larger, on average (because of the narrowermargin). However, if we normalize both normals tounit length, we observe that the number of weightswith large absolute values has not changed by much,but that w has more smaller weights.

The error analysis shows that for j = 1 there wereon average 8.625 false negative classifications per cat-egory for the training set. Thus, most of the 11positive training examples have been misclassified asnegative. For j = 8 there are no false negatives onthe training set. This suggests that the weights ofpositive support vectors (i.e., the corresponding αi

coefficients) have increased considerably. Generally,this could cause too many false positive classifica-tions and, in order to prevent that, the SVM learnerincludes features with negative weights into w, as ithas been observed above.

Moderate number of positive examples (P =128, i.e., 1 % of the training set). When thenumber of positive examples in the training set isP = 128 we observe that for j = 1 the recall is betterand precision worse when compared with the perfor-mance for P = 11. Thus, having 1% instead of 0.1 %of positive examples in the training data producesSVM classifiers that are less conservative in assigningthe positive class label to documents. The recall onthe training set now increases from 72.9 % for j = 1to 99.9 % for j = 4 (note that it takes values of j be-tween 10 and 30 to reach 100% recall). On the testset it increases from 42.0 % to 52.7 %. At the sametime, test precision drops from 80.0 % to 67.6 %, while

F1 grows from 47.8 % to 53.6 % on the test set. Asbefore, most of these changes occurs between j = 1and j = 2.

Abundance of positive examples (P =1, 024,i.e., 10 % of the training set). For P =1,024 the performance for j = 1 is already relativelygood with 92.1 % recall on the training and 76.6 %on the test set. Thus, the number of positive exam-ples is sufficiently large to ensure training of effectiveclassifiers. Increasing j to 4 yields further increase inrecall, to 99.9 % on the training set and to 82.3 % onthe test set. Precision on the test set decreased from59.2 % to 50.5 % causing F1 to decrease from 62.4 %to 58.3 %. Note that this is in contrast with the per-formance for small P where increasing j improvedrecall noticeably at the expense of precision but F1

was nevertheless increased, despite the precision loss.This is consistent with the fact that the value of F1

tends to reflect the lesser of precision and recall. Forexample, for small P , the recall on the test set wasrather low while the precision was very high. In-creasing j improved recall noticeably at the expenseof precision but F1 was nevertheless increased despitethe loss in precision. For large P , the recall is not suf-ficiently low to begin with and the increase in recallwas not sufficient to counterbalance the decrease inprecision.

The fact that we are not gaining anything by in-creasing j in this situation is confirmed by the BEPvalues, which grow slightly on the training set, from96.6 % → 99.0 % but decrease on the test set from67.0 % → 64.9 %.

5 Score distributions and classprobabilites

In the previous section, we have seen that increment-ing the j parameter affects the classification perfor-mance through modifying the score threshold value b.However, we also observed that this modification wasnot adequate. Thus, we focus our effort on adjustingthe score threshold directly. This calls for an analy-sis of score distributions wT x of positive and negative

7

examples, both in the training and test data sets.In this section we first investigate the score distri-

bution of positive and negative examples with respectto the default threshold obtained by SVM over thetraining data. Then we estimate the class probabili-ties, i.e., the probability that a document with a par-ticular score belongs to a class. Based on the findingswe propose a thresholding heuristic based on classprobability distributions over the training data giventhe SVM normal w in addition to the more obviousones such as the threshold that leads to maximumF1 measure over the training data or the thresholdrecommended by cross validation over the trainingdata.

5.1 Analysis of score distributions

In order to study the distribution of values b(x) :=wT x for a given normal w and documents x from thetraining and test sets, we discretized the distributionby dividing the range [minx wT x,maxx wT x] into 50subintervals of equal width. We recorded the num-ber of positive and negative examples in each scoreinterval. Figure 2 shows the resulting distributions(C183).

We are particularly interested in score distributionobtained by the SVM learner with j = 1 over thetraining data and the score value b∗ that correspondsto the maximum F1 measure performance over thetest data. The objective of our thresholding methodsis to adjust b0 automatically to values close to b∗.

Of interest are also the score values of b0 − 1 andb0 +1 where we expect concentration of support vec-tors.

From Figure 2 we observe that, for P = 1024 and128 positive documents in the training data, SVMscores of positive and negative documents lead tosimilar distributions over the positive and negativedocuments. The distributions are not necessarilyseparable but the consistency in distributions pro-vides solid foundation for performance prediction us-ing score thresholds.

We note that the distribution of negative train-ing documents tends to reach maximum around b =b0−1, at the location of negative support vectors. Infact, the actual peak is typically at some b strictly be-

low b0−1. This is consistent with the SVM optimiza-tion strategy: any data instance positioned withinthe margin entails a cost, measured by the slack vari-able ξi.

Similarly, the distribution of positive training ex-amples tends to have a spike at b = b0 + 1 becauseof the concentration of positive support vectors atthe unit distance from the hyperplane. Interestingly,b = b0 + 1 is typically the peak of the distributionin contrast to the score distribution of negative ex-amples. We also observe that the score distributionsof positive documents have more symmetrical shapethan that of the negative documents. In the futurewe will investigate whether this characteristic of scoredistributions is correlated with the relative propor-tion of positive and negative examples in the trainingset.

We note that in case of P = 128, the score distri-bution for positive test documents is much broaderthan the one for positive training documents, indi-cating that many positive test examples are assignedlow values of wT x and will therefore be misclassifiedas negative. This is the sign of overfitting to the onepercent of positive examples.

In the case of extremely imbalanced data, e. g.,P = 11 (the top row of Figure 2), we see that thelearner separates the positive and negative trainingexamples well but is much less effective in doing so onthe test data. The classifier suffers from serious over-fitting. Furthermore, estimation of positive score dis-tribution over the test data is unreliable with very fewdata points. Thus, it is hard to modify the thresholdbased on score distribution solely. However, from theexperiments we also observe that, in the case of veryfew positive training examples (in this case 0.1 %),the SVM learner typically selects a threshold so thatpractically all of the positive documents are supportvectors; some actually lie within the margin, but usu-ally none lie on the negative side of the margin (i.e.below the plane wT x = b0 − 1). This suggests an-other strategy selecting a better threshold by usingb = b0 − 1 instead of b0. This simple heuristic isto a certain degree incorporated in the method wepropose in the following section.

8

0

1

10

100

1000

10000

100000

0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

292)

Threshold value (b)

Category C183, j = 1, positive examples: 11 training (natural: 2240), 2691 test

training positivestraining negatives

test positivestest negatives

original threshold b0 (F1=0.0000)best test threshold b* (F1=0.4747)

0

1

10

100

1000

10000

100000

-1 0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

314)

Threshold value (b)

Category M14, j = 1, positive examples: 11 training (natural: 9812), 34557 test




0

1

10

100

1000

10000

100000

-1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

938)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

03)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5 6 7b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

77)

Threshold value (b)





0

1

10

100

1000

10000

100000

-3 -2 -1 0 1 2 3 4 5 6b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

73)

Threshold value (b)





Figure 2: These charts show how many documents fall at a certain distance from the plane. For illustration,we show the analysis for two categories: C183 is shown on the left as an example of a smaller category,and M14 on the right as an example of a large category (charts for other categories can be found in theappendix, pp. 21–23). Number of positive training examples: top row, 11; middle row, 128; bottom row,1024. The two vertical lines show the original threshold obtained by training SVM and the hypotheticalbest threshold calculated from the test data, just for comparison. The bottom two graphs show the analysisfor a very small number of positive examples (P = 11): note how the distribution of positive test exampleshas hardly any relation to that of the positive training examples.

9

5.2 Analysis of the class probabilities

We explore the possibility of using class probabilitiesto identify successful score thresholds in case of highlyimbalanced classes.

Given the orientation of the hyperplane w andsome value b we estimate the probability that a doc-ument x assigned the score wT x belongs to the pos-itive class as P (⊕|b) := P (y = +1|wT x = b). Weapply similar methods as in the previous section.We divide the range of values wT x into subintervals(about three times as long as those in the analysis ofscore distributions) and record the relative frequencyof positive examples among all the documents withscores within a given interval.

The distribution curves (Figure 3) have a famil-iar sigmoidal shape. For training examples the esti-mated P (⊕|b) is 0 for low scores b and grows rapidlytowards 1, the value that is maintained for the rest ofthe document scores, confirming that SVM scoring oftraining examples does not introduce false positivesamong highly scored documents.

On the graphs we indicate the default score thresh-old b0 that SVM produces during the learning processand the threshold b∗ that maximizes the F1 perfor-mance measure on the test set. We note that b∗ usu-ally occurs close to the value of b where the probabil-ity P (⊕|b) for the training set first begins to grow invalue, i.e., the score interval with significant numberof positive examples is encountered. This is consis-tent with our observation about the distribution ofscores in the previous section: in case of highly im-balanced training data it seems suitable to adjust thedefault threshold to the value of b0 − 1 (thus movingthe separating plane to the lower edge of the originalmargin area).

We, thus, suggest a simple heuristic for selecting anew score threshold. We choose the smallest value ofb for which the probability that an instance is posi-tive, given that its score belongs to the correspondinginterval, is above a certain small predefined thresh-old, e.g., η = 2 %. The exact value of η should notbe very important, considering how immediate andlarge the first increase in P (⊕|b) appears to be onthese distribution charts for P = 11.

6 Experiments with thresholdmodification

In this section we evaluate several methods for settingscore thresholds b. We assume that the SVM learnerhas provided a hyperplane wT x = b0 under the de-fault setting of j = 1. We intend to modify only thethreshold (replace b0 by some other value b), keepingits normal (w) unchanged. This, therefore, preservesthe orientation of the plane but influences the levelof assignment of the positive labels to the test docu-ments; a higher (lower) threshold leads to more con-servative (more liberal) assignment. Success of thethresholding mechanism is measured by comparisonwith the optimal value of the F1 measure for the testdata.

6.1 Comparison of different thresh-olding heuristics

Several methods for modifying the threshold are con-sidered: selecting the score (1) that maximizes thevalue of F1 over the training data; (2) cross-validationover subsets of training data, observing the F1 mea-sure on individual subsets; and (3) based on a scorethat corresponds to a specified value of positive classprobability as described in Section 5.2. For the sakeof comparison, we cosider the threshold that maxi-mizes F1 over the test data which is, of course, un-available in practice. The results of these experimentsare summarized in Figure 5 and Table 2.

We observe that although the performance of theoriginal threshold obtained by the SVM learner isvery poor when the number of positive training ex-amples was low, the same model can be consider-ably improved by choosing a different threshold. Infact, by choosing a threshold using cross-validationon the training set, we get to within 70 % of theoptimal threshold, the one that maximizes the F1

performance measure on the test data. Additionally,placing a threshold on the training break-even pointor using the threshold that maximizes training F1

also turns out to be a reasonably successful strategy;it achieves about 50% of the optimal test F1.

The following is a detailed analysis of these results

10

0

0.5

1

0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.087

7)

Threshold value (b)


P(positive|training set)P(positive|test set)


0

0.5

1

-1 0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.094

2)

Threshold value (b)




0

0.5

1

-1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.281

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.31)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5 6 7b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.53)

Threshold value (b)




0

0.5

1

-3 -2 -1 0 1 2 3 4 5 6b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.519

)

Threshold value (b)




Figure 3: These charts show the proportion of positive documents among those that fall at a certain distancefrom the plane. For illustration, we show the analysis of two categories: C183 is shown on the left as anexample of a smaller category, and M14 on the right as an example of a large category (charts for othercategories can be found in the appendix, pp. 24–26). Number of positive training examples: top row, 11;middle row, 128; bottom row, 1024. The two vertical lines show the original threshold obtained by trainingSVM and the hypothetical best threshold calculated from the test data, just for comparison.11

0

0.25

0.5

0.75

1

0 1 b0b*

Per

form

ance

if th

e se

para

ting

hype

rpla

ne is

pla

ced

at th

is th

resh

old

Threshold value (b)


training F1test F1

original threshold b0 (F1=0.0000)best test threshold b*(F1=0.4747)

0

0.25

0.5

0.75

1

-1 0 1 b0b*

Per

form

ance

if th

e se

para

ting

hype

rpla

ne is

pla

ced

at th

is th

resh

old

Threshold value (b)


training F1test F1


0

0.25

0.5

0.75

1

-1 0 1 2 3 4b0b*

Per

form

ance

if th

e se

para

ting

hype

rpla

ne is

pla

ced

at th

is th

resh

old

Threshold value (b)


training F1test F1


0

0.25

0.5

0.75

1

-2 -1 0 1 2 3 4b0b*

Per

form

ance

if th

e se

para

ting

hype

rpla

ne is

pla

ced

at th

is th

resh

old

Threshold value (b)


training F1test F1


0

0.25

0.5

0.75

1

-2 -1 0 1 2 3 4 5 6 7b0 b*

Per

form

ance

if th

e se

para

ting

hype

rpla

ne is

pla

ced

at th

is th

resh

old

Threshold value (b)


training F1test F1


0

0.25

0.5

0.75

1

-3 -2 -1 0 1 2 3 4 5 6b0b*

Per

form

ance

if th

e se

para

ting

hype

rpla

ne is

pla

ced

at th

is th

resh

old

Threshold value (b)


training F1test F1


Figure 4: These charts show how F1 depends on the threshold b of the separating hyperplane wT x = b whileits normal (w) remains unchanged. C183 is shown on the left as an example of a smaller category, and M14on the right as an example of a large category.

12

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10 100 1000

Mac

roav

erag

ed F

1 pe

rform

ance

on

the

test

set

Number of positive training examples

Comparison of different thresholding strategies at J = 1

original thresholdP(pos|w.x) > 0.02

cross-validation (avg. b)training breakeven point

test breakeven pointmaximize test F1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10 100 1000

Mac

roav

erag

ed F

1 pe

rform

ance

on

the

test

set


Comparison of different thresholding strategies at J = 1

original thresholdP(pos|w.x) > 0.5

cross-validation (avg. dist.)maximize training F1

maximize test F1

Figure 5: A comparison of various strategies for modifying the threshold of the separation hyperplaneobtained by the SVM learner. Note that the macroaveraged test F1 performance measure is shown in allcases. The explanations in the legend refer to the thresholding strategy used to modify the threshold of thehyperplane found by the SVM learner.

for P = 11 positive training examples. The thresh-old as proposed by the SVM learner is usually muchtoo high, already for the training set but even moreso for the test set (as we can see from Figure 5; re-call: 21.6 % on the training set, 1.7 % on the test set;precision: 56.3 % and 18.8 %). It is interesting thatwhile many categories are linearly separable problemsand a threshold exists that would achieve 100% ac-curacy on the training set, the SVM prefers to choosea much higher threshold that misclassifies most pos-itive training examples as negative. The reason isprobably that this gives it the illusion of a muchwider margin (the misclassified positive examples areignored for the purposes of margin classificaton andwhile their slack variables do penalize the plane a lit-tle, this is apparently not enough to overcome theattraction of the wide margin).2

6.1.1 Maximizing training F1

Threshold modification by choosing b to be the valuethat maximizes F1 on the training set improvesthe macroaverage F1 performance from and 1% to

2To avoid this, one might be tempted to increase the over-all misclassification cost, i.e., the C parameter. However, Sec-tion 8 will show that this is not helpful.

23.2 % on the test set (Figure 5; on the training set,macroaveraged value rises to 99.7 %).

An alternative to this heuristic might be to placethe threshold at the break-even point, i.e., so that theprecision and recall on the training set are roughlythe same. It is not unreasonable to expect a good F1

performance around the break-even point, becausethe F1-measure tends to rewards circumstances whereprecision and recall are similar rather than thosewhen one of them has been increased at the expenseof the other. It turns out that the thresholds placedat the breakeven point are very close to those at max-imum training F1, and the resulting performance onthe test set is very similar as well.

6.1.2 Cross-validation

We are aware that by considering the maximum F1

value over the whole training we run the risk of over-fitting the value of b to the training data. In orderto reduce that risk we apply stratified 5-fold cross-validation, hoping to identify a more robust scorethreshold. We train the SVM classifier on 80 % ofthe training examples and modify the threshold tomaximize F1 on the remaining 20% of the trainingset. In the end, each of the five iterations during

13

cross-validation proposes a possibly different normalwi and score threshold bi. As our final model, weuse the normal trained over the entire training setbut calculate the threshold b from the five bi param-eters obtained from cross-validation. The average,b := 1

5

∑5i=1 bi, is one obvious choice. However, we

might be concerned by the fact that b is not reallyexpressed in any absolute units but is relative to thewidth of the margin. The role of b, as far as theclassifier is concerned, is really to define the distancebetween the hyperplane and the origin of the coor-dinate system. For the plane wT x = b, this dis-tance is b/||w||. Thus, we can average the distances,d := 1

5

∑5i=1 bi/||wi||, and set the threshold of the fi-

nal plane so as to place the plane at this distance fromthe origin: b := d · ||w||. It turns out that the thresh-olds proposed by these two approaches are usuallyquite similar. Averaging the values of b gives betterperformance at P = 11 (F1 = 36.0 % compared to33.5 % obtained from averaging the distance), whileat P = 128 and P = 1024 averaging the distance isslightly more successful, although the differences aresmall.

6.1.3 Thresholding from class probabilities

Based on the observations in section 5.2, we considerthe class probability P (⊕|b) that x is positive givenb = wT x, and place the score threshold at the small-est b where this probability grows beyond some smallpositive value η. In this case we used η = 0.02.

As Figure 5 shows, this heuristic is very successfulwhen very few positive training examples are avail-able. However, from P = 64 upwards, this heuris-tic leads to relatively poorer performance, indicat-ing that the resulting classifier is too liberal in as-signing the positive class values. This is reasonable,given that the probability P (⊕|b) as a function of bno longer has such a sudden and steep increase from0 but instead assumes a more continuous sigmoidalshape (see the bottom two rows of Figure 3).

If we take the “Bayesian” approach, i.e. choosingthe threshold b such that at P (⊕|b) = 1/2, in order toclassify each instance into the class with the greater aposteriori probability, we see that the resulting clas-sifier is less successful for small values of P . However,

for higher values of P this approach improves the per-formance and for P > 64 it matches the effectivenessof the cross-validation method (Sec. 6.1.2).

6.1.4 Maximizing test F1

In order to put the above results into a perspective,it is informative to look at the maximum F1 thatcan be achieved on the test data just by modifyingthe threshold. If the threshold is chosen optimallyfor each category, the macroaveraged F1 achievesthe value of 47.2 % on the test set. Incidentally,if the threshold is placed where the precision-recallbreakeven point occurs on the test set, the resultingclassifier is too conservative and its F1 is on averageabout 0.05 less than the optimal F1.

6.2 Range of good thresholds

As a way of assessing how difficult it is to select agood score threshold, we examine the distribution ofscores over the test data. We determine the range ofscore thresholds that, if chosen, would lead to a spec-ified performance over the test data, e.g. 80 % of theoptimal F1 value on the test set. The narrower thisrange is, more difficult it is to choose a good thresh-old. In other words, we are interested in the width ofthe bell-shaped portion of the curves in Figure 4.

Given a hyperplane normal w, suppose that b∗ isthe threshold that maximizes test F1 for this w. Leta constant a specified the level of F1 performance tobe achieved. We define:

b1 := max{b < b∗ : F1(w, b) < aF1(w, b∗)}b2 := min{b > b∗ : F1(w, b) < aF1(w, b∗)}

and let bmin := mini wT xi, bmax := maxi wT xi,where i goes over all test documents. Then we canregard κa(w) := (b2 − b1)/(bmax − bmin) as an in-dicator of the difficulty of choosing a good thresh-old for the particular classifier under consideration.The measure is relative to bmax − bmin to make iteasier to compare different classifiers. Alternatively,we might convert b2 − b1 to Euclidean distance be-tween the hyperplanes wT x = b1 and wT x = b2, i.e.κ′a(w) := (b2 − b1)/||w||. The values of these indi-cators, shown in Table 1 for a = 0.8, suggest that,

14

Macroaveraged valuesj P F1(b∗) κ0.8 κ′0.8

1 11 0.4724 0.1136 0.05401 128 0.6490 0.2186 0.06501 1024 0.6855 0.2159 0.0586

Table 1: The κ0.8 and κ′0.8 values indicate how wide an inter-

val of thresholds (b) around the optimal threshold b∗ (i.e., thethreshold that maximizes F1 on the test set) result in F1 that isat least 80 % of the optimal. The lower these values, the moredifficult it is to choose a good threshold without looking at thetest set. See the text for a formal definition of κ and κ′.

Macroaveraged F1 on the test setj P At original At training Cross-validation At optimal Thr. from

threshold BEP (avg. b) (avg. dist.) threshold P (⊕|wT x) distr.1 11 0.0166 0.2324 0.3603 0.3351 0.4724 0.33341 128 0.4795 0.5901 0.5889 0.5944 0.6490 0.46681 1024 0.6249 0.5987 0.5803 0.5958 0.6855 0.3753

100 11 0.1311 0.0916 0.3500 0.3336 0.4651 0.2997100 128 0.5375 0.2595 0.5666 0.5853 0.6334 0.3750100 1024 0.5841 0.5590 0.5666 0.5853 0.6661 0.6025

Table 2: The performance of classifiers based on different thresholding strategies.

indeed, it is more difficult to choose a good thresholdin situations when there were few positive training ex-amples (i.e., small value of P ), although the results,particularly for κ′, are somewhat inconclusive.

7 Using both j and b

In this section, we consider applying both cost-basedlearning (via the j parameter) and threshold mod-ification (the b value), using several thresholdingheuristics.

If we decide to retain the original threshold aslearned by the SVM, we saw (in Section 4) that in-creasing the value of j improves performance if thenumber of positive examples is small (e.g. P = 11 and128) but degrades performance slightly if the numberof positive examples is large enough: at P = 1024,F1 = 62.5 % for j = 1 but 58.4 % for j = 100.

Due to the scarcity of positive examples on thetraining set, our classifiers are too conservative. Thethreshold b needs to be lowered to improve the per-formance on the test set. For j = 1, setting thevalue of b to the score of the BEP for the trainingdata makes the classifier more liberal. For j = 100 itactually increases the threshold and makes the clas-sifier more conservative, leading to performance that

is worse than that of the original threshold.In the models obtained at j = 100, all the 11 pos-

itive training examples become unbounded supportvectors, i.e., they have 0 < αi < jC and lie on theplane wT x = b0 + 1, parallel to the separating hy-perplane wT x = b0. The negative examples all liebelow the plane wT x = b0 + 1; some of them are un-bounded (i.e., lie on wT x = b0−1), some are bounded(i.e. have αi = C) and thus lie above wT x = b0 − 1;some of these latter even lie above wT x = b0, mean-ing that they are misclassified by this model. Nonelie above wT x = b0 + 1, however.

Now suppose that the threshold of the plane ismodified from b0 to some new value b. For valuesof b > b0 + 1, the preceding paragraph shows that allthe training examples would be predicted negative bysuch a plane, resulting in 0 % recall and (by defini-tion) 100 % precision; for b < b0 + 1, recall wouldbe 100 % as all positive examples would be predictedpositive.

Let b := maxi:yi=−1 wT xi be the lowest thresh-old below which all the negative training examplesare located. For values of b ∈ (b, b0 +1), all positiveexamples are declared positive, so precision is still100 %, but, as b drops below b, more and more neg-ative training examples are misclassified as positive

15

0

0.25

0.5

0.75

1

-1 0 1 2 3b0b*

Per

form

ance

if th

e se

para

ting

hype

rpla

ne is

pla

ced

at th

is th

resh

old

Threshold value (b)


training F1test F1


0

0.25

0.5

0.75

1

-1 0 1 2 3b0b*

Per

form

ance

if th

e se

para

ting

hype

rpla

ne is

pla

ced

at th

is th

resh

old

Threshold value (b)


training F1test F1


Figure 6: The dependency of F1 on the threshold b while the normal w, obtained by learning at j = 100,remains unchanged.

and precision begins to decrease.

Thus we see that precision and recall are both100 % for all b ∈ (b, b0 + 1); for b > b0 + 1, pre-cision is greater than recall, and for b < b, recall isgreater than precision. If we decide to choose b so asto maximize the F1-measure on the training set, or toplace it on the precision-recall break-even point, bothcriteria give the same proposal: b must lie somewherein (b, b0 +1), but they cannot be more specific thanthat.

The results shown in Table 2 above use the middleof the range thus proposed, i.e., b := (b + b0 + 1)/2.Note that the existence of bounded negative sup-port vectors implies that b > b0 − 1 and henceb > b0. Thus our newly proposed threshold b actu-ally makes the classifier more conservative than theoriginal threshold at b0. (Even if C were increased,thereby forcing the learner to avoid placing negativetraining examples above the plane wT x = b0 − 1, bwould still be ≥ b0 − 1 and the new threshold pro-posed by these heuristics would still be ≥ b0, thusnot making the classifier any more liberal.) Indeedit is hard to expect a heuristic to propose a usefulnew threshold for a hyperplane that cannot distin-guish between different positive training examples atall (because they are all equally distant from it, i.e.,all on wT x = b0 + 1): the new threshold can really

only tell how far from the positive examples shouldthe new hyperplane be placed, and it is not obvioushow to do this in a principled way and so as to achieveas much as possible for performance on the test set.

If cross-validation on the training set is used tomodify the threshold, the performance is roughly thesame in both cases (for j = 1 and j = 100), andinsofar as there is a difference it is in favour of j = 1.The same is true for the optimal threshold (i.e., theone that maximizes the F1 measure on the test set).Both of these results suggest that the orientation ofthe original plane (the one obtained for j = 1) isactually slightly better than of the one for j = 100;in a sense, j = 100 caused the learner to slightlyoverfit the training set.

The general conclusion of this section is that itconfirms the observations of Section 4 that the hyper-planes obtained at large values of j do not really havea better orientation than those obtained at j = 1;that modifying the threshold is a much more success-ful way of improving the original classifier; and that,if one decides to modify the threshold, the j param-eter can be left at its original value of 1 and there isno need to tune it.

16

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10 100 1000

Mac

roav

erag

ed F

1 pe

rform

ance

on

the

test

set


Comparison of different instance weighting strategies (solid: J = C = 1; dashed: J = 100, C = 1; dotted: J = 1, C = 100)

maximize test F1, J = C = 1maximize test F1, J = 100, C = 1maximize test F1, J = 1, C = 100

cross-validation (avg. b), J = C = 1cross-validation (avg. b), J = 100, C = 1cross-validation (avg. b), J = 1, C = 100

original threshold, J = C = 1original threshold, J = 100, C = 1original threshold, J = 1, C = 100

Figure 7: A comparison of various strategies of increasing the influence of misclassified training exampleson the criterion function optimized by the SVM learner. C = j = 1 is the default setting; C = 1, j = 100increases the influence of positive examples by a factor of 100; C = 100, j = 1 increases the influence of allexamples by a factor of 100. The margin maximization term of the criterion function remains the same inall cases.

8 A comparison of weightingstrategies

Recall that the criterion function (1) optimized bySVM during training requires the learner to seek atradeoff between maximizing the margin and classi-fying the training examples correctly. The relativeimportance of these two goals is determined by the Cparameter: the greater the C, the more important itwill seem to classify the training examples correctly.

In effect, this can be seen as a kind of noise handling.In Section 6, we have observed that the poor per-

formance of the classifiers obtained by SVM on asmall number of positive examples is chiefly due tothe fact that optimization leads to the maximummargin at a low cost of placing the positive exam-ples within the margin. Indeed, at the default valueC = 1 it can afford to practically ignore most of thepositive training examples when computing the widthof the margin.

The most straightforward way of forcing the SVM

17

learner to take the training examples more seriouslyis by increasing the C parameter. This can be seen asan alternative to the j parameter: instead of increas-ing the weight of errors on positive training examplesonly, C affects the weight of errors on both positiveand negative examples. The influence of the marginmaximization criterion on the learner’s choice of themodel is thus correspondingly smaller.

Our experiments show that increasing C is notmore effective than increasing j. This is illustrated bythe chart in Figure 7. If extremely few positive train-ing examples are available, increasing C has practi-cally the same effect as increasing j, whereas in allother circumstances increasing C is in fact slightlyless successful than increasing j (this can be seenby comparing the dashed and dotted lines in Fig-ure 7). Both these approaches are inferior to leavingall the weights at their default values and adjustingthe threshold b using cross-validation on the trainingset. In other words, increasing the influence of mis-classification costs above the default values causedSVM to produce hyperplanes of a slightly less suc-cessful orientation.

9 Conclusions

We have explored the use of SVM for text categoriza-tion in cases where only a small percentage of posi-tive examples are available in the training set. Ourexperiments indicate that, although the straightfor-ward cost-based approach yields an increase in per-formance, this improvement is of limited extent com-pared to the effect of modifying the threshold b. Forexample, the macroaveraged F1 increases from 0.0157to 0.1295 using the cost-based approach while di-rect modification of the threshold achieves F1 per-formance of 0.3603. The increase in performancebrought about by the cost-based approach is, in fact,due largely to threshold adjustment rather than im-provement in the orientation of the separating hyper-plane found by the SVM learner.

In the case of few positive training examples it isdifficult to infer the distribution of the scores of pos-itive test examples (Section 5.1). Thus, it seems un-likely that one can successfully make use of the prop-

erties of score distributions to modify the threshold.As the number of positive training examples increasesthis is much easier. However, by observing the classprobabilities (Section 5.2) we find that setting thescore threshold at the point where the probability ofthe positive class begins to grow beyond negligiblysmall values (e.g., at P (⊕|b) = 0.02) is more effec-tive for cases where very few positive training ex-amples are available, while placing the threshold atP (⊕|b) = 0.5 (a kind of maximum a posteriori pre-dictor) is very effective for large positive classes.

The most successful thresholding technique ex-plored in our experiments, cross-validation over thetraining data, achieves more than 70% of the bestF1 performance that could possibly be attained bywithout changing the orientation of the SVM deter-mined hyperplane. For example, with the hyperplaneobtained after learning with 11 positive training ex-amples, the best threshold could achieve a macroav-eraged F1 of 0.4724, while the threshold based oncross-validation achieves F1 = 0.3603.

Future work will extend this research to compareour results with alternative thresholding techniquesused in information retrieval and other techniquesthat deal with imbalanced class distributions, suchas discarding negative examples or training severalclassifiers and combining them with stacking. In ad-dition, it would be interesting to explore the effectsof thresholding in cases when feature selection is per-formed before training, as is commonly done in textcategorization. And since our experiments indicatethat the hyperplane found by the SVM algorithmwhen presented with very few positive training exam-ples has a reasonably good orientation but an overlyconservative threshold, it would be interesting to ex-plore whether the criterion function used by SVMcould be redefined so as to focus on looking for a goodorientation without regard for the threshold, with theunderstanding that the threshold would be selectedafterwards (and that efforts to force SVM to choosea good threshold by itself, e.g. via the j parameter,tend to lead it to a hyperplane with a slightly betterthreshold and a less useful normal). However, thismay be problematic because a threshold is necessaryin order to define the notion of a margin, which isone of the key components that made SVM such a

18

successful learning algorithm.

References

[BGMM02] Janez Brank, Marko Grobelnik, NatasaMilic-Frayling, Dunja Mladenic: Interaction offeature selection methods and linear classifica-tion models. Proc. ICML-2002 Workshop onText Learning, pp. 12–17, 2002. Longer versionavailable as Microsoft Research Technical Re-port MSR-TR-2002-63, 12 June 2002.

[CRS02] Soumen Chakrabarti, Shourya Roy, Ma-hesh V. Soundalgekar: Fast and accurate textclassification via multiple linear discriminantprojections. Proc. 28th Int. VLDB Conf., HongKong, China, August 20–23, 2002.

[CS98] Philip K. Chan, Salvatore J. Stolfo: TowardScalable Learning with Non-Uniform Class andCost Distributions: A Case Study in Credit CardFraud Detection. Proc. 4th Int. Conf. on Knowl-edge Discovery and Data Mining (KDD-98), Au-gust 27–31, 1998, New York City, New York,USA, pp. 164–168. AAAI Press.

[CV95] Corinna Cortes, Vladimir N. Vapnik:Support-vector networks. Machine Learning,20(3):273–297, September 1995.

[Elk01] Charles Elkan: The foundations of cost-sensitive learning. Proc. 17th IJCAI, Seattle,Washington, USA, August 4–10, 2001, pp. 973–978. Morgan Kaufmann.

[Jap00] Nathalie Japkowicz: Learning from Imbal-anced Data Sets: A Comparison of VariousStrategies. In Nathalie Japkowicz (ed.), Learn-ing from Imbalanced Data Sets: Papers from theAAAI Workshop (Austin, Texas, Monday, July31, 2000), AAAI Press, Technical Report WS-00-05, pp. 10–15.

[Joa99] Thorsten Joachims: Making large-Scale SVMLearning Practical.

In: B. Scholkopf, C. Burges, A. Smola (eds.),Advances in Kernel Methods — Support Vec-tor Learning. Cambridge, MA: MIT Press, 1999,pp. 169–184.

[KM97] Miroslav Kubat, Stan Matwin: Addressingthe curse of imbalanced training sets: one-sidedselection. Proc. 14th ICML, Nashville, Ten-nessee, USA, July 8–12, 1997, pp. 179–186.

[Lew91] David D. Lewis: Representation and Learn-ing in Information Retrieval. PhD thesis (andTech. Rept. UM-CS-1991-093), CS Dept., Univ.of Massachusetts, Amherst, MA, USA, 1992.

[Pla99] John Platt: Probabilistic outputs for supportvector machines and comparisons to regularizedlikelihood methods. In: A. Smola, P. Bartlett, B.Scholkopf, D. Schuurmans (eds.): Advances inlarge margin classifiers. Cambridge, MA: MITPress, 1999, pp. 61–74.

[NGL97] Hwee Tou Ng, Wei Boon Goh, Kok LeongLow: Feature Selection, Perceptron Learning,and a Usability Case Study for Text Catego-rization. Proceedings of the 20th InternationalACM SIGIR Conference on Research and De-velopment in Information Retrieval (pp. 67–73).Philadelphia, PA, USA, July 27–31, 1997.

[Reu00] Reuters: Reuters Corpus, Volume 1: En-glish Language, 1996-08-20 to 1997-08-19.Available from http://about.reuters.com/researchandstandards/corpus/

[Yang01] Yiming Yang: A study on thresholdingstrategies for text categorization. Proc. 24th An-nual International ACM SIGIR Conference onResearch and Development in Information Re-trieval, New Orleans, Louisiana, USA, Septem-ber 9–13, 2001, pp. 137–145.

[WP01] Gary M. Weiss, Foster Provost: The effectof class distribution on classifier learning: Anempirical study. Tech. Rept. ML-TR-44, Dept.of CS, Rutgers University. August 2, 2001.

19

Appendix: per-category charts

These charts are like those in figures 2 and 3, except that they are given here for all 16 categories ratherthan for just two.

20

0

1

10

100

1000

10000

100000

0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

232)

Threshold value (b)





0

1

10

100

1000

10000

100000

0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

29)

Threshold value (b)





0

1

10

100

1000

10000

100000

0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

292)

Threshold value (b)





0

1

10

100

1000

10000

100000

0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

24)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

351)

Threshold value (b)

Category E121, j = 1, positive examples: 11 training (natural: 1444), 738 test




0

1

10

100

1000

10000

100000

-1 0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

29)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

333)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

504)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

259)

Threshold value (b)





0

1

10

100

1000

10000

100000

0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

264)

Threshold value (b)

Category GHEA, j = 1, positive examples: 11 training (natural: 2616), 2067 test




0

1

10

100

1000

10000

100000

0 1 2b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

457)

Threshold value (b)

Category GOBIT, j = 1, positive examples: 11 training (natural: 551), 293 test




0

1

10

100

1000

10000

100000

0 1b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

279)

Threshold value (b)

Category GODD, j = 1, positive examples: 11 training (natural: 1642), 1004 test




0

1

10

100

1000

10000

100000

-1 0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

309)

Threshold value (b)

Category GPOL, j = 1, positive examples: 11 training (natural: 9867), 20228 test




0

1

10

100

1000

10000

100000

-1 0 1 2b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

365)

Threshold value (b)

Category GSPO, j = 1, positive examples: 11 training (natural: 1931), 12441 test




0

1

10

100

1000

10000

100000

-1 0 1 b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

314)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

391)

Threshold value (b)





These charts show the number of training/test posi-tive/negative examples found at a certain distance of theplane. The planes have been trained using 11 positivetraining examples. There is one chart for each category.The vertical lines show the original threshold obtained bythe SVM, as well as the threshold that gives the best F1

performance on the test set.

21

0

1

10

100

1000

10000

100000

-1 0 1 2b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

462)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

887)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

938)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

759)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

891)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

04)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

05)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3 4b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

937)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

856)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

749)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

866)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

514)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

768)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

05)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

03)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

891)

Threshold value (b)







22

0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

13)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5 6b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

56)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5 6 7b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

77)

Threshold value (b)





0

1

10

100

1000

10000

100000

-3 -2 -1 0 1 2 3 4 5 6b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

7)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5 6b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

43)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5 6 7b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

84)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

28)

Threshold value (b)





0

1

10

100

1000

10000

100000

-1 0 1 2 3 4b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.0

951)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5 6b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

5)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

31)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

12)

Threshold value (b)





0

1

10

100

1000

10000

100000

-3 -2 -1 0 1 2 3 4b0 b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

31)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5 6b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

54)

Threshold value (b)





0

1

10

100

1000

10000

100000

-3 -2 -1 0 1 2 3 4 5 6b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

56)

Threshold value (b)





0

1

10

100

1000

10000

100000

-3 -2 -1 0 1 2 3 4 5 6b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

73)

Threshold value (b)





0

1

10

100

1000

10000

100000

-2 -1 0 1 2 3 4 5 6b0b*

Num

ber o

f exa

mpl

es(in

a b

ucke

t of w

idth

0.1

47)

Threshold value (b)







23

0

0.5

1

0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.069

6)

Threshold value (b)




0

0.5

1

0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.087

1)

Threshold value (b)




0

0.5

1

0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.087

7)

Threshold value (b)




0

0.5

1

0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.071

9)

Threshold value (b)




0

0.5

1

-1 0 1 2b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.105

)

Threshold value (b)




0

0.5

1

-1 0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.086

9)

Threshold value (b)




0

0.5

1

-1 0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.099

9)

Threshold value (b)




0

0.5

1

-1 0 1 2b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.151

)

Threshold value (b)




0

0.5

1

-1 0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.077

8)

Threshold value (b)




0

0.5

1

0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.079

3)

Threshold value (b)




0

0.5

1

0 1 2b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.137

)

Threshold value (b)




0

0.5

1

0 1b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.083

7)

Threshold value (b)




0

0.5

1

-1 0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.092

7)

Threshold value (b)




0

0.5

1

-1 0 1 2b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.11)

Threshold value (b)




0

0.5

1

-1 0 1 b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.094

2)

Threshold value (b)




0

0.5

1

-1 0 1 2b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.117

)

Threshold value (b)




These charts show the proportion of positive examplesamong all examples that are found at a certain distancefrom the plane. The planes have been trained using 11positive training examples. There is one chart for eachcategory. The vertical lines show the original thresholdobtained by the SVM, as well as the threshold that givesthe best F1 performance on the test set.

24

0

0.5

1

-1 0 1 2b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.139

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.266

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.281

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.228

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.267

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.311

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.316

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3 4b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.281

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.257

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.225

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.26)

Threshold value (b)




0

0.5

1

-1 0 1 2b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.154

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.23)

Threshold value (b)




0

0.5

1

-1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.315

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.31)

Threshold value (b)




0

0.5

1

-1 0 1 2 3b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.267

)

Threshold value (b)





25

0

0.5

1

-2 -1 0 1 2 3 4b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.339

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5 6b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.468

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5 6 7b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.53)

Threshold value (b)




0

0.5

1

-3 -2 -1 0 1 2 3 4 5 6b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.509

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5 6b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.43)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5 6 7b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.551

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.384

)

Threshold value (b)




0

0.5

1

-1 0 1 2 3 4b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.285

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5 6b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.45)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.392

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.335

)

Threshold value (b)




0

0.5

1

-3 -2 -1 0 1 2 3 4b0 b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.392

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5 6b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.461

)

Threshold value (b)




0

0.5

1

-3 -2 -1 0 1 2 3 4 5 6b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.467

)

Threshold value (b)




0

0.5

1

-3 -2 -1 0 1 2 3 4 5 6b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.519

)

Threshold value (b)




0

0.5

1

-2 -1 0 1 2 3 4 5 6b0b*

Rel

ativ

e fre

quen

cy o

f pos

itive

doc

umen

ts (i

n a

buck

et o

f wid

th 0

.44)

Threshold value (b)





26