[lecture notes in computer science] machine learning and data mining in pattern recognition volume...

P. Perner (Ed.): MLDM 2011, LNAI 6871, pp. 46–59, 2011. © Springer-Verlag Berlin Heidelberg 2011

Smoothing Multinomial Naïve Bayes in the Presence of Imbalance

Alexander Y. Liu and Cheryl E. Martin

Applied Research Laboratories, The University of Texas at Austin,

P.O. Box 8029, Austin, Texas 78713-8029

{aliu,cmartin}@arlut.utexas.edu

Abstract. Multinomial naïve Bayes is a popular classifier used for a wide variety of applications. When applied to text classification, this classifier requires some form of smoothing when estimating parameters. Typically, Laplace smoothing is used, and researchers have proposed several other successful forms of smoothing. In this paper, we show that common preprocessing techniques for text categorization have detrimental effects when using several of these well-known smoothing methods. We also introduce a new form of smoothing for which these detrimental effects are less severe: ROSE smoothing, which can be derived from methods for cost-sensitive learning and imbalanced datasets. We show empirically on text data that ROSE smoothing performs well compared to known methods of smoothing, and is the only method tested that performs well regardless of the type of text preprocessing used. It is particularly effective compared to existing methods when the data is imbalanced.

Keywords: text classification, multinomial naïve Bayes, smoothing, imbalanced dataset, preprocessing.

1 Introduction

Multinomial naïve Bayes [1] is a popular classifier for text classification because of its simplicity, speed, and good performance. The classifier learns a conditional probability that the th document is from a class given the document that is

given by Bayes rule, and is of the form . Multinomial naïve

Bayes uses a multinomial model to estimate . In particular, ∏ | , where is the number of features, is the number of times the th feature occurs in the th document , and | is the probability of the th

feature occurring given class . In text classification, if the features used are from a typical bag-of-words model, | would correspond to the conditional probability that the th word in the vocabulary occurs in a document, given that the document belongs to class .

Smoothing Multinomial Naïve Bayes in the Presence of Imbalance 47

In text classification, the conditional feature probabilities are typically estimated using Laplace smoothing as follows:

| ∑∑ ∑ , (1)

where is the number of data points in the training set from class , is the number of features, and is a parameter known as the Laplace smoothing constant. is typically set to 1.

A non-zero value for prevents from equaling zero in certain degenerate cases. In particular, if a feature does not occur in any document in the training set,

for all documents in the test set that contain this feature will be zero for all classes , causing multinomial naïve Bayes to lose all discriminative power. Rarely occurring features may also be problematic if smoothing is not performed. For example, a rare feature that may occur in some classes in the training set but does not occur in class will dominate probability estimates since it will force to be zero, regardless of the values of the remaining features. This problem can be exacerbated by data scarcity and class imbalance problems since there may be insufficient numbers of these rarely occurring features to ensure non-zero probability estimates.

Thus, some form of smoothing is necessary to prevent cases where missing or rarely occurring features inappropriately dominate the probability estimates in multinomial naïve Bayes. Laplace smoothing is arguably the most prevalent form of smoothing. However, several researchers have shown that other forms of smoothing are useful. In this paper, we introduce a new form of smoothing: Random OverSampling Expected (ROSE) smoothing, which can be derived from methods used in cost-sensitive learning and imbalanced datasets. We show that, empirically, the new approach performs well against existing forms of smoothing.

Moreover, in this paper, we show that different forms of smoothing interact in unexpectedly different ways with several common preprocessing techniques for text. The performance of a particular smoothing approach can be impacted significantly by the type of preprocessing used to create features. In particular, many existing smoothing methods can react poorly to normalizing feature vectors, while the proposed approach tends to react less poorly.

2 Related Work

Several different forms of smoothing have been devised for the multinomial naïve Bayes classifier. As mentioned, the most popular method is Laplace smoothing. In this section, we describe four other smoothing methods that have been proposed in past research. Three have been previously benchmarked in [2]: absolute discounting, linear discounting, and Witten-Bell smoothing. In particular, we use the equations for these methods as described in [2], which are repeated here in terms of the notation

48 A.Y. Liu and C.E. Martin

used in this paper. The fourth is given in [3] and will be referred to as “Frank06” in this paper.

The following notation will be used to define these methods: let be the number of features in the training set in class that occurs zero times, be the number of features in the training set in class that occurs only once, and be the number of features in the training set in class that occurs only twice. As above, is the

number of features in the dataset. Finally, let ∑ ∑ , the sum of all

feature values of all documents in class . Absolute discounting is defined as follows:

| ∑∑ ∑ ∑ 0∑ ∑ ∑ 0 , (2)

where .

Witten-Bell smoothing is given by the following:

| ∑ ∑ ∑ 0∑ 0 . (3)

Linear discounting is defined as:

| ∑ ∑ ∑ 0∑ 0 . (4)

Finally, Frank06 smoothing is defined as:

| ∑∑ ∑ . (5)

The Frank06 approach is comparatively recent and can be written in the same form as Laplace smoothing, where ⁄ . Frank and Bouckaert [3] show that 1 works well on a number of text datasets. In particular, this approach is designed for imbalanced datasets, and [3] contains a good discussion of the effect of imbalance on Laplace smoothing.

This paper is not intended to offer an exhaustive comparison of all smoothing methods. Several other forms of smoothing exist in addition to the methods described (e.g., see [4] for a list of smoothing methods tested on n-grams), and existing studies


such as [2] already provide benchmarks for various smoothing methods. Instead, this paper describes how the performance of common smoothing methods can be differentially impacted by preprocessing approaches commonly used for text, and it presents a new smoothing technique that is robust to preprocessing choices.

3 Random OverSampling Expected Smoothing

In this paper, we present a method of smoothing called Random OverSampling Expected (ROSE) smoothing. ROSE smoothing is derived from random oversampling, a method that can be used to perform cost-sensitive multinomial naïve Bayes classification and that can handle imbalanced class priors. In particular, ROSE smoothing allows for many of the advantages of random oversampling without the additional overhead (in terms of computation and memory) needed when performing random oversampling. Unlike many other smoothing methods, ROSE smoothing automatically learns a separate smoothing parameter for each feature for each class. Below we discuss the use of random oversampling for multinomial naïve Bayes, introduce ROSE smoothing, and discuss the connection of ROSE smoothing to random oversampling.

3.1 ROSE Smoothing Background

The use of resampling to address imbalanced datasets has been well-studied. In short, it has been shown that, when one class has a prior probability that is much smaller than the other class, classifiers will learn models that are not very useful at distinguishing between classes (e.g., [5, 6]). Resampling can effectively address the imbalanced dataset problem, and is a class of techniques where data points are added to the class with the smaller prior (oversampling) or whereby data points are removed from the class with the larger prior (undersampling).

Random oversampling—the duplication of randomly selected points from the minority class—is one method of adjusting class priors. However, in addition to adjusting class priors, random oversampling will also adjust the term | when used with multinomial naïve Bayes. It turns out that it is beneficial to use the changed estimates of | after random oversampling, even if one uses the imbalanced priors present in the dataset before random oversampling.

If one were to randomly oversample and use Laplace smoothing to estimate | , in expectation, | ∑∑ ∑ , (6)

where 1 is the Laplace smoothing constant. The value is the number of times occurs in the resampled documents, and is discussed in more detail below. As

before, ∑ is the number of times occurs in the training set in class (before resampling).


Cost-sensitive learning is another approach that can work well on imbalanced data. For multinomial naïve Bayes, one method of cost-sensitive learning is equivalent to artificially adjusting class priors [7]. Random oversampling can be used as a classifier-agnostic method of changing class priors, although, as noted, random oversampling will change more than just the class priors.

Note that a longer version of the above discussion (including derivations) can be found in Liu et al. [7], where we examined the relationship of resampling and cost-sensitive versions of multinomial naïve Bayes in depth. In that paper, we were primarily concerned with analyzing the effect of various oversampling methods on naïve Bayes and SVMs. In the current paper, we empirically show that the effect of random oversampling can be leveraged directly (without the need to perform resampling) as a form of smoothing.

3.2 ROSE Smoothing Approach

ROSE smoothing entails using the expected value of | after random oversampling directly as an estimate for | . In essence, this is a form of smoothing wherein the smoothing parameter is equal to and is determined primarily from the data itself.

If random oversampling is applied, is affected by the amount of resampling. For example, if one were to randomly duplicate documents in class , then, in expectation, the number of times that feature will occur in those documents is *

∑∑ ∑ times, where is the average number of words

per document. Intuitively, is the expected number of times occurs in the resampled documents since is the expected number of words in all

resampled documents and ∑∑ ∑ is the number of times a randomly chosen

feature is equal to . When using ROSE smoothing, one can use the parameter for each feature

determined by calculating the expected value of if one were to actually perform

resampling (i.e., by calculating * ∑∑ ∑ ). This calculation

shows that ROSE smoothing essentially creates a smoothing parameter custom tailored for each feature for each class. This is in contrast to smoothing techniques—such as Laplace smoothing—which simply use a single smoothing parameter regardless of the values of features for a given class and for all classes.

In this paper, we use the expected value of that arises if one were to resample until all class priors were equal. As an alternative, for each class, can potentially be tuned in future experiments. It has been shown (in Weiss, McCarthy and Zabar [8], among others) that, when resampling, it is not always best to balance the priors. Thus, tuning (and therefore ) could potentially produce even better results than those obtained in this paper. However, as shown in the next section, setting without tuning works well enough to outperform existing smoothing techniques.


Finally, in the imbalanced dataset problem, it is customary to resample only data points from the minority class. Thus, in our experiments, is set for the minority class such that the effective number of documents if one were to resample would be equal in the minority and majority class, and is set to zero for the majority class.

4 Experiments

We run two sets of experiments using various preprocessing methods: experiments on naturally imbalanced datasets and experiments on data artificially manipulated to be imbalanced. The purpose of these experiments is to empirically examine the variation in performance of smoothing methods across various combinations of common preprocessing techniques for text classification and across various levels of imbalance.

We use L2 normalization and TF-IDF as the two possible preprocessing techniques in our experiments. TF-IDF reweights features such that the value of the th feature in document is proportional to the number of times that feature occurs in and is inversely proportional to the number of documents where the th feature occurs. The motivation is to reduce the influence of a feature if that feature occurs in a large number of documents in the corpus. L2 normalization normalizes the feature vector for document using the L2 norm (i.e., for each feature , replace with ∑ ). The motivation for performing L2 normalization is to reduce a possible

confounding factor arising from document length such that documents with similar distributions of words will have similar feature values, regardless of whether documents contain few or many words.

We test the following smoothing methods:

• Laplace smoothing • Absolute discounting • Witten-Bell smoothing • Linear discounting • Frank06 • ROSE smoothing

Note that several of these methods (including our own and Laplace smoothing)

have parameters that can potentially be tuned. However, it is not common to tune the Laplace smoothing constant using cross-validation, so we do not tune any of the above methods using cross-validation in our experiments.

4.1 Experiment 1: Standard Datasets

In the first set of experiments, we apply different forms of smoothing to seven datasets that have naturally occurring imbalanced class priors: hitech, k1b, la12,


ohscal, reviews, sports, and a subset of the Enron e-mail corpus. These datasets are drawn from several different sources, namely:

• la12 consists of news from the LA Times • hitech, reviews, and sports contain news from the San Jose Mercury • ohscal contains text related to medicine • k1b contains documents from the Yahoo! subject hierarchy • Enron e-mail data consists of e-mails labeled with work/non-work class labels

All six of these datasets except the Enron e-mail data are included in the CLUTO toolkit [9]1. For each dataset, we chose the smallest class within that dataset as one class and aggregated all other classes to create the class with larger prior. The prior of the minority class varies from around 2.5% to 15%.

We use fifty percent of the data for training (selected using stratified random sampling) and the remainder as a test set. Results are averaged over ten independent runs of the experiment. We present both the micro-averaged and macro-averaged f1-measure, where the average is taken over all datasets and all runs of the experiment. Finally, as discussed earlier, we vary whether TF-IDF weighting is used and whether the document vectors are normalized using the L2 norm, resulting in four possible preprocessing combinations. As mentioned, the goal of this set of experiments is to examine the relative performance of different smoothing methods on naturally imbalanced data, particularly when combined with different forms of preprocessing.

Results for the first set of experiments are presented in Fig. 1 and Fig. 2 for micro and macro-averaged f1-measure, respectively. For micro-averaged f1-measure (Fig. 1), ROSE smoothing works well regardless of what preprocessing techniques are used, and is the only approach that has either the best micro-averaged f1-measure or close to the best f1-measure regardless of preprocessing. Absolute discounting is the baseline that results in the highest f1-measure (when TF-IDF weighting is used), but performs very poorly when L2 normalization is used.

Similar trends for macro-averaged f1-measure can be seen in Fig. 2. However, the detrimental effect of L2 normalization on most smoothing methods is much clearer. TF-IDF also degrades performance when combined with four of the six tested smoothing methods (although this degradation is usually slight) when L2

normalization is not performed. Both absolute and Witten-Bell methods work very well from the perspective of macro-averaged f1-measure, but only if L2 normalization is not performed. Once again, ROSE smoothing is the only approach that both performs well and is robust to choice of preprocessing methods.

Finally, Laplace smoothing is outperformed by most competing methods for both micro-averaged and macro-averaged f1-measure in terms of best possible f1-measure. An unexpected result is that, on average, it is best not to perform any preprocessing when using Laplace smoothing. While this is not always true (e.g., in the next set of experiments), we have observed in practice that the best set of preprocessing techniques is typically dependent on the classifier being used (and choice of smoothing if using multinomial naïve Bayes) as well as the dataset.

1 We use the version of the data available at http://www.ideal.ece.utexas.edu/data/docdata.tar.gz


Fig. 1. Micro-averaged F1-Measure for Experiment 1 Datasets

Fig. 2. Macro-averaged F1-Measure for Experiment 1 Datasets

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

no L2, no TFIDF

no L2, TFIDF

L2, no TFIDF

L2, TFIDF

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

no L2, no TFIDF

no L2, TFIDF

L2, no TFIDF

L2, TFIDF


4.2 Experiment 2: Class Prior Controlled Data Sets

In the second set of experiments, we more closely examine the effect of imbalance on the relative performance of the smoothing methods and preprocessing techniques. We create three two-class problems by taking three pairs of classes from the 20-newsgroup dataset 2: alt.atheism versus comp.graphics, rec.autos versus sci.space, and rec.sport.baseball versus rec.sport.hockey. These datasets are naturally balanced (1000 data points in each class). In our experiments, we modify each dataset by removing data points from one class until a certain ratio of class priors is achieved. We present results where the minority class prior is equal to 0.1, 0.2, 0.3, 0.4, and 0.5. As in experiment one, the presented results are averaged over ten independent runs of the experiment, and TF-IDF weighting and L2 normalization are experimental controls. Only macro-averaged f1-measure is discussed for this set of experiments, since results for micro-averaged f1-measure are similar (Figs. 8 and 9 include example results for micro-averaged f1-measure for reference).

When controlling for imbalance (Figs. 3-7), the best f1-measure is obtained by using ROSE smoothing with no special preprocessing for the most imbalanced case (Fig. 3), although both absolute discounting and Witten-Bell smoothing are competitive. As the amount of imbalance decreases, the difference in smoothing methods and the differential effects of preprocessing choices tend to decrease as well. For these experiments, Witten-Bell and Linear discounting both work fairly well, and are more competitive with ROSE smoothing and absolute discounting in terms of highest f1-measure achievable than in the previous experiments. However, it is still true regardless of imbalance that absolute discounting reacts very poorly to L2 normalization. In addition, Laplace smoothing performs very poorly if L2 normalization is applied, although, unlike absolute discounting, this problem is exacerbated by class imbalance.

Frank06 does surprisingly poorly in our experiments. In [3], the authors only compare Frank06 against Laplace smoothing when both L2 normalization and TF-IDF weighting are used, using AUC as the evaluation metric. In our results, Laplace smoothing and Frank06 perform competitively when both L2 normalization and TF-IDF are used, but both are outperformed by many other smoothing methods. Results in [3] also indicate that using different values for parameter in Frank06 can change the quality of results. In particular, using 1 worked best for 3 of the 4 datasets used in [3], but using a different value for (i.e., the minimum value for across all classes) performed best on the 20 Newsgroups dataset. In preliminary experiments, we tried both methods of setting and found that 1 resulted in the best results for both experiments 1 and 2. Tuning via cross-validation may be useful, but, as mentioned, for the sake of a fair comparison, none of the smoothing methods were tuned using cross-validation in our experiments.

2 http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups


Fig. 3. Macro-averaged F1: Minority Class Prior = 0.1


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

no L2, no TFIDF

no L2, TFIDF

L2, no TFIDF

L2, TFIDF

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

no L2, no TFIDF

no L2, TFIDF

L2, no TFIDF

L2, TFIDF




0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

no L2, no TFIDF

no L2, TFIDF

L2, no TFIDF

L2, TFIDF

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

no L2, no TFIDF

no L2, TFIDF

L2, no TFIDF

L2, TFIDF



Fig. 8. Micro-averaged F1: Minority Class Prior = 0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

no L2, no TFIDF

no L2, TFIDF

L2, no TFIDF

L2, TFIDF

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

no L2, no TFIDF

no L2, TFIDF

L2, no TFIDF

L2, TFIDF


Fig. 9. Micro-averaged F1: Minority Class Prior = 0.5

5 Conclusion

In this paper, we introduce ROSE smoothing—a new form of smoothing for multinomial naïve Bayes models. ROSE smoothing, which can be derived from the effects of random oversampling on multinomial naïve Bayes, performs well on imbalanced datasets and is relatively robust to choice of preprocessing methods compared to other existing smoothing methods. When data is not imbalanced, the differences among many of the most competitive smoothing methods, including our proposed method, are less severe.

Laplace smoothing, perhaps the most common form of smoothing used with multinomial naïve Bayes, is outperformed by many of the tested smoothing methods for imbalanced data. This is not a new result, and our experiments support known results that show that Laplace smoothing is often outperformed by competing methods. A new insight provided by this paper is the adverse effect of common preprocessing methods such as L2 normalization and TF-IDF on Laplace smoothing and other smoothing approaches. While the Laplace smoothing constant can be adjusted to improve performance (especially after L2 normalization occurs), users who are new to text classification may be unaware that such an adjustment of a default parameter setting needs to be performed under these conditions. Moreover, in some systems, the Laplace smoothing constant is hard-coded to equal “1” and cannot be adjusted. Our results also show that even other smoothing methods that outperform Laplace smoothing can be sensitive to choice of preprocessing.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

no L2, no TFIDF

no L2, TFIDF

L2, no TFIDF

L2, TFIDF


The proposed ROSE smoothing method is more robust to choice of preprocessing method than many existing smoothing methods and learns a separate smoothing parameter for each feature for each class. While other existing smoothing methods can also perform well, ROSE smoothing outperforms or is competitive with all other smoothing methods benchmarked, regardless of dataset or what preprocessing is used. Thus, ROSE smoothing could be used to combat errors of novice users in software systems designed for those who are not experts in machine learning and are unsure of how to best preprocess the data and tune classifier parameters. The impact will be greater for applications with larger class imbalances.

References

1. McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: The AAAI 1998 Workshop on Learning for Text Categorization, pp. 41–48. AAAI Press, Menlo Park (1998)

2. He, F., Ding, X.: Improving Naive Bayes Text Classifier Using Smoothing Methods. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 703–707. Springer, Heidelberg (2007)

3. Frank, E., Bouckaert, R.R.: Naive Bayes for Text Classification with Unbalanced Classes. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 503–510. Springer, Heidelberg (2006)

4. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: The 34th Annual Meeting of the Association for Computational Linguistics (1996)

5. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6, 429–449 (2002)

6. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

7. Liu, A., Martin, C., La Cour, B., Ghosh, J.: Effects of oversampling versus cost-sensitive learning for Bayesian and SVM classifiers. Data Mining: Special Issue in Annals of Information Systems 8, 159–192 (2010)

8. Weiss, G.M., McCarthy, K., Zabar, B.: Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? In: The 2007 International Conference on Data Mining, DMIN 2007 (2007)

9. Karypis, G.: CLUTO - A Clustering Toolkit. TR 02-017, University of Minnesota, Department of Computer Science and Engineering (2002)

[lecture notes in computer science] machine learning and data mining in pattern recognition volume...

Documents