feature selection fitomographyfl - illustrating that …feature selection “tomography” —...

10
Keyword(s): Abstract: Feature Selection “Tomography” - Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman HP Laboratories HPL-2010-19R1 Feature selection; text categorization; visualization; machine learning classification; Feature filtering methods are often used in text classification and other high-dimensional domains to quickly score each feature independently and pass only the best to the learning algorithm. The panoply of available methods grows over the years, with frequent research publications touting new functions that seem to yield superior learning: variants on Information Gain, Chi Squared, Mutual Information, and others. This slow generate-and-test search in the literature is usually counted as progress towards finding superior filter methods. But this is illusory. We provide a new empirical method to reveal the feature preference surface for a given dataset and classifier: cross-validating with an additional feature whose noise characteristics are controlled. By evaluating for every sensitivity and specificity level - though computationally expensive - we determine the expected benefit of adding a feature with a given characteristic. We desire features with high expected gain. Ideally this preference surface would be easily generalizable across many datasets and depend on just a few parameters. However, by visualizing the preference surface under different conditions while holding other factors constant, we demonstrate graphically that it depends on more factors than are ever considered in feature selection papers: training set size and class distribution, classifier model and parameters, and - even holding all of these constant - the individual target concept itself. Thus, the ongoing sequence of published papers in this area will not yield an optimal feature selection function of significant generality. External Posting Date: February 6, 2012 [Fulltext] Approved for External Publication Internal Posting Date: February 6, 2012 [Fulltext] Copyright 2012 Hewlett-Packard Development Company, L.P.

Upload: others

Post on 07-Jun-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

Keyword(s): Abstract:

Feature Selection “Tomography” - Illustrating that Optimal Feature Filteringis Hopelessly UngeneralizableGeorge Forman

HP LaboratoriesHPL-2010-19R1

Feature selection; text categorization; visualization; machine learning classification;

Feature filtering methods are often used in text classification and other high-dimensional domains toquickly score each feature independently and pass only the best to the learning algorithm. The panoply ofavailable methods grows over the years, with frequent research publications touting new functions thatseem to yield superior learning: variants on Information Gain, Chi Squared, Mutual Information, andothers. This slow generate-and-test search in the literature is usually counted as progress towards findingsuperior filter methods. But this is illusory. We provide a new empirical method to reveal the featurepreference surface for a given dataset and classifier: cross-validating with an additional feature whose noisecharacteristics are controlled. By evaluating for every sensitivity and specificity level - thoughcomputationally expensive - we determine the expected benefit of adding a feature with a givencharacteristic. We desire features with high expected gain. Ideally this preference surface would be easilygeneralizable across many datasets and depend on just a few parameters. However, by visualizing thepreference surface under different conditions while holding other factors constant, we demonstrategraphically that it depends on more factors than are ever considered in feature selection papers: training setsize and class distribution, classifier model and parameters, and - even holding all of these constant - theindividual target concept itself. Thus, the ongoing sequence of published papers in this area will not yieldan optimal feature selection function of significant generality.

External Posting Date: February 6, 2012 [Fulltext] Approved for External PublicationInternal Posting Date: February 6, 2012 [Fulltext]

Copyright 2012 Hewlett-Packard Development Company, L.P.

Page 2: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable

George Forman*

Abstract Feature filtering methods are often used in text classification and other high-dimensional domains to quickly score each feature independently and pass only the best to the learning algorithm. The panoply of available methods grows over the years, with frequent research publications touting new func-tions that seem to yield superior learning: variants on Informa-tion Gain, Chi Squared, Mutual Information, and others. This slow generate-and-test search in the literature is usually counted as progress towards finding superior filter methods. But this is illusory.

We provide a new empirical method to reveal the feature preference surface for a given dataset and classifier: cross-validating with an additional feature whose noise characteris-tics are controlled. By evaluating for every sensitivity and specificity level—though computationally expensive—we determine the expected benefit of adding a feature with a given characteristic. We desire features with high expected gain.

Ideally this preference surface would be easily generaliza-ble across many datasets and depend on just a few parameters. However, by visualizing the preference surface under different conditions while holding other factors constant, we demon-strate graphically that it depends on more factors than are ever considered in feature selection papers: training set size and class distribution, classifier model and parameters, and—even holding all of these constant—the individual target concept itself. Thus, the ongoing sequence of published papers in this area will not yield an optimal feature selection function of significant generality.

Keywords Feature selection, visualization, machine learning.

1 Introduction

Text classification technology is increasingly being ap-plied to documents and text snippets in order to recog-nize key topics, genres, or sentiments. Feature selection is often used to select a subset of the words or other high-dimensional features for input to the (often slow) learning algorithm, in order to improve its accuracy or scalability. Feature filtering methods, which are highly scalable and parallelizable, independently score each feature in a single scan of the training dataset and select the top scoring features. Each score is determined by a feature selection function chosen in advance, such as Information Gain or Mutual Information.

* HP Labs, Palo Alto, CA, USA. [email protected]

Over the past decade there has been an ongoing se-quence of research papers on feature selection that each report improved classification via new feature selection functions [e.g. 2,4,7,9,14,15,20,21,23,28,29,30], some-times within a specialized domain of classification [e.g. 1,3,5,12,13,18,19,32]. The ‘holy grail’ for this line of research might be for some inventive person to engi-neer just the right function such that it dominates the performance of the others—at least in some limited do-main, such as text classification for sentiment analysis with linear SVMs. This has proven to be a slow and elusive generate-and-test search, which we shed some light on.

In this paper we offer a method we call ‘feature se-lection tomography,’ which allows one to directly de-termine the empirical feature preference surface for a specific dataset and induction algorithm setting. The method is explained in Section 2, but basically it works by determining how much the induction algorithm im-proves with the addition of a single artificial feature with controlled characteristics. By measuring across the gamut of sensitivity and specificity characteristics for the feature, we obtain the entire preference surface. Figure 1 shows an example, detailed later. If the added feature mirrors the class label perfectly, the classifier obtains 100% accuracy (left and right peaks). Notably, if two features with different noise characteristics offer the same expected benefit to the classifier (any two points on a contour line), then they should receive the same score by an ideal feature selection function, since they are equally valuable. Thus, the empirical surface reveals the contour lines of an ideal feature selection function in a specific situation.

Interestingly, the contours found experimentally do

Figure 1. Empirical feature preference surface.

Page 3: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

not match the shapes of known feature selection func-tions in the literature, as we show in Section 3.1. This alone is revealing.

Inspired by this new source of information, one might hope to model the many measured surfaces via a multi-dimensional regression and thereby deliver an empirically ideal feature filter, leapfrogging the gener-ate-and-test search. However, feature tomography also makes it easy to illustrate that the preference surface depends on a great number of parameters and characte-ristics—more factors than any previous feature selection paper has considered. Thus, with such a high dimen-sional space it becomes impractical to construct an ideal feature selection function that would serve generally. Tsamardinos and Aliferis showed that feature selection depends on both the classifier as well as the perfor-mance evaluation metric [25]. We find, moreover, un-der identical problem conditions of training set size, class prior, dataset features, classifier model, and the model parameters, that the empirical preference surface depends on the individual target class chosen for the same dataset. Thus, this paper illustrates that there is no single ideal function to be found by the ongoing se-quence of papers in feature filtering methods. This message may seem obvious in hindsight and in view of the no free lunch theorem [27], but it can now be dem-onstrated in a lucid visualization via feature selection tomography.

In this paper we restrict our attention to binary classi-fication problems and binary features. Section 2 lays out our methodology and Section 3 expounds the findings, compares to known feature selection functions, and de-monstrates that the preference surfaces depend on many factors. This is followed by some philosophical discus-sion on the validity and related work in Sections 4 and 5. We conclude with future work in Section 6.

2 Feature Selection Tomography

Given a training set and a specific classifier learning algorithm, the well-known methods of bootstrapping or cross-validation can be used to estimate the generaliza-tion accuracy of the classifier model on such data, via multiple invocations of the learning algorithm on differ-ent subsets of the data for training and testing.

For feature selection tomography we insert in the da-taset one additional feature with controlled characteris-tics, and by bootstrapping we determine the benefit of such a feature to classification accuracy (or any other performance objective).

We restrict our attention to binary classification and binary features. Hence, a feature can be characterized by its true positive rate tpr = P(feature | positive class) and its false positive rate fpr = P(feature | negative class). For example, to generate a controlled feature with tpr=40% and fpr=1%, the feature is turned on with 40% probability for each positive case and with 1% probabil-ity for each negative case. This can be repeated to de-termine the average benefit of having such a feature. In independent bootstraps, we test the gamut of feature characteristics in a grid: tpr=0–100% × fpr=0–100%.

The remaining question is the choice of dataset to which the controlled feature is appended. The premise in feature selection papers is that the ideal function should work across datasets. Therefore, we start off with a neutral dataset: we generate 1000 independent binary noise features and add a single controlled feature that reflects the class label to some degree. (We also examine real datasets later in the paper.) Thus, any im-provement over majority voting is due to the controlled feature. Each training set has 1000 examples, with 5% to 50% being positive, depending on the experiment.

(To obtain smooth curves, we repeated each boot-strap at least eight times and continued as long as the standard error of the accuracy exceeded 1%. On each iteration the controlled feature is regenerated. By sam-pling every 4% across the tpr × fpr grid, the large num-ber of runs complete on our batch job cluster in much less than an hour using the WEKA v3.6 library in Java.)

3 Empirical Preference Surfaces

Figure 1 shows the preference surface we obtain for a linear Support Vector Machine (SVM) classifier with 50% positives in the dataset. The z-axis shows the accu-racy of the SVM classifier when it receives the benefit of a single feature with given the tpr,fpr characteristics. Note that a perfect feature that ‘gives away’ the true label would have tpr=100% and fpr=0%; we see at the far left peak that such a feature does give the greatest

Pointwise Mutual Information

Accuracy Log Odds Ratio

Figure 2. Feature selection functions for comparison with the empirical surface in Figure 1.

Page 4: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

benefit (and symmetrically for the opposite corner, where the feature is 1 iff the case is labeled negative). By contrast, a useless feature with tpr=fpr gives no in-formation to the classifier, and we see that this diagonal valley of the preference surface gives least preference, i.e. 50% accuracy. While these particular test cases are intuitive, the exact shape of the curvature between these extremes is where all the disagreement lies among fea-ture selection functions.

3.1 Comparison with Known Functions Strikingly, the shape revealed in Figure 1 does not cor-respond to any of a dozen known feature selection func-tions we have examined. To illustrate, Figure 2 shows the preference surfaces generated by just three example feature selection functions: Pointwise Mutual Informa-tion, Accuracy (of the single feature as the classifier), and Log Odds Ratio. None matches Figure 1 closely: the first shape is wildly different, the second is too flat, and the third curls upwards at the edges too much for a good fit.

Since feature filtering uses the selection function on-ly to rank features, the range of specific values from a function is irrelevant. For example, Odds Ratio will rank identically to Log Odds Ratio. Thus, we only need to compare the shapes of the preference surfaces. We can do this by comparing their contour lines (the co-lored lines of equal z-value, mirrored on the base of each graph). Thus, we can compare shapes more close-ly with a refined visualization: Figure 3 shows the con-

tour lines of six feature selection functions against a background of the contours from Figure 1. (For details of their equations refer to [7].) Each function correctly gives highest preference to a perfect feature in the op-posing corners and lowest preference to features along the diagonal tpr=fpr. But the interior shapes disagree with one another and, by our visualization we can see that they each disagree somewhat with the empirically determined preference surface. Disagreement is most evident wherever contours cross, i.e. two points of equal value according to one surface are scored different from each other by the other surface; if two surfaces had identical contour lines, then they would have consistent ranking of which features are better. The shape of the Accuracy surface is too flat to fit well. Odds Ratio and Bi-Normal Separation are too sharp in the corners. Even Information Gain and Chi-Squared do not quite match; it is somewhat difficult to see here, but the bend in their contours consistently exceeds those of the em-pirical surface. We will further illustrate the mismatch in the next subsection.

3.2 Many Influencing Factors The ground-truth empirical shape in Figure 1 appears simple enough that we might design a functional form that models the surface well. But this would be prema-ture. As revealed in the introduction, unfortunately we found the surface is strongly affected by a wide variety of factors, which we illustrate next.

Pointwise Mutual Information

Accuracy

(Log) Odds Ratio

Information Gain

Chi-Squared

Bi-Normal Separation

Figure 3. Contours of several common feature selection functions compared against background contours from Figure 1. Whe-rever the foreground and background lines cross, it shows that the two surfaces disagree on the local gradient at that point, illu-strating that their shapes disagree in their relative preferences for feature selection. (This comparison is nicely independent of absolute output values, thus Odds Ratio and Log Odds Ratio induce the exact same contours.)

Page 5: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

First, the surface depends on the class prior. In Fig-ure 4 we vary the percentage of positives in the dataset. Intuitively, when positives are scarce, there is a stronger preference for features with a very low false positive rate. With increasing imbalance observe the steeper slope of the red contours near the y-axis (and symmetri-cally for negatively correlated features).

This figure is also useful to illustrate that Chi-Squared, which appeared in the previous subsection to be a close match at 50% positives, is not such a good match under class imbalance. The background dotted lines show how the shape of Chi-Squared changes for 5% to 50% positives. Under imbalance it becomes clear that Chi-Squared is not a perfect fit (see top left). Simi-larly, though Information Gain too adapts to the class prior, it also does not fit under class imbalance (illu-strated later).

Second, we illustrate that the empirical surface de-pends on the classifier model. In Figure 5 we vary the classifier algorithm, fixing the percentage of positives hereafter at 20%. The first plot in the figure is using SVM, as before. The second and third plots show the empirical preference surfaces for the traditional (bi-nomial) Naïve Bayes classifier as well as the multi-nomial Naïve Bayes variant [17]. The fourth plot is for Logistic Regression. The absolute accuracy of each model is irrelevant. For our purposes here, it is enough to see that their shapes (as reflected by the red contour lines) differ substantially—illustrating clearly that ideal feature selection depends on the classification model.

(The dotted background in each plot shows the contours of Information Gain at 20% positives—again not an excellent match.) It would be acceptable to require a separate, empirically derived preference surface for each different classifier. The model thus far would de-pend on four parameters: fpr, tpr, the percentage of positives, and the classifier model. Modeling three or four dimensions well can sometimes be workable. But the story worsens.

Third, the empirical surface also depends on the size of the training set. Hereafter we focus again on the li-near SVM and fix the percentage of positives at 20%. Figure 6a shows the preference surface when training on 50 positives and 200 negatives (solid red foreground contours) vs. that of our default training size of the same class prior (200:800, dotted background). The graph is zoomed into the top left corner of the gamut in order to make it easier to see the preference differences at low false positive rates. Near the top left corner the prefe-rence contours become decidedly inconsistent with the shape of all feature selection functions we considered.

Fourth, the empirical preferences even depend on the tuning parameter within the SVM itself. Figure 6b shows that the shape of the preference surface changes if the SVM complexity constant C is set to 0.01 (fore-ground) instead of the default 1.0 (background). Such parameters are often tuned via cross-validation after feature filtering has completed. But this result shows that the ideal feature filtering function would need to be included in the search loop, since it apparently depends

50% positives

20% positives

10% positives

5% positives

Figure 4. Empirical feature preferences vary with the percentage of positives in the dataset. Background compares Chi-Squared at the given prior. Axes range from 0–100%; these redundant labels are omitted hereafter to reduce visual clutter.

SVM

(Binomial) Naïve Bayes

Multinomial Naïve Bayes

Logistic Regression

Figure 5. Empirical preferences vary by classifier. 20% positives. Background compares Info. Gain.

Page 6: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

strongly on the tuning parameter (consistent with [25]). Additionally we tested whether the empirical prefe-

rence surface depends on the performance objective. We found substantial differences among accuracy, precision and recall, but found no substantial difference between accuracy and F-measure.

4 Discussion

Taking a step back, the unstated premise behind the many research papers on feature filtering is that a supe-rior selection function may be found—perhaps limited within some problem domain, such as sentiment classi-fication. According to this line of thought, in the limit we might expect to uncover some ideal feature selection function that dominates the others. This paper demon-strates empirically that the feature preferences vary de-pending on a large number of parameters—more parameters than any existing paper considers. Thus, any claims of generality for a feature selection function are very restricted (even if not stated so), unless the func-tion can successfully take into account all these parame-ters.

But could such a highly parameterized ideal function exist? If so, then it could be quite valuable, even if it required a great deal of computation and modeling to construct it in a one-time effort. We argue that it does not: the feature preference surface even depends on the individual dataset and even its target concept. To show this, we replace the generated noise dataset by indepen-dent random draws of 200 positives and 800 negatives from the OHSCAL text classification benchmark having 11,162 cases, available online. We show that the prefe-

rence surface depends on which subclass is sought; we study two separate binary partitions of the data: the bi-nary task of recognizing class 0 (antibodies, 1159 posi-tive cases) and, separately, recognizing class 5 (pregnancy, 1621 cases). We apply feature selection tomography to both tasks and compare their preference surfaces, shown in Figure 7. Even here there is signifi-cant disagreement in the shape of the two surfaces. This is in spite of identical conditions in the number of posi-tives, the number of negatives, the SVM model, the SVM tuning parameters, and the number of binary fea-tures (11,465 distinct words).

As additional demonstration of this dependence on the specific binary task, we compare the preference sur-faces for distinguishing different classes of the UCI datasets ‘Letter Recognition’ and ‘Optical Recognition of Handwritten Digits.’ For each binary task, we train a Naïve Bayes classifier on samples of 200 positives and 800 negatives. Figure 8 compares pairs of surfaces. (The contour lines are irregular in the valley where the gradient is low and random experimental variation is a larger factor. Focus toward the top and left where the most predictive features are to be found.) Each pair shows substantial differences. Thus, the ideal preference surface depends also on the particular target concept within the dataset—the final nail.

While unfortunate, this is not actually surprising. A fundamental assumption of feature filter methods is that the value of each feature can be judged independently of what other features are available. This assumption is clearly not true, since xor-features can become valuable when used in combination. Even so, feature filter me-thods find favor for their simplicity and often prove highly effective, much as Naïve Bayes frequently excels despite violations of its independence assumption.

To be clear, the philosophical argument is only whether an optimal feature filter can exist, not the more practical issues of whether sub-optimal feature filtering can be useful or whether there is a substantial difference in classification accuracy when using ideal vs. sub-ideal selection functions. We continue to advocate the use of feature filtering for very high-dimensional domains.

(a) ¼ training set, zoomed

(b) SVM -C 0.01

Figure 6. Empirical surfaces also depend on the training set size and even the classifier tuning parameter. 20% positives.

Figure 7. Empirical preference surfaces based on two OHSCAL text classification tasks. 20% positives.

Page 7: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

But this section underscores that the characteristics of the particular dataset have a substantial effect on the resulting surface, and hence, any empirical study of selection functions will depend largely on the bench-mark tasks that have been selected. This point deserves more than a nod. To amplify this point, gathering a set of text classification tasks or sentiment analysis tasks together does not necessarily constitute a homogeneous group. We have shown here that the class distribution of each task has a substantial effect on the preference sur-face, among other factors. The ‘winning’ selection func-tion found will have more to do with covering the distribution of test surfaces well than fitting any one specific shape.

That said, in a specialized domain where there is a large stream of learning tasks that are expected to be substantially homogeneous, one might be able to use the tomography method on a sample of such datasets to determine an empirical preference surface, which could generalize to future tasks of that type. Demonstrating this practically is left to future work. Although the com-putation demands of feature selection tomography is substantial, in some circumstances it may still be insig-nificant compared with wrapper methods [11,31], which iteratively search for a better subsets of features by eva-luating each subset with the slow learning algorithm.

Advantages include: (1) the computation of the surface may be done in advance, before the training set arrives, and amortized over multiple similar training sets; (2) once the surface is determined, it can be used quick-ly for new datasets, since filtering does not involve the learning algorithm; (3) the surface can be interpolated to nearby characteristics that have not been evaluated (e.g., we can interpolate for a feature with tpr=4.2%, fpr=1.3%), whereas wrapper methods cannot interpolate to subsets not actually tested; (4) wrapper methods test subsets of features, and there are extremely many sub-sets for high-dimensional datasets. Furthermore, the computational demands of tomography can be judi-ciously reduced: If symmetry is assumed for negatively correlated features, half the computation can be folded. Also, the sampling grid may be coarsened, especially in regions where smoothness in the surface is assumed. In practical use with a stream of similar tasks, one could first determine the typical range of characteristics of actual features, and only evaluate the preference surface in these regions. For example, in text classification it is rare to encounter words with fpr > 25%, except for use-less stopwords. While all this may be technically possi-ble, it involves much greater complexity than the simple state-of-the-practice: simply trying a set of known fea-

Figure 8. Comparison of pairs of individual binary tasks drawn from UCI Letter and OptDigits. The Naïve Bayes classifier was trained with 200 positives and 800 negatives in each case. Differences are often apparent at the left and/or top, indicating that the feature preference surface depends substantially on the individual dataset.

Page 8: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

ture selection functions and selecting the apparent best via cross-validation when each actual dataset arrives.

5 Related Work

Most papers that explore feature selection functions for feature filtering observe only the aggregate end ef-fect on the improvement of a classifier. Some have in-cluded a discussion of the selection biases or shape of the preference surface, either by analyzing the mathe-matical form [19,30] or graphically depicting the 3D shape of the function [7]. No papers have evaluated the functions based on their fit with the empirical prefe-rence surface, which is only now revealed graphically by the feature selection tomography method.

Almost all the papers use simple characteristics to measure features—such as fpr, tpr, precision, recall, document frequency, and the training class prior. But none discuss all the parameters studied here. Often a paper will limit its scope to a particular classifier, such as Naïve Bayes [4,19], so naturally their conclusions are limited to that classifier (though not always stated). A number of papers focus on particular sub-domains, such as class imbalance [5,7,19,32], sentiment classification [1,13], genre classification [12], foreign languages [1,18], or computer virus detection [3]. Most limit their scope to binary classification problems and often binary features, but some consider multi-class [30]. Work by Tsmardinos and Aliferis proves that optimal feature selection, whether via filters or wrappers, is tied to both the specific classification algorithm and the perfor-mance evaluation metric [25]; we showed a variety of additional dependencies, even including the target con-cept itself for a fixed training set.

The idea of adding a randomized feature also appears in some other works [e.g. 24,26], but their goal is in-stead to determine a score threshold below which fea-tures may be discarded. We only add one feature at a time because the preference surface depends on the oth-er features available; adding multiple features at once may interact with each other.

Recently Joachims demonstrated the ability to train a linear SVM over all of the Reuters RCV1 dataset with over 800,000 documents and 47,000 word features in just minutes on a PC [10]. This raises the question of whether feature filtering is still useful for scalability (leaving aside the question of whether it can still im-prove accuracy). While—especially in research—a bag-of-words model is commonplace, in practical set-tings one may often find phrases of two or more words helpful for accurate classification. When classifying technical text, it can also be helpful to include character n-grams that include symbols, e.g. terms such as ‘OS/2’ or ‘> 4 GB RAM’ in tech-support documents. When including these and other kinds of terms, the number of potential features can grow by several orders of magni-

tude, requiring scalable feature filtering methods, partly just considering the storage space for the extracted fea-tures. Also, the number of potential features can be tremendous in web-scale analytics or genomic/micro-array domains, as well as with feature construction me-thods, such as produced by automatically linking mul-tiple database tables.

While feature selection functions simply assume that the worth of each feature can be evaluated independent-ly, there are many methods that consider feature interac-tions—for a substantial increase in computational complexity [22]. Some multivariate filtering methods eliminate feature redundancy, e.g. perfectly correlated features. Wrapper methods and embedded methods both account for the full interaction of the features with the classifier [e.g. 26]. Embedded methods generally scale somewhat better than wrappers, but cannot approach the efficiency of simple feature filtering methods when it comes to genuinely massive feature spaces.

6 Conclusion and Future Work

The key contribution of this paper is the method of feature selection tomography, which applies bootstrap-ping to measure the benefit to classification accuracy of adding a feature with known characteristics (paramete-rized by fpr and tpr for binary classification and binary features). By varying the characteristics across the ga-mut, we obtain a feature preference surface. In a sense, this is backwards compared with real feature selection, but it is consistent with the premise behind feature fil-tering: each potential feature is evaluated independently for its expected benefit to the classification problem, and those with the highest scores are best to include.

Using this technique, we produced example surfaces that (1) demonstrate substantial variability depending on many different factors, and (2) do not match a variety of known feature selection functions, such as Information Gain. Some might hope that a clever researcher will one day produce a better function that dominates—motivated by theory or not. But this paper rebuts that hope with graphic illustrations: such a function would have to depend on very many different parameters; fur-thermore, datasets under identical conditions induce different preference surfaces depending on the target concept alone. The demonstrations in this paper em-phasize that the sequence of papers published in this area are not leading to an end-point that dominates all others. Such observations have been made by others (for example, the no free lunch theorem [27] and [6]), but the tomography visualization illustrates the point.

One implication is that superior methods published in the literature are bounded within a narrower domain than is often stated or appreciated. An optimistic out-come may be that future publications in this area may more strongly qualify the domain they study and hope

Page 9: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

to generalize over. Most past claims are likely tied more closely to the particular characteristics of the datasets and classifiers used than previously thought. Poetically, we thought we were as blind researchers feeling differ-ent parts of an elephant, but it appears instead we have before us a multi-headed chimera.

One central avenue for future work is to explore the use of feature tomography to produce an empirically optimal feature selector that generalizes within some limited domain of fairly similar tasks. To be of practic-al use, this should be in some very high dimensional domain or for frequently repeated feature selection problems, in order that the computation of the prefe-rence surface is dwarfed by its repeated use for feature selection. Reducing such an empirical model to a func-tional form is the work of a large body of literature, but we note that modeling near the axis boundaries of the empirical surface is critical for sparse features—whereas most functional modeling methods do not excel near the surface boundary. We have found this to be an area of some difficulty, and it may call for research. One avenue may be to represent the space as Detection Error Tradeoff (DET) curves [16] instead of ROC curves, which would transform the region of interest to a large area in the graph, rather than a sliver near the axis boundaries as with ROC.

Throughout this paper we have limited our scope to binary features. It would be challenging to extend this work to other feature types, where additional parameters besides fpr and tpr would be needed to characterize va-riables. Even many types of numerical features, such as word counts, are typically binarized and treated with binary methods for feature selection. But there are plen-ty of wrapper methods for feature selection that can deal with heterogeneous features types, e.g. via Random Forests [26].

Finally, although this study has focused on feature selection, the tomography technique more generally can find the sensitivity to various features. Such a prefe-rence surface might be used in future work for feature weighting (weighting the feature dimensions of a k-Nearest-Neighbor distance function) or feature scaling for linear SVM kernels, in analogy to the use of feature selection functions used for feature weighting, instead of the traditional TF*IDF weighting in text classification [8]. An empirically determined preference surface could also be leveraged simply for showing users which features are likely to be more predictive, without involv-ing classification. This could be helpful for domain understanding or for manually designing new features to best aid in an important classification task. In fact, this effort originally began as a way to determine which inductive transfer features would likely be most useful to add to a given classification dataset. We believe the tomography technique will open several avenues of pur-suit in machine learning research.

References

1. Abbasi, A., Chen, H., and Salem, A.: Sentiment analysis in multiple languages: Feature selection for opinion classi-fication in Web forums. ACM Trans.Info.Syst. 26, 3, 1–34, 2008.

2. Aizawa, A.: The feature quantity: an information theoretic perspective of TFIDF-like measures. In: 23rd Int’l SIGIR Conf. on Research and Development in Info. Retrieval, SIGIR, 104–111, 2000.

3. Cai, D. M., Gokhale, M., and Theiler, J.: Comparison of feature selection and classification algorithms in identify-ing malicious executables. Comput. Stat. Data Anal. 51, 6 (Mar), 3156–3172, 2007.

4. Chen, J., Huang, H., Tian, S., and Qu, Y.: Feature selec-tion for text classification with Naïve Bayes. Expert Sys-tems Applications 36, 3, 5432–5435, 2009.

5. Chen, X., and Wasikowski, M.: FAST: a ROC-based feature selection metric for small samples and imbalanced data classification problems. In: 14th Int’l Conf. on Knowledge Discovery and Data Mining, KDD, 124–132, 2008.

6. Daelemans, W., and Hoste, V. Evaluation of machine learning methods for natural language processing tasks. In 3rd International Conference on Language Resources and Evaluation, 2002.

7. Forman, G.: An extensive empirical study of feature selec-tion metrics for text classification. Journal of Machine Learning Research (JMLR), 3, 1289–1305, Mar 2003.

8. Forman, G.: BNS feature scaling: an improved representa-tion over TF*IDF for SVM text classification. In 17th ACM Conf. on Info. and Knowledge Mgmt., CIKM, 263–270, 2008.

9. Galavotti, L., Sebastiani, F., and Simi, M.: Experiments on the use of feature selection and negative evidence in au-tomated text categorization. In: 4th European Conference on Research and Advanced Technologies for Digital Li-braries. LNCS 1923, 59–68, 2000.

10. Joachims, T.: Training Linear SVMs in Linear Time. In: 12th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, KDD, 217–226, 2006.

11. Kohavi, R. and John, G. H.: Wrappers for feature subset selection. Artificial Intelligence, 1–2, 273–324, Dec 1997.

12. Lee, Y. and Myaeng, S. H.: Text genre classification with genre-revealing and subject-revealing features. In: 25th Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR,145–150, 2002.

13. Li, S. and Zong, C.: Multi-domain sentiment classifica-tion. In: 46th Association for Computational Linguistics on Human Language Technologies. 257–260, 2008.

14. Li, S., Xia, R., Zong, C., and Huang, C.R.: A framework of feature selection methods for text categorization. In: 47th ACL and 4th Int'l Joint Conf. on Natural Language Processing, 692–700, 2009.

Page 10: Feature Selection fiTomographyfl - Illustrating that …Feature Selection “Tomography” — Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable George Forman

15. Liu, H., Su, Z., Yao, Z., and Liu, S.: An improved feature selection for categorization based on mutual information. In: Int’l Conf. on Web Info. Systems and Mining. LNCS 5854, 80–87, 2009.

16. Martin, A. F. et al.: The DET curve in assessment of de-tection task performance. In: Proc. Eurospeech '97, Rhodes, Greece, v.4, 1899–1903, 1997.

17. McCallum, A., and Nigam, K.: A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, 41–48, 1998.

18. Mesleh, A.: Support vector machines based Arabic lan-guage text classification system: feature selection compar-ative study. In: 12th WSEAS int’l Conf. on Applied Mathematics, 228–233, 2007.

19. Mladenic, D. and Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naive Bayes. In: 16th Int’l Conf. on Machine Learning. ICML, 258–267, 1999.

20. Montanes, E., Alonso, P., Combarro, E. F., Diaz, I., Corti-na, R., and Ranilla, J.: Using Laplace and angular meas-ures for Feature Selection in Text Categorisation. Int. J. Adv. Intell. Paradigms 1,1,40–59, 2008.

21. Olsson, J. S. and Oard, D. W.: Combining feature selec-tors for text classification. In: 15th ACM Int’l Conf. on In-fo. & Knowledge Mgmt. CIKM, 798–799, 2006.

22. Saeys, Y., Inza, I., and Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517, 2007.

23. Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., and Wang, Z.: A novel feature selection algorithm for text ca-tegorization. Expert Systems Applications, 33, 1, 1–5, Jul 2007.

24. Stoppiglia, H., Dreyfus, G., Dubois, R., and Oussar, Y.: Ranking a random feature for variable and feature selec-tion. JMLR 3:1399—1414, Mar 2003.

25. Tsamardinos, I., & Aliferis, C. F.: Towards principled feature selection: Relevancy, filters and wrappers. 9th Int’l Workshop on AI and Statistics, 2003.

26. Tuv, E., Borisov, A., Runger, G. & Torkkola, K.: Feature selection with ensembles, artificial variables, and redun-dancy elimination. JMLR, 10, 1341–1366, 2009.

27. Wolpert, D.: The lack of a priori distinctions between learning algorithms. Neural Computation, 1341-1390, 1996.

28. Wu, O., Zuo, H., Zhu, M., Hu, W., Gao, J., and Wang, H.: Rank aggregation based text feature selection. In: IEEE/WIC/ACM Int’l Joint Conf. on Web Intelligence and Intelligent Agent Technology, 165–172, 2009.

29. Yang, Y. and Liu, X.: A re-examination of text categoriza-tion methods. In: 22nd Int’l ACM SIGIR Conf. on Re-search and Development in Info. Retrieval, SIGIR, 42–49, 1999.

30. Yang, Y. and Pedersen, J. O.: A Comparative Study on Feature Selection in Text Categorization. In: 14th Int’l Conf. on Machine Learning, ICML, 412–420, 1997.

31. Zhang, Q., Weng, F., and Feng, Z.: A progressive feature selection algorithm for ultra large feature spaces. In: 44th meeting of the Association for Computational Linguistics, ACL, 561–568, 2006.

32. Zheng, Z., Wu, X., and Srihari, R.: Feature selection for text categorization on imbalanced data. SIGKDD Explora-tions Newsletter, 6, 1, 80–89, Jun 2004.