bayesian classification and feature reduction using uniform dirichlet priors

17
448 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003 Bayesian Classification and Feature Reduction Using Uniform Dirichlet Priors Robert S. Lynch, Jr., Member, IEEE, and Peter K. Willett, Senior Member, IEEE Abstract—In this paper, a method of classification referred to as the Bayesian data reduction algorithm (BDRA) is developed. The algorithm is based on the assumption that the discrete symbol probabilities of each class are a priori uniformly Dirichlet distributed, and it employs a “greedy” approach (which is similar to a backward sequential feature search) for reducing irrele- vant features from the training data of each class. Notice that reducing irrelevant features is synonymous here with selecting those features that provide best classification performance; the metric for making data-reducing decisions is an analytic for the probability of error conditioned on the training data. To illustrate its performance, the BDRA is applied both to simulated and to real data, and it is also compared to other classification methods. Further, the algorithm is extended to deal with the problem of missing features in the data. Results demonstrate that the BDRA performs well despite its relative simplicity. This is significant because the BDRA differs from many other classifiers; as opposed to adjusting the model to obtain a “best fit” for the data, the data, through its quantization, is itself adjusted. Index Terms—Class-specific features, discrete features, feature selection, neural networks, noninformative prior, UCI repository. I. INTRODUCTION I N THIS paper, a method of classification referred to as the Bayesian data reduction algorithm (BDRA) is developed. All training and test data are assumed discrete; by “discrete,” it is meant that each feature of each measurement can take on one of a finite number of values; the overall cardinality is the Cartesian product of these. For example, three binary valued features can take on a possible discrete symbols, corresponding to the eight feature vectors; . Features may be natu- rally discrete or they may have been continuous valued and discretized. As will be seen, the Dirichlet statistical model is concerned only with the overall cardinality and not with features or quantization: Each of the possible observations is accorded the same prior likelihood under all hypotheses. The BDRA, however, is interested in the details, since it changes by removal of features’ thresholds. For example, a ternary feature can become binary by removal of one threshold, and the feature can be ignored by removal of the other. Manuscript received July 24, 1999; revised December 2, 2001. R. S. Lynch, Jr. was supported by a Naval Undersea Warfare Center In-House Laboratory In- dependent Research Grant. P. K. Willett was supported by the Office of Naval Research under Contract N00014-97-1-0502 and by the Air Force Office of Scientific Research under Contract F49620-00-1-0052. This paper was recom- mended by Associate Editor L. O. Hall. R. S. Lynch, Jr. is with the Naval Undersea Warfare Center, Newport, RI 02841 USA (e-mail: [email protected]). P. K. Willett is with the University of Connecticut, Storrs, CT 06269 USA. Digital Object Identifier 10.1109/TSMCB.2003.811121 Certain labeled realizations of the ( -valued) feature vectors are referred to as “training” data under all classes, that is, there are realizations of the training data under class . As an illus- tration, Fig. 1 shows representative histograms of the Iris plant training data set, which is found at the University of California at Irvine’s (UCI) Repository of Machine Learning Databases, [44] (also see the URL address http://www.ics.uci.edu/~mlearn/). Each class in this figure contains 45 samples of training data ( ), where the classes are labeled respectively as Iris Setosa, Iris Versicolor, and Iris Virginica. Classification using this data set is based on four ternary valued discrete features 1 ( ). Notice that these features were originally continuously valued but were quantized using a method of percentiles (see item 5 of Appendix A), which is a necessary step for the BDRA. In this case, three discrete levels per feature provides best classification performance (see Table I and Table IV in the Results section). Thus, the basic distributional model contained in the BDRA for modeling each class is based on the frequency of occurrence of all possible quantized feature vectors, and the difference between the histograms constitutes all relevant classification information. The BDRA is based on what is referred to as the com- bined Bayes test in [40] and [42], which classifies discrete observations given an assumed uniform Dirichlet (completely noninformative) prior for the symbol probabilities of each class (see [29]). A noninformative prior is used here as a prior of “ig- norance” to model the situation in which the true probabilistic structure of each class is unknown and has to be inferred from the training data (for more on this model, see Appendix B). The motivation for utilizing a discrete model is that it enables the placement of a noninformative prior on the feature vectors, or histogram cell probabilities (shown in Fig. 1) controlling the frequency of occurrence of each vector type—this would be unrewarding in the continuous-feature case. Also, with the assumption that the frequency of occurrence of each symbol type is multinomially distributed [see (10) and (13) in Ap- pendix B], the underlying Bayesian model for each class is naturally related to the multinomial-Dirichlet distribution [8]. Later, the steps in developing the basic distributional model of the BDRA are fully described. In addition to the training data, unlabeled and quantized “test” observations are to be used by the classifier to form a deci- sion. In all of the results shown here, . Results for larger values of are straightforward; however, the complexity of the computation of the requisite probabilities of error is expo- 1 Sepal length in centimeters, sepal width in centimeters, petal length in cen- timeters, and petal width in centimeters. 1083-4419/03$17.00 © 2003 IEEE

Upload: independent

Post on 27-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

448 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003

Bayesian Classification and Feature ReductionUsing Uniform Dirichlet Priors

Robert S. Lynch, Jr., Member, IEEE,and Peter K. Willett, Senior Member, IEEE

Abstract—In this paper, a method of classification referred toas the Bayesian data reduction algorithm (BDRA) is developed.The algorithm is based on the assumption that the discretesymbol probabilities of each class area priori uniformly Dirichletdistributed, and it employs a “greedy” approach (which is similarto a backward sequential feature search) for reducing irrele-vant features from the training data of each class. Notice thatreducing irrelevant features is synonymous here with selectingthose features that provide best classification performance; themetric for making data-reducing decisions is an analytic for theprobability of error conditioned on the training data. To illustrateits performance, the BDRA is applied both to simulated and toreal data, and it is also compared to other classification methods.Further, the algorithm is extended to deal with the problem ofmissing features in the data. Results demonstrate that the BDRAperforms well despite its relative simplicity. This is significantbecause the BDRA differs from many other classifiers; as opposedto adjusting the model to obtain a “best fit” for the data, the data,through its quantization, is itself adjusted.

Index Terms—Class-specific features, discrete features, featureselection, neural networks, noninformative prior, UCI repository.

I. INTRODUCTION

I N THIS paper, a method of classification referred to as theBayesian data reduction algorithm (BDRA) is developed.

All training and test data are assumed discrete; by “discrete,”it is meant that each feature of each measurement can takeon one of a finite number of values; the overall cardinality

is the Cartesian product of these. For example, threebinary valued features can take on a possiblediscrete symbols, corresponding to the eight feature vectors;

. Features may be natu-rally discrete or they may have been continuous valued anddiscretized. As will be seen, the Dirichlet statistical modelis concerned only with the overall cardinality and not withfeatures or quantization: Each of the possible observationsis accorded the same prior likelihood under all hypotheses. TheBDRA, however,is interested in the details, since it changes

by removal of features’ thresholds. For example, a ternaryfeature can become binary by removal of one threshold, andthe feature can be ignored by removal of the other.

Manuscript received July 24, 1999; revised December 2, 2001. R. S. Lynch,Jr. was supported by a Naval Undersea Warfare Center In-House Laboratory In-dependent Research Grant. P. K. Willett was supported by the Office of NavalResearch under Contract N00014-97-1-0502 and by the Air Force Office ofScientific Research under Contract F49620-00-1-0052. This paper was recom-mended by Associate Editor L. O. Hall.

R. S. Lynch, Jr. is with the Naval Undersea Warfare Center, Newport, RI02841 USA (e-mail: [email protected]).

P. K. Willett is with the University of Connecticut, Storrs, CT 06269 USA.Digital Object Identifier 10.1109/TSMCB.2003.811121

Certain labeled realizations of the (-valued) feature vectorsare referred to as “training” data under all classes, that is, thereare realizations of the training data under class. As an illus-tration, Fig. 1 shows representative histograms of the Iris planttraining data set, which is found at the University of California atIrvine’s (UCI) Repository of Machine Learning Databases, [44](also see the URL address http://www.ics.uci.edu/~mlearn/).Each class in this figure contains 45 samples of training data( ), where the classesare labeled respectively as Iris Setosa, Iris Versicolor, and IrisVirginica. Classification using this data set is based on fourternary valued discrete features1 ( ). Noticethat these features were originally continuously valued butwere quantized using a method of percentiles (see item 5of Appendix A), which is a necessary step for the BDRA.In this case, three discrete levels per feature provides bestclassification performance (see Table I and Table IV in theResults section). Thus, the basic distributional model containedin the BDRA for modeling each class is based on the frequencyof occurrence of all possible quantized feature vectors, andthe difference between the histograms constitutes all relevantclassification information.

The BDRA is based on what is referred to as the com-bined Bayes test in [40] and [42], which classifies discreteobservations given an assumed uniform Dirichlet (completelynoninformative) prior for the symbol probabilities of each class(see [29]). A noninformative prior is used here as a prior of “ig-norance” to model the situation in which the true probabilisticstructure of each class is unknown and has to be inferred fromthe training data (for more on this model, see Appendix B).The motivation for utilizing a discrete model is that it enablesthe placement of a noninformative prior on the feature vectors,or histogram cell probabilities (shown in Fig. 1) controllingthe frequency of occurrence of each vector type—this wouldbe unrewarding in the continuous-feature case. Also, with theassumption that the frequency of occurrence of each symboltype is multinomially distributed [see (10) and (13) in Ap-pendix B], the underlying Bayesian model for each class isnaturally related to the multinomial-Dirichlet distribution [8].Later, the steps in developing the basic distributional modelof the BDRA are fully described.

In addition to the training data, unlabeled and quantized“test” observations are to be used by the classifier to form a deci-sion. In all of the results shown here, . Results for largervalues of are straightforward; however, the complexity ofthe computation of the requisite probabilities of error is expo-

1Sepal length in centimeters, sepal width in centimeters, petal length in cen-timeters, and petal width in centimeters.

1083-4419/03$17.00 © 2003 IEEE

LYNCH AND WILLETT: BAYESIAN CLASSIFICATION AND FEATURE REDUCTION USING UNIFORM DIRICHLET PRIORS 449

Fig. 1. Representative discrete data: Histograms of training sets for the IrisPlant data. (The frequency of occurrence of each symbol is represented as apoint instead of a bar to reduce clutter in the figure.)

TABLE IILLUSTRATION OF APPLYING THE BDRA TO THE IRIS PLANT DATA OF FIG. 1

SHOWING, FOR EACH CLASS, THE NUMBERS OFOUTCOMES OFEACH

DISCRETESYMBOL AFTER DATA REDUCTION. THE TOP OFEACH COLUMN

REFERS TO THESPECIFICBINARY FEATURE VALUE PAIRS ASSOCIATED

WITH EACH OF THE FOUR POSSIBLESYMBOLS (M = 4)

nential in . The goal is to determine, with minimum proba-bility of error, from which class the unknown test data have beengenerated conditioned on knowledge of the training data. Now,finding the minimum probability of error can present problemsin situations where the “curse of dimensionality” [7] predomi-nates in that the training set size, relative to the number of fea-tures is insufficient to estimate accurately a probabilistic modelfor each class. An important aspect of the BDRAs performanceis its ability to reduce the quantization complexity of featurevectors (and hence the overall cardinality) to improve clas-sification performance while simultaneously selecting relevantclassification features [10]. Notice that a related idea appearedin [37] that is based on the Chi-squared statistic, and for anotherrelated approach, see [25].

The basic algorithmic approach underlying the BDRA is tosearch for the quantization complexity that minimizes the prob-ability of error. This differs from many other methods, such asneural networks, where the model is adjusted, or tuned, to findthe best fit for the unquantized feature data. In any case, it isimportant to note that the intent is not to show the BDRA asa “universally best” classifier as such a classifier does not exist[19]; the emphasis here is on illustrating the ability of the BDRAto often provide superior performance when applied to difficultsituations where the curse of dimensionality is a factor. This hasadded significance due to the relative simplicity of the BDRA.

For example, to obtain best performance (i.e., obtain an accept-ably low error probability as in the Iris plant data example ofTable I), the only tuning required for the BDRA is to determinehow finely continuous valued features should be initially quan-tized (see Appendix A, item 5), which makes it straightforwardto apply in most classification problems.

A substantial amount of work exists with respect to Bayesianapproaches to classification. For example, with speech recogni-tion, many works have appeared on the subject of mismatches inthe information contained in the training and test data [30], [31],[33], [43], [52]. Although not explicitly addressed in this paper,previous results related to the BDRA, based on thejoint statis-tics of training and test data, have addressed this problem [42](see also [40]). Bayesian methods are also associated with de-cision trees [53], and in [13], the Dirichlet distribution [definedin (11) and (12) in Appendix B] is employed as a prior on allnodes of the tree. The Dirichlet’s use here is different in that itrepresents a prior on the quantization cells of each class model.Another Bayesian approach often used in practice is a beliefnetwork, a typical example of which is the naive Bayes classi-fier with the root node of the network containing the class la-bels and a leaf node for each feature. A potential drawback withthis classifier is that it assumes all features are conditionally in-dependent, which is a weakness from which the BDRA doesnot suffer. In addition to belief networks, a Bayesian techniquefor neural networks, and also based on naive Bayesian learning,has been developed in [49] to reduce training times. The precisemathematical of the BDRA representing its training metric, andwhich is key to its dimensionality reduction capability, substan-tially differentiates it from the method shown in [49] that utilizesboosting (see [6]) for training. The BDRA further differs fromthis method by having the capability to eliminate entire features,and the final quantization of any remaining features is in mostcases not the same (results in [49] suggest that the opposite istrue in many cases). Bayesian techniques have also been devel-oped for unsupervised training, and some typical examples ofthis can be found in [48] and [50]. Note that the primary em-phasis in this paper is on supervised training where the data areall correctly labeled. However, for future work, unsupervisedmethods have been developed for adapting the BDRA [39].

As previously mentioned, an important driver in the perfor-mance of the BDRA is in selecting relevant classification fea-tures from data, and many of the commonly used algorithms thatcan be used to accomplish this have appeared in references suchas [32] and [34]. In general, these algorithms find the best fea-ture subset by mitigating the effects of the curse of dimension-ality (for various related perspectives on this see, [2], [12], [19],[21], [23], [35], [46], [56], and [58]). Of these algorithms, oneof the most commonly used is the sequential search, which canbe either forward (i.e., bottom-up) or backward (i.e., top-down).In either case, a distribution for the features is assumed (oftenGaussian), and the features are either added in (bottom-up) ortaken away (top-down) one at a time. Notice that a variationon this approach is to do a floating search that can alternatebetween bottom-up and top-down searches (e.g., see [34]). Inany case, the search continues iteratively, as long as a perfor-mance metric such as the Mahalanobis distance between classesis being increased. A drawback of sequential feature searching

450 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003

is that it tends to be heuristic and suboptimal. It is possible tosearch through all arrangements of the features to find the op-timal subset or to use techniques such as branch and bound, [23];however, in high-dimensional cases, these methods can be com-putationally demanding.

In the BDRA, the backward sequential search is used to selectrelevant feature information. There are, however, two primarydifferences often not found with typical implementations. First,the BDRA does not necessarily remove whole features, butrather, it iteratively reduces the quantization complexity byremoving discrete levels of each feature one level at a time. Inother words, a feature’s information is reduced by graduallycoarsening its quantization: It will be eliminated only if on aprevious iteration it was binary valued. Second, the multinomial-Dirichlet distribution model on which the BDRA is based allowsa closed-form analytic for the empirical probability of errorto be used as a feature reduction metric. That is, the Dirichletprior of ignorance provides the necessary performance metricfor the BDRA. Typically, with many other feature selectionimplementations such as those utilizing a multivariate Gaussianmodel (see [32]), a theoretical for the probability of error isnot utilized. In these cases, feature selection is usually basedon partitioning the data into training and testing sets, andreduction metrics employ Monte Carlo-type methods. This isin contrast to the BDRA, which uses the entire training setonly once, and marginalizes over [i.e., sums as in (5)] allpossible test observations.

An example of applying the BDRA [see (1)–(6) and the ac-companying six itemized dimensionality reduction steps fol-lowing (6)] to the Iris plant data of Fig. 1 is shown in Table I.In this case, the BDRA improves classification performance byeliminating both sepal features from the data and reducing bothpetal features to be binary valued. The improvement in per-formance can be evidenced for the Iris plant training data bycomputing the probability of error [see (5)] for both Fig. 1 andTable I. It turns out that the probability of error conditioned onthe training data before data reduction in Fig. 1 is computed as

, and after feature reduction in Table I, this isreduced to . Notice that only 45 of the original50 data samples for each class were used for training, whichmeans five samples were also set aside and used as unlabeledtest data. Interestingly, after partitioning the data using one iter-ation of the holdout 10% method (Fig. 1 and Table I are the resultof one such partition) described in Appendix A, item 6, it wasfound that performance on the unlabeled test data also turned outbe 0.111 (see Table IV for the results of an average of seven iter-ations of the hold-out 10% method on the Iris plant data). This il-lustrates the BDRAs ability to predict its own performance (i.e.,from its training metric) on the test data, which not only has pro-vided motivation for its development but also helps validate itas a reasonable method for selecting the most relevant featuresin data. With that, based on the remaining thresholds after datareduction, the BDRA also reveals the specific values of the rel-evant features most important to correct classification.

Performance results are presented in subsequent sections inan attempt to demonstrate comprehensively the classificationcapability of the BDRA by applying it to both simulated andreal data (for an application of the BDRA to mislabeled training

data; see [41]). Simulation-based results demonstrate the effectof increasing the dimensionality of the feature set and of the useof class-specific features [2]. Additionally, 28 data sets (fromthe UCI Repository)2 are explored; interestingly, many of thesedata sets contain missing features and the BDRA is extended todeal with this problem (for an alternate approach, see [38]). Inthe simulated applications, performance of the BDRA is com-pared with several different types of neural networks and a linearclassifier (that represents each class using a vector of the meanvalues of all features). For the neural networks, results are ob-tained for backpropagation, radial basis function, and learningvector quantization-type networks. The linear classifier is alsocompared with the BDRA for the real data applications. Withrespect to simulated data, the BDRA is compared to differenttypes of classifiers in order to validate the results, that is, theuse of more than one classifier helps guard against unfair com-parisons due to the possibility that a specific classifier might beimproperly tuned to the data. In the case of the real data, manypreviously published results already exist; therefore, the linearclassifier serves as an easy-to-understand baseline comparisonclassifier.

In obtaining the results, no other processing is applied tothe data (both simulated and real) or the classifiers, except thatfor missing feature values and the necessary discretizing ofcontinuous-valued features for the BDRA, in order to improveperformance. Typical examples of performance improvementprocessing methods not used here include data whitening (see[9]) and the adaptive boosting techniques, based on voting,such as AdaBoost [6]. However, the intent here is to illustratethe inherent capability of the BDRA, including comparisons toother methods, with as little pre or post processing as possible.This implies that the performance results shown do not neces-sarily represent the best achievable (either utilizing the methodsfound here, or elsewhere) for any of the scenarios or data setsexamined. The idea is to demonstrate with empirical results,and based on a large number of data sets containing differentstatistical characteristics, that the BDRA performs comparablywell on average with a minimal amount of user intervention.This is related to the fact that the most practical applicationsof the BDRA involve those situations where little tuning (i.e.,data/classifier adjustments to improve performance) by theuser is desired. This is an important capability not foundwith the majority of other classifiers, an exception being thelinear classifier. This represents another reason the BDRA iscompared to the linear classifier for the real data results.

II. DEVELOPMENT OF THEBAYESIAN DATA REDUCTION

ALGORITHM (BDRA)

An outline of the development of the BDRA, previously dis-cussed in [40], [41], is given in this section. Before proceedingwith this development, the notation that is used for all containedin this paper is itemized below. Unless otherwise indicated, aboldface font represents a vector quantity.

2As previously noted, the UCI data sets can be found at the URL addresshttp://www.ics.uci.edu/~mlearn/. Many references exist on previous results withthese data sets; for example, see those cited within the repository, and for addi-tional examples not mentioned elsewhere in this paper, see [1], [22], [24], [26],[28], [51], and [55].

LYNCH AND WILLETT: BAYESIAN CLASSIFICATION AND FEATURE REDUCTION USING UNIFORM DIRICHLET PRIORS 451

Total number of classes with.

Number of discrete symbols.Vector representing the true symbolprobabilities for class .Vector representing the true symbolprobabilities for the test data.Hypothesis defined as .Collection of training data from allclasses.Number of occurrences of thethsymbol in the training data for class

.Total number of training data forclass .Number of occurrences of thethsymbol in the test data.Total number of test data.Indicator function (unity valuedwhen the condition is true, elsezero).Error in making a classification deci-sion.

In the development of the BDRA, recall that the goal is todetermine, with minimum probability of error, to which ofclasses an unknown test vector () belongs, conditioned on thetraining data ( ). Thus, a primary component of the BDRA formodeling each class is the conditional distribution .However, under the assumption that the training data of eachclass is independent (e.g., is independent of ), this distri-bution is equivalent to . Now, with equiprobableclasses the classification decision rule for the BDRA can bespecified as

(1)

where ties are broken arbitrarily (a similar rule without speci-fying distributions was shown in [43]).

Notice that the distribution of (1) can also bewritten as the ratio of to , where theselatter distributions are given, respectively, by

(2)

and

(3)

The development of (2) and (3) is outlined and shown in Ap-pendix B (see also [40]). Observe that a Bayesian approach tothis problem is preferred based on the results of a comparisonto the appropriate maximum likelihood (ML) based approach asshown in [40].

After taking the ratio of (2) and (3), the desired conditionaldistribution of (1) becomes

(4)

The associated conditional probability of error for the testof (1) can be straightforwardly written by letting

, and it is given by

(5)In most typical classification situations, including those con-

tained in this paper, no more than one observation of test data(i.e., a single feature vector) is tested for class membership ateach trial. For the shown above, this means that , andthis simplifies (1) to

(6)

where . With a single test observation, andindependent identically distributed data, (6) is intuitive as itstrongly depends on the number of training data that are of thesame type as the test datum.

Given as shown in (5), the definition of the BDRA iscompleted by describing the method used to select relevant clas-sification features. As discussed earlier, the BDRA incorporatesa somewhat modified backward sequential search (see [32] and[34]), which is contained in the following iterative steps.

1) Beginning with the initial training data having quanti-zation complexity (i.e., the Cartesian product of thenumber of discrete levels for all features), (5) is usedto compute .

2)a) Remove the lowest-valued threshold for feature 1. Thiscoarsening of the data makes the two lowest values ofthis first feature indistinguishable; for an illustration ofthis, see Table II. Note that all classes’ data are coars-ened in the same way.

b) Calculate the probability of error for themerged (coarsened) data from the previous step. Notethat the cardinality has been reduced by merging.

c) Replace the threshold, that is, return toand .3) Repeat step 2) for all thresholds for feature 1.4) Repeat steps 2) and 3) for all features.5) From step 4), select the minimum of all computed

(in the event of a tie use an arbitraryselection), and choose this as the new training data con-figuration for each class.

6) Repeat items two through five until the probability oferror does not decrease any further, or until ,at which point, the final quantization complexity hasbeen found.

The backward sequential search of the BDRA is suboptimaland “greedy” as it chooses a best training data configurationat each iteration. A better but computationally more complexapproach is to do a global search over all possible featurelevel reductions and corresponding training data configurations.However, a limited study involving simulated binary valuedfeatures, and hundreds of independent trials revealed that onlyabout 3% of the time did the suboptimal approach produceresults different from a global feature search. In this case, the

452 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003

TABLE IIILLUSTRATION OF MERGING. THE UPPERTABLE SHOWS THENUMBERS

OF TRAINING DATA X FOR CLASSESA AND B FOR THE CASE THAT

FEATURE 1 IS TERNARY AND FEATURE 2 IS BINARY. AFTER REMOVAL

OF THE LOWEST THRESHOLDFROM THE FIRST FEATURE, THE LOWER

TABLE X RESULTS. NOTE THAT THE CARDINALITY HAS CHANGED

FROMM = 6 ABOVE TOM = 4 BELOW

overall average probability of error for the global search wasless than 1% lower than the suboptimal one.

It should also be noted that reducing the quantizationcomplexity in the BDRA, via any searching method, impliesa change in Dirichlet prior [see the parameter in (11) and(12) of Appendix B] with each reduced feature of the trainingdata. An argument in favor of this is that it essentially removesirrelevant feature information by reducing the curse of dimen-sionality with respect to the available training data, but, and inparticular with real data applications, the appropriate value of

to use in the Dirichlet prior is rarely knowna priori. Thus,the empirically estimated value of by the BDRA will morethan likely not be the true value, even though with respect tothe available training data, performance improves. However,note that most feature selection methods involve the differenceof a metric for goodness of fit (for example, the log-likelihood)that always increases with model complexity, minus a penaltyterm whose bias is toward low-complexity models. Generally,the penalty term is heuristic; an appealing feature of the BDRAis that this penalization of complex (i.e., fine quantization)models comes directly from the Dirichlet prior density.

As an additional note about the BDRA, in general, (5) shoulddecrease as the number of test data is made larger than one

(see [40], [42]). However, this should not affect theBDRA’s dimensionality reduction performance as this is en-tirely conditioned on the training data and not the test data.

III. RESULTS

Performance of the BDRA is shown in the sections belowusing both simulated data and real data from the UCI Repository(see [44]). Appendix A summarizes some of the assumptionsand conditions underlying the analysis in this section. Re-sults for simulated data are presented first, and the BDRA iscompared to several different neural networks and to a linearclassifier (analogous to weighted voting where the weightsfor each class are given by the mean values of each feature).In this case, a variety of classifiers helps to validate the re-sults. With respect to real data, typically having unknown truecharacteristics, the BDRA is compared to only the linear clas-sifier as it represents a relatively easy-to-understand baselinefor comparing performance. Additionally, the linear classifier,

like the BDRA, requires a minimal of tuning by the user inorder to obtain its best classification performance. However,many of these data sets have previously been applied to avariety of classification methods.

A. Performance of the BDRA Using Simulated Data

In this section, the BDRA is compared to various classifiersusing simulated data (see items 1–4 of Appendix A, and Ap-pendix C). Within these results, it will be observed that an im-portant aspect is the ability of each classifier to deal with thecurse of dimensionality inherent to the data. For the BDRA,it has been stated that the dimensionality problem is addressedby reducing the quantization of each feature through a featuresearch. With neural networks, the dimensionality of the datais reduced by self-adjusting neuron weights (see [9] and [27]).Therefore, it is of interest to compare the BDRA to a neuralnetwork using data with a variable and increasing dimension-ality. Additionally, to supplement these results, the dimension-ality problem is further explored by comparing the BDRA tothe Class-Specific classifier of [2], whicha priori knows thosefeatures most relevant to correct classification.

The experimental results of Fig. 2 demonstrate the trainingmethod of the BDRA, which improves performance by reducingthe dimensionality of the data. In this figure, the probability oferror is plotted versus the number of levels (i.e., of the orig-inal unreduced training data), for each of a total of six dis-crete-valued features, where the number of levels is varied fromtwo to five (ignore noninteger values in all plots). This implies,recalling the discussion accompanying Fig. 1, that the initialquantization complexity is 64 for two levels per feature (i.e.,

), and for five levels per feature, it is 15 625 (i.e.,).

The results in Fig. 2 (and in Figs. 3–6) are based on an averageof 100 independent trials of randomly generated true symbolprobabilities and training data according to Appendix C. Thefollowing notation describes the error probabilities contained inFig. 2:

Unreduced (Training Data):the conditional probability oferror computed using (5) conditioned on the initial trainingdata configuration of each class before data reduction;Reduced (Training Data):the probability of error com-puted using (5) conditioned on the final reduced trainingdata configuration for each class (i.e., after the BDRA hastrained). This training metric of the BDRA reveals whetherreducing the data results in a performance improvement;Unreduced (True):the probability of error computed using(5) with the based on the initial unreduced training data(the same as in the first item above) and conditioned on thetrue symbol probabilities [i.e., replaced by

];Reduced (True):the probability of error computed using(5) with the based on the reduced training data (the sameas in the second item above) and conditioned on the truesymbol probabilities (as in the previous item). This andthe previous item are used to gauge actual classificationperformance as opposed to generating an independent testdata set along with the training data;

LYNCH AND WILLETT: BAYESIAN CLASSIFICATION AND FEATURE REDUCTION USING UNIFORM DIRICHLET PRIORS 453

Fig. 2. Performance of the BDRA with two relevant features.

Optimal: the probability of error computed given theare known,a priori, for each class (i.e., the classifier withinfinite training data). This is the “clairvoyant” situationand is unrealistic for practice but does serve as a bound onperformance. Also, in this case, the probabilities are inten-tionally constrained to be less than 0.1 in magnitude. Formore on the method of generating true symbol probabili-ties and training data, see Appendix C.

Basically, Unreducedand Reducedrefer to the cases, re-spectively, that the BDRA has not and has been applied. Theprobabilities of error reported underTraining Data are thepredictions of the BDRA based on what it “knows,” whilethose underTrue are the actual probabilities of error. It isinteresting, we hope, that these latter two follow one another,meaning that in this situation the BDRA is actually able topredict its own true performance. It is also interesting thatthe BDRA appears to be conservative.

The results in Fig. 2 are based on two classes, each containing25 samples of training data, and where only two features areproviding all relevant classification information. By “relevant,”it is meant that those features are distributed uniquely (indepen-dently) for each class with the remaining features out of the totalof six being distributed the same amongst the classes. Thus, inthis scenario, most of the features are not providing any usefuldiscriminating information.

The effectiveness of the BDRA at improving overall classi-fication performance is clearly shown in Fig. 2. This remainstrue even as the initial number of discrete levels per featureincreases, where before data reduction the error probabilitiesare approaching 0.5. Intuitively, with a small fixed number oftraining data, it becomes difficult to estimate probabilities for anincreasing number of discrete symbols. However, after training,a reduction in the number of discrete symbols makes estimatingthe probabilities more accurate. In fact, before training, the ini-tial quantization complexities of the four cases shown totaled64, 729, 4096, and 15 625, respectively, and after training, these

are reduced by the BDRA to an average of 2.6, 3.9, 4.3, and 5.1.These final quantization complexities are actually lower thanexpected for two relevant features, implying that the BDRA has“over-reduced” the data to improve performance.

In Fig. 3, performance of the BDRA is compared with sev-eral neural networks and a linear classifier. The classificationscenario in this figure is the same as in Fig. 2, andOptimalandReduced (True)results are repeated. With respect to the otherclassifiers, the following notation is used for the error probabil-ities shown (for a description of each neural network type, seeAppendix D):

NN(BP): the true probability of error computed using atrained backpropagation type neural network (with anadaptive learning rate and momentum) as a decision ruleconditioned on the true symbol probabilities. This and theerror probabilities of the following neural network arecomputed similarly to that forUnreduced (True)of theBDRA;NN(RBF):the same as in the previous item, except that thedecision rule is a trained radial basis function type neuralnetwork;NN(LVQ): the same as in the previous two items, exceptthat the decision rule is a trained learning vector quantiza-tion type neural network;Linear: the true probability of error computed using alinear discriminant (a mean vector estimated from thetraining data for each class) as a decision rule conditionedon the true symbol probabilities.

In all cases of Fig. 3, the BDRA is superior to the other classi-fiers in that it achieves a lower probability of error and a smallersample standard deviation(for a definition of this metric see[15], for example). In this and other cases, the sample stan-dard deviation is computed by averaging over all numbers ofdiscrete levels per feature and is used in place of error bars toavoid cluttering the figure. The best performing classifier otherthan the BDRA is the neural network trained using backprop-agation that has a single hidden layer containing eight nodes.Note that the same network was also trained using a four-nodehidden layer, but its performance was inferior in all situationsappearing here, and it was left out to avoid redundancy in theresults. The neural networks appear to show greater degrada-tion in performance than the BDRA as the number of discretelevels per feature increases.

In Fig. 4, the scenario of Fig. 3 is repeated, except that thetraining data of each class is increased to 100 samples, and asexpected, all classifiers are performing better with the additionalinformation. Once again, the BDRA is the best performing clas-sifier; however, relative performance differences have decreasedfor all classifiers, and the best performing alternative classifieris still NN(BP).

In Fig. 5, results similar to those shown in Fig. 3 areplotted, except that now, each data vector contains no irrelevantfeatures. Observe that even though all features are relevantto correct classification (i.e., less opportunity for gains fromfeature reduction), the BDRA is still able to reduce modelcomplexity to improve performance. As compared with Fig. 3,performance of the BDRA has diminished with respect to theotherclassifiers,which appear to “prefer” more relevant features.

454 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003

Fig. 3. Performance comparison of the BDRA to several neural networks anda linear classifier using two relevant features.

Fig. 4. Performance comparison of Fig. 3 with one hundred samples oftraining data for each class.

This is natural, and it is perhaps surprising that the BDRAis able to offer any performance advantage at all; intuitively,however, the BDRA is removing features that while relevant,are comparatively uninformative, as the training data volumesavailable.

In this situation, the over-reducing tendency of the BDRA,which is less of a problem in Fig. 3, has lowered its relativeability as the likelihood has increased that useful feature infor-mation will be thrown away. With respect to that, final averagequantization complexities were reduced to 3.4, 4.8, 5.7, and6.2, respectively. In other words, the BDRA reduced the datain all cases of Fig. 5, on average, to fewer than two of the mostpredominant relevant features. Comparing Fig. 3 with Fig. 5,it appears that the BDRA tends to favor situations in which the

Fig. 5. Performance comparison of Fig. 3 with six relevant features for eachclass.

Fig. 6. Performance comparison of Fig. 5 using one hundred samples oftraining data for each class.

feature set contains a smaller number of very useful featuresand a relatively large number of useless features. Intuitively,such a case is most suited to feature selection.

In Fig. 6, through an increase of the training data size to 100samples, the BDRA is somewhat better able to deal with thesituation of Fig. 5. In fact, the BDRA even shows a slight im-provement in its relative performance in this figure with respectto the other classifiers. This finding, and that of Fig. 4, helps tovalidate the BDRA’s relative performance superiority, as neuralnetworks tend to prefer more data [60].

B. BDRA Applied to the Selection of Class-Specific Features

1) The Class-Specific Classifier:In [2], an approach to re-ducing the dimensionality of a feature set was developed by re-

LYNCH AND WILLETT: BAYESIAN CLASSIFICATION AND FEATURE REDUCTION USING UNIFORM DIRICHLET PRIORS 455

formulating the optimum Bayesian classifier forclasses andgiven by

(7)

into an equivalent class-specific classifier having the form

(8)

where (for more on notation see Section II) we have thefollowing.

completely parameterizes the data representing theth class .

is a sufficient statistic for .is a normalizing distribution and is the

same for all .

Equation (7) can be expressed as (8) when each class hasa unique sufficient statistic , which captures all relevantinformation about the parameter . Additionally, under thenull hypothesis, all classes must be distributed the same (i.e.,

for all ). Notice that (8) is an alternative methodof reducing the dimensionality of a training data set. That is,the class-specific idea is useful for cases in whichof (8) can be estimated, as opposed to the higher dimensional

of (7). Thus, it is of interest in this sec-tion to determine if the BDRA can effectively reduce irrelevantinformation (i.e., null distributed features that are common toeach class) from a training data set that contains class-specificfeatures. As a measure of performance, the probability of errorfor the BDRA [i.e.,Reduced (True)from Fig. 2] is comparedwith the probability of error for (7) and (8).

2) Performance of the BDRA With Class-Specific Fea-tures: Performance results of applying the BDRA to simulatedclass-specific features appear in Fig. 7, where it is assumed thatthere are a total of three classes (i.e., ). Specifically, thereare two target classes of interest and one null class representingthe distribution of irrelevant features. In the class-specificparadigm, the irrelevant feature distribution (i.e., commonnull) is considered a separate class, which is not necessaryin the BDRA. Therefore, class-specific features are reallyrelevant features, except that, as opposed to the cases shown inFigs. 2–6, they are not constrained to be the same features forthe nonnull distributed classes. The results presented in Fig. 7are averaged over 250 independent trials of randomly generatedsymbol probabilities and associated training data, and for eachtrial, 10 000 independent samples of test data are generated.Further, the data sets of each class contain six binary valuedfeatures so that initially, .

In Fig. 7, the probability of error is plotted versus the truenumber of class-specific features (out of a total of six binaryvalued features) for classifying the two target classes of interestlabeled as classes 1 and 2. Observe that class 0 is the commonnull class that does not directly affect the performance results.With respect to notation, the following items describe the errorprobabilities appearing in Fig. 7, and note that optimal errorprobabilities (for a description see the text accompanying Fig. 2)

Fig. 7. Performance of the BDRA compared with the class-specific classifier.

have been constrained to be0.05 and 0.1, which keeps thevalues relatively constant over all numbers of relevant features(see Appendix C for a discussion on generating true symbolprobabilities):

BDRA (Initial): the empirical probability of error com-puted using the decision rule of (1) and the initial trainingdata configuration of the target classes before data reduc-tion [i.e., all feature information, which is the same as using(7)];BDRA (Trained):the probability of error computed usingthe decision rule of (1) and the final reduced training dataconfiguration for the target classes (i.e., the remaining fea-ture information after the BDRA has trained on the data ofthe target classes).3

CLASSP: the empirical probability of error computedusing the decision rule of (8), which is the appropriateclass-specific features of the target classes, and based onthe initial training data of each class (this classifier ispartially clairvoyant in that it knows the relevant featuresof each class);CLASSP (True null):the empirical probability of errorcomputed using the decision rule of (8), the appropriateclass-specific features extracted from the initial trainingdata of the target classes, and the true probabilities forthe null class (i.e., the difference between this case andCLASSPabove is that the null class uses the appropriateclairvoyant probabilities).

3The resultsBDRA (Initial)andBDRA (Trained)are analogous, respectively,to Unreduced (True)andReduced (True)of Figs. 2–6, except that in the formercase, performance is now determined by generating a separate test set as opposedto using the actual symbol probabilities. This was done to simplify obtainingresults for the class-specific classifier, due to its form, described in the next twoitems;

456 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003

In Fig. 7, all given true numbers of class-specific features (outof the total six) are determined randomly for each class. That is,the only instance in which both target classes are guaranteed tohave the same class-specific features is when all six featuresare class-specific. Thus, for a given class, any feature that isnot class-specific is distributed according to the common null

. Further, notice that the training data consists of only fivesamples for each class (including the null class), which ensuresthat the curse of dimensionality is predominant in the data.

Observe in Fig. 7 that theBDRA (Trained)classifier canoutperform the class-specific classifier of (8) by achieving anoverall lower probability of error and independent of whetherthe null probabilities are actual [CLASSP (True null)] or esti-mated from training samples (CLASSP). The associated samplestandard deviations of these classifiers turned out to be sim-ilar and much larger than optimal due to a small sample size.As expected, the worst performing method isBDRA (Initial),which uses all feature information based on (7), although thismethod also produces a lower sample standard deviation thanthe other classifiers. It is further seen that as more class-specificfeatures are added to the feature vectors of each class, theperformance of class-specific classifier intuitively becomesmore likeBDRA (Initial), and with all class-specific features,they are the same.

An interesting observation from this result is that the BDRA’sperformance is essentially based on automatically identifyingthe best performing features and without directly observing thenull. On the other hand, the class-specific classifier of (8) notonly knowsa priori which features are more relevant to correctclassification, but it also has access to the null distribution. Itis suspected that the BDRA’s performance is superior due toits ability to more adequately measure the feature information,using (5), that is jointly important to correct classification. Thatis, where the class-specific classifier is constrained to a fixednumber of features (higher dimensionality), the BDRA, beingunconstrained, can reduce even class-specific features and takethe dimensionality down lower as determined by the probabilityof error [ (5)].

C. Performance of the BDRA Using Real Data

In this section, the BDRA is trained and tested (see items 1, 2,5, and 6 of Appendix A) on 28 real data sets that are contained atthe University of California at Irvine (UCI) Repository of Ma-chine Learning Data Bases [44]. Additionally, in this section,the BDRA is also compared to the linear classifier that was usedwith simulated data. The linear classifier represents a relativelyeasy-to-understand metric for comparing and interpreting per-formance results based on data with unknown characteristics. Inparticular, the linear classifier requires no tuning or specializedtraining algorithm, and it works best when the joint feature spaceof each class are linearly separable. However, most of the datasets found at the UCI Repository have been explored using manydifferent classification techniques, and in Section III-C1, theperformance of the BDRA is discussed against previous work.

Before discussing results of applying the BDRA to real data,it is necessary to modify the algorithm in order to account for thepossibility that the data of either class contains missing feature

information (see item 7 of Appendix A). As it turns out, slightlymore than one third of the UCI data sets used in this work havemissing features. Basically, by missing features, it is meant thatthe feature vectors of theth class are assumed to be madeup of either or both of two observation types: features that arerepresented by discrete values and missing features which haveno values. For example, with three binary features, a possiblefeature vector that is missing a single feature might appear as

, where represents the missing value. In this case,can have the value of 0 or 1 so that this feature vector has acardinality of two. In general, all missing features are assumedto appear according to an unknown probability distribution.

With respect to missing features, an often-used approach isto “fill-in” missing values by estimates obtained from all knownfeature values (e.g., such as the sample mean) [9]. In fact, thismethod is used for the linear classifier employed here (if the fea-ture is discrete the median is used instead). However, to modelmissing feature information in the BDRA, an additional fill-intype approach is employed. In this approach, the number ofdiscrete levels for each feature is increased by one so that allmissing values for that feature are assigned to the same level.For example, starting with six binary valued features, and if itis known that any of these features can be missing from eitherthe training or test data, the initial quantization complexity is in-creased from to . In other words, in this case,each feature is increased from having binary to ternary values.

An alternate approach to missing features was also applied tothe BDRA, in which the distribution of (4) is extended such thateach missing feature is assumed to be uniformly distributed overits range of values. That is, each data vector missing (trainingor test) features is replaced by its entire cardinality of values(i.e., all possible discrete symbols the feature vector can take onif all possible arrangements of values are substituted in for themissing features), each contributing one divided by the cardi-nality of a symbol. Results from this method are not discussedhere as the observed performance differences have not, in thecases explored, warranted the extra computational burden.

Tables III–V give the characteristics of each data set, and clas-sification results for the BDRA and the linear classifier for allreal data sets analyzed here. Listed in the columns of these ta-bles (and in the header of Table III), from left to right, are anumerical label for each data set (DS),4 the total number of data( ), the total number of classes (), the total number ofcontinuous ( ) and discrete ( ) valued features, the per-cent of feature vectors with at least one missing feature value(% Miss), the ratio of the final (as determined by the BDRA)and initial average quantization complexities ,5

the best number of discrete levels used to initially quantize allcontinuous valued features (),6 and the average probability oferror ( ) and sample standard deviation () results, respec-tively, for both the BDRA and linear classifiers.

4At the bottom of each table is a DS name key, and a subscript refers to aspecific note that appears in Appendix E.

5The� symbol is synonymous with the best performing value.6For each data set, results were determined for 2, 3, 4, and 5 initial levels

of quantization for each continuous valued feature. Also, n/a indicates that theoriginal data contains no continuous valued features.

LYNCH AND WILLETT: BAYESIAN CLASSIFICATION AND FEATURE REDUCTION USING UNIFORM DIRICHLET PRIORS 457

TABLE IIIDATA CHARACTERISTICS AND CLASSIFICATION PERFORMANCE. THE

COLUMNS HAVE THE FOLLOWING MEANING: DATA SET (DS) (SUBSCRIPT

REFERS TONOTE IN APPENDIX E); TOTAL NUMBER OF DATA (N );NUMBER OF CLASSES(C); NUMBER OF CONTINUOUS (f ) AND DISCRETE

(f ) VALUED FEATURES; PERCENT OFFEATURE VECTORSMISSING

FEATURE VALUES (% MISS); RATIO OF FINAL AND INITIAL QUANTIZATION

COMPLEXITIES (M =M ); BEST NUMBER OF LEVELS TO INITIALLY

QUANTIZE CONTINUOUS VALUED FEATURES(q ), WHERE N/A INDICATES

THE DATA CONTAINS NO CONTINUOUS VALUES; AND THE AVERAGE

PROBABILITY OF ERROR(P (e)) AND SAMPLE STANDARD DEVIATION (�)FOR THEBDRA AND LINEAR CLASSIFIERS

All estimated entries found in Tables III–V are based on thehold-out 10% method of training and testing, as described initem 6 of Appendix A. In the tables, and for each data set, thehold-out 10% method was applied seven times so that final per-formance results are based on an average of 70 trials.

Table VI summarizes the results appearing in Tables III–V byshowing mean values for specific collections of data sets definedby the left-most column of each table. In particular, mean valuesappear for the entire collection of data (All), for all data setsthat only have no missing features (No missing), for all datasets that only have missing features (All missing), for all datasets that only have continuous valued features (Cont. featuresonly), for all data sets that only have discrete valued features(Disc. features only), and for all data sets that only have bothcontinuous and discrete valued features (Mixed features only).In addition, in this table, and for each classifier, the number oftimes the respective classifier obtained the lowest value of(# wins) appears.

From the row “All” of Table VI, it can be seen that overall, theBDRA is able to substantially outperform the linear classifier byobtaining a lower average error probability for 20 out of 28 datasets. However, the standard deviations of each classifier are alsorelatively similar, indicating that the linear classifier tends tohave more bias error with respect to the unknown optimal clas-sifiers of each data set [60]. Additionally, in reducing the data,the BDRA tends to drive the quantization complexity down toan average of about 10% of its initial value. Therefore, and refer-ring back to Figs. 1–7, this result implies that the data contains asignificant number of “irrelevant” features as the BDRA’s rela-tive performance tends to diminish when it reduces completelyrelevant features.

With these results, notice that before reducing the data theBDRA prefers to initially quantize continuous valued features

TABLE IVDATA CHARACTERISTICS ANDCLASSIFICATION PERFORMANCE, CONTINUED.

SEE TABLE III FOR THEMEANING OF EACH COLUMN

TABLE VDATA CHARACTERISTICS ANDCLASSIFICATION PERFORMANCE, CONTINUED.

SEE TABLE III FOR THEMEANING OF EACH COLUMN

TABLE VIMEAN VALUES FORTABLES III–V. SEE TABLE III FOR THEMEANING OF

EACH COLUMN

to an average of less than three levels per feature (see the quan-tity ). In this case, it is suspected that if more than five initialquantization levels had been used for each feature, then averagevalues would have increased. When applying the BDRA to con-tinuous valued data, this is the only tuning parameter that mustbe adjusted (no tuning is required for all-discrete data), and ingeneral, it can be determined by observing thevalue corre-sponding to where the conditional probability of error in (5) isminimum.

458 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003

Another significant observation in Table VI is that a largepart of the BDRA’s performance over that of the linear clas-sifier appears to be in its ability to deal with missing featurevalues. Observe that the linear classifier’s worst performanceoccurs with “Mixed feature” data, which, as it turns out, has alarge percentage of missing feature values. On the other hand,the BDRA shows its best performance with data that containsmissing features. It is suspected that in this case, the BDRA’sperformance is due to the additional level of quantization em-ployed for each missing feature. That is, relevant informationfor correct classification is contained in each class having a dif-ferent number of missing feature values, which is effectivelymodeled by an additional level of quantization. For example, anexamination of the Credit Screening data of Table III revealsthat the class defined as “those people who were denied credit”has more missing values (i.e., more blank answers on credit ap-plications) than those who were issued credit.

1) Performance of the BDRA With Respect to Other Re-sults: It is, in general, somewhat difficult to compare theresults of one classification method with other previously pub-lished results using the same data as the experimental setupsof different authors can widely vary. For example, differentresults with the same data can involve such things as differentpartitioning of the training and test sets and different numbersof statistical trials. Also, in some cases, previous results caneven be based on not utilizing the complete set of availablefeatures for training (see the next two paragraphs for examplesof this). Despite these potential difficulties, the purpose of thissection is to compare the performance of the BDRA with asingle data set from the tables above that previously obtainedby a random sampling of other results utilizing the same data.

A data set that is often utilized for illustrating performanceresults is the Credit Screening data of Table III, which is alsowidely referred to as the Australian Credit Card Data. From thistable, the BDRA was able to obtain an average probability oferror of 0.136 (the linear classifier obtained 0.379) with this dataand where all 15 available features were utilized for training.Notice that in obtaining this performance the BDRA signifi-cantly reduced much feature information from the data (i.e.,obtaining a ratio of final to initial quantization complexities of3 10 ). Recall, the results in Table III were based on an av-erage of randomly shuffling the data seven times, and for eachof these, the hold-out 10% method, with stratification, was usedfor training and testing. In comparison, the following items sum-marize other (best for each item) results obtained with the CreditScreening data, and which are itemized below according to Ref-erence number:7

Reference [4]:A 0.141 probability of error was obtainedusing a nearest neighbor classifier with a forward sequen-tial feature search.Reference [20]:A 0.152 probability of error was obtainedusing the decision tree type classifier C4.5, where the re-sults are based on an average of 20 independent trials, andtwo thirds of the data were used for training.

7Additional performance comparisons appear in Appendix E, where thosecases that the BDRA performed much worse than other average results arenoted.

Reference [25]:A 0.125 probability of error was obtainedusing a dynamic discretizer and 14 of the 15 availablefeatures.Reference [56]:A 0.143 probability of error was obtainedusing C4.5 and a feature subset selection method based ondynamic relevance and where 14 out of 15 features wereused for training.Reference [59]:A 0.115 probability of error was obtainedusing an evolutionary type neural network and where theresults are based on an average of 30 independent trials,and 75% of the data were used for training.

Many of the above results arise from considerable author-tuning and -intervention; hence, we feel it significant that theBDRA, which requires very little, has a performance in the thickof the best reports.

2) Training Time Performance of the BDRA:In Fig. 8, thetime duration that the BDRA uses to train is considered basedon the data set labeled Yeast (with eight continuous valued fea-tures) in Table V. In this case, the average CPU time is shown inseconds versus the initial number of levels (ignore nonintegervalues) each continuous valued feature is quantized to beforetraining. Results are shown using all ten classes of the Yeastdata containing 1484 samples and a two-class subset that con-tains 892 samples. It can be seen that the amount of time thatthe BDRA uses to train depends strongly on the initial quanti-zation complexity of the data, and for the yeast data, this variesfrom 256 to 390 625 discrete symbols. As a comparison, thesame results were obtained for the linear classifier used in Ta-bles III–VI. For , average training time for the linearclassifier is 0.06 s, and for , this is increased to 0.49 s.The linear classifier requires much less training time than theBDRA, except when the features are initially quantized to twodiscrete levels. However, the results in Tables III–VI revealedthat in many cases, good performance can be obtained with lessthan five discrete symbols per continuous feature, and this helpskeep training times for the BDRA to reasonable levels.

IV. SUMMARY

In this paper, a classification approach known as the Bayesiandata reduction algorithm (BDRA) is discussed. The main fea-tures of the BDRA are the following.

• It operates on discrete data, and as such, contin-uous-valued data must be quantized prior to use of theBDRA. There is no set limit on this quantization fineness,however.

• The decision function is the “combined Bayes test,” whichcombines in a likelihood ratio the training and testing data.In a sense, this coupling allows the test data to contributeto training.

• The combined Bayes test is a Bayes factor [3], [8],meaning that it can be approximated using the Bayesianinformation criterion (BIC) [54], which can also be usedto reduce the dimensionality of data (or model orderselection). However, the disadvantage of using BIC isthat dimensionality reduction is performed on each classindependent of the others, or (unlike with the BDRA)

LYNCH AND WILLETT: BAYESIAN CLASSIFICATION AND FEATURE REDUCTION USING UNIFORM DIRICHLET PRIORS 459

Fig. 8. Elapsed time in seconds for the BDRA to complete training based onthe data set labeled Yeast.

the reduction does not pay respect to best classificationperformance. (The drawbacks of other dimensionality re-duction methods that do not pay respect to discriminatingcapability are discussed in [21] and [47].)

• A uniform Dirichlet prior [as defined in (11) and (12)of Appendix B]—corresponding to complete prior igno-rance—is placed on the training data for each class.

• The Dirichlet prior allows an analytic expression forthe probability of error. This is usually applied in itsconditional form, meaning the probability of errorgiventhe training data. It has been found that this probabilityof error expression often matches the true probability oferror.

• The probability of error expression can be used for fea-ture reduction. Specifically, the BDRA uses a greedy ap-proach to coarsen discrete-valued features, and to removeone when it is already binary and the probability of errorexpression indicates that further coarsening is appropriate.

• The Dirichlet prior has an implicit “penalty term” formodel complexity, and hence, feature reduction can bedone in a rational and nonheuristic way. In fact, this itemand the previous two have strongly motivated this work.

The BDRA has been discussed in a number of prior con-ference papers, and it has several features not explored here,for example, the BDRA can deal analytically with training datathat may have been mislabeled, and it has a sound probabilisticmodel to treat (genuinely8 ) missing features. The purpose ofthis paper is to record the BDRA, to explore its properties, andto demonstrate its performance on real data sets.

The BDRA was shown to be superior to various classifiersusing simulated data, and it was shown that the BDRA’s self-confidence (its own analytic probability of error calculated con-ditionally on the training data) is reasonably accurate and usu-ally conservative. This result remained true when the BDRA

8Some features are missing since, due to some exogenous and random obfus-cation, they are simply not observed. In other cases, meaning may be attachedto the fact that a feature is missing, such as is discussed in the Credit Screeningdata set—presumably in that case certain questions are left unanswered simplybecause the respondents did not want to answer them.

was also used to classify simulated class-specific features. Ap-parently, the feature reduction ability of the BDRA is useful to“prune” features that may be relevant but are not adequately sup-ported by the training data. The BDRA tends to perform well incases in which the “curse of dimensionality” is active, that is,where there are too many features to be supported by a fairlysparse set of training data. When the classification problem isdifficult the BDRA is a sound choice. However, when the clas-sification problem is a matter of learning a mapping, other al-gorithms may be preferable, and this was illustrated by showingthe narrowing in the performance gap as the training data sizeincreased.

The BDRA was also applied to 28 real data sets contained atthe UCI Repository. Results demonstrated that the BDRA wasoverall superior to a baseline linear classifier. In a comparisonto previous work using the Credit Screening data, the BDRA’sperformance was similar to those “best” performances reportedin the literature. The training time of the BDRA was observedand indicates that good performance can be obtained at reason-able complexity.

APPENDIX AASSUMPTIONS ANDCONDITIONS UNDERLYING THE ANALYSIS

IN THIS WORK

In order to aid in interpreting and understanding the results,the following items describe all of the primary assumptions andconditions underlying this work.

1) No prior information is assumed to exist about the un-derlying distributions of the features for each class. Inthe BDRA, this uncertainty is modeled using the uniformDirichlet distribution.

2) In all cases, the feature vectors (i.e., data samples) areassumed to be independent from datum to datum.

3) For the simulated data results, each classifier trains ondiscretized feature vectors that have been generated ac-cording to Appendix C and contain a respective numberof discrete levels as given in Figs. 2–6. The results in eachof these figures are based on an average of one hundredindependent trials of randomly generating true symbolprobabilities and training data. is determined foreach classifier by averaging its associated trained decisionmetric over all possible test observations, conditioned onthe true symbol probability distributions of each class.

4) The class-specific results in Fig. 7 are only shown for bi-nary valued features. The results in this figure are basedon an average of 250 independent trials of randomly gen-erating true symbol probabilities and training data. Foreach trial, is computed by also generating 10 000 in-dependent samples of test data and averaging the results.

is computed in this way as opposed to that describedfor the previous item due to the form of the class-specificsclassifier.

5) With the real data results (see Fig. 1 and Tables III–V),the BDRA is the only classifier that always trains on dis-cretized feature vectors. That is, any continuous valuedfeatures are first binned up according to percentiles (e.g.,for three discrete levels thresholds are computed to place

460 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003

the feature values in equally sized cells, with each cellreceiving 33% of the data) before training. In this case,the linear classifier trains on the data as it appears in theoriginal data set. Error probabilities are estimated for thereal data by averaging the results of seven iterations ofapplying the hold-out 10% method with stratification.

6) In the hold-out 10% method used to generate results in Ta-bles III–V, the data are partitioned into ten disjoint sets fortesting with the remaining data (the remaining nine setsnot used for testing) used for training, and stratificationis also utilized so that training and test sets have roughlythe same class proportions. Then, final tabulated resultsare obtained by averaging over values computed with theten disjoint partitions.

7) Missing features in the training and test data are assumedto occur with an unknown distribution and are handledusing separate “fill-in” type methods for the BDRA andthe linear classifier. For the BDRA missing feature valuesare assigned to an additional level of quantization, and forthe linear classifier, the missing values are assigned themean value (median for discrete features) of the knowntraining data.

APPENDIX BOUTLINE OF THE DEVELOPMENT OF

AND

The development of the posterior distributions shown in (2)and (3) is based on solving an integral expression of the typegiven by

(9)

with each integration taken over the dimensional unit hyper-plane, or simplex [46], and where

(10)

and

(11)

Observe that is multinomially distributed(with combined training and test observations) and thatis uniformly distributed on the positive unit hyperplaneand rep-resented by a Dirichlet distribution (which is also referred to in[45] as the multivariate beta density). Here, the Dirichlet distri-bution is used as an ignorance prior, that is, no prior informationis assumed to exist about the probability vector. In its gen-eral form, this distribution is given by

(12)

where, as in (11), it becomes uniformly distributed when its pa-rameters are selected to be unity [11]. It should be noted thatif the in (12) are each set to 1/2, then this represents Jeffreys’prior for the multinomial distribution (see [14]). Jeffreys’ prioris often chosen as a noninformative prior as it is invariant totransformations on the parameters [16]. However, it is not usedhere in the BDRA as it does not treat each parameter of the un-known to be equally likely.

The integral of (9) is solved using the methodsshown in [40], [45] [i.e., by factoring the Dirichlet dis-tribution and repeatedly applying the definite integral

], whichresults directly in (2). Given that (3) is obtained in the samemanner except that an “uncombined” multinomial distribution

(13)

is used in (9) in place of . In [8], the posteriordistribution that results from using (11) and (13) in (9) is referredto as the multinomial-Dirichlet distribution.

APPENDIX CGENERATION OF THETRUE SYMBOL PROBABILITIES AND DATA

As related to the simulated data reported here, it can be seenin all performance figures that the optimal probabilities of errorwere constrained to be less than 0.1 (except in Fig. 7, wherean additional constraint of greater than 0.05 was also used tomake all true error probabilities relatively constant). To achievethis constraint on a consistent basis, Gaussian mixture densitieswere used for generating the underlying true symbol probabili-ties. In doing so, two equiprobable Gaussian mixtures were usedfor each relevant feature, and three equiprobable mixtures wereused for irrelevant ones. That is, for each feature, a Gaussianrandom variable and the complementary error function areused to generate probabilities for each of its associated discretelevels. Thus, for example, the probability of observing a binaryone for a feature is equivalent to the probability of observing apositive value for the associated element of the correspondingGaussian mixture. These probabilities are then subsequentlynormalized to sum to unity, and this procedure is repeatedeither two or three times, depending on the feature type (i.e.,relevant or irrelevant). The resulting feature probabilities ofeach mixture component are then summed together. Repeatingthis procedure for each feature defines the probabilities for allof the discrete levels of the complete set of features, whichwhen appropriately multiplied defines the probabilities for eachof the discrete symbols. Using this model, controlling theprobability of error to meet the specified constraints was doneby adjusting the spread of the means (i.e., the probabilities ofthe discrete symbols for the features). Gaussian mixtures werechosen instead of the Dirichlet distribution, or a similar uniformtype distribution, because the probability of error using theDirichlet converges to 0.25 with symbol quantity, whereasits variance approaches zero (see Appendix F). Therefore,constraining the probability of error to small values using theDirichlet can be computationally impractical.

LYNCH AND WILLETT: BAYESIAN CLASSIFICATION AND FEATURE REDUCTION USING UNIFORM DIRICHLET PRIORS 461

APPENDIX DNEURAL NETWORK DESCRIPTION

In generating results for the neural networks, all trainingand testing was performed using the Neural Network Toolboxof Matlab, [18]. In the following paragraphs, each networkis briefly described; in addition, observe that each is of thefeed-forward type. With that, the input layer for each networkcontains six nodes corresponding to the six discrete features.

The neural networkNN(BP)(whose neuron model is the log-sigmoid transfer function) was trained using backpropagation,momentum, and an adaptive learning rate. This network wasspecified to contain two layers including one hidden layerconsisting of eight nodes and an output layer of nodes equalto the total number of classes. With this, initialization of networkweights was random, and the following items describe therelevant NN(BP) parameter settings required by the Matlabsoftware:

• maximum number of epochs to train (100);• sum-squared error goal (0.02);• learning rate (0.01);• learning rate increase when adapting (1.05);• learning rate decrease when adapting (0.7);• momentum constant (0.9);• maximum error ratio (1.04).

The neural networkNN(RBF)uses the radial basis transferfunction (an exponential), and it was trained using thesolverbfunction of Matlab that iteratively creates a radial basis net-work by adding neurons, one at a time, until the sum-squarederror falls beneath the goal or a maximum number of neuronsis reached. Radial basis networks consist of hidden and outputlayers (the output layer contains one node). In addition, the inputconsisted of six nodes corresponding to the six discrete fea-tures, and initialization of network weights was random. Thefollowing items describe theNN(RBF)parameter settings re-quired by Matlab:

• maximum number of neurons (1000);• sum-squared error goal (0.02);• spread of radial basis layer (1.0).

The neural networkNN(LVQ) was trained using learningvector quantization that is a supervised method of training acompetitive layer, where the learning is based on the Kohonenrule. There are two layers in this network consisting of a com-petitive layer and a linear output layer with a number of nodesequal to the number of classes. In addition, initialization ofnetwork weights assigned six neurons to the hidden competitivelayer, and the following items describe all relevant parametersettings.

• maximum number of presentations 100, that is, thetraining vectors are randomly selected and presented tothe network until this number is reached;

• learning rate (0.01).

APPENDIX ENOTES FORTABLES III–V

1) Three classes were defined based on the miles per gallonattribute according to the following scheme: class 1

, class 2 , class 3 .

2) Only features 3, 4, 5, 6, 7, and 9 are used from the originaltraining set. Additionally, the class labels are given byattribute 13 and only those data without missing values.

3) The BDRA’s performance is much worse than the averageerror probability obtained using [4], [20], and [56] that isapproximately 0.2.

4) Three classes were defined based on the average me-dian price of a home in thousands of dollars accordingto the following scheme: class 1 , class 2

, class 3 .5) This is another case in which the BDRA’s performance is

much worse than the average error probability that basedon [4] is approximately 0.094.

6) The BDRA’s worse relative performance occurred in thiscase in which the average error probability based on [4]and [56] is approximately 0.046.

APPENDIX FMEAN AND VARIANCE OF THE PROBABILITY OF ERROR FOR

DIRICHLET DISTRIBUTED SYMBOL PROBABILITIES

Under the assumption of an optimal test, and that there aretwo classes labeled and , the test chooses if and

. Thus, for the probability of error, we have (see[29])

(14)

(15)

(16)

where (14) and (15) result from total probability and symmetryin the probabilities, and (16) is based on the definition of an errorunder . Notice that (16) is also equivalent to

(17)

(18)

where (17) and (18) result from another application of totalprobability and conditional independence.

Now, using the following marginal probability for theDirichlet given by (see [40])

(19)

(20)

(21)

462 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003

equation (18) then becomes

(22)

(23)

(24)

Equation (24) was obtained by integrating over the positive unithyperplane, and it is clear in this result that under a uniformDirichlet distribution, as approaches infinity, the quantity

approaches 1/4.Using the results above, the variance of can be deter-

mined by first finding

(25)

(26)

and after subtracting from (26), (24) squared produces the result

Var

(27)

In this case, it can be seen that asapproaches infinity, thevariance of approaches a limit of zero.

As a final supplementary result, observe that (19) can be ex-tended to any number of classes by redefining (16) as

(28)

and using (19)–(21), independence between the classes, and

(29)

Equation (28) thus becomes

(30)

Fig. 9. Mean of the probability of error for Dirichlet distributed symbolprobabilities as a function of the total number of classes.

To solve (30), use is made of properties of Binomial coef-ficients given by the , where

. Letting , and ,(30) is rewritten as

(31)

which after integrating and rearranging becomes

(32)

Fig. 9 shows a plot of (32) as a function of the total numberof classes (varying from one to 50). In this figure, a sepa-rate curve appears, respectively, for discrete symbol quantitiesof and . The results demonstrate that the ex-pected probability of error increases rapidly for one up to ap-proximately ten classes; however, for larger numbers of classes,the rate of increase steadily decreases. Another observation isthat the probability of error is less for an infinite number of dis-crete symbols. Intuitively, with analytical distributions that areunhindered by the curse of dimensionality a larger number ofdiscrete symbols provides more discriminating information forany number of classes.

REFERENCES

[1] F. M. Alkoot and J. Kittler, “Multiple expert system design by combinedfeature selection and probability level fusion,” inProc. Int. Conf. Infor-mation Fusion, July 2000.

[2] P. M. Baggenstoss, “Class-specific feature sets in classification,”IEEETrans. Signal Processing, vol. 47, pp. 3428–3432, Dec. 1999.

[3] S. Basu and N. Ebrahimi, “Estimating the number of undetected errors:Bayesian model selection,” inProc. Ninth Int. Symp. Software Engi-neering, 1998, pp. 22–31.

LYNCH AND WILLETT: BAYESIAN CLASSIFICATION AND FEATURE REDUCTION USING UNIFORM DIRICHLET PRIORS 463

[4] S. D. Bay, “Nearest neighbor classification from multiple feature sub-sets,”Intell. Data Anal., vol. 34, no. 3, pp. 191–209, Aug. 1999.

[5] Y. Bar-Shalom and X. Li,Multitarget-Multisensor Tracking: Principlesand Techniques. Storrs: Univ. Connecticut, 1995, Course Notes.

[6] E. Bauer and R. Kohavi, “An empirical comparison of voting classifi-cation algorithms: Bagging, boosting, and variants,”Machine Learning,vol. 36, pp. 105–142, 1999.

[7] R. Bellman,Adaptive Control Processes: A Guided Tour. Princeton,NJ: Princeton Univ. Press, 1961.

[8] J. M. Bernardo and A. F. M. Smith,Bayesian Theory. New York:Wiley, 1994.

[9] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford,U.K.: Clarendon, 1995.

[10] A. L. Blum and P. Langley, “Selection of relevant features and examplesin machine learning,” inProc. Artificial Intelligence, 1997, pp. 245–271.

[11] C. G. E. Boender and A. H. G. Rinnooy Kan, “A multinomial Bayesianapproach to the estimation of population and vocabulary size,”Biometrika, vol. 74, no. 4, pp. 849–856, 1987.

[12] L. J. Buturovic, “Toward Bayes-optimal linear dimension reduction,”IEEE Trans. Pattern Anal. Machine Intell., vol. 16, pp. 420–423, Apr.1994.

[13] W. L. Buntine, “Learning classification trees,”Statist. Comput., vol. 2,pp. 63–73, AUTHOR: What year was this published?.

[14] L. L. Campbell, “Averaging entropy,”IEEE Trans. Inform. Theory, vol.41, pp. 338–339, Jan. 1995.

[15] G. Casella and R. L. Berger,Statistical Inference. Belmont, CA:Duxbury, 1990.

[16] B. P. Carlin and T. A. Louis,Bayes and Empirical Bayes Methods forData Analysis. London, U.K.: Chapman & Hall, 1996.

[17] M. Delampady and J. O. Berger, “Lower bounds on Bayes factors formultinomial distributions, with applications to chi-square tests of fit,”Ann. Statist., vol. 18, no. 3, pp. 1295–1316, 1990.

[18] H. Demuth and M. Beale,Neural Network Toolbox. Natick, MA: TheMath Works, Inc., 1994.

[19] L. Devroye, L. Györfi, and G. Lugosi,A Probabilistic Theory of PatternRecognition. New York: Springer-Verlag, 1996.

[20] P. Domingos, “Two way induction,” inProc. Int. Conf. Tools With Arti-ficial Intelligence, Nov. 1995.

[21] R. O. Duda and P. E. Hart,Pattern Classification and Scene Anal-ysis. New York: Wiley, 1973.

[22] Y. Ephraim, “Statistical-model-based speech enhancement systems,”Proc. IEEE, vol. 80, no. 10, pp. 1526–1555, Oct. 1992.

[23] K. Fukunaga,Statistical Pattern Recognition. Boston, MA: Academic,1990.

[24] K. Fukunaga and R. R. Hayes, “Effects of sample size in classifier de-sign,” IEEE Trans. Pattern Anal. Machine Intell., vol. 11, pp. 873–885,Aug. 1989.

[25] J. Gama, L. Toro, and C. Soares, “Dynamic discretization of continuousattributes,” inProc. 6th Ibero-Amer. Conf. Artificial Intell., 1998.

[26] R. Hanson, J. Stutz, and P. Cheeseman, “Bayesian classification theory,”NASA Ames Res. Center Tech. Rep., no. FIA-90-12-7-01, Dec. 1990.

[27] S. Haykin,Neural Networks: A Comprehensive Foundation, Second Edi-tion. Upper Saddle River, NJ: Prentice-Hall, 1999.

[28] J. P. Hoffbeck and D. A. Landgrebe, “Covariance matrix estimation andclassification with limited training data,”IEEE Trans. Pattern Anal. Ma-chine Intell., vol. 18, pp. 763–767, July 1996.

[29] G. F. Hughes, “On the mean accuracy of statistical pattern recognizers,”IEEE Trans. Inform. Theory, vol. 14, pp. 55–63, Jan. 1968.

[30] Q. Huo, H. Jiang, and C. Lee, “A Bayesian predictive classification ap-proach to robust speech recognition,” inProc. IEEE Int. Conf. Acoust.,Speech, Signal Process., Apr. 1997, pp. 1547–1550.

[31] Q. Huo and C. Lee, “A Bayesian predictive classification approach torobust speech recognition,”IEEE Trans. Speech Audio Processing, vol.8, pp. 200–204, Mar. 2000.

[32] A. Jain and D. Zongker, “Feature selection: Evaluation, application, andsmall sample size performance,”IEEE Trans. Pattern Anal. MachineIntell., vol. 19, pp. 153–158, Feb. 1997.

[33] H. Jiang, K. Hirose, and Q. Huo, “Improving Viterbi Bayesian predictiveclassification via sequential Bayesian learning in robust recognition,”Speech Commun., vol. 28, no. 4, pp. 313–326, Aug. 1999.

[34] J. Kittler, “Feature set search algorithms,” inPattern Recognition andSignal Processing, C. H. Chen, Ed. Alphen aan den Rijn, The Nether-lands: Sijthoff and Noordhoff, 1978, pp. 41–60.

[35] Q. Li and D. W. Tufts, “Principal feature classification,”IEEE Trans.Neural Networks, vol. 8, pp. 155–160, Jan. 1997.

[36] T. Lim, W. Loh, and Y. Shih, “A comparison of prediction accuracy,complexity, and training time of thirty-three old and new classificationalgorithms,”Machine Learning J., 1999.

[37] H. Liu and R. Setiono, “Feature selection via discretization,”IEEETrans. Knowledge Data Eng., vol. 9, pp. 642–645, July/Aug. 1997.

[38] R. S. Lynch, Jr. and P. K. Willett, “Utilizing a class labeling feature in anadaptive Bayesian classifier,” inProc. SPIE Int. Conf. AeroSense, Apr.2001.

[39] , “Adaptive Bayesian classification using noninformative Dirichletpriors,” in Proc. IEEE Int. Conf. Systems, Man, Cybern., Oct. 2000.

[40] R. S. Lynch, Jr., “Bayesian classification using noninformative Dirichletpriors,” Ph.D. dissertation, Univ. Connecticut, Storrs, May 1999.

[41] R. S. Lynch, Jr. and P. K. Willett, “Bayesian classification using misla-beled training data and a noninformative prior,”J. Franklin Inst., July1999.

[42] , “Performance considerations for a combined information classifi-cation test using Dirichlet priors,”IEEE Trans. Signal Processing, vol.47, pp. 1711–1715, June 1999.

[43] N. Merhav and Y. Ephraim, “A Bayesian classification approach withapplication to speech recognition,”IEEE Trans. Acoust., Speech, SignalProcessing, vol. 39, pp. 2157–2166, Oct. 1991.

[44] C. J. Merz and P. M. Murphy, “UCI repository of machine learningdatabases,” Dept. Inform. Comput. Sci., Univ. California, Irvine, CA,1996.

[45] J. E. Mosimann, “On the compound multinomial distribution, themultivariate beta-distribution, and correlation’s among proportions,”Biometrika, vol. 49, no. 1, pp. 65–82, 1962.

[46] K. L. Oehler and R. M. Gray, “Combining image compression and clas-sification using vector quantization,”IEEE Trans. Pattern Anal. Ma-chine Intell., vol. 17, pp. 461–473, May 1995.

[47] M. Padmanabhan and L. R. Bahl, “Model complexity adaptation usinga discriminative measure,”IEEE Trans. Speech Audio Processing, vol.8, pp. 205–208, Mar. 2000.

[48] K. Peleg and U. Ben Hanan, “Adaptive classification by neural net basedprototype populations,”Int. J. Pattern Recognition Artificial Intell., vol.7, no. 4, pp. 917–933, Aug. 1993.

[49] N. S. Philip and K. B. Joseph, “Boosting the differences: A fast Bayesianclassifier neural network,”Intell. Data Anal., vol. 4, no. 6, pp. 463–473,2000.

[50] W. Pieczynski, J. Bouvrais, and C. Michel, “Estimation of general-ized mixture in the case of correlated sensors,”IEEE Trans. ImageProcessing, vol. 9, pp. 308–312, Feb. 2000.

[51] S. Puuronen, V. Terziyan, A. Katasonov, and A. Tsymbal, “Dynamic in-tegration of multiple data mining techniques in a knowledge discoverymanagement system,” inProc. SPIE Conf. Data Mining Knowledge Dis-covery, Apr. 1999.

[52] H. Qiang, “Adaptive learning and compensation of hidden Markovmodel for robust speech recognition,”Commun. COLIPS, vol. 8, no.2, pp. 161–189, Dec. 1998.

[53] B. D. Ripley,Pattern Recognition and Neural Networks. Cambridge,U.K.: Cambridge Univ. Press, 1996.

[54] G. Schwartz, “Estimating the dimensions of a model,”Ann. Statist., vol.6, pp. 461–464, 1978.

[55] B. M. Shahshahani and D. A. Landgrebe, “The effect of unlabeledsamples in reducing the small sample size problem and mitigating theHughes phenomenon,”IEEE Trans. Geosci. Remote Sensing, vol. 32,pp. 1087–1095, Sept. 1994.

[56] H. Wang, D. Bell, and F. Murtagh, “Axiomatic approach to featuresubset selection based on relevance,”IEEE Trans. Pattern Anal.Machine Intell., vol. 21, pp. 271–277, Mar. 1999.

[57] F. Wen, P. Willett, and S. Deb, “Condition monitoring of helicopterdata,” inProc. IEEE Int. Conf. Syst., Man, Cybern., Oct. 2000.

[58] Q. Xie, C. A. Laszlo, and R. K. Ward, “Vector quantization techniquefor nonparametric classifier design,”IEEE Trans. Pattern Anal. MachineIntell., vol. 15, pp. 1326–1329, Dec. 1993.

[59] X. Yao and Y. Liu, “A new evolutionary system for evolving artificialneural networks,”IEEE Trans. Neural Networks, vol. 8, pp. 694–713,May 1997.

[60] G. P. Zhang, “Neural networks for classification: A survey,”IEEE Trans.Syst., Man, Cybern., vol. 30, no. 4, pp. 451–462, Nov. 2000.

464 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 33, NO. 3, JUNE 2003

Robert S. Lynch, Jr. (M’00) was born in Albany,NY, on May 18, 1960. In 1980 and 1982, respec-tively, he received the A.A. and A.S. degrees fromHudson Valley Community College, Troy, NY.In 1984 and 1991, respectively, he received B.S.and M.S. degrees, both in electrical engineering,from Union College, Schenectady, NY. In 1999, hereceived the Ph.D. degree in electrical engineeringfrom the University of Connecticut, Storrs.

Since 1991, he has been with the Naval UnderseaWarfare Center, Newport, RI, where he is involved

in the research and development of sonar systems. His research interests arein the areas of pattern recognition and classification, detection, data fusion,tracking, and signal processing. Since 1996, he has been awarded grants fromthe In-House Laboratory Independent Research (ILIR) Program of NUWC toconduct basic research in classification. He has also been a previous recipientof Bid and Proposal grants and is the principal investigator for several projectsat NUWC.

Dr. Lynch is a member of the International Society of Information Fusion(ISIF) and is an associate member of the IEEE Sensor, Array, and Multichannel(SAM) Technical Committee.

Peter K. Willett (SM’97) received the B.A.Sc. de-gree in 1982 the University of Toronto, Toronto, ON,Canada, and the Ph.D. degree from Princeton Univer-sity, Princeton, NJ, in 1986.

He is a Professor at the University of Connecticut,Storrs, where he has worked since 1986. Previously,he was with the University of Toronto. His interestsare generally in the areas of communications,tracking, detection theory, and signal processing.

Dr. Willett is an associate editor both forthe IEEE TRANSACTIONS ON AEROSPACE AND

ELECTRONIC SYSTEMS and for the IEEE TRANSACTIONS ON SYSTEMS, MAN,AND CYBERNETICS.