a comparison of some multivariate discrimination procedures

A Comparison of Some Multivariate Discrimination ProceduresAuthor(s): M. P. Gessaman and P. H. GessamanSource: Journal of the American Statistical Association, Vol. 67, No. 338 (Jun., 1972), pp. 468-472Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2284408 .

Accessed: 15/06/2014 06:43

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 195.78.108.40 on Sun, 15 Jun 2014 06:43:33 AMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=astata

http://www.jstor.org/stable/2284408?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


@ Journal of the Amerkan Stotistical Association June 1972, Volume 67, Number 338

Theory & Methods Section

A Comparison of Some Multivariate Discrimination Procedures

M. P. GESSAMAN and P. H. GESSAMAN*

A sequence of observations is obtained; each observation is known to be a value on one of two distinct absolutely continuous p-dimensional random variables, X and Y. The problem is to decide whether each observation is on X or Y. Some discrimination procedures are suggested and, using Monte Carlo methods, they are compared with other discrimination procedures to be found in the literature. The overall best performer in the comparisons of this study is one of the suggested procedures, a "nearest neighbor" type procedure based on statistically equivalent blocks.

1. A GENERAL MODEL

Let X and Y be two absolutely continuous p-dimensional random variables; their probability density functions will be denoted by fi and f2, respectively. Observa- tions ziP i=1, 2, , k, are obtained at random from a set assumed to contain only representatives of X and Y. Using the observations and any available information about the distributions, a decision is to be made whether each zi is an observation on X or Y.

Every discrimination procedure for a given problem can be discussed in the following framework: First, choose two measurable subsets, A1 and A2, of RP. Let the discrimination procedure be defined by the rule

If zi E AJ1A2, decide zi is an observation on X; If zi E A1A2, decide z; is an observation on Y;(1.1) If zi E A1A2 UJ 1A J2, reserve judgment.

The discrimination problem obviously becomes one of choosing the sets A1 and A2. The natural choice for A1 would be a portion of RP on which it appears more likely that an observation on X would appear than an observation on Y. Similarly, the natural choice for A2 would be a portion on which an observation on Y would seem more likely to appear.

If the probability of reserving judgment is required to be zero under both distributions, the resulting problem will be called a forced discrimination problem since a decision about zi must be made; otherwise, the problem will be called a partial discrimination problem.

2. SOME MULTIVARIATE DISCRIMINATION PROCEDURES BASED ON STATISTICALLY EQUIVALENT BLOCKS

Each of the procedures suggested in this section is based on a construction for a single sample X1, X2,

* M. P. Gessaman is associate professor, Department of Mathematics, University of Nebraska, Omaha, Neb. 68101. P. H. Gessaman is assistant professor, Depart- ment of Agricultural Economics, University of Nebraska, Lincoln, Neb. 68503.

Xm described by Gessaman [7]; this construction will be outlined here for the reader's convenience.

Let X1, * * *, X,,m be a random sample on X. For purposes of discussion, let p =2; the reader can easily gen- eralize the method to any p-dimensional space. Let k =k(nm) = [ml], the greatest integer less than or equal to ml. Rank the m observations on the first coordinate in increasing order. Partition the plane into [(m/k) ] "blocks" by making [((m/k) ]-1 "cuts" on the ranked observations, spacing the cuts as evenly as possible on the ranks 1, , m. Each cut is a vertical line on the plane; the cut itself is considered to belong to the block of which it forms the left boundary. Since the distributions are absolutely continuous, ties occur with probability zero. A numerical example should help to clarify the method up to this point.

Example 2.1: Let m =729: then k =9. The eight cuts are vertical lines determined by the first coordinate values of the observations with the following ranks on the first coordinate: 81, 162, 243, * - *, 648. The resulting nine blocks each contain 81 observations.

Delete the observations which were used to make the cuts just described. Take each block, rank the remaining observations in the block on the second coordinate, and partition it into [(m/k)l] subblocks by making [(m/k)i] -1 horizontal line cuts, spacing the cuts as evenly as possible on the ranks in that subblock. After this construction has been completed, the plane will be partitioned into [(mk)I] subblocks, each containing k-1, k or k+1 observations. Let us return to the numerical example.

Example 2.1 (cont.): Consider the subblock formed above by the first 81 ordered observations on the first coordinate. Delete the cut observation and rank the remaining 80 observations on the second coordinate. Make horizontal cuts on the observations with the following ranks: 9, 18, 27, * * *, 72. The resulting subblocks each contain from eight to ten observations from the sample. The remaining blocks can be similarly partitioned yielding a total of 81 subblocks, each containing from eight to ten observations.

Those readers who are familiar with statistically equivalent blocks will recognize their use in this construction; those who are not will find useful references in [13, 6, 2]. Due to properties of these blocks, the final subblocks of the preceding construction will be called probability

468



A Comparison of Discrimination Procedures 469

squares in the following discussion. This terminology is suggested by the fact that for large m, with high probability, the probabilities of the subblocks are approximately equal.

Let X1, , Xm be a random sample on X and Yi, Yn a random sample on Y. Using the observations

zi and the information contained in the samples, each zi is to be classified as an observation on X or Y (or possibly the decision to reserve judgment may be allowed). A number of discrimination procedures have been suggested in the literature, and many of these will be mentioned in the next section. However, the construction in Example 2.1 seems to suggest rather naturally some discrimination procedures not previously appearing in the literature. These procedures belong to the class of "nearest neighbor" procedures, where "neighbor" is interpreted to mean belonging to the same probability square.

2.1 Forced Discrimination Procedures

Let Xi, . , Xm be a random sample on X. Partition the plane by the construction in Example 2.1 into [(mk)i] probability squares. The set A1 can be formed by selecting from among these squares those that seem more likely to contain observations on X than on Y; the remaining squares will be placed in A2=A1. The discrimination procedure is then defined by (1.1). The criteria used to make this selection will depend on such considerations as the relative seriousness of misclassification errors. One of the simplest methods is to assign a given square to A1 or A2 according to whether it contains a majority of X sample points or Y sample points, respectively.

Let Y,, * *, Y. be a random sample on Y. Using this sample, a procedure can be constructed by a method similar to that above. In general, this procedure will differ from that based on the X sample. The investigator may well want to find both procedures and choose the one which performs better at classifying the samples themselves.

2.2 Partial Discrimination Procedures

a. Consider the partition determined by the sample Xi, - - , Xm. Select squares for A1 where it appears that the decision for X should definitely be made and squares for A2 where it appears that the decision for Y should definitely be made. The procedure is then defined by (1.1); those squares where the distributions seem too close to decide will go into the reserve judgment set 1112. The selection of squares for A1 and A2 will depend on such considerations as the relative importance of misclassification and failure to classify. Another consideration is the possibility of controlling the probability of misclassification on X observations. This control is discussed by Quesenberry and Gessaman [11] and is accomplished by determining the number of squares to go into A1. Of course, a procedure could also be based on the sample

Y1, , Y with the possibility of control of the probability of misclassification on Y observations.

b. Use the X-sample construction to form the X prob-

ability squares and select the squares to go into A1; sepa- rately use the Y-sample construction to form the Y probability squares and select the squares to go into A2. The procedure is then defined by (1.1). The criteria for these selections are similar to those-previously described, except that it is now possible to control the probabilities of both types of misclassification under the respective distributions. (See [11].) If this control is not desired, one of the simplest methods is to put X squares into A1 if they contain more X observations than Y observations and to put Y squares into A2 if they contain more Y observations than X observations.

3. A COMPARISON OF SOME MULTIVARIATE DISCRIMINATION PROCEDURES

To investigate the relative efficiency of some of the procedures suggested in the preceding section with others in the literature, Monte Carlo methods were used to simu- late three two-distribution discrimination problems. Sam- ples were generated from three pairs of bivariate normal distributions and the procedures were determined for each pair from these samples. Each procedure was based on three sample sizes: 729, 200 and 64. Then further samples of size 250 were generated from each of these distrl- butions and classified by the procedures. The proportions of the 500 observations for each pair which were misclassified or (in the case of partial discrimination only) not classified for each of the procedures considered are given in Table 1 and Table 2.

The bivariate normal distributions sampled were as follows:

-1 0- I (a) .,u= (O I ), 1= _; 0 1_

versus I(b):A2 = (2,0), 12 = [2]

II (a):,u3 = (02 O), 2 3 = _1 4_

versus II(b):,u4 = (0, 0), ; = [ ] 1~~~~ _ 41

III(a):,5u= (1,41), = [1 1

versus III(b): 6 = (-1, 0), 16 = [1 4j 3.1 Forced Discrimination

For the forced discrimination problem the following procedures were compared:

A. Fisher Discriminant. This is the classical linear discriminant analysis (see [4, 1]). Note the assumption of multivariate normal distributions with identical dispersion matrices is satisfied only in the third pair.

B. Density Estimatorls. The following procedure was used:



470 Journal of the American Statistical Association, June 1972

For J1(z) and J2(Z) consistent estimators of fi(z) and f2(z), respectively,

> 1, decide z is an observation on X; if J1(Z)/J2(Z) < 1, decide z is an observation on Y; (3.1)

( = 1, decide in an arbitrary fashion for X or Y.

The procedure (3.1) with f1(z) and f2(z) known is optimal in several senses (see [1, 12, 5]) and the use of consistent density estimators as in (3.1) yields procedures consistent with this optimal procedure. (See [5].) The density estimators used in this comparison were

a. A Parzen-Cacoullos kernel estimator (see [10, 3]);

f'.(x) = mK(EXJ) f ()=mh-2 (m) 'l K h(m))

where h(m) =m-1/8 and K(X) =exp {-(co2 +co2)/2}/27r; b. The Loftsgaarden-Quesenberry estimator (see [9]); c. The Gessaman estimator [7].

C. Nearest Neighbor with Probability Squares. The plane was partitioned into probability squares using the X-sample and the number of elements from the two spaces on each square compared. A square was assigned to A1 or A2 according to the parent of the majority of elements it contained; where ties occurred, decisions were made on the basis of the surrounding squares. Using the Y-sample produced another procedure; the procedure which misclassified fewer of the sample observations themselves was used.

The results of these trials are given in Table 1.

1. FORCED DISCRIMINATIONa

I(&) ve I(b) II(^ vs 11(b) III(&) vs 111(b)

729 200 64 729 200 64 729 200 64

A. Fisher discri.inant .116 .118 .112 .502 .496 .498 .150 .158 .166

D. Density estimators

(i) Parzen-Cacoullo Kernel .090 .100 .100 .328 .332 .354 .172 .168 .176

(ii) Loftugaarden-Queuenberry .096 .108 .102 .320 .336 .370 .160 .162 .178

(iti) Gessman .110 .108 .140 .360 .356 .436 .160 .160 .200

C. Nearest neighbor with probability squares .090 .096 .098 .246 .292 .304 .142 .158 .178

a Proportions of the 500 observations for each pair misclassified by procedures based on sample sizes 729, 200, 64.

Comment on Table 1: Only in the third pair are the assumptions for the Fisher discriminant analysis satisfied. In that case the performance of this analysis is uniformly better than any of the density estimators and only in the case of the largest sample is it surpassed by the nearest neighbor procedure. For the other pairs its performance is uniformly worse than that of all other procedures for all sample sizes, except for the Gessaman estimator on the smallest sample in the first pair.

The average proportion of misclassification by the kernel estimator is .0013 less than the average proportion for the Loftsgaarden-Quesenberry estimator. But this difference is very small and the superiority of the kernel is not consistent since the other estimator does a uniformly better 2job on the third pair. In the absence of any assumptions regarding the probability distributions there does not seem to be a significant difference between the discriminatory powers of these procedures when the dis-

tributions are actually normal, at least on samples of size 200 or more. The Gessaman estimator does not perform as well as the others, particularly for the second pair and for the smallest sample size. For the larger samples when the distributions are somewhat separated, however, its performance compares well. This estimator has an ad- vantage over the other two in that sets A1 and A2 are easily determined and are unions of rectangles; thus giving the investigator insight into the relative density of probability for the distributions on any portion of RP. Using the other estimators, the sets A1 and A2 are difficult to find and probably extremely complex.

The nearest neighbor procedure compares very favor- ably with all procedures for all sample sizes except with the Fisher discriminant for the smallest sample size on the third pair. It appears to be a nonparametric procedure to be highly recommended for the sample sizes of these trials when the distributions are approximately normal.

The average increase in proportion of misclassification over all procedures as sample size decreased from 729 to 200 is only .0071; the range of such increases was -.006 to .046, the latter occurring with the nearest neighbor procedure on the second pair, where performance still exceeds that of all other procedures, and exclusive of this the greatest increase was .016 and the average increase was .004. From these trials it appears that little is gained by going from sample size 200 to 729, an increase of over 350 percent. When the possible additional expense of the larger sample size is added to the consideration, the smaller sample size looks even more attractive. However, when sample size is dropped from 729 to 64 a larger effect is noticeable; the average increase in proportion of misclassification is .0247 and the range is -.004 to .076. The decrease in discriminatory power as sample size decreases is most marked in the second pair, where the distributions are very close. Unless the investigator wishes to go to very large samples, a sample of size 200 seems from these trials to be adequately efficient for discrimination.

3.2 Partial Discrimination

For the partial discrimination problem few procedures have been suggested in the literatu're. An optimal procedure exists for this problem (see [11]) and the use of consistent density estimators yields a procedure consistent with the optimal procedure. Since we have already compared the performance of estimators for forced discrimination, it does not seem to be necessary to repeat that comparison for partial discrimination. The following procedures were compared:

A. Procedures with No Control of Error

1. Kendall. In his Order-Statistic Method, Kendall [8] suggests that in the case of multivariate distributions the sample component variables be considered one at a time, assignling the final common p-dimensional range to the reserve judgment set. His method is completely distribution-free, does not misclassify any of the observations from the samples on which the procedure is based,



A Comparison of Discrimination Procedures 471

and is easily constructed even for very large samples of high dimensionality. However, if the ranges of the variables are unlimited, for large samples it may actually classify few observations; and if the distributions differ only in dispersion matrix, as in the second pair, the proportion of sample observations actually classified will also be very low.

2. Nearest Neighbor with Probability Squares. Refer to Section 2.2a under Partial Discrimination Procedures. The (a) squares were assigned to A1 if the number of observations from (a) outnumbered those from (b) by more than 2; similarly, (a) squares were assigned to A2 if the number of observations from (b) outnumbered those from (a) by more than 2. Each square in the reserve judgment set, then, contained "almost" equal numbers of observations from (a) and (b). A procedure can be constructed in the analogous way using (b) squares. In some cases, these procedures differed considerably; therefore, the results of both procedures are recorded in Table 2.

B. Procedures with .1 Probability of Error

For the purposes of this comparison, an error level of .1 under each distribution was used.

1. Quesenberry-Gessaman. In [11] these authors suggested the use of tolerance regions to form convex hulls about the sample elements of each distribution. The interiors of the respective hulls are A1 and A2; the procedure is then defined by (1.1). In this comparison, eight- sided polygonal regions were used for the two largest sample sizes and four-sided polygonal regions with sides of slope 1 and -1 were used on the smallest sample size; these shapes were chosen because they are somewhat sensitive to the dispersion matrices of the distributions. Elliptical regions, also, might have been used since nor- mality was known to exist. This approach gives procedures which are easily constructed from the data, as well as giving insight into relative locations of the spaces. However, it is not very likely that it will produce procedures that approximate optimal procedures since no consideration is given to comparing the two spaces, thus much information in the samples is ignored.

2. Nearest Neighbor with Probability Squares. Refer to Section 2.2b under Partial Discrimination Procedures. Squares from the (a) construction were chosen for A1 and squares from the (b) construction were chosen for A2 in numbers to achieve the .1 level for probability of misclassification under each distribution.

The results of these trials are recorded in Table 2.

Comment on Table 2: As one would expect, the Kendall Order-Statistic Method leads to a high proportion of observations not classified, even in the first and third pair where the distributions are separated in mean. As sample size decreases from 729 to 200, remarkably little difference in performance is seen except in the proportion misclassified in the first pair. However, as sample size decreases to 64, some changes are very apparent; the proportion of misclassifications uniformly increases and the proportion not classified uniformly decreases signifi- cantly. This is consistent with our earlier observation that if the distributions are not bounded in range then increase in sample size will lead to fewer classifications, both correct and incorrect. In other words, discriminatory power decreases as sample size increases. This can not help but be distressing to a statistician who might want to use the method. However, the advantages of this method which were mentioned earlier still remain, and one must ask if there exists an alternative which holds promise of better performance.

In comparison with the Kendall method, the nearest neighbor procedure without control of the level of errors of misclassification shows a uniformly lower proportion of observations not classified. This improvement in the proportion not classified is made partly at the expense of a higher proportion of misclassified observations. This effect can be altered by changing the rules for forming the reserve judgment set, but it will still be present to some degree. It is interesting to note that increasing the sample size has the same result as that noted for the Kendall method, i.e., as sample size increases, the proportion misclassified decreases (uniformly) while the proportion not classified increases (uniformly except for the first pair using (a) squares on the smaller sample sizes). An expla-

2. PARTIAL DISCRIMINATIONa

I(a) vs I(b) II(a) vs II(b) III(a) vs III(b)

Procedure 729 200 64 729 200 64 729 200 64

MC NC MC NC MC NC Mc NC MC NC MC NC MC NC MC NC MC NC

A. Procedures with no control of error

(i) Kendall .002 .686 .002 .508 .036 .290 .002 .996 .012 .966 .062 .876 .004 .846 .004 .826 .024 .492

(ii) Nearest neighbor with probability squares using (a) squares .078 .076 .098 .024 .152 .032 .092 .400 .194 .256 .242 .126 .102 .102 .128 .056 .178 .000 Using (b) squares .032 .166 .072 .066 .098 .000 .098 .382 .202 .200 .254 .144 .084 .160 .122 .108 .184 .000

B. Procedures with .1 probability of error

(i) Quesenberry-Gessaman .028 .394 .020 .428 .040 .336 .010 .872 .018 .828 .042 .800 .016 .442 .024 .544 .028 .522

(ii) Nearest neighbor with probability squares .082 .100 .074 .128 .054 .206 .082 .614 .132 .426 .100 .540 .156 .124 .150 .028 .138 .276

a Proportions of the 500 observations for each pair misclassified (MC) and not classified (NC) by procedures based on sample sizes 729, 200, 64.



472 Journal of the American Statistical Association, June 1972

nation of this tendency may be that as sample size increases the number of squares increases and their areas decrease. For smaller sample sizes larger squares will make it more likely that one distribution will dominate, hence classifying more observations-both correctly and incorrectly. In the case of the Kendall method, a large proportion of observations not classified is inevitable as sample size increases when the variable ranges are un- bounded. On the other hand, for the nearest neighbor procedure it seems reasonable that as sample size increases this procedure should approximate the optimal procedure. The choice between using (a) squares and (b) squares can be made by the criteria of the investigator for a "good" procedure and the performance of the procedures on the samples used to construct them.

The Quesenberry-Gessaman hulls gave procedures where the error levels were controlled at .1; in fact, the proportion of misclassifications was never more than .050. However, the proportion not classified was never less than one-third using this procedure on these distributions. In comparison, the nearest neighbor procedure had a uniformly higher proportion of observations correctly classified and a uniformly lower proportion not classified. The reason seems to be that blocks are placed in 2Fi by the former procedure with no consideration given to whether (a) or (b) observations seem more likely to be found there. A block may well go into A1 when it contains no (b) observations at all; it is very likely to go into 12 as well and then into the reserve judgment set. Thus many observations which should be correctly classified are not classified at all. Misclassifications can only occur among observations which fall outside their hulls, and many of these will go into the reserve judgment set. Control of the error level does not seem to be so dependable for the nearest neighbor procedure since it tends more to choose squares for A, and 112 which are "between" the distributions. Overall, the nearest neighbor procedure with probability squares appears to be a better procedure unless control of error is a very important concern. The effect of sample size is not at all clear; there seems to be no consistent tendency for either procedure as sample size increases.

The results of these trials are, of course, not conclusive, but they are at least provocative. It is hoped that this article may suggest some of the available alternatives open in the discrimination problem and give some feeling for how they might be expected to perform.

[Received November 1969. Revised October 1971.]

REFERENCES

[1] Anderson, T. W., An Introduction to Multivariate Statistical Analysis, New York: John Wiley and Sons, Inc., 1958.

[21 , "Some Nonparametric Multivariate Procedures Based on Statistically Equivalent Blocks," Proceedings of an International Symposium on Multivariate Analysis, New York: Academic Press, Inc., 1966.

[3] Cacoullos, T., "Estimation of a Multivariate Density," Technical Report No. 40, Department of Statistics, Univer- sity of Minnesota, Minneapolis, Minn., 1964.

[4] Fisher, R. A., "The Use of Multiple Measurements in Taxo- nomic Problems," Annals of Eugenics, 7, Part 4 (1936), 179-88.

[51 Fix, E. and Hodges, J. L., Jr., "Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties," USAF School of Aviation Medicine, Project No. 21-49-004, Report No. 4, 1951.

[61 Fraser, D. A. S., Nonparametric Methods in Statistics, New York: John Wiley and Sons, Inc., 1957.

[7] Gessaman, M. P., "A Consistent Nonparametric Multi- variate Density Estimator Based on Statistically Equivalent Blocks," Annals of Mathematical Statistics, 41 (August 1970), 1344-6.

[8] Kendall, M. G., "Discrimination and Classification," Pro- ceedings of an International Symposium on Multivariate Analysis, New York: Academic Press, Inc., 1966.

[9] Loftsgaarden, D. 0. and Quesenberry, C. P., "A Nonpara- metric Estimate of a Multivariate Density Function," An- nals of Mathematical Statistics, 36 (June 1965), 1049-51.

[10] Parzen, E., "On Estimation of a Probability Density Func- tion and Mode," Annals of Mathematical Statistics, 33 (Au- gust 1962), 1065-76.

[11] Quesenberry, C. P. and Gessaman, M. P., "Nonparametric Discrimination Using Tolerance Regions," Annals of Math- ematical Statistics, 39 (April 1968), 664-73.

[12] Welch, B. L., "Note on Discriminant Functions," Biomet- rika, 31 (December 1939), 218-20.

[13] Wilks, S. S., Mathematical Statistics, New York: John Wiley and Sons, Inc., 1962.



a comparison of some multivariate discrimination procedures

Documents