a class-dependent weighted dissimilarity measure for nearest neighbor classification problems

15
Pattern Recognition Letters. A Class-Dependent Weighted Dissimilarity Measure for Nearest Neighbor Classification Problems Roberto Paredes and Enrique Vidal Instituto Tecnológico de Informática, Universidad Politécnica de Valencia, Spain. [email protected], [email protected] Abstract A class-dependent weighted (CDW) dissimilarity measure in vector spaces is proposed to improve the performance of the nearest neighbor classifier. In order to optimize the required weights, an approach based on Fractional Programming is presented. Experiments with several standard benchmark data sets show the effectiveness of the proposed technique. Keywords: Nearest Neighbour Classification, Weighted Dissimilarity Measures, Itera- tive Optimization, Fractional Programming. 1 Introduction Let be a finite set of prototypes, which are class-labelled points in a vector space and let be a dissimilarity measure defined in . For any given point , the Nearest Neighbor (NN) classification rule assigns the label of a prototype to such that is minimum. The NN rule can be extended to the -NN rule by classifying in the class which is more heavily represented by the labels of its nearest neighbours. The great effectiveness of these rules when the number of prototypes is growing to infinity is well known [Cover (1967)]. However, in most real situations, the number of available prototypes is usually very small, which often leads to dramatic degradations of ( -)NN classification accuracy. Consider the following general statistical statement of a two-class Pattern Recognition clas- sification problem: Let be a training data set of independent,

Upload: independent

Post on 27-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Pat

tern

Rec

ogni

tion

Let

ters

.

A Class-Dependent Weighted Dissimilarity Measure

for Nearest Neighbor Classification Problems

Roberto Paredes and Enrique Vidal

Instituto Tecnológico de Informática,

Universidad Politécnica de Valencia, Spain.

[email protected], [email protected]

Abstract

A class-dependent weighted (CDW) dissimilarity measure invector spaces is proposed

to improve the performance of the nearest neighbor classifier. In order to optimize the

required weights, an approach based on Fractional Programming is presented. Experiments

with several standard benchmark data sets show the effectiveness of the proposed technique.

Keywords: Nearest Neighbour Classification, Weighted DissimilarityMeasures, Itera-

tive Optimization, Fractional Programming.

1 Introduction

Let P be a finite set ofprototypes, which are class-labelled points in a vector spaceE and

let d(�; �) be a dissimilarity measure defined inE. For any given pointx 2 E, the Nearest

Neighbor(NN) classification rule assigns the label of a prototypep 2 P to x such thatd(p; x)is minimum. The NN rule can be extended to thek-NN rule by classifyingx in the class which

is more heavily represented by the labels of itsk nearest neighbours. The great effectiveness of

these rules when the number of prototypes is growing to infinity is well known [Cover (1967)].

However, in most real situations, the number of available prototypes is usually very small, which

often leads to dramatic degradations of (k-)NN classification accuracy.

Consider the following general statistical statement of a two-class Pattern Recognition clas-

sification problem: LetDn = f(X1; Y1); : : : ; (Xn; Yn)g be a training data set of independent,

identically distributed random variable pairs, whereYi 2 f0; 1g; 1 � i � n, are classification

labels, and letX be an observation from the same distribution. LetY be the true label ofX andgn(�) a classification rule based onDn. The probability of error isRn = PfY 6= gn(X)g. De-

vroye et al. show that, for any integern and classification rulegn, there exists a distribution of(X; Y ) with Bayes riskR� = 0 such that the expectation ofRn isE(Rn) � 12 � ", where" > 0is an arbitrary small number [Devroye (1996)]. This theoremstates that even though we have

rules, such as thek-NN rule, that are universally consistent (that is, theyasymptoticallyprovide

optimal performance for any distribution), theirfinitesample performance can be extremely bad

for some distributions.

This reason explains the increasing interest in finding variants of the NN rule and adequate

distance measures that help improve the NN classification performance in small data set situa-

tions [Tomek (1976), Fukunaga (1985), Luk (1986), Urahama (1995), Short (1980), Short (1981),

Fukunaga (1982), Fukunaga (1984), Myles (1990].

Here we propose a weighted measure which can be seen as a generalization of the simple

weightedL2 dissimilarity in a d-dimensional space:d(y; x) =vuut dXj=1 �j2(xj � yj)2 (1)

where�j is the weight of the j-th dimension. Assuming am-class classification problem, our

proposed generalization is just a natural extension of (1):d(y; x) =vuut dXj=1 �2 j(xj � yj)2 (2)

where = lass(x). We will refer to this extension as“Class-Dependent Weighted (CDW)”

measure. If�ij = 1, 1 < i < m, 1 < j < d, the weighted measure is just theL2 metric.

On the other hand, if the weights are the inverse of the variances in each dimension, theMaha-

lanobis distance (MD)is obtained. Weights can also be computed as class-dependent inverse

variances, leading to a measure that will be referred to asclass-dependent Mahalanobis (CDM)

dissimilarity.

In the general case, (2) is not a metric, sinced(x; y) can be different fromd(y; x) if lass(x) 6= lass(y), which would not satisfy the symmetry property.

In this most general setting, we are interested in finding anm� d weight matrix,M , which

optimizes the CDW-based NN classification performance:

2

M = 0BBB� �11 : : : �1d...

...�m1 : : : �md 1CCCA (3)

2 Approach

In order to find a matrixM that results in a low error rate of the NN classifier with the CDW

dissimilarity measure, we propose the minimization of a specific criterion index.

Under the proposed framework, we expect NN accuracy to improve by using a dissimilarity

measure such that distances between points belonging to thesame class are small while inter-

class distances are large. This simple idea suggests the following criterion index:J(M) = Px2S d(x; x=nn)Px2S d(x; x6=nn) (4)

wherex=nn is the nearest neighbor ofx in the same class( lass(x) = lass(x=nn)) andx6=nn is the

nearest neighbor ofx in a different class( lass(x) 6= lass(x 6=nn)). In the sequel,P

x2S d(x; x=nn)will be denoted asf(M), and

Px2S d(x; x6=nn) asg(M). That is:J(M) = f(M)g(M)

Minimizing this index amounts to minimizing aratio between sums of distances, a problem

which is difficult to solve by conventional gradient descent. In fact, the gradient with respect to

a�ij takes the form:�J(M)��ij = (�f(M)=��ij)g(M)� f(M)(�g(M)=��ij)g(M)2Taking into account thatf(M) =Px2S d(x; x=nn) andg(M) =Px2S d(x; x 6=nn) this leads to an

exceedingly complex expression. Clearly, an alternative technique for minimizing (4) is needed.

2.1 Fractional Programming

In order to find a matrixM that minimizes (4), a Fractional Programming procedure [Sniedovich (1992)]

is proposed. Fractional programming aims at solving problems of the following type:1ProblemQ : q = minz2Z v(z)w(z)1As in [Vidal (1995)], where another application of Fractional Programming in Pattern Recognition is de-

scribed, her we considerminimizationproblems rather thanmaximizationproblems as in [Sniedovich (1992)]. It

can be easily verified that the same results of [Sniedovich (1992)] also hold in our formulation.

3

wherev andw are real-valued functions on some setZ, andw(z) > 0; 8z 2 Z. Let Z�denote the set of optimal solutions to this problem. An optimal solution can be obtained via the

solution of a parametric problem of the following type:ProblemQ(�) : q(�) = minz2Z(v(z)� �w(z)); � 2 <:LetZ�(�) denote the set of optimal solutions to the problemQ(�). The justification for seeking

the solution ofProblem Qvia Problem Q(�) is that a� 2 < exists such that every optimal solu-

tion to the problemProblem Q(�) is also an optimal solution to theProblem Q. The algorithm

for finding this� 2 < is known asDinkelbach’s Algorithm[Sniedovich (1992)].

Dinkelbach’s Algorithm

Step–1: Selectz 2 Z and setk = 1 and �(k) = v(z)w(z)Step–2: Set�0 = �(k);

solve the problemQ(�(k)) and selectz 2 Z�(�(k))Step–3: Setk = k + 1 and �(k) = v(z)w(z)

if �0 = �(k) stop, else go to step 2.

Step 2 requires an optimal solution to the problemQ(�) : q(�) = minz2z(v(z)��w(z)). If

this optimal solution can be found2, then the algorithm finds a� (in a finite number of iterations)

for which every optimal solution toQ(�) is an optimal solution toQ as well. Unfortunately,

however, ifQ(�) cannot be solved optimally (only local solutions can be found), then the algo-

rithm does not guarantee that the global optimal solution can be found for the original problemQ. Since we will use gradient descent techniques to solveQ(�) which do not guarantee a glob-

ally optimal solution, in general we will not find the optimalsolution to the problemQ, but we

expect to find a good local optimum.

In our case,Z is a setM of matrices of sizem� d as in (3) andz is one of these matrices,M 2 M. Thus, using gradient descent to obtain a locally optimal solution to the problemQ(�) = minM2M(f(M)� �g(M)) leads to the following equations:�0ij = �ij � �ij �(f(M)� �g(M))��ij 1 � i � m; 1 � j � d (5)

where�ij is a component ofM at a certain iteration of the descent algorithm,�0ij is the value of

this component at the next iteration and�ij is a step factor (or “learning rate”) for dimensionj2and other basic conditions are met [Sniedovich (1992)]

4

and classi (typically�ij = � 8i; j). By developing the partial derivatives in (5) for ourm-class

classification problem and definingSi = fx 2 S : lass(x) = ig, 1 � i � m, the following

update equations are obtained:�0ij = �ij �Xx2Si �ij�ij(x=nnj � xj)2d(x; x=nn) (6)�0ij = �ij + X

x62Si^x6=nn2Si ��ij�ij(x 6=nnj � xj)2d(x; x6=nn) (7)

Finally, by embedding this gradient descent procedure intoDinkelbach’s Algorithm,. we

obtain the “Fractional Programming Gradient Descent” (FPGD) algorithm to find a (local) min-

imum for the index (4). In this algorithm, shown in Figure 1, the parameters�g and�� are used

to control the precision of the minimum required to assess convergence. They are typically set

to adequate small fixed values. On the other hand, the learning rates�ij are generally set to a

single constant value or to values that depend on the variances observed in the training data (c.f.

Section 3).FPGD(S,M)f /* M is an initial matrix andS the training set */�0 =1; � = f(M)=g(M);iterations = 0;

while(�0 � � > "�) f�0 = �; �00 =1;

while(�00 � � > "g) f�00 = �; M 0 = Mfor all x 2 S fi = lass(x); k = lass(x 6=nn);

for j = 1 : : : d f�0ij = �0ij � �ij�ij(x=nnj � xj)2d(x; x=nn) ; �0kj = �0kj + �0�kj�kj(x 6=nnj � xj)2d(x; x6=nn) ;giterations = iterations + 1;gM = M 0; � = f(M)g(M) ;gggFigure 1: Fractional Programming Gradient Descent algorithm.

It is interesting to note that the computations involved in (6) and (7) implicitly entail com-

puting the NN of eachx 2 S, according to the CDW dissimilarity corresponding to the current

5

0

2

4

6

8

10

12

14

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

Iterations

FPGD evolution in the Monkey ProblemError estimation Index

Error estimationIndex

Figure 2: Behaviour of the FPGD algorithm as applied to the “Monkey Problem” data set.

Classification error is estimated through Leave One Out.

values of the weights�ij and prototypesS � fxg. Therefore, as a byproduct, a Leave One Out

estimation (LOO) of the error rate of the NN classifier with the weighted measure can readily

be obtained. This issue will be further explored on the next section.

Figure 2 shows a typical evolution of this algorithm, as applied to the so-called “Monkey

Problem” data set which is described in Section 3.

6

2.2 Finding adequate solutions in adverse situations.

A negative side effect of the fact that only locally optimal solutions can be obtained in each

step of the Fractional Programming procedure is that, if thethe additive factor in (7) is not

sufficiently large, the algorithm may tend to set�-values to zero.

As an example of this kind of divergent behaviour, consider the following two-class prob-

lem, with each class having 500 two-dimensional points (Figure 3). Class A is a mixture of two

Gaussian distributions. The first distribution has a standard deviation ofp10 in thex1 dimen-

sion and a unit standard deviation in thex2 dimension, while the second distribution has a unit

standard deviation in thex1 dimension and a standard deviation ofp10 in thex2 dimension,

with both distributions centered at (0,0). Class B is a Gaussian distribution with unit standard

deviation in thex1 dimension and a standard deviation ofp10 in thex2 dimension, centered at

(6,0). Note the relatively large interclass overlapping onthex1 dimension.

-10

-5

0

5

10

-10 -5 0 5 10

Class AClass B

Figure 3: Two-class problem with the Gaussian mixture distributions and interclass overlapping.

As shown in Figure 4, with this data set (and using just unit initialization weights and a

constant value for the step factor�), the estimated error rate tends to worsen, while the proposed

criterion index (4) effectively decreases through successive iterations.

This undesirable effect is actually due to the fact that all�ij tend to zero until the algorithm

7

stops. It is interesting to note that, despite this “divergent” behaviour, a minimum error estimate

is achieved at a certain step of the procedure, as can be seen in Figure 4. In other words, a low

value ofJ(M) does not always necessarily mean a low value of the NN classifier error rate,

but was only an assumption as mentioned in Section 2. Nevertheless it is possible to find a

minimum of the estimated error somewhere in the path that goes towards the minimum index

value. This suggests to us that, rather than supplying the weight values obtained at the end of the

FPGD procedure, a better choice forM in general would be supplying the weights that led to the

minimum estimated error rate. In typical cases, such as that shown in Figure 2, this minimum is

achieved at the convergence point of the FPGD procedure, while in adverse situations, such as

that in Figure 4, the minimum-error weights will hopefully be a better choice than the standard

(L2 or Mahalanobis) distance.

0

10

20

30

40

50

60

0 20 40 60 80 100 120 140 160 180

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

Iterations

Evolution on a Synthetic Data SetError estimation Index

Minimum estimated error

Error estimationIndex

Figure 4: “Divergent” evolution of the FPGD algorithm with the “adverse” synthetic data shown

in Figure 3. The CDW index converges as expected but the errorrate tends to increase. Never-

theless there is a step in which the error is minimum.

It worth noting that this simple heuristic guarantees a LOO error estimation for the resulting

weights which is never larger than the one obtained with the initial weights. Consequently, if

weights are initialized with values corresponding to a certain conventional (adequate) metric,

8

the final weights are expected to behave at least as well as this metric would.

2.3 Asymptotic behaviour.

The previous section introduce a essential feature of our approach, namely, the estimation of

the error rate of the classifier by LOO using the weights at each step of the process. At the end

of the process, the weights with the best estimation are selected.

Let n be the size of the training set. IfM is initialized to the unit matrix, in the first step of

the process a LOO error estimation,�nnn, of the standard Nearest Neighbor classifier is obtained.

At the end of the process the weight matrix with the best errorestimation,�nw, is selected.

Therefore�nw � �nnn.

It’s well known that, under suitable conditions [Devroye (1996)], whenn tends to infinity

the LOO error estimation of a NN classifier tends to the error rate of this classifier. Therefore:�nw � �nnnlimn!1 �nnn = �nnlimn!1 �nw = �w9>>>=>>>;! �w � �nn (8)

In conclusion, the classifier using the optimal weight matrix is guaranteed to produce less

than or equal error rate than the standard Nearest Neighbor,in this asymptotic case.

3 Experiments

Several standard benchmark corpora from the UCI Repositoryof Machine Learning Databases

and Domain Theories [UCI] and the Statlog Project [Statlog]have been used. A short descrip-

tion of these corpora is given below:� Statlog Australian Credit Approval (Australian): 690 prototypes, 14 features, 2 classes. Dividedinto 10 sets for cross-validation.� UCI Balance (Balance): 625 prototypes, 4 features, 3 classes. Divided into 10 sets for cross-validation. A different design of the experiment was made in[Shultz (1994)].� Statlog Pima Indians Diabetes (Diabetes): 768 prototypes, 8 features, 2 classes. Divided into 11sets for cross-validation.� Statlog DNA (DNA): Training set of 2000 prototypes. Test set of 1186 vectors,180 features, 3classes.

9

� Statlog German Credit Data (German): 1000 prototypes, 20 features, 2 classes. Divided into 10sets for cross-validation.� Statlog Heart (Heart): 270 prototypes, 13 features, 2 classes. Divided into 9 sets for cross-validation.� UCI Ionosphere (Ionosphere): Training set of 200 prototypes (the first 200 as in [Sigilito (1989)]),Testset 151 vectors, 34 features, 2 classes.� Statlog Letter Image Recognition Letter (Letter): Training set of 15000 prototypes, Test set of5000 vectors, 16 features, 26 classes.� UCI Monkey-Problem-1 (Monkey): Training set of 124 prototypes, Test set of 432 vectors, 6features, 2 classes.� Statlog Satellite Image (Satimage): Training set of 4435 prototypes, Test set of 2000 prototypes,36 features, 6 classes.� Statlog Image Segmentation (Segmen): 2310 prototypes, 19 features, 7 classes. Divided into 10sets for cross-validation.� Statlog Shuttle (Shuttle): Training set of 43,500 prototypes, Test set of 14,500 vectors, 9 features,7 classes.� Statlog Vehicle (Vehicle): 846 prototypes, 18 features, 4 classes. Divided into 9 sets for cross-validation.

Most of these data sets involve bothnumericandcategoricalfeatures. In our experiments,

each categorical feature has been replaced byn binary features, wheren is the number of

different values allowed for the categorical feature. For example, in a hypothetical set of data

with two features: Age (Continuous) and Sex (Categorical: M,F), the categorical feature would

be replaced by two binary features; i.e., Sex=M will be represented as (1,0) and Sex=F as (0,1).

The continuous feature will not undergo any change, leadingto an overall three-dimensional

representation.

Many UCI and Statlog data sets are small. In these cases,N-Fold Cross-Validation[Raudys (1991)]

has been applied to obtain the classification results. Each corpus is divided intoN blocks usingN � 1 blocks as a training set and the remaining block as a test set.Therefore, each block

is used exactly once as a test set. The number of cross validation blocks,N , is specified for

each corpus in the UCI and Statlog documentation. For theDNA, Letter, Monkey, Satimageand

Shuttle, which are relatively larger corpora, a single specific partition into training and test sets

was provided by Statlog and, in these cases, no cross validation was carried out. It should be

finally mentioned that, although classification-cost penalties are available in a few cases, for the

sake of presentation homogeneity, we have decided not to make use of them; neither for training

nor for classification.

10

4 Results

Experiments with both the NN and the k-NN rules were carried out using theL2 metric, the

Mahalanobis distance (MD), the“class-dependent”Mahalanobis (CDM), and our CDW dis-

similarity measures. As mentioned in Section 1, CDM consists in weighting each dimension by

the inverse of the variance of this dimension in each class.

In the case of the CDM dissimilarity, computation singularities can appear when dealing

with categorical features, which often exhibitnull class-dependent variances. This problem was

solved by using the overall variance as a “back-off” for smoothing the null values.

Initialization values for training the CDW weights were selected according to the following

simple rule, which is based on LOO NN performance of conventional methods on the training

data: If rawL2 outperforms CDM, then set all initial�ij = 1; otherwise, set them to the inverse

of the corresponding training data standard deviations. Similarly, the step factors,�ij, are set

to a small constant (0.001) in the former case and to the inverse of the standard deviation in

the latter. Tables 1 and 2 summarize the results for NN and K-NN classification, respectively.

In the case of k-NN, only the results for the optimal value ofk; 1 < k < 21 observed in each

method are reported.

For the NN classification rule (Table 1) CDW outperforms conventional methods in most of

the corpora. The greatest improvement (+13%) was obtained in theMonkey-Problem, a categor-

ical corpus with a small number of features and only two classes. Similarly, good improvement

(+9.2%) was obtained for theDNAcorpus, which is also a corpus with categorical data, but with

far more features (180) and 3 classes. CDW has only been slightly outperformed (by less than

1.6%) by other methods in a few cases:Australian, IonosphereandShuttle.

For the K-NN classification rule (Table 2), CDW outperforms conventional methods in

many corpora:DNA, Ionosphere, Letter, Monkey, SegmenandVehicle; againMonkeyandDNA

yielded the most significant improvements (+12.7% and +7.7%, respectively). Also, in this K-

NN case, in the corpora where CDW is outperformed by some other method, the difference in

accuracy was generally small.

Error estimation 95% Confidence Intervals3 [Duda (1973)] for the best method are also

shown in Tables 1 and 2. It is interesting to note that in the few cases where CDW is outper-

formed by other methods, the difference is generally well within the corresponding confidence3Computed by numerically solving the equations:

Pk�K P (k; n; p1) = 1�A2 andPk�K P (k; n; p0) = 1�A2 ,

where P(k,n,p) is the binomial distribution, A=0.95 is the confidence value and[p0; p1℄ the confidence interval.

11

Table 1: Classification accuracy (in %) of different methods, using the NN rule on several data sets.Results in boldface correspond to the best accuracy. The last column is the 95% confidence interval ofthe best method. L2 MD CDM CDW CI

Australian 65.73 81.03 82.94 81.37 +2:7;�3:0Balance 78.83 80.16 68.0 82.63 +2:9;�3:2Diabetes 69.94 70.62 68.3 71.72 +3:2;�3:3DNA 76.55 74.28 84.99 94.18 +1:3;�1:5German 66.3 66.9 67.6 70.7 +2:8;�2:9Heart 59.72 76.21 76.14 77.31 +4:8;�5:5Ionosphere 92.05 85.22 82.95 91.39 +3:8;�5:5Letter 95.8 95.26 92.98 96.6 +0:5;�0:5Monkey 78.7 86.34 87.04 100 +0:0;�0:8Satimage 89.45 89.35 85.3 90.15 +1:3;�1:4Segmen 96.32 96.27 95.97 96.92 +0:7;�0:8Shuttle 99.88 99.91 99.93 99.86 +0:04;�0:05Vehicle 65.3 68.51 66.79 69.5 +3:1;�3:2

Table 2: Classification accuracy (in %) of different methods using the K-NN rule on several data sets.Results in boldface correspond to the best accuracy. The last column is the 95% confidence interval ofthe best method. L2 MD CDM CDW CI

Australian 69.26 85.44 85.29 84.8 +2:5;�2:8Balance 91.16 91.66 91.16 90.83 +2:0;�2:4Diabetes 76.5 77.32 73.77 75.13 +2:9;�3:1DNA 86.76 83.64 85.16 94.43 +1:2;�1:4German 71.2 73.2 74.5 71.8 +2:7;�2:8Heart 67.89 85.13 82.14 80.6 +4:0;�4:8Ionosphere 94.7 85.22 90.34 97.35 +1:9;�4:0Letter 96.1 95.56 92.98 96.6 +0:5;�0:5Monkey 83.33 86.34 87.33 100 +0:0;�0:8Satimage 90.75 90.65 87.25 90.75 +1:2;�1:3Segmen 96.32 96.27 95.97 96.92 +0:7;�0:8Shuttle 99.88 99.92 99.93 99.86 +0:04;�0:05Vehicle 66.54 71.72 70.25 71.85 +3:0;�3:2

12

intervals. On the other hand, in many cases where CDW was the best method, confidence in-

tervals were small (notably DNA, Monkey and Letter), thus indicating a statistically significant

advantage of CDW.

Comparisons with the best method known for each corpus [UCI,Statlog, Sigilito (1989)] are

summarized in Table 3, while Table 4 shows the results achieved by several methods in a few

corpora4. From these comparisons and the previously discussed results (Tables 1,2) it can be

seen that CDW exhibits auniformly good behaviourfor all the corpora, while other procedures

may work very well for some corpora (usually only one corpus)but typically tend to worsen

(dramatically in many cases) for the rest.

5 Concluding remarks

A weighted dissimilarity measure for NN classification has been presented. The required matrix

of weights is obtained through Fractional-Programming-based minimization of an appropriate

criterion index. Results obtained for several standard benchmark data sets are promising.

Current results using the CDW index and the FPGD algorithm are uniformly better than

those achieved by other more traditional methods. This alsoapplies to comparing FPGD with

the direct Gradient Descent technique previously proposedin [Paredes (1998)] to minimize a

simpler criterion index.

Other more sofisticated optimization methods can be devisedto minimize the proposed in-

dex (4) and new indexes can be proposed which would probably lead to improved performance.

In this sense, an index which computes the relation between the K-NN distances to the proto-

types of the same class and the K-NN to the prototypes in the nearest class (rather than the plain

NN as in (4)), would be expected to improve current CDW K-NN results.

Another new weighting scheme that deserves to be studied is one in which weights are as-

signed to eachprototype–rather than (or in addition to) eachclass. This “Prototype-Dependent

Weighted (PDW)” measure would involve a more “local” configuration of the dissimilarity

function and is expected to lead to an overall behaviour of the corresponding k-NN classifiers

which is even more data-independent.4Corpora that make use of classification-cost penalties (Section 3), (such asHeart andGerman), other corpora

which are not comparable because of other differences in experiment design, are excluded. Only those methods

which have results in many corpora, and corpora for which results with many methods are available have been

chosen for the comparisons in Table 4

13

Table 3: Comparing CDW classification accuracy (in %) with the best accuracy achieved by othermethods.

CDW Other (Method)

Australian 84.80 86.9 (Cal5)Diabetes 75.13 77.7 (LogDisc)DNA 94.43 95.9 (Radial)Ionosphere 97.35 96.7 (IB3)Letter 96.60 93.6 (Alloc80)Monkey 100.00 100.0 (AQ17-DCI)5

Satimage 90.75 90.75 (KNN)Segmen 96.92 97.0 (Alloc80)Shuttle 99.86 99.0 (NewId)Vehicle 71.85 85.0 (QuaDisc)

Table 4: Comparing classification error rate (in %) achieved by several methods. Results in boldfacecorrespond to the the best method for each corpus.

Alloc80 CART C4.5 Discrim NBayes QDisc Cal5 Radial CDW

Australian 20.1 14.5 15.5 14.1 15.1 20.7 13.1 14.5 15.2DNA 5.7 8.5 7.6 5.9 6.8 5.9 13.1 4.1 5.5Letter 6.4 —— 13.2 30.2 52.9 11.3 25.3 23.3 3.4Satimage 13.2 13.8 15 17.1 —— 15.5 15.1 12.1 9.2Segmen 3 4 4 11.6 26.5 15.7 6.2 6.9 3.1Vehicle 17.3 23.5 26.6 21.6 55.8 15 27.9 30.7 28.1

Local prototype weighting can also be made feature-independent; i.e., a single scalar weight

is assigned to each prototype. The weight of each prototype is intended to measure the value

of this prototype for improving classification accuracy. Such a prototype weighting scheme

can be seen from the viewpoint of prototypeediting. This kind of weights can be learned using

techniques similar to those introduced in this paper, leading to a recently studied very successful

editing-oriented weighting method which we callWP-Edit[Paredes (2000)].5Many other algorithms also achieve 100% accuracy.

14

References[Cover (1967)] T.M. Cover and P.E. Hart. 1967 Nearest neighbor pattern classification. IEEE Transac-

tions on Infromation Theory, 13(1), 21–27.

[Devroye (1996)] L. Györfi, L. Devroye and G. Lugosi. 1996. A probabilistic theory of pattern recog-nition. Springer-Verlag New York, Inc.

[Tomek (1976)] I. Tomek. 1976. A generalization of the k-nn rule. IEEE Transactions on Systems, Man,and Cybernetics 6(2), 121–126.

[Fukunaga (1985)] K. Fukunaga and T.E. Flick. 1985. The 2-nnrule for more accurate nn risk estima-tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(1), 107–112.

[Luk (1986)] A. Luk and J.E. Macleod. 1986. An alternative nearest neighbour classification scheme.Pattern Recognition Letters, 4, 375–381.

[Urahama (1995)] K. Urahama and Y. Furukawa. 1995. Gradientdescent learning of nearest neighborclassifiers with outlier rejection. Pattern Recognition, 28(5), 761–768.

[Short (1980)] R.D. Short and K. Fukunaga. 1980. A new nearest neighbor distance measure. In Proc.5th IEEE Int. Conf. Pattern Recognition. Miami Beach, FL.

[Short (1981)] R.D. Short and K. Fukunaga. 1981. An optimal distance measure for nearest neighbourclassification. IEEE Trans. Info. Theory, 27, 622–627.

[Fukunaga (1982)] K. Fukunaga and T.E.Flick. 1892. A parametrically defined nearest neighbour mea-sure. Patter Recognition Letters, 1, 3–5.

[Fukunaga (1984)] K. Fukunaga and T.E.Flick. 1984. An optimal global nearest neighbour metric. IEEETrans. Pattern Recognition Mach. Intell. PAMI, 6, 314–318.

[Myles (1990] J.P. Myles and D.J. Hand. 1990. The multi-class metric problem in nearest neighbourdiscrimination rules. Pattern Recognition, 23(11), 1291–1297.

[Paredes (1998)] R. Paredes and E. Vidal. 1998. A nearest neighbor weighted measure in classificationproblems. VIII Simposium Nacional de Reconocimiento de Formas y Análisis de Imágenes, Proc.,Bilbao, Spain. July 1998.

[Paredes (2000)] R. Paredes and E. Vidal. 2000. Weighting prototypes. A new editing approach. 15thInternational Conference on Pattern Recognition, ICPR2000. Barcelona, Spain. September 2000.

[Sniedovich (1992)] M. Sniedovich. 1992. Dynamic Programming. Marcel Dekker Inc.

[Vidal (1995)] E. Vidal, A. Marzal and P. Aibar. 1995. Fast Computation of Normalized Edit Distances”.IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(9), 899-902.

[UCI] C. Blake, E. Keogh and C.J. Merz. UCI Repository of machine learning databases.http://www.ics.uci.edu/�mlearn/MLRepository.html. University of California, Irvine, Dept. of In-formation and Computer Sciences.

[Statlog] Statlog Corpora. Dept. Statistics and Modellong Science (Stams). Stratchclyde University.ftp.strath.ac.uk

[Sigilito (1989)] V.G. Sigilito, S. P. Wing, L. V. Hutton andK. B. Baker. 1989. Classification of radarreturns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10,262-266.

[Shultz (1994)] T.R. Shultz, D. Mareschal and W.C. Schmidt.1994. Modeling Cognitive Developmenton Balance Scale Phenomena. Machine Learning, 16, 57–86.

[Raudys (1991)] S.J. Raudys and A.K Jain. 1991. Small SampleEffects in Statistical Pattern Recogni-tion: Recomendations for Practitioners". IEEE Trans on PAMI, 13(3), 252-264.

[Duda (1973)] R.Duda and P.Hart. 1973. Pattern Recognitionand Scene Analisys. John Wiley. NewYork.

15