[ieee 2009 international conference on artificial intelligence and computational intelligence -...

4
Study on the Improvement of K-Nearest-Neighbor Algorithm Sun Bo Du Junping Gao Tian Beijing Key Lab of Intelligent Telecommunication Software and Multimedia Beijing University of Posts and Telecommunications, BeijingChina, 100876 [email protected], [email protected] Abstract—As one of the instance based learning method, the K- nearest-neighbor (KNN) algorithm has been widely used in many fields. This paper accomplishes the improvements on the two aspects. First, aiming to improve the efficiency of classifying, we move some computations occurring at classifying period to the training period, which leads to the great descent of computational cost. Second, to improve the accuracy of classifying, we take into account of the contribution of different attributes and obtain the optimal attribute weight sets using the quadratic programming method. Finally, this paper gives the validation of the improvements through practical experiment. Keywords-K-nearest-neighbo; quadratic programming; attribute weight sets) I. INTRODUCTION As a basic instance based learning method, the K-nearest- neighbor (KNN) algorithm is widely used in many fields with the high efficient and robust features [1-2]. But the inherent weaknesses of KNN are the main obstacles for its further applying. During the training period, KNN just simply stores the training instances and postpones most computations to classifying period, which leads to tremendous computational cost. Meanwhile, KNN does not take into account of the contributions of various attributes, which give impacts on the accuracy of classifying. As to the two drawbacks mentioned above, this paper gives the improvements on both the efficiency and the accuracy of classifying. By moving some computational work from the classifying period to the training period, we greatly reduce the computational cost for this part of computations will be reused when classifying new instances. With the consideration about the contributions of different attributes, a quadratic programming is applied for calculating an optimal attribute weight sets. Based on the attribute weight sets, we calculate the similarities between different instances. Finally, this paper gives the validation of the improvements through practical experiments. II. K-NEAREST-NEIGHBOR ALGORITHM The K-nearest-neighbor (KNN) algorithm is an instance based learning method [3]. KNN assumes that each instance ---------------- This work is supported by Beijing Natural Science Foundation of China (No.4082021), the National Programs for High Technology Research and Development of China (863 Key Project, No. 2008AA018308) and Project of Beijing Education Committee. relates to a point in a n-dimensional space and can be described as a sequence of attributes, i.e. <a 1 (x), a 2 (x) … a n (x)> (n is the number of attributes). The distance between instance x i and instance x j is calculated by formula (1). 2 1 ( , ) ( ( ) ( )) n i j r i r j r dx x a x a x = (1) Aiming to classify a new instance x, KNN selects k nearest instances to instance x in the training database, and uses the k instances to determine the class of instance x. The working procedure of KNN is as follows. Training period: for each training sample <x, f(x)>, add it to the training database. Classifying period: to classify a new instance y, choose k nearest instances in training database, i.e. x 1 … x k . Then, return the result of classifying y, as shown in formula (2), where (,) 1 ab δ = if a=b, otherwise, (,) 0 ab δ = . 1 () arg max (, ( )) k i vV i f y vfx δ = (2) KNN has drawbacks in the following two aspects. First, due to the fact that most computations are happened during the period of classifying a new instance instead of in the training period, the time complexity of classifying a new instance is enormous and even unacceptable, especially when the database is in a large scale [4]. Second, the distance between instances is calculated based on all the attributes of the instances. If there are plenty of irrelevant attributes, the distance will be affected, which leads to bad impact on classifying accuracy [5]. Therefore, the training instances should be effectively arranged by index and modeled by an active learning method. When classifying a new instance, we do not need to repeat the previous process and we can use the outcomes of the training process. Meanwhile, we try to obtain a subset of all attributes by filtering irrelevant attributes in order to improve the classifying accuracy. III. K-NEAREST-NEIGHBOR ALGORITHM IMPROVEMENT A. KNN Classifying Efficientcy Improvement During the training period, KNN nearly does not do any computations, but simply stores the training instances. For each new instance, such computational work must be repeated each time. If the scale of database is really large, time cost on computations increases dramatically. Given an instance y (m attributes) to classify and a training database (n instances), it takes ( ) m n Ο to calculate the distances 2009 International Conference on Artificial Intelligence and Computational Intelligence 978-0-7695-3816-7/09 $26.00 © 2009 IEEE DOI 10.1109/AICI.2009.312 390

Upload: tian

Post on 11-Apr-2017

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2009 International Conference on Artificial Intelligence and Computational Intelligence - Shanghai, China (2009.11.7-2009.11.8)] 2009 International Conference on Artificial Intelligence

Study on the Improvement of K-Nearest-Neighbor Algorithm

Sun Bo Du Junping Gao Tian Beijing Key Lab of Intelligent Telecommunication Software and Multimedia

Beijing University of Posts and Telecommunications, Beijing,China, 100876 [email protected], [email protected]

Abstract—As one of the instance based learning method, the K-nearest-neighbor (KNN) algorithm has been widely used in many fields. This paper accomplishes the improvements on the two aspects. First, aiming to improve the efficiency of classifying, we move some computations occurring at classifying period to the training period, which leads to the great descent of computational cost. Second, to improve the accuracy of classifying, we take into account of the contribution of different attributes and obtain the optimal attribute weight sets using the quadratic programming method. Finally, this paper gives the validation of the improvements through practical experiment.

Keywords-K-nearest-neighbo; quadratic programming; attribute weight sets)

I. INTRODUCTION As a basic instance based learning method, the K-nearest-

neighbor (KNN) algorithm is widely used in many fields with the high efficient and robust features [1-2]. But the inherent weaknesses of KNN are the main obstacles for its further applying. During the training period, KNN just simply stores the training instances and postpones most computations to classifying period, which leads to tremendous computational cost. Meanwhile, KNN does not take into account of the contributions of various attributes, which give impacts on the accuracy of classifying.

As to the two drawbacks mentioned above, this paper gives the improvements on both the efficiency and the accuracy of classifying. By moving some computational work from the classifying period to the training period, we greatly reduce the computational cost for this part of computations will be reused when classifying new instances. With the consideration about the contributions of different attributes, a quadratic programming is applied for calculating an optimal attribute weight sets. Based on the attribute weight sets, we calculate the similarities between different instances. Finally, this paper gives the validation of the improvements through practical experiments.

II. K-NEAREST-NEIGHBOR ALGORITHM The K-nearest-neighbor (KNN) algorithm is an instance

based learning method [3]. KNN assumes that each instance ---------------- This work is supported by Beijing Natural Science Foundation of China (No.4082021), the National Programs for High Technology Research and Development of China (863 Key Project, No. 2008AA018308) and Project of Beijing Education Committee.

relates to a point in a n-dimensional space and can be described as a sequence of attributes, i.e. <a1(x), a2(x) … an(x)> (n is the number of attributes). The distance between instance xi and instance xj is calculated by formula (1).

2

1

( , ) ( ( ) ( ))n

i j r i r jr

d x x a x a x=

≡ −∑ (1)

Aiming to classify a new instance x, KNN selects k nearest instances to instance x in the training database, and uses the k instances to determine the class of instance x. The working procedure of KNN is as follows.

Training period: for each training sample <x, f(x)>, add it to the training database.

Classifying period: to classify a new instance y, choose k nearest instances in training database, i.e. x1 … xk. Then, return the result of classifying y, as shown in formula (2), where ( , ) 1a bδ = if a=b, otherwise, ( , ) 0a bδ = .

1( ) arg max ( , ( ))

k

iv V if y v f xδ

∈ =← ∑ (2)

KNN has drawbacks in the following two aspects. First, due to the fact that most computations are happened during the period of classifying a new instance instead of in the training period, the time complexity of classifying a new instance is enormous and even unacceptable, especially when the database is in a large scale [4]. Second, the distance between instances is calculated based on all the attributes of the instances. If there are plenty of irrelevant attributes, the distance will be affected, which leads to bad impact on classifying accuracy [5]. Therefore, the training instances should be effectively arranged by index and modeled by an active learning method. When classifying a new instance, we do not need to repeat the previous process and we can use the outcomes of the training process. Meanwhile, we try to obtain a subset of all attributes by filtering irrelevant attributes in order to improve the classifying accuracy.

III. K-NEAREST-NEIGHBOR ALGORITHM IMPROVEMENT

A. KNN Classifying Efficientcy Improvement During the training period, KNN nearly does not do any

computations, but simply stores the training instances. For each new instance, such computational work must be repeated each time. If the scale of database is really large, time cost on computations increases dramatically. Given an instance y (m attributes) to classify and a training database (n instances), it takes ( )m nΟ ∗ to calculate the distances

2009 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-3816-7/09 $26.00 © 2009 IEEE

DOI 10.1109/AICI.2009.312

390

Page 2: [IEEE 2009 International Conference on Artificial Intelligence and Computational Intelligence - Shanghai, China (2009.11.7-2009.11.8)] 2009 International Conference on Artificial Intelligence

between y and other instances and ( log )n nΟ ∗ to sort the results. So the total time complexity is ( log )m n n nΟ ∗ + ∗ . It is obvious that the key to improve the efficiency of classifying is the size n of database. The improvement presented in this paper is to decrease the n value, i.e. to compress the scale of searching the k nearest neighbors. With the process of moving some computations from the classifying period to the training period, the amount of classifying computations will sharply decrease and the efficiency of KNN will be improved.

To illustrate the mechanism of the improvement, we simply assume that each instance has only two attributes. The working procedure of improved KNN is shown in Fig.1.

Figure 1. Improvement of KNN

Training period: randomly select an instance O from the training database as the center instance. Then calculate the distances from other instances to the center instance and arrange the results with an ascending order. Finally, assign to the parameter r (here r is temporarily assigned with the 1/5 of the biggest distance to center instance).

Classifying period: for a new instance y, calculate the distance d between y and O; choose the instances among [d-r, d+r] in the ordered training database and calculate the distances between them and y. Then select the k instances with the distance lower than r and store them by an ascending order, i.e. x1 … xk; return the result f(y) of classifying y:

1( ) arg max ( , ( ))

k

iv V if y v f xδ

∈ =← ∑ (3)

We add a process of sorting in the training period, with the time complexity ( log )m n n nΟ ∗ + ∗ . This process needs to be performed only once and can be used in classifying other instances. In the second step of classifying, we adopt the binary search algorithm to retrieve instances with the time complexity (log )nΟ . With the assumption that the number of instances located in the ring is n1, as shown in Fig.1, the time complexity of calculating the distances between these n1 instances and target instance

is 1( )m nΟ ∗ . If there are n2 instances located in the circle (radius r), the time complexity of sorting them is 2 2( log )n nΟ ∗ . Finally, the overall time cost in this step

is 1 2 2(log log )n m n n nΟ + ∗ + ∗ . As n1 and n2 are usually much smaller than n in the practical application, the improved KNN will be of great efficiency in contrast with the standard KNN.

B. KNN Classifying Accuracy Improvement Another problem of KNN is that the process of

calculating distances between instances involves all attributes. If each instance has 20 attributes, in which only two have contribution on the correctly classifying, these irrelevant attributes affect the distance calculating. One practical solution to this problem is to assign a weight to each attribute based on its contribution during the distance calculating. Based on this improvement, the similarity between instance X and instance Y is shown in formula (4), where wi is the weight on attribute i.

1( , ) ( , )

m

i i ii

Sim X Y w Sim X Y=

= ∗∑ (4)

Suppose we already have a set of attribute weights, this set of weights will be used in formula (4) to calculate the similarity between two instances. Now we try to get an optimal set of attribute weights. When using this set of attribute weights to classify some new instances, the smallest number of misclassified instances can be achieved. Obviously, the optimal attribute weight setting is related to both the training instances and the instances to be classified. As to a large database, we assume that the training instances fully represent the instances under classification. Therefore, the optimal attribute weights can be calculated by only using the training database. This is what we are aiming to do.

1) Quadratic programming model: To change the default, adjust the template as follows. As an effective solution to the optimization problems, a quadratic programming model is widely used. A quadratic programming model is to calculate the maximum or minimum value of an objective function with a set of variables subject to a set of constraints on the variables. In a quadratic programming model, each constraint is a linear equation or a linear inequality and the objective function is at most quadratic [6-7]. Therefore, this model can be represented in formula (5).

1 1

:n n n

j j jk j kj j k j

Max Min c x C x x= = =

+∑ ∑∑ (5)

Constrainti ( i=1,…,m) is one of the following forms, where 0( 1,..., )jx j n≥ = .

1

nij j ij

a x b=

≤∑

1

nij j ij

a x b=

=∑

391

Page 3: [IEEE 2009 International Conference on Artificial Intelligence and Computational Intelligence - Shanghai, China (2009.11.7-2009.11.8)] 2009 International Conference on Artificial Intelligence

1

nij j ij

a x b=

≥∑

Formula (5) is the objective function, where x1… xn are the variables, n is the number of variables and m is the number of constraints.

There have been many commercial software packages to solve the quadratic programming problem, for example, MATLAB, IBM OSL and so on. Here we use MATLAB to solve this problem.

2) Optimal attribute weight sets calculation: We take the binary classification as an example to illustrate how to utilize the quadratic programming model to get the optimal attribute weight sets. With the simple assumption that the similarity of two instances in same class is 1, otherwise 0, we calculate the attribute weight sets during training period and give the similarities by using formula (4).

Based on the assumption above, calculating the attribute weight sets is transformed to an optimization problem. This problem is aiming at minimizing the similarity differences calculated by formula (4) in comparison with real similarity. Due to the fact that only one objective function exists in quadratic programming model, all the similarity differences between corresponding pairs should be summed into the objective function. Since the final aim is to minimize all the individual differences, we use the sum of the square of the differences instead of the arithmetic sum of the differences. So the optimization problem can be represented by the following quadratic programming model.

Suppose that there are n instances in the training database and each instance has m attributes, the constraints are shown as formula (6).

1( , 1... , )

m

ijk k ij ij ijk

S W L M R i j n i j=

+ − = = <∑ (6)

In formula (6), Sijk is the similarity on the kth attribute

between instance i and instance j. 1

mijk kk

S W=∑ is the

similarity between instance i and instance j calculated by formula (4). Rij is the real similarity between instance i and instance j according to their real classification. Lij and Mij are the difference variables by which the calculated similarities are less or greater than the real similarities.

Actually, our aim is to minimize each Lij and Mij. However, the objective function is to minimize the sum of the square of each Lij and Mij. Therefore, since we have to use one objective function to express our aim, we utilize the sum of the square of each Lij and Mij to prevent the phenomenon that any of them is too large. Finally, the objective function is presented in formula (7).

2 2

1 1

( )n n

ij iji j i

Min L M= = +

+∑∑ (7)

From the analysis above, a typical quadratic programming model is formed with the constraints formula (6) and the objective formula (7). We use MARLAB to solve this problem and obtain the attribute weight sets.

3) Complexity analysis and simplification: As there is still no comprehensive analysis about the complexity of the

quadratic programming problems yet, we just provide a brief analysis about the size of this model. There are m weight variables and n*(n-1) differences variables. In addition to the n*(n-1) constraints in formula (6), this model totally has n*(n-1)+m variables and n*(n-1) constraints. Obviously, the size of this model is not linear to the number of training instances. Therefore, with the increasing number of training database, solving this problem may become really complicated, even unmanageable. Therefore, we try to apply some simplifications to reduce the size of this model.

In the process of calculating the attribute weight sets, each instance needs to be compared with all the other instances in training database. Suppose that we only compare each instance with a subset of other instances, we can decrease the time complexity of this model greatly. With the help of the improvement discussed in section 2.1, we choose n2 instances located in the circle (radius r) as a representative subset. Therefore, this model now has m weight variables, n* n2 differences variables and n* n2 constraints functions. According to the discussion in section 2.1, we know that n2 is always much smaller than n-1. Therefore, the process of calculating the attribute weight sets will be greatly simplified and accelerated, what is more, without extra computational cost.

IV. EXPERIMENTAL ANALYSIS To test the performance of the improved KNN with the

combination of the two improvements above, we use it to classify the Mushroom data sets acquired from the UCI Machine Learning Repository. This data set is comprised of 8124 instances. Each instance is characterized by 22 attributes and classified as either "edible" or "poisonous". We select 1000 instances randomly as the training instances to calculate the attribute weight sets and then classify the rest instances in the set. Meanwhile, we record the rate P1 standing for the rate of correctly classifying and the overall time T1 for classifying the rest instances. In contrast, we repeat the process above using the standard KNN and record P2 and T2. All the experiments are performed on an Intel Core 2 Duo CPU, 2.66GHz PC with 2.00G RAM and Windows Server 2003. As to KNN, the value of K will also affect the efficiency and the accuracy of classifying, which we are not going to investigate. So we just simply assign K as one. To avoid occasional results, we perform the experiment ten times and make a summary and comparison, shown in Table 1, Fig.2 and Fig.3.

392

Page 4: [IEEE 2009 International Conference on Artificial Intelligence and Computational Intelligence - Shanghai, China (2009.11.7-2009.11.8)] 2009 International Conference on Artificial Intelligence

TABLE I. STATISTICAL RESULTS ABOUT CLASSIFICATION

Times Improved KNN Standard KNN

T1(s) P1 (%) T2 (s) P2(%)

1 50.4 95.01 132.9 93.54

2 39.1 95.50 126.2 93.83

3 42.7 97.75 135.0 95.87

4 45.2 97.60 139.4 97.08

5 56.8 95.95 142.8 93.90

6 34.5 96.56 127.5 96.83

7 36.9 97.50 130.6 96.54

8 40.7 96.00 140.5 95.89

9 33.4 97.78 131.8 96.50

10 30.3 96.10 125.3 95.32

Figure 2. Comparison on classification time

Figure 3. Comparison on classification accuracy

For the improved KNN, the average time for classifying rest instances is 41.0 seconds and the correct rate is 96.58%. For the standard KNN, the average time is 133.2 seconds and the correct rate is 95.53%. The standard deviations for the accuracy of classifying are 1.02% and 1.33% respectively. Obviously, the improved KNN has a tremendous upgrade on classifying efficiency, and the increasing rate is about 69.22% (calculated by (T2-T1) / T2). Meanwhile, the correct rate also gains an upgrade about 1.05%. From the comparison of the standard deviations shown in Fig.3, we can naturally draw the conclusion that the rate of correct classifying is more stable for the improved KNN.

V. CONCLUSION The K-nearest-neighbor algorithm is a basic instance-

based learning method and widely used in similarity classification. Based on the discussion about the drawbacks of the standard K-nearest-neighbor algorithm, this paper gives the novel improvements from the following two aspects. For one thing, as a combination of lazy learning method and active learning method, we move some repeated computations occurring at the classifying period to the training period in order to improve the efficiency of classifying. For another, to improve the accuracy of classifying, we take into account the contributions of different attributes and obtain an optimal attribute weight sets by using the quadratic programming model. Finally, this paper gives the validation of the improvements through practical experiments. The experimental results show that the improved KNN has the following merits: a remarkable increase in classifying efficiency and a stable rising in classifying accuracy.

REFERENCES [1] Xue Ling, Li Chao, and Xiong Zhang, “Design of content-based

video retrieval system using MPEG-7”, Journal of Beijing University of Aeronautics and astronautics, 2006, 32(07), pp.865-868.

[2] Wang Wei, Xu Wei, Xiong Zhihui, and Zhang Maojun, “Tow Phase Auto Identification and Query of Similar Video Clips Based on Fuzzy Histogram”, Journal of Chinese Computer Systems, 2007, 28(08), pp.1477-1481.

[3] Aha D W, Kibler D, and Albert M K, “Instance-based learning algorithms”, Machine Learning, 1991, 6, pp. 37- 66.

[4] Aha D W, Lazy learning, Dordrecht: Kluwer Academic, 1997. [5] K. S. Beyer, J. Goldstein, and R. Ramakrishnan, “When is ‘Nearest

Neighbor’ Meaningful?”, Proc of the 7th International Conference on Database Theory (ICDT), 1999,1.

[6] Bogdan Gavrea, and Gosmin Petra, “Large quadratic programming problems generated by rigid body simulation”, General Mathematics, 2008, 16(4), pp.73-79.

[7] Domingos P, “Unifying Instance-Based and Rule-Based Induction”, Machine Learning, 1996, 24(2), pp.141-168.

393