[ieee 2011 fourth international workshop on advanced computational intelligence (iwaci) - wuhan,...

4
Clustering-based Extraction of Near Border Training Samples for Classification of Remote Sensing Image Xiaoyong Bian and Xiaolong Zhang I. INTRODUCTION N the last decades, remote sensing images with high spatial and spectral resolution are increasingly emerging and have driven the introduction of a set of novel algorithms in the field. Clustering and classification techniques have been introduced in information theory, image processing and computer vision. Clustering is to divide data into groups of similar objects [1], [2], and one of its goals is to approximate real data distribution as possible. There are many clustering methods that vary in their complexity and effectiveness. Currently clustering has received an increasingly attraction on identifying structure in complex and large data. This is the case of remote sensing images, where clustering method prior to supervised classification method is usually undertaken to generate multi-cluster structure of data distribution in an unsupervised manner and then “valuable” training samples for support vector machine classification (SVM) can be extracted. As we know, although SVM has showed good generalization performance, it suffers from exceeded computation cased by multispectral and hyperspectral images. Obviously it is very important to locate such training samples near to the decision hyperplane, which are so-called support vectors (SVs) and used to determine the decision boundary in higher feature space. Manuscript received September 17, 2011. This work was supported in part by National Natural Science Foundation of China (60975031), the Program of Wuhan Subject Chief Scientist (201150530152), and the Project (2008CDB344, 2009CDA034) from Hubei Provincial Natural Science Foundation, P. R. China, the Open Foundation (2010D11) of State Key Laboratory of Bioelectronics, Southeast University, as well as the Project (2008TD04) from Science Foundation of Wuhan University of Science and Technology. X. Bian is with School of Computer Science, Wuhan University of Science and Technology, Wuhan, 430065, P.R. China. (e-mail: [email protected]). X. Zhang is with School of Computer Science, Wuhan University of Science and Technology, Wuhan, 430065, P.R. China. (e-mail: [email protected]). The spatial and spectral information contained in remote sensing images can be mined and used to complete different learning tasks including automatic or semiautomatic land cover classification. However, when dealing with multispectral and hyperspectral remote sensing images, there are several critical problems considered [3]: 1) the high cost of human labeling; 2) the poor quality of training samples. The manual definition of a suitable labeled data is usually time consuming and highly redundant (often lack of spatial variability of the spectral signature), and supervised classification algorithms require the availability of a certain quantity of labeled samples for training a reliable classifier. In real applications, the high number of input dimension and less number of labeled data often incur the curse of dimensionality. A certain number of informative training samples to achieve an accurate learning of the classifier need to be explored, and an alternative approach to expand the origin training data according to known as active learning [4], [5], which has been paid much more attention in the remote sensing community. The learning procedure in active learning repeatedly queries a batch of available unlabeled data from an unlabeled pool and updates the training set by adding new labeled data. Nevertheless, repetitive iteration selection of the most informative unlabeled data and retraining current classifier would be computationally expensive still and slow convergence [6]-[8]. Moreover the risk of classification error of selecting the most informative samples did not effectively be restrained. In this work, we propose the modified k-means clustering and selection of training samples on the basis of data cluster structure that explicitly takes the statistics of the class distribution into accounts [9], [10], In clustering procedure, a bundle of informative data points which locate between cluster centroid and cluster border can be extracted according to cosine similarity measure. The near boundary within individual cluster can be found, and only these data points near border of each cluster could be labeled and quickly pricked off. Consequently, the qualified training samples obtained can accelerate SVM training. Note that the identified cluster borders are expected to approximate learned decision boundary with consequent SVM. The performance of the proposed approach is compared to cluster center based sampling, random sampling in active learning approach that have been presented for actively selecting training data for margin-based classifiers to show increased performance. Different from the active learning I Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011 978-1-61284-375-9/11/$26.00 @2011 IEEE AbstractIn this paper, we investigate the joint use of modified k-means clustering and selecting a set of data near the decision remote sensing image with support vector machine (SVM). We propose to find informative samples and label them as near border training samples, which are induced by cluster center and cosine similarity measure between these data points and cluster center. The proposed approach works in two consecutive steps and near border training samples can be first produced in the clustering step and then are fed to SVM classification. A comparison with cluster centers, random sampling is provided, suggesting the location of labeled data within individual clusters experiments on real remote sensing images show a better classification accuracy of proposed method compared to state-of-the-art methods. convergence [1]-[3]. Moreover the risk of classification error known as active learning [2], [3], which has been paid much distribution into accounts [1], [3], In clustering procedure, a 321

Upload: xiaolong

Post on 24-Mar-2017

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

Clustering-based Extraction of Near Border Training Samples for Classification of Remote Sensing Image

Xiaoyong Bian and Xiaolong Zhang

I. INTRODUCTION

N the last decades, remote sensing images with high spatial and spectral resolution are increasingly emerging and have driven the introduction of a set of novel algorithms in the

field. Clustering and classification techniques have been introduced in information theory, image processing and computer vision. Clustering is to divide data into groups of similar objects [1], [2], and one of its goals is to approximate real data distribution as possible. There are many clustering methods that vary in their complexity and effectiveness. Currently clustering has received an increasingly attraction on identifying structure in complex and large data. This is the case of remote sensing images, where clustering method prior to supervised classification method is usually undertaken to generate multi-cluster structure of data distribution in an unsupervised manner and then “valuable” training samples for support vector machine classification (SVM) can be extracted. As we know, although SVM has showed good generalization performance, it suffers from exceeded computation cased by multispectral and hyperspectral images. Obviously it is very important to locate such training samples near to the decision hyperplane, which are so-called support

vectors (SVs) and used to determine the decision boundary in higher feature space.

Manuscript received September 17, 2011. This work was supported in part by National Natural Science Foundation of China (60975031), the Program of Wuhan Subject Chief Scientist (201150530152), and the Project (2008CDB344, 2009CDA034) from Hubei Provincial Natural Science Foundation, P. R. China, the Open Foundation (2010D11) of State Key Laboratory of Bioelectronics, Southeast University, as well as the Project (2008TD04) from Science Foundation of Wuhan University of Science and Technology.

X. Bian is with School of Computer Science, Wuhan University of Science and Technology, Wuhan, 430065, P.R. China. (e-mail: [email protected]).

X. Zhang is with School of Computer Science, Wuhan University of Science and Technology, Wuhan, 430065, P.R. China. (e-mail: [email protected]).

The spatial and spectral information contained in remote sensing images can be mined and used to complete different learning tasks including automatic or semiautomatic land cover classification. However, when dealing with multispectral and hyperspectral remote sensing images, there are several critical problems considered [3]: 1) the high cost of human labeling; 2) the poor quality of training samples. The manual definition of a suitable labeled data is usually time consuming and highly redundant (often lack of spatial variability of the spectral signature), and supervised classification algorithms require the availability of a certain quantity of labeled samples for training a reliable classifier. In real applications, the high number of input dimension and less number of labeled data often incur the curse of dimensionality. A certain number of informative training samples to achieve an accurate learning of the classifier need to be explored, and an alternative approach to expand the origin training data according toknown as active learning [4], [5], which has been paid muchmore attention in the remote sensing community. The learning procedure in active learning repeatedly queries a batch of available unlabeled data from an unlabeled pool and updates the training set by adding new labeled data. Nevertheless, repetitive iteration selection of the most informative unlabeled data and retraining current classifier would be computationally expensive still and slow convergence [6]-[8]. Moreover the risk of classification error of selecting the most informative samples did not effectively be restrained.

In this work, we propose the modified k-means clustering and selection of training samples on the basis of data cluster structure that explicitly takes the statistics of the class distribution into accounts [9], [10], In clustering procedure, abundle of informative data points which locate between cluster centroid and cluster border can be extracted according to cosine similarity measure. The near boundary within individual cluster can be found, and only these data points near border of each cluster could be labeled and quickly pricked off. Consequently, the qualified training samples obtained can accelerate SVM training. Note that the identified cluster borders are expected to approximate learned decision boundary with consequent SVM.

The performance of the proposed approach is compared to cluster center based sampling, random sampling in active learning approach that have been presented for actively selecting training data for margin-based classifiers to show increased performance. Different from the active learning

I

Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011

978-1-61284-375-9/11/$26.00 @2011 IEEE

Abstract— In this paper, we investigate the joint use of modified k-means clustering and selecting a set of data near the decision remote sensing image with support vector machine (SVM). We propose to find informative samples and label them as near border training samples, which are induced by cluster center and cosine similarity measure between these data points and cluster center. The proposed approach works in two consecutive steps and near border training samples can be first produced in the clustering step and then are fed to SVM classification. A comparison with cluster centers, random sampling is provided, suggesting the location of labeled data within individual clusters experiments on real remote sensing images show a better classification accuracy of proposed method compared to state-of-the-art methods.

convergence [1]-[3]. Moreover the risk of classification error

known as active learning [2], [3], which has been paid much

distribution into accounts [1], [3], In clustering procedure, a

321

Page 2: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

methods, the proposed approach works in two consecutive steps combined with unsupervised and supervised fashion and initially labeled data are only cluster centers induced by aforementioned clustering methods.

This paper is organized as follows. Section 2 introduces theproposed two-step learning approach and discusses key problems. In Section 3, data and setup of the experiments areprovided, presenting different experimental results obtained on the considered data sets. Finally, Section 4 draws theconclusion of the work.

II. LABEL AND SELECTION OF NEAR BORDER TRAINING

SAMPLES

A. Modified k-means clustering

Clustering methods could address the issue of finding a partition of available data which maximizes the ratio of inter-class variance and intra-class variance. The traditional k-means is sensitive to initial centers and is prone to be stuck in local minimization, and often provides no guarantee of producing balanced clusters, especially for large nonconvex clusters. In the case, good initialization for cluster centers is required in order to avoid strapping in local minimum. The variant methods are desired for improving scalability of k-means. E.g., bisecting k-means [2] and aggregated hierarchical clustering, they start from a single cluster of the whole data, and repeatedly bisect the largest remaining cluster into two sub-clusters at each step, until the desired

value of is reached, which typically converges to balanced

clusters of similar sizes.

k

In this context, clustering analysis is straightforward and based on an intuitive notion: clustering method works on the whole multi or hyperspectral image and can produce accurate cluster structure of real data distribution, on the other hand, true sample labeling of a suitable training set for SVM classification is very high costly and redundant as well. In this study, we propose k-means with meliorated initial centers

instead of random initialization. We obtain k initial cluster

centers from high-density regions, which are produced

by the combined conditions of both mean gray threshold within each channel and maximum distance measure between the new cluster and obtained clusters (see Fig.1 (a)). Our clustering algorithm requires no more prior knowledge than

cluster number , which is equal to the number of classes in

given data.

k2

k

B. Label the confident data points

We propose the use of the derived cluster structure encoded in clustering result to guide the labeling of the most confident data points, which are essentially cluster centers and are determinately assigned the label, respectively. Then the label of cluster center can be propagated among the cluster based on cosine similarity measure. Motivated by SVs, the interior samples in each cluster should not be labeled and can be removed, thus every cluster can be reduced to a

concentric ring. Assume that data object ,

, the number of class ,

niixX 1)(= �

Di Rx � },...,1{ cK � Kk � , then

)10(|| ��� �� kc data points near to cluster border can be

retained (see Fig.1(b)), denoted as . According to cosine

similarity measure, we have Equation (1)

'

kc

2'

2

''

||||||||),cos(

ki

kTi

ki cx

cxcx

�� (1)

Where 2|||| � is the L2-norm of given expression, is the

cluster center, represents tested data point from

. The larger the cosine value is, the more similar

between the tested data point and the cluster center is. On

given threshold t , a small part of competent data points

with their labels (

'kc

kth ix'

kc

,...2,1, �iyi ) can be formed from

cluster and the whole training data from all clusters are

obtained, defined by Equation (2) and (3), respectively.

'kc

}),cos(|),{( tcxyxTS kiiik �� (2)

kck TSTS k ||

1�� � (3)

Furthermore, the above ring region is reduced to smaller concentric region (see Fig.1 (c)), where the most similar data points close to cluster border should be with more discriminative in accordance with the notation of SVs. The selection of different t,� is a tradeoff between the number

of data points to be labeled and possible noise introduced by labeling error. If we choose a larger value for the parameter � (or less value for t ), it means that much more data points

are assigned the same label as the cluster center’s, which may introduce labeling error and degrade classification accuracy. Therefore different t,� can be assessed in experiments.

Every time a batch of data points can be labeled and a final training set comprises the border labeled data is used for SVM training. It is worthy noting that cluster centers generally violate the SVs [3], [9], and such training set is alsosurveyed.

C. Proposed approach

In this section, we present the detailed algorithm consisting of two consecutive stages both clustering step and subsequent

classification step. During the clustering step, a prior k is

equal to the number of classes in the given data. For the classification step, we use LIBSVM package to train SVM. Algorithm 1 gives a depiction in pseudocode.

Algorithm 1: Proposed clustering based classification process

Input: Data object kRxxX Di

nii ,,)( 1 �� �

1) Clustering step

This paper is organized as follows. SectionII introduces the

problems. In Section III, data and setup of the experiments are

on the considered data sets. Finally, Section IV draws the

generally violate the SVs [1], [3], and such training set is also

322

Page 3: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

Step1 Obtain k initial cluster centers using modified

k-means

Step2 Run k-means with above cluster centers obtained kStep3 Label each cluster center and fit cluster structure

with ellipsoid morphology

Step4 Initialize TSt,,� and generate initial concentric

ring region according to Equation (1), and reduced concentric ring region according to Equation (2)

Step5 Add the new labeled data to the current training set

kTSStep6 Repeat Step5, until each cluster is done, and the

whole Training data TS is acquired

2) Classification step

Step7 Train RBF SVM classifier based on above TSusing grid-searching with 10-cross validation

Step8 Classify the investigated images and make performance evaluation based on classification accuracyStep9 Output classification model and classification map

Step10 Repeat Step4~Step9, until the stop criterion is satisfied.

III. EXPERIMENTAL RESULTS

A. Data and setup

Two data sets in the experiments are used. The first data set is Landsat ETM-7 taken over Jiaodong Biland in Shandong

Province, China. This data set consists of 332 464 pixels

with 7 bands, by removing the panchromatic band, the remainder 6 bands can be preprocessed by PCA transformation. There are 4 land-cover types and a small number of available labeled data are collected by ground survey. The second data set is AVIRIS Indian Pine in Northwestern Indiana, in June 1992. Which is the primary crops of the area, consisting if initial 220 bands with sized

145 145. Uncalibrated and noisy bands were removed, and

200 bands remained. The total number of samples for the second data set is 9345 (Corn-min till: 834, Corn-notill: 1434, Grass/Pasture: 497, Grass/Trees: 747, Soybean-no till: 968, Soybean-min till: 2468, Soybean-clean till: 614, Woods: 1294, Hay-windrowed: 489).

The proposed approach is unsupervised. Hence, it starts with no labeled information exception the prior knowledge of the number of clusters. Our approach has been compared with only cluster centers, full center and random sampling. For SVM learning, the LIBSVM package (one-against-all strategy, OAA) has been used, and the algorithm presented in the paper was implemented on Intel Duo CPU 2.4GHZ Window XP PC with 2GB RAM. The RBF kernel was

adopted and the optimal parameters can be obtained

via 10-cross validation in the searching range . In

order to compare the proposed approach with other data sampling methods, evaluation metric: classification accuracy

( OA ) can be defined as follows.

),( gc

]2,2[ 105�

��

n

iii

n

glCOA

1

),( (4)

Where denotes the total number of data points,

is the counter function that add 1 itself if and only

if

n),( ii glC

ii gl � , and the predicted label is permuted to match

the label given by groundtruth .

il

ig

B. Results and Discussion

To investigate the effectiveness of the proposed approach, named near border data labeling and SVM (BL-SVM), we compared it with three other methods: 1) cluster centers + SVM (CC-SVM); and 2) full clusters + SVM (FC-SVM); and 3) random sampling + SVM (RS-SVM). The last method has been presented in remote sensing literature [8]. They all work on the basis of clustered result. Fig. 1 gives an illustrative experiment of the first investigated image.

(a) (b) (c)

Fig. 1 Example of data labeling and sample selection. (a) Full cluster selection and labeling. (b) Concentric ring region selection and labeling. (c) Near border selection and labeling.

Fig. 1 (a) shows four full clusters filled with four white ellipsoidal regions obtained in this experiment, and their label is the same as that of respective cluster center’s. In this case, many non-SVs are added to the final training set, the possible labeling error degenerates classification accuracy. However, such case can be improved through removing a certain quantity of interior data points of the cluster according to parameter � in Fig. 1(b), thus a small set of near border data

points can be reserved, where the possible labeling error is expected to mitigate compared to the Fig.1 (a). Because of the introduction of biased and redundant samples in Fig.1 (a) and (b), further these data points in each ring region of Fig.1 (b) are refined based on cosine similarity measure between them and their cluster center, i.e., the most similar and spatially separated data points are selected and a reduced near border data points are drawn as in Fig.1 (c). In order to obtain statistically significant results, 10 trials were run by varying the selected training set, and average classification accuracy is reported in Table 1.

Table 1 Classification accuracy comparison on the Landsat ETM-7 image

Overall classification accuracy (%) Trainingsamples RS-SVM FC-SVM CC-SVM BL-SVM

12 88.15 88.56 90.30 90.80

20 88.83 88.92 90.83 91.71

50 90.21 90.50 92.21 92.56

120 91.40 91.23 92.50 93.92

150 91.73 91.60 93.16 93.75

180 92.67 92.55 93.74 94.62

Table 1 shows overall classification accuracy of RS-SVM, FC-SVM, CC-SVM and our proposed approach BL-SVM for a training data rates in the range of 0.01% and 0.1% for the

323

Page 4: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

first investigated data set. As can be seen in Table 1, the overall classification accuracy by our proposed approach can outperform three other methods. In the best case, the increase in classification accuracy is higher than 2.88% for the Landsat ETM-7 data compared to RS-SVM classification in case of a training data rates of 0.02%. In this first experiment, the classical setting of t,� is 0.02, 0.975 respectively, and

the less threshold t is set, the more the selected samples are.

In summary, the proposed approach gains a stable increase in case of a minimal number of training samples and demonstrates its effectiveness. It is noted that all this also benefits from SVM classification with less sensitivity to limited training samples and less number of features.

(a) (b)

Fig.2 RGB composite and test performance for the analyzed AVIRIS hyperspectral image (a) RGB Composite

image of Indian pine (b) Test performance ( )

obtained by SVM using different training set induced by different strategies.

100/-1 OA

Fig.2 (a) shows RGB composite image of Indian pine. For this data set, a large number of labeled samples were defined from ground reference data and a high number of channels are reduced by PCA. In the study, nine different land-cover classes available in the original ground-truth are investigated. As the data set presents a very challenging land-cover classification task, aforementioned clustering is directly done within individual regions from different classes but not the whole image. Fig.2 (b) illustrates the results for the four different sampling methods. The observed error (computed

using overall accuracy OA as are considered

respectively and the performance difference among the four strategies can be observed. In Fig.2 (b), method BL-SVM provides the best results because it avoids biased training samples as other three methods do.

)100/-1 OA

IV. CONCLUSIONS

In the paper, we present a joint use of clustering and supervised classification approach. An automatic procedure extracting data points near to cluster border and data labeling is presented. The extracted training samples from each cluster are probably SVs that approximate the optimal separating hyperplane under the assumption that the identified cluster structure of unlabeled data should be consistent with the SVM classification model. The process of choosing unlabeled data and data labeling is simple, fast and unsupervised. The selection criterion of training data gives

priority to the informative data points as well as the representative ones from different clusters, making the training set statistically stable. The experiments have been showed the proposed approach can increase the classification accuracy.

In the future, we plan to extend this work to combine the use of information theoretic based clustering and active learning, in order to accurately yield cluster structure and appropriately measure the informative and representative of unlabeled data.

ACKNOWLEDGMENT

The author would like to thank Dr. C.-J. Lin for the use of their LIBSVM library and the reviewers for their anonymous comments.

REFERENCES

[1] G. B. Hu, S. G. Zhou, J. H. Guan and X. H. Hu, “Towards effective document clustering: A constrained k-means based approach,” Information Processing and Management, vol. 44, no. 4, pp. 1397-1409, 2008.

[9] X.Y. Bian, T.X. Zhang and X.L. Zhang, “Combining clustering and classification for remote-sensing images,” Chin. Opt. Lett., vol.

9, no. 1, pp. 011002-1:4, 2011.

[10] B. Demir and S. Erturk, “Clustering-based extraction of border training patterns for accurate SVM classification of hyperspectral images,” IEEE Geosci. and Remote Sens. Lett., vol. 6, no. 4, pp.

840-844, 2009.

[2] X.Y. Bian, T.X. Zhang and X.

[3] B. Demir and S. Erturk, “Clu

324