support vector machines for sar automatic target recognition€¦ · support vector machines for...

Support Vector Machinesfor SAR Automatic TargetRecognition

QUN ZHAO, Member, IEEE

JOSE C. PRINCIPE, Fellow, IEEEUniversity of Florida

Algorithms that produce classifiers with large margins, such

as support vector machines (SVMs), AdaBoost, etc. are receiving

more and more attention in the literature. A real application of

SVMs for synthetic aperture radar automatic target recognition

(SAR/ATR) is presented and the result is compared with

conventional classifiers. The SVMs are tested for classification

both in closed and open sets (recognition). Experimental results

showed that SVMs outperform conventional classifiers in target

classification. Moreover, SVMs with the Gaussian kernels are able

to form a local “bounded” decision region around each class that

presents better rejection to confusers.

Manuscript received April 5, 2000; revised November 11, 2000;released for publication December 26, 2000.

IEEE Log No. T-AES/37/2/06336.

Refereeing of this contribution was handled by L. M. Kaplan.

This work was supported by DARPA Grant F33615-97-1-1019.

Authors’ address: Computational NeuroEngineering Laboratory,EB 451, Bldg. #33, P.O. Box 116130, University of Florida,Gainesville, FL 32611, E-mail: (fzhao,[email protected]).

0018-9251/01/$10.00 c° 2001 IEEE

I. INTRODUCTION

The training of a learning machine is statistical innature, which means that an appropriate criterion isneeded to fit both the model order and the parameters.This implies that the design procedure should takeinto consideration both the performances of thetraining set and the model complexity. In the statisticalliterature, various criteria for model complexitydesign have been described, such as the Akaikeinformation-theoretic criterion (AIC) [1], and theminimum description length (MDL) criterion [27, 28].The model complexity criterion can be regarded as asum of two terms [17, 26] involving a log-likelihoodfunction and a model complexity penalty. Accordingto this theory, the task of training a learning machineis to find a weight vector that minimizes the followingcost functional J(w) [5],

J(w) = Remp(w) +¸Rmdl(w) (1)

where Remp(w) is the empirical risk or the standardperformance measure resulting from the training setsuch as the minimum squared error (MSE), and thesecond term Rmdl(w) is a complexity penalty termdepending upon the network topology (capacity).In fact, this risk equation (1) is a simple form ofregularization theory [33], where ¸, the regularizationparameter, is normally difficult to determine. When ¸is zero, (1) is called the empirical risk minimization(ERM) principle [35] and no capacity control isutilized, which normally leads to overfitting thetraining data and producing bad generalization.When ¸ is increased, more emphasis is put on thecomplexity penalty to specify the network, and theerror rate in the training set increases, but bettergeneralization is achieved. This means that a suitablebalance should be struck between the accuracyattained on the particular training set and the capacityof the classifier.Statistical pattern recognition has lived with this

compromise since its early days [13]. Recently, thestructural risk minimization (SRM) has been proposedby Vapnik as an alternate inductive principle forlearning, which is able to control the generalizationability of learning machines in the small sampleset limit [34, 35]. Vapnik proposed to minimize aconfidence interval derived from the capacity of theset of functions implemented by the learning machine(Vapnik-Chervonenkis or VC dimension) instead ofstriking the compromise between empirical risk andmachine complexity. This is a remarkably differentway of thinking about generalization and is linked toPopper’s principle of falsifiability. The same authorshowed later that a practical way to minimize theVC dimension is to design classifiers that maximizethe margin. The margin is defined as the minimumdistance between the training set samples and thedecision surface.

IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 37, NO. 2 APRIL 2001 643

The theoretical and experimental results showthat many learning algorithms, such as SVMs [35],AdaBoost [10, 31], and Bagging [3], will produceclassifiers with large margins and lead to bettergeneralization performance. As a large marginclassifier, the SVM has been used successfully inmany pattern recognition applications [5], includingisolated handwritten digit recognition [6], speakeridentification [32], face detection in images, and veryrecently also to automatic target recognition (ATR)[21, 40—41].ATR refers to the use of computer processing

to detect and recognize target signatures, for ourcase, in synthetic aperture radar (SAR) images. Theconventional ATR architecture comprises a focus ofattention (detector and discriminator) followed by aclassifier [24]. The role of the focus of attention isto discard image chips that do not contain potentialtargets. Model based approaches are being investigatedin MSTAR (moving and stationary target acquisitionand recognition) literature, but here we concentrate oncomparing statistical classifiers. Statistical classifierscan be broadly divided into two types followingthe taxonomy in [20]: one class in one network(OCON) and all class in one network (ACON).Template matching is typical in the OCON group,while classifiers trained discriminantly (i.e., with allthe classes at once) such as the multilayer perceptron(MLP) or radial basis function (RBF) networks [38]appear in the second group.Pattern classification can be grouped into closed

set and open set applications [16]. In closed-setclassification, one needs to perform the classificationinto a fixed number of classes, and we expect thatthe test set samples are drawn from the same classes.However, in some practical cases, some exemplarspresented to the classifier during testing do NOTbelong to the learned classes. This has been calledopen-set classification or verification [16]. Forinstance, in face recognition, a security system has tobe able to reject intruders while being able to copewith variations of a known face due to lighting orpose differences [4]. ATR falls into the second group,where it is impossible to create a training set with allpossible vehicles and the ATR system is required tofor instance discriminate between military and civilianvehicles [23]. Similar problems arise in speakeridentification [14], recognition and verification offingerprints, signatures, etc. Open-set classification isan important problem that falls between classificationand detection [18] and it is much more demandingthan simply requiring accuracy and generalizationfrom the classifier. Here rejection of confusers(in this paper confusers are vehicles that are notincluded in the training set) is needed. One commonway of implementing rejection is the thresholdingcriterion [23], which defines a decision region in thepattern space with a threshold T given in advance

Fig. 1. Illustration of two-class classification problem. In the leftfigure, a “global” discriminant function divides whole samplespace into two parts. In the right figure, two “local” decisionregions are formed to keep confusers away from class region of

interest.

as,D(T) = fx j g(x)¸ T, 8x 2 R(I)g (2)

where g(x) is the decision function of a classifier, andR(I) represents the pattern space. The thresholdingcriterion states that we are taking the value of thediscriminant function as a representation of theproximity to the in-class samples (Fig. 1). Thisimplies that the discriminant function should beable to create a “local” decision region, instead of a“global” one. Otherwise a confuser far away fromthe class center can easily be accepted as an objectof interest. Hence in verification systems the actualtopology of the classifier is rather important forrejection of confusers.For the problem of SAR/ATR, the classifier should

be able to classify the targets in the training set aswell as their variants (different serial numbers),and to reject confusers, all at a reasonable level.Support vector machines (SVMs) are utilized here toperform the task of target recognition and confuserrejection. As a comparison, perceptrons trained withthe MSE criterion (the delta rule) [17] and SRM arealso used to perform the same tasks. The theoreticalbackground of MSE and SRM are given in SectionII. Experimental results and discussion are given inSection III and IV, respectively.

II. LEARNING CRITERIA FOR EMPIRICAL ANDSTRUCTURAL RISK MINIMIZATION

Let us consider a two-class classification problem,where the training set is described as

X := fx1, : : : ,xmg, xi 2 Rn, (3)

and labels asY := fy1, : : : ,ymg µ f¡1,1g: (4)

A. Perceptron Criterion and the Minimum SquaredError Criterion

A simple classifier that can solve a linearlyseparable task is the perceptron [29], and its linear

644 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 37, NO. 2 APRIL 2001

decision function is represented as

g(x) = sgn(w ¢ x+ b) (5)

where (w ¢ x) indicates the inner product, and sgn(¢)is a signum function. The perceptron criterion (or therisk functional) is defined as [29]

J(w) =Xxi2Ex

jw ¢ xi+ bj (6)

where the summation is over the set of Ex of patternsthat are misclassified by the perceptron, and j ¢ jrepresents the absolute value. For linearly separableproblems, the algorithm converges in a finite numberof iterations.Unlike the perceptron criterion (6) which considers

only the misclassified patterns, the MSE criteriontakes into account the entire training set, which isdefined as the squared error (L2 norm) between thedesired output and actual output,

J(w) =mXi=1

(yi¡ g(w,xi))2: (7)

To get a continuous differential output, a sigmoidfunction is used instead of the signum,

g(w,x) =1

1+exp(¡w ¢ x+ b) : (8)

The delta rule, which is a gradient-based algorithmcan be used to train the network [17]. Since thesamples that produce larger errors are closer to theboundary, the MSE risk functional (7) will place thedecision surface at a location that predicts better andmore consistently the correct class assignment than theperceptron criterion does.Taking into consideration of the model complexity,

one can use regularization theory as indicated in (1).Here a cost functional with a weight decay [37] termis utilized in the experiments,

J(w) =Xi

(yi¡ g(w,xi))2 +¸ ¢ kwk2: (9)

The parameter ¸ has to be experimentally determined.

B. Criteria for Structural Risk Minimization

The delta rule with weight decay implementscriterion (1). In this section, an alternative learningcriterion called SRM is considered. Two applicationsof this learning methodology are briefly reviewed, i.e.,Optimal Hyperplane (OH) and SVM. The interestedreader is referred to [35] for a full treatment.1) Optimal Hyperplane: The training set of

Section II is said to be separated by an OH if thefollowing two conditions are satisfied. First, allthe samples are separated without error (keep theempirical risk zero), and second, as illustrated inFig. 2, the distances between the closest vectors to the

Fig. 2. Two-class linearly separable problem (balls versustriangles). OH (solid line) intersects itself halfway between thetwo classes, and keeps margin maximal. Samples across boundary

H1 or H2 are support vectors.

hyperplane are maximal. The separating hyperplane isdescribed in the canonical form,

yi(w ¢ xi+ b)¸ 1, i= 1, : : : ,m: (10)

It is easy to prove that the margin between the twohyperplanes H1 : w ¢ xi+ b = 1 and H2 : w ¢ xi+ b =¡1is d = 2=kwk. Thus, to find a hyperplane that satisfiesthe second condition, one has to solve the quadraticprogramming problem minimizing kwk2, subject toconstraint (10). The solution to this optimizationproblem is given by the saddle point of a primalLagrange functional,

LP =12kwk2¡

mXi=1

®i[yi(w ¢ xi+ b)¡ 1] (11)

where ®i, i= 1, : : : ,m, are positive Lagrangemultipliers. Since (11) is a convex quadraticprogramming problem, it is equivalent to solve the“dual” problem [9]: maximize LP , subject to theconstraints that the gradient of LP with respect to wand b vanish, which gives the simpler conditions,

w=Xi

®iyixi (12)

Xi

®iyi = 0 (13)

Substituting (12) and (13) into (11), we get the dualformulation by the following vector representation,

LD =¤T1¡ 1

2¤TC¤

s.t. ¤TY = 0 (14)

¤¸ 0where ¤T = (®1, : : : ,®m), Y

T = (y1, : : : ,ym), 1T =

(1, : : : ,1) is an m-dimensional all-ones vector, and C isa symmetric m by m correlation matrix with elementsCij = yiyjxi ¢ xj , i,j = 1, : : : ,m. Notice that there is aLagrange multiplier ®i for every training sample.In the solution, those points with ®i > 0 are calledsupport vectors (SV), and lie on either H1 or H2. Thedecision surface is made by

g(x) = sgn

ÃXi2SV

yi®ixi ¢ x+ b!

(15)

ZHAO & PRINCIPE: SUPPORT VECTOR MACHINES FOR SAR AUTOMATIC TARGET RECOGNITION 645

2) Soft Margin Hyperplane: More generally,when dealing with nonlinearly separable patterns, weintroduce positive slack variables »i, i= 1, : : : ,m, in theconstraint (10), i.e.,

yi(w ¢ xi+ b)¸ 1¡ »i (16)

For an error to occur, the corresponding »i mustexceed unity, thus

Pi »i is an upper bound on the

number of training errors. In this case, the riskfunctional to minimize is,

L= kwk2=2+¸ÃX

i

»¾i

!k(17)

subject to constraint (16), where ¸ is a parameter toassign a penalty to training errors. For any positiveinteger k, this is a convex programming problem.For sufficiently large ¸ and sufficiently small ¾, theparameters w and bias b determine the hyperplane thatminimizes the number of errors on the training setand separate the rest of the elements with maximalmargin. Note that the problem of constructing ahyperplane which minimizes the error on the trainingset is general NP-complete. To avoid this difficultythe case of ¾ = 1 is considered in this paper, wherethe solution is called the soft margin hyperplane. Ifwe take k = 2 in (17), the optimization remains aquadratic programming problem of maximizing as

LD =¤T1¡ 1

2

·¤TC¤+

±2

¸

¸s.t. ¤TY = 0

± ¸ 00· ¤· ±1

(18)

where ± is a scalar, and the constraint impliesthat the smallest admissible value is ± = ®max =max(®1, : : : ,®m) [6]. Therefore, to construct a softmargin hyperplane, one can either solve convexprogramming problem in the m-dimensional spaceof the parameter vector ¤, or solve the quadraticprogramming problem in the dual m+1 dimensionalspace of ¤ and ± [6].3) Support Vector Machine: Until now, all the

previous architectures create the decision functionsin the input space that are linear functions of data.Then one may ask how can the above method begeneralized to the case of a nonlinear decisionfunctions in a feature space? One alternative isto map the data to some other high-dimensionalEuclidean space (feature space) using a nonlinearmapping Á : Rd! E. There is evidence providedby Cover’s Theorem [22] that a complex patternclassification problem cast in a higher dimensionalspace is more likely to be linearly separable thanin a lower dimensional space. The advantage ofthis method is that it decouples the numbers of freeparameters of the learning machines from the input

space dimensionality [22]. In this way, the decisionrule of (15) is implemented in the new feature space,i.e.,

g(x) = sgn

ÃXi2SV

yi®iÁ(xi) ¢Á(x) + b!:

Using kernel functions K(x,y) that obey theMercer condition [6, 28], the discriminant can bewritten

g(x) = sgn

ÃXi2SV

yi®iK(xi,x)+ b

!: (19)

The advantage of using kernel function is that insteadof calculating the inner product of Á(xi) ¢Á(x) in thefeature space, we can do it in the input space by usingK(xi,x). This learning machine is the so-called supportvector machine. Correspondingly, the task of traininga SVM is to maximize

LD =¤T1¡ 1

2

·¤TK¤+

±2

¸

¸s.t. ¤TY = 0

± ¸ 00· ¤· ±1

(20)

where K is a symmetric m by m kernel matrix withelements kij = yiyjK(xi ¢ xj).To describe the classification ability of either the

OH or the SVM, the margin of an example (xj ,yj) isdefined as [31]

½f(xj ,yj) = yjg(xj) (21)

where g(xj) is the decision function defined by (19).It is observed in the following experiments that SVMtends to increase the margins associated with trainingexamples and converge to a distribution in which mostexamples have large margins.Instead of quadratic programming an iterative

algorithm called the Adatron [2] is used in ourexperiments to train the OH. This algorithm wasrecently extended to the kernel Adatron [11, 12], so itcan also train the SVM. The advantage of the Adatronalgorithm is its conceptual and implementationsimplicity. It precomputes the inner products (orthe kernel computation) so it is memory intensiveand cannot be applied to large data sets. As aniterative algorithm we have to choose experimentallythe step size, but the conceptual simplicity and itsstraightforward implementation makes it a goodchoice for our case. We present the algorithm in theAppendix.One final point to make is that the large margin

training is intrinsically a discriminant training methodthat is applied to two classes. Hence if the problemat hand has more than two classes we have to trainseveral classifiers (either one class versus all theothers, or pairwise classification). Since we only usedthree classes we utilized here the pairwise training (2different classifiers were trained per class).


Fig. 3. (a) Illustration of pose. (b) SAR images of target T72,BTR70, BMP2 taken at different aspect angles.

III. EXPERIMENTAL RESULTS

In this work, SAR ATR experiments wereperformed using the MSTAR database to classify threetargets. The image data are composed of 80 by 80SAR images chips roughly centered on three typesof military vehicles: the T72, BTR70, and BMP2(the T-72 is a tank and the other two vehicles arearmored personnel carriers). Examples of the SARimages are shown in Fig. 3. These images are a subsetof the 9/95 MSTAR Public Release Data, where thepose (aspect angles) of the vehicles lies between 0 to360 deg. Only target images are used here so there isno need for the focus of attention. This image set waschosen because there is available in the open literaturea pilot study that can be used as a base for furthercomparisons [36].We normalize the L2-norm of all the images from

the training and testing sets, and utilized directly theimages as inputs to the classifier. This preprocessingwas kept at a minimum because the targets in theMSTAR database were in the same open fieldbackground, and the radar was carefully calibrated. Ifthe operations of recentering, intensity matching, andbackground masking (as done in [36]) were performedbetter accuracy should be possible, but a longer effortwould have been necessary to conduct the testing.The training set for the closed set classification

contained SAR images taken at a depression angleof 17 deg, while the testing set depression angleis 15 deg. Therefore the SAR images between thetraining and the testing sets for the same vehicle atthe same pose are different. Variants (different serialnumber) of the three targets were also used in thetesting set for the open set experiments, as illustratedin Table I. The size of the training and testing sets is698 and 1365, respectively.The SAR images are very noisy due to the image

formation and lack resolution due to the radar signalbandwidth in the range dimension and integration

TABLE ITraining and Testing Set

Training Set Size Testing Set Size

T72 (Sn 132) 232 T72 (Sn 132) 196T72 (Sn 812) 195T72 (Sn s7) 191

BTR70 (Sn c71) 233 BTR70 (Sn c71) 196BMP2 (Sn c21) 233 BMP2 (Sn c21) 196

BMP2 (Sn c9566) 196BMP2 (Sn c9563) 195

Fig. 4. Classifier topology is depicted. First a pose estimator isapplied to image and determines approximate pose of target, then

classifier is chosen according to result of pose estimation.

angle in the cross-range dimension, which makes theclassification of SAR vehicles a non-trivial problem[23]. Unlike optical images, the SAR images of thesame target taken at different aspect angles showgreat differences, which precludes the existence ofa rotation invariant transform. This results from thefact that a SAR image reflects the fine target structure(point scatter distribution on the target surface) ata certain pose. Parts of the target structure will beoccluded when illuminated by the radar from anotherpose, which results in dramatic differences fromimage to image taken with angular increments of onlya few degrees.In order to cope with these problems, template

matching uses closely spaced poses (» 10 deg) toform the template [36]. Template matchers are OCONclassifiers, which are not trained discriminantly [20].We are experimenting with a different classifierarchitecture, based on more powerful classifierstrained discriminantly (ACON) and a pose estimator(see Fig. 4). The input space is divided using thepose information [25] and twelve sub-classifiers weretrained one for each 30 deg sector of aspect anglewith data from all the three classes. We have createda pose estimator based on mutual information thatis able to determine the pose of all MSTAR targetswith an error less than 8 deg [39]. In these results weassume that the pose estimator is error free. The largesector size was chosen to differentiate our approachfrom the closely spaced poses, but the sector has notbeen optimized for best performance.We compared three classifiers. 1) A perceptron

trained with the delta rule and weight decay, witha single layer structure of 6,400 input units and 3output units. 2) An OH (the same perceptron as in(1) but trained for large margin). 3) An SVM basedon the Gaussian kernel, with the kernel size chosen


Fig. 5. (a) Weight distribution of perceptron trained with (9).Weights connected input nodes with three output nodes,

respectively, of classifier covering aspect angles from zero to30 deg. Compared with images in Fig. 3, they resemble thetargets’ features. (b) Weight distribution of OH, obtained withLagrange multipliers (T72-BTR70, BTR70-BMP2, BMP2-T72).

as the average Euclidean distance between trainingpatterns. Both the OH and the SVM were trainedwith the Adatron algorithm with bias and soft marginalgorithm [11, 12] (see Appendix).

A. Classification Results

A closed-set classification experiment wasperformed first. The perceptron was trained with alearning rate of 0.5 and a weigth decay of 10E-5. It isknown that a perceptron is still very much related tothe template matcher, but where each training image isnonlinearly weighted to create a discriminant templatefor the class. Fig. 5(a) depicts images of the inputweight matrices (white means high value) for each ofthe three output nodes for one of the sector classifiers(0—30 deg aspect angle). Compared with the inputSAR images in Fig. 3, one can see that the weightimages emphasize some of the point scatters of thetargets, and suppress the background. Thus, effectivelythe perceptron is working with the discriminant andpersistent scatters of each class for the given sector.This has been a goal of model-based ATR [21], buthere the result is obtained through training and in amuch simpler way.The OH uses the same architecture as the

perceptron, but the training principle is different.Instead of training the weight vector itself using MSE,one trains the Lagrange multipliers using the SRMprinciple. Here we employed the Adatron algorithm[11, 12] in a pair-wise training among the threeclasses (i.e., T-72 versus BMP2, BMP2 versus BTR70,and BTR70 versus T72). In our experiments theadvantage is obvious, since the number of trainingexamples is much smaller than the dimensionalityof the input space, so there are fewer parameters tobe trained. Fig. 5(b) shows the weight distributions

TABLE IIMisclassification Rates (%) of Classifiers

BMP2 BTR70 T72 Average

Perceptron 9.88 0.51 17.87 11.94OH 6.13 1.02 15.64 9.45SVM 9.03 0.51 11.86 9.01

for the OH obtained from the Lagrange multipliersusing (12). Comparing with Fig. 5(a), one can seethe differences that result from the ERM an SRMtraining criteria. While the perceptron emphasizesthe discriminant point scatters for each class, the OHworks more with the pairwise differences and does notseem to concentrate as much on the point scatters.We used the Adatron and kernel Adatron

algorithms with a step size of 0.01 to train the OHand SVM, which use the same training principle,but have different architectures. Fig. 6 illustrates thelearning curves and margin distribution for the OHclassifier and SVM based on the training and testingresult of one (cover approximately 30 deg) of thetwelve classifiers, where the margin distribution graphis defined as the sum of the margins (see equation(21)) of the training set as a function of number ofiterations [31]. We also monitored the error in the testset (dashed line) for analysis purposes. In Fig. 6(a),the learning curve for the OH reveals that the trainingerror dropped to zero in 70 iterations, but the testingerror continued to drop from 22 to 8 in the next 450iterations. In Fig. 6(b) the error for the SVM onlytook about 20 iterations to reach zero while the testingerror continued to drop from 22 to 6 in the following200 iterations. Meanwhile, the sum of margins of thetraining set continued to increase quickly even afterthe training error is zero. This means that there isfine-tuning of the decision boundary even after thetraining set error reaches zero. This is impossiblewith the delta rule, and shows the intrinsic differencebetween SRM and ERM training. The test error plotsalso shows another difference with respect to theMSE training. There is no over-training with the SRMprinciple since the error monotonically decreases asthe margins are maximized. Hence, the stop criterionfor training should be based not on the error but bymonitoring the margin. Comparing the learning curvesand margin distribution graphs in Fig. 6(a) and (b), wefind that the SVM converges faster than the OH, butthe margins are identical for our data. This indicatessimilar performance.Table II shows the misclassification rates of the

three classification methods using the data from thethree targets. The misclassification rates Pe of thelarge margin classifiers were around 9% while theperceptron achieved 12% approximately. It revealsthat the OH and the SVM had a better classificationperformance than the perceptron. We also conclude


Fig. 6. Learning curves and margin distribution graphs for (a) OH and (b) SVM, where margin distribution graph is defined as sum ofmargins of training set as function of number of iterations. Learning curves are shown above corresponding margin distribution graphs.Each learning curve and margin distribution graph shows training error/margin (with solid line) and testing error/margin (with dash-dotline), respectively. It is revealed that after training error dropped to zero, testing error still continued dropping, and sum of margins

continued increasing.

that there is no advantage in classification accuracyof the SVM over the OH. This may be related to thehigh dimensionality of our data set. The networkswere run several times with different initial conditionsand learning rates and the results of Table II wererepeatable.

B. Verification Results

A critical problem in ATR is how to discriminatebetween targets and confusers. To reject confusers,thresholds are set for all three classifiers, andperformance in terms of missed detections andfalse alarms is measured in a receiver operatingcharacteristic (ROC) curve. In the verificationexperiment the previously trained classifiers weretested in an enlarged test set with two confusers,D7 and 2S1. The baseline for the comparison is thetemplate matching method using basically the sameMSTAR target mix [36], where a power normalizedtemplate matcher with a mask individualizing the

targets was developed with templates at 10 degincrements. This preprocessing is much more involvedthan the one used in our design and may bias theresults in favor of the template matcher. For all thefour classifiers, a threshold was individually set foreach method to keep the probability of detection Pdequal to 0.9 in the testing set. Here Pd is defined asthe ratio of number of targets detected and the totalnumber of targets (a Pd of 0.9 is typically used inMSTAR).The recognition results are listed in Table III.

The misclassification rate Pe is defined as the ratioof number of targets incorrectly classified over thenumber of targets tested. The first row gives the resultfor the template matcher as reported in [36]. Theaverage Pe of the perceptron, OH, and SVM were6.67%, 5.42% and 5.13%, respectively. They are allbetter than the template matching, which is 9.60%.Concomitantly with these experiments conducted

in our laboratory, another group was applying SVMsto the MSTAR data [21]. In their paper, an SVM


TABLE IIIMisclassification Rates and Confuser Rejection Rates (%)

ConfuserBMP2 BTR70 T72 Average Rejection

Template 11.58 2.04 10.14 9.60 53.47Perceptron 9.71 0.00 5.84 6.67 27.19OH 8.69 0.51 3.78 5.42 38.50SVM 4.94 0.00 7.04 5.13 68.80

classifier was used to classify the same target mix inMSTAR, but using a polynomial instead of a Gaussiankernel function, and trained with the standard andmore involved quadratic programming approach.The reported misclassification errors are around6.6%—7.2%, slightly worse than our results.When confusers were added to the test set, the

SVM showed the highest rejection rate of 68.80%,while the optimal hyperplane presented a rejection rateof 38.50%, and the perceptron 27.19%, respectively.The template matching showed a rejection rate of53.47%, which is better than the perceptron and OH,but still worse than the SVM. These rejection resultsmay seem to contradict the classification results, butin fact they are easily interpreted if one characterizesthe classifier’s discriminant function type in local orglobal. In fact, the best performers in rejection arethe SVM (which uses the Gaussian kernel) and thematched filter, which is also a local discriminant.The global discriminant classifiers (OH andperceptron) were unable to reject confusers aseffectively.We present in Table IV the confusion matrices

for each classifier at the operating point of Pd = 0:9,which is typically used in MSTAR. In terms ofrejection in Table IV, it refers to false rejection forthe three targets BMP2, BTR70, and T72, but for thetwo confusers it indicates that the classifier correctlyrejects the vehicle as not a target. One can see fromTable IV that each system makes its own type ofmistakes. For instance, the template matcher hasdifficulty with the 2S1 (confused with the BMP2),while the other 3 classifiers have a more equilibratedperformance, and are progressively better at rejectingconfusers. The SVM improves the confuser rejectionof the template matcher from 53% to 68% (Table III)while providing a better recognition rate, hence it is abetter classifier for verification.We note that Table III only provides results for

probability of detection Pd = 0:9, corresponding toonly one point on the ROC curve. To give an overallperformance comparison, the ROC curves of the threeclassifiers are shown in Fig. 7. It is observed thatthe SVM shows much better target recognition andconfuser rejection performance than the two otherclassifiers. We have to point out that the ROC for theBTR70 is much better than the others because thereare no variants for this vehicle in the test set.

TABLE IVConfusion Matrices (Counts) of Classifiers and Confuser Rejection

when Pd = 0:9

BMP2 BTR70 T72 Rejection

BMP2 483 59 9 36BTR70 4 188 en 0 4T72 43 16 427 962S1 111 83 38 42D7 16 4 3 251

(a) Template matching


BMP2 436 16 41 83BTR70 0 194 0 2T72 18 16 502 512S1 9 105 100 60D7 29 68 88 89

(b) perceptron


BMP2 443 9 42 88BTR70 0 193 1 2T72 16 6 519 462S1 9 50 117 98D7 53 9 99 113

(c) optimal hyperplane


BMP2 511 14 15 47BTR70 0 195 0 1T72 31 10 453 882S1 57 24 10 183D7 53 0 27 145

(d) support vector machine

V. DISCUSSION AND CONCLUSION

Our tests in SAR/ATR and the choice of ourclassifiers enable us to conclude some very interestingpoints about topology of classifiers and trainingcriterion. This is illustrated briefly in Table V.Notice that we have the same classifier topology

trained with different principles: global discriminantperceptron is trained with either the ERM (e.g.,delta rule with weight decay) or with SRM (orlarge margin rule). We also have the same basictraining principle (SRM or large margin) applied totwo different topologies, the OH perceptron withglobal discriminant and the SVM with Gaussiankernel (the local discriminant). Finally, we havetwo different classification problems: the closedand open sets. Hence we can confidently draw thefollowing conclusions. For closed set classificationproblems the generalization of the large marginclassifiers outperforms the ERM principle in theMSTAR data set. We say this because the OHoutperforms the perceptron trained with the delta rule.The performance improvement is substantial but not


TABLE VTopology of Classifier Versus Training Criterion

ERM SRM

Global discriminant Perceptron Optimal HyperplaneLocal discriminant Radial Basis Function Support Vector Machine

Fig. 7. ROC curve of (a) perceptron, (b) OH, (c) SVM using testsets against two confusers.

dramatic. We thought that the SRM had the potentialto produce even better classifiers, but obviously, thisdepends on the structure of the data clusters.By analyzing the learning criteria, we can see

that the modified ERM learning criterion (9) hasan L2 norm in both the training set error and theregularization. Although never designed for largemargin, the MSE positions the decision surface ata location with a “safe margin” between the vehicle

Fig. 8. Values of Lagrange multipliers of trained (a) OH and(b) SVM, where most of them are greater than zero and thus

support vectors by definition.

classes. This was unexpected in particular because thesize of the training data is much smaller than the inputdata dimensionality, which raises many concerns aboutgeneralization. Different from (9), the SRM criterion(11) maximizes the margin by using a L1 norm in thetraining set error, and a L2 norm in the regularization.The major difference between these schemes is thetraining set error norm.We analyzed what this difference in norm

effectively means in terms of the number of degreesof freedom of the classifier. The perceptron trainedwith the delta rule and weight decay has 6,400weights per class where the weight distribution isshown in Fig. 5. In the OH classifier the Lagrangemultipliers play a similar role to the weights in theperceptron. Fig. 8 shows the Lagrange multipliers ®ifor the OH and SVM. We first note that the numberof ®i is related to the number of training samples,instead of the dimensionality of inputs to the classifier.This is the big difference between the two methods:large margin training decouples in a very effective


way the input space dimension from the number offeatures used, while the MSE is stuck with the inputdata dimensionality. Only changing the topology(increasing the number of layers) can help a classifiertrained with MSE decouple the feature space from theinput space, but then the designer is faced with theproblem of choosing the number of hidden processingelements (PEs) without a good theory to guide in thedesign.Moreover, the large margin training can still assign

“importance” to each input data sample by the valueof the Lagrange multipliers: when ®i are far from zerothey are called support vectors, because they are theones that define the position of the decision surface.In Fig. 8 the values of the multipliers are shown forboth the OH and SVM of the first sector among the12. It is seen that most of the ®i are greater than zero,which means that for our ATR application almost allthe samples are kinds of support vectors. This is dueto the small data set and the high dimensionality ofthe input space, and each sample in the data set hasto play a role in forming the decision boundary. Thissmall ratio of data samples to space dimensionalitymay also explain why the SVM did not outperformthe OH in our closed-set classification results.When the classification is open set, the classifier

topology plays a more important role than thetraining principle. This means that open and closed-set classifications are really two different problems,where generalization is not the only difficulty. In openset classification the learning machine is presentedwith samples that are beyond the training set classes,i.e. that appear in very different areas of the patternspace. We (and others) use the idea of thresholdingthe output of the classifiers to decide about the degreeof similarity to the in-class samples. Note that thisimplies that the class posterior distribution must bea good representation for the probability densityfunction. With this thresholding methodology thedesign for large margin was not the major factor inperformance.What is important here is to choose topologies

that enforce local class discriminants. The SVMmaps a confuser far away from the “local”decision region onto a location close to the originof the feature space, which promises a reliablerejection. Perceptrons are notorious for their globaldiscriminants (hyperplanes) so they perform poorlyin these tasks (irrespective of the learning principle)even when compared with template matchers thatare known to generalize poorly (because they arelinear systems [8]). When we choose topologiesthat enforce local discriminants a large marginclassifier still seems more robust to confusers.Unfortunately, the comparison of the SVM with thetemplate matcher is not very appropriate since thetemplate matcher is unable to exploit covarianceinformation, and they are really embedded in twodifferent classifier designs (see below). A moreappropriate test would have been the comparison

of the SVM with an RBF network. This point hadbeen addressed in [30], where the SVM and RBF hadbeen compared using a real-world pattern recognitionexperiment (handwritten digits recognition), andthe result showed that the SVM achieves higherrecognition accuracy than the RBF.The performance comparison between template

matchers and the other three methods (SVMs, OH,and perceptrons) is not straightforward because thedifferences are not only in classifier structure but alsoin classifier design. In fact the template matcher isone example of OCON structure, while all the otherclassifiers are trained toward the type of ACON,but only across 30 deg aspect angles. Our classifierdesign exploits a divide-and-conquer approach in thesense that we realize that one of the difficulties inSAR/ATR is the huge dependence of the signatureswith aspect angle. Hence, we reason that it should bemuch simpler to discriminate targets if we comparevehicles aligned for pose. In order to implement thisprinciple we first estimate the pose and then deriveclassifiers for a subset of the pose angles within asector (30 deg in these experiments).The results presented here show that in fact the

SVM outperforms the template matcher in bothaccuracy and in confuser rejection (the OH andperceptron also outperform the template matcher inaccuracy). But there are so many differences betweenSVMs and template matchers that we cannot sayunivocally that this difference is due to our classifierdesign. There are a number of design parameters thatwere not fully considered in our approach. Probablythe most important is the sector size. We saw that thediscriminability of targets varied widely over aspect,but we kept the same sector size in the design. Sectorsthat produce poor results should be broken down insmaller sectors to improve discriminability. But ofcourse there is a lower limit on the sector size due topose estimator accuracy and also on the available datato train the classifiers. This trade-off should be furtherinvestigated.

APPENDIX. ALGORITHM OF KERNEL ADATRONWITH BIAS [12]

0) Define

fAD(xi) = yi

0@ mXj=1

yj®jk(xi,xj)+ b

1AMAD = min

i2f1,:::,mgfAD(xi):

1) Initialization setup: Lagrange multipliers ®i,i 2 f1, : : : ,mg, learning rate ´, bias b, and a smallthreshold t.2) While MAD < t3) Choose pattern xi, i 2 f1, : : : ,mg4) Calculate a update ±i = ´(1¡fAD(xi))5) If (®i+ ±i)> 0, ®i = ®i+ ±i, b = b+ yi±i6) End While


REFERENCES

[1] Akaike, H. (1974)A new look at the statistical model identification.IEEE Transactions on Automatic Control, 19 (1974),716—723.

[2] Anlauf, J., and Biehl, M. (1989)The Adatron: An adaptive perceptron algorithm.Europhysics Letters, 10, 7 (1989), 687—692.

[3] Breiman, L. (1994)Bagging predictors.Technical report 421, University of California, Berkeley,1994.

[4] Chellappa, R., Wilson, C., and Sirohey, S. (1995)Human and machine recognition of faces: A survey.Proceedings of the IEEE, 83, 5 (1995).

[5] Burges, C. (1998)A tutorial on support vector machines for patternrecognition.Data Mining and Knowledge Discovery (1998).

[6] Cortes, C., and Vapnik, V. (1995)Support vector networks.Machine Learning, 20 (1995), 273—297.

[7] Courant, R., and Hilbert, D. (1953)Methods of Mathematical Physics.Interscience, 1953.

[8] Fisher, J., and Principe, J. C. (1998)Recent advances to nonlinear MACE filters.Optical Engineering, 36, 10 (1998), 2697—2709.

[9] Fletcher, R. (1987)Practical Methods of Optimization (2nd ed.).New York: Wiley, 1987.

[10] Freund, Y., and Schapire, R. (1995)A decision-theoretic generalization of on-line learningand an application to boosting.In Proceedings of the 2nd European Conference onComputational Learning Theory, 1995.

[11] Frie¯, T., Cristianini, N., and Campbell, C. (1998)The kernel-Adatron algorithm: A fast and simple learningprocedure for support vector machines.In Shavlik, J. (Ed.), Machine Learning: Proceedings of the15th International Conference.San Francisco, CA: Morgan Kaufmann Publishers, 1998.

[12] Frie¯, T. (1998)Support vector neural networks: The kernel Adatron withbias and soft-margin.Research report, University of Sheffield, UK, 1998.

[13] Fukunaga, K. (1972)Statistical Pattern Recognition (2nd ed.).San Diego, CA: Academic Press, 1972.

[14] Gish, H., and Schimdt, M. (1994)Text-independent speaker identification.IEEE Signal Processing Magazine, 11 (1994), 18—32.

[15] Girosi, F. (1998)An equivalence between sparse approximation andsupport vector machines.Neural Computation, 10, 6 (1998), 1455—1480.

[16] Gori, M., and Scarselli, F. (1998)Are multilayer perceptrons adequate for patternrecognition and verification?IEEE Transactions on Pattern Analysis and MachineIntelligence, 20, 11 (1998), 1121—1132.

[17] Haykin, S. (1994)Neural Networks: A Comprehensive Foundation.Englewood, NJ: Macmillan College Co., 1994.

[18] Helstrom, C. W. (1968)Statistical Theory of Signal Detection (2nd ed.).Elmsford, NY: Pergamon Press, 1968.

[19] Juang, B., and Katagiri, S. (1992)Discriminative learning for minimum error classification.IEEE Transactions on Signal Processing, 40, 12 (1992),3043—3054.

[20] Lin, S. H., Kung, S. Y., and Lin, L. J. (1997)Face recognition/detection by probabilistic decision-basedneural network.IEEE Transactions on Neural Networks, 8, 1 (1997),114—132.

[21] Bryant, M., and Garber, F. (1999)SVM classifier applied to the MSTAR public data set.Algorithms for Synthetic Aperture Radar Imagery VI,E. Zelnio, Ed., Proceedings of the SPIE, 3721 (1999),355—360.

[22] Nilsson, N. (1965)Learning Machines: Foundations of TrainablePattern-Classifying Systems.New York: McGraw-Hill, 1965.

[23] Novak, L., Owirka, G., Brower, W., and Weaver, A. (1997)The automatic target recognition system in SAIP.The Lincoln Lab Journal, 10, 2 (1997), 187—202.

[24] Principe, J., Radisavljevic, A., Fisher, J., and Novak, L.(1998)Target prescreening based on a quadratic Gammadiscriminator.IEEE Transactions on Aerospace and Electronic Systems,34, 3 (1998), 706—715.

[25] Principe, J., Zhao, Q., and Xu, D. (1998)A novel ATR classifier exploiting pose information.In Proceedings of Image Understanding Workshop,Monterey, CA., Nov. 1998, 833—836.

[26] Priestley, M. (1981)Spectral Analysis and Time Series.New York: Academic Press, 1981.

[27] Rissanen, J. (1978)Modeling by shortest data description.Automatica, 14 (1978), 465—471.

[28] Rissanen, J. (1989)Stochastic Complexity in Statistical Inquiry.Singapore: World Scientific, 1989.

[29] Rosenblatt, F. (1958)The Perceptron: A probabilistic model for informationstorage and organization in the brain.Psychological Review, 65 (1958), 386—408.

[30] Schoolkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P.,Poggio, T., and Vapnik, V. (1997)Comparing support vector machines with Gaussiankernels to radial basis function classifiers.IEEE Transactions on Signal Processing, 45, 11 (1997),2758—2765.

[31] Schapire, R., Freund, Y., Bartlett, P., and Lee, W. (1998)Boosting the margin: A new explanation for theeffectiveness of voting methods.Annals of Statistics, 1998.

[32] Schimdt, M. (1996)Identifying speaker with support vector networks.In Interface’96 Proceedings, Sydney, Australia, 1996.

[33] Tikhonov, A., and Arsenin, V. (1977)Solutions of Ill-Posed Problems.Washington, DC: W. H. Winston, 1977.

[34] Vapnik, V. (1998)Statistical Learning Theory.New York: Wiley, 1998.

[35] Vapnik, V. (1995)The Nature of Statistical Learning Theory.New York: Springer-Verlag, 1995.


[36] Velten, V., Ross, T., Mossing, J., Worrell, S., and Bryant, M.(1998)Standard SAR ATR evaluation experiments using theMSTAR public release data set.Research Report, Wright State University, 1998.

[37] Weigend, A. S. (1991)Generalization by weight elimination with application toforecasting.Advances in Neural Information Processing Systems, 3(1991), 875—882.

[38] Zhao, Q., and Bao, Z. (1996)Radar target recognition using a radial basis functionneural network.Neural Networks, 9, 4 (1996), 709—720.

Qun Zhao received his B.S. degree from Xian Jiaotong University, China, in1989, and M.S. and Ph.D. degrees from Xidian University, China, in 1992 and1995, respectively, all in electrical engineering.In 1996 he joined the Center for Information Science, Peking University,

China, where in Sept. 1997 he became an Associate Research Professor.He joined the University of Florida in 1998. He is presently working at theDepartment of Electrical and Computer Engineering, University of Florida, andthe Center for Study of Emotion and Attention, National Institute of MentalHealth. His current research work includes information-theoretic learning,automatic target recognition, and functional magnetic resonance imaging. Hismain interests include pattern recognition, statistical learning, medical imaging,time-frequency distribution, and speech analysis.Dr. Zhao has published around 30 papers in journals and conference

proceedings. He serves as referee for IEEE Transactions on Neural Networks,Pattern Analysis and Machine Analysis, and Biomedical Engineering. He has been amember of IEEE since 1998.

Jose C. Principe (M’83–SM’90–F’00) is Professor of Electrical and ComputerEngineering at the University of Florida, Gainesville, where he teaches signalprocessing and artificial neural networks. He is the Founder and Director of theUniversity of Florida Computational NeuroEngineering Laboratory (CNEL).His primary area of interest is processing of nonstationary signals with adaptiveneural models.Dr. Principe is a member of the advisory board of the University of Florida

Brain Institute.

[39] Zhao, Q., Xu, D. X., and Principe, J. (1998)Pose estimation of SAR automatic target recognition.In Proceedings of Image Understanding Workshop,Monterey, CA., Nov. 1998, 827—832.

[40] Zhao, Q., and Principe, J. (1999)From hyperplanes to large margin classifiers:Applications to SAR ATR.Automatic Target Recognition IX, F. Sadjadi (Ed.),Proceedings of the SPIE, 3718 (1999), 101—109.

[41] Zhao, Q., Principe, J., Brennan, V., Xu, D., and Wang, Z.(2000)SAR automatic target recognition with three strategies oflearning and representation.Optical Engineering, 39, 5 (May 2000), 1230—1244.


support vector machines for sar automatic target recognition€¦ · support vector machines for...

Documents