[ieee 2009 ieee symposium on computational intelligence and data mining (cidm) - nashville, tn, usa...

4
Parametric Subspace Analysis for dimensionality reduction and classification Nhat Vo University of Melbourne [email protected] Duc Vo University of Technology, Sydney [email protected] Subhash Challa University of Melbourne [email protected] Bill Moran University of Melbourne [email protected] Abstract— Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA) are the two popular techniques in the context of dimensionality reduction and classification. By extracting discriminant features, LDA is optimal when the distributions of the features for each class are unimodal and separated by the scatter of means. On the other hand, PCA extract descriptive features which helps itself to outperform LDA in some classification tasks and less sensitive to different training data sets. The idea of Parametric Subspace Analysis (PSA) proposed in this paper is to include a parameter for regulating the combination of PCA and LDA. By combining descriptive (of PCA) and discriminant (of LDA) features, a better performance for dimensionality reduction and classifi- cation tasks is obtained with PSA and can be seen via our experimental results. I. I NTRODUCTION Principal component analysis (PCA) and Linear discrimi- nant analysis (LDA) are the most popular subspace analysis approaches to learn the low-dimensional structure of high dimensional data. PCA is a subspace projection technique widely used to reduce multidimensional data sets to lower dimensions for analysis. Depending on the field of applica- tion, it is also named the discrete Karhunen-Love transform (or KLT, named after Kari Karhunen and Michel Love), the Hotelling transform (in honor of Harold Hotelling), or proper orthogonal decomposition (POD)[1]. It finds a set of repre- sentative projection vectors such that the projected samples retain most information about original samples. The most representative vectors are the eigenvectors corresponding to the largest eigenvalues of the covariance matrix. Unlike PCA, LDA finds a set of vectors that maximizes Fisher’s criterion tr W T S b W tr {W T S w W } (1) where S b is the between-class scatter matrix, and S w is the within-class scatter matrix. Thus, by applying this method, we find the projection directions that on one hand maximize the Euclidean distance between the samples of different classes and on the other minimize the distance between the samples of the same class. This ratio is maximized when the column vectors of the projection matrix W are the eigenvectors of S 1 w S b . There has been a tendency in the computer vision commu- nity to prefer LDA over PCA. This is mainly because LDA deals directly with discrimination between classes while PCA does not pay attention to the underlying class structure. Generally speaking, LDA is optimal when the distributions of the features for each class are unimodal and separated by the scatter of means [1]. However, PCA still can outper- form LDA in some classification tasks. In [2], their overall conclusion is that when the training data set is small, PCA can outperform LDA and, also, that PCA is less sensitive to different training data sets (see Fig. 1 for depiction of failure in PCA and LDA). Based on these analysis, it is quite obvi- ous that a projection technique which combines descriptive (of PCA), and discriminant (of LDA) may lead to a better performance for dimensionality reduction and classification tasks. In this paper, we propose subspace analysis technique called Parametric Subspace Analysis (PSA). It includes a parameter for regulating the combination of PCA and LDA. By combining descriptive (of PCA) and discriminant (of LDA) features, the proposed technique gains a better perfor- mance for dimensionality reduction and classification tasks compared to that of PCA and LDA. In section 2, we briefly present the two most popular subspace analysis techniques which are PCA and LDA. The proposed idea is described in section 3. Section 4 presents experimental results and comparison of PSA with PCA and LDA. Section 5 concludes the paper with some directions about future work. II. SUBSPACE ANALYSIS One approach to cope with the problem of excessive dimensionality of data is to reduce the dimensionality by combining features. Linear combinations are particularly attractive because they are simple to compute and analyt- ically tractable. In effect, linear methods project the high- dimensional data onto a lower dimensional subspace. Sup- pose that we have N sample {x 1 ,x 2 , ..., x N } taking values in an n-dimensional space. Let us also consider a linear transformation mapping the original n-dimensional space into an m-dimensional feature space, where m<n. The new feature vectors y k m are defined by the following linear transformation: y k = W T x k (2) where k =1, 2, ..., N , μ n is the mean of all samples, and W n×m is a matrix with orthonormal columns. Atfer the linear transformation, each data point x k can be represented by a feature vector y k m which is used for classification. 978-1-4244-2765-9/09/$25.00 ©2009 IEEE

Upload: bill

Post on 09-Apr-2017

213 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2009 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational Intelligence and Data

Parametric Subspace Analysis for dimensionalityreduction and classification

Nhat VoUniversity of Melbourne

[email protected]

Duc VoUniversity of Technology, Sydney

[email protected]

Subhash ChallaUniversity of Melbourne

[email protected]

Bill MoranUniversity of Melbourne

[email protected]

Abstract— Principal Components Analysis (PCA) and LinearDiscriminant Analysis (LDA) are the two popular techniquesin the context of dimensionality reduction and classification.By extracting discriminant features, LDA is optimal when thedistributions of the features for each class are unimodal andseparated by the scatter of means. On the other hand, PCAextract descriptive features which helps itself to outperformLDA in some classification tasks and less sensitive to differenttraining data sets. The idea of Parametric Subspace Analysis(PSA) proposed in this paper is to include a parameter forregulating the combination of PCA and LDA. By combiningdescriptive (of PCA) and discriminant (of LDA) features, abetter performance for dimensionality reduction and classifi-cation tasks is obtained with PSA and can be seen via ourexperimental results.

I. INTRODUCTION

Principal component analysis (PCA) and Linear discrimi-nant analysis (LDA) are the most popular subspace analysisapproaches to learn the low-dimensional structure of highdimensional data. PCA is a subspace projection techniquewidely used to reduce multidimensional data sets to lowerdimensions for analysis. Depending on the field of applica-tion, it is also named the discrete Karhunen-Love transform(or KLT, named after Kari Karhunen and Michel Love), theHotelling transform (in honor of Harold Hotelling), or properorthogonal decomposition (POD)[1]. It finds a set of repre-sentative projection vectors such that the projected samplesretain most information about original samples. The mostrepresentative vectors are the eigenvectors corresponding tothe largest eigenvalues of the covariance matrix. Unlike PCA,LDA finds a set of vectors that maximizes Fisher’s criterion

tr{WT SbW

}

tr {WT SwW} (1)

where Sb is the between-class scatter matrix, and Sw is thewithin-class scatter matrix. Thus, by applying this method,we find the projection directions that on one hand maximizethe Euclidean distance between the samples of differentclasses and on the other minimize the distance betweenthe samples of the same class. This ratio is maximizedwhen the column vectors of the projection matrix W are theeigenvectors of S−1

w Sb.There has been a tendency in the computer vision commu-

nity to prefer LDA over PCA. This is mainly because LDAdeals directly with discrimination between classes while PCAdoes not pay attention to the underlying class structure.

Generally speaking, LDA is optimal when the distributionsof the features for each class are unimodal and separatedby the scatter of means [1]. However, PCA still can outper-form LDA in some classification tasks. In [2], their overallconclusion is that when the training data set is small, PCAcan outperform LDA and, also, that PCA is less sensitive todifferent training data sets (see Fig. 1 for depiction of failurein PCA and LDA). Based on these analysis, it is quite obvi-ous that a projection technique which combines descriptive(of PCA), and discriminant (of LDA) may lead to a betterperformance for dimensionality reduction and classificationtasks. In this paper, we propose subspace analysis techniquecalled Parametric Subspace Analysis (PSA). It includes aparameter for regulating the combination of PCA and LDA.By combining descriptive (of PCA) and discriminant (ofLDA) features, the proposed technique gains a better perfor-mance for dimensionality reduction and classification taskscompared to that of PCA and LDA. In section 2, we brieflypresent the two most popular subspace analysis techniqueswhich are PCA and LDA. The proposed idea is describedin section 3. Section 4 presents experimental results andcomparison of PSA with PCA and LDA. Section 5 concludesthe paper with some directions about future work.

II. SUBSPACE ANALYSIS

One approach to cope with the problem of excessivedimensionality of data is to reduce the dimensionality bycombining features. Linear combinations are particularlyattractive because they are simple to compute and analyt-ically tractable. In effect, linear methods project the high-dimensional data onto a lower dimensional subspace. Sup-pose that we have N sample {x1, x2, ..., xN} taking valuesin an n-dimensional space. Let us also consider a lineartransformation mapping the original n-dimensional spaceinto an m-dimensional feature space, where m < n. Thenew feature vectors yk ∈ �m are defined by the followinglinear transformation:

yk = WT xk (2)

where k = 1, 2, ..., N , μ ∈ �n is the mean of all samples,and W ∈ �n×m is a matrix with orthonormal columns.Atfer the linear transformation, each data point xk can berepresented by a feature vector yk ∈ �m which is used forclassification.

978-1-4244-2765-9/09/$25.00 ©2009 IEEE

Page 2: [IEEE 2009 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational Intelligence and Data

−10 −5 0 5 10 15 20 25 30−10

−5

0

5

10

15

Class 1Class 2PCALDAPSA

(a) The case when LDA fails to find optimal projection for classifica-tion task.

−10 −5 0 5 10 15 20 25 30−10

−5

0

5

10

15

Class 1Class 2PCALDAPSA

(b) The case when PCA fails to find optimal projection for classifica-tion task.

Fig. 1: Two examples of classification tasks where PCA fails in one example and LDA fails in the other one.

A. Principal Component Analysis - PCA

Different objective functions will yield different algo-rithms with different properties. PCA aims to extract asubspace in which the variance is maximized. Its objectivefunction is WPCA = arg max

W T W=I

{tr(WT StW )

}, with the total

scatter matrix is defined as

St =1N

N∑

k=1

(xk − μ)(xk − μ)T (3)

where μ = 1N

N∑i=1

xi is the mean of all samples. The

optimal projection WPCA = [w1w2...wm] is the set ofn-dimensional eigenvectors of St corresponding to the mlargest eigenvalues.

B. Linear Discriminant Analysis - LDA

While PCA seeks directions that are efficient for represen-tation, LDA seeks directions that are efficient for discrimina-tion. Assume that each sample belongs to one of C classes{Π1,Π2, ...,ΠC}. Let Ni be the number of the samples inclass Πi(i = 1, 2, ..., C), μi = 1

Ni

∑x∈Πi

x be the mean of the

samples in class Πi. Then the between-class scatter matrixSb and the within-class scatter matrix Sw are defined

Sb =1N

C∑

i=1

Ni(μi − μ)(μi − μ)T (4)

Sw =1N

C∑

i=1

xk∈Πi

(xk − μ∗i )(xk − μ∗

i )T (5)

where μ∗i = μk if xi ∈ Πk. In LDA, the projection is chosen

to maximize the ratio of the trace of the between-class scatter

matrix of the projected samples to the trace of the within-class scatter matrix of the projected samples, i.e.,

WLDA = arg maxW T W=I

tr{WT SbW

}

tr {WT SwW} (6)

The optimal projection for LDA is WLDA = [w1w2...wm],where {wi |i = 1, 2, ...,m} is the set of generalized eigen-vectors of Sb and Sw corresponding to the m largest gener-alized eigenvalues {λi |i = 1, 2, ...,m} , i.e.,

Sbwi = λiSwwi

⇔ S−1w Sbwi = λiwi

i = 1, 2, ...,m (7)

In some real aplication tasks, this method cannot be applieddirectly since the dimension of the sample space is typicallylarger than the number of samples in the training set. Asa consequence, Sw is singular. This problem is knownas the “small sample size problem” [1]. To overcome thesingularity of Sw while solving (7), some main approacheshave been proposed. First one is Fisherface method [3], inwhich PCA is first used for dimension reduction so as tomake Sw nonsingular before the application of LDA. Secondapproach is the null space based LDA (NLDA) [4], wherethe between-class scatter is maximized in the null space ofthe within-class scatter matrix. The singularity problem isthus implicitly avoided. Final one is Direct-LDA method[5]. First, the null space of Sb is removed and, then, theprojection vectors that minimize the within-class scatter inthe transformed space are selected from the range spaceof Sb. Due to the limited length of paper, details of thosealgorithms can be found in respective references.

Page 3: [IEEE 2009 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational Intelligence and Data

III. PARAMETRIC SUBSPACE ANALYSIS - PSA

In this section, we will introduction a subspace analysistechnique called Parametric Subspace Analysis (PSA). Sim-ilar to LDA, our goal is also to maximize the between-classscatter, while minimizing the within-class scatter. However,in order to combine PCA and LDA, we introduce anothercriterion for PSA as follow

WPSA = arg maxW T W=I

tr{WT (Sb + αSw) W

}

tr {WT ((1 − α)Sw + αI) W} (8)

where I is the identity matrix. The PSA criterion includesthe parameter α to regulate the combination of descriptive(of PCA), and discriminant (of LDA). For the balance ofPCA and LDA affect on PSA criterion, we range the valueof α from 0 to 1. When α = 0 then

WPSA = arg maxW T W=I

tr{WT SbW

}

tr {WT SwW} = WLDA (9)

and when α = 1 then

WPSA = arg maxW T W=I

tr{WT StW

}

tr {WT W} = WPCA (10)

It is noted that Sb + Sw = St which makes the equality in(10). It is also clear that PCA and LDA are special casesof PSA. Similar to LDA, the optimal projection for PSA isWPSA = [w1w2...wm], where {wi |i = 1, 2, ...,m} is theset of generalized eigenvectors of Sb = Sb +αSw and Sw =(1− α)Sw + αI corresponding to the m largest generalizedeigenvalues {λi |i = 1, 2, ...,m} , i.e.,

Sbwi = λiSwwi

⇔ S−1w Sbwi = λiwi

i = 1, 2, ...,m (11)

Because Sw is a full rank matrix, PSA explicitly overcomesthe small sample size problem which we encounters in LDA.

IV. EXPERIMENTAL RESUTLS

Database PCA (%) LDA (%) PSA (%) α∗Synthetic Control 89.00 92.67 95.67 0.40

FaceAll 65.20 79.98 85.89 0.20Swedish Leaf 80.80 74.72 84.48 0.05

Wafer 91.51 92.20 95.55 0.65ECG 69.00 60.00 83.00 0.10Adiac 61.13 52.43 73.40 0.05

TABLE II: Accurate Recognition Rate of PCA, LDA andPSA (with optimal value α∗).

In this section, we test our approach with a comprehen-sive experiment on six datasets which are Adiac, SwedishLeaf, Faceall, Synthetic Control, Wafer, and ECG datasetscollected by [7] (see Table 1 for brief descriptions ofthose datasets). Adiac [10] is a set of diatom contour files(samples). The set consists of 781 files, comprising 37different taxa (class). The Swedish Leaf dataset comes froma leaf classification project at Linkoping University and theSwedish Museum of Natural History [8]. The dataset con-tains isolated leaves from 15 different Swedish tree species,

with 75 leaves per species. Synthetic Control dataset [6]consists of synthetically generated control charts. It contains600 examples of control charts synthetically generated bythe process in Alcock and Manolopoulos (1999). Wafer andECG datasets in [9] were analyzed by appropriate domainexperts, and a label of normal or abnormal was assigned toeach data set. The ECG dataset contains 200 samples where133 were identified as normal and 67 were identified asabnormal. Of the 7,164 time series in Wafer dataset, 762 wereidentified as abnormal and 6,402 were identified as normal.FaceAll dataset [7] contains 2250 face image samples of 14classes. In our experiments, some samples will be randomlychosen as training samples while the other remaining sampleswill be testing samples. We use the exact datasets providedby [7] for our experiments, where training and testing setsare already chosen (see Table 1 for details). Fig. 2 showsthe performance comparisons of PCA, LDA and PSA (withvarying value of α) based on six datasets, while in Table 2we can see numerical recognition rate of those algorithms.It is noted that in Table 2, the value α is the one with thehighest recognition rate of PSA. Based on these experiments,we observe the following points:

• LDA does not outperform PCA in some classificationtasks, and vice verse. Among six datasets we experi-ment, PCA gives a better performance than LDA onAdiac, Swedish Leaf and ECG datasets, while LDAoutperforms PCA for classification task on FaceAll,Wafer and Synthetic Control datasets.

• By integrating the regulating parameter α into PSA cri-terion, the performance of PSA is obviously improved.For examples, in cases of Adiac, Swedish Leaf, ECG,and Wafer, PSA give a much better performance thanthat of PCA and LDA in most of the value of α rangingfrom 0 to 1. For FaceAll and Synthetic Control datasets,PSA outperform PCA and LDA with approximate valueα ≤ 0.5, and in between of PCA and LDA performancewith α ≥ 0.5.

• Performance of PSA with optimal value of α is quitefar better than performance of PCA and LDA, whichcan be seen in Table 2.

V. CONCLUSION

In this paper, we show how to combine descriptive featuresof PCA and discriminant features of LDA by proposing anew criterion for PSA. We derived and experimented to showthe improved performance of classification tasks by PSA.The results are quite promising, which encourages us somemore future works on some following aspects:

• A way to automatically choose the optimal value of αshould be obviously considered. This could be done byusing Genetic algorithm.

• Kernelization of PSA should be studied and comparedto that of PCA and LDA.

REFERENCES

[1] K. Fukunaga, Introduction to statistical pattern recognition (2nd ed.).San Diego, CA, USA: Academic Press Professional, Inc., 1990.

Page 4: [IEEE 2009 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Nashville, TN, USA (2009.03.30-2009.04.2)] 2009 IEEE Symposium on Computational Intelligence and Data

Name Reference No. ofclasses

Trainingsamples

Testingsamples

Dimensionof sample

Synthetic Control [6] 6 300 300 60FaceAll [7] 14 560 1,690 131

Swedish Leaf [8] 15 500 625 128Wafer [9] 2 1,000 6,164 152ECG [9] 2 100 100 96Adiac [10] 37 390 391 176

TABLE I: Description of datasets used in the experiments.

0 0.2 0.4 0.6 0.8 150

55

60

65

70

75

The value of alpha

Cor

rect

Rec

ogni

tion

Rat

e (%

)

PCALDAPSA

(a) Adiac Dataset

0 0.2 0.4 0.6 0.8 174

76

78

80

82

84

86

The value of alpha

Cor

rect

Rec

ogni

tion

Rat

e (%

)

PCALDAPSA

(b) Swedish Leaf Dataset

0 0.2 0.4 0.6 0.8 160

65

70

75

80

85

The value of alpha

Cor

rect

Rec

ogni

tion

Rat

e (%

)

PCALDAPSA

(c) ECG Dataset

0 0.2 0.4 0.6 0.8 165

70

75

80

85

90

The value of alpha

Cor

rect

Rec

ogni

tion

Rat

e (%

)

PCALDAPSA

(d) FaceAll Dataset

0 0.2 0.4 0.6 0.8 191

91.5

92

92.5

93

93.5

94

94.5

95

95.5

96

The value of alpha

Cor

rect

Rec

ogni

tion

Rat

e (%

)

PCALDAPSA

(e) Wafer Dataset

0 0.2 0.4 0.6 0.8 189

90

91

92

93

94

95

96

The value of alpha

Cor

rect

Rec

ogni

tion

Rat

e (%

)

PCALDAPSA

(f) Synthetic Control Dataset

Fig. 2: Performance Comparisons of PCA, LDA and PSA (with varying value of α) based on six datasets.

[2] A. M. Martınez and A. C. Kak, “Pca versus lda,” IEEE Trans. PatternAnal. Mach. Intell., vol. 23, no. 2, pp. 228–233, 2001.

[3] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs.fisherfaces: Recognition using class specific linear projection,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 19,no. 7, pp. 711–720, 1997.

[4] L.-F. Chen, H.-Y. M. Liao, M.-T. Ko, J.-C. Lin, and G.-J. Yu, “A newlda-based face recognition system which can solve the small samplesize problem.” Pattern Recognition, vol. 33, no. 10, pp. 1713–1726,2000.

[5] H. Yu and J. Yang, “A direct lda algorithm for high-dimensional data- with application to face recognition.” Pattern Recognition, vol. 34,no. 10, pp. 2067–2070, 2001.

[6] D. T. Pham and A. B. Chan, “Control chart pattern recognition usinga new type of self-organizing neural network,” Proceedings of theInstitution of Mechanical Engineers, Part I: Journal of Systems andControl Engineering, vol. 212, no. 2, pp. 115–127, 1998.

[7] K. E., X. X., W. L., and R. C. A., “The ucr time series classifica-

tion/clustering homepage,” 2006.[8] O. J. O. Soderkvist, “Computer vision classification of leaves from

swedish trees,” Master’s thesis, Linkoping University, SE-581 83Linkoping, Sweden, September 2001, liTH-ISY-EX-3132.

[9] R. T. Olszewski, “Generalized feature extraction for structural patternrecognition in time-series data,” Ph.D. dissertation, 2001, co-Chair-Roy Maxion and Co-Chair-Dan Siewiorek.

[10] A. Jalba, M. Wilkinson, J. Roerdink, M. Bayer, and S. Juggins, “Au-tomatic diatom identification using contour analysis by morphologicalcurvature scale spaces,” vol. 16, no. 4, pp. 217–228, September 2005.