independent component discriminant analysis

INDEPENDENT COMPONENT

DISCRIMINANT ANALYSIS

Umberto AmatoIstituto per le Applicazioni del Calcolo ‘M. Picone’ CNR - Sezione di Napoli

Via Pietro Castellino 111, 80131 Napoli (Italy)E-mail: [email protected]

Anestis Antoniadis and Gerard GregoireLMC-IMAG, Universite Joseph Fourier

BP 53, 38041 Grenoble Cedex 09 (France)E-mail (Antoniadis): [email protected]

E-mail (Gregoire): [email protected]

Abstract. We introduce a nonparametric method for discrimi-nant analysis based on the search of independent components in asignal (ICDA). Keypoints of the method are reformulation of theclassification problem in terms of transform matrices; use of Inde-pendent Component Analysis (ICA) to choose a transform matrixso that transformed components are as independent as possible;nonparametric estimation of the density function for each indepen-dent component; application of a Bayes rule for class assignment.Convergence of the method is proved and its performance is illus-trated on simulated and real data examples.

Keywords: Classification, Discriminant Analysis, Principal Compenent Anal-ysis, Independent Component Analysis, Kernel regression.

1

AMS Subject Classification: Primary 62H30; Secondary 62G08.

1 Introduction

The purpose of classical discriminant analysis is to classify a p-variate vectorx as having come from one of K populations. In the standard setting, theindependent observations are described by multivariate random vectors, suchthat all the measurements of each population k compose a distribution of valuescharacterized by a probability density fk(x). An observation is assumed to bedrawn from one and only one class and error is incurred if it is assigned to adifferent one. The cost or loss associated with such an error is usually definedby

L(k, k), 1 ≤ k, k ≤ K,

where k is the correct population class assignment and k is the assignmentthat was actually made. A special but commonly occuring loss L is the 0-1loss defined by:

L(k, k) = 1− δ(k, k),

where δ denotes the Kronecker symbol, which assigns a loss of one unit for eachmistake irrespectively of its type. This is the loss which we assume throughoutthis paper.

The most often applied classification rules are based on normal-theory clas-sification, which assumes that the class-conditional densities fk, k = 1, . . . , K,are Gaussian with mean vectors µk and variance-covariance matrices Σk. Suchstandard parametric rules include linear and quadratic discriminant analysis(e.g. Anderson (1984)), which have been shown to be quite useful in a widevariety of problems. However, in practice, the form of the class-conditionaldensities is seldom known. One way to attempt to mitigate this problem isto try to obtain an estimation of these densities by nonparametric methods.Indeed, recently much attention has been given to the application of non-parametric methods in the classification problem, including methods such asneural networks (Ripley, 1994), classification and regression trees (Breiman etal., 1984), flexible discriminant analysis (Hastie, Tibshirani and Buja (1994))and multivariate adaptive regression splines (Friedman (1991)). These meth-ods have often been shown to exhibit superior performances over standardparametric rules. A disadvantage of such models may be a lack of parsimonyin the final model and a sensitivity to the “curse of dimensionality” when thedimension p is large and the sample sizes are moderate.

In this paper we present a nonparametric discriminant analysis methodthat is a simple generalization of the model assumed by linear and quadraticdiscriminant analysis. This generalization relies upon a transformation of thedata based on independent component analysis (ICA), a statistical method

2

for transforming an observed multivariate vector into components that arestochastically as independent as possible from each other.

In section 2, we briefly review the classification problem and nonparamet-ric classification rules based on multivariate kernel density estimators. Themotivation of our method is also given in this section through an alternativeinterpretation of linear discriminant analysis and quadratic discriminant anal-ysis. Section 3 is devoted to a brief presentation of independent componentanalysis and we also present there some results on the optimal behavior ofthe resulting estimated transforms. A new family of discrimination rules ispresented in Section 4 where we also present some theoretical asymptotic re-sults on consistency. Suggestions and practical guidelines are also providedin this section. Finally, some simulations and performance comparisons withother methods on real examples confirming the usefulness of our approach arepresented in Section 5, together with some conclusions.

2 Parametric and nonparametric classification

rules

When allocating an object with measurement vector x into one of K possibleunordered populations arbitrary labeled as 1, . . . , K, the risk incurred is givenby

R(k|x) =

∑Kk=1 L(k, k)fk(x)πk∑K

k=1 fk(x)πk

, (1)

where πk is the a priori probability of observing an individual from populationk, fk is the class-conditional a priori density of population k. This can beminimized by choosing k to minimize the numerator in Equation (1) leadingto the so-called Bayes decision rule (see for example Anderson (1984)). Forthe special case of a 0-1 loss, this Bayes decision rule reduces to the followingsimple rule: allocate x to population k such that

k = d(x) = argmaxk=1,...,K

{fk(x)πk}. (2)

The class-conditional densities and the unconditional class prior probabilitiesare usually unknown. More often one is able to obtain a sample of observationsfrom each class that are correctly classified by some external mechanism. Whenthe training sample data can be regarded as drawn randomly from the pooledpopulation, these prior probabilities are estimated by the fraction of each classin the pooled sample

πk = Nk/N,

where Nk is the sample size of class k and N =∑K

k=1 Nk. The most oftenapplied classification rules are derived by assuming that the class-conditional

3

densities are p-variate normal with mean vectors µk and variance-covariancenon singular matrices Σk, k = 1, . . . , K. Substitution of the correspondingGaussian densities in expression (2) leads to the classification rule

dGauss(x) = argmaxk=1,...,K

{(x− µk)T Σ−1

k (x− µk) + ln | det(Σk)| − 2 ln πk}. (3)

The above classification rule leads to the quadratic discriminant function. Animportant special case is the one where all the variance-covariance matrices areassumed to be equal, resulting to what is called linear discriminant analysis(see for example Friedman (1989)).

Quadratic and linear discriminant analysis work well when the class condi-tional densities are approximately normal and good estimates can be obtainedfor the population parameters µk and Σk. These parameters are usually esti-mated by their sample analogs

µk = xk =1

Nk

Nk∑i=1

xik

and

Σk = Sk =1

Nk

Nk∑i=1

(xik − xk) · (xik − xk)T ,

where {xik, i = 1, . . . , Nk} is the training sample from population k.Although reasonable, these two approaches enjoy optimal properties espe-

cially when the population distributions are normal. Under substantial depar-tures from normality these discriminant procedures are highly biased.

The parametric approach to discriminant analysis has been naturally ex-tended to the case where nothing is known about the densities fk except possi-bly for some assumptions about their general behavior (see Fix and Hodges, 1951).The suggested approach is to estimate the densities fk on IRp using nonpara-metric density estimates based on the training samples and to substitute theseestimates into the Bayes decision rule (2) to give a nonparametric discriminantrule. The most often used procedure for nonparametric density estimation iskernel density estimation with appropriate smoothing parameter selection (seee.g. Silverman (1986)). More precisely, in nonparametric classification bykernel density estimation the class-conditional densities are estimated withmultivariate kernel density estimators of the form

fk(x) =1

Nk

Nk∑i=1

K{(x− xik;Hk} (4)

where K denotes a multivariate kernel function from IRp into IR, and Hk isusually a p-dimensional vector of appropriate bandwidths. When the dimen-sion p is large, the density estimation problem can be difficult. Most often

4

the kernel K is taken to be a product of univariate Gaussian kernel function,leading to estimates of the form

fk(x) = (2π)−p/2 (hk1 · hk2 · · ·hkp)−1 N−1

k

Nk∑

`=1

p∏j=1

exp

{−(xj − x`kj)

2

2h2kj

}, (5)

usually called the Gaussian product kernel estimators (see e.g. Scott (1992)).However it is known, that while in one-dimensional density estimation it is notcrucial to estimate the tails particularly accurately, this is not true anymorein high dimensional spaces where regions of relatively low density can stillbe extremely important parts of the multidimensional density. Therefore aGaussian kernel product estimator may be inappropriate due to the shorttailed normal density. Another problem with multivariate kernel densities isthe so-called “curse of dimensionality”, where apparently large regions of highdensity may be completely devoid of observations in a sample of moderate size(the empty space phenomenon as called by Scott et al. (1977)). Therefore,multidimensional density estimation is usually not practically applied whenp > 5. It can be shown however that when the dimension p is moderaterelatively to the sample sizes of the classes, and under severe departures fromnormality, the rewards of nonparametric classification by kernel methods aresubstantial.

In order to circumvent the biased tail estimation and empty space phe-nomenon common to these nonparametric multivariate kernel density estima-tors, we present now an alternative view of QDA and LDA. This view willallows us to appropriately extend the nonparametric classification problem ina small-sample high dimensional setting.

Under QDA and LDA the class-conditional densities fk are replaced byGaussian densities estimates. Every variance-covariance matrix Σk is positivedefinite and non singular. Therefore, for each k = 1, . . . , K, there exists an or-thonormal matrix Hk such HkΣkH

Tk = Dk = Diag(σ2

1k, . . . σ2pk). The matrices

Hk and Dk are usually obtained by means of a singular value decompositionof Σk. Now, if X is Np(µk, Σk) distributed, the random vector Y defined by

Y = HkX,

is Gaussian with a Np(Hkµk, Dk) distribution and therefore has stochasticallyindependent components. By orthonormality of Hk, one also has

X = HTk Y.

A change of variables gives

fk(x) = fHTk Y(x) = fY(Hkx)| det(HT

k )|

= fY(Hkx) =

p∏j=1

1

σjk

φ

((Hkx)j − (Hkµk)j

σjk

), (6)

5

where φ(t) = (2π)−1/2 exp(−t2/2) denotes the density of a standard normalvariable.

Now, an equivalent formulation of LDA and QDA can be given as fol-lows. For each population k, k = 1, . . . , K, estimate the Hk by an appropriateestimator Hk and compute the transformed sample vectors Y`,k = HkX`,k,` = 1, . . . , Nk. Using the transformed data, fit a univariate Gaussian densityin each direction of Yk and estimate the joint density of the p componentsof Yk by the product of the univariate Gaussian density estimates. If the re-sulting multivariate density estimate of fk is called fk, back transform it byexpression (6) to obtain

fNk (x) = fk(Hkx).

Linear and quadratic discriminant analysis are therefore just affected by theway the Hk are estimated.

The transforms Hk make the components of the transformed vectors in-dependent and Gaussian. The independency allows us to use as a densityestimator the product of univariate densities which then by normality are fit-ted with normal densities. A natural generalization is therefore to seek fora transform that makes the components of a random vector mutually inde-pendent irrespectively of its distribution. Independent component analysisachieves such a task.

3 Independent Component Analysis

Independent component analysis (ICA) is a statistical method for linear trans-forming an observed multidimensional random vector X into a random vec-tor Y whose components are stochastically as independent from each otheras possible. Several procedures to find such transformations have been re-cently developed in the signal processing literature relying either on Comon’s(Comon, 1994) information-theoretic approach or Hyvarinen’s maximum ne-gentropy approach (Hyvarinen, 1997). The basic goal of ICA is to find arepresentation Y = MX (M not necessarily a square matrix) in which thetransformed components Yi are the least statistically dependent. ICA leadsto meaningful results whenever the probability distribution of X is far fromGaussian and this is the case that we are interested in this paper. In ICA, the(pseudo) inverse A of M is called the mixing matrix. The basis vectors ai (rowsof A) are generally not mutually orthogonal. This can be compared to Princi-pal Component Analysis (PCA), where the matrix Hk defining an analogoustransform has its rows orthonormal. When dealing with non-Gaussian data,the orthonormality requirement of PCA is an unnatural constraint, whereasthe independence assumption of ICA is roughly valid. When the data is nor-mally distributed PCA leads to the Hk transform of the previous section and

6

performs therefore an independent component analysis but this is the only casewhere PCA and ICA lead to the same transforms. On the other hand PCA iseasier to derive because it requires only second-order statistics.

The foundations of ICA rely upon the concept of mutual information. In-deed, it is known by the work of Joe (see Joe (1987, 1989)) that relativeentropy measures are very efficient in measuring stochastic dependence. It isnot the purpose of this paper to describe the theoretic foundations of ICA. Fora detailed review the reader is referred to the original work of Comon (1994)and to the papers of Hyvarinen (1997, 1999). Let us only say here that thelinear ICA transform is found by minimizing the mutual entropy between theresulting transformed vector and the product of its marginals, or equivalentlyby maximizing the negentropy (or differential entropy). In order to find a sim-ple and fast computable optimization algorithm, Hyvarinen approximates thenegentropy J(YM) of YM = MX by an approximation of the form

JG(YM) =

p∑i=1

{IE(G(Yi))− IE(G(Z))}2,

where Z is a zero-mean standard normal random variable. In the ICA termi-nology, the functions G are called contrast functions and several choices of Gare possible. The most usual one is the power 3 transform, G(t) = t3.

The behavior of the transform obtained by the ICA procedure has beenanalyzed thoroughly by Hyvarinen (1997). For the sake of completeness andsince we are going to use these results later we restate them below. We assumehereafter that the ICA model holds, i.e., there exists a matrix M such thatthe random vector Y = MX has stochastically independent components. Wethen have

Theorem 3.1 (Hyvarinen (1997), Th. 1) Assume that the contrast func-tion G is a sufficiently smooth even function. The set of local minima ofJG(YM) under the constraint that the variances of the components of YM areequal to 1, includes the true matrix M . In particular if the marginals of YM

have a distribution with non-zero kurtosis and if G(t) = t4, there are no spu-rious optima.

We also have a result of asymptotic behavior when the expectations in theICA criterion are replaced by their empirical estimates based on a sample ofn independent random vectors with the same distribution as that of X. Anequivalent formulation of Theorem 2 of Hyvarinen (1997), more convenient forour purposes, is the following

Theorem 3.2 (Hyvarinen (1997), Th. 2) Under the same assumptions as

in Theorem 3.1, the empirical ICA transform Mn converges a.s. and in the

7

mean squared sense towards M as n goes to ∞. Moreover we have

IE(‖Mn −M‖2) = O(

1

n

).

This theorem follows easily by the strong law of large numbers and theasymptotic variance of

√nMn given in the above mentioned paper.

4 Independent Component Discriminant Anal-

ysis

The contents of the previous sections suggest now the following approach fornonparametric classification. In order to avoid the curse of dimensionality wewill first try to find a linear transformation Y = MX of the vector X by meansof the ICA procedure. Since the components of Y are now approximately in-dependent, the density of the probability distribution of Y is approximatelygiven by the product of its marginal’s densities (univariate). These marginaldensities are then fitted by univariate nonparametric kernel density estimatesmimicking the QDA and LDA cases where the marginal densities were esti-mated by univariate Gaussian densities. The overall classification proceduremay be summarized by the following algorithm:

1. For each class k, k = 1, . . . , K, use the training sample of size Nk toestimate the mean µk of class k, and use this estimate to center the datawithin the k-th class. Use then the ICA algorithm on the centered datato derive the optimal transform Mk.

2. Using the Mk matrix, compute the (centered) transformed sample (Y`,k =

MkX`,k, ` = 1, . . . , Nk) and for each direction j (j = 1, . . . , p) use an

adaptive univariate kernel density estimator to estimate by fjk the den-sity of the jth component of Yk.

3. For a new observation x compute for each class k the product of theestimated marginal densities at the point x,

p∏j=1

fjk

((Mkx)j

)·∣∣∣det(Mk)

∣∣∣

and substitute the results into the Bayes rule (2) to get the estimatedlabel of x.

8

This is the procedure that we have used in our examples. In order tojustify this generalization a fundamental property of our classification ruleis its asymptotic consistency as the sample size of each class in the trainingsample goes to ∞. In order to show that asymptotically the resulting deci-sion rule is optimal it is necessary to prove that it converges to the Bayesrule (2). This amounts to show that our estimated kernel product density∏p

j=1 fjk

((Mkx)j

)· | det(Mk)| converges uniformly towards

∏pj=1 fj ((Mkx)j) ·

| det(Mk)| , for each k, as Nk →∞. More precisely we can state the followingtheorem.

Theorem 4.1 Assume that the ICA model holds. Assume also that the class-conditional densities fk, k = 1, . . . , K, are compactly supported, differentiableand lower bounded by a strictly positive constant, that may differ from oneclass to the other. Let K be a compactly supported continuously differentiableunivariate density kernel, such that

∫xK(x) dx = 0 and 0 <

∫x2K(x)dx < ∞.

Let h be a kernel bandwidth such that h → 0 and min Nkh2 →∞ as min Nk →

∞. Thenp∏

j=1

fjk

((Mkx)j

)· | det(Mk)|

converges uniformly in L1 towards

p∏j=1

fj ((Mkx)j) · | det(Mk)|.

The fact that our decision rule converges uniformly in probability towardsthe Bayes rule and is asymptotically optimal follows immediately as a corollaryof Theorem 4.1.

Proof. Note first that in order to prove the assertion it is enough to prove theresult component wise, i.e., that for any j, j = 1, . . . , p and any k, k = 1, . . . , K,fjk(m

Tk x) converges uniformly in L1 towards fjk(m

Tk x), where mT

k denotes thejth row of the matrix Mk. To do so, it is again sufficient to control theconvergence of

fjk

(mk1x1 +

∑i≥1

mixi

)→ fjk

(m1x1 +

∑i≥1

mixi

). (7)

Now to prove (7) we will first prove that

fjk

(mk1x1 +

∑i≥1

mixi

)→ fjk

(m1x1 +

∑i≥1

mixi

), (8)

9

and use standard results on the uniform convergence of fjk to fjk on thesupport of fjk (see Jones and Wand, 1995). The proof of (8) follows from a

Taylor expansion of fjk in the neighbor of m1x1:

|fjk(mk1x1 + u)− fjk(mk1x1 + u)| ≤ |∂fjk

∂m1

(m1x1)| |m1 −m1|+ o(|m1 −m1|).

A simple calculation, Cauchy-Schwarz inequality and the asymptotic proper-ties of m1 show that

IE(|fjk(mk1x1 + u)− fjk(mk1x1 + u)|

)≤ (O(n−2h−3) +O(n−1h−2)

)1/2+O(n−1/2).

The result of the Theorem follows by our assumptions on the asymptotic be-havior of h.

5 Examples and discussion

In this section we present two examples that involve simulations and comparethe performances of our ICDA procedure with the performances of other para-metric and nonparametric rules on some real data sets that have appearedelsewhere in the literature.

5.1 Simulations

The first example (waveform) is taken from Breiman et al. (1984). It is a threeclass problem with 21 variables. The predictors are defined by

xi = uh1(i) + (1− u)h2(i) + εi Class 1



where i = 1, . . . , 21, u is uniform on [0, 1], εi are standard normal randomvariables and the hi are the shifted triangular forms defined by: h1(i) =max(6 − |i − 11|, 0), h2(i) = h1(i − 4) and h3(i) = h1(i + 4). The trainingsample has 500 observations, and equal priors were used. We have used a testsample of size 300. Training samples are plotted in Figure 1 for each class.

Table 1 shows the averaged success rates for the train dataset and the testdataset over 200 simulations, with the standard error of the average in paren-theses. The number of independent components is 21 (equal to the numberof variables) Comparison is also shown with some parametric and nonpara-metric classification methods, namely Linear Discriminant Analysis (LDA),Quadratic Discriminant Analysis (QDA) and Flexible Discriminant Analysis

10

0 5 10 15 20 25−4

−2

0

2

4

6

8 Plot of samples for class 1

0 5 10 15 20 25−4

−2

0

2

4

6


0 5 10 15 20 25−4

−2

0

2

4

6


Figure 1: Plot of the training samples for each of the 3 classes in the waveformexample.

11

Method Train Test

LDA 87(±1) 82(±2)QDA 93(±2) 81(±2)FDA 88(±2) 82(±2)ICDA 96.2(±0.8) 80(±2)PCDA 95(±1) 79(±2)

Table 1: Success percentage (average value over 200 repetitions and corre-sponding standard deviation in parentheses) corresponding to the Linear Dis-criminant Analysis, Quadratic Discriminant Analysis, Flexible DiscriminantAnalysis (with BRUTO), Independent Components Discriminant Analysis andPrincipal Components Discriminant Analysis for the train and test datasets.Table refers to the waveform example.

endowed with BRUTO (FDA, Hastie et al., 1994). ICDA compares favourablywith other methods, especially for the Train dataset.

The second example (normal) is a two class problem with 2 variables. Forclass 1 the first predictor is defined by sampling from an equal probabilitymixture of two normal distributions 0.5N(0, 1) + 0.5N(4, 1) while the secondcomponent is drawn independently from a N(0, 1) distribution. For class 2the first predictor is defined by sampling from an equal probability mixtureof two normal distributions 0.5N(2, 1) + 0.5N(6, 1) while again the secondcomponent is drawn in dependently from a N(0, 1) distribution. The finaltraining and test samples are obtained by transforming each sampled vector

with a linear transform Σ1/2 where Σ =

(1 1/2

1/2 1

). The training sample

has 100 observations, and equal priors were used. We have used a test sampleof size 900. Figure 2 displays a scatterplot of the training samples. The twomixture distributions were chosen in such a way that the two populationsoverlap. Moreover the fact that the second component in each class beforetransformation by Σ is independent and Gaussian was set to test the ability ofthe ICA algorithm to reduce the discrimination into a unidimensional space.Figure 3 shows the same plot as Fig. 2 for the independent components.

Table 2 shows the averaged success rates of ICDA for the train dataset andthe test dataset over 200 simulations, with the standard error of the averagein parentheses (the number of independent components is 2). Comparison isalso shown with other parametric and nonparametric methods. Performanceof IDCA is better than LDA and QDA, but worse than FDA-BRUTO in thetest case.

12

−2 −1 0 1 2 3 4 5 6 7 8−2

−1

0

1

2

3

4

Component 1

Com

pone

nt 2

Original components

Figure 2: Scatterplot of the 2 original components for the training dataset inthe normal example. Dots: class 1; circles: class 2.

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Component 1

Com

pone

nt 2

Transformed components

Figure 3: Scatterplot of the 2 independent transformed components for thetraining dataset in the normal example. Dots: class 1; circles: class 2.

13

Method Train Test

LDA 62(±5) 41(±5)QDA 63(±5) 42(±5)FDA 76(±5) 65(±8)ICDA 75(±4) 58(±8)PCDA 73(±4) 54(±7)

Table 2: Success percentage (average value over 200 repetitions and corre-sponding standard deviation in parentheses) corresponding to the Linear Dis-criminant Analysis, Quadratic Discriminant Analysis, Flexible DiscriminantAnalysis (with BRUTO), Independent Components Discriminant Analysis andPrincipal Components Discriminant Analysis for the train and test datasets.Table refers to the normal example.

5.2 Real data examples

The first real data example comes from speech processing (vowel). 15 speakers(8 for training and 7 for test) pronounced 11 different words correspondingto different vowel sounds; 6 frames of speech were considered for each record,giving rise to 528 (for training) and 462 (for test) samples. After properprocessing, each sample consisted of 10 components coming from a spectralanalysis of the frames (see Hastie et al. (1994) for more details). Thereforewe have 11 classes and 10 components. Table 3 shows performance of ICDAcompared with other parametric and nonparametric methods (10 independentcomponents were found by ICA). Again ICDA performs well, gaining top per-formance in the train case.

Method Train Test

LDA 68.4 44.4QDA 98.9 47.2FDA 75.8 49.8ICDA 99.4 46.8PCDA 98.3 46.1

Table 3: Success percentage corresponding to the Linear Discriminant Anal-ysis, Quadratic Discriminant Analysis, Flexible Discriminant Analysis, Inde-pendent Components Discriminant Analysis and Principal Components Dis-criminant Analysis for the train and test datasets. Table refers to the vowelexample.

The second real data example (eeg) concerns with human event-relatedpotentials (ERPs) records described in Makeig et al. (1999). Ten subjects were

14

presented five squares, one of which was randomly coloured green (attendedsquare); then a circle randomly appears inside one of the boxes. If the circlefills the attended green box, then the subject is required to press a thumbbotton as soon as possible. ERP records are sampled at 512Hz for 1000msec(512 points) from 29 scalp electrodes mounted in an electrode cap. For amore detailed description of the experiment we defer to Makeig et al. (1999).Even though originally intended for decomposition of late positive complexes(LPCs), however the experiment is well suited also for classification purposes,where two classes can be defined, wether the subject pushes the button ontime when the circle fills the (attended) green box or not. 25 records areavilable for each scalpe electrode, 5 successes and 20 failures of the subjects,obtained averaging a larger number of experiments. In the train examplethese 25 samples were considered both for training and classification. In thetest example 13 randomly chosen samples (3 successes and 10 failures) wereconsidered for training, while the remaining samples form the test dataset. Inorder to improve effectiveness of the classification methods, data were subjectto a previous noise removal phase by wavelet regularization. Figure 4 shows aplot of the (filtered) records for the two classes.

Table 4 shows success percentage of the ICDA method for the train and testexamples and records corresponding to the first electrode. In the test example200 different choices of the train dataset were considered (13 records extractedfrom the whole set of 25) and the table shows average percentage together withits standard deviation. The number of independent components selected was4 for the first class and 19 for the second one in the train experiment and 2 forthe first class and 9 for the second one in the test experiments. Comparisonwith other methods is not shown because these fail when the size of the samplesis higher than the number of samples for each class.

Method Train Test

ICDA 100 62± 20PCDA 100 61± 20

Table 4: Success percentage corresponding to the Independent ComponentsDiscriminant Analysis and Principal Components Discriminant Analysis forthe train and test datasets. Table refers to the eeg example.

5.3 Discussion and some conclusions

Summarizing results on experiments, performance of ICDA is competitive com-pared with parametric and nonparametric methods. In particular top perfor-mance is reached in the training examples, which suggests the conclusion that

15

0 100 200 300 400 500 600−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2 Plot of samples for class 1

0 100 200 300 400 500 600−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3 Plot of samples for class 2

Figure 4: Plot of the training samples for each of the 2 classes in the eegexample.

16

ICDA is particularly performing when the training dataset is well representa-tive of the problem at hand.

The algorithm developed for the ICDA has been implemented in Matlab.It relies on the FastICA algorithm by Hyvarinen (1999). The number of inde-pendent components seeked could be less than the size of the samples (indeedthe preliminar PCA step is already able to reduce some redundant compo-nents). On the other side performance of IDCA significantly depends on thenumber of independent components found. In fact, since components are de-fined up to their order and up to a multiplicative constant, it is not possibleto pick the most significant ones and different seeds for starting iterations giverise to different components at each iteration in general (for more details seeHyvarinen (1999)). Sometimes a failure occurs in picking a component in areasonable number of iterations; in this circumstance a new iteration is startedwith a different seed so that the number of independent components is keptfixed for all experiments within a same problem. As a general rule, best resultsare obtained when the maximum number of components is allowed.

The code developed is available from the authors upon request.Also different contrast functions were considered; results in the tables of

Sections 5.1 and 5.2 refer to the contrast function g(t) = t3. Analogous compu-tations made on the same datasets with contrast functions G(t) = tanh t andG(t) = t exp(−t2/2) did not show substantial differences, and are not shownhere for the sake of brevity.

Other than FastICA methods could be used to find independent compo-nents, even though significant differences are not to be expected. Possiblyextension of the transform to more realistic transformations could be also con-sidered (e.g., nonlinear ICA).

The method introduced in the present paper, and in particular the gen-eral formulation (6), has a validity that goes beyond application of ICA forestimating transforms Hk. According to the choice of Hk we have differentclassification methods. A simple extension concerns replacement of the ICAstep by a PCA one (PCDA). Even though the former is a generalization ofthe latter (indeed they are equivalent only in the case of Gaussian modelswhere uncorrelation implies independence), however computational efficiencysuggests consideration of PCA. Results for the examples of Sections 5.2 and5.3 are shown in Tables 1–4. It is possible to see that ICDA compares slightlymore favourably than PCDA, even though performance of the latter is verygood and, in any case, significantly better than LDA and QDA. Another in-teresting perspective, under investigation by the authors, is to choose Hk froma wavelet packet family so that independence is obtained among the packets(see Saito and Coifman (1995) for an attempt in this regard).

Indeed decorrelation (i.e., PCA) is the first step of ICA; therefore morerobust estimates of the covariance matrix on which PCA is based could improve

17

performance of the ICDA classification, especially for small samples.A final consideration concerns the kernel method. We used the Kernel Den-

sity Estimation Toolbox developed by C.C. Beardah (1995) with Epanechnikovbandwidth. Indeed this choice is particularly effective from a computationalpoint of view, even though it results in choosing bandwidths that are largerthan the optimal ones. However, this does not affect overall convergence ofthe method; on the other side the use of an asymptotically optimal criterionfor choosing the bandwidth did not improve the misclassification rates signifi-cantly, while heavily degrading computational efficiency.

Acknowledgments

The paper was supported by the Agenzia Spaziale Italiana and CNR/CNRSin the framework of the project “Metodi tatistici non parametrici avanzatiper l’analisi di dati e applicazioni”. Anestis Antoniadis and Gerard Gregoireare grateful to the Istituto per Applicazioni della Matematica for its warmhospitality where this work was completed.

References

[1] T.W. Anderson, An introduction to multivariate statistical analysis, 2nded., John Wiley & Sons, New York (1984).

[2] C. C. Beardah, The Kernel Density Estimation toolbox for Matlab, avail-able at http://euler.ntu.ac.uk/maths.html (1995).

[3] L. Breiman, J.H. Friedman, R. Olshen and C.J. Stone, Classification andRegression Trees. Wadsworth (1984).

[4] P. Comon, Independent component analysis, a new concept?, Signal Pro-cessing 36, 287–314 (1994).

[5] J. Friedman, Exploratory projection pursuit, J. Amer. Statist. Assoc. 82,249–266 (1987).

[6] J. Friedman, Multivariate adaptive regression splines (with discussion),Annals of Statistics 19, 1–141 (1991).

[7] E. Fix ans J.L. Hodges, Discriminatory analysis - nonparametric discrim-ination: Consistency properties, Int. Stat. Rev. 57, 238–247 (1989).

[8] T. Hastie, R. Tibshirani and A. Buja, Flexible Discriminant Analysis byoptimal scoring, J. Am. Stat. Assoc. 89, 1255–1270 (1994).

18

[9] A. Hyvarinen, Independent component analysis by minimization of mu-tual information, Proc. IEEE Int. Conf. on Acoustics, Speech and SignalProcessing (ICASSP’97), 3917–3920 (1997).

[10] A. Hyvarinen, Fast and Robust Fixed-Point Algorithms for IndependentComponent Analysis, IEEE Transactions on Neural Networks 10, 626–634 (1999).

[11] H. Joe, Majorization, randomness and dependence for multivariate distri-butions, Ann. Probab. 15, 1217–1225 (1987).

[12] H. Joe, Relative entropy measures of multivariate dependence, J. Am.Stat. Assoc. 84, 157–164 (1989).

[13] S. Makeig, M. Westerfield, T. Jung, J. Covington, J. Townsend, T.J.Sejnowski and E. Courchesne, Functionally independent components ofthe late positive event-related potential during visual spatial attention, J.Neuroscience 19, 2665–2680 (1999).

[14] B. Ripley, Neural networks and related methods for classification, J. R.Stat. Soc. Ser. B 56, 409–456 (1994).

[15] N. Saito and R. Coifman, Local discriminant bases and their applications,J. Mathematical Imaging and Vision, 5, 337–358 (1995).

[16] D.W. Scott, Multivariate density estimation: theory, practice, and visu-alization, Wiley, New York (1992).

[17] D. W. Scott, R. A. Tapia and Thompson, J. R., Kernel density estimationrevisited, Nonlinear Analysis. Theory Meth. Applic., 1, 339–372, (1977).

[18] B.W. Silverman, Density estimation for statistics and data analysis,Chapman and Hall, London (1986).

[19] M.P. Wand and M.C. Jones, Kernel smoothing, Chapman & Hall, London(1995).

19

independent component discriminant analysis

Documents