independent component analysis: a flexible nonlinearity and decorrelating manifold approach

25

Upload: independent

Post on 21-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Independent Component Analysis:A exible non-linearity and decorrelating manifold approachRichard Everson and Stephen RobertsDepartment of Electrical and Electronic Engineering,Imperial College of Science, Technology & Medicine,London. UK.fr.everson,[email protected] 3, 1998AbstractIndependent Components Analysis �nds a linear transformation to variables which are max-imally statistically independent. We examine ICA and algorithms for �nding the best transfor-mation from the point of view of maximising the likelihood of the data. In particular, examinethe way in which scaling of the unmixing matrix permits a \static" nonlinearity to adapt tovarious margninal densities. We demonstrate a new algorithm that uses generalised exponentialfunctions to model the marginal densities and is able to separate densities with light tails.Numerical experiments show that the manifold of decorrelating matrices lies along the ridgesof high-likelihood unmixing matrices in the space of all unmixing matrices. We show how to�nd the optimum ICA matrix on the manifold of decorrelating matrices as an example use thealgorithm to �nd independent component basis vectors for an ensemble of portraits.1 IntroductionFinding a natural cooordinate system for empirical data is an essential �rst step in its analysis.Principal components analysis (PCA) is often used to �nd a basis set which is determined by thedataset itself. The principal components are orthogonal and projections of the data onto them arelinearly decorrelated, which can be ensured by considering the second order statistical propertiesof the data. Independent components analysis (ICA), which has enjoyed recent theoretical (Belland Sejnowski 1995; Cardoso and Laheld 1996; Cardoso 1997; Pham 1996; Lee et al. 1998) andempirical (Makeig et al. 1996; Makeig et al. 1997) attention, aims at a loftier goal: it seeks alinear transformation to coordinates in which the data are maximally statistically independent, notmerely decorrelated. Viewed from another perspective, ICA is a method of separating independentsources, which have been linearly mixed to produce the data.Despite its recent popularity, aspects of the ICA algorithms are still poorly understood. Inthis paper, we seek to better understand and improve the technique. To this end we explicitlycalculate the likelihood landscape and examine the way in which the maximum likelihood basisis achieved. We report on methods for learning the nonlinearities implicit in the ICA algorithm,and we have implemented an ICA algorithm which can separate leptokurtic (i.e., heavy-tailed) andplatykurtic (i.e., light-tailed) sources, by modelling marginal densities with the family of generalizedexponential densities. We examine ICA in the context of decorrelating transformations, and derivean algorithm which operates on the manifold of decorrelating matrices. As an illustration of ouralgorithm we apply it to the \Rogues Gallery" { an ensemble of portraits { in order to �nd theindependent components basis vectors for the ensemble.0$ Id: notes.tex,v 1.11 1998/03/03 16:42:40 rme Exp rme $1

2 BackgroundConsider a set of T observations, x(t) 2 RN. Independent components analysis seeks a lineartransformation W 2 RM�N to a new set of variables,a =Wx; (1)in which the components of a, am(t), are maximally independent in a statistical sense. The degreeof independence is measured by the mutual information between the components of a:I(a) = Z p(a) log p(a)Qm pm(am)da (2)When the joint probability p(a) can be factored into the product of the marginal densities pm(am),the various components of a are statistically independent and the mutual information is zero. ICAthus �nds a factorial coding (Barlow 1961) for the observations.The model we have in mind is that the observations were generated by the noiseless linearmixing of M independent sources sm(t), so thatx =Ms: (3)The matrix W is thus to be regarded as the (pseudo) inverse of the mixing matrix, M. Thussuccessful estimation of W constitutes blind source separation. Note that W is not the exactinverse of M, since the mutual information does not change under permutation of the coordinatesor with scaling of any coordinate. So, if P is a permutation matrix and D a diagonal matrix, wehave WM = PD: (4)It should be noted, however, that it may not be possible to �nd a factorial coding with a linearchange of variables, in which case there will be some remaining statistical dependence between theam.2.1 Neuro-mimetic AlgorithmICA has been brought to the fore by Bell & Sejnowski's (1995) neuro-mimetic formulation, whichwe now review. For simplicity, we keep to the standard assumption that M = N .Bell & Sejnowski introduce a non-linear, component-wise mapping of a into a space in whichthe marginal densities are uniform. Thusy = g(a); ym = gm(am): (5)The linear transformation followed by the non-linear map may be accomplished by a single layerneural network in which the elements of W are the weights and the M neurons have transferfunctions gm.Since the mutual information is constant under one-to-one, component-wise changes of variables,I(a) = I(y), and since the gm are, in theory at least, chosen to generate uniform marginal densities,pm(ym), the mutual information I(y) is equal to the negative of the entropy of y:I(y) = �H(y) = Z p(y) logp(y)dy = � hlog p(y)iy ; (6)where h�iy denotes the expectation over the y space.We emphasise that the point of the non-linear change of variables is to bring the problem into aspace in which the marginal densities are uniform and so can be ignored in the mutual informationexpression (2). This trick means that one does not have to take the awkward step of estimating2

(and dividing by) the marginal densities.Under a one-to-one change of variables x! y probabilities are transformed aspy(y) = px(x)jJ j ; (7)where in this case, jJ j = det @yi@xj = detWYm g0m(am) (8)is the determinant of the Jacobian. Changing variables, the entropy can therefore be written asH(y) = � Z px(x) log px(x)jJ j dx (9)= ��log pxjJ j� (10)= hlog jJ ji � hlog px(x)i (11)= log jW j+*Xm log g0m(am)+� hlog px(x)i ; (12)where expectations are now over x space.In order to perform any gradient-based optimisation approach to the maximum entropy, andtherefore the minimum mutual information, the gradient of H with respect to the elements of Wis needed. Since the �nal term of (12) does not depend on W ,@H@Wij = @@Wij log jW j+Xm � @@Wij log g0m(am)� (13)= W�Tij + �g00ig0i (ai)xj� : (14)Writing zi = �i(ai) = g00i =g0i (15)gives @H@W = W�T + DzxTE (16)If a gradient-ascent method is applied, the estimates of W are then updated according to �W =�@H=@W for some learning rate �. Bell & Sejnowski drop the expectation operator in order toperform a stochastic gradient descent to the minimum mutual information. Various modi�cationsof this scheme, such as MacKay's covariant algorithm (MacKay 1996) and Amari's natural gradientscheme (Amari et al. 1996) enhance the convergence rate, but the basic ingredients remain thesame.Accelerated convergence can often be obtained by magnifying the role of the zxT� � T�1PTt zxTterm in (16), which indicates the direction in which to change W to maximise the likelihood of thedata (MacKay 1996). This can be achieved by, for example, replacing it with max(1; T � i)zxT=T ,where i is the iteration number, so that in the early stages of learning the term carries weight(T � i), but is weighted correctly after T iterations.If one sacri�ces the plausiblity of a biological interpretation of the ICA algorithm, much moree�cient optimisation of the unmixing matrix is possible. In particular, quasi-Newton methods,such as the BFGS scheme (Press et al. 1992), which approximate the Hessian @2H=@WijWkl, canspeed up �nding the unmixing matrix by at least an order of magnitude.3

Other approaches to ICA are possible: explicit manipulations with moments were the principaltool (see Comon, Jutten, and Herault (1991)) before Bell & Sejnowski's work. Cardoso (1997) andMacKay (1996) have each shown that the neuro-mimetic formulation is equivalent to a maximumlikelihood approach, and Roberts (1998) has addressed ICA from a Bayesian point of view.3 Likelihood LandscapeAs previously noted, Cardoso and MacKay have each formulated the ICA problem in terms ofmaximum likelihood. MacKay in particular shows that the log likelihood for a single observationx(t) is given by logP (x(t)jW ) = log j detW j+Xm log pm(am(t)): (17)The normalised log likelihood for the entire set of observations is thereforelogL = 1T TXt=1 logP (x(t)jW ) (18)= log j detW j+ 1T TXt=1Xm log pm(am(t)) (19)= log j detW j �Xm Hm(am); (20)where Hm(am) = 1T TXt=1 log pm(am(t)) � � Z pm(am) log pm(am)dam (21)is an estimate of the marginal entropy of the mth unmixed variable.Note also that the mutual information is given byI(a) = Z p(a) log p(a)da+Xm Hm(am) (22)= � log j detW j+H(x) +Xm Hm(am) (23)= H(x)� logL: (24)Thus the mutual information is a constant, H(x), minus the log likelihood, so that hills in the loglikelihood are valleys in the mutual information.Equation (18) requires some sort of normalisation since multiplication of W by a diagonalmatrix does not change the mutual information, but would permit us to make the likelihood takeon arbitrary values. We therefore choose to normalise W so that the sum of the squares of theelements in each row is unity: PjW 2ij = 1 8i.When only two sources are mixed, the row normalised W may be parameterized by two angles:W = �cos �1 sin �1cos �2 sin �2� (25)and the likelihood plotted as a function of �1 and �2. Figure 1 shows the log likelihood for the mix-ture of a Gaussian and a bi-exponential source withM = �2 13 1�. Also plotted are the constituentcomponents of logL: namely log j detW j and Hm. Here the entropies ware calculated by modellingthe marginal densities with a generalised exponential (see below) and numerical quadrature, but4

Likelihood

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

−9

−8

−7

−6

−5

−4

a−3 −2 −1 0 1 2 3

−10

−9

−8

−7

−6

−5

−4

−3

c

det(|W|)

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

−3

−2.5

−2

−1.5

−1

−0.5

0

b−3 −2 −1 0 1 2 3

0

0.5

1

1.5

2

2.5

3

3.5

dFigure 1: Likelihood landscape for a mixture of a bi-exponential and Gaussian sources. a: Log likelihood,logL, plotted as a function of �1 and �2. Dark gray indicates low likelihood matrices and white indicateshigh likelihood matrices. The maximum likelihood matrix (i.e., the ICA unmixing matrix) is indicated bythe �. b: log j detW (�1; �2)j. c: Log likelihood along the \ridge" �2 = const:, passing through the maximumlikelihood. d: The marginal entropy, Hm(am) = h(�m).5

histogramming the a(t) and numerical quadrature gives very similar, though coarser, results.Various features deserve comment.1. When �1 = �2 W is not of full rank, so log j detW j and hence logL are singular.2. Symmetries. Clearly logL is doubly periodic in �1 and �2. Additional symmetries are con-ferred by the facts that(a) log j detW j is symmetric in the line �1 = �2.(b) Hm(am) depends only on the angle, and not on the particular m; that is, Hm(am) maybe written as h(�m) for some function h, which depends, of course, on the data, x(t).Consequently logL = log j detW j �Xm h(�m): (26)h(�) is graphed in Figure 1d for the Gaussian/bi-exponential example.Analogous symmetries are retained in higher dimensional examples.3. The maximum likelihood is achieved for several (�1; �2) related by symmetry, one instance ofwhich is marked by a star in the �gure. The maximum likelihood lies on a ridge with steepsides and a at top. Figure 1d shows a section along the ridge. The rapid convergence ofICA algorithms is probably due to the ease in ascending the sides of the ridge; arriving atthe very best solution requires a lot of extra work.Note however that this picture gives a slightly distorted view of the likelihood landscape facedby learning algorithms because they generally work in terms of the full matrixW , rather thanthe with row-normalised form.3.1 Mixture of imagesAs a more realistic example, Figure 2 show the likelihood landscape for a pair of images mixedwith the mixing matrix M = � 0:7 0:30:55 0:45�. Since the distributions of pixel values in the imagesare certainly not unimodal the marginal entropies were calculated by histogramming the am andnumerical quadrature.The overall likelihood is remarkably similar in structure to the bi-exponential-Gaussian mixtureshown above. The principal di�erence is that the top of the ridge is now bumpy and gradient-basedalgorithms may get stuck at a local maximum as illustrated in Figure 3. The imperfect unmixingby the matrix at a local maximum is evident as the ghost of Einstein haunting the house. Unmixingby the maximum likelihood matrix is not quite perfect because the source images are not in factindependent, indeed the correlation matrix ssT� = � 1:0000 �0:2354�0:2354 1:0000 �.4 Choice of squashing functionThe algorithm, as outlined above, leaves open the choice of the \squashing functions" gm, whosefunction is to map the transformed variables, am, into a space in which their marginal densities areuniform. It should be pointed out that what is actually needed are the functions �m(am) ratherthan the gm themselves.If the marginal densities are known it is, in theory, a simple matter to �nd the appropriatesquashing function, since the function which is the cumulative marginal density maps into a spacein which the density is uniform. That isg(a) = P (a) = Z a�1 p(x)dx; (27)6

Likelihood

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

a−3 −2 −1 0 1 2 3

−1.5

−1

−0.5

0

0.5

1

1.5

b−3 −2 −1 0 1 2 3

−6

−5

−4

−3

−2

−1

0

1

cFigure 2: Likelihood landscape for a mixture of two images. a: Log likelihood, logL, plotted as a functionof �1 and �2. Dark gray indicates low likelihood matrices and white indicates high likelihood matrices. Themaximum likelihood matrix is indicated by the �; and M�1 is indicated by the square. Note that symmetrymeans that an equivalent maximum likelihood matrix is almost coincident with M�1. Crosses indicate thetrajectory of estimates of W by the relative gradient algorithm, starting with W0 = I. b: The marginalentropy, Hm(am) = h(�m). c: Log likelihood along the \ridge" �2 = const:, passing through the maximumlikelihood.7

Figure 3: Unmixing of images by maximum likelihood and sub-optimal matrices. Top row: Imagesunmixed by a matrix at a local maximum (logL = �0:0611). The trajectory followed by the relativegradient algorithm that arrived at this local maximum is shown in �gure 2. Bottom row: Images unmixedby the maximum likelihood matrix (logL = 0:9252).where the subscripts m have been dropped for ease of notation. Combining equations (27) and (15)gives alternative forms for the ideal �:�(a) = @p@P (a) = @ log p@a (a) = p0(a)p(a) (28)In practice, however, the marginal densities are not known. Bell & Sejnowski recognised this andinvestigated a number of forms for g and hence �. Current folklore maintains that so long asthe marginal densities are heavy tailed (platykurtic) almost any function that squashes will do,and the generalised sigmoidal function and the negative hyperbolic tangent are choices. Solving(15) with �(a) = � tanh(a) shows that using the hyperbolic tangent is equivalent to assumingp(a) = 1=(� cosh(a)).4.1 Learning the nonlinearityIn section 3 it was argued that, by multiplying the unmixing matrix by a diagonal matrix, D, thelikelihood could be made to take on arbitrary values, without changing the mutual informationbetween the unmixed variables. The estimated mutual information is not, however, independent ofD if the unmixed densities appearing in (19) are not the true source densities. This is precisely thecase faced by learning algorithms using an a priori �xed marginal density. As �gure 4 shows thelikelihood landspace for row-normalized unmixing matrices using p(a) = 1=(� cosh(a)) is similarin form to the likelihood shown in �gure 1, though the ridges are not so sharp and the maximumlikelihood is only �3:9975, which is to be compared with the true maximum likelihood of �3:1178.Multiplying the row-normalized mixing matrix by a diagonal matrix, D opens up the possibilityof better �tting the unmixed densities to 1=(� cosh(a)). Choosing W � to be the row-normalisedM�1, �gure 4b shows the log likelihood of DW � as a function of D. The maximum log likelihoodof �3:1881 is achieved for D = �1:67 00 5:094�. 8

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

−12

−11

−10

−9

−8

−7

−6

−5

−4

a10

−210

−110

010

110

2−22

−20

−18

−16

−14

−12

−10

−8

−6

−4

−2

c−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

b10

010

110

210

310

40.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

dFigure 4: Likelihood for a bi-exponential-Gaussian mixture, assuming 1= cosh sources. a: Normalisedlog likelihood plotted in the space of two-dimensional, rownormalised umnixing matrices. M�1 is markedwith a star. b: The log likelihood plotted as a function of the elements D11 and D22 of a diagonal matrixmultiplying the row-normalized maximum likelihood matrix. Since the log likelihood becomes very large andnegative for large Dmm, the gray scale is � log10(j logLj). The axes are labelled with log10Dmm . c: Section,D11 = const:, through the maximum in (a). d: Convergence of the diagonal elements of D, as W is foundby the relative gradient algorithm.9

In fact, by adjusting the overall scaling of each row of W ICA algorithms are \learning thenonlinearity". Conceptually we may think of the diagonal terms being incorporated into the non-linearity as adjustable parameters, which are learned along with the row-normalized unmixingmatrix. Let W = DW , where D is diagonal and W is row-normalized. Then, in this example, wemay think of the marginal densities as being modelled by 1= cosh(am=Dmm), with am(t) = Wmx(t).Figure 4d shows the convergence of D as W is located for the bi-exponential/Gaussian mixtureusing the relative gradient algorithm. The component for which D!� 5 is the unmixed Gaussiancomponent, while the component for which D!� 1:67 is the bi-exponential component.The manner in which nonlinearity is learned can be seen by noting that the weights are adjustedso that the variance of the unmixed Gaussian component is small, while the width of the unmixedexponential component remains relatively large. This means that the Gaussian component reallyonly \feels" the linear part of tanh close to the origin and direct substitution in (15) shows that�(a) = �a is the correct nonlinearity for a Gaussian distribution. On the other hand, the unmixedbi-exponential component sees the nonlinearity more like a step function, which is appropriate fora bi-exponential density.Densities with tails lighter than Gaussian (i.e., those for which p@ log p@p < 1) require a sub-linear�, and it might be expected that the � tanh(a) nonlinearity would be unable to cope with suchmarginal densities. Indeed, with � = � tanh, the relative gradient, covariant and BFGS variationsof the ICA algorithm all fail to separate a mixture of a uniform source, a Gaussian source and abi-exponential source.This point of view gives a partial explanation for the spectacular ability of ICA algorithmsto separate sources with di�erent heavy-tailed densities using a \single" non-linearity, and theirinability to unmix light-tailed sources (such as uniform densities).4.2 Adapting �By re�ning the estimate of � as the calculation proceeds one might expect to be able to estimatea more accurate W and to be able to separate sources with di�erent densities. In order to addressthis point we have investigated a number of methods of estimating �(a) from the T instances ofa(t), t = 1:::T .4.2.1 Non-parametric methodsCumulative density: Since the cumulative density P (a) is easily calculated by sorting the Tvalues of a, a straight-forward estimate of � is obtained from �(a) = @p@P (a). Two numericaldi�erentiations are required, however: one to �nd p = @P@a and another to estimate @p@P . Withoutconsiderable smoothing the estimated � is far too noisy to be of practical value.Kernel Density Estimators: The marginal densities may also be estimated non-parametricallywith a kernel density estimator (Silverman 1986; Wand and Jones 1995) after which � follows fromdi�erentiating log p.The kernel density estimate, p�, of p is given byp�(a) = TXt=1 1h (a� a(t)h ): (29)Convolution with the (non-negative) kernel provides a degree of smoothing and extends theestimate of p to regions where there are no data points. Common choices for the kernel, whichshould integrate to 1, are: the boxcar function, in which case p� is the conventional histogramestimate; the Gaussian kernel; and the Epanechnihov kernel: (x) = ( 34p5(1� x2=5) �p5 � x � p50 otherwise (30)10

which can be shown to yield the approximate minimum mean squared error in p� (Silverman 1986).We have used the Epanechnihov kernel for this reason, and because its �nite support makes itcomputationally less expensive than the Gaussian kernel.The choice of the bandwidth, h, which sets the degree of smoothing, is however something of ablack art. Unfortunately the maximum likelihood estimate of h is zero; that is, the best estimateof the empirical density is T Dirac delta functions located at the observations! Clearly too littlesmoothing yields results in poor estimates of �, which involves a further numerical di�erentiation,while over-smoothing misses the very features of p which we are trying to estimate. Numericalexperiments show that the requirement of calculating � necessitates considerably larger h that theconventional rule of thumb h = 1:06�N�1=5. Even with larger h estimates of p and � in the sparsetails are poor and the method appears to o�er no advantage over �xed � methods, at least forproblems in which the source densities are unimodal.Adaptive kernel density estimators, in which h is tailored to the local p� appear, at �rst sight, tobe attractive. Silverman (1986) outlines schemes in which h is adapted to be small where an initialrough estimate of p� is large, but h is larger where p� is smaller and the data points are sparse.Unfortunately, this scheme when applied iteratively does not converge to a realistic estimate of p:it really needs human supervision and is therefore unsuitable for incorporation into an iterativemethod for ICA.It is possible to make better kernel density estimates of p, and therefore �, constraints are placedon � { for example, that � be a monotonically decaying function. Such constraints, however, limitthe densities which may be modelled and move the method into the realm of parametric models.4.2.2 Parametric methodsAll of the �xed non-linearities implicitly model a marginal density, which is discovered by solving(15) for p. In particular, � = � tanh(a) assumes that p(a) / 1= cosh(a).Hyperbolic tangent: MacKay (1996) has suggested generalising the usual � = � tanh(a) to usea gain �; that is �(a) = � tanh(�a). Solving (15) for p shows that this models distributions of theform p(a) / 1=[cosh(�a)]1=�. As � ! 0, p(a) approaches a Gaussian density with variance 1=�,while for large �, � is suited to the bi-exponential distribution.MacKay (1996) has also suggested using the Student distributions.Generalised Exponential: The hyperbolic tangent with gain can model Gaussian densities anddensities with heavier tails than Gaussian, however it cannot adequately model densities with tailsthat are lighter than Gaussian. To provide a little more exibility than the hyperbolic tangent withgain, we have used the generalised exponential distribution:p(aj�;R) = R�1=R2�(1=R) expf��jajRg: (31)The width of the distribution is set by 1=�, while the weight of its tails is determined by R. Clearly pis Gaussian when R = 2, bi-exponential when R = 1, and the uniform distribution is approximatedin the limit R!1.Rather than learn R and � along with the elements of W , which magni�es the size of the searchspace, they may be calculated for each am at any, and perhaps every, stage of learning. For Tvalues of a, the normalised log likelihood isL = 1T TXt=1 log p(aj�;R) = log R�1=R2�(1=R) � 1T TXt=1 �ja(t)jR: (32)11

0 5 10 15 20 25 30 35−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

a 0 5 10 15 20 25 30 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

b 0 5 10 15 20 25 30 350

5

10

15

20

25

30

cFigure 5: Separating a mixture of uniform, Gaussian and bi-exponential sources. a: Likelihood logL of theunmixing matrix plotted against iteration. b: Fidelity of the unmixing �(WM) plotted against iteration.c: Estimates of Rm for the unmixed variables. R describes the power of the generalized exponential: thetwo lower curves converge to approximately 1 and 2, describing the separated bi-exponential and Gaussiancomponents, while the upper curve (limited to 25 for numerical reasons) describes the unmixed uniformsource.Setting @L@� = 0 yields, 1� = RT Xt ja(t)jR (33)so R can be found by solving the one dimensional problemdLdR = �TR � TR2 f (1=R) + log �g � TR� @�@R + �Xt ja(t)jR log ja(t)j+ @�@RXt ja(t)jR = 0; (34)where (x) = �0(x) is the digamma function and@�@R = TRPt ja(t)jR � 1R + Pt ja(t)jR log ja(t)jPt ja(t)jR � (35)Since d2LdR2 < 0, this is readily and robustly accomplished.This parametric model, like the hyperbolic tangent, assumes that the marginal densities areunimodal and symmetric about the mean.4.3 ExampleWe have implemented an adaptive ICA algorithm using the generalized exponential to modelthe marginal densities. Schemes based on the relative gradient algorithm and the BFGS methodhave been used, but the quasi-Newton scheme is much more e�cient and we discuss that here.The BFGS scheme minimizes � logL (see equation 18). At each stage of the minimization theparameters Rm and �m, describing the distribution of the mth unmixed variable, were calculated.With these on hand � logL can be calculated by numerical integration of the marginal entropies(21) and the gradient found from �@ logL@W = �W�T + DzxTE ; (36)where zm = �(amj�m; Rm) is evaluated using the generalized exponential. Note that (36) assumesthat R and � are �xed, though in fact they depend on Wij because they are evaluated fromam(t) = Wmkxk(t). In practice, this leads to small errors in the gradient (largest at the beginningof the optimization, before R and � have reached their �nal values), to which the quasi-Newtonscheme is tolerant. 12

Two measures were used to assess the scheme's performance: �rst, the log likelihood (18) wascalculated; the second measures how well W approximates M�1. Recall that changes of scale inthe am and permutation of the order of the unmixed variables do not a�ect the mutual information,so rather than WM = I we expect WM = PD (37)for some diagonal matrix D and permutation matrix P . Under the Frobenius norm, the nearestdiagonal matrix to any given matrix A is just its the diagonal elements, diag(A). Consequently theerror in W may be assessed by�(MW ) = �(WM) = minP kWMP � diag(WMP )kkWMk ; (38)where the minimum is taken over all permutation matrices, P . Of course, when the sources are in-dependent �(WM) should be zero, though when they are not independent the maximum likelihoodunmixing matrix may not correspond to �(WM) = 0.Figure 5 shows the progress of the scheme in separating a bi-exponential source, s1(t), a uni-formly distributed source, s2(t), and a Gaussian source, s3(t), mixed withM = 0@0:2519 0:0513 0:07710:5174 0:6309 0:45720:1225 0:6074 0:49711A : (39)There were T = 1000 observations. The log likelihood and �(WM) show that the generalisedexponential adaptive algorithm (unlike the � = � tanh) succeeds in separating the sources.4.4 Mixtures of GaussiansICA folklore maintains that the algorithm cannot separate mixtures of Gaussian sources. Thishowever is not the case. Figure 6 shows the log likelihood for two unit variance Gaussian sources,mixed with the same mixing matrix M = �2 13 1�. The overall structure is very similar to thenon-Gaussian case (Figure 1), and there is, unsurprisingly, a global maximum in the same place asthe non-Gaussian examples. There is certainly a gradient to ascend and both the relative gradientalgorithm and MacKay's covariant scheme, as well as quasi-Newton schemes, successfully locatethe maximum. It is interesting that the maximum likelihood unmixing matrix is not equivalent tothe PCA unmixing matrix, even with Gaussian sources. PCA produces orthogonal basis vectorsfor the dataset, whereas ICA �nds basis vectors that correspond toM�1. When the mixing matrixis merely a rotation (i.e., M is orthogonal), the PCA matrix has the same likelihood as the ICAmatrix, however this is a consequence of the orthogonality of M, not the fact that the sources areGaussian. If the joint density of the data is a multi-variate Gaussian, the ICA and PCA unmixingmatrices coincide. In this case the data may be thought of as having been produced by an othogonalmixing of univariate Gaussian sources.The appropriate nonlinearity for a mixture of Gaussians is actually linear: �(a) = �a, and theaccelerated1 relative gradient algorithm with this � is able to locate the correct unmixing matrix(�gure 6), although convergence is very slow.1See section 2 13

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

Figure 6: Likelihood landscape for a mixture of Gaussian sources. Dark grey indicates low likelihoodmatrices and white indicates high likelihood matrices. The inverse of the mixing matrix is marked by thesquare and the PCA unmixingmatrix is indicated by the �. The trajectory of the acclerated relative gradientalgorithm with �(a) = � tanh(a) nonlinearity is indicated by triangles, while the trajectory of the relativegradient algorithm with �(a) = �a \nonlinearity" is marked by crosses. Note that symmetry means thatM�1 and the matrix located by the relative gradient algorithms are equivalent.14

5 Decorrelating matricesIf an unmixing matrix can be found, the unmixed variables are, by de�nition, independent. Oneconsequence is that the cross-correlation between any pair of unmixed variables is zero:hanami � 1T TXt=1 am(t)an(t) = 1T (am; an)t = �mndn; (40)where (�; �)t denotes the inner product with respect to t, and dn is a scale factor.Since all the unmixed variables are pairwise decorrelated we may writeAAT = D2 (41)where A is the matrix whose mth row is am(t) and D is a diagonal matrix of scaling factors. Wewill say that a decorrelating matrix for data X is a matrix which, when applied to X , leaves therows of A uncorrelated.Equation (41) comprises M(M � 1)=2 relations which must be satis�ed if W is to be a decorre-lating matrix. (There are onlyM(M�1)=2 relations rather thanM2 because (1) AAT is symmetric,so demanding that [D2]ij = 0 (i 6= j) is equivalent to requiring that [D2]ji = 0; and (2) the diagonalelements of D are not speci�ed: we are only demanding that cross-correlations are zero.)Clearly there are many decorrelating matrices, of which the ICA unmixing matrix is justone, and we mention a few others below. The decorrelating matrices comprise an M(M + 1)=2-dimensional manifold in the MN -dimensional space of possible unmixing matrices, and we mayseek the ICA unmixing matrix on this manifold.If W is a decorrelating matrix, we haveAAT = WXXTWT = D2 (42)and if none of the rows of A is identically zero,D�1WXXTWTD�1 = IM (43)Now, if Q 2 RM�M is a real orthogonal matrix and D another diagonal matrix,DQD�1WXXTD�1QTD = D2 (44)so DQD�1W is also a decorrelating matrix.Note that the matrix D�1W not only decorrelates, but makes the rows of A orthonormal. It isstraightforward to produce a matrix which does this. LetX = U�V T (45)be a singular value decomposition of the data matrix X = [x(1);x(2); :::x(T)]; U 2 RM�M andV 2 RT�T are orthogonal matrices and � 2 RM�T is a matrix with singular values, �i > 0,arranged along the leading diagonal and zeros elsewhere. Then let W0 = ��1UT. Clearly the rowsof WX = V T are orthonormal, so the class of decorrelating matrices is characterised asW = DQW0 = DQ��1UT: (46)The columns of U are the familiar principal components of principal components analysis and��1UTX is the PCA representation of the data X , but normalised or \whitened" so that thevariance of the data projected onto each principal component is unity.The manifold of decorrelating matrices is seen to beM(M+1)=2-dimensional: it is the Cartesianproduct of the M -dimensional manifold D of scaling matrices and the (M � 1)M=2-dimensional15

�(WM) logLPCA 0.3599 -3.1385ZCA 0.3198 -3.1502ICA 0.1073 -3.1210Table 1: PCA, ZCA and ICA errors in inverting the mixing matrix (equation 38) and logL, the loglikelihood of the unmixing matrix (equation 18)manifold of orthogonal matrices Q. Explicit coordinates on Q are given byQ = eS (47)where S is an anti-symmetric matrix (ST = �S). Each of the above-diagonal elements of S maybe used as coordinates for Q.Particularly well-known decorrelating matrices (Bell and Sejnowski 1997; Penev and Atick 1996)are:PCA Q = I and D = �. In this case W simply produces the principal components represen-tation. The columns of U form a new orthogonal basis for the data and the mean squaredprojection onto the mth coordinates is �2m. The PCA solution holds a special position amongdecorrelating transforms because it simultaneously �nds orthonormal bases for both the row(V ) and column (U) spaces of X . Viewed in these bases, the data is decomposed into asum of products which are linearly decorrelated in both space and time. The demand byICA of independence in time, rather than just linear decorrelation, can only be achieved bysacri�cing orthogonality between the elements of the spatial basis, i.e., the rows of W .ZCA Q = U and D = I . Bell and Sejnowski (1997) call decorrelation with the symmetricaldecorrelating matrix, WT = W , the zero-phase components analysis. Unlike PCA, whosebasis functions are global, ZCA basis functions are local and whiten each row of WX so thatit has unit variance.ICA In the sense that it is neither local or global, ICA is intermediate between ZCA and PCA.No general analytic form for Q and D can be given, and the optimum Q must be sought byminimising equation (2) (the value of D is immaterial since I(Da) = I(a)). It is importantto note that if the optimal W is found within the space of decorrelating matrices it may notminimise the mutual information, which also depends on higher moments, as well as someother W which does not yield an exact linear decorrelation.When M = 2 the manifold of decorrelating matrices is 3-dimensional, since two parameters arerequired to specify D and a single angle parameterises Q. Since multiplication by a diagonal matrixdoes not change the decorrelation, D is relatively unimportant and the manifold of row-normalizeddecorrelating matrices (which lies in D�Q) may be plotted on the likelihood landscape { this hasbeen done for the Gaussian/bi-exponential example in �gure 7. Also plotted on the �gure are thelocations of the orthogonal matrices corresponding to the PCA and ZCA decorrelating matrices.The manifold consists of two non-intersecting leaves, corresponding to detQ = �1, which run alongthe tops of the ridges in the likelihood. Figure 8 shows the likelihood and errors in inverting themixing matrix as the detQ = +1 leaf is traversed. Table 5 gives the likelihoods and errors ininverting the mixing matrix for the PCA, ZCA and ICA.This characterisation of the decorrelating matrices does not assume that the number of observa-tion sequences N is equal to the assumed number of sourcesM , and it is interesting to observe thatif M < N the reduction in dimension from x to a is accomplished by ��1UTM 2 RM�N, where UMconsists of the �rst M columns of U . This is the transformation onto the decorrelating manifoldand is the same irrespective of whether the �nal result is PCA, ZCA or ICA.16

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

−9

−8

−7

−6

−5

−4

Figure 7: The manifold of (row-normalised) decorrelating matrices plotted on the likelihood function for themixture of Gaussian and bi-exponential sources. Leaves of the manifold corresponding to detQ = �1 are assolid and dashed lines respectively. The symbols mark the locations of decorrelating matrices correspondingto PCA (�), ZCA (+) and ICA (�).−3 −2 −1 0 1 2 3

−3.2

−3.18

−3.16

−3.14

−3.12

−3.1

−3.08

−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

Figure 8: Likelihood and errors in inverting the mixing matrix as the detQ = +1 leaf of the decorrelatingmanifold is traversed. The symbols mark the locations of decorrelating matrices corresponding to PCA (�),ZCA (+) and ICA (�). 17

An important characteristic of the PCA basis (the columns of U) is that it minimizes recon-struction error. A vector x is approximated by ~x projecting x onto the �rst M columns of U~x = UMUTMx; (48)where UM denotes the �rst M columns of U . The mean squared approximation error�(PCA)M = kx� ~xk2� (49)is minimised amongst all linear bases by the PCA basis for anyM . Indeed the PCA decompositionis easily derived by minimising this error functional with the additional constraint that the columnsof UM are orthonormal. It is a surprising fact that this minimum reconstruction error property isshared by all the decorrelating matrices, and in particular by the (non-orthogonal) ICA basis whichis formed by the rows of W . This fact is easily seen by noting that the approximation in terms ofM ICA basis functions is ~x = W yWx; (50)where the pseudo-inverse of W is W y = UM�QTD�1: (51)The approximation error is therefore�(ICA)M =Dkx�W yWxk2E (52)=Dkx� (UM�QTD�1)(DQ��1UTM )xk2E (53)=Dkx� UMUTMxk2E (54)=�(PCA)M (55)Penev and Atick (1996) have also noticed this property in connection with local feature analysis.5.1 AlgorithmsHere we examine algorithms which seek to minimise the mutual information using an unmixingmatrix W which is drawn from the class of linearly decorrelating matrices D�Q and therefore hasthe form of equation (41). Since I(Da) = I(a) for any diagonal D, at �rst sight it appears that wemay choose D = IM . However, as the discussion in section 4.1 points out, the elements of D serveas adjustable parameters tuning a model marginal density to the densities generated by the am. Ifa \�xed" nonlinearity is to be used, it is therefore crucial to permit D to vary and to seek W onthe full manifold of decorrelating matrices.A straight forward algorithm is to use one of the popular minimisation schemes (Bell andSejnowski 1995; Amari et al. 1996; MacKay 1996) to take one or several steps towards the minimumand then to replace the current estimate of W with the nearest decorrelating matrix.Finding the nearest decorrelating matrix requires �nding the D and Q that minimizekW �DQW0k2 (56)When D = IM (i.e., when an adaptive � is being used) this is a simple case of the matrix Procrustesproblem (Golub and Loan 1983; Horn and Johnson 1985). The minimizing Q is the orthogonalpolar factor of WWT0 . That is, if WWT0 = Y SZT is a SVD of W , then Q = Y ZT. When D 6= IM ,(56) must be minimized numerically to �nd D and Q (Everson 1998).This scheme permits estimates ofW to leave the decorrelating manifold, because derivatives aretaken in the full space of M �M matrices. It might be anticipated that a more e�cient algorithm18

would be one which constrains W to remain on the manifold of decorrelating matrices, and we nowexamine algorithms which enforce this constraint.5.1.1 Optimising on the decorrelating manifoldWhen the marginal densities are modelled with an adaptive nonlinearity D may be held constantand the unmixing matrix sought on Q, using the parameterisation (47); however, with �xed non-linearities it is essential to allow D to vary. In this case the optimum is sought in terms of the(M � 1)M=2 above-diagonal elements of S and the M elements of D.Optimisation schemes perform best if the independent variables have approximately equal mag-nitude. To ensure the correct scaling we writeD = � ~D (57)and optimise the likelihood with respect to the elements of ~D (which are O(1)) along with Spq. Aninitial pre-processing step is to transform the data into the whitened PCA coordinates; thusX = ��1UTX: (58)The normalised log likelihood islogL(Xj ~D;Q) = log j det� ~DQj+*Xm log pm(am(t))+ : (59)The gradient of logL with respect to ~D is@ logL@ ~Di = ~D�1i +*�i(ai)�iXj Qijxj(t)+ (60)and @ logL@Qij = h�i(ai)Dixj(t)i = Zij (61)Using the parameterisation (47), equation (61), and2@Qij@Spq = Qip�qj �Qiq�pj + Qqj�pi � Qpj�qi; (62)the gradient of logL with respect to the above-diagonal elements Spq (p < q � M) of the anti-symmetric matrix is given by:@ logL@Spq = �QkpZkq + QkqZkp � QqkZpk + QpkZqk (63)(summation on repeated indicies). With the gradient on hand, gradient descent or, more e�ciently,quasi-Newton schemes may be used.When the nonlinearity is adapted to the unmixed marginal densities, one simply sets ~D = IMin (59) and (61), and the optimisation is conducted on in the M(M � 1)=2-dimensional manifoldQ. Clearly, a natural starting guess for W is the PCA unmixing matrix given by S = 0, ~D = IM .Finding the ICA unmixing matrix on the manifold of decorrelating matrices has a number ofadvantages.1. The unmixing matrix is guaranteed to be linearly decorrelating.2. The optimum unmixing matrix is sought in the M(M � 1)=2-dimensional (or if �xed non-linearities are used, M(M +1)=2-dimensional) space of decorrelating matrices rather than in19

the full M2-dimensional space of order M matrices. For large problems and especially if theHessian matrix is being used this provides considerable computational savings in locating theoptimum matrix.3. The scaling matrix D, which does not provide any additional information, is e�ectively re-moved from the numerical solution.A potentially serious disadvantage is that with small amounts of data the optimum matrix onQ may not coincide with the maximum likelihood ICA solution, because an unmixing matrix whichdoes not produce exactly linear decorrelation may more e�ectively minimize the mutual information.Of course with su�cient data, a necessary condition for independence is linear decorrelation and theoptimum ICA matrix will lie on the decorrelating manifold. Nonetheless, the decorrelating matrixis generally very close to the optimum matrix and provides a good starting point from which to�nd it.6 Rogues GalleryThe hypothesis that human faces are composed from a small admixture of canonical or basis faceshas inspired much research in the pattern recognition (Kirby and Sirovich 1990; Atick et al. 1995)and psychological (O'Toole et al. 1991; O'Toole et al. 1991) communities. Much of this work hasfocused on eigenfaces, which are the principal components of an ensemble of faces and are thereforemutually orthogonal. As an application of our adaptive ICA algorithm on the decorrelating manifoldwe have computed the independent components for an ensemble faces. The model we have in mindis that a paticular face, x, is an admixture of M basis functions, the coe�cients of the admixturebeing drawn from M independent sources s. If the ensemble of faces is subjected to ICA the rowsof the unmixing matrix are estimates of the basis functions, which (unlike the eigenfaces) need notbe orthogonal.There were 143 clean-shaven, male Caucasian faces in the original ensemble, but the ensemblewas augmented by the re ection of each face in it's midline to make 286 faces in all. The mean facewas subtracted from each face of the ensemble before ICA. Independent components were estimatedusing a quasi-Newton scheme on the decorrelating manifold with generalised exponential modellingof the source densities. Since the likelihood surface has many local maxima, the optimisation wasrun repeatedly (M + 1 times for M assumed sources) each run starting from a di�erent initialdecorrelating matrix. One of the initial conditions always included the PCA unmixing matrix andit was found that this initial matrix always lead to the ICA unmixing matrix with the highestlikelihood. It was also always the case that the ICA unmixing matrix had a higher likelihood thanthe PCA unmixing matrix. We remark that an adaptive optimisation scheme using our generalisedexponential approach was essential: several of the unmixed variables had densities with tails lighterthan Gaussian.Principal components are naturally ordered by the associated singular value, �m, which mea-sures standard deviation of projection of the data onto the mth principal component: �2m =(uTmx)2�. In an analogous manner we may order the ICA basis vectors by the scaling matrixD. Hereafter we assume that the unmixing matrix is row-normalised, and denote the ICA basisvectors by wm. Then D2m = (wTmx)2� measures the mean squared projection of the data ontothe mth normalized ICA basis vector. We then order the wm according to Dm, starting with thelargest.Figure 9 shows the �rst eight ICA basis vectors together with the eight principal componentsfrom the same dataset. The wm were calculated withM = 20, i.e., it was assumed that there are 20signi�cant ICA basis vectors. Although independent components have to be recalculated for eachM , we have found a fair degree of concurrence between the basis vectors calculated for di�erentM .For example, �gure 10a shows the matrix of inner products between the basis vectors calculatedwith M = 10 and M = 20; i.e., jW (20)W (10)T j. The large elements on or close to the diagonaland the small elements in the lower half of the matrix indicate that the basis vectors retain theiridentities as the assumed number of sources increases.20

-8 -6 -4 -2 0 2 4 6

-4 -2 0 2 4 6 8Figure 9: Independent basis functions and principal components of faces. a: The �rst 8 independentcomponent basis faces, wm from an M = 20 ICA of the faces ensemble. b: The �rst 8 principal components,um, from the same ensemble. 21

0 5 10 15 20 0

2

4

6

8

10

0.2 0.4 0.6 0.8Figure 10: The matrix jW (20)W (10)Tj showing the inner products between the independent componentsbasis vectors for M = 10 and M = 20 assumed sources.As the �gure shows, the ICA basis vectors have more locally concentrated power than theprincipal components. Power is concentrated around sharp gradients or edges in the images, inconcurrence with Bell and Sejnowski's (1997) observation that the ICA basis functions are edgedetectors. This property may make the ICA basis vectors useful feature detectors since the edgesare literal features! We also note that, unlike the um, the wm are not forced to be symmetric oranti-symmetric in the vertical midline. There is a tendency for the midline to be a line of symmetryand we anticipate that with a su�ciently large ensemble, the wm would aquire exact symmetry.As the assumed number of sources is increased the independent component basis vectors ap-proach the principal components. This is illustrated in �gure 11, which shows the matrix of innerproducts between the wm from a M = 20 source model and the �rst 20 principal components.For m greater than about 12 the angle between the principal components um and wm is small.We speculate that this is partly due to the relatively small size (T = 286) and noisy condition ofour ensemble, factors which also prevent meaningful estimates of the true number of independentsources.7 Summary and ConclusionWe have used the likelihood landscape as a numerical tool to better understand independent compo-nents analysis and the manner in which gradient-based algorithms to �nd the maximum likelihoodunmixing matrix work. In particular we have tried to make plain that scaling of the unmixing ma-trix plays in adapting a \static" nonlinearity to the nonlinearities required to unmix sources withdi�ering marginal densities. To cope with light-tailed densities we have demonstrated a schemethat uses generalised exponential functions to model the marginal densities. Despite the successof this scheme in separating a mixture of Gaussian, bi-exponential and uniform sources, additionalwork is required to model sources which are heavily skewed or have multimodal densities.Numerical experiments show that the manifold of decorrelating matrices lies along the ridges ofhigh-likelihood unmixing matrices in the space of all unmixing matrices. We have shown how to �ndthe optimum ICA matrix on the manifold of decorrelating matrices and we have used the algorithmto �nd independent component basis vectors for a rogues gallery. Seeking the ICA unmixing matrix22

0 5 10 15 20 0

5

10

15

20

0.2 0.4 0.6 0.8Figure 11: The matrix jW (20)UT20j showing the inner products between the independent components basisvectors (M = 20) and the �rst 20 principal components.23

on the decorrelating manifold naturally incorporates the case in which there are more observations,N , than sources, M . However, in common with other authors, we note that real cocktail partyproblem { separating many voices from few observations { remains to be solved (for machines).Finally, independent components analysis depends on minimising the mutual information be-tween the unmixed variables, which is identical to minimising the Kullback-Leibler divergencebetween the between the joint density p(a) and the product of the marginal densities Qm pm(am).The Kullback-Leibler divergence is one of many measures of disparity between densities (see, forexample, Basseville (1989)) and one might well consider using a di�erent one. Particularly at-tractive is the Hellinger distance which is a metric and not just a divergence. When an unmixingmatrix which makes the mutual information zero can be found, the Hellinger distance is also zero.However, when some resdidual dependence between the unmixed variables remains these variousdivergences will vary in their estimate of the best unmixing matrix.AcknowledgementsWe are grateful for discussions with Will Penny and Iead Rezek, and we thank Michael Kirby andLarry Sirovich for supplying the rogues gallery ensemble. Part of this research was supported byfunding from British Aerospace to whom the authors are most grateful.ReferencesAmari, S., A. Cichocki, and H. Yang (1996). A new learning algorithm for blind signal separa-tion. In D. Touretzky, M. Mozer, and M. Hasselmo (Eds.), Advances in Neural InformationProcessing Systems, Volume 8, Cambridge MA, pp. 757{763. MIT Press.Atick, J., P. Gri�n, and A. Redlich (1995). Statistical Approach to Shape from Shading: Recon-struction of 3D Face Surfaces from Single 2D Images. Neural Computation. (To appear).Barlow, H. (1961). Possible principles underlying the transformation of sensory messages. InW. Rosenblith (Ed.), Sensory Communication. MIT Press.Basseville, M. (1989). Distance measures for signal processing and pattern recognition. SignalProcessing 18, 349{369.Bell, A. and T. Sejnowski (1995). An information maximization Approach to Blind Separationand Blind Deconvolution. Neural Computation 7 (6), 1129{1159.Bell, A. and T. Sejnowski (1997). The \Independent Components" of Natural Scenes are EdgeFilters. Vision Research 37 (23), 3327{3338.Cardoso, J.-F. (1997). Infomax and Maximum Likelihood for Blind Separation. IEEE SignalProcessing Letters 4 (4), 112{114.Cardoso, J.-F. and B. Laheld (1996). Equivarient adaptive source separation. IEEE Trans. onSignal Processing 45 (2), 434{444.Comon, P., C. Jutten, and J. Herault (1991). Blind Separation of Sources. 2. Problems Statement.Signal Processing 24 (1), 11{20.Everson, R. (1998). Orthogonal, but not orthonormal, Procrustes prob-lems. Advances in Computational Mathematics . (Submitted). Available fromhttp://www.ee.ic.ac.uk/research/neural/everson.Golub, G. and C. V. Loan (1983). Matrix Computations. Oxford: North Oxford Academic.Horn, R. and C. Johnson (1985). Matrix Analysis. Cambridge University Press.Kirby, M. and L. Sirovich (1990). Application of the Karhunen-Lo�eve Procedure for the Char-acterization of Human Faces. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 12 (1), 103{108. 24

Lee, T.-W., M. Girolami, A. Bell, and T. Sejnowski (1998). A Unifying Information-theoretic Framework for Independent Component Analysis. International Jour-nal on Mathematical and Computer Modeling . (In press). Available fromhttp://www.cnl.salk.edu/ tewon/Public/mcm.ps.gz.MacKay, D. (1996, December). Maximum Likelihood and Covariant Algorithms for Inde-pendent Component Analysis. Technical report, University of Cambridge. Available fromhttp://wol.ra.phy.cam.ac.uk/mackay/.Makeig, S., , T.-P. Jung, A. Bell, D. Ghahremani, and T. Sejnowski (1997). Transiently Time-locked fMRI Activations revealed by independent components analysis. Proceedings of theNational Academy of Sciences 95, 803{810.Makeig, S., A. Bell, T.-P. Jung, and T. Sejnowski (1996, Cambridge MA, USA). IndependentComponent Analysis of Electroencephalographic Data. In Advances in Neural InformationProcessing Systems, Volume 8. MIT Press.O'Toole, A., H. Abdi, K. De�enbacher, and J. Bartlett (1991). Classifying Faces by Race and SexUsing an Autoassiciative Memory Trained for Recognition. In K. Hammond and D. Gentner(Eds.), Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society, pp.847{851.O'Toole, A., K. De�enbacher, H. Abdi, and J. Bartlett (1991). Simulating the \Other-RaceE�ect" As a Problem in Perceptual Learning. Connection Science 3 (2), 163{178.Penev, P. and J. Atick (1996). Local Feature Analysis: A general statistical theory for objectrepresentation. Network: Computation in Neural Systems 7 (3), 477{500.Pham, D. (1996). Blind Separation of Instantaneous Mixture of Sources via an IndependentComponent Analysis. IEEE Transactions on Signal Processing 44 (11), 2668{2779.Press, W., S. Teukolsky, W. Vetterling, and B. Flannery (1992). Numerical Recipes in C (2 ed.).Cambridge: Cambridge University Press.Roberts, S. (1998). Independent Components Analysis: Source Assess-ment and Separation, a Bayesian Approach. (In press) Available fromhttp://www.ee.ic.ac.uk/hp/staff/sroberts.html.Silverman, B. (1986). Density Estimation for Statistics and Data Analysis. Number 26 in Mono-graphs on statistics and applied probability. London: Chapman and Hall.Wand, M. and M. Jones (1995). Kernel Smoothing. Number 60 in Monographs on statistics andapplied probability. London: Chapman and Hall.25