clase08-mlss05au hyvarinen ica 02

8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

1/87

Independent Component Analysis

Aapo Hyvarinen

HIIT Basic Research Unit

University of Helsinki, Finland

http://www.cs.helsinki.fi/aapo.hyvarinen/

1


2/87

Blind source separation

Four source signals:

0 10 20 30 40 50 60 70 80 90 1001.5

1

0.5

0

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 1002

1.5

1

0.5

0

0.5

1

1.5

2

0 10 20 30 40 50 60 70 80 90 1003

2

1

0

1

2

3

0 10 20 30 40 50 60 70 80 90 1008

6

4

2

0

2

4

Due to some external circumstances, only linear mixtures of the sourcesignals are observed.

0 10 20 30 40 50 60 70 80 90 1004

3

2

1

0

1

2

3

4

0 10 20 30 40 50 60 70 80 90 1008

6

4

2

0

2

4

0 10 20 30 40 50 60 70 80 90 1006

4

2

0

2

4

6

8

10

0 10 20 30 40 50 60 70 80 90 1006

4

2

0

2

4

6

8

Estimate (separate) original signals!

2


3/87

Solution by independence

Useonlyinformation on statistical independence to recover:

0 10 20 30 40 50 60 70 80 90 1002

1.5

1

0.5

0

0.5

1

1.5

2

0 10 20 30 40 50 60 70 80 90 1003

2

1

0

1

2

3

0 10 20 30 40 50 60 70 80 90 1001.5

1

0.5

0

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 1004

2

0

2

4

6

8

These are the independent components!

3


4/87

Independent Component Analysis.

(Herault and Jutten, 1984-1991)

Observed random vectorxis modelled by a linear latent variable model

xi=m

j=1

ai jsj, i=1...n (1)

or in matrix form:

x = As (2)

where

The mixing matrixAis constant (a parameter matrix).

Thesi are latent random variables calledthe independent components.

Estimate bothAands, observing onlyx.

4


5/87

Basic properties of the ICA model Must assume:

Thesi are mutually independent

Thesi are nongaussian. For simplicity: The matrixAis square.

Thesi defined only up to a multiplicative constant.

Thesi are not ordered.

5


6/87

ICA and decorrelation

First approach: decorrelate variables.

Whitening or sphering: decorrelate and normalizeE

{xxT

}= I

Simple by eigen-value decomposition of covariance matrix. But: Decorrelation uses only correlation matrix: n2/2 equations,

andAhasn2 elements

Not enough information!

6


7/87

Independence is better

Fortunately, independence is stronger than uncorrelatedness. For independent variables we have

E{h1(y1)h2(y2)}E{h1(y1)}E{h2(y2)} =0. (3)

Still, decorrelation (whitening) is usually done before ICA forvarious technical reasons

For example: after decorrelation and standardization,Acan be

considered orthogonal.

Gaussian data determined by correlations alone model cannot be estimated for gaussian data.

7


8/87

Illustration of whitening

Two ICs with uniform distributions:

latent variables

observed variables

whitened variables

Original variables, observed mixtures, whitened mixtures.

Cf. gaussian density: symmetric in all directions.

8


9/87

Basic intuitive principle of ICA estimation.

(Sloppy version of) the Central Limit Theorem(Donoho, 1982).

Consider a linear combinationwTx= qTs qisi+ qjsj is more gaussian thansi. Maximizing the nongaussianityofqTs, we can findsi.

Also known as projection pursuit.

9


10/87

Marginal and joint densities, uniform distributions.

Marginal and joint densities, whitened mixtures of uniform ICs

10


11/87

Marginal and joint densities, supergaussian distributions.

Whitened mixtures of supergaussian ICs

11


12/87

Kurtosis as nongaussianity measure.

Problem: how to measure nongaussianity?

Definition:kurt(x) =E{x4}3(E{x2})2 (4)

if variance constrained to unity, essentially 4th moment.

Simple algebraic properties because its a cumulant:kurt(s1+ s2) = kurt(s1) +kurt(s2) (5)

kurt(s1) =

4

kurt(s1) (6) zero for gaussian RV, non-zero for most nongaussian RVs. positive vs. negative kurtosis have typical forms of pdf.

12


13/87

Left: Laplacian pdf, positive kurt (supergaussian).Right: Uniform pdf, negative kurt (subgaussian).

13


14/87

The extrema of kurtosis

by the properties of kurtosis:

kurt(wT

x) = kurt(qT

s) =q41 kurt(s1) + q

42 kurt(s2) (7)

constrain variance to equal unity

E

{(wTx)2

}=E

{(qTs)2

}=q21+ q

22=1 (8)

for simplicity, consider kurtoses equal to one. maxima of kurtosis give independent components (see figure) general result: absolute valueof kurtosis maximized by thesi

(Delfosse and Loubaton, 1995).

Note: extrema are orthogonal due to whitening.

14


15/87

Optimization landscape for kurtosis. Thick curve is unit sphere, thin

curves are contours where kurtosis is constant.

15


16/87

0 0.5 1 1.5 2 2.5 3 3.52

2.5

3

3.5

4

4.5

5

5.5

angle of w

kurtosis

Kurtosis as a function of the direction of projection. For positive kurtosis,

kurtosis (and its absolute value) are maximized in the directions of the

independent components.

16


17/87

0 0.5 1 1.5 2 2.5 3 3.51.3

1.2

1.1

1

0.9

0.8

0.7

0.6

0.5

angle of w

ku

rtosis

Case of negative kurtosis. Kurtosis is minimized, and its absolute value

maximized, in the directions of the independent components.

17


18/87

Basic ICA estimation procedure

1. Whiten the data to givez.2. Set iteration counti=1.

3. Take a random vectorwi.

4. Maximize nongaussianity ofwTi z,

under constraintswi2 =1 andwTi wj=0,j


19/87

Why kurtosis is not optimal

Sensitive to outliers:Consider a sample of 1000 values with unit var, and one value equal

to 10.

Kurtosis equals at least 104/10003=7. For supergaussian variables, statistical performance not optimal even

without outliers.

Other measures of nongaussianity should be considered.

19


20/87

Differential entropy as nongaussianity measure

Generalization of ordinary discrete Shannon entropy:H(x) = E{ logp(x)} (9)

for fixed variance, maximized by gaussian distribution. often normalized to give negentropy

J(x) =H(xgauss)H(x) (10)

Good statistical properties, but computationally difficult.

20


21/87

Approximation of negentropy

Approximations of negentropy(Hyvarinen, 1998):

JG(x) = (E{G(x)}E{G(xgauss)})2 (11)

whereGis a nonquadratic function.

Generalization of (square of) kurtosis (which isG(x) =x4). A good compromise?

statistical properties not bad (for suitable choice of G)

computationally simple

Further possibility: Skewness (for nonsymmetric ICs)

21


22/87

Information-theoretic approach.

(Comon 1994)

Mutual information ofy = (y1, ...,yn)T

I(y1, ...,yn) =n

i=1

H(yi)H(y) (12)

whereHis differential entropy.

A measure of redundancy ofy. Equals zero iffyi are independent.

Fory = Wx, we obtain

I(y) =n

i=1

H(yi) log |det W|+ const. (13)

22


23/87

Mutual information and nongaussianity

IfWconstrained to be orthogonal (whitened data):

I(y1, ...,yn) =

n

i=1H(yi) + const. (14)

Sum of nongaussianities! (though opposite sign)

Rigorous derivation of maximization of nongaussianities.

23


24/87

Maximum likelihood estimation.

(Pham et al, 1992)

Log-likelihood of the model: (W= A1)

L=T

t=1

n

i=1

logpsi (wTi x(t)))+ Tlog |det W| (15)

Equivalent to the infomax approach in neural networks.

Needs estimates of the psi , but these need not be exact at all. Roughly: consistent ifpsi is of the right type (sub or supergaussian).

Very similar to mutual information:

I(y) =E{n

i=1

logpyi (yi)}+ log |det W|+C (16)

24


25/87

Overview of ICA estimation principles.

Most approaches can be interpreted as maximizing the

nongaussianity of ICs.

Basic choice: the nonquadratic function in the nongaussianitymeasure:

kurtosis: fourth power

entropy/likelihood: log of density

approx of entropy:G(s) =logcosh sor others.

One-by-one estimation vs. estimation of the whole model. Estimates constrained to be white vs. no constraint

25


26/87

Algorithms (1). Adaptive gradient methods

Gradient methods for one-by-one estimation straightforward. Stochastic gradient ascent for likelihood(Bell-Sejnowski 1995)

W (W1)T + g(Wx)xT (17)

withg= (logps). Problem: needs matrix inversion! Better: natural/relative gradient ascent of likelihood

(Amari et al, 1996, Cardoso and Laheld, 1994)

W [I + g(y)yT]W (18)

withy =Wx. Obtained by multiplying gradient byWTW.

26


27/87

Algorithms (2). The FastICA fixed-point algorithm

(Hyvarinen 1997,1999)

An approximate Newton method in block (batch) mode.

No matrix inversion, but still quadratic (or cubic) convergence. No parameters to be tuned.

For a single IC (whitened data)w E{xg(wTx)}E{g(wTx)}w, normalizew

wheregis the derivative ofG.

For likelihood:

W W + D1[D2+E{g(y)yT}]W, orthonormalizeW

27


28/87

0 0.5 1 1.5 2 2.5 31.3

1.2

1.1

1

0.9

0.8

0.7

iteration count

k

urtosis

Convergence of FastICA. Vectors after 1 and 2 iterations, values of

kurtosis.

28


29/87

0 0.5 1 1.5 2 2.5 32.5

3

3.5

4

4.5

5

5.5

iteration count

kurtosis

Convergence of FastICA (2). Vectors after 1 and 2 iterations, values of

kurtosis.

29


30/87

Relations to other methods (1): Projection pursuit(Friedman and Tukey,

1974; Huber, 1985)

Projection pursuit is a method for visualization and exploratory dataanalysis.

Attempts to show clustering structure of data by finding interesting

projections.

PCA is not designed to find clustering structure.

Interestingness is usually measured by nongaussianity.

For example, bimodal distributions are very nongaussian.

30


31/87

Illustration of projection pursuit. The projection pursuit direction is

horizontal, the principal component vertical.

31


32/87

Relations to other methods. (2)

Factor analysis: ICA is a nongaussian (usually noise-free) version

Blind deconvolution: obtained by constraining the mixing matrix

Principal component analysis often the same applications

very different statistical principles

32


33/87

Basic ICA estimation: conclusions

ICA is very simple as a model:

linear nongaussian latent variables model.

Estimation not so simple due to nongaussianity:objective functions cannot be quadratic.

Estimation by maximizing nongaussianity of independentcomponents.

Equivalently (?), maximum likelihood or min of mutual info.

Algorithms: adaptive (natural gradient descent) vs. block/batch mode

(FastICA).

Choice of nonlinearity: cubic (kurtosis) vs. non-polynomial functions

33


34/87

Applications (1)

Brain Imaging Data

34


35/87

Main application areasso far:

audio noise cancelling (cocktail-party problem)Very difficult... mainly historical

biomedical signals: Electro????gram

Brain images

Microarray data (gene expression)

vision modelling and image processing econometric time series telecommunications

35


36/87

Brain imaging data analysis

EEG, MEG: high temporal resolutionPET, fRMI: global activity maps

Huge amounts of data: need for neuroinformatics Physiological models vs. unsupervised methods

36


37/87

Electric and magnetic fields

Magnetic FieldElectric Potential

EEG and MEG measures over the scalp

(From Vigario et al, 2000.)

Many sources mixed in the measurements.

37


38/87

Magnetoencephalography

Dewar

Liquid

helium

Neuromag-122TM

40 mm

Sensor array

Gradiometer

(degrees)

planargradiometer

axialgradiometer

Bnet

(fT)

- 40 + 200 + 40- 20

Neuromag-122 whole scalp magnetometer.


38


39/87

Artefact removal from MEG

1

1

2

2

3

3

4

4

5

5

6

6

MEGsaccades blinking biting

A subset of 12 spontaneous MEG signals.


39


40/87

IC1

IC2

IC3

IC4

IC5

IC6

IC7

IC8

IC9

10 s

Artefacts found from MEG

data, using the FastICA algo-

rithm. (From Vigario et al,

1998.)

40


41/87

Analysis of evoked magnetic fields

Left side Right side

MEG25

MEG83

MEG sample

MEG60(MEG-L)

MEG10(MEG-R)

1 0 1 2 3 4 5-

Averaged auditory evoked responses to 200 tones, using MEG. (From

Vigario et al, 1998.)

41


42/87

1 0 1 2 3 4 5

1 0 1 2 3 4 5-

PC1

PC2

PC3

PC4

PC5

IC1

IC2

IC3

IC4

-

a)

b)

Principal (a) and inde-

pendent (b) components

found from the auditoryevoked field study. (From

Vigario et al, 1998.)

42


43/87

Applications (2)

Image and VisionModelling

43


44/87

ICA and image data

Models of image data always useful In computational neuroscience: evolution+development give optimal

receptive fields

In image processing: Essential for denoising, prediction, etc. ICA gives an interesting model

(Olshausen and Field, 1996; Bell and Sejnowski, 1997)

Important connection to sparse coding.

44


45/87

Linear models of images.

Observed variablesxi are gray-scale values of pixels in an image Modelled by a linear latent variable model

x= As =i aisi (19)

Columnsai are called basis vectors

Image is superposition of basis vectors Well-known basis vector sets:

Fourier analysis (sines, cosines)

DCT Wavelets

Gabor analysis

What could be the best basis vectors?45


46/87

Some DCT (top) and wavelet (bottom) basis vectors

46


47/87

What is sparseness?

A form of nongaussianity (higher-order structure) often encountered

in natural signals

Variable is active only rarely

gaussian:

sparse:

47


48/87

What is sparseness? (2)

A random variable is sparse if its density has heavy tails, and a peakat zero.

Kurtosis is (strongly) positive, i.e. supergaussianity Typical sparse pdf (Laplace):

4 3 2 1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(dash-dot: Gaussian density)

48


49/87

Linear Sparse Coding

For random vectorx, find linear representation:x = As (20)

so that the componentssi are as sparse as possible.

a given data pointx(t)is represented using only a limited number ofactive componentssi.

ICA is sparse coding since sparseness is supergaussianity.

Sparse coding/ICA gives an optimal basis.

49


50/87

ICA basis vectors of image windows.

50


51/87

Why sparse coding?

Good fit to V1 simple cell receptive fields (Van Hateren et al, 1998). Compress images: code only nonzero components. In biological networks: saves energy. Internal model for recovery of structure. Denoising: use thresholding to leave only components that are really

active.

Wavelet methods use the same principles.

51


52/87

Sparse Coding: Denoising by Shrinkage

(Hyvarinen, 1999)

Assume the data is corrupted by white Gaussian noisen

x= As + n (21)

and constrainAto be orthogonal.

Estimatesi fromxby ML method,

assuming thesi to be independent:

si= f(wT

i x) (22)

and reconstructx = As.

52


53/87

Shrinkage nonlinearity as denoising

E.g. if thesi have a Laplace distribution, we have

f(u) =sign(u)max(0,

|u

|

22) (23)

53


54/87

Sparse Code Shrinkage algorithm

1. Estimate the sparse coding matrixW = A1, and the shrinkagenonlinearities fi.

2. Compute for each noisy observationx(t)the corresponding noisy

sparse componentswTi x(t).

3. Reduce noise by applying the shrinkage non-linearity fi(.)on thenoisy sparse components:

si(t) = fi(wTi x(t)).

4. Invert the coding to obtainx(t), given byx(t) = Ws(t).

Can be considered as an adaptive version of wavelet shrinkage (Donoho

et al, 1995)

54


55/87

Experiments on Sparse Code Shrinkage

input dataxwas 88 windows from images. basis vectors estimated from noise-free images, using a modification

of FastICA

sparse code shrinkage applied for sliding windows in noisy images. averages of the 88 reconstructions taken as final reconstructions

55


56/87

Noise level: 0.3

Left: Noisy image. Middle: Wiener filtered. Right: Sparse code shrinkageresult.

56


57/87

Conclusion: Image feature extraction by ICA

ICA gives an interesting model for image data Takes into account nongaussianity, here: sparseness

Performs sparse coding. Features related to Gabor functions, wavelets, V1 simple cells.

Shrinkage denoising possible as with wavelets.

57


58/87

Extensions:

Subspace and topographyformalisms

58


59/87

Relaxing independence

For most data sets, the estimated components are not veryindependent.

In fact, independent components can not be found in general.

We attempt to model some of the remaining dependencies. Basic models group components:

Multidimensional ICA, and

Independent Subspace Analysis.

59


60/87

Multidimensional ICA(Cardoso 1998)

One approach to relaxing independence. thesi can be divided inton-tuples, such that

thesi inside a givenn-tuple may be dependent on each other

dependencies between differentn-tuples are not allowed.

Everyn-tuple corresponds to a subspace.

60


61/87

Invariant-feature subspaces(Kohonen 1996)

Linear filters (like in ICA) necessarily lack any invariance.

invariant-feature subspaces is an abstract approach to representinginvariant features.

Principle: invariant feature is a linear subspace in a feature space. The value of the invariant feature is given by norm of the projection

on that subspace.k

i=1(wTi x)

2

(24)

61


62/87

Independent Subspace Analysis(Hyvarinen and Hoyer, 2000)

Combination of multidimensional ICA and invariant-featuresubspaces.

The probability density inside each subspace isspherically

symmetric, i.e. depends only on the norm of the projection.

Simplifies the model considerably.

The nature of the invariant features is not specified.

62


63/87

IInput

8

7

(.)

2(.)

2

6

2

1

3

5

4

(.)

2(.)

2(.)

2(.)

2(.)

2

2(.)

63


64/87

Application on image data

Applied on image data, our model shows emergence ofcomplex-cell properties:

We have phase and some translation invariance,as well as orientation and frequency selectivity.

Each subspace can be interpreted as a complex cell. Similar to energy models for complex cells (norm is like local

energy).

64


65/87

Independent Subspaces of natural image data.

65


66/87

Independent Subspace Analysis: Conclusions

A simple way of relaxing the independence constraint in ICA. Instead of scalar components, only subspaces are independent.

Densities inside subspaces are spherically symmetric. Can be interpreted as invariant-feature subspaces.

When applied on image data, complex cell properties emerge.

66


67/87

Problem: Dependencies still remain

Linear decomposition often does not give independence, even forsubspaces.

Remaining dependencies could be visualized or else utilized.

Components can be decorrelated, so only higher-order correlationsare interesting

How to visualize them? E.g. using topographic order

67


68/87

Extending the model to include topography

Instead of having unordered components,

they are arranged on a two-dimensional lattice

dependent

independent

The components are typically sparse, but not independent. Near-by components have higher-order correlations.

68


69/87

Dependence through local variances

Often encountered in e.g. image data

Components are independentgiven their variances

In our model, variances are not independent

instead: correlated for near-by components

e.g. generated by another ICA model, with topographic mixing

INDEPENDENT TOPOGRAPHIC VARIANCE DEPENDENCE

69


70/87

Two signals that are independent given their variances.

70


71/87

Topographic ICA model(Hyvarinen et al, 2000)

u

2

3

1

x

x

x

A

3

u

u

1

2

s

s

s

2

3

1

1

2

3

Variance-generating variablesui are generated randomly, and mixed

linearly inside their topographic neighbourhoods. Mixtures are

transformed using a nonlinearity, thus giving variancesi of thesi.

Finally, ordinary linear mixing.

71


72/87

Approximation of likelihood

Likelihood of the model intractable

Approximation:T

t=1

n

j=1

G(n

i=1

h(i,j)(wTi x(t))2) + Tlog

|det W

|. (25)

whereh(i,j)is neighborhood function, andGa nonlinear function.

Generalization of independent subspace analysis. Function of local energies only!

72


73/87

Top-down modulated Hebbian learning

Approximation of likelihood can be maximized by gradient ascent

Learning rule:

wiE{x(wTi x)ri)}+ normalization+feedback (26)

where

ri=n

k=1

h(i, k)g(n

j=1

h(k,j)(wTjx)2). (27)

Hebbian learning withri a function of the outputs of a higher-order(complex) cells.

73


74/87

Topographic ICA of natural image data. Topographically ordered

Gabor-like basis vectors for image patches.

74


75/87

Independent subspace analysis and topographic ICA

In ISA, single components are not independent, but subspaces are.

In topographic ICA, dependencies modelled continuously. No strict division into subspaces.

For image data, each neighbourhood is a complex cell. Localenergies are their outputs.

Topographic ICA is a generalization of ISA, incorporating theinvariant-feature subspace principle as invariant-feature

neighbourhoods.

75


76/87

Topographic ICA: Conclusion

A more sophisticated way of relaxing independence.

Dependencies that cannot be cancelled by ICA define a similaritymeasure

New principle for topographic mappings Formulated as a modification of the ICA model. Approximation of likelihood gives tractable algorithms. For image data, topography similar to V1.

76


77/87

Using time dependencies

77

Using autocorrelations for ICA estimation


78/87

Using autocorrelations for ICA estimation

Take the basic linear mixture model

x(t) = As(t) (28)

Cannot be estimated in general (take gaussian RVs) Usually in ICA, we assume thesi to be nongaussian

higher-order statistics provide missing information.

Alternatively: assume thesi are time-dependent signals use time correlations to give more information

For example, a lagged covariance matrix

Cx=E{x(t)x(t )}. (29)

measures covariances of lagged signals.

78

The AMUSE algorithm for using autocorrelations


79/87

g g

(Tong et al, 1991; Molgedey and Schuster, 1994)

Basic principle: decorrelate each signaly = Wxwith other signals,lagged as well as not lagged.

In other words: E{yi(t)yj(t )} =0 for alli = j. To do this:

1. Whiten the data to obtainz(t) = Vx(t)

2. Find orthogonal transformationWso that the lagged covariance

matrix ofy(t) = Wz(t)is identity.

Matrix diagonalization problem

Cx =E{x(t)x(t )} =E{As(t)s(t )TAT} = ACsAT

of a (more or less) symmetric matrix.

79


80/87

Pros and cons of separation by autocorrelations

Very fast to compute: a single eigen-value decomposition, like PCA

Can only separate ICs with different autocorrelations Because the lagged covariance matrix must have different

eigenvalues

Some improvement can be achieved by using several lags in the

algorithm (Belouchrani et al, 1997, SOBI).

but if signals have identical Fourier spectra, autocorrelations just

cannot separate them

80


81/87

Combining nongaussianity and autocorrelations

Best results should be obtained by using these two kinds ofinformation.

E.g.: Model temporal structure of signals with e.g. ARMA models

A more general approach: minimize coding complexity

Find a decompositiony= Wxso that theyi are easy to code. Rigorously defined by Kolmogoroff Complexity.

Signals are easy to code if they are nongaussian and have timedependencies.

81


82/87

Coding complexity as a general framework

(Pajunen, 1998)

For whitened dataz, and an orthogonalW: minimize sum of codinglengths of they= Wz.

If only marginal distributions are used,

coding length is given by entropy, i.e. nongaussianity.

If only autocorrelations are used, coding length is related to

autocorrelations.

Thus we have a generalization of both frameworks.

82


83/87

Approximation of coding complexity

The value ofy(t)is predicted from the preceding values

y(t) = f(y(t1),y(t2), ...y(1)). (30)

The residualsy(t)y(t)are coded independently from each other.

Predictor could be linear. Coding length is approximated by entropy of residuals

H(yy) (31)

Many other approximations can be developed.

83


84/87

Estimation using variance nonstationarity

(Matsuoka et al, 1995)

An alternative to autocorrelations (and nongaussianity)

Variance changes slowly over time

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100004

3

2

1

0

1

2

3

4

This gives enough information to estimate model

84


85/87

Convolutive ICA

Often the signals do not arrive at the same in the sensors

There may be echos as well (multi-path phenomena) Include convolution in the model:

xi=n

j=1

ai j(t)

si(t) =

n

j=1

k

ai j(k)si(t

k), fori=1, ..., n, (32)

In theory: Estimation by the same principles as ordinary ICA

In practice: huge number of parameters since (de)convolving filters

may be very long

special methods may need to be used

85


86/87

Final Summary

ICA is a very simple model. Simplicity implies wide applicability. A nongaussian alternative to PCA or factor analysis. Decorrelation or whitening is only half ICA.

The other half uses the higher-order statistics of nongaussian

variables

(or alternatively: autocorrelations, variance nonstationarity,

complexity)

Basic principle is to find maximally nongaussian directions.- Essentially equivalent to maximum likelihood or

information-theoretic formulations.

86


87/87

Final Summary (2)

Applications:

Blind source separation: biomedical signals, econometrics etc. Feature extraction: images etc.

Exploratory data analysis: like projection pursuit

New coming all the time

Since dependencies cannot always be cancelled, subspaces ortopographic versions may be useful.

Alternatively, separation is possible using time dependencies. Nongaussianity is beautiful !?

87

clase08-mlss05au hyvarinen ica 02

Documents