clase08-mlss05au hyvarinen ica 02

Upload: juan-alvarez

Post on 04-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    1/87

    Independent Component Analysis

    Aapo Hyvarinen

    HIIT Basic Research Unit

    University of Helsinki, Finland

    http://www.cs.helsinki.fi/aapo.hyvarinen/

    1

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    2/87

    Blind source separation

    Four source signals:

    0 10 20 30 40 50 60 70 80 90 1001.5

    1

    0.5

    0

    0.5

    1

    1.5

    0 10 20 30 40 50 60 70 80 90 1002

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    0 10 20 30 40 50 60 70 80 90 1003

    2

    1

    0

    1

    2

    3

    0 10 20 30 40 50 60 70 80 90 1008

    6

    4

    2

    0

    2

    4

    Due to some external circumstances, only linear mixtures of the sourcesignals are observed.

    0 10 20 30 40 50 60 70 80 90 1004

    3

    2

    1

    0

    1

    2

    3

    4

    0 10 20 30 40 50 60 70 80 90 1008

    6

    4

    2

    0

    2

    4

    0 10 20 30 40 50 60 70 80 90 1006

    4

    2

    0

    2

    4

    6

    8

    10

    0 10 20 30 40 50 60 70 80 90 1006

    4

    2

    0

    2

    4

    6

    8

    Estimate (separate) original signals!

    2

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    3/87

    Solution by independence

    Useonlyinformation on statistical independence to recover:

    0 10 20 30 40 50 60 70 80 90 1002

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    0 10 20 30 40 50 60 70 80 90 1003

    2

    1

    0

    1

    2

    3

    0 10 20 30 40 50 60 70 80 90 1001.5

    1

    0.5

    0

    0.5

    1

    1.5

    0 10 20 30 40 50 60 70 80 90 1004

    2

    0

    2

    4

    6

    8

    These are the independent components!

    3

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    4/87

    Independent Component Analysis.

    (Herault and Jutten, 1984-1991)

    Observed random vectorxis modelled by a linear latent variable model

    xi=m

    j=1

    ai jsj, i=1...n (1)

    or in matrix form:

    x = As (2)

    where

    The mixing matrixAis constant (a parameter matrix).

    Thesi are latent random variables calledthe independent components.

    Estimate bothAands, observing onlyx.

    4

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    5/87

    Basic properties of the ICA model Must assume:

    Thesi are mutually independent

    Thesi are nongaussian. For simplicity: The matrixAis square.

    Thesi defined only up to a multiplicative constant.

    Thesi are not ordered.

    5

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    6/87

    ICA and decorrelation

    First approach: decorrelate variables.

    Whitening or sphering: decorrelate and normalizeE

    {xxT

    }= I

    Simple by eigen-value decomposition of covariance matrix. But: Decorrelation uses only correlation matrix: n2/2 equations,

    andAhasn2 elements

    Not enough information!

    6

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    7/87

    Independence is better

    Fortunately, independence is stronger than uncorrelatedness. For independent variables we have

    E{h1(y1)h2(y2)}E{h1(y1)}E{h2(y2)} =0. (3)

    Still, decorrelation (whitening) is usually done before ICA forvarious technical reasons

    For example: after decorrelation and standardization,Acan be

    considered orthogonal.

    Gaussian data determined by correlations alone model cannot be estimated for gaussian data.

    7

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    8/87

    Illustration of whitening

    Two ICs with uniform distributions:

    latent variables

    observed variables

    whitened variables

    Original variables, observed mixtures, whitened mixtures.

    Cf. gaussian density: symmetric in all directions.

    8

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    9/87

    Basic intuitive principle of ICA estimation.

    (Sloppy version of) the Central Limit Theorem(Donoho, 1982).

    Consider a linear combinationwTx= qTs qisi+ qjsj is more gaussian thansi. Maximizing the nongaussianityofqTs, we can findsi.

    Also known as projection pursuit.

    9

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    10/87

    Marginal and joint densities, uniform distributions.

    Marginal and joint densities, whitened mixtures of uniform ICs

    10

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    11/87

    Marginal and joint densities, supergaussian distributions.

    Whitened mixtures of supergaussian ICs

    11

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    12/87

    Kurtosis as nongaussianity measure.

    Problem: how to measure nongaussianity?

    Definition:kurt(x) =E{x4}3(E{x2})2 (4)

    if variance constrained to unity, essentially 4th moment.

    Simple algebraic properties because its a cumulant:kurt(s1+ s2) = kurt(s1) +kurt(s2) (5)

    kurt(s1) =

    4

    kurt(s1) (6) zero for gaussian RV, non-zero for most nongaussian RVs. positive vs. negative kurtosis have typical forms of pdf.

    12

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    13/87

    Left: Laplacian pdf, positive kurt (supergaussian).Right: Uniform pdf, negative kurt (subgaussian).

    13

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    14/87

    The extrema of kurtosis

    by the properties of kurtosis:

    kurt(wT

    x) = kurt(qT

    s) =q41 kurt(s1) + q

    42 kurt(s2) (7)

    constrain variance to equal unity

    E

    {(wTx)2

    }=E

    {(qTs)2

    }=q21+ q

    22=1 (8)

    for simplicity, consider kurtoses equal to one. maxima of kurtosis give independent components (see figure) general result: absolute valueof kurtosis maximized by thesi

    (Delfosse and Loubaton, 1995).

    Note: extrema are orthogonal due to whitening.

    14

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    15/87

    Optimization landscape for kurtosis. Thick curve is unit sphere, thin

    curves are contours where kurtosis is constant.

    15

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    16/87

    0 0.5 1 1.5 2 2.5 3 3.52

    2.5

    3

    3.5

    4

    4.5

    5

    5.5

    angle of w

    kurtosis

    Kurtosis as a function of the direction of projection. For positive kurtosis,

    kurtosis (and its absolute value) are maximized in the directions of the

    independent components.

    16

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    17/87

    0 0.5 1 1.5 2 2.5 3 3.51.3

    1.2

    1.1

    1

    0.9

    0.8

    0.7

    0.6

    0.5

    angle of w

    ku

    rtosis

    Case of negative kurtosis. Kurtosis is minimized, and its absolute value

    maximized, in the directions of the independent components.

    17

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    18/87

    Basic ICA estimation procedure

    1. Whiten the data to givez.2. Set iteration counti=1.

    3. Take a random vectorwi.

    4. Maximize nongaussianity ofwTi z,

    under constraintswi2 =1 andwTi wj=0,j

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    19/87

    Why kurtosis is not optimal

    Sensitive to outliers:Consider a sample of 1000 values with unit var, and one value equal

    to 10.

    Kurtosis equals at least 104/10003=7. For supergaussian variables, statistical performance not optimal even

    without outliers.

    Other measures of nongaussianity should be considered.

    19

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    20/87

    Differential entropy as nongaussianity measure

    Generalization of ordinary discrete Shannon entropy:H(x) = E{ logp(x)} (9)

    for fixed variance, maximized by gaussian distribution. often normalized to give negentropy

    J(x) =H(xgauss)H(x) (10)

    Good statistical properties, but computationally difficult.

    20

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    21/87

    Approximation of negentropy

    Approximations of negentropy(Hyvarinen, 1998):

    JG(x) = (E{G(x)}E{G(xgauss)})2 (11)

    whereGis a nonquadratic function.

    Generalization of (square of) kurtosis (which isG(x) =x4). A good compromise?

    statistical properties not bad (for suitable choice of G)

    computationally simple

    Further possibility: Skewness (for nonsymmetric ICs)

    21

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    22/87

    Information-theoretic approach.

    (Comon 1994)

    Mutual information ofy = (y1, ...,yn)T

    I(y1, ...,yn) =n

    i=1

    H(yi)H(y) (12)

    whereHis differential entropy.

    A measure of redundancy ofy. Equals zero iffyi are independent.

    Fory = Wx, we obtain

    I(y) =n

    i=1

    H(yi) log |det W|+ const. (13)

    22

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    23/87

    Mutual information and nongaussianity

    IfWconstrained to be orthogonal (whitened data):

    I(y1, ...,yn) =

    n

    i=1H(yi) + const. (14)

    Sum of nongaussianities! (though opposite sign)

    Rigorous derivation of maximization of nongaussianities.

    23

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    24/87

    Maximum likelihood estimation.

    (Pham et al, 1992)

    Log-likelihood of the model: (W= A1)

    L=T

    t=1

    n

    i=1

    logpsi (wTi x(t)))+ Tlog |det W| (15)

    Equivalent to the infomax approach in neural networks.

    Needs estimates of the psi , but these need not be exact at all. Roughly: consistent ifpsi is of the right type (sub or supergaussian).

    Very similar to mutual information:

    I(y) =E{n

    i=1

    logpyi (yi)}+ log |det W|+C (16)

    24

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    25/87

    Overview of ICA estimation principles.

    Most approaches can be interpreted as maximizing the

    nongaussianity of ICs.

    Basic choice: the nonquadratic function in the nongaussianitymeasure:

    kurtosis: fourth power

    entropy/likelihood: log of density

    approx of entropy:G(s) =logcosh sor others.

    One-by-one estimation vs. estimation of the whole model. Estimates constrained to be white vs. no constraint

    25

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    26/87

    Algorithms (1). Adaptive gradient methods

    Gradient methods for one-by-one estimation straightforward. Stochastic gradient ascent for likelihood(Bell-Sejnowski 1995)

    W (W1)T + g(Wx)xT (17)

    withg= (logps). Problem: needs matrix inversion! Better: natural/relative gradient ascent of likelihood

    (Amari et al, 1996, Cardoso and Laheld, 1994)

    W [I + g(y)yT]W (18)

    withy =Wx. Obtained by multiplying gradient byWTW.

    26

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    27/87

    Algorithms (2). The FastICA fixed-point algorithm

    (Hyvarinen 1997,1999)

    An approximate Newton method in block (batch) mode.

    No matrix inversion, but still quadratic (or cubic) convergence. No parameters to be tuned.

    For a single IC (whitened data)w E{xg(wTx)}E{g(wTx)}w, normalizew

    wheregis the derivative ofG.

    For likelihood:

    W W + D1[D2+E{g(y)yT}]W, orthonormalizeW

    27

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    28/87

    0 0.5 1 1.5 2 2.5 31.3

    1.2

    1.1

    1

    0.9

    0.8

    0.7

    iteration count

    k

    urtosis

    Convergence of FastICA. Vectors after 1 and 2 iterations, values of

    kurtosis.

    28

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    29/87

    0 0.5 1 1.5 2 2.5 32.5

    3

    3.5

    4

    4.5

    5

    5.5

    iteration count

    kurtosis

    Convergence of FastICA (2). Vectors after 1 and 2 iterations, values of

    kurtosis.

    29

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    30/87

    Relations to other methods (1): Projection pursuit(Friedman and Tukey,

    1974; Huber, 1985)

    Projection pursuit is a method for visualization and exploratory dataanalysis.

    Attempts to show clustering structure of data by finding interesting

    projections.

    PCA is not designed to find clustering structure.

    Interestingness is usually measured by nongaussianity.

    For example, bimodal distributions are very nongaussian.

    30

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    31/87

    Illustration of projection pursuit. The projection pursuit direction is

    horizontal, the principal component vertical.

    31

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    32/87

    Relations to other methods. (2)

    Factor analysis: ICA is a nongaussian (usually noise-free) version

    Blind deconvolution: obtained by constraining the mixing matrix

    Principal component analysis often the same applications

    very different statistical principles

    32

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    33/87

    Basic ICA estimation: conclusions

    ICA is very simple as a model:

    linear nongaussian latent variables model.

    Estimation not so simple due to nongaussianity:objective functions cannot be quadratic.

    Estimation by maximizing nongaussianity of independentcomponents.

    Equivalently (?), maximum likelihood or min of mutual info.

    Algorithms: adaptive (natural gradient descent) vs. block/batch mode

    (FastICA).

    Choice of nonlinearity: cubic (kurtosis) vs. non-polynomial functions

    33

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    34/87

    Applications (1)

    Brain Imaging Data

    34

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    35/87

    Main application areasso far:

    audio noise cancelling (cocktail-party problem)Very difficult... mainly historical

    biomedical signals: Electro????gram

    Brain images

    Microarray data (gene expression)

    vision modelling and image processing econometric time series telecommunications

    35

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    36/87

    Brain imaging data analysis

    EEG, MEG: high temporal resolutionPET, fRMI: global activity maps

    Huge amounts of data: need for neuroinformatics Physiological models vs. unsupervised methods

    36

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    37/87

    Electric and magnetic fields

    Magnetic FieldElectric Potential

    EEG and MEG measures over the scalp

    (From Vigario et al, 2000.)

    Many sources mixed in the measurements.

    37

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    38/87

    Magnetoencephalography

    Dewar

    Liquid

    helium

    Neuromag-122TM

    40 mm

    Sensor array

    Gradiometer

    (degrees)

    planargradiometer

    axialgradiometer

    Bnet

    (fT)

    - 40 + 200 + 40- 20

    Neuromag-122 whole scalp magnetometer.

    (From Vigario et al, 2001.)

    38

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    39/87

    Artefact removal from MEG

    1

    1

    2

    2

    3

    3

    4

    4

    5

    5

    6

    6

    MEGsaccades blinking biting

    A subset of 12 spontaneous MEG signals.

    (From Vigario et al, 1998.)

    39

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    40/87

    IC1

    IC2

    IC3

    IC4

    IC5

    IC6

    IC7

    IC8

    IC9

    10 s

    Artefacts found from MEG

    data, using the FastICA algo-

    rithm. (From Vigario et al,

    1998.)

    40

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    41/87

    Analysis of evoked magnetic fields

    Left side Right side

    MEG25

    MEG83

    MEG sample

    MEG60(MEG-L)

    MEG10(MEG-R)

    1 0 1 2 3 4 5-

    Averaged auditory evoked responses to 200 tones, using MEG. (From

    Vigario et al, 1998.)

    41

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    42/87

    1 0 1 2 3 4 5

    1 0 1 2 3 4 5-

    PC1

    PC2

    PC3

    PC4

    PC5

    IC1

    IC2

    IC3

    IC4

    -

    a)

    b)

    Principal (a) and inde-

    pendent (b) components

    found from the auditoryevoked field study. (From

    Vigario et al, 1998.)

    42

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    43/87

    Applications (2)

    Image and VisionModelling

    43

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    44/87

    ICA and image data

    Models of image data always useful In computational neuroscience: evolution+development give optimal

    receptive fields

    In image processing: Essential for denoising, prediction, etc. ICA gives an interesting model

    (Olshausen and Field, 1996; Bell and Sejnowski, 1997)

    Important connection to sparse coding.

    44

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    45/87

    Linear models of images.

    Observed variablesxi are gray-scale values of pixels in an image Modelled by a linear latent variable model

    x= As =i aisi (19)

    Columnsai are called basis vectors

    Image is superposition of basis vectors Well-known basis vector sets:

    Fourier analysis (sines, cosines)

    DCT Wavelets

    Gabor analysis

    What could be the best basis vectors?45

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    46/87

    Some DCT (top) and wavelet (bottom) basis vectors

    46

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    47/87

    What is sparseness?

    A form of nongaussianity (higher-order structure) often encountered

    in natural signals

    Variable is active only rarely

    gaussian:

    sparse:

    47

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    48/87

    What is sparseness? (2)

    A random variable is sparse if its density has heavy tails, and a peakat zero.

    Kurtosis is (strongly) positive, i.e. supergaussianity Typical sparse pdf (Laplace):

    4 3 2 1 0 1 2 3 40

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    (dash-dot: Gaussian density)

    48

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    49/87

    Linear Sparse Coding

    For random vectorx, find linear representation:x = As (20)

    so that the componentssi are as sparse as possible.

    a given data pointx(t)is represented using only a limited number ofactive componentssi.

    ICA is sparse coding since sparseness is supergaussianity.

    Sparse coding/ICA gives an optimal basis.

    49

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    50/87

    ICA basis vectors of image windows.

    50

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    51/87

    Why sparse coding?

    Good fit to V1 simple cell receptive fields (Van Hateren et al, 1998). Compress images: code only nonzero components. In biological networks: saves energy. Internal model for recovery of structure. Denoising: use thresholding to leave only components that are really

    active.

    Wavelet methods use the same principles.

    51

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    52/87

    Sparse Coding: Denoising by Shrinkage

    (Hyvarinen, 1999)

    Assume the data is corrupted by white Gaussian noisen

    x= As + n (21)

    and constrainAto be orthogonal.

    Estimatesi fromxby ML method,

    assuming thesi to be independent:

    si= f(wT

    i x) (22)

    and reconstructx = As.

    52

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    53/87

    Shrinkage nonlinearity as denoising

    E.g. if thesi have a Laplace distribution, we have

    f(u) =sign(u)max(0,

    |u

    |

    22) (23)

    53

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    54/87

    Sparse Code Shrinkage algorithm

    1. Estimate the sparse coding matrixW = A1, and the shrinkagenonlinearities fi.

    2. Compute for each noisy observationx(t)the corresponding noisy

    sparse componentswTi x(t).

    3. Reduce noise by applying the shrinkage non-linearity fi(.)on thenoisy sparse components:

    si(t) = fi(wTi x(t)).

    4. Invert the coding to obtainx(t), given byx(t) = Ws(t).

    Can be considered as an adaptive version of wavelet shrinkage (Donoho

    et al, 1995)

    54

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    55/87

    Experiments on Sparse Code Shrinkage

    input dataxwas 88 windows from images. basis vectors estimated from noise-free images, using a modification

    of FastICA

    sparse code shrinkage applied for sliding windows in noisy images. averages of the 88 reconstructions taken as final reconstructions

    55

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    56/87

    Noise level: 0.3

    Left: Noisy image. Middle: Wiener filtered. Right: Sparse code shrinkageresult.

    56

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    57/87

    Conclusion: Image feature extraction by ICA

    ICA gives an interesting model for image data Takes into account nongaussianity, here: sparseness

    Performs sparse coding. Features related to Gabor functions, wavelets, V1 simple cells.

    Shrinkage denoising possible as with wavelets.

    57

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    58/87

    Extensions:

    Subspace and topographyformalisms

    58

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    59/87

    Relaxing independence

    For most data sets, the estimated components are not veryindependent.

    In fact, independent components can not be found in general.

    We attempt to model some of the remaining dependencies. Basic models group components:

    Multidimensional ICA, and

    Independent Subspace Analysis.

    59

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    60/87

    Multidimensional ICA(Cardoso 1998)

    One approach to relaxing independence. thesi can be divided inton-tuples, such that

    thesi inside a givenn-tuple may be dependent on each other

    dependencies between differentn-tuples are not allowed.

    Everyn-tuple corresponds to a subspace.

    60

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    61/87

    Invariant-feature subspaces(Kohonen 1996)

    Linear filters (like in ICA) necessarily lack any invariance.

    invariant-feature subspaces is an abstract approach to representinginvariant features.

    Principle: invariant feature is a linear subspace in a feature space. The value of the invariant feature is given by norm of the projection

    on that subspace.k

    i=1(wTi x)

    2

    (24)

    61

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    62/87

    Independent Subspace Analysis(Hyvarinen and Hoyer, 2000)

    Combination of multidimensional ICA and invariant-featuresubspaces.

    The probability density inside each subspace isspherically

    symmetric, i.e. depends only on the norm of the projection.

    Simplifies the model considerably.

    The nature of the invariant features is not specified.

    62

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    63/87

    IInput

    8

    7

    (.)

    2(.)

    2

    6

    2

    1

    3

    5

    4

    (.)

    2(.)

    2(.)

    2(.)

    2(.)

    2

    2(.)

    63

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    64/87

    Application on image data

    Applied on image data, our model shows emergence ofcomplex-cell properties:

    We have phase and some translation invariance,as well as orientation and frequency selectivity.

    Each subspace can be interpreted as a complex cell. Similar to energy models for complex cells (norm is like local

    energy).

    64

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    65/87

    Independent Subspaces of natural image data.

    65

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    66/87

    Independent Subspace Analysis: Conclusions

    A simple way of relaxing the independence constraint in ICA. Instead of scalar components, only subspaces are independent.

    Densities inside subspaces are spherically symmetric. Can be interpreted as invariant-feature subspaces.

    When applied on image data, complex cell properties emerge.

    66

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    67/87

    Problem: Dependencies still remain

    Linear decomposition often does not give independence, even forsubspaces.

    Remaining dependencies could be visualized or else utilized.

    Components can be decorrelated, so only higher-order correlationsare interesting

    How to visualize them? E.g. using topographic order

    67

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    68/87

    Extending the model to include topography

    Instead of having unordered components,

    they are arranged on a two-dimensional lattice

    dependent

    independent

    The components are typically sparse, but not independent. Near-by components have higher-order correlations.

    68

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    69/87

    Dependence through local variances

    Often encountered in e.g. image data

    Components are independentgiven their variances

    In our model, variances are not independent

    instead: correlated for near-by components

    e.g. generated by another ICA model, with topographic mixing

    INDEPENDENT TOPOGRAPHIC VARIANCE DEPENDENCE

    69

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    70/87

    Two signals that are independent given their variances.

    70

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    71/87

    Topographic ICA model(Hyvarinen et al, 2000)

    u

    2

    3

    1

    x

    x

    x

    A

    3

    u

    u

    1

    2

    s

    s

    s

    2

    3

    1

    1

    2

    3

    Variance-generating variablesui are generated randomly, and mixed

    linearly inside their topographic neighbourhoods. Mixtures are

    transformed using a nonlinearity, thus giving variancesi of thesi.

    Finally, ordinary linear mixing.

    71

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    72/87

    Approximation of likelihood

    Likelihood of the model intractable

    Approximation:T

    t=1

    n

    j=1

    G(n

    i=1

    h(i,j)(wTi x(t))2) + Tlog

    |det W

    |. (25)

    whereh(i,j)is neighborhood function, andGa nonlinear function.

    Generalization of independent subspace analysis. Function of local energies only!

    72

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    73/87

    Top-down modulated Hebbian learning

    Approximation of likelihood can be maximized by gradient ascent

    Learning rule:

    wiE{x(wTi x)ri)}+ normalization+feedback (26)

    where

    ri=n

    k=1

    h(i, k)g(n

    j=1

    h(k,j)(wTjx)2). (27)

    Hebbian learning withri a function of the outputs of a higher-order(complex) cells.

    73

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    74/87

    Topographic ICA of natural image data. Topographically ordered

    Gabor-like basis vectors for image patches.

    74

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    75/87

    Independent subspace analysis and topographic ICA

    In ISA, single components are not independent, but subspaces are.

    In topographic ICA, dependencies modelled continuously. No strict division into subspaces.

    For image data, each neighbourhood is a complex cell. Localenergies are their outputs.

    Topographic ICA is a generalization of ISA, incorporating theinvariant-feature subspace principle as invariant-feature

    neighbourhoods.

    75

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    76/87

    Topographic ICA: Conclusion

    A more sophisticated way of relaxing independence.

    Dependencies that cannot be cancelled by ICA define a similaritymeasure

    New principle for topographic mappings Formulated as a modification of the ICA model. Approximation of likelihood gives tractable algorithms. For image data, topography similar to V1.

    76

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    77/87

    Using time dependencies

    77

    Using autocorrelations for ICA estimation

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    78/87

    Using autocorrelations for ICA estimation

    Take the basic linear mixture model

    x(t) = As(t) (28)

    Cannot be estimated in general (take gaussian RVs) Usually in ICA, we assume thesi to be nongaussian

    higher-order statistics provide missing information.

    Alternatively: assume thesi are time-dependent signals use time correlations to give more information

    For example, a lagged covariance matrix

    Cx=E{x(t)x(t )}. (29)

    measures covariances of lagged signals.

    78

    The AMUSE algorithm for using autocorrelations

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    79/87

    g g

    (Tong et al, 1991; Molgedey and Schuster, 1994)

    Basic principle: decorrelate each signaly = Wxwith other signals,lagged as well as not lagged.

    In other words: E{yi(t)yj(t )} =0 for alli = j. To do this:

    1. Whiten the data to obtainz(t) = Vx(t)

    2. Find orthogonal transformationWso that the lagged covariance

    matrix ofy(t) = Wz(t)is identity.

    Matrix diagonalization problem

    Cx =E{x(t)x(t )} =E{As(t)s(t )TAT} = ACsAT

    of a (more or less) symmetric matrix.

    79

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    80/87

    Pros and cons of separation by autocorrelations

    Very fast to compute: a single eigen-value decomposition, like PCA

    Can only separate ICs with different autocorrelations Because the lagged covariance matrix must have different

    eigenvalues

    Some improvement can be achieved by using several lags in the

    algorithm (Belouchrani et al, 1997, SOBI).

    but if signals have identical Fourier spectra, autocorrelations just

    cannot separate them

    80

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    81/87

    Combining nongaussianity and autocorrelations

    Best results should be obtained by using these two kinds ofinformation.

    E.g.: Model temporal structure of signals with e.g. ARMA models

    A more general approach: minimize coding complexity

    Find a decompositiony= Wxso that theyi are easy to code. Rigorously defined by Kolmogoroff Complexity.

    Signals are easy to code if they are nongaussian and have timedependencies.

    81

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    82/87

    Coding complexity as a general framework

    (Pajunen, 1998)

    For whitened dataz, and an orthogonalW: minimize sum of codinglengths of they= Wz.

    If only marginal distributions are used,

    coding length is given by entropy, i.e. nongaussianity.

    If only autocorrelations are used, coding length is related to

    autocorrelations.

    Thus we have a generalization of both frameworks.

    82

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    83/87

    Approximation of coding complexity

    The value ofy(t)is predicted from the preceding values

    y(t) = f(y(t1),y(t2), ...y(1)). (30)

    The residualsy(t)y(t)are coded independently from each other.

    Predictor could be linear. Coding length is approximated by entropy of residuals

    H(yy) (31)

    Many other approximations can be developed.

    83

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    84/87

    Estimation using variance nonstationarity

    (Matsuoka et al, 1995)

    An alternative to autocorrelations (and nongaussianity)

    Variance changes slowly over time

    0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100004

    3

    2

    1

    0

    1

    2

    3

    4

    This gives enough information to estimate model

    84

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    85/87

    Convolutive ICA

    Often the signals do not arrive at the same in the sensors

    There may be echos as well (multi-path phenomena) Include convolution in the model:

    xi=n

    j=1

    ai j(t)

    si(t) =

    n

    j=1

    k

    ai j(k)si(t

    k), fori=1, ..., n, (32)

    In theory: Estimation by the same principles as ordinary ICA

    In practice: huge number of parameters since (de)convolving filters

    may be very long

    special methods may need to be used

    85

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    86/87

    Final Summary

    ICA is a very simple model. Simplicity implies wide applicability. A nongaussian alternative to PCA or factor analysis. Decorrelation or whitening is only half ICA.

    The other half uses the higher-order statistics of nongaussian

    variables

    (or alternatively: autocorrelations, variance nonstationarity,

    complexity)

    Basic principle is to find maximally nongaussian directions.- Essentially equivalent to maximum likelihood or

    information-theoretic formulations.

    86

  • 8/14/2019 Clase08-Mlss05au Hyvarinen Ica 02

    87/87

    Final Summary (2)

    Applications:

    Blind source separation: biomedical signals, econometrics etc. Feature extraction: images etc.

    Exploratory data analysis: like projection pursuit

    New coming all the time

    Since dependencies cannot always be cancelled, subspaces ortopographic versions may be useful.

    Alternatively, separation is possible using time dependencies. Nongaussianity is beautiful !?

    87