biomedical signal processing --- application of ... · image? f. theis biomedical signal processing...

Biomedical signal processing — application ofoptimization methods for machine learning

problems

Fabian J. Theis

Computational Modeling in BiologyInstitute of Bioinformatics and Systems Biology

Helmholtz Zentrum Munchen

http://cmb.helmholtz-muenchen.de

Grenoble, 16-Sep-2008

F. Theis Biomedical signal processing — application of optimization methods for machine learning problems

Data mining

cocktail-party problem


Data miningcocktail-party problem

W

Neural

NetworkF. Theis Biomedical signal processing — application of optimization methods for machine learning problems

Data mining

• mixture model x(t) = f(s(t))

• estimate mixing process f and sources s(t)

• often linear f = A

s(t) x(t) s(t)

W

Neural

NetworkF. Theis Biomedical signal processing — application of optimization methods for machine learning problems

Supervised methodsUnsupervised methods

Signal component analysisConclusions

Outline

1 Supervised methodsMotivation 1: classificationMotivation 2: image segmentationStatistical decision theory

2 Unsupervised methodsClusteringk-meansPartitional clustering

3 Signal component analysisIndependent component analysisIndependent subspace analysisSparse component analysisNonlinear sparse component analysis

4 Conclusions




Motivation 1: classificationMotivation 2: image segmentationStatistical decision theory

Outline




4 Conclusions





Motivation 1: classification

data analysis: classification

• decide between (two or multiple) classes s(t) ∈ 0, 1• learn by example

gf

?





Neural networks





Classification: example

• observations:• immunological data set• 30 cell parameters of 37

children with pulmonarydiseases

• goal• interpretation using

supervised andunsupervised analysis

• disease classification intochronic bronchitis orinterstitial lung disease

CB ⇔ ILD ?

cooperation with D. Hartl, Pediatric Immunology, Munich





Data visualization & dimension reduction

parameter interpretation?





Data visualization & dimension reduction

0.07

0.248

0.426

d 0.166

1

1.89

d 2.8

8.23

13.6

d 15.5

30.3

48.1

d 22.8

35.9

49

d 0.395

1.81

3.3

d 0.104

1.8

4.5

d 3.2

19.1

37.8

d 0.718

10.9

22.1

d 1.39

16.3

32.1

d 30.4

57.2

82.6

d 6.84

16

25.5

d 4

27.9

53.8

d 1.35

3.28

5.22

d 196000

446000

699000

d 1

1.33

1.71

d 0.0623

4.44

9.29 CB(3)

CB(3)

ILD(1)

ILD(2)

ILD(3)CB(2)

CB(1)

CB(1)

CB(1)

ILD(1)

CB(1)

ILD(2)

ILD(1)

CB(1)

CB(1)

ILD(1)

ILD(2)

ILD(2)

ILD(1)

CB(2)ILD(1)

ILD(1)

ILD(2)CB(1)

nO(2)O(1)

nO(3)

x(1)

x(2)

x(3)O(2)

nO(1)

O(1)

O(1)

x(1)

O(1)

x(2)

x(1)

O(1)

O(1)

x(1)

x(2)

x(2)

x(1)

O(2)x(1)

x(1)

x(2)O(1)

K−means−Clusters

• visualization by self-organizing map network• topology-preserving nonlinear dimension reduction/scaling• detect new parameter dependencies





Disease classification

dimension-reducingnetwork

z(i) = BsupervisedAunsup.x(i)results:

• down-scaling to 5 hiddenneurons suffices

• classification rate of > 90%

[Theis, Hartl, Krauss-Etschmann, Lang. Neural network signal analysis in immunology. Proc. ISSPA 2003.]





Motivation 2: image segmentation

classification

• application in image processing

• ⇒ object classification

gf

?





Motivation 2: image segmentation

Problem: Howmany labelled cellslie in this sectionimage?





Biological background: neurogenesis

• adult neurogenesis• new neurons emerge even

in the adult human brain• level depends on external

stimuli• Are there neural ancestral

cells?

• goal• automated quantification

of neurogenesis in adultmice

cooperation with Z. Kohl, Department of Neurology, University of Regensburg





Automated cell counting





Automated cell counting

directional neural network

• train cell patch classifier ζusing directional neuralnetwork

• scan image using ζ to get cellpositions

• speed-up via hierarchicaland multiscale methods





Results

• counting comparison with 2 experts (variability ±5%) yields90%± 4% accuracy

• application: considerable cell proliferation in hippocampus ofepileptic mice

[Theis, Kohl, Guggenberger, Kuhn, Lang. ZANE - an algorithm for counting labelled cells in section images. Proc. MEDSIP 2004]





Statistical decision theory

setup

• input: random vector X : Ω → Rp

• output: random vector Y : Ω → R or categorical output, possiblyY ∈ 0, 1

• input-output relation measured by joint density P(X ,Y )

• realization by samples (training data) (xi , yi ) for i = 1, . . . ,N

• often collected in (N × p)-matrix X and vector y ∈ RN





Goal: prediction

• goal: learn classificator from training data ⇒predict y∗ for new sample x∗

−1 −0.5 0 0.5−1.5

−1

−0.5

0

0.5

1

1.5





Linear model

y = β0 +

p∑j=1

xj βj

set x0 := 1, theny = x>β

least squares: minimize

RSS(β) =N∑

i=1

(yi − x>i β)2 = (y − Xβ)>(y − Xβ)

⇒ X>(y − Xβ) = 0 so

β = (X>X)−1X>y





Linear model

−1 −0.5 0 0.5−1.5

−1

−0.5

0

0.5

1

1.5

decision boundary x |x>β = 1/2





Linear model

nice, but what about more complex data?

−3 −2 −1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−3 −2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

(r = 2 and r = 10 Gaussians per class, σ = 0.2, with r means sampledfrom N((1, 0), I and N((0, 1), I), respectively)





Linear model

hm?

−3 −2 −1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−3 −2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

‘global’, linear model is too rigid





Nearest-neighbor method

y =1

k

∑xi∈Nk (x)

yi

if Nk(x) equal the k closest points xi to x

• local model

• needs metric (here Euclidean)

• how to determine k?• smaller k ⇒ higher learning accuracy• larger k ⇒ smoother, higher generalizability• least-square learning would yield k = 1





Nearest-neighbor method, k = 10

−1 −0.5 0 0.5−1.5

−1

−0.5

0

0.5

1

1.5

−3 −2 −1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

decision boundary x |y(x) = 1/2





Nearest-neighbor method, k = 1, 2, 10

−3 −2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−3 −2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−3 −2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5





Statistical decisions

probabilistic view: P(X ,Y ) = P(Y |X )P(X )

find function f (X ) predicting Y as well as possible w.r.t. squared errorloss L(Y , f (X )) = (Y − f (X ))2

expected prediction error

EPE(f ) = E (Y−f (X ))2 =

∫(y−f (x))2P(dx , dy) = EXEY |X ((Y−f (X ))2|X )

pointwise minimization suffices

f (x) = argmincEY |X ((Y − c)2|X = x)

solved at conditional expectation (regression function)

f (x) = E (Y |X = x)






f (x) = E (Y |X = x)

can be estimated by

f (x) =1

k

∑xi∈Nk (x)

yi

• approximate expectation via sample averages

• approximate point conditioning to local conditioning

• note: f (x) → E (Y |X = x) for N,K →∞, k/N → 0

• but:• (very) finite samples• ‘curse’ of dimensionality

• fraction r of unit cube in p dimensions is covered by cube of edgelength ep(r) = r1/p

• e2(0.01) = 0.1, e2(0.1) = 0.32• e10(0.01) = 0.63, e10(0.1) = 0.80






if instead for approximating f (x) = E (Y |X = x), we assume linear modelf (x) = x>β, we get

β = E (XX>)−1E (XY )

• no conditioning, global approximation





Statistical decisions for discrete Y

if Y ∈ 0, 1, consider loss function

L(Y , f (X )) =

0 if f(X)=Y1 otherwise

then EPE = EX

∑y∈0,1 L(y , f (X ))P(y |X ) and hence

Y (x) = argminy0∈0,1∑

y∈0,1

L(y , y0)P(y |X = x)

= argminy0∈0,1 1− P(y0|X = x)

which yields the Bayes classifier

Y (x) = argmaxy P(y |X = x)

question: how to model P(Y |X )?





Bayes classifier results

−1 −0.5 0 0.5−1.5

−1

−0.5

0

0.5

1

1.5

−3 −2 −1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−3 −2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5





Method combinations

• nonlinear models e.g. f (x) =∑p

j=1 fj(xj) or basis expansionf (x) =

∑j hj(x)βj with polynomial, Fourier or sigmoidal bases (→

neural networks)

• prediction/function approximation by maximum-likelihoodestimation of parameters

• enhance generalizability by adding regularization term +λJ(f ) toRSS(f ) for f from some function class

• generalize inner-product methods to nonlinear situations byhigh-dimensional embedding x 7→ Φ(x) and kernelsk(x , x ′) = Φ(x)>Φ(x)




Clusteringk-meansPartitional clustering

Outline




4 Conclusions





Clustering

• explanation by example• goal: differentiate

hand-written digits 2 and4

• given a set of unknowngray-scale images of 2s and4s, find the subset of 2sand the subset of 4s

•• unsupervised learning by

example

•





Clustering




• versus• unsupervised learning by

example

•





Clustering




• versus• unsupervised learning by

example

• like a baby:





Example data set

• here: machine learningi.e. statistical approach

• needs many test cases:

here 1000 28x28 images each• interpret each 28x28-image as

element of R784:

. . .

. . .

• dimension reduction viaPCA to only 2 dimensions





Example data set

• here: machine learningi.e. statistical approach

• needs many test cases:

here 1000 28x28 images each• interpret each 28x28-image as

element of R784:

. . .

. . .

• dimension reduction viaPCA to only 2 dimensions

−1 0 1 2 3 4 5−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3





k-means

• clustering:• data vectors (samples)

x(1), x(2), . . . , x(T ) ∈ Rn

• distance measure d(x, y)between samples

• algorithm: k-means• given number k of clusters• initialize centroids

randomly• update rules: batch or

sequential (online)

• cost function• minimize E(ci ,Ci ) :=Pk

i=11|Ci |

Px∈Ci

d(xi , ci )2

[Theis, Gruber. Grassmann clustering. Proc. EUSIPCO 2006]





k-means


x(1), x(2), . . . , x(T ) ∈ Rn




sequential (online)


i=11|Ci |

Px∈Ci

d(xi , ci )2

Samples

Centroids






k-means


x(1), x(2), . . . , x(T ) ∈ Rn




sequential (online)


i=11|Ci |

Px∈Ci

d(xi , ci )2

Aufteilung

batch k-means






k-means


x(1), x(2), . . . , x(T ) ∈ Rn




sequential (online)


i=11|Ci |

Px∈Ci

d(xi , ci )2

Zuweisung

batch k-means






k-means


x(1), x(2), . . . , x(T ) ∈ Rn




sequential (online)


i=11|Ci |

Px∈Ci

d(xi , ci )2

batch k-means






k-means


x(1), x(2), . . . , x(T ) ∈ Rn




sequential (online)


i=11|Ci |

Px∈Ci

d(xi , ci )2

beliebiges Sample

sequentieller k-means






k-means


x(1), x(2), . . . , x(T ) ∈ Rn




sequential (online)


i=11|Ci |

Px∈Ci

d(xi , ci )2

nächster Centroid







k-means


x(1), x(2), . . . , x(T ) ∈ Rn




sequential (online)


i=11|Ci |

Px∈Ci

d(xi , ci )2

Update







k-means


x(1), x(2), . . . , x(T ) ∈ Rn




sequential (online)


i=11|Ci |

Px∈Ci

d(xi , ci )2







Batch k-means

−1 0 1 2 3 4 5−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3k−means after 1 iteration

done: error 4.5%





Batch k-means

−1 0 1 2 3 4 5−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3k−means after 2 iterations

done: error 4.5%





Batch k-means

−1 0 1 2 3 4 5−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5


done: error 4.5%





Partitional clustering• goal:

• given a set A of points in metric space (M, d)• find partition of A into Bi ,

Si Bi = A, and centroids ci ∈ M

minimizing

E(B1, c1, . . . , Bk , ck) :=kX

i=1

Xa∈Bi

d(a, ci )2. (1)

• A = a1, . . . , aT ⇒ constrained non-linear opt. problem• minimize

E(W,C) :=kX

i=1

TXt=1

witd(ai , ci )2. (2)

subject to

wit ∈ 0, 1,kX

i=1

wit = 1 for 1 ≤ i ≤ k, 1 ≤ t ≤ T . (3)

• C := c1, . . . , ck centroid locations, W := (wit) partition matrix





Minimize this!

• common approach: partial optimization for W and C• alternate minimization of either W and C while keeping the other

one fixed

• ⇒ batch k-means algorithm• initial random choice of centroids c1, . . . , ck

• iterate until convergence:• cluster assignment: for each at determine an index i(t) such that

i(t) = argmini d(at , ci )

• cluster update: within each cluster Bi := at |i(t) = i determine thecentroid ci by minimizing

ci := argminc

Xa∈Bi

d(a, c)2

• convergence to local minimum (??)





Euclidean case

• special case: M := Rn and the Euclidean distance d(x , y) := ‖x − y‖• centroids can be calculated in closed form:

• centroid is given by the cluster mean

ci := (1/|Bi |)Xa∈Bi

a

• this follows directly from

Xa∈Bi

‖a− ci‖2 =Xa∈Bi

nXj=1

(aj − cij)2 =

nXj=1

Xa∈Bi

(a2j − 2ajcij + c2

ij )





Extensions

ci := argminc

∑a∈Bi

d(a, c)p

• more difficult optimization problems:• non-Euclidean spaces e.g. RPn or Grassmann manifolds• extensions from p = 2 to e.g. p = 1 or p <• p = 1 corresponds to finding the spatial median of Bi




Independent component analysisIndependent subspace analysisSparse component analysisNonlinear sparse component analysis

Outline




4 Conclusions





Independent component analysisexample: Cocktail party problem of the brain

auditorycortex

worddetection

decision

auditorycortex 2

[Keck, Theis, Gruber, Lang, Specht, Puntonet. 3D spatial analysis of fMRI data on a word perception task. LNCS, 3195:977-984]





BSS model

• Blind source separation (BSS) problem

x(t) = As(t) + ε(t)

• x(t) observed m-dimensional random vector• A (unknown) full-rank m × n matrix• s(t) (unknown) n-dimensional source signals (here: n ≤ m)• ε(t) (unknown) white noise

• goal: given x, recover A and s!

• additional assumptions necessary• stochastically independent s(t): ps(s1, . . . , sn) = ps1(s1) . . . psn (sn)⇒ independent component analysis (ICA)

• sparse source signals si (t) ⇒ sparse component analysis (SCA)• nonnegative s and A ⇒ nonnegative matrix factorization (NMF)





• important questions in data analysis• model? (restrictions to A and s)• indeterminacies of the model?• algorithmic identification given x?

• identifiability• obvious indeterminacies: scaling L and permutation P

Theorem

Let the independent random vector s ∈ L2 contain at most one gaussiancomponent. Given two ICA solutions As = A′s′, then A = A′LP.

Note: theorem does not hold for gaussiansources s.

[Theis. A new concept for separability problems in blind source separation. Neural Computation, 2004]





• important questions in data analysis• model? (restrictions to A and s)• indeterminacies of the model?• algorithmic identification given x?

• identifiability• obvious indeterminacies: scaling L and permutation P

Theorem

Let the independent random vector s ∈ L2 contain at most one gaussiancomponent. Given two ICA solutions As = A′s′, then A = A′LP.

−2

−1

0

1

2

−2

−1

0

1

20

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Note: theorem does not hold for gaussiansources s.

[Theis. A new concept for separability problems in blind source separation. Neural Computation, 2004]





ICA algorithms

• basic scheme of ICA algorithms (case m = n)• search for invertible demixing matrix W that minimizes some

dependence measure of Wx

• some contrasts• minimize mutual information I (Wx) (?)• maximize neural network output entropy H(f (Wx)) (?)• extend PCA by performing nonlinear decorrelation (?)• maximize non-Gaussianity of output components (Wx)i (?)• minimize off-diagonal error of Hln pWx

• minimize median deviation of Wx

[Theis et al. Linear geometric ICA: Fundamentals and algorithms. Neural Computation, 2003]

[Theis, Lang, Puntonet. A geometric algorithm for overcomplete linear ICA. Neurocomputing, 2004]





Optimization

• problem: minimize costfunction f (W) on Gl(n) orO(n)

• often: gradient descent:∆W ∝ −∇f (W)

• in high dimensions:simulated annealing orgenetic algorithms

• use non-Euclidean structure ofGl(n)

• Euclidean gradient notcompatible with groupGl(n)

• define natural gradient

∇natf (W) = ∇eucf (W)W>W

⇒ considerable performanceincrease

[Stadlthanner, Theis, Puntonet, Lang. Extended sparse nonnegative matrix factorization. LNCS, 3512:249-256][Squartini, Theis. New Riemannian metrics for speeding-up the convergence of over- and underdetermined ICA. In preparation]

[Theis. Gradients on matrix manifolds and their chain rule. Submitted to NIPS LR, 2005]





fMRI analysis

• function magneticresonance imaging

• noninvasive brain imagingtechnique ⇒ information onbrain activation patterns

• activation maps helpidentifying task-relatedbrain regions

• BSS techniques for fMRIpossible, see (?).





fMRI analysis

spatial-only BSS

• function magneticresonance imaging

• noninvasive brain imagingtechnique ⇒ information onbrain activation patterns

• activation maps helpidentifying task-relatedbrain regions

• BSS techniques for fMRIpossible, see (?).





Experimental setup

• experiment• block design protocol:

• 5 time instants of visualstimulation

• 5 instants of rest

• 100 scans taking 3s each• data set

• well known design →expected activity in visualcortex

• here: use only a singlehorizontal slice

• preprocessing• motion correction• smoothing

data acquired by D. Auer, MPI of Psychiatry, Munich





Results

1 2

3 4

(a) spatial sources sS

1 cc: 0.18 2 cc: 0.00

3 cc: 0.05 4 cc: 0.90

(b) temporal sources tS

• component 2 partially represents the frontal eye fields• component 4: stimulus component, cc = 0.9 with stimulus

[Theis, Gruber, Keck, Lang. Functional MRI analysis by a novel spatiotemporal ICA algorithm. LNCS 3696:677-682]





Independent subspace analysis





Why extend ICA?

• identifiability of ICA onlyholds if data follows generativemodel with independentsources

• simulation• apply ICA to data not

fulfilling the ICA model• here sources consist of a

2d- and a 1-d irreduciblecomponent

• plot Amari-error over 100runs

.





Why extend ICA?

• identifiability of ICA onlyholds if data follows generativemodel with independentsources

• simulation• apply ICA to data not

fulfilling the ICA model• here sources consist of a

2d- and a 1-d irreduciblecomponent

• plot Amari-error over 100runs

FastICA JADE Extended Infomax0

1

2

3

4

cros

stal

king

err

or

result: no recovery of mixingmatrix





Independent subspace analysis

• require stochastic independence only between groups of sourcecomponents

• nk-dimensional S is to be k-independent i.e.0B@ S1

...Sk

1CA , . . . ,

0B@ Snk−k+1

...Snk

1CAmutually independent⇒ independent subspace analysis (ISA)

• recent result: extension to arbitrary group-size• major advantage:

general independent subspace analysis (ISA) always exists

[Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 2004]





PCA

X

S

A





ICA

X

SL

P

A





ISA with fixed groupsize

X

SL

P

A





General ISA

X

SL

P

A





ISA framework

Definition

Y independent component of X :⇔ ∃ X = A(Y,Z) such that Y and Zare stochastically independent.

Definition (general ISA)

• S is irreducible if it contains no lower-dim. independent cpt.

• W ∈ Gl(n) independent subspace analysis of X :⇔∃ WX = (S1, . . . ,Sk) with pairwise independent, irreducible Si

Theorem

Given a random vector X with existing covariance, then an ISA of Xexists and is unique except for scaling and permutation.





Algebraic ISA algorithms

• main idea: source condition matrices Ci (S) are block-diagonal

• subspace JADE• after whitening assume orthogonal A• group-independence of S: contracted quadricovariance matrices

Cij(S) are block-diagonal• perform joint block diagonalization of Cij(X) to get A>

• for general ISA, estimate block-structure after diagonalization

=Cij(S) A>

Cij(X)I

A

[Theis. Towards a general independent subspace analysis. NIPS 2006 accepted]





Joint Block Diagonalization with unknown block-sizes

Joint Block Diagonalization (JBD)

• given n × n-matrices C1, . . . ,CK and a partition m,m1 + · · ·+ mr = n of n

• goal: find orthogonal A such that ∀k: A>CkA is m-block-diagonal

⇒ minimize (e.g. by applying iterative Givens-rotations)

f m(A) :=K∑

k=1

‖A>CkA− diagMm(A>CkA)‖2F

unknown blocksize m ⇒ general JBD then searches formaximal-length block structure i.e.

(A,m) = argmaxm | ∃A:f m(A)=0 |m|

result: JBD by JD: any block-optimal JBD i.e. zero of f m is a localminimum of ordinary joint diagonalization.





Example

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

405 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

(unknown) C1 A>A w/o rec. P A>A .

• performance of the proposed general JBD• (unknown) block-partition 40 = 1 + 2 + 2 + 3 + 3 + 5 + 6 + 6 + 6 + 6• additive noise with SNR of 5dB, K = 100 matrices• result: estimate A equals A after permutation recovery





Extraction of fetal electrocardiograms

• separate fetal ECG (FECG) recordings from the mother’s ECG(MECG)

• apply Hessian-based MICA with k = 2 and 500 Hessians





0 100 200 300 400 500−50

0

50

0 100 200 300 400 500−50

0

120

0 100 200 300 400 500−100

0

50

(a) ECG recordings

0 100 200 300 400 500−120

0

50

0 100 200 300 400 500−20

0

80

0 100 200 300 400 500−20

0

20

(b) extracted sources

0 100 200 300 400 500−50

0

50

0 100 200 300 400 500−50

0

120

0 100 200 300 400 500−100

0

50

(c) MECG part

0 100 200 300 400 500−50

0

50

0 100 200 300 400 500−50

0

120

0 100 200 300 400 500−100

0

50

(d) FECG part

[Theis. Blind signal separation into groups of dependent signals using joint block diagonalization. Proc. ISCAS 2005]





Sparse component analysis

sparse

[Theis, Puntonet, Lang. Median-based clustering for underdetermined blind signal processing. IEEE SPL, 2005]





Model

• Sparse Component Analysis (SCA) problem

x(t) = As(t)

• observed mixtures x(t) ∈ Rm

• A (unknown) real matrix with linearly independent columns• s(t) (unknown) (m − 1)-sparse sources s(t) ∈ Rn i.e. s(t) has at

most (m − 1) non-zeros

• goal: recover unknown A and s(t) given only x(t)

Theorem

If s(t) is (m − 1)-sparse and A and s(t) in ’general position’, both A ands(t) are identifiable (except for scaling and permutation).

[Georgiev, Theis, Cichocki. Sparse component analysis and blind source separation of underdetermined mixtures. IEEE TNN, 2005]





SCA algorithm

• matrix identification by multiple hyperplane detection• e.g. using Hough transform• robust against outliers and noise

• source recovery using sparsity andknown matrix

−1

−0.5

0

0.5

1 −1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

[Theis, Georgiev, Cichocki. Robust sparse component analysis based on a generalized Hough transform. Signal Processing 2006]





SCA of surface electromyograms

• electromyogram (EMG): electric signal generated by a contractingmuscle

• surface EMG: non-invasive, however source overlaps

cooperation with G. Garcıa, Bioinformatic Engineering, Osaka





Results

source and SCA recovery within 8 artificial, dependent mixtures

• results on toy data: sparseness works as separation criterion

• real data• relative sEMG enhancement 24.6± 21.4% (mean over 9 subjects)• beats standard signal processing and ICA

[Theis, Garcıa. On the use of sparse signal decomposition in the analysis of multi-channel surface EMGs. Signal Processing, 2006]





SCA of functional MRI data

1 2 3

4 5

1 cc: −0.16 2 cc: −0.28 3 cc: 0.13

4 cc: −0.04 5 cc: −0.88

component maps (S) time courses (A)

• complete SCA was performed using k-means hyperplane clustering

• components 2 and 3 represents inner ventricles, component 1 contains thefrontal eye fields

• component 5 is desired visual stimulus component — active in the visualcortex (crosscorrelation with stimulus |cc| = 0.88 — fastICA yields similar|cc| = 0.9)





Postnonlinear SCA

Given m-dimensional random vector x, find representation

x = f(As)

with unknown

• n-dim. random vector s (sources)

• m × n-matrix A (mixing matrix)

• diagonal invertible f = f1 × . . .× fm (postnonlinearities)

postnonlinear ICA ⇒ s independent (see (?)) here: SCA model ⇒ s is

(m − 1)-sparse





Overcomplete postnonlinear cocktail-party problem

f1

f2





Postnonlinearity identification lemma

Given an invertible 2× 2-matrix A, define L at 0 as

L := A([0, ε)× 0 ∪ 0 × [0, ε)).

Lemma

If a diagonal analytic diffeomorphism h := h1 × h2 maps an L (in ’generalposition’) at 0 again on an L at 0, then it is a linear scaling.

h





Identifiability

• due to linear identifiability it is enough show that if f(As) = f(As)then h = f−1 f is linear scaling

• case m = 2: image of As and As are finite unions of L’s, so thisfollows from previous lemma

h

h





Identifiability• due to linear identifiability it is enough show that if f(As) = f(As)

then h = f−1 f is linear scaling


h

h





Identifiability

• due to linear identifiability it is enough show that if f(As) = f(As)then h = f−1 f is linear scaling


h

h





Identifiability: proof

f

f

R

R

3

3

2

2

2

R

R

RA

A





Algorithm

• multistage separation algorithm:• find separating nonlinearities g• estimate mixing matrix A of linearized model g(x)• estimate sources given A and g(x)

• how can g be found algorithmically?





Postnonlinearity detection

• for simplicity assume m = 2.

• geometrical preprocessing: determine two 1-dimensionalsubmanifolds in the image of x

• find curves y(t) and z(t) in R2 which are mapped onto an L by g.

• simple method:• choose arbitrary starting points y(t1) and z(t1) among samples of x• iteratively pick closest sample to previous y(ti−1) resp. z(ti−1) with

smaller modulus






0 10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

0 10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

0 10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

fA−→

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 100−5

0

5

−1.5 −1 −0.5 0 0.5 1 1.5−5

−4

−3

−2

−1

0

1

2

3

4

5

−1.5 −1 −0.5 0 0.5 1 1.5−5

−4

−3

−2

−1

0

1

2

3

4

5

geometrical preprocessing mixture density






0 10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

0 10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

0 10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

fA−→

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 100−5

0

5

−1.5 −1 −0.5 0 0.5 1 1.5−5

−4

−3

−2

−1

0

1

2

3

4

5

−1.5 −1 −0.5 0 0.5 1 1.5−5

−4

−3

−2

−1

0

1

2

3

4

5

geometrical preprocessing

mixture density






0 10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

0 10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

0 10 20 30 40 50 60 70 80 90 100−0.5

0

0.5

fA−→

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 100−5

0

5

−1.5 −1 −0.5 0 0.5 1 1.5−5

−4

−3

−2

−1

0

1

2

3

4

5

−1.5 −1 −0.5 0 0.5 1 1.5−5

−4

−3

−2

−1

0

1

2

3

4

5

geometrical preprocessing mixture density






• reparametrization (y := y y−11 ) of the curves gives y1 = z1 = id.

• hence g y = (g1, ag1) and g z = (g1, bg1)

• g2 y2 = ag1 = abg2 z2

• analytical geometrical postnonlinearity detection: find analytical1d diffeomorphism g with

g y = cg z

for c 6= 0,±1 and given curves y , z : (−1, 1) → R withy(0) = z(0) = 0.

• note c = y ′(0)/z ′(0)






• equation g y = cg z can be solved in different ways:• calculate composite derivatives using Faa di Bruno’s formula ⇒

derivatives of y and z lead to estimation of derivatives of g• least-squares polynomial fit of g using energy function

E = 12T

PTi=1(g(y(ti ))− cg(z(ti )))

2

• MLP approximation of g using E from above

• fix g(0) = 0 and g ′(0) = 1.





Artificial mixtures

• artificial example• postnonlinear mixture of n = 3 uniform sources (105 samples) to

m = 2 observations• postnonlinear mixing model x = f1 × f2(As)

• mixing matrix A =

„4.3 7.8 0.599 6.2 10

«• postnonlinearities f1(x) = tanh(x) + 0.1x and f2(x) = x

• algorithm• MLP based postnonlinearity detection algorithm• natural gradient-descent learning• parameters: 9 hidden neurons, learning rate of η = 0.01 and 105

iterations





PNL detection

−2 0 2

−1.5

−1

−0.5

0

0.5

1

1.5

f1

−4 −2 0 2 4

−5

0

5

f2

−2 −1 0 1 2−4

−3

−2

−1

0

1

2

3

4

g1

−5 0 5−5

0

5

g2

mixing pnls f

separating pnls g

0 10 20 30 40 50 60 70 80 90 100−5

0

5

0 10 20 30 40 50 60 70 80 90 100−5

0

5

0 10 20 30 40 50 60 70 80 90 100−5

0

5

−5

0

5

−5

0

5−6

−4

−2

0

2

4

6

SIRs: 26, 71 and 46 dB density of recovered sources





PNL detection

−2 0 2

−1.5

−1

−0.5

0

0.5

1

1.5

f1

−4 −2 0 2 4

−5

0

5

f2

−2 −1 0 1 2−4

−3

−2

−1

0

1

2

3

4

g1

−5 0 5−5

0

5

g2

mixing pnls f separating pnls g

0 10 20 30 40 50 60 70 80 90 100−5

0

5

0 10 20 30 40 50 60 70 80 90 100−5

0

5

0 10 20 30 40 50 60 70 80 90 100−5

0

5

−5

0

5

−5

0

5−6

−4

−2

0

2

4

6

SIRs: 26, 71 and 46 dB density of recovered sources




Conclusions

• analyze statistical patterns in data sets x(t)• method: factorization model x(t) = f (s(t))

• supervised training of f ⇒ nearest neighbor (local), regression(global)

• unsupervised identification (often linear) ⇒ clustering (local model),blind source separation (linear model)

• applications: biomedical data analysis, signal processing, financialmarkets etc.




Current application — with T. Schroder, HMGU

• unsupervised clustering of subtrees

• supervised learning of cell shapes

• parameter estimation of dynamical system for cell fate decision




References


biomedical signal processing --- application of ... · image? f. theis biomedical signal processing...

Documents