variational bayesian learning of ica with missing datakwchan/publications/nc03-mica.pdfla jolla, ca...
Post on 10-Jul-2020
1 Views
Preview:
TRANSCRIPT
LETTER Communicated by Erkki Oja
Variational Bayesian Learning of ICA with Missing Data
Kwokleung ChankwchansalkeduComputational Neurobiology Laboratory Salk Institute La Jolla CA 92037 USA
Te-Won LeetewonsalkeduInstitute for Neural Computation University of California at San DiegoLa Jolla CA 92093 USA
Terrence J SejnowskiterrysalkeduComputational Neurobiology Laboratory Salk Institute La Jolla CA 92037 USAand Department of Biology University of California at San DiegoLa Jolla CA 92093 USA
Missing data are common in real-world data sets and are a problem formany estimation techniques We have developed a variational Bayesianmethod to perform independent component analysis (ICA) on high-dimensional data containing missing entries Missing data are handlednaturally in the Bayesian framework by integrating the generative den-sity model Modeling the distributions of the independent sources withmixture of gaussians allows sources to be estimated with different kurto-sis and skewness Unlike the maximum likelihood approach the varia-tional Bayesian method automatically determines the dimensionality ofthe data and yields an accurate density model for the observed data with-out overtting problems The technique is also extended to the clustersof ICA and supervised classication framework
1 Introduction
Data density estimation is an important step in many machine learningproblems Often we are faced with data containing incomplete entries Thedata may be missing due to measurement or recording failure Anotherfrequent cause is difculty in collecting complete data For example it couldbe expensive and time-consuming to perform some biomedical tests Datascarcity is not uncommon and it would be very undesirable to discard thosedata points with missing entries when we already have a small data setTraditionally missing data are lled in by mean imputation or regressionimputation during preprocessing This could introduce biases into the data
Neural Computation 15 1991ndash2011 (2003) cdeg 2003 Massachusetts Institute of Technology
1992 K Chan T Lee and T Sejnowski
cloud density and adversely affect subsequent analysis A more principledway would be to use probability density estimates of the missing entriesinstead of point estimates A well-known example of this approach is theuse of the expectation-maximization (EM) algorithm in tting incompletedata with a single gaussian density (Little amp Rubin 1987)
Independent component analysis (ICA Hyvarinen Karhunen amp Oja2001) assumes the observed data x are generated from a linear combinationof independent sources s
x D A s C ordm (11)
where A is the mixing matrix which can be nonsquare The sources s havenongaussian density such as psl expiexcljsljq The noise term ordm can havenonzero mean ICA tries to locate independent axes within the data cloudand was developed forblind source separation It has been applied to speechseparation and analyzing fMRI and EEG data (Jung et al 2001) ICA is alsoused to model data density describing data as linear mixtures of indepen-dent features and nding projections that may uncover interesting structurein the data Maximum likelihood learning of ICA with incomplete data hasbeen studied by Welling and Weber (1999) in the limited case of a squaremixing matrix and predened source densities
Many real-world data sets have intrinsic dimensionality smaller thanthat of the observed data With missing data principal component analysiscannot be used to perform dimension reduction as preprocessing for ICAInstead the variational Bayesian method applied to ICA can handle smalldata sets with high observed dimension (Chan Lee amp Sejnowski 2002Choudrey amp Roberts 2001 Miskin 2000) The Bayesian method preventsovertting and performs automatic dimension reduction In this article weextend the variational Bayesian ICA method to problemswith missing dataMore important the probability density estimate of the missing entries canbe used to ll in the missing values This allows the density model to berened and made more accurate
2 Model and Theory
21 ICA Generative Model with Missing Data Consider a data set ofT data points in an N-dimensional space X D fxt 2 RNg t in f1 TgAssume a noisy ICA generative model for the data
Pxt j micro DZ
N xt j Ast C ordm ordfPst j micros dst (21)
where A is the mixing matrix and ordm and [ordf]iexcl1 are the observation meanand diagonal noise variance respectively The hidden source st is assumed
Variational Bayesian Learning of ICA with Missing Data 1993
to have L dimensions Similar to the independent factor analysis of Attias(1999) each component of st will be modeled by a mixture of K gaussiansto allow for source densities of various kurtosis and skewness
Pst j micros DLY
l
AacuteKX
kl
frac14lklN
iexclstl j Aacutelkl macrlkl
cent
(22)
Split each data point into a missing part and an observed part xgtt D xogt
t
xmgtt In this article we consider only the random missing case (Ghahra-
mani amp Jordan 1994) that is the probability for the missing entries xmt is
independent of the value of xmt but could depend on the value of xo
t Thelikelihood of the data set is then dened to be
L micro I X DY
tPxo
t j micro (23)
where
Pxot j micro D
ZPxt j micro dxm
t
DZ microZ
N xt j Ast C ordm ordf dxmt
paraPst j micros dst
DZ
N xot j [Ast C ordm]o
t [ordf]ot Pst j micros dst (24)
Here we have introduced the notation [cent]ot which means taking only the
observed dimensions (corresponding to the tth data point) of whatever isinside the square brackets Since equation 24 is similar to equation 21the variational Bayesian ICA (Chan et al 2002 Choudrey amp Roberts 2001Miskin 2000) can be extended naturally to handle missing data but only ifcare is taken in discounting missing entries in the learning rules
22 Variational Bayesian Method In a full Bayesian treatment the pos-terior distribution of the parameters micro is obtained by
Pmicro j X D PX j microPmicro
PXD
Qt Pxo
t j micro Pmicro
PX (25)
where PX is the marginal likelihood and given as
PX DZ Y
tPxo
t j microPmicro dmicro (26)
1994 K Chan T Lee and T Sejnowski
The ICA model for PX is dened with the following priors on the param-eters Pmicro
PAnl D N Anl j 0 regl Pfrac14l D Dfrac14l j dofrac14l
Pregl D G regl j aoregl boregl PAacutelkl D N Aacutelklj sup1oAacutelkl 3oAacutelkl (27)
Pmacrlkl D G macrlklj aomacrlkl bomacrlkl
Pordmn D N ordmn j sup1oordmn 3oordmn P9n D G 9n j ao9n bo9n (28)
where N cent G cent and Dcent are the normal gamma and Dirichlet distribu-tions respectively
N x j sup1 curren D
sjcurrenj
2frac14N eiexcl 12 xiexclsup1gtcurrenxiexclsup1I (29)
G x j a b D ba
0axaiexcl1eiexclbxI (210)
Dfrac14 j d D 0P
dkQ0dk
frac14d1iexcl11 pound cent cent cent pound frac14
dKiexcl1K (211)
Here aocent bocent docent sup1ocent and 3ocent are prechosen hyperparameters forthe priors Notice that curren in the normal distribution is an inverse covarianceparameter
Under the variational Bayesian treatment instead of performing the in-tegration in equation 26 to solve for Pmicro j X directly we approximate itby Qmicro and opt to minimize the Kullback-Leibler distance between them(Mackay 1995 Jordan Ghahramani Jaakkola amp Saul 1999)
iexclKLQmicro j Pmicro j X DZ
Qmicro logPmicro j X
Qmicrodmicro
DZ
Qmicro
X
t
log Pxot j micro C log
Pmicro
Qmicro
dmicro
iexcl log PX (212)
Since iexclKLQmicro j Pmicro j X middot 0 we get a lower bound for the log marginallikelihood
log PX cedilZ
Qmicro X
t
log Pxot j micro dmicro C
ZQmicro log
Pmicro
Qmicrodmicro (213)
which can also be obtained by applying Jensenrsquos inequality to equation 26Qmicro is then solved by functional maximization of the lower bound A sep-
Variational Bayesian Learning of ICA with Missing Data 1995
arable approximate posterior Qmicro will be assumed
Qmicro D QordmQordf pound QAQreg
poundY
l
Qfrac14lY
kl
QAacutelkl Qmacrlkl
(214)
The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as
ZQmicro log
Pmicro
Qmicrodmicro
DX
l
ZQfrac14l log
Pfrac14l
Qfrac14ldfrac14l
CX
l kl
ZQAacutelkl
logPAacutelkl
QAacutelkl
dAacutelklC
X
l kl
ZQmacrlkl
logPmacrlkl
Qmacrlkl
dmacrlkl
CZ Z
QAQreg logPA j reg
QAdA dreg C
ZQreg log
Preg
Qregdreg
CZ
Qordm logPordm
Qordmdordm C
ZQordf log
Pordf
Qordfdordf (215)
23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo
t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm
t j xot that is the approximate distribution on the
missing entries which is given by
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (216)
As noted by Welling and Weber (1999) elements of st given xot are depen-
dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05
Unless we are interested in only the rst- and second-order statisticsof Qxm
t j xot we should try to capture as much structure as possible of
1996 K Chan T Lee and T Sejnowski
shy 1
shy 05
005
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
x1
x2
shy 1
shy 05
0
05
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
16
s1s2
Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05
Variational Bayesian Learning of ICA with Missing Data 1997
Pst j xot in Qst In this article we take a slightly different route from Chan
et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space
Pst DLY
l
AacuteX
kl
frac14lklN stl j Aacutelkl macrlkl
DX
k1
cent cent centX
kL
[frac141k1 pound cent cent cent pound frac14LkL
pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]
DX
k
frac14k N st j Aacutek macrk (217)
Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space
k D k1 kl kLgt kl D 1 K
Aacutek D Aacute1k1 Aacutelkl AacuteLkL
gt
macrk D
0
Bmacr1k1
macrLkL
1
CA
frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)
Log likelihood for xot is then expanded using Jensenrsquos inequality
log Pxot j micro D log
ZPxo
t j st microX
k
frac14k N st j Aacutek macrk dst
D logX
k
frac14k
ZPxo
t j st micro N st j Aacutek macrk dst
cedilX
k
Qkt logZ
Pxot j st micro N st j Aacutek macrk dst
CX
k
Qkt logfrac14k
Qkt (219)
Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt
1998 K Chan T Lee and T Sejnowski
xt
Y
n
Aa
st
b
f
kt p
Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st
apply Jensenrsquos inequality again and get
log Pxot j micro cedil
X
k
Qkt
microZQskt log Pxo
t j skt micro dskt
CZ
Qskt logN skt j Aacutek macrk
Qsktdskt
para
CX
k
Qkt logfrac14k
Qkt (220)
Substituting log Pxot j micro back into equation 213 the variational Bayesian
method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st
3 Learning Rules
Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample
log Qordm D log Pordm CZ
Qmicro nordmX
t
log Pxot j micro dmicronordm C const (31)
Variational Bayesian Learning of ICA with Missing Data 1999
where micro nordm is the set of parameters excluding ordm This gives
Qordm DY
nN ordmn j sup1ordmn 3ordmn
3ordmn D 3oordmn C h9niX
t
ont
sup1ordmn D 3oordmnsup1oordmn C h9niP
t ontP
k Qkthxnt iexcl Ancentskti3ordmn
(32)
Similarly
Qordf DY
nG 9n j a9n b9n
a9n D ao9n C12
X
t
ont
b9n D bo9n C 12
X
t
ont
X
k
Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)
QA DY
nN Ancent j sup1Ancent currenAncent
currenAncent D
0
Bhreg1i
hregLi
1
CA C h9niX
t
ont
X
k
Qkthsktsgtkti
sup1Ancent DAacute
h9niX
t
ontxnt iexcl hordmniX
k
Qkthsgtkti
currenAncent
iexcl1 (34)
Qreg DY
l
G regl j aregl bregl
aregl D aoregl C N2
bregl D boregl C12
X
nhA2
nli (35)
Qfrac14l D Dfrac14 j dfrac14l
dfrac14lk D dofrac14lk CX
t
X
klDk
Qkt (36)
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
1992 K Chan T Lee and T Sejnowski
cloud density and adversely affect subsequent analysis A more principledway would be to use probability density estimates of the missing entriesinstead of point estimates A well-known example of this approach is theuse of the expectation-maximization (EM) algorithm in tting incompletedata with a single gaussian density (Little amp Rubin 1987)
Independent component analysis (ICA Hyvarinen Karhunen amp Oja2001) assumes the observed data x are generated from a linear combinationof independent sources s
x D A s C ordm (11)
where A is the mixing matrix which can be nonsquare The sources s havenongaussian density such as psl expiexcljsljq The noise term ordm can havenonzero mean ICA tries to locate independent axes within the data cloudand was developed forblind source separation It has been applied to speechseparation and analyzing fMRI and EEG data (Jung et al 2001) ICA is alsoused to model data density describing data as linear mixtures of indepen-dent features and nding projections that may uncover interesting structurein the data Maximum likelihood learning of ICA with incomplete data hasbeen studied by Welling and Weber (1999) in the limited case of a squaremixing matrix and predened source densities
Many real-world data sets have intrinsic dimensionality smaller thanthat of the observed data With missing data principal component analysiscannot be used to perform dimension reduction as preprocessing for ICAInstead the variational Bayesian method applied to ICA can handle smalldata sets with high observed dimension (Chan Lee amp Sejnowski 2002Choudrey amp Roberts 2001 Miskin 2000) The Bayesian method preventsovertting and performs automatic dimension reduction In this article weextend the variational Bayesian ICA method to problemswith missing dataMore important the probability density estimate of the missing entries canbe used to ll in the missing values This allows the density model to berened and made more accurate
2 Model and Theory
21 ICA Generative Model with Missing Data Consider a data set ofT data points in an N-dimensional space X D fxt 2 RNg t in f1 TgAssume a noisy ICA generative model for the data
Pxt j micro DZ
N xt j Ast C ordm ordfPst j micros dst (21)
where A is the mixing matrix and ordm and [ordf]iexcl1 are the observation meanand diagonal noise variance respectively The hidden source st is assumed
Variational Bayesian Learning of ICA with Missing Data 1993
to have L dimensions Similar to the independent factor analysis of Attias(1999) each component of st will be modeled by a mixture of K gaussiansto allow for source densities of various kurtosis and skewness
Pst j micros DLY
l
AacuteKX
kl
frac14lklN
iexclstl j Aacutelkl macrlkl
cent
(22)
Split each data point into a missing part and an observed part xgtt D xogt
t
xmgtt In this article we consider only the random missing case (Ghahra-
mani amp Jordan 1994) that is the probability for the missing entries xmt is
independent of the value of xmt but could depend on the value of xo
t Thelikelihood of the data set is then dened to be
L micro I X DY
tPxo
t j micro (23)
where
Pxot j micro D
ZPxt j micro dxm
t
DZ microZ
N xt j Ast C ordm ordf dxmt
paraPst j micros dst
DZ
N xot j [Ast C ordm]o
t [ordf]ot Pst j micros dst (24)
Here we have introduced the notation [cent]ot which means taking only the
observed dimensions (corresponding to the tth data point) of whatever isinside the square brackets Since equation 24 is similar to equation 21the variational Bayesian ICA (Chan et al 2002 Choudrey amp Roberts 2001Miskin 2000) can be extended naturally to handle missing data but only ifcare is taken in discounting missing entries in the learning rules
22 Variational Bayesian Method In a full Bayesian treatment the pos-terior distribution of the parameters micro is obtained by
Pmicro j X D PX j microPmicro
PXD
Qt Pxo
t j micro Pmicro
PX (25)
where PX is the marginal likelihood and given as
PX DZ Y
tPxo
t j microPmicro dmicro (26)
1994 K Chan T Lee and T Sejnowski
The ICA model for PX is dened with the following priors on the param-eters Pmicro
PAnl D N Anl j 0 regl Pfrac14l D Dfrac14l j dofrac14l
Pregl D G regl j aoregl boregl PAacutelkl D N Aacutelklj sup1oAacutelkl 3oAacutelkl (27)
Pmacrlkl D G macrlklj aomacrlkl bomacrlkl
Pordmn D N ordmn j sup1oordmn 3oordmn P9n D G 9n j ao9n bo9n (28)
where N cent G cent and Dcent are the normal gamma and Dirichlet distribu-tions respectively
N x j sup1 curren D
sjcurrenj
2frac14N eiexcl 12 xiexclsup1gtcurrenxiexclsup1I (29)
G x j a b D ba
0axaiexcl1eiexclbxI (210)
Dfrac14 j d D 0P
dkQ0dk
frac14d1iexcl11 pound cent cent cent pound frac14
dKiexcl1K (211)
Here aocent bocent docent sup1ocent and 3ocent are prechosen hyperparameters forthe priors Notice that curren in the normal distribution is an inverse covarianceparameter
Under the variational Bayesian treatment instead of performing the in-tegration in equation 26 to solve for Pmicro j X directly we approximate itby Qmicro and opt to minimize the Kullback-Leibler distance between them(Mackay 1995 Jordan Ghahramani Jaakkola amp Saul 1999)
iexclKLQmicro j Pmicro j X DZ
Qmicro logPmicro j X
Qmicrodmicro
DZ
Qmicro
X
t
log Pxot j micro C log
Pmicro
Qmicro
dmicro
iexcl log PX (212)
Since iexclKLQmicro j Pmicro j X middot 0 we get a lower bound for the log marginallikelihood
log PX cedilZ
Qmicro X
t
log Pxot j micro dmicro C
ZQmicro log
Pmicro
Qmicrodmicro (213)
which can also be obtained by applying Jensenrsquos inequality to equation 26Qmicro is then solved by functional maximization of the lower bound A sep-
Variational Bayesian Learning of ICA with Missing Data 1995
arable approximate posterior Qmicro will be assumed
Qmicro D QordmQordf pound QAQreg
poundY
l
Qfrac14lY
kl
QAacutelkl Qmacrlkl
(214)
The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as
ZQmicro log
Pmicro
Qmicrodmicro
DX
l
ZQfrac14l log
Pfrac14l
Qfrac14ldfrac14l
CX
l kl
ZQAacutelkl
logPAacutelkl
QAacutelkl
dAacutelklC
X
l kl
ZQmacrlkl
logPmacrlkl
Qmacrlkl
dmacrlkl
CZ Z
QAQreg logPA j reg
QAdA dreg C
ZQreg log
Preg
Qregdreg
CZ
Qordm logPordm
Qordmdordm C
ZQordf log
Pordf
Qordfdordf (215)
23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo
t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm
t j xot that is the approximate distribution on the
missing entries which is given by
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (216)
As noted by Welling and Weber (1999) elements of st given xot are depen-
dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05
Unless we are interested in only the rst- and second-order statisticsof Qxm
t j xot we should try to capture as much structure as possible of
1996 K Chan T Lee and T Sejnowski
shy 1
shy 05
005
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
x1
x2
shy 1
shy 05
0
05
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
16
s1s2
Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05
Variational Bayesian Learning of ICA with Missing Data 1997
Pst j xot in Qst In this article we take a slightly different route from Chan
et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space
Pst DLY
l
AacuteX
kl
frac14lklN stl j Aacutelkl macrlkl
DX
k1
cent cent centX
kL
[frac141k1 pound cent cent cent pound frac14LkL
pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]
DX
k
frac14k N st j Aacutek macrk (217)
Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space
k D k1 kl kLgt kl D 1 K
Aacutek D Aacute1k1 Aacutelkl AacuteLkL
gt
macrk D
0
Bmacr1k1
macrLkL
1
CA
frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)
Log likelihood for xot is then expanded using Jensenrsquos inequality
log Pxot j micro D log
ZPxo
t j st microX
k
frac14k N st j Aacutek macrk dst
D logX
k
frac14k
ZPxo
t j st micro N st j Aacutek macrk dst
cedilX
k
Qkt logZ
Pxot j st micro N st j Aacutek macrk dst
CX
k
Qkt logfrac14k
Qkt (219)
Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt
1998 K Chan T Lee and T Sejnowski
xt
Y
n
Aa
st
b
f
kt p
Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st
apply Jensenrsquos inequality again and get
log Pxot j micro cedil
X
k
Qkt
microZQskt log Pxo
t j skt micro dskt
CZ
Qskt logN skt j Aacutek macrk
Qsktdskt
para
CX
k
Qkt logfrac14k
Qkt (220)
Substituting log Pxot j micro back into equation 213 the variational Bayesian
method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st
3 Learning Rules
Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample
log Qordm D log Pordm CZ
Qmicro nordmX
t
log Pxot j micro dmicronordm C const (31)
Variational Bayesian Learning of ICA with Missing Data 1999
where micro nordm is the set of parameters excluding ordm This gives
Qordm DY
nN ordmn j sup1ordmn 3ordmn
3ordmn D 3oordmn C h9niX
t
ont
sup1ordmn D 3oordmnsup1oordmn C h9niP
t ontP
k Qkthxnt iexcl Ancentskti3ordmn
(32)
Similarly
Qordf DY
nG 9n j a9n b9n
a9n D ao9n C12
X
t
ont
b9n D bo9n C 12
X
t
ont
X
k
Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)
QA DY
nN Ancent j sup1Ancent currenAncent
currenAncent D
0
Bhreg1i
hregLi
1
CA C h9niX
t
ont
X
k
Qkthsktsgtkti
sup1Ancent DAacute
h9niX
t
ontxnt iexcl hordmniX
k
Qkthsgtkti
currenAncent
iexcl1 (34)
Qreg DY
l
G regl j aregl bregl
aregl D aoregl C N2
bregl D boregl C12
X
nhA2
nli (35)
Qfrac14l D Dfrac14 j dfrac14l
dfrac14lk D dofrac14lk CX
t
X
klDk
Qkt (36)
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 1993
to have L dimensions Similar to the independent factor analysis of Attias(1999) each component of st will be modeled by a mixture of K gaussiansto allow for source densities of various kurtosis and skewness
Pst j micros DLY
l
AacuteKX
kl
frac14lklN
iexclstl j Aacutelkl macrlkl
cent
(22)
Split each data point into a missing part and an observed part xgtt D xogt
t
xmgtt In this article we consider only the random missing case (Ghahra-
mani amp Jordan 1994) that is the probability for the missing entries xmt is
independent of the value of xmt but could depend on the value of xo
t Thelikelihood of the data set is then dened to be
L micro I X DY
tPxo
t j micro (23)
where
Pxot j micro D
ZPxt j micro dxm
t
DZ microZ
N xt j Ast C ordm ordf dxmt
paraPst j micros dst
DZ
N xot j [Ast C ordm]o
t [ordf]ot Pst j micros dst (24)
Here we have introduced the notation [cent]ot which means taking only the
observed dimensions (corresponding to the tth data point) of whatever isinside the square brackets Since equation 24 is similar to equation 21the variational Bayesian ICA (Chan et al 2002 Choudrey amp Roberts 2001Miskin 2000) can be extended naturally to handle missing data but only ifcare is taken in discounting missing entries in the learning rules
22 Variational Bayesian Method In a full Bayesian treatment the pos-terior distribution of the parameters micro is obtained by
Pmicro j X D PX j microPmicro
PXD
Qt Pxo
t j micro Pmicro
PX (25)
where PX is the marginal likelihood and given as
PX DZ Y
tPxo
t j microPmicro dmicro (26)
1994 K Chan T Lee and T Sejnowski
The ICA model for PX is dened with the following priors on the param-eters Pmicro
PAnl D N Anl j 0 regl Pfrac14l D Dfrac14l j dofrac14l
Pregl D G regl j aoregl boregl PAacutelkl D N Aacutelklj sup1oAacutelkl 3oAacutelkl (27)
Pmacrlkl D G macrlklj aomacrlkl bomacrlkl
Pordmn D N ordmn j sup1oordmn 3oordmn P9n D G 9n j ao9n bo9n (28)
where N cent G cent and Dcent are the normal gamma and Dirichlet distribu-tions respectively
N x j sup1 curren D
sjcurrenj
2frac14N eiexcl 12 xiexclsup1gtcurrenxiexclsup1I (29)
G x j a b D ba
0axaiexcl1eiexclbxI (210)
Dfrac14 j d D 0P
dkQ0dk
frac14d1iexcl11 pound cent cent cent pound frac14
dKiexcl1K (211)
Here aocent bocent docent sup1ocent and 3ocent are prechosen hyperparameters forthe priors Notice that curren in the normal distribution is an inverse covarianceparameter
Under the variational Bayesian treatment instead of performing the in-tegration in equation 26 to solve for Pmicro j X directly we approximate itby Qmicro and opt to minimize the Kullback-Leibler distance between them(Mackay 1995 Jordan Ghahramani Jaakkola amp Saul 1999)
iexclKLQmicro j Pmicro j X DZ
Qmicro logPmicro j X
Qmicrodmicro
DZ
Qmicro
X
t
log Pxot j micro C log
Pmicro
Qmicro
dmicro
iexcl log PX (212)
Since iexclKLQmicro j Pmicro j X middot 0 we get a lower bound for the log marginallikelihood
log PX cedilZ
Qmicro X
t
log Pxot j micro dmicro C
ZQmicro log
Pmicro
Qmicrodmicro (213)
which can also be obtained by applying Jensenrsquos inequality to equation 26Qmicro is then solved by functional maximization of the lower bound A sep-
Variational Bayesian Learning of ICA with Missing Data 1995
arable approximate posterior Qmicro will be assumed
Qmicro D QordmQordf pound QAQreg
poundY
l
Qfrac14lY
kl
QAacutelkl Qmacrlkl
(214)
The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as
ZQmicro log
Pmicro
Qmicrodmicro
DX
l
ZQfrac14l log
Pfrac14l
Qfrac14ldfrac14l
CX
l kl
ZQAacutelkl
logPAacutelkl
QAacutelkl
dAacutelklC
X
l kl
ZQmacrlkl
logPmacrlkl
Qmacrlkl
dmacrlkl
CZ Z
QAQreg logPA j reg
QAdA dreg C
ZQreg log
Preg
Qregdreg
CZ
Qordm logPordm
Qordmdordm C
ZQordf log
Pordf
Qordfdordf (215)
23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo
t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm
t j xot that is the approximate distribution on the
missing entries which is given by
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (216)
As noted by Welling and Weber (1999) elements of st given xot are depen-
dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05
Unless we are interested in only the rst- and second-order statisticsof Qxm
t j xot we should try to capture as much structure as possible of
1996 K Chan T Lee and T Sejnowski
shy 1
shy 05
005
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
x1
x2
shy 1
shy 05
0
05
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
16
s1s2
Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05
Variational Bayesian Learning of ICA with Missing Data 1997
Pst j xot in Qst In this article we take a slightly different route from Chan
et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space
Pst DLY
l
AacuteX
kl
frac14lklN stl j Aacutelkl macrlkl
DX
k1
cent cent centX
kL
[frac141k1 pound cent cent cent pound frac14LkL
pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]
DX
k
frac14k N st j Aacutek macrk (217)
Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space
k D k1 kl kLgt kl D 1 K
Aacutek D Aacute1k1 Aacutelkl AacuteLkL
gt
macrk D
0
Bmacr1k1
macrLkL
1
CA
frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)
Log likelihood for xot is then expanded using Jensenrsquos inequality
log Pxot j micro D log
ZPxo
t j st microX
k
frac14k N st j Aacutek macrk dst
D logX
k
frac14k
ZPxo
t j st micro N st j Aacutek macrk dst
cedilX
k
Qkt logZ
Pxot j st micro N st j Aacutek macrk dst
CX
k
Qkt logfrac14k
Qkt (219)
Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt
1998 K Chan T Lee and T Sejnowski
xt
Y
n
Aa
st
b
f
kt p
Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st
apply Jensenrsquos inequality again and get
log Pxot j micro cedil
X
k
Qkt
microZQskt log Pxo
t j skt micro dskt
CZ
Qskt logN skt j Aacutek macrk
Qsktdskt
para
CX
k
Qkt logfrac14k
Qkt (220)
Substituting log Pxot j micro back into equation 213 the variational Bayesian
method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st
3 Learning Rules
Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample
log Qordm D log Pordm CZ
Qmicro nordmX
t
log Pxot j micro dmicronordm C const (31)
Variational Bayesian Learning of ICA with Missing Data 1999
where micro nordm is the set of parameters excluding ordm This gives
Qordm DY
nN ordmn j sup1ordmn 3ordmn
3ordmn D 3oordmn C h9niX
t
ont
sup1ordmn D 3oordmnsup1oordmn C h9niP
t ontP
k Qkthxnt iexcl Ancentskti3ordmn
(32)
Similarly
Qordf DY
nG 9n j a9n b9n
a9n D ao9n C12
X
t
ont
b9n D bo9n C 12
X
t
ont
X
k
Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)
QA DY
nN Ancent j sup1Ancent currenAncent
currenAncent D
0
Bhreg1i
hregLi
1
CA C h9niX
t
ont
X
k
Qkthsktsgtkti
sup1Ancent DAacute
h9niX
t
ontxnt iexcl hordmniX
k
Qkthsgtkti
currenAncent
iexcl1 (34)
Qreg DY
l
G regl j aregl bregl
aregl D aoregl C N2
bregl D boregl C12
X
nhA2
nli (35)
Qfrac14l D Dfrac14 j dfrac14l
dfrac14lk D dofrac14lk CX
t
X
klDk
Qkt (36)
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
1994 K Chan T Lee and T Sejnowski
The ICA model for PX is dened with the following priors on the param-eters Pmicro
PAnl D N Anl j 0 regl Pfrac14l D Dfrac14l j dofrac14l
Pregl D G regl j aoregl boregl PAacutelkl D N Aacutelklj sup1oAacutelkl 3oAacutelkl (27)
Pmacrlkl D G macrlklj aomacrlkl bomacrlkl
Pordmn D N ordmn j sup1oordmn 3oordmn P9n D G 9n j ao9n bo9n (28)
where N cent G cent and Dcent are the normal gamma and Dirichlet distribu-tions respectively
N x j sup1 curren D
sjcurrenj
2frac14N eiexcl 12 xiexclsup1gtcurrenxiexclsup1I (29)
G x j a b D ba
0axaiexcl1eiexclbxI (210)
Dfrac14 j d D 0P
dkQ0dk
frac14d1iexcl11 pound cent cent cent pound frac14
dKiexcl1K (211)
Here aocent bocent docent sup1ocent and 3ocent are prechosen hyperparameters forthe priors Notice that curren in the normal distribution is an inverse covarianceparameter
Under the variational Bayesian treatment instead of performing the in-tegration in equation 26 to solve for Pmicro j X directly we approximate itby Qmicro and opt to minimize the Kullback-Leibler distance between them(Mackay 1995 Jordan Ghahramani Jaakkola amp Saul 1999)
iexclKLQmicro j Pmicro j X DZ
Qmicro logPmicro j X
Qmicrodmicro
DZ
Qmicro
X
t
log Pxot j micro C log
Pmicro
Qmicro
dmicro
iexcl log PX (212)
Since iexclKLQmicro j Pmicro j X middot 0 we get a lower bound for the log marginallikelihood
log PX cedilZ
Qmicro X
t
log Pxot j micro dmicro C
ZQmicro log
Pmicro
Qmicrodmicro (213)
which can also be obtained by applying Jensenrsquos inequality to equation 26Qmicro is then solved by functional maximization of the lower bound A sep-
Variational Bayesian Learning of ICA with Missing Data 1995
arable approximate posterior Qmicro will be assumed
Qmicro D QordmQordf pound QAQreg
poundY
l
Qfrac14lY
kl
QAacutelkl Qmacrlkl
(214)
The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as
ZQmicro log
Pmicro
Qmicrodmicro
DX
l
ZQfrac14l log
Pfrac14l
Qfrac14ldfrac14l
CX
l kl
ZQAacutelkl
logPAacutelkl
QAacutelkl
dAacutelklC
X
l kl
ZQmacrlkl
logPmacrlkl
Qmacrlkl
dmacrlkl
CZ Z
QAQreg logPA j reg
QAdA dreg C
ZQreg log
Preg
Qregdreg
CZ
Qordm logPordm
Qordmdordm C
ZQordf log
Pordf
Qordfdordf (215)
23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo
t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm
t j xot that is the approximate distribution on the
missing entries which is given by
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (216)
As noted by Welling and Weber (1999) elements of st given xot are depen-
dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05
Unless we are interested in only the rst- and second-order statisticsof Qxm
t j xot we should try to capture as much structure as possible of
1996 K Chan T Lee and T Sejnowski
shy 1
shy 05
005
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
x1
x2
shy 1
shy 05
0
05
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
16
s1s2
Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05
Variational Bayesian Learning of ICA with Missing Data 1997
Pst j xot in Qst In this article we take a slightly different route from Chan
et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space
Pst DLY
l
AacuteX
kl
frac14lklN stl j Aacutelkl macrlkl
DX
k1
cent cent centX
kL
[frac141k1 pound cent cent cent pound frac14LkL
pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]
DX
k
frac14k N st j Aacutek macrk (217)
Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space
k D k1 kl kLgt kl D 1 K
Aacutek D Aacute1k1 Aacutelkl AacuteLkL
gt
macrk D
0
Bmacr1k1
macrLkL
1
CA
frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)
Log likelihood for xot is then expanded using Jensenrsquos inequality
log Pxot j micro D log
ZPxo
t j st microX
k
frac14k N st j Aacutek macrk dst
D logX
k
frac14k
ZPxo
t j st micro N st j Aacutek macrk dst
cedilX
k
Qkt logZ
Pxot j st micro N st j Aacutek macrk dst
CX
k
Qkt logfrac14k
Qkt (219)
Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt
1998 K Chan T Lee and T Sejnowski
xt
Y
n
Aa
st
b
f
kt p
Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st
apply Jensenrsquos inequality again and get
log Pxot j micro cedil
X
k
Qkt
microZQskt log Pxo
t j skt micro dskt
CZ
Qskt logN skt j Aacutek macrk
Qsktdskt
para
CX
k
Qkt logfrac14k
Qkt (220)
Substituting log Pxot j micro back into equation 213 the variational Bayesian
method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st
3 Learning Rules
Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample
log Qordm D log Pordm CZ
Qmicro nordmX
t
log Pxot j micro dmicronordm C const (31)
Variational Bayesian Learning of ICA with Missing Data 1999
where micro nordm is the set of parameters excluding ordm This gives
Qordm DY
nN ordmn j sup1ordmn 3ordmn
3ordmn D 3oordmn C h9niX
t
ont
sup1ordmn D 3oordmnsup1oordmn C h9niP
t ontP
k Qkthxnt iexcl Ancentskti3ordmn
(32)
Similarly
Qordf DY
nG 9n j a9n b9n
a9n D ao9n C12
X
t
ont
b9n D bo9n C 12
X
t
ont
X
k
Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)
QA DY
nN Ancent j sup1Ancent currenAncent
currenAncent D
0
Bhreg1i
hregLi
1
CA C h9niX
t
ont
X
k
Qkthsktsgtkti
sup1Ancent DAacute
h9niX
t
ontxnt iexcl hordmniX
k
Qkthsgtkti
currenAncent
iexcl1 (34)
Qreg DY
l
G regl j aregl bregl
aregl D aoregl C N2
bregl D boregl C12
X
nhA2
nli (35)
Qfrac14l D Dfrac14 j dfrac14l
dfrac14lk D dofrac14lk CX
t
X
klDk
Qkt (36)
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 1995
arable approximate posterior Qmicro will be assumed
Qmicro D QordmQordf pound QAQreg
poundY
l
Qfrac14lY
kl
QAacutelkl Qmacrlkl
(214)
The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as
ZQmicro log
Pmicro
Qmicrodmicro
DX
l
ZQfrac14l log
Pfrac14l
Qfrac14ldfrac14l
CX
l kl
ZQAacutelkl
logPAacutelkl
QAacutelkl
dAacutelklC
X
l kl
ZQmacrlkl
logPmacrlkl
Qmacrlkl
dmacrlkl
CZ Z
QAQreg logPA j reg
QAdA dreg C
ZQreg log
Preg
Qregdreg
CZ
Qordm logPordm
Qordmdordm C
ZQordf log
Pordf
Qordfdordf (215)
23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo
t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm
t j xot that is the approximate distribution on the
missing entries which is given by
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (216)
As noted by Welling and Weber (1999) elements of st given xot are depen-
dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05
Unless we are interested in only the rst- and second-order statisticsof Qxm
t j xot we should try to capture as much structure as possible of
1996 K Chan T Lee and T Sejnowski
shy 1
shy 05
005
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
x1
x2
shy 1
shy 05
0
05
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
16
s1s2
Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05
Variational Bayesian Learning of ICA with Missing Data 1997
Pst j xot in Qst In this article we take a slightly different route from Chan
et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space
Pst DLY
l
AacuteX
kl
frac14lklN stl j Aacutelkl macrlkl
DX
k1
cent cent centX
kL
[frac141k1 pound cent cent cent pound frac14LkL
pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]
DX
k
frac14k N st j Aacutek macrk (217)
Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space
k D k1 kl kLgt kl D 1 K
Aacutek D Aacute1k1 Aacutelkl AacuteLkL
gt
macrk D
0
Bmacr1k1
macrLkL
1
CA
frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)
Log likelihood for xot is then expanded using Jensenrsquos inequality
log Pxot j micro D log
ZPxo
t j st microX
k
frac14k N st j Aacutek macrk dst
D logX
k
frac14k
ZPxo
t j st micro N st j Aacutek macrk dst
cedilX
k
Qkt logZ
Pxot j st micro N st j Aacutek macrk dst
CX
k
Qkt logfrac14k
Qkt (219)
Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt
1998 K Chan T Lee and T Sejnowski
xt
Y
n
Aa
st
b
f
kt p
Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st
apply Jensenrsquos inequality again and get
log Pxot j micro cedil
X
k
Qkt
microZQskt log Pxo
t j skt micro dskt
CZ
Qskt logN skt j Aacutek macrk
Qsktdskt
para
CX
k
Qkt logfrac14k
Qkt (220)
Substituting log Pxot j micro back into equation 213 the variational Bayesian
method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st
3 Learning Rules
Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample
log Qordm D log Pordm CZ
Qmicro nordmX
t
log Pxot j micro dmicronordm C const (31)
Variational Bayesian Learning of ICA with Missing Data 1999
where micro nordm is the set of parameters excluding ordm This gives
Qordm DY
nN ordmn j sup1ordmn 3ordmn
3ordmn D 3oordmn C h9niX
t
ont
sup1ordmn D 3oordmnsup1oordmn C h9niP
t ontP
k Qkthxnt iexcl Ancentskti3ordmn
(32)
Similarly
Qordf DY
nG 9n j a9n b9n
a9n D ao9n C12
X
t
ont
b9n D bo9n C 12
X
t
ont
X
k
Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)
QA DY
nN Ancent j sup1Ancent currenAncent
currenAncent D
0
Bhreg1i
hregLi
1
CA C h9niX
t
ont
X
k
Qkthsktsgtkti
sup1Ancent DAacute
h9niX
t
ontxnt iexcl hordmniX
k
Qkthsgtkti
currenAncent
iexcl1 (34)
Qreg DY
l
G regl j aregl bregl
aregl D aoregl C N2
bregl D boregl C12
X
nhA2
nli (35)
Qfrac14l D Dfrac14 j dfrac14l
dfrac14lk D dofrac14lk CX
t
X
klDk
Qkt (36)
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
1996 K Chan T Lee and T Sejnowski
shy 1
shy 05
005
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
x1
x2
shy 1
shy 05
0
05
1
shy 1
shy 05
0
05
10
02
04
06
08
1
12
14
16
s1s2
Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05
Variational Bayesian Learning of ICA with Missing Data 1997
Pst j xot in Qst In this article we take a slightly different route from Chan
et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space
Pst DLY
l
AacuteX
kl
frac14lklN stl j Aacutelkl macrlkl
DX
k1
cent cent centX
kL
[frac141k1 pound cent cent cent pound frac14LkL
pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]
DX
k
frac14k N st j Aacutek macrk (217)
Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space
k D k1 kl kLgt kl D 1 K
Aacutek D Aacute1k1 Aacutelkl AacuteLkL
gt
macrk D
0
Bmacr1k1
macrLkL
1
CA
frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)
Log likelihood for xot is then expanded using Jensenrsquos inequality
log Pxot j micro D log
ZPxo
t j st microX
k
frac14k N st j Aacutek macrk dst
D logX
k
frac14k
ZPxo
t j st micro N st j Aacutek macrk dst
cedilX
k
Qkt logZ
Pxot j st micro N st j Aacutek macrk dst
CX
k
Qkt logfrac14k
Qkt (219)
Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt
1998 K Chan T Lee and T Sejnowski
xt
Y
n
Aa
st
b
f
kt p
Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st
apply Jensenrsquos inequality again and get
log Pxot j micro cedil
X
k
Qkt
microZQskt log Pxo
t j skt micro dskt
CZ
Qskt logN skt j Aacutek macrk
Qsktdskt
para
CX
k
Qkt logfrac14k
Qkt (220)
Substituting log Pxot j micro back into equation 213 the variational Bayesian
method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st
3 Learning Rules
Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample
log Qordm D log Pordm CZ
Qmicro nordmX
t
log Pxot j micro dmicronordm C const (31)
Variational Bayesian Learning of ICA with Missing Data 1999
where micro nordm is the set of parameters excluding ordm This gives
Qordm DY
nN ordmn j sup1ordmn 3ordmn
3ordmn D 3oordmn C h9niX
t
ont
sup1ordmn D 3oordmnsup1oordmn C h9niP
t ontP
k Qkthxnt iexcl Ancentskti3ordmn
(32)
Similarly
Qordf DY
nG 9n j a9n b9n
a9n D ao9n C12
X
t
ont
b9n D bo9n C 12
X
t
ont
X
k
Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)
QA DY
nN Ancent j sup1Ancent currenAncent
currenAncent D
0
Bhreg1i
hregLi
1
CA C h9niX
t
ont
X
k
Qkthsktsgtkti
sup1Ancent DAacute
h9niX
t
ontxnt iexcl hordmniX
k
Qkthsgtkti
currenAncent
iexcl1 (34)
Qreg DY
l
G regl j aregl bregl
aregl D aoregl C N2
bregl D boregl C12
X
nhA2
nli (35)
Qfrac14l D Dfrac14 j dfrac14l
dfrac14lk D dofrac14lk CX
t
X
klDk
Qkt (36)
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 1997
Pst j xot in Qst In this article we take a slightly different route from Chan
et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space
Pst DLY
l
AacuteX
kl
frac14lklN stl j Aacutelkl macrlkl
DX
k1
cent cent centX
kL
[frac141k1 pound cent cent cent pound frac14LkL
pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]
DX
k
frac14k N st j Aacutek macrk (217)
Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space
k D k1 kl kLgt kl D 1 K
Aacutek D Aacute1k1 Aacutelkl AacuteLkL
gt
macrk D
0
Bmacr1k1
macrLkL
1
CA
frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)
Log likelihood for xot is then expanded using Jensenrsquos inequality
log Pxot j micro D log
ZPxo
t j st microX
k
frac14k N st j Aacutek macrk dst
D logX
k
frac14k
ZPxo
t j st micro N st j Aacutek macrk dst
cedilX
k
Qkt logZ
Pxot j st micro N st j Aacutek macrk dst
CX
k
Qkt logfrac14k
Qkt (219)
Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt
1998 K Chan T Lee and T Sejnowski
xt
Y
n
Aa
st
b
f
kt p
Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st
apply Jensenrsquos inequality again and get
log Pxot j micro cedil
X
k
Qkt
microZQskt log Pxo
t j skt micro dskt
CZ
Qskt logN skt j Aacutek macrk
Qsktdskt
para
CX
k
Qkt logfrac14k
Qkt (220)
Substituting log Pxot j micro back into equation 213 the variational Bayesian
method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st
3 Learning Rules
Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample
log Qordm D log Pordm CZ
Qmicro nordmX
t
log Pxot j micro dmicronordm C const (31)
Variational Bayesian Learning of ICA with Missing Data 1999
where micro nordm is the set of parameters excluding ordm This gives
Qordm DY
nN ordmn j sup1ordmn 3ordmn
3ordmn D 3oordmn C h9niX
t
ont
sup1ordmn D 3oordmnsup1oordmn C h9niP
t ontP
k Qkthxnt iexcl Ancentskti3ordmn
(32)
Similarly
Qordf DY
nG 9n j a9n b9n
a9n D ao9n C12
X
t
ont
b9n D bo9n C 12
X
t
ont
X
k
Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)
QA DY
nN Ancent j sup1Ancent currenAncent
currenAncent D
0
Bhreg1i
hregLi
1
CA C h9niX
t
ont
X
k
Qkthsktsgtkti
sup1Ancent DAacute
h9niX
t
ontxnt iexcl hordmniX
k
Qkthsgtkti
currenAncent
iexcl1 (34)
Qreg DY
l
G regl j aregl bregl
aregl D aoregl C N2
bregl D boregl C12
X
nhA2
nli (35)
Qfrac14l D Dfrac14 j dfrac14l
dfrac14lk D dofrac14lk CX
t
X
klDk
Qkt (36)
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
1998 K Chan T Lee and T Sejnowski
xt
Y
n
Aa
st
b
f
kt p
Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st
apply Jensenrsquos inequality again and get
log Pxot j micro cedil
X
k
Qkt
microZQskt log Pxo
t j skt micro dskt
CZ
Qskt logN skt j Aacutek macrk
Qsktdskt
para
CX
k
Qkt logfrac14k
Qkt (220)
Substituting log Pxot j micro back into equation 213 the variational Bayesian
method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st
3 Learning Rules
Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample
log Qordm D log Pordm CZ
Qmicro nordmX
t
log Pxot j micro dmicronordm C const (31)
Variational Bayesian Learning of ICA with Missing Data 1999
where micro nordm is the set of parameters excluding ordm This gives
Qordm DY
nN ordmn j sup1ordmn 3ordmn
3ordmn D 3oordmn C h9niX
t
ont
sup1ordmn D 3oordmnsup1oordmn C h9niP
t ontP
k Qkthxnt iexcl Ancentskti3ordmn
(32)
Similarly
Qordf DY
nG 9n j a9n b9n
a9n D ao9n C12
X
t
ont
b9n D bo9n C 12
X
t
ont
X
k
Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)
QA DY
nN Ancent j sup1Ancent currenAncent
currenAncent D
0
Bhreg1i
hregLi
1
CA C h9niX
t
ont
X
k
Qkthsktsgtkti
sup1Ancent DAacute
h9niX
t
ontxnt iexcl hordmniX
k
Qkthsgtkti
currenAncent
iexcl1 (34)
Qreg DY
l
G regl j aregl bregl
aregl D aoregl C N2
bregl D boregl C12
X
nhA2
nli (35)
Qfrac14l D Dfrac14 j dfrac14l
dfrac14lk D dofrac14lk CX
t
X
klDk
Qkt (36)
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 1999
where micro nordm is the set of parameters excluding ordm This gives
Qordm DY
nN ordmn j sup1ordmn 3ordmn
3ordmn D 3oordmn C h9niX
t
ont
sup1ordmn D 3oordmnsup1oordmn C h9niP
t ontP
k Qkthxnt iexcl Ancentskti3ordmn
(32)
Similarly
Qordf DY
nG 9n j a9n b9n
a9n D ao9n C12
X
t
ont
b9n D bo9n C 12
X
t
ont
X
k
Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)
QA DY
nN Ancent j sup1Ancent currenAncent
currenAncent D
0
Bhreg1i
hregLi
1
CA C h9niX
t
ont
X
k
Qkthsktsgtkti
sup1Ancent DAacute
h9niX
t
ontxnt iexcl hordmniX
k
Qkthsgtkti
currenAncent
iexcl1 (34)
Qreg DY
l
G regl j aregl bregl
aregl D aoregl C N2
bregl D boregl C12
X
nhA2
nli (35)
Qfrac14l D Dfrac14 j dfrac14l
dfrac14lk D dofrac14lk CX
t
X
klDk
Qkt (36)
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
2000 K Chan T Lee and T Sejnowski
QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl
3Aacutelkl D 3oAacutelkl C hmacrlkliX
t
X
klDk
Qkt
sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl
iP
tP
klDk Qkthsktli3Aacutelkl
(37)
Qmacrlkl D G macrlkl
j amacrlkl bmacrlkl
amacrlkl D aomacrlkl C12
X
t
X
klDk
Qkt
bmacrlkl D bomacrlkl C12
X
t
X
klDk
Qkthsktl iexcl Aacutelkl 2i (38)
Qskt D N skt j sup1skt currenskt
currenskt D
0
Bhmacr1k1 i
hmacrLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA A
+
currensktsup1skt D
0
Bhmacr1k1 Aacute1k1i
hmacrLkL AacuteLkLi
1
CA
C
Agt
0
Bo1t91
oNt9N
1
CA xt iexcl ordm
+ (39)
In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A
PklDk means picking
out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt
ont Draquo
1 if xnt is observed0 if xnt is missing (310)
For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 2001
n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl
it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by
log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki
iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)
where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as
EX Qmicro DX
t
log zt CZ
Qmicro logPmicro
Qmicro dmicro (312)
4 Missing Data
41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation
xot D [A]o
t cent st (41)
would be overdetermined in st unless [A]ot has a rank smaller than L In
this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm
t j xot The
approximate distribution Qxmt j xo
t can be obtained by
Qxmt j xo
t DX
k
Qkt
Zplusmnxm
t iexcl xmktQxm
kt j xot k dxm
kt (42)
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
2002 K Chan T Lee and T Sejnowski
where plusmncent is the delta function and
Qxmkt j xo
t k DZ
Qmicro
ZN xm
kt j [Askt C ordm]mt [ordf]m
t Qskt dskt dmicro
DZ Z
QAQordfN xmkt j sup1xm
kt currenxmkt dA dordf (43)
sup1xmkt D [Asup1skt C sup1ordm]m
t (44)
currenxmkt
iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m
t (45)
Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm
t j xot using the above equations is demonstrated in Figure 3 The
shaded area is the exact posterior Pxmt j xo
t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing
Pt
shy 4 shy 3 shy 2 shy 1 0 1 20
01
02
03
04
05
06
07
08
09
Figure 3 The approximation of Qxmt j xo
t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm
t j xot corresponding to the noiseless mixture in Figure 1
with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm
kt j xot k
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 2003
withP
t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm
t j xot from this modied method which we refer to
as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm
t j xot and a better estimate
for hxmt j xo
t i From equation 216
Qxmt j xo
t DZ
Qmicro
ZN xm
t j [Ast C ordm]mt [ordf]m
t Qst dst dmicro (46)
and the above formalism Qst becomes
Qst DX
k
Qkt
Zplusmnst iexcl sktQskt dskt (47)
which is a mixture of KL gaussians The missing values can then be lled inby
hst j xot i D
ZstQst dst D
X
k
Qktsup1skt (48)
hxmt j xo
t i DZ
xmt Qxm
t j xot dxm
t
DX
k
Qktsup1xmkt D [A]m
t hst j xot i C [sup1ordm]m
t (49)
where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively
a maximum a posterior (MAP) estimate on Qst and Qxmt j xo
t may beobtained but then numerical methods are needed
42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo
t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
2004 K Chan T Lee and T Sejnowski
5 Clusters of ICA
The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes
Pxt j 2 DX
cPct D c j frac12Pxt j micro c
DX
cPct D c j frac12
ZN xt j Acsc
t C ordmc ordfcPsct j micro c
s dsct (51)
The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by
Q2 D Qfrac12Y
cQmicro c (52)
and similar to equation 215
ZQ2 log
P2
Q2d2 D
ZQfrac12 log
Pfrac12
Qfrac12dfrac12
CX
c
ZQmicro c log
Pmicro c
Qmicro cdmicro c (53)
Equation 220 now becomes
log Pxot j 2 cedil
X
cQct log
Pct
QctC
X
ck
QctQkct
poundmicroZ
Qsckt log Pxo
t j sckt micro c dsc
kt
CZ
Qsckt log
N sckt j Aacutec
k macrck
Qsckt
dsckt
para
CX
ck
QctQkct log
frac14 ck
Qkct
(54)
We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc
t All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 2005
the same only withP
t replaced byP
t Qct Finally we need two morelearning rules
dfrac12c D dofrac12c C
X
tQct (55)
log Qct D hlog frac12ci C log zct iexcl log Zt (56)
where zct is the normalization constant for Qkc
t (see equation 311) and Ztis for normalizing Qct
6 Supervised Classication
It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication
Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M
PyTC1 j xTC1 XT YT M
D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M
PxTC1 j XT YT M (61)
Here M denotes our generative model for observation fxt ytg
Pxt yt j M D Pxt j yt MPyt j M (62)
Pxt j yt M could be a mixture model as given by equation 51
61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y
Pxt j yt D y M D Pxt j 2y (63)
Pyt j M D Pyt D y j D y (64)
If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution
P j YT M D D j d (65)
dy D doy CX
tIyt D y (66)
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
2006 K Chan T Lee and T Sejnowski
Icent is an indicator function that equals 1 if its argument is true and 0 other-wise
Under the generative model of equation 62 it can be shown that
P2y j XT YT M D P2y j Xy (67)
where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y
by applying the learning rules in sections 3 and 5 on subset Xy
62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by
PyTC1 D y j XT YT M DZ
PyTC1 D y j yPy j XT YT dy
DdyPy dy
(68)
The other term PxTC1 j yTC1 XT YT M can be computed as
log PxTC1 j yTC1 D y XT YT M
D log PxTC1 j Xy M
D log PxTC1 Xy j M iexcl log PXy j M (69)
frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)
The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y
log PxTC1 j Xy M frac14Z
Q2y log PxTC1 j 2y d2y
D log ZTC1 (611)
where ZTC1 is the normalization constant in equation 56
7 Experiment
71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 2007
Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)
Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution
at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
2008 K Chan T Lee and T Sejnowski
1 2 3 4 5 6 7shy 2000
shy 1900
shy 1800
shy 1700
shy 1600
shy 1500
Number of dimensions
log
mar
gina
l lik
elih
ood
low
er b
ound
full missing ICA polynomial missing ICA
Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication
Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity
72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 2009
Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
2010 K Chan T Lee and T Sejnowski
uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3
represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available
73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250
8 Conclusion
In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
Variational Bayesian Learning of ICA with Missing Data 2011
References
Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851
Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress
Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California
Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114
Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation
Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT
Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley
Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233
Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122
Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley
Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge
Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge
Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete
data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation
Received July 18 2002 accepted January 30 2003
top related