variational bayesian learning of ica with missing datakwchan/publications/nc03-mica.pdfla jolla, ca...

LETTER Communicated by Erkki Oja

Variational Bayesian Learning of ICA with Missing Data

Kwokleung ChankwchansalkeduComputational Neurobiology Laboratory Salk Institute La Jolla CA 92037 USA

Te-Won LeetewonsalkeduInstitute for Neural Computation University of California at San DiegoLa Jolla CA 92093 USA

Terrence J SejnowskiterrysalkeduComputational Neurobiology Laboratory Salk Institute La Jolla CA 92037 USAand Department of Biology University of California at San DiegoLa Jolla CA 92093 USA

Missing data are common in real-world data sets and are a problem formany estimation techniques We have developed a variational Bayesianmethod to perform independent component analysis (ICA) on high-dimensional data containing missing entries Missing data are handlednaturally in the Bayesian framework by integrating the generative den-sity model Modeling the distributions of the independent sources withmixture of gaussians allows sources to be estimated with different kurto-sis and skewness Unlike the maximum likelihood approach the varia-tional Bayesian method automatically determines the dimensionality ofthe data and yields an accurate density model for the observed data with-out overtting problems The technique is also extended to the clustersof ICA and supervised classication framework

1 Introduction

Data density estimation is an important step in many machine learningproblems Often we are faced with data containing incomplete entries Thedata may be missing due to measurement or recording failure Anotherfrequent cause is difculty in collecting complete data For example it couldbe expensive and time-consuming to perform some biomedical tests Datascarcity is not uncommon and it would be very undesirable to discard thosedata points with missing entries when we already have a small data setTraditionally missing data are lled in by mean imputation or regressionimputation during preprocessing This could introduce biases into the data

Neural Computation 15 1991ndash2011 (2003) cdeg 2003 Massachusetts Institute of Technology

1992 K Chan T Lee and T Sejnowski

cloud density and adversely affect subsequent analysis A more principledway would be to use probability density estimates of the missing entriesinstead of point estimates A well-known example of this approach is theuse of the expectation-maximization (EM) algorithm in tting incompletedata with a single gaussian density (Little amp Rubin 1987)

Independent component analysis (ICA Hyvarinen Karhunen amp Oja2001) assumes the observed data x are generated from a linear combinationof independent sources s

x D A s C ordm (11)

where A is the mixing matrix which can be nonsquare The sources s havenongaussian density such as psl expiexcljsljq The noise term ordm can havenonzero mean ICA tries to locate independent axes within the data cloudand was developed forblind source separation It has been applied to speechseparation and analyzing fMRI and EEG data (Jung et al 2001) ICA is alsoused to model data density describing data as linear mixtures of indepen-dent features and nding projections that may uncover interesting structurein the data Maximum likelihood learning of ICA with incomplete data hasbeen studied by Welling and Weber (1999) in the limited case of a squaremixing matrix and predened source densities

Many real-world data sets have intrinsic dimensionality smaller thanthat of the observed data With missing data principal component analysiscannot be used to perform dimension reduction as preprocessing for ICAInstead the variational Bayesian method applied to ICA can handle smalldata sets with high observed dimension (Chan Lee amp Sejnowski 2002Choudrey amp Roberts 2001 Miskin 2000) The Bayesian method preventsovertting and performs automatic dimension reduction In this article weextend the variational Bayesian ICA method to problemswith missing dataMore important the probability density estimate of the missing entries canbe used to ll in the missing values This allows the density model to berened and made more accurate

2 Model and Theory

21 ICA Generative Model with Missing Data Consider a data set ofT data points in an N-dimensional space X D fxt 2 RNg t in f1 TgAssume a noisy ICA generative model for the data

Pxt j micro DZ

N xt j Ast C ordm ordfPst j micros dst (21)

where A is the mixing matrix and ordm and [ordf]iexcl1 are the observation meanand diagonal noise variance respectively The hidden source st is assumed

Variational Bayesian Learning of ICA with Missing Data 1993

to have L dimensions Similar to the independent factor analysis of Attias(1999) each component of st will be modeled by a mixture of K gaussiansto allow for source densities of various kurtosis and skewness

Pst j micros DLY

AacuteKX

frac14lklN

iexclstl j Aacutelkl macrlkl

Split each data point into a missing part and an observed part xgtt D xogt

xmgtt In this article we consider only the random missing case (Ghahra-

mani amp Jordan 1994) that is the probability for the missing entries xmt is

independent of the value of xmt but could depend on the value of xo

t Thelikelihood of the data set is then dened to be

L micro I X DY

t j micro (23)

Pxot j micro D

ZPxt j micro dxm

DZ microZ

N xt j Ast C ordm ordf dxmt

paraPst j micros dst

N xot j [Ast C ordm]o

t [ordf]ot Pst j micros dst (24)

Here we have introduced the notation [cent]ot which means taking only the

observed dimensions (corresponding to the tth data point) of whatever isinside the square brackets Since equation 24 is similar to equation 21the variational Bayesian ICA (Chan et al 2002 Choudrey amp Roberts 2001Miskin 2000) can be extended naturally to handle missing data but only ifcare is taken in discounting missing entries in the learning rules

22 Variational Bayesian Method In a full Bayesian treatment the pos-terior distribution of the parameters micro is obtained by

Pmicro j X D PX j microPmicro

Qt Pxo

t j micro Pmicro

PX (25)

where PX is the marginal likelihood and given as

PX DZ Y

t j microPmicro dmicro (26)

The ICA model for PX is dened with the following priors on the param-eters Pmicro

PAnl D N Anl j 0 regl Pfrac14l D Dfrac14l j dofrac14l

Pregl D G regl j aoregl boregl PAacutelkl D N Aacutelklj sup1oAacutelkl 3oAacutelkl (27)

Pmacrlkl D G macrlklj aomacrlkl bomacrlkl

Pordmn D N ordmn j sup1oordmn 3oordmn P9n D G 9n j ao9n bo9n (28)

where N cent G cent and Dcent are the normal gamma and Dirichlet distribu-tions respectively

N x j sup1 curren D

sjcurrenj

2frac14N eiexcl 12 xiexclsup1gtcurrenxiexclsup1I (29)

G x j a b D ba

0axaiexcl1eiexclbxI (210)

Dfrac14 j d D 0P

dkQ0dk

frac14d1iexcl11 pound cent cent cent pound frac14

dKiexcl1K (211)

Here aocent bocent docent sup1ocent and 3ocent are prechosen hyperparameters forthe priors Notice that curren in the normal distribution is an inverse covarianceparameter

Under the variational Bayesian treatment instead of performing the in-tegration in equation 26 to solve for Pmicro j X directly we approximate itby Qmicro and opt to minimize the Kullback-Leibler distance between them(Mackay 1995 Jordan Ghahramani Jaakkola amp Saul 1999)

iexclKLQmicro j Pmicro j X DZ

Qmicro logPmicro j X

Qmicrodmicro

Qmicro

log Pxot j micro C log

Pmicro

Qmicro

dmicro

iexcl log PX (212)

Since iexclKLQmicro j Pmicro j X middot 0 we get a lower bound for the log marginallikelihood

log PX cedilZ

Qmicro X

log Pxot j micro dmicro C

ZQmicro log

Pmicro

Qmicrodmicro (213)

which can also be obtained by applying Jensenrsquos inequality to equation 26Qmicro is then solved by functional maximization of the lower bound A sep-

arable approximate posterior Qmicro will be assumed

Qmicro D QordmQordf pound QAQreg

poundY

Qfrac14lY

QAacutelkl Qmacrlkl

The second term in equation 213 which is the negative Kullback-Leiblerdivergence between approximate posterior Qmicro and prior Pmicro is then ex-panded as

ZQmicro log

Pmicro

Qmicrodmicro

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Qregdreg

Qordm logPordm

Qordmdordm C

ZQordf log

Qordfdordf (215)

23 Special Treatment for Missing Data Thus far the analysis followsalmost exactly that of the variational Bayesian ICA on complete data exceptthat Pxt j micro is replaced by Pxo

t j micro in equation 26 and consequently themissing entries are discounted in the learning rules However it would beuseful to obtain Qxm

t j xot that is the approximate distribution on the

missing entries which is given by

Qxmt j xo

Qmicro

t j [Ast C ordm]mt [ordf]m

t Qst dst dmicro (216)

As noted by Welling and Weber (1999) elements of st given xot are depen-

dent More important under the ICA model Qst is unlikely to be a singlegaussian This is evident from Figure 1 which shows the probability den-sity functions of the data x and hidden variable s The inserts show thesample data in the two spaces Here the hidden sources assume density ofPsl expiexcljslj07 They are mixed noiselessly to give Px in the uppergraph The cut in the upper graph represents Px1 j x2 D iexcl05 whichtransforms into a highly correlated and nongaussian Ps j x2 D iexcl05

Unless we are interested in only the rst- and second-order statisticsof Qxm

t j xot we should try to capture as much structure as possible of

shy 05

Figure 1 Probability density functions for the data x (top) and hidden sourcess (bottom) Inserts show the sample data in the two spaces The ldquocutsrdquo showPx1 j x2 D iexcl05 and Ps j x2 D iexcl05

Pst j xot in Qst In this article we take a slightly different route from Chan

et al (2002) or Choudrey and Roberts (2001) when performing variationalBayesian learning First we break down Pst into a mixture of KL gaussiansin the L-dimensional s space

Pst DLY

AacuteX

frac14lklN stl j Aacutelkl macrlkl

cent cent centX

[frac141k1 pound cent cent cent pound frac14LkL

pound N st1 j Aacute1k1 macr1k1 pound cent cent cent pound N stL j AacuteLkL macrLkL ]

frac14k N st j Aacutek macrk (217)

Here we have dened k to be a vector index The ldquokthrdquo gaussian is centeredat Aacutek of inverse covariance macrk in the source s space

k D k1 kl kLgt kl D 1 K

Aacutek D Aacute1k1 Aacutelkl AacuteLkL

macrk D

Bmacr1k1

macrLkL

frac14k D frac141k1 pound cent cent cent pound frac14LkL (218)

Log likelihood for xot is then expanded using Jensenrsquos inequality

log Pxot j micro D log

t j st microX

frac14k N st j Aacutek macrk dst

D logX

frac14k

t j st micro N st j Aacutek macrk dst

cedilX

Qkt logZ

Pxot j st micro N st j Aacutek macrk dst

Qkt logfrac14k

Qkt (219)

Here Qkt is a short form for Qkt D k kt is a discrete hidden variableand Qkt D k is the probability that the tth data point belongs to the kthgaussian Recognizing that st is just a dummy variable we introduce Qskt

Figure 2 A simplied directed graph for the generative model of variationalICA xt is the observed variable kt and st are hidden variables and the restare model parameters The kt indicates which of the KL expanded gaussiansgenerated st

apply Jensenrsquos inequality again and get

log Pxot j micro cedil

microZQskt log Pxo

t j skt micro dskt

Qskt logN skt j Aacutek macrk

Qsktdskt

Qkt logfrac14k

Qkt (220)

Substituting log Pxot j micro back into equation 213 the variational Bayesian

method can be continued as usual We have drawn in Figure 2 a simpliedgraphical representation for the generative model of variational ICA xtis the observed variable kt and st are hidden variables and the rest aremodel parameters where kt indicates which of the KL expanded gaussiansgenerated st

3 Learning Rules

Combining equations 213 215 and 220 we perform functional maximiza-tion on the lower bound of the log marginal likelihood log PX with re-gard to Qmicro (see equation 214) Qkt and Qskt (see equation 220)mdashforexample

log Qordm D log Pordm CZ

Qmicro nordmX

log Pxot j micro dmicronordm C const (31)

where micro nordm is the set of parameters excluding ordm This gives

Qordm DY

nN ordmn j sup1ordmn 3ordmn

3ordmn D 3oordmn C h9niX

sup1ordmn D 3oordmnsup1oordmn C h9niP

t ontP

k Qkthxnt iexcl Ancentskti3ordmn

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

b9n D bo9n C 12

Qkthxnt iexcl Ancentskt iexcl ordmn2i (33)

nN Ancent j sup1Ancent currenAncent

currenAncent D

Bhreg1i

hregLi

CA C h9niX

Qkthsktsgtkti

sup1Ancent DAacute

ontxnt iexcl hordmniX

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

G regl j aregl bregl

aregl D aoregl C N2

bregl D boregl C12

nli (35)

Qfrac14l D Dfrac14 j dfrac14l

dfrac14lk D dofrac14lk CX

Qkt (36)

QAacutelkl D N Aacutelklj sup1Aacutelkl 3Aacutelkl

3Aacutelkl D 3oAacutelkl C hmacrlkliX

sup1Aacutelkl D3oAacutelkl sup1oAacutelkl C hmacrlkl

klDk Qkthsktli3Aacutelkl

Qmacrlkl D G macrlkl

j amacrlkl bmacrlkl

amacrlkl D aomacrlkl C12

bmacrlkl D bomacrlkl C12

Qkthsktl iexcl Aacutelkl 2i (38)

Qskt D N skt j sup1skt currenskt

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

Bhmacr1k1 Aacute1k1i

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

In the above equations hcenti denotes the expectation over the posterior distri-butions Qcent Ancent is the nth row of the mixing matrix A

PklDk means picking

out those gaussians such that the lth element of their indices k has the valueof k and ot is an indicator variable for observed entries in xt

ont Draquo

1 if xnt is observed0 if xnt is missing (310)

For a model of equal noise variance among all the observation dimensionsthe summation in the learning rules for Qordf would be over both t and

n Note that there exists scale and translational degeneracy in the modelas given by equation 21 and 22 After each update of Qfrac14l QAacutelkl andQmacrlkl

it is better to rescale Pstl to have zero mean and unit varianceQskt QA Qreg Qordm and Qordf have to be adjusted correspondinglyFinally Qkt is given by

log Qkt D hlog Pxot j skt microi C hlog N skt j Aacutek macrki

iexcl hlog Qskti C hlog frac14ki iexcl log zt (311)

where zt is a normalization constant The lower bound EX Qmicro for the logmarginal likelihood computed using equations 213 215 and 220 can bemonitored during learning and used for comparison of different solutionsor models After some manipulation EX Qmicro can be expressed as

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

41 Filling in Missing Entries Recovering missing values while per-forming demixing is possible if we have N gt L More specically if thenumber of observed dimensions in xt is greater than L the equation

xot D [A]o

t cent st (41)

would be overdetermined in st unless [A]ot has a rank smaller than L In

this case Qst is likely to be unimodal and peaked point estimates of stwould be sufcient and reliable and the learning rules of Chan et al (2002)with small modication to account for missing entries would give a rea-sonable approximation When Qst is a single gaussian the exponentialgrowth in complexity is avoided However if the number of observed di-mensions in xt is less than L equation 41 is now underdetermined in stand Qst would have a broad multimodal structure This corresponds toovercomplete ICA where single gaussian approximation of Qst is unde-sirable and the formalism discussed in this article is needed to capture thehigher-order statistics of Qst and produce a more faithful Qxm

t j xot The

approximate distribution Qxmt j xo

t can be obtained by

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

where plusmncent is the delta function and

Qxmkt j xo

t k DZ

Qmicro

kt j [Askt C ordm]mt [ordf]m

t Qskt dskt dmicro

QAQordfN xmkt j sup1xm

kt currenxmkt dA dordf (43)

sup1xmkt D [Asup1skt C sup1ordm]m

t (44)

currenxmkt

iexcl1 D [Acurrensktiexcl1Agt C currenordmiexcl1 C diagordfiexcl1]m

t (45)

Unfortunately the integration over QA and Qordf cannot be carried out an-alyticallybut we can substitute hAiand hordfi as an approximationEstimationof Qxm

t j xot using the above equations is demonstrated in Figure 3 The

shaded area is the exact posterior Pxmt j xo

t for the noiseless mixing inFigure 1 with observed x2 D iexcl2 and the solid line is the approximation byequations 42 through 45 We have modied the variational ICA of Chanet al (2002) by discounting missing entries This is done by replacing

shy 4 shy 3 shy 2 shy 1 0 1 20

Figure 3 The approximation of Qxmt j xo

t from the full missing ICA (solidline) and the polynomial missing ICA (dashed line) The shaded area is theexact posterior Pxm

t j xot corresponding to the noiseless mixture in Figure 1

with observed x2 D iexcl2 Dotted lines are the contribution from the individualQxm

kt j xot k

t ont and 9n with ont9n in their learning rules The dashed line is theapproximation Qxm

t j xot from this modied method which we refer to

as polynomial missing ICA The treatment of fully expanding the KL hiddensource gaussians discussed in section 23 is named full missing ICA The fullmissing ICA gives a more accurate t for Pxm

t j xot and a better estimate

for hxmt j xo

t i From equation 216

Qxmt j xo

Qmicro

and the above formalism Qst becomes

Qst DX

Zplusmnst iexcl sktQskt dskt (47)

which is a mixture of KL gaussians The missing values can then be lled inby

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t hst j xot i C [sup1ordm]m

t (49)

where sup1skt and sup1xmkt are given in equations 39 and 44 Alternatively

a maximum a posterior (MAP) estimate on Qst and Qxmt j xo

t may beobtained but then numerical methods are needed

42 The ldquoFullrdquo and ldquoPolynomialrdquo Missing ICA The complexity of thefull variational Bayesian ICA method is proportional to T pound KL where Tis the number of data points L is the number of hidden sources assumedand K is the number of gaussians used to model the density of each sourceIf we set K D 2 the ve parameters in the source density model Pstlare already enough to model the mean variance skewness and kurtosis ofthe source distribution The full missing ICA should always be preferredif memory and computational time permit The ldquopolynomial missing ICArdquoconverges more slowly per epoch of learning rules and suffers from manymore local maxima It has an inferior marginal likelihood lower bound Theproblems are more serious at high missing data rates and a local maximumsolution is usually found instead In the full missing ICA Qst is a mixtureof gaussians In the extreme case when all entries of a data point are missingthat is empty xo

t Qst is the same as Pst j micro and would not interferewith the learning of Pst j micro from other data point On the other hand thesingle gaussian Qst in the polynomial missing ICA would drive Pst j micro tobecome gaussian too This is very undesirable when learning ICA structure

5 Clusters of ICA

The variational Bayesian ICA for missing data described above can be easilyextended to model data density with C clusters of ICA First all parametersmicro and hidden variables kt skt for each cluster are given a superscript indexc Parameter frac12 D ffrac121 frac12Cg is introduced to represent the weights onthe clusters frac12 has a Dirichlet prior (see equation 211) 2 D ffrac12 micro1 microCgis now the collection of all parameters Our density model in equation 21becomes

Pxt j 2 DX

cPct D c j frac12Pxt j micro c

cPct D c j frac12

ZN xt j Acsc

t C ordmc ordfcPsct j micro c

s dsct (51)

The objective function in equation 213 remains the same but with micro replacedby 2 The separable posterior Q2 is given by

Q2 D Qfrac12Y

cQmicro c (52)

and similar to equation 215

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

Qmicro cdmicro c (53)

Equation 220 now becomes

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

t j sckt micro c dsc

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

We have introduced one more hidden variable ct and Qct is to be inter-preted in the same fashion as Qkc

t All learning rules in section 3 remain

the same only withP

t replaced byP

t Qct Finally we need two morelearning rules

dfrac12c D dofrac12c C

tQct (55)

log Qct D hlog frac12ci C log zct iexcl log Zt (56)

where zct is the normalization constant for Qkc

t (see equation 311) and Ztis for normalizing Qct

6 Supervised Classication

It is generally difcult for discriminative classiers such as multilayer per-ceptron (Bishop 1995) or support vector machine (Vapnik 1998) to handlemissing data In this section we extend the variational Bayesian techniqueto supervised classication

Consider a data set XT YT D fxt yt t in 1 Tg Here xt containsthe input attributes and may have missing entries yt 2 f1 y Ygindicates which of the Y classes xt is associated with When given a newdata point xTC1 we would like to compute PyTC1 j xTC1 XT YT M

PyTC1 j xTC1 XT YT M

D PxTC1 j yTC1 XT YT MPyTC1 j XT YT M

PxTC1 j XT YT M (61)

Here M denotes our generative model for observation fxt ytg

Pxt yt j M D Pxt j yt MPyt j M (62)

Pxt j yt M could be a mixture model as given by equation 51

61 Learning of Model Parameters Let Pxt j yt M be parameterizedby 2y and Pyt j M be parameterized by D 1 Y

Pxt j yt D y M D Pxt j 2y (63)

Pyt j M D Pyt D y j D y (64)

If is given a Dirichlet prior P j M D D j do its posterior hasalso a Dirichlet distribution

P j YT M D D j d (65)

dy D doy CX

tIyt D y (66)

Icent is an indicator function that equals 1 if its argument is true and 0 other-wise

Under the generative model of equation 62 it can be shown that

P2y j XT YT M D P2y j Xy (67)

where Xy is a subset of XT but contains only those xt whose training labelsyt have value y Hence P2y j XT YT M can be approximated with Q2y

by applying the learning rules in sections 3 and 5 on subset Xy

62 Classication First PyTC1 j XT YT M in equation 61 can be com-puted by

PyTC1 D y j XT YT M DZ

PyTC1 D y j yPy j XT YT dy

DdyPy dy

The other term PxTC1 j yTC1 XT YT M can be computed as

log PxTC1 j yTC1 D y XT YT M

D log PxTC1 j Xy M

D log PxTC1 Xy j M iexcl log PXy j M (69)

frac14 EfxTC1 Xyg Q02y iexcl EXy Q2y (610)

The above requires adding xTC1 to Xy and iterating the learning rules toobtain Q02y and EfxTC1 Xyg Q02y The error in the approximation isthe difference KLQ02y P2y j fxTC1 Xyg iexcl KLQ2y P2y j Xy If weassume further that Q02y frac14 Q2y

log PxTC1 j Xy M frac14Z

Q2y log PxTC1 j 2y d2y

D log ZTC1 (611)

where ZTC1 is the normalization constant in equation 56

7 Experiment

71 Synthetic Data In the rst experiment 200 data points were gener-ated by mixing four sources randomly in a seven-dimensional space Thegeneralized gaussian gamma and beta distributions were used to repre-sent source densities of various skewness and kurtosis (see Figure 5) Noise

Figure 4 In the rst experiment 30 of the entries in the seven-dimensionaldata set are missing as indicated by the black entries (The rst 100 data pointsare shown)

Figure 5 Source density modeling by variational missing ICA of the syntheticdata Histograms recovered sources distribution dashed lines original proba-bility densities solid line mixture of gaussians modeled probability densitiesdotted lines individual gaussian contribution

at iexcl26 dB level was added to the data and missing entries were createdwith a probability of 03 The data matrix for the rst 100 data points isplotted in Figure 4 Dark pixels represent missing entries Notice that somedata points have fewer than four observed dimensions In Figure 5 weplotted the histograms of the recovered sources and the probability densityfunctions (pdf) of the four sources The dashed line is the exact pdf usedto generate the data and the solid line is the modeled pdf by mixture oftwo one-dimensional gaussians (see equation 22) This shows that the twogaussians gave adequate t to the source histograms and densities

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

Number of dimensions

full missing ICA polynomial missing ICA

Figure 6 E X Qmicro as a function of hidden source dimensions Full missing ICArefers to the full expansions of gaussians discussed in section 23 and polynomialmissing ICA refers to the Chan et al (2002) method with minor modication

Figure 6 plots the lower bound of log marginal likelihood (see equa-tion 312) for models assuming different numbers of intrinsic dimensionsAs expected the Bayesian treatment allows us to the infer the intrinsic di-mension of the data cloud In the gure we also plot the EX Qmicro fromthe polynomial missing ICA Since a less negative lower bound representsa smaller Kullback-Leibler divergence between Qmicro and PX j micro it isclear from the gure that the full missing ICA gave a better t to the datadensity

72 Mixing Images This experiment demonstrates the ability of the pro-posed method to ll in missing values while performing demixing This ismade possible if we have more mixtures than hidden sources or N gt L Thetop row in Figure 7 shows the two original 380 pound 380 pixel images Theywere linearly mixed into three images and iexcl20 dB noise was added Miss-ing entries were introduced randomly with probability 02 The denoisedmixtures are shown in the third row of Figure 7 and the recovered sourcesare in the bottom row Only 08 of the pixels were missing from all threemixed images and could not be recovered 384 of the pixels were missingfrom only one mixed image and their values could be lled in with low

Figure 7 A demonstration of recovering missing values when N gt L Theoriginal images are in the top row Twenty percent of the pixels in the mixedimages (second row) are missing at random Only 08 are missing from thedenoised mixed images (third row) and separated images (bottom)

uncertainty and 96 of the pixels were missing from any two of the mixedimages Estimation of their values is possible but would have high uncer-tainty From Figure 7 we can see that the source images were well separatedand the mixed images were nicely denoised The signal-to-noise ratio (SNR)in the separated images was 14 dB We have also tried lling in the missingpixels by EM with a gaussian model Variational Bayesian ICA was then ap-plied on the ldquocompletedrdquo data The SNR achieved in the unmixed imageswas 5 dB This supports that it is crucial to have the correct density modelwhen lling in missing values and important to learn the density model andmissing values concurrently The denoised mixed images in this examplewere meant only to illustrate the method visually However if x1 x2 x3

represent cholesterol blood sugar and uric acid level for example it wouldbe possible to ll in the third when only two are available

73 Survival Prediction We demonstrate the supervised classicationdiscussed in section 6 with an echocardiogram data set downloaded fromthe UCI Machine Learning Repository (Blake amp Merz 1998) Input variablesare age-at-heart-attack fractional-shortening epss lvdd and wall-motion-indexThe goal is to predict survival of the patient one year after heart attack Thereare 24 positive and 50 negative examples The data matrix has a missingrate of 54 We performed leave-one-out cross-validation to evaluate ourclassier Thresholding the output PyTC1 j XT YT M computed usingequation 610 at 05 we got a true positive rate of 1624 and a true negativerate of 4250

8 Conclusion

In this article we derived the learning rules for variational Bayesian ICAwith missing data The complexity of the method is proportional to T pound KLwhere T is the number of data points L is the number of hidden sourcesassumed and K is the number of 1D gaussians used to model the densityof each source However this exponential growth in complexity is man-ageable and worthwhile for small data sets containing missing entries in ahigh-dimensional space The proposed method shows promise in analyzingand identifying projections of data sets that have a very limited number ofexpensive data points yet contain missing entries due to data scarcity Theextension to model data density with clusters of ICA was discussed Theapplication of the technique in a supervised classication setting was alsocovered We have applied the variational Bayesian missing ICA to a pri-matesrsquo brain volumetric data set containing 44 examples in 57 dimensionsVery encouraging results were obtained and will be reported in anotherarticle

References

Attias H (1999) Independent factor analysis Neural Computation 11(4) 803ndash851

Bishop C M (1995) Neural networks for pattern recognition Oxford ClarendonPress

Blake C amp Merz C (1998) UCI repository of machine learning databases IrvineCA University of California

Chan K Lee T-W amp Sejnowski T J (2002) Variational learning of clusters ofundercomplete nonsymmetric independent components Journal of MachineLearning Research 3 99ndash114

Choudrey R A amp Roberts S J (2001) Flexible Bayesian independent compo-nent analysis for blind source separation In 3rd International Conference onIndependent Component Analysis and Blind Signal Separation (pp 90ndash95) SanDiego CA Institute for Neural Computation

Ghahramani Z amp Jordan M (1994) Learning from incomplete data (Tech RepCBCL Paper No 108) Cambridge MA Center for Biological and Computa-tional Learning MIT

Hyvarinen A Karhunen J amp Oja E (2001) Independent component analysisNew York Wiley

Jordan M I Ghahramani Z Jaakkola T amp Saul L K (1999) An introductionto variational methods for graphical models Machine Learning 37(2) 183ndash233

Jung T-P Makeig S McKeown M J Bell A Lee T-W amp Sejnowski T J(2001) Imaging brain dynamics using independent component analysisProceedings of the IEEE 89(7) 1107ndash1122

Little R J A amp Rubin D B (1987) Statistical analysis with missing data NewYork Wiley

Mackay D J (1995) Ensemble learning and evidence maximization (Tech Rep)Cambridge Cavendish Laboratory University of Cambridge

Miskin J (2000) Ensemble learning for independent component analysis Unpub-lished doctoral dissertation University of Cambridge

Vapnik V (1998) Statistical learning theory New York WileyWelling M amp Weber M (1999) Independent component analysis of incomplete

data In 1999 6th Joint Symposium on Neural Compuatation Proceedings (Vol 9pp 162ndash168) San Diego CA Institute for Neural Computation

Received July 18 2002 accepted January 30 2003

cloud density and adversely affect subsequent analysis A more principledway would be to use probability density estimates of the missing entriesinstead of point estimates A well-known example of this approach is theuse of the expectation-maximization (EM) algorithm in tting incompletedata with a single gaussian density (Little amp Rubin 1987)

Independent component analysis (ICA Hyvarinen Karhunen amp Oja2001) assumes the observed data x are generated from a linear combinationof independent sources s

x D A s C ordm (11)

where A is the mixing matrix which can be nonsquare The sources s havenongaussian density such as psl expiexcljsljq The noise term ordm can havenonzero mean ICA tries to locate independent axes within the data cloudand was developed forblind source separation It has been applied to speechseparation and analyzing fMRI and EEG data (Jung et al 2001) ICA is alsoused to model data density describing data as linear mixtures of indepen-dent features and nding projections that may uncover interesting structurein the data Maximum likelihood learning of ICA with incomplete data hasbeen studied by Welling and Weber (1999) in the limited case of a squaremixing matrix and predened source densities

Many real-world data sets have intrinsic dimensionality smaller thanthat of the observed data With missing data principal component analysiscannot be used to perform dimension reduction as preprocessing for ICAInstead the variational Bayesian method applied to ICA can handle smalldata sets with high observed dimension (Chan Lee amp Sejnowski 2002Choudrey amp Roberts 2001 Miskin 2000) The Bayesian method preventsovertting and performs automatic dimension reduction In this article weextend the variational Bayesian ICA method to problemswith missing dataMore important the probability density estimate of the missing entries canbe used to ll in the missing values This allows the density model to berened and made more accurate

2 Model and Theory

21 ICA Generative Model with Missing Data Consider a data set ofT data points in an N-dimensional space X D fxt 2 RNg t in f1 TgAssume a noisy ICA generative model for the data

Pxt j micro DZ

N xt j Ast C ordm ordfPst j micros dst (21)

where A is the mixing matrix and ordm and [ordf]iexcl1 are the observation meanand diagonal noise variance respectively The hidden source st is assumed

Pst j micros DLY

AacuteKX

frac14lklN

L micro I X DY

t j micro (23)

Pxot j micro D

ZPxt j micro dxm

DZ microZ

Qt Pxo

t j micro Pmicro

PX (25)

PX DZ Y

N x j sup1 curren D

sjcurrenj

G x j a b D ba

Dfrac14 j d D 0P

dkQ0dk

dKiexcl1K (211)

Qmicrodmicro

Qmicro

Pmicro

Qmicro

dmicro

iexcl log PX (212)

log PX cedilZ

Qmicro X

ZQmicro log

Pmicro

Qmicrodmicro (213)

poundY

Qfrac14lY

QAacutelkl Qmacrlkl

ZQmicro log

Pmicro

Qmicrodmicro

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Qregdreg

Qordm logPordm

Qordmdordm C

ZQordf log

Qordfdordf (215)

Qxmt j xo

Qmicro

shy 05

Pst DLY

AacuteX

cent cent centX

macrk D

Bmacr1k1

macrLkL

t j st microX

D logX

frac14k

cedilX

Qkt logZ

Qkt logfrac14k

Qkt (219)

microZQskt log Pxo

t j skt micro dskt

Qsktdskt

Qkt logfrac14k

Qkt (220)

3 Learning Rules

Qmicro nordmX

Qordm DY

t ontP

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

b9n D bo9n C 12

currenAncent D

Bhreg1i

hregLi

CA C h9niX

Qkthsktsgtkti

sup1Ancent DAacute

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

aregl D aoregl C N2

bregl D boregl C12

nli (35)

Qkt (36)

j amacrlkl bmacrlkl

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

PklDk means picking

ont Draquo

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

Pst j micros DLY

AacuteKX

frac14lklN

L micro I X DY

t j micro (23)

Pxot j micro D

ZPxt j micro dxm

DZ microZ

Qt Pxo

t j micro Pmicro

PX (25)

PX DZ Y

N x j sup1 curren D

sjcurrenj

G x j a b D ba

Dfrac14 j d D 0P

dkQ0dk

dKiexcl1K (211)

Qmicrodmicro

Qmicro

Pmicro

Qmicro

dmicro

iexcl log PX (212)

log PX cedilZ

Qmicro X

ZQmicro log

Pmicro

Qmicrodmicro (213)

poundY

Qfrac14lY

QAacutelkl Qmacrlkl

ZQmicro log

Pmicro

Qmicrodmicro

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Qregdreg

Qordm logPordm

Qordmdordm C

ZQordf log

Qordfdordf (215)

Qxmt j xo

Qmicro

shy 05

Pst DLY

AacuteX

cent cent centX

macrk D

Bmacr1k1

macrLkL

t j st microX

D logX

frac14k

cedilX

Qkt logZ

Qkt logfrac14k

Qkt (219)

microZQskt log Pxo

t j skt micro dskt

Qsktdskt

Qkt logfrac14k

Qkt (220)

3 Learning Rules

Qmicro nordmX

Qordm DY

t ontP

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

b9n D bo9n C 12

currenAncent D

Bhreg1i

hregLi

CA C h9niX

Qkthsktsgtkti

sup1Ancent DAacute

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

aregl D aoregl C N2

bregl D boregl C12

nli (35)

Qkt (36)

j amacrlkl bmacrlkl

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

PklDk means picking

ont Draquo

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

N x j sup1 curren D

sjcurrenj

G x j a b D ba

Dfrac14 j d D 0P

dkQ0dk

dKiexcl1K (211)

Qmicrodmicro

Qmicro

Pmicro

Qmicro

dmicro

iexcl log PX (212)

log PX cedilZ

Qmicro X

ZQmicro log

Pmicro

Qmicrodmicro (213)

poundY

Qfrac14lY

QAacutelkl Qmacrlkl

ZQmicro log

Pmicro

Qmicrodmicro

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Qregdreg

Qordm logPordm

Qordmdordm C

ZQordf log

Qordfdordf (215)

Qxmt j xo

Qmicro

shy 05

Pst DLY

AacuteX

cent cent centX

macrk D

Bmacr1k1

macrLkL

t j st microX

D logX

frac14k

cedilX

Qkt logZ

Qkt logfrac14k

Qkt (219)

microZQskt log Pxo

t j skt micro dskt

Qsktdskt

Qkt logfrac14k

Qkt (220)

3 Learning Rules

Qmicro nordmX

Qordm DY

t ontP

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

b9n D bo9n C 12

currenAncent D

Bhreg1i

hregLi

CA C h9niX

Qkthsktsgtkti

sup1Ancent DAacute

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

aregl D aoregl C N2

bregl D boregl C12

nli (35)

Qkt (36)

j amacrlkl bmacrlkl

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

PklDk means picking

ont Draquo

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

poundY

Qfrac14lY

QAacutelkl Qmacrlkl

ZQmicro log

Pmicro

Qmicrodmicro

ZQfrac14l log

Pfrac14l

Qfrac14ldfrac14l

ZQAacutelkl

logPAacutelkl

QAacutelkl

dAacutelklC

ZQmacrlkl

logPmacrlkl

Qmacrlkl

dmacrlkl

QAQreg logPA j reg

QAdA dreg C

ZQreg log

Qregdreg

Qordm logPordm

Qordmdordm C

ZQordf log

Qordfdordf (215)

Qxmt j xo

Qmicro

shy 05

Pst DLY

AacuteX

cent cent centX

macrk D

Bmacr1k1

macrLkL

t j st microX

D logX

frac14k

cedilX

Qkt logZ

Qkt logfrac14k

Qkt (219)

microZQskt log Pxo

t j skt micro dskt

Qsktdskt

Qkt logfrac14k

Qkt (220)

3 Learning Rules

Qmicro nordmX

Qordm DY

t ontP

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

b9n D bo9n C 12

currenAncent D

Bhreg1i

hregLi

CA C h9niX

Qkthsktsgtkti

sup1Ancent DAacute

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

aregl D aoregl C N2

bregl D boregl C12

nli (35)

Qkt (36)

j amacrlkl bmacrlkl

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

PklDk means picking

ont Draquo

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

shy 05

Pst DLY

AacuteX

cent cent centX

macrk D

Bmacr1k1

macrLkL

t j st microX

D logX

frac14k

cedilX

Qkt logZ

Qkt logfrac14k

Qkt (219)

microZQskt log Pxo

t j skt micro dskt

Qsktdskt

Qkt logfrac14k

Qkt (220)

3 Learning Rules

Qmicro nordmX

Qordm DY

t ontP

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

b9n D bo9n C 12

currenAncent D

Bhreg1i

hregLi

CA C h9niX

Qkthsktsgtkti

sup1Ancent DAacute

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

aregl D aoregl C N2

bregl D boregl C12

nli (35)

Qkt (36)

j amacrlkl bmacrlkl

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

PklDk means picking

ont Draquo

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

Pst DLY

AacuteX

cent cent centX

macrk D

Bmacr1k1

macrLkL

t j st microX

D logX

frac14k

cedilX

Qkt logZ

Qkt logfrac14k

Qkt (219)

microZQskt log Pxo

t j skt micro dskt

Qsktdskt

Qkt logfrac14k

Qkt (220)

3 Learning Rules

Qmicro nordmX

Qordm DY

t ontP

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

b9n D bo9n C 12

currenAncent D

Bhreg1i

hregLi

CA C h9niX

Qkthsktsgtkti

sup1Ancent DAacute

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

aregl D aoregl C N2

bregl D boregl C12

nli (35)

Qkt (36)

j amacrlkl bmacrlkl

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

PklDk means picking

ont Draquo

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

microZQskt log Pxo

t j skt micro dskt

Qsktdskt

Qkt logfrac14k

Qkt (220)

3 Learning Rules

Qmicro nordmX

Qordm DY

t ontP

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

b9n D bo9n C 12

currenAncent D

Bhreg1i

hregLi

CA C h9niX

Qkthsktsgtkti

sup1Ancent DAacute

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

aregl D aoregl C N2

bregl D boregl C12

nli (35)

Qkt (36)

j amacrlkl bmacrlkl

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

PklDk means picking

ont Draquo

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

Qordm DY

t ontP

Similarly

Qordf DY

nG 9n j a9n b9n

a9n D ao9n C12

b9n D bo9n C 12

currenAncent D

Bhreg1i

hregLi

CA C h9niX

Qkthsktsgtkti

sup1Ancent DAacute

Qkthsgtkti

currenAncent

iexcl1 (34)

Qreg DY

aregl D aoregl C N2

bregl D boregl C12

nli (35)

Qkt (36)

j amacrlkl bmacrlkl

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

PklDk means picking

ont Draquo

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

j amacrlkl bmacrlkl

currenskt D

Bhmacr1k1 i

hmacrLkLi

Bo1t91

currensktsup1skt D

hmacrLkL AacuteLkLi

Bo1t91

CA xt iexcl ordm

+ (39)

PklDk means picking

ont Draquo

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

EX Qmicro DX

log zt CZ

Qmicro logPmicro

Qmicro dmicro (312)

4 Missing Data

xot D [A]o

t cent st (41)

t j xot The

Qxmt j xo

Zplusmnxm

t iexcl xmktQxm

kt j xot k dxm

kt (42)

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

Qxmkt j xo

t k DZ

Qmicro

t Qskt dskt dmicro

t (44)

currenxmkt

t (45)

kt j xot k

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

for hxmt j xo

Qxmt j xo

Qmicro

Qst DX

hst j xot i D

ZstQst dst D

Qktsup1skt (48)

hxmt j xo

t i DZ

xmt Qxm

t j xot dxm

Qktsup1xmkt D [A]m

t (49)

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

5 Clusters of ICA

Pxt j 2 DX

cPct D c j frac12

ZN xt j Acsc

s dsct (51)

Q2 D Qfrac12Y

cQmicro c (52)

ZQ2 log

Q2d2 D

ZQfrac12 log

Pfrac12

Qfrac12dfrac12

ZQmicro c log

Pmicro c

log Pxot j 2 cedil

cQct log

QctQkct

poundmicroZ

Qsckt log Pxo

Qsckt log

N sckt j Aacutec

k macrck

QctQkct log

frac14 ck

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

the same only withP

t replaced byP

tQct (55)

dy D doy CX

tIyt D y (66)

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

DdyPy dy

D log PxTC1 j Xy M

D log ZTC1 (611)

7 Experiment

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

1 2 3 4 5 6 7shy 2000

shy 1900

shy 1800

shy 1700

shy 1600

shy 1500

8 Conclusion

References

8 Conclusion

References

8 Conclusion

References

variational bayesian learning of ica with missing datakwchan/publications/nc03-mica.pdfla jolla, ca...

Documents

yen-lin lee and truong nguyen ece dept., ucsd, la jolla, ca...

citywestetnssecondclass.weebly.com · web viewthere are 9...

imputation of missing data under missing not at … ·...

ersity of california la jolla, ca 92093 · the regents of...

missing disinflation and missing inflation: a var...

missing secret

screening the data tedious but essential!. missing data...

la jolla, california 92093-0990 - environment, health ......

list of persons reported missing/traced...

missing mittens

nc03 completo

missing information: missing sasid’s missing disabilities...

missing money and missing markets: reliability, capacity...

missing data missing data methods in ml multiple...

gradsko vijeĆe grada pule...program gradnje 2019. grafi...

orona 3g aesthetic · 2016. 3. 24. · packs reference...

holcims draft eis - missing data, missing analysis missing...

dmissing missing missing missing after decismissing ... ·...

amber silver blue endangered missing persons...endangered...

missing links, missing markets: internal exchanges