machinelearning project

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051

Subspace Clustering with Missing Data

Lianli Liu Dejiao Zhang Jiabei Zheng

Abstract

1 Subspace clustering with missing data can be seen as the combination of subspace clusteringand low rank matrix completion, which is essentially equivalent to high-rank matrix completionunder the assumption that columns of the matrix X ∈ Rd×N belong to a union of subspaces. It’sa challenging problem, both in terms of computation and inference. In this report, we study twoefficient algorithms proposed for solving this problem, the EM-type algorithm [1] and k-meansform algorithm – k-GROUSE[2], and implement the two algorithms on both simulated and realdataset. Besides that we will also give a brief description of recently developed theorem [1] onthe sampling complexity for subspace clustering, which is a great improvement over the previouslyexisting theoretic analysis [3] that requires impractically sample size, e.g. the number of observedvectors per subspace should be super-polynomial in the dimension d.

1 Problem Statement

Consider a matrix X ∈ Rd×N , with entries missing at random, whose columns lie in the union of at most K unknownsubspaces of Rd, each has dimension at most rk < d. The goal of subspace clustering is to infer the underlying Ksubspaces from the partially observed matrix XΩ, where Ω denote the indexes of the observed entries, and to clusterthe columns of XΩ into groups that lie in the same subspaces.

2 Sampling Complexity for Generic Subspace Clustering and EM Algorithm

2.1 Sampling Complexity

[1] shows that subspace clustering is possible without impractically large sample size as that required by previous work[3]. The main assumptions used in the theoretic analysis of [1] is different from those arising from traditional Low-Rank Matrix Completion, e.g. [3]. It assumes the columns of X are drawn from a non-atomic distribution supportedon a union of low-dimensional generic subspaces, and entries in the columns are missing uniformly at random. Beforestating the main result in [1], we provide the following definition first.

Definition 1. We denote the set of d × n matrices with rank r byM(r, d × n). A generic (d × n) matrix of rank ris a continuousM(r, d× n) valued random variable. We say a subspace S is generic if a matrix whose columns aredrawn i.i.d according to a non-atomic distribution with support on S is generic a.s.

The key assumptions of this method is following:

• A1. The columns of the d×N data matrix X are drawn according to a non-atomic distribution with supporton the union of at most K generic subspaces. The subspaces, denoted by S = Sk|k = 0, . . . ,K − 1, eachhas rank exactly r < d.

• A2. The probability that a column is drawn from subspace k is ρk. Let ρ∗ be the bound on minkρk.1Authors by alphabetical order

1

052053054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103

• A3. We observe X only on a set of entries Ω and denote the observation XΩ. Each entry in XΩ is sampledindependently with probability p.

We now state the main results in [1], which implies that if we observe at least order dr+1(log d/r + logK) columnsand at least O(r log2 d) entries in each column, then identification of the subspaces is possible with high probability.Although by assumption A2 the rank of each subspace is exactly r, this result can be generalized to a relax versionthat the dimension of each subspace is upper bounded by r.

Theorem 1. [1] Suppose A1-A3 hold. Let ε > 0 be given. Assume the number of subspaces K ≤ ε6ed/4, the total

number of columns N ≥ (2d+ 4M)/ρ∗, and

p ≥ 1

d128µ2

1rβ0 log2(2d), β0 =

√1 +

log(

6Kε 12 log(d)

)2 log(2d)

,

M =

(de

r + 1

)r+1 [(r + 1) log

(de

r + 1

)+ log

(8K

ε

)](1)

where µ21 := maxk

d2

r ‖UkV∗k ‖2∞ and UkΣkV

∗k is the singular value decomposition of X [k]2. Then with probability at

least 1− ε, S can be uniquely determined from XΩ.

2.2 EM Algorithm

In [4], computing the principal subspaces of a set of data vectors is formulated in a probabilistic framework andobtained by a maximum-likelihood method. Here we use the same model but extend it to the case of missing data:[

xoixmi

]=

K∑i=1

1zi=k([

W oik

Wmik

]yi +

[µoikµmik

]+ ηi

)(2)

where 1, · · · ,K 3 zi ∼ ρ ⊥ yi ∼ N (0, I). Wk is a d × r matrix whose span is Sk, and ηi|zi ∼ N (0, σ2ziI) is the

noise in the zthi subspace. To find the Maximum Likelihood Estimate of θ = W,µ, ρ, σ2, i.e.

θ = arg maxθ

l(θ, xo) = arg maxθ

N∑i=1

log(

K∑k=1

ρkP (xoi |zi = k; θ)) (3)

an EM algorithm is proposed where Xo := xoi Ni=1 is the observed data, Xm := xmi Ni=1,Y := yiNi=1,Z :=ziNi=1 are hidden variables. The iterates of the algorithm are as follows, with detailed derivation provided in Ap-pendix A.

Wk =

[ N∑i=1

pi,kEk〈xiyTi 〉 −(∑N

i=1 pi,kEk〈xi〉)(∑N

i=1 pi,kEk〈yi〉T)∑N

i=1 pi,kEk〈yi〉T

][ N∑i=1

pi,kEk〈yiyTi 〉 −(∑N

i=1 pi,kEk〈yi〉)(∑N

i=1 pi,kEk〈yi〉T)∑N

i=1 pi,k

]−1

(4)

µk =

∑Ni=1 pi,k

(Ek〈xi〉 − WkEk〈yi〉

)∑Ni=1 pi,k

(5)

σ2k =

1

d∑Ni=1 pi,k

[ N∑i=1

pi,k(tr(E〈xixTi 〉)− 2µTk Ek〈xi〉+ µk

Tµk)

− 2tr(Ek〈yixTi 〉Wk) + 2µTk WkEk〈yi〉+ tr(Ek〈yiyTi 〉Wk

TWk)

](6)

2X [k] denote the columns of X corresponding to the kth subspace

2

104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155

ρk =1

N

N∑i=1

pi,k , pi,k := Pzi|xoi ,θ(k) =ρkPxoi |zi=k,θ(x

oi )∑K

j=1 ρjPxoi |zi=j,θ(xoi )

(7)

where Ek〈·〉 denotes the conditional expectation E·|xoi ,zi=k,θ[·]. This conditional expectation can be easily calculated

from its conditional distribution. Denote Mk := σ2I +W oiTk W oi

k , by Gauss-Markov theorem,[xmiyi

] ∣∣∣∣xoi ,zi=k,θ

∼ N([

µmik +Wmik M−1

k W oiTk (xoi − µ

oik )

M−1k W oiT

k (xoi − µoik )

],

σ2

[I +Wmi

k M−1k WmiT

k Wmik M−1

k

M−1k WmiT

k M−1k

])(8)

Detailed derivation is provided in Appendix B.

3 Projection Residual Based K-MEANS Form Algorithm

[2] proposes a form of k-means algorithm adapted to k subspaces clustering. It alternats between assigning vectors tosubspaces based on the projection residuals and updates each subspace by performing an incremental gradient descentalong the geodesic curve of Grassmannian 3 manifold(GROUSE[6]).

3.1 Subspace Assignment

For subspace assignment, [2] proves that the incomplete data vector can be assigned to the correct subspace with highprobability if the angles between the vector and the various subspaces satisfied some mild condition.

For each data vector, when there is no entry missing, it’s natural to use the following

‖v − PSiv‖22 < ‖v − PSjv‖22 ∀j 6= i, and i, j ∈ 0, . . . ,K − 1. (9)

where PSi is the projection operator onto subspace Si, to decide whether v can be assigned to Si or not. Now supposewe only observe the a subset of the entries of v, denoted as vΩ, where Ω ⊂ 1, . . . , d, |Ω| = m < d. And define theprojection operator restricted to Ω as

PSΩ= (UTΩUΩ)−1UTΩ (10)

In [7], the authors prove that UTΩUΩ is invertible w.h.p as long as the assumptions of Theorem.2 are satisfied.

Let Uk ∈ Rd×rk whose orthonormal columns span the rk dimensional subspaces Sk respectively. Then an naturalquestion arises as under what condition we can use a similar decision rule as that of (9) for the subspace assignmentfor the incomplete data case. To answer this question, we will first restate a theorem in [2] to show that the projectionresidual will concentrated around that of the full data case by a scaling factor if the sampling size m satisfies certaincondition. Let v = x+ y, where x ∈ S, y ∈ S⊥.Theorem 2. [2]4 Let δ > 0 and m ≥ 8

3µ(S)r log(

2rδ

), then with probability at least 1− 3δ,

m(1− α)− dµ(S) (1+β)2

1−γ

d‖v − PSv‖22 ≤ ‖vΩ − PSΩ

vΩ‖22 (11)

and with probability at least 1− δ,

‖vΩ − PSΩvΩ‖22 ≤ (1 + α)

m

d‖v − PSv‖22 (12)

where α =√

2µ2(y)m log

(1δ

), β =

√2µ(y) log

(1δ

)and γ =

√8rµ(S)

3m log(

2rδ

).

3Grassmannian is a compact Riemannian manifold, and it’s geodesics can be explicitly computed [5]4In this theorem, µ(S), µ(y) is coherence parameter of subspace S and vector y, here we follow the definitions proposed in [8],

µS := drmaxj ‖PSej‖22 and µ(y) = d‖z‖2

∞‖z‖2

2

3

156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207

Next we will state the requirement on the angles between the vector and the subspaces, by which the incomplete datavector can be assigned to the correct subspace w.h.p. Define the angle between the vector v and its projection into the rk

dimensional subspace Sk, k = 0, . . . ,K−1 as θi = sin−1(‖v−PSkv‖2‖v‖2

). And let Ck(m) =

m(1−αk)−rkµ(Sk)(1+βi)

2

1−γim(1+α0) ,

where αk, βk and γk are defined as that in Theorem.2 for different rk. Notice that Ck(m) < 1 and Ck(m) 1 asm→∞. Without loss of generality, we assume θ0 < θk,∀k 6= 0.Corollary 1. [2] Let m ≥ 8

3 maxi 6=0

(riµ(Si) log

(2riδ

))for fixed δ > 0. Assume that

sin2(θ0) ≤ Ci(m) sin2(θi), ∀i 6= 0.

Then with probability at least 1− 4(K − 1)δ,

‖vΩ − PS0vΩ‖22 < ‖vΩ − PSivΩ‖22, ∀i 6= 0

Proof. Detail proof is available in Appendix C.

3.2 Subspace Update

GROUSE [6] is a very efficient algorithm for single subspace estimation, although the theoretic support of this methodis still unavailable5. k-GROUSE [2] can be seen as a combination of k-subspaces and GROUSE for multiple subspaceestimation, which is detailed in Algorithm.1.

Algorithm 1 k-subspaces with the GROUSERequire: A collection of vectors vΩ(t), t = 1, . . . , T , and the observed indices Ω(t). An integer number ofsubspaces k and dimensions rk, k = 1, . . . ,K. A maximum number of iterations, maxIter. A fixed step size η.Initialize Subspaces: Zero-fill the vectors and collect them in a mtrix V . Initialize k subspaces estimates usingprobabilistic farthest insertion.Calculate Orthonormal Bases Uj , k = 1, . . . ,K. Let QjΩ = (UTkΩ

UkΩ)−1UTkΩ

for i = 1, . . . ,maxIter doSelect a vector at random, vΩ.for k = 1, . . . , K do

Calculate weights of projection onto kth subspace: w(k) = QkΩvΩ.Calculate Projection Residuals to kth subspace: rΩ(k) = vΩ − UkΩw(k).

end forSelect min residual: k = arg mink ‖rΩ(k)‖22. Set w = w(k). Define p = Ukw and rΩc = 0, i.e, r is the zerofilled rΩ(k)Update Subspace:

Uk := Uk +

[(cos(ση)− 1)

p

‖p‖+ sin(ση)

r

‖r‖

]wT

‖w‖. (13)

where σ = ‖r‖‖p‖.end for

4 Experiment on simulated data

In this section, we will explore the performance of k-GROUSE and EM algorithm on simulated data. Before we pre-sented the results we will first give a brief description about two key points of the implement of the algorithm. Firstly,we use a k-means++ like method to initialize the subspaces for both k-GROUSE and EM algorithm. Specifically,we randomly pick a point x0 ∈ X , and select it’s q 6nearest nearest neighbors and calculate the best S0 w.r.t theseselected q points. Next, we recursively initialize Sj by randomly fist pick a point x with probability proportional to

5Local convergence of GROUSE is available in [9]6q is a nonnegative parameter, usually we set q = 2maxk rk

4

208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259

min[dist(x,S0), . . . , dist(x,Sj−1)], and then find the best fit Sj of its q neighborhood. Secondly, the error is calcu-lated by examining the cluster assignments if a set of vectors who come from the same underlying subspace(which isknown). For each set, we consider the cluster ID own the largest number of vectors to the true one, and the others asincorrectly clustered.

4.1 EM Algorithm

The first experiment of EM algorithm is dealing with the simulated data generated by the Gaussian mixture model insection 2. We used d = 100, K = 4 and r = 5 for this part. For simulation, each of K subspaces is the span ofan orthonormal basis generated from r i.i.d. standard Gaussian d-dimensional vectors. The performance of the EMalgorithm strongly depends on initialization. With poor initial estimation of W and µ, estimated θ will get stuck bya local minimum of l(θ, xo). Fig.1(a) gives an example of processing the same XΩ with different initial estimations,where we observed an obvious difference in performance.

We ran 50 trials for each set of |ω|, Nk and σ2, where |ω| is the number of observed entries in each column, Nk isthe number of columns of each subspace and σ2 is the noise level. The stopping criteria is chosen as l(θ(t+1), xo) −l(θ(t), xo) < 10−4l(θ(t), xo). The result are summarized in Fig.1(b)-(d). The difference between our results andresults reported in [1] mainly comes from different initialization strategies. Generally, larger |ω| and Nk, and smallerσ2 gives less misclassified points, which is reasonable. For some cases, all misclassified points come from gettingstuck at a local minimum.

0 50 100 150 200 250 300 350 400−4

−2

0

2

4

6x 10

4

Iterations

Log

Like

lihoo

d

GoodBad

(a)

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Nk

Pro

p. o

f mis

clas

sifie

d po

ints

(b)

10 15 20 25 30 35 400

0.2

0.4

0.6

0.8

1

|w|

Pro

p. o

f mis

clas

sifie

d po

ints

(c)

−4 −3 −2 −1 00

0.2

0.4

0.6

0.8

1

log(σ2)

Pro

p. o

f mis

clas

sifie

d po

ints

(d)

Figure 1: (a) Good initial estimation: converges after 83 iterations, label correct rate = 100%. Bad initial estimation:converges after 395 iterations, label correct rate = 45.6%. (Nk = 200, ω = 24, σ2 = 10−4). (b)-(d) shows proportionof misclassified points: (b) as a function of Nk with |ω| = 24, σ2 = 10−4. (c) as a function of |ω| with σ2 = 10−4,Nk = 300. (d) as a function of σ2 with |ω| = 24, Nk = 210.

5

260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311

4.2 k-GROUSE

In this section, we will explore the performance of k-GROUSE in two simulation scenarios. We set the ambientdimension d = 40 and the data matrix X consists of 80 vectors per subspace. In the first case, we let the number ofsubspaces K1 = 4 and and r = 5 for each subspace, i.e, the sum of the dimensions of the subspaces R =

∑Kk=1 rk <

d. In the second case, we set the number of subspaces K2 = 5 and make one of them depended on the other four.From Fig.2(a) we can see that k-GROUSE perform well as long as the number of observations is a little bit more thatr log2 d, which is within the order predicted by the theory. However, the performance of K-GROUSE is also dependedon the initialization. Although k-means++ initialization can significantly improve the convergence of k-GROUSE, itcan converge to some local point in the worse case.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Sampling density

prob

abili

ty o

f err

or

d = 40 dim = 5

K1 = 4K2 = 5

(a)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.002

0.004

0.006

0.008

0.01

0.012

0.014

Sampling densitypr

obab

ility

of e

rror

d = 100 dim = 5 K = 4

k−GROUSEEM

(b)

Figure 2: (a) Performance of k-GROUSE with independent and dependent subspaces, obtained over 50 trials; (b) k-GROUSE vs EM Algorithm, obtained over 50 trials

4.3 k-GROUSE vs EM Algorithm

Now we compare the performance of k-GROUSE and EM Algorithm in simulation data. In this setting, we setK = 4, r = 5, d = 100. From Figure.2(a) we can see that both k-GROUSE and EM Algorithm perform well inthis situation, while k-GROUSE is more efficient. However, as we will see in the next section, the performance gapbetween the two will become obvious when we apply them to some worse conditioned real data, specifically whenthe conditional number of the data matrix X is large, k-GROUSE can perform poorly while EM algorithm still workunder the same condition. In traditional matrix completion, the sampling density can heavily depend on the conditionnumber of the matrix, it seems that it’s inherent here by k-GROUSE.

5 Experiment on real data

5.1 EM algorithm

We test the algorithm on real data by applying it to image compression. An image of 264×536 is used, which is thesame as the one used in mixture probabilistic PCA[4]. Similar with [4], the image was segmented into 8×8 non-overlapping blocks, giving a total dataset of 2211×64-dimensional vectors. In the experiment, we set the number ofsubspaces as 3, and test the performance of the algorithm under different missing rate (number of missing pixels inimage block) and compression ratio. To make sure the resulting coded image has the same bit rate as the conventionalSVD coded image, after the EM algorithm converges, the image coding was performed in a hard fashion, i.e. thecoding efficient of each image block is only estimated using the subspace it has the largest probability of belonging to.

We test the algorithm under different missing rate (number of missing pixels in each image block) and compressionratio. An initialization scheme that is slightly different from the simulated experiment is used: after setting all missingdata to zero, a simple k-means clustering is performed on the data set. µk,Wk and σk of each subspace is initializedfrom the kthdata cluster xi ∈ Rdusing probabilistic PCA [4], where µk is initialized by the sample mean. Denotethe sample covariance matrix as Sk = UΛV ,Wk ∈ Rd×r is initialized as

Wk = Ur(Λr − σ2I)1/2 (14)

6

312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363

where σ2 is the initial value of σ2k = 1

d−r∑dj=r+1 λj , Ur ∈ Rd×r is composed of the leading r left singular vectors

from U and Λr ∈ Λd×r is the corresponding singular value matrix.

Sample images reconstructed under different missing rate (equivalently, the number of observing pixels per block ω)and compression ratio (number of components in each subspace r) are shown in Fig.3. Mean squared error (MSE)between the compressed image and the original image is plotted in Fig5. As compression ratio decreases, the qualityof image improves significantly, as we can see from Fig3, the mean square error also decreases. While for differentmissing rate, the images appear visually the same, no significant difference is observed between missing data caseand complete data case (though means square error decreases slightly). This validates the theory in [1] that as long ascertain conditions are met and a reasonable number of datapoints are observed, with a high probability we can recoverthe true subspace from incomplete dataset.

(a) ω=14,r = 4 (b) ω=64,r = 4 (c) ω=14,r = 12

(d) ω=64,r = 12 (e) ω=14,r = 32 (f) ω=64,r = 32

Figure 3: Image compression using EM algorithm

5.2 k-GROUSE algorithm

We also test the k-GROUSE algorithm in image compression. We use the same block transformation scheme andnumber of subspaces as in the EM algorithm. The sample compressed images are shown below.

6 Conclusion

From this project, we learned different subspace learning techniques, especially under missing data case. We alsostudied the theoretical requirement for recovering subspaces when data is incomplete. We also practiced the imple-mentation of EM algorithm and k-GROUSE.

7 Contribution

Dejiao Zhang is in charge of abstract, sections 1, 2.1, 3, 4.2, 4.3 and (Appendix C). Jiabei Zheng rederived the EMalgorithm for Gaussian mixture model (Appendix A), did coding and made plots in 4.1. Lianli Liu rederived theconditional distribution (Appendix B), did experiment on real dataset (section 5) and debugged the EM algorithm.

7

364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415

(a) missing rate 30%, number of modes 32 (b) missing rate 30%, number of modes 50

Figure 4: Image compression using k-GROUSE

(a) MSE under different missing rate,mode = 12

(b) MSE under different compressionratio, missing rate = 78%

Figure 5: Plot of mean square error

References

[1] D. Pimentel and R. Nowak, “On the sample complexity of subspace clustering with missing data.”[2] L. Balzano, A. Szlam, B. Recht, and R. Nowak, “k-subspaces with missing data,” in Statistical Signal Processing

Workshop (SSP), 2012 IEEE. IEEE, 2012, pp. 612–615.[3] B. Eriksson, L. Balzano, and R. Nowak, “High-rank matrix completion and subspace clustering with missing

data,” arXiv preprint arXiv:1112.5629, 2011.[4] M. Tipping and C. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural computation,

vol. 11, no. 2, pp. 443–482, 1999.[5] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM

journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998.[6] L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking of subspaces from highly incomplete

information,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on.IEEE, 2010, pp. 704–711.

[7] L. Balzano, B. Recht, and R. Nowak, “High-dimensional matched subspace detection when data are missing,” inInformation Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1638–1642.

[8] E. J. Candes and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computationalmathematics, vol. 9, no. 6, pp. 717–772, 2009.

[9] L. Balzano and S. J. Wright, “Local convergence of an algorithm for subspace identification from partial data,”arXiv preprint arXiv:1306.3391, 2013.

8

416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467

Appendix A

Here is the complete derivation of the EM algorithm. The complete data probability is

Pr(x, y, z; θ) =

N∏i=1

Pr(xi, yi, zi)

=

N∏i=1

Pr(xi|yi, zi)Pr(yi|zi)Pr(zi)

=

N∏i=1

Pr(xi|yi, zi)Pr(yi)Pr(zi)

=

N∏i=1

ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I)

(15)

The complete data log-likelihood is

l(θ;x, y, z) =

N∑i=1

log(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))

=

N∑i=1

K∑k=1

∆iklog(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))

(16)

where

∆ik =

1 if zi = k

0 else.(17)

The E-step is

Q(θ; θ) = E[l(θ;x, y, z)|xo, θ]

=

K∑k=1

N∑i=1

E[∆iklog(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))|xo, θ]

=

K∑k=1

N∑i=1

K∑l=1

E[∆iklog(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))|xo, zi = l, θ]Pr(zi = l|xo, θ)

=

K∑k=1

N∑i=1

E[log(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))|xo, zi = k, θ]Pr(zi = k|xo, θ)

=

K∑k=1

N∑i=1

piklog(ρk)− d+ r

2log(2π)− d

2log(σ2

k)

− 1

2σ2k

Ek〈(xi − µk −Wkyi)T (xi − µk −Wkyi)〉 −

1

2Ek〈yTi yi〉

=

K∑k=1

(

N∑i=1

pik)log(ρk)− d+ r

2log(2π)− d

2log(σ2

k)− 1

2σ2k

Uk −1

2Ek〈yT y〉

(18)

where

pik = Pr(zi = k|xoi , θ) (19)

9

468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519

Ek〈·〉 = E[·|xoi , zi = k, θ] (20)

Ek〈u〉 =

∑Ni=1 pikEk〈ui〉∑N

i=1 pik(21)

Ek〈uT v〉 =

∑Ni=1 pikEk〈uTi vi〉∑N

i=1 pik(22)

Uk = Ek〈xTx〉+ µTk µk + tr(Ek〈yT y〉WTk Wk)

− 2µTkEk〈x〉+ 2µTkWkE〈y〉 − 2tr(E〈yxT 〉Wk)(23)

The M-step isθ = arg min

θQ(θ, θ) (24)

First derive ρ. ρ satisfies∑Ki=1 ρk = 1. Using Lagrangian multiplier λ:

∂Q+ λ∑Ki=1 ρk

∂ρk= 0 (25)

N∑i=1

pik + λρk = 0 (26)

K∑k=1

N∑i=1

pik + λ = 0 (27)

λ = −1 (28)

Use this to eq(), we have:

ρk =

N∑i=1

pik (29)

Then we eliminate σ2 with W fixed as W

∂Q

∂σ2k

= 0⇒ σk2

=Ukd

(30)

σk2

=1

d(Ek〈xTx〉+ µTk µk + tr(Ek〈yT y〉Wk

TWk)

− 2µTkEk〈x〉+ 2µTk WkE〈y〉 − 2tr(E〈yxT 〉Wk))

(31)

The elimination of µ is similar:

∂Q

∂µk= 0⇒ ∂Uk

∂µk= 0 (32)

10

520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571

µk = Ek〈x〉 − WkEk〈y〉 (33)

Then we derive expression for W . From Eq(18) we have:

∂Q

∂Wk= − d

2σ2k

∂σ2k

∂Wk− 1

2σ2k

∂Uk∂Wk

+Uk2σ4

k

∂σ2k

∂Wk= − 1

2σ2k

∂Uk∂Wk

= 0 (34)

As a result,∂Q

∂Wk= 0⇒ ∂Uk

∂Wk= 0 (35)

We plug in expression of µk to Uk, We get

∂

∂WkEk〈xTx〉+ (Ek〈x〉 −WkEk〈y〉)T (Ek〈x〉 −WkEk〈y〉) + tr(Ek〈yT y〉WT

k Wk)

− 2(Ek〈x〉 −WkEk〈y〉)TEk〈x〉+ 2(Ek〈x〉 −WkEk〈y〉)TWkE〈y〉 − 2tr(E〈yxT 〉Wk) = 0

(36)

Wk = (Ek〈xyT 〉 − Ek〈x〉Ek〈yT 〉)(Ek〈yyT 〉 − Ek〈y〉Ek〈yT 〉)−1 (37)

pik is calculated with Bayes’ rule:

pik = Pr(zi = k;xoi , θ)

=ρkP (xoi |zi = k; θ)∑Kj=1 ρjP (xoi |zi = j; θ)

=ρkφ(xoi ;µ

oik ,W

oik W

oikT + σ2

kI)∑Kj=1 ρjφ(xoi ;µ

oij ,W

oij W

oijT + σ2

j I)

(38)

8 Appendix B

By model definition, [xoixmiyi

] ∣∣∣∣∣zi=k,θ

=

W oik

WmikI

yi +

µoikµmik

0

+

[ηoiηmi0

](39)

Eq(39) is a linear transform of yi ∼ N (0, I). By property of mulivariable Gaussian distribution,[xoixmiyi

] ∣∣∣∣∣zi=k,θ

∼ N

([µoikµmik

0

],

W oik W

oiTk + σ2I W oi

k WmiTk W oi

k

Wmik W oiT

k Wmik WmiT

k + σ2I Wmik

W oiTk WmiT

k I

) (40)

By Bayesian Gauss-Markov Theorem,[xmiyi

] ∣∣∣∣∣xoi ,zi=k,θ

∼ N (µmyi +RmyoiR−1oi (xoi − µ

oik ), Rmyi −RmyoiR−1

oi Romyi) (41)

where

µmyi =

[µmik

0

], Rmyoi =

[Wmik W oiT

k

W oiTk

], Rmyi =

[Wmik WmiT

k + σ2I Wmik

WmiTk I

](42)

Roi = W oik W

oiTk + σ2I, Romyi = [W oi

k WmiTk W oi

k ] (43)Plug Eq(42) and Eq(43) into Eq(41),

RmyoiR−1oi =

[Wmik W oiT

k

W oiTk

](W oi

k WoiTk + σ2I)−1 (44)

11

572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623

By matrix identities,(I + PQ)−1P = P (I +QP )−1 (45)

ThereforeW oiTk (W oi

k WoiTk + σ2I)−1 = (W oiT

k W oik + σ2I)−1W oiT

k = (M−1k )W oiT

k (46)Put Eq(46) and Eq(44) together,

RmyoiR−1oi =

[Wmik W oiT

k

W oiTk

][W oi

k WoiTk + σ2I]−1 =

[Wmik M−1

k W oiTk

M−1k W oiT

k

](47)

Therefore

µmyi +RmyoiR−1oi (xoi − µ

oik ) =

[µmik +Wmi

k M−1k W oiT

k (xoi − µoik )

M−1k W oiT

k (xoi − µoik )

](48)

which is exactly the formula of mean given in the paper.

To prove the formula of variance, by Eq(47)

RmyoiR−1oi Romyi =

[Wmik W oiT

k

W oiTk

](W oi

k WoiTk + σ2I)−1[W oi

k WmiTk W oi

k ]

=

[Wmik M−1

k W oiTk W oi

k WmiTk Wmi

k M−1k W oiT

k W oik

M−1k W oiT

k W oik W

miTk M−1

k W oiTk W oi

k

](49)

Thus

Rmyi −RmyoiR−1oi Romyi =

[σ2I +Wmi

k (I −M−1k W oiT

k W oik )WmiT

k Wmik (I −M−1

k W oiTk W oi

k )

(I −M−1k W oiT

k W oik )WmiT

k I −M−1k W oiT

k W oik

](50)

Notice thatMk −W oiT

k W oik = σ2I ⇒ I −M−1

k W oiTk W oi

k = σ2M−1k (51)

Thus Eq(50) simplifies to

Rmyi −RmyoiR−1oi Romyi = σ2

[I +Wmi

k M−1k WmiT

k Wmik M−1

k

M−1k WmiT

k M−1k

](52)

which is exactly the formula of variance given in the paper.

Appendix C

Before we give the proof of Corollary.1, we examining the case of binary subspace assignment, i.e, K = 2. Letv ∈ Rd and S0,S1 ⊂ Rn with dimension r0, r1 separately. Define the angle between v and its projection ontosubspace S0,S1 as θ0, θ1 :

θ0 = sin−1

(‖v − PS0v‖2‖v‖2

)and θ1 = sin−1

(‖v − PS1v‖2‖v‖2

)(53)

Theorem 3. [2] Let δ > 0 and m ≥ 83d1µ(S1 log

(2d1

δ

). Assume that

sin2(θ0) < C(m) sin2(θ1)

Then with probability at least 1− 4δ

‖vΩ − PS0vΩ‖22 < ‖vΩ − PS1vΩ‖22

Proof. By Theorem.2 and union bound, the following two statements hold simultaneously with probability at least1− 4δ,

‖vΩ − PS0ΩvΩ‖22 ≤ (1 + α0)

m

n‖v − PS0v‖ and

m(1− α1)− r1µ(S1) (1+β1)2

1−γ1

d‖v − PS1v‖22 ≤ ‖vΩ − PS1

ΩvΩ‖22

(54)

12

624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675

Combining with (53), we have‖v − PS0

Ω‖22 < C(m)‖v − PS1v‖22

will hold if and only ifsin2(θ0) < C(m) sin2(θ1)

which complete the proof

Proof. of Corollary.1. The results of Theorem.3 can be generalized to the situation where there are multiplesubspaces Si, i = 0, . . . ,K − 1. Again without loss of generality we assume that θ0 < θi,∀i, and define

Ci(m) =m(1−αi)−riµ(Si) (1+βi)

2

1−γim(1+α0) . Then the conclusion of Corollary.1 can be obtained by applying union bound

and following the similar argument as that in the proof of Theorem.3.

13

machinelearning project

Documents