efficient algorithms for estimating multi-view mixture...
TRANSCRIPT
![Page 1: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/1.jpg)
Efficient algorithms for estimatingmulti-view mixture models
Daniel Hsu
Microsoft Research, New England
![Page 2: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/2.jpg)
Outline
Multi-view mixture models
Multi-view method-of-moments
Some applications and open questions
Concluding remarks
![Page 3: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/3.jpg)
Part 1. Multi-view mixture models
Multi-view mixture modelsUnsupervised learning and mixture modelsMulti-view mixture modelsComplexity barriers
Multi-view method-of-moments
Some applications and open questions
Concluding remarks
![Page 4: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/4.jpg)
Unsupervised learning
I Many modern applications of machine learning:I high-dimensional data from many diverse sources,I but mostly unlabeled.
I Unsupervised learning: extract useful info from this data.I Disentangle sub-populations in data source.I Discover useful representations for downstream stages of
learning pipeline (e.g., supervised learning).
![Page 5: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/5.jpg)
Unsupervised learning
I Many modern applications of machine learning:I high-dimensional data from many diverse sources,I but mostly unlabeled.
I Unsupervised learning: extract useful info from this data.I Disentangle sub-populations in data source.I Discover useful representations for downstream stages of
learning pipeline (e.g., supervised learning).
![Page 6: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/6.jpg)
Mixture models
Simple latent variable model: mixture model
h
~x
h ∈ [k ] := 1,2, . . . , k (hidden);
~x ∈ Rd (observed);
Pr[ h = j ] = wj ; ~x∣∣h ∼ Ph;
so ~x has a mixture distribution
P(~x) = w1P1(~x) + w2P2(~x) + · · ·+ wkPk (~x).
Typical use: learn about constituent sub-populations(e.g., clusters) in data source.
![Page 7: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/7.jpg)
Mixture models
Simple latent variable model: mixture model
h
~x
h ∈ [k ] := 1,2, . . . , k (hidden);
~x ∈ Rd (observed);
Pr[ h = j ] = wj ; ~x∣∣h ∼ Ph;
so ~x has a mixture distribution
P(~x) = w1P1(~x) + w2P2(~x) + · · ·+ wkPk (~x).
Typical use: learn about constituent sub-populations(e.g., clusters) in data source.
![Page 8: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/8.jpg)
Multi-view mixture models
Can we take advantage of diverse sources of information?
h
~x1 ~x2 · · · ~x`
h ∈ [k ],
~x1 ∈ Rd1 , ~x2 ∈ Rd2 , . . . , ~x` ∈ Rd` .
k = # components, ` = # views (e.g., audio, video, text).
![Page 9: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/9.jpg)
Multi-view mixture models
Can we take advantage of diverse sources of information?
h
~x1 ~x2 · · · ~x`
h ∈ [k ],
~x1 ∈ Rd1 , ~x2 ∈ Rd2 , . . . , ~x` ∈ Rd` .
k = # components, ` = # views (e.g., audio, video, text).
View 1: ~x1 ∈ Rd1 View 2: ~x2 ∈ Rd2 View 3: ~x3 ∈ Rd3
![Page 10: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/10.jpg)
Multi-view mixture models
Can we take advantage of diverse sources of information?
h
~x1 ~x2 · · · ~x`
h ∈ [k ],
~x1 ∈ Rd1 , ~x2 ∈ Rd2 , . . . , ~x` ∈ Rd` .
k = # components, ` = # views (e.g., audio, video, text).
View 1: ~x1 ∈ Rd1 View 2: ~x2 ∈ Rd2 View 3: ~x3 ∈ Rd3
![Page 11: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/11.jpg)
Multi-view mixture models
Multi-view assumption:Views are conditionally independent given the component.
View 1: ~x1 ∈ Rd1 View 2: ~x2 ∈ Rd2 View 3: ~x3 ∈ Rd3
Larger k (# components): more sub-populations to disentangle.Larger ` (# views): more non-redundant sources of information.
![Page 12: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/12.jpg)
Semi-parametric estimation task
“Parameters” of component distributions:
Mixing weights wj := Pr[h = j], j ∈ [k ];
Conditional means ~µv ,j := E[~xv |h = j] ∈ Rdv , j ∈ [k ], v ∈ [`].
Goal: Estimate mixing weights and conditional means fromindependent copies of (~x1, ~x2, . . . , ~x`).
Questions:1. How do we estimate wj and ~µv ,j without observing h?
2. How many views ` are sufficient to learn with poly(k)computational / sample complexity?
![Page 13: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/13.jpg)
Semi-parametric estimation task
“Parameters” of component distributions:
Mixing weights wj := Pr[h = j], j ∈ [k ];
Conditional means ~µv ,j := E[~xv |h = j] ∈ Rdv , j ∈ [k ], v ∈ [`].
Goal: Estimate mixing weights and conditional means fromindependent copies of (~x1, ~x2, . . . , ~x`).
Questions:1. How do we estimate wj and ~µv ,j without observing h?
2. How many views ` are sufficient to learn with poly(k)computational / sample complexity?
![Page 14: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/14.jpg)
Some barriers to efficient estimation
Challenge: many difficult parametric estimation tasks reduce tothis estimation problem.
Cryptographic barrier: discrete HMM pa-rameter estimation as hard as learning parityfunctions with noise (Mossel-Roch, ’06).
Statistical barrier: Gaussian mixtures in R1
can require exp(Ω(k)) samples to estimateparameters, even if components are well-separated (Moitra-Valiant, ’10).
In practice: resort to local search (e.g., EM), often subject toslow convergence and inaccurate local optima.
![Page 15: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/15.jpg)
Some barriers to efficient estimation
Challenge: many difficult parametric estimation tasks reduce tothis estimation problem.
Cryptographic barrier: discrete HMM pa-rameter estimation as hard as learning parityfunctions with noise (Mossel-Roch, ’06).
Statistical barrier: Gaussian mixtures in R1
can require exp(Ω(k)) samples to estimateparameters, even if components are well-separated (Moitra-Valiant, ’10).
In practice: resort to local search (e.g., EM), often subject toslow convergence and inaccurate local optima.
![Page 16: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/16.jpg)
Some barriers to efficient estimation
Challenge: many difficult parametric estimation tasks reduce tothis estimation problem.
Cryptographic barrier: discrete HMM pa-rameter estimation as hard as learning parityfunctions with noise (Mossel-Roch, ’06).
Statistical barrier: Gaussian mixtures in R1
can require exp(Ω(k)) samples to estimateparameters, even if components are well-separated (Moitra-Valiant, ’10).
In practice: resort to local search (e.g., EM), often subject toslow convergence and inaccurate local optima.
![Page 17: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/17.jpg)
Some barriers to efficient estimation
Challenge: many difficult parametric estimation tasks reduce tothis estimation problem.
Cryptographic barrier: discrete HMM pa-rameter estimation as hard as learning parityfunctions with noise (Mossel-Roch, ’06).
Statistical barrier: Gaussian mixtures in R1
can require exp(Ω(k)) samples to estimateparameters, even if components are well-separated (Moitra-Valiant, ’10).
In practice: resort to local search (e.g., EM), often subject toslow convergence and inaccurate local optima.
![Page 18: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/18.jpg)
Making progress: Gaussian mixture model
Gaussian mixture model: problem becomes easier if assumesome large minimum separation between component means(Dasgupta, ’99):
sep := mini 6=j
‖~µi − ~µj‖maxσi , σj
.
I sep = Ω(dc): interpoint distance-based methods / EM(Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00)
I sep = Ω(kc): first use PCA to k dimensions(Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05;Achlioptas-McSherry, ’05)
I Also works for mixtures of log-concave distributions.
I No minimum separation requirement: method-of-momentsbut exp(Ω(k)) running time / sample size(Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)
![Page 19: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/19.jpg)
Making progress: Gaussian mixture model
Gaussian mixture model: problem becomes easier if assumesome large minimum separation between component means(Dasgupta, ’99):
sep := mini 6=j
‖~µi − ~µj‖maxσi , σj
.
I sep = Ω(dc): interpoint distance-based methods / EM(Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00)
I sep = Ω(kc): first use PCA to k dimensions(Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05;Achlioptas-McSherry, ’05)
I Also works for mixtures of log-concave distributions.
I No minimum separation requirement: method-of-momentsbut exp(Ω(k)) running time / sample size(Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)
![Page 20: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/20.jpg)
Making progress: Gaussian mixture model
Gaussian mixture model: problem becomes easier if assumesome large minimum separation between component means(Dasgupta, ’99):
sep := mini 6=j
‖~µi − ~µj‖maxσi , σj
.
I sep = Ω(dc): interpoint distance-based methods / EM(Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00)
I sep = Ω(kc): first use PCA to k dimensions(Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05;Achlioptas-McSherry, ’05)
I Also works for mixtures of log-concave distributions.
I No minimum separation requirement: method-of-momentsbut exp(Ω(k)) running time / sample size(Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)
![Page 21: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/21.jpg)
Making progress: discrete hidden Markov models
Hardness reductions create HMMs with degenerate output andnext-state distributions.
≈
1 4 5 6 7 82 3
0.6 Pr[~xt = ·|ht = 2] + 0.4 Pr[~xt = ·|ht = 3]
1 4 5 6 7 82 3
1 4 5 6 7 82 3
Pr[~xt = ·|ht = 1]
0.6
+0.4
These instances are avoided by assuming parameter matricesare full-rank (Mossel-Roch, ’06; Hsu-Kakade-Zhang, ’09)
![Page 22: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/22.jpg)
Making progress: discrete hidden Markov models
Hardness reductions create HMMs with degenerate output andnext-state distributions.
≈
1 4 5 6 7 82 3
0.6 Pr[~xt = ·|ht = 2] + 0.4 Pr[~xt = ·|ht = 3]
1 4 5 6 7 82 3
1 4 5 6 7 82 3
Pr[~xt = ·|ht = 1]
0.6
+0.4
These instances are avoided by assuming parameter matricesare full-rank (Mossel-Roch, ’06; Hsu-Kakade-Zhang, ’09)
![Page 23: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/23.jpg)
What we do
This work: given ≥ 3 views, mild non-degeneracy conditionsimply efficient algorithms for estimation.
I Non-degeneracy condition for multi-view mixture model:Conditional means ~µv ,1, ~µv ,2, . . . , ~µv ,k are linearlyindependent for each view v ∈ [`], and ~w > ~0.
Requires high-dimensional observations (dv ≥ k )!
I New efficient learning guarantees for parametric models(e.g., mixtures of Gaussians, general HMMs)
I General tensor decomposition framework applicable toa wide variety of estimation problems.
![Page 24: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/24.jpg)
What we do
This work: given ≥ 3 views, mild non-degeneracy conditionsimply efficient algorithms for estimation.
I Non-degeneracy condition for multi-view mixture model:Conditional means ~µv ,1, ~µv ,2, . . . , ~µv ,k are linearlyindependent for each view v ∈ [`], and ~w > ~0.
Requires high-dimensional observations (dv ≥ k )!
I New efficient learning guarantees for parametric models(e.g., mixtures of Gaussians, general HMMs)
I General tensor decomposition framework applicable toa wide variety of estimation problems.
![Page 25: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/25.jpg)
What we do
This work: given ≥ 3 views, mild non-degeneracy conditionsimply efficient algorithms for estimation.
I Non-degeneracy condition for multi-view mixture model:Conditional means ~µv ,1, ~µv ,2, . . . , ~µv ,k are linearlyindependent for each view v ∈ [`], and ~w > ~0.
Requires high-dimensional observations (dv ≥ k )!
I New efficient learning guarantees for parametric models(e.g., mixtures of Gaussians, general HMMs)
I General tensor decomposition framework applicable toa wide variety of estimation problems.
![Page 26: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/26.jpg)
What we do
This work: given ≥ 3 views, mild non-degeneracy conditionsimply efficient algorithms for estimation.
I Non-degeneracy condition for multi-view mixture model:Conditional means ~µv ,1, ~µv ,2, . . . , ~µv ,k are linearlyindependent for each view v ∈ [`], and ~w > ~0.
Requires high-dimensional observations (dv ≥ k )!
I New efficient learning guarantees for parametric models(e.g., mixtures of Gaussians, general HMMs)
I General tensor decomposition framework applicable toa wide variety of estimation problems.
![Page 27: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/27.jpg)
Part 2. Multi-view method-of-moments
Multi-view mixture models
Multi-view method-of-momentsOverviewStructure of momentsUniqueness of decompositionComputing the decompositionAsymmetric views
Some applications and open questions
Concluding remarks
![Page 28: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/28.jpg)
The plan
I First, assume views are (conditionally) exchangeable,and derive basic algorithm.
I Then, provide reduction from general multi-view setting toexchangeable case.
−→
![Page 29: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/29.jpg)
The plan
I First, assume views are (conditionally) exchangeable,and derive basic algorithm.
I Then, provide reduction from general multi-view setting toexchangeable case.
−→
![Page 30: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/30.jpg)
Simpler case: exchangeable views
(Conditionally) exchangeable views: assume the views havethe same conditional means, i.e.,
E[ ~xv |h = j ] ≡ ~µj , j ∈ [k ], v ∈ [`].
Motivating setting: bag-of-words model,~x1, ~x2, . . . , ~x` ≡ ` exchangeable words in a document.
One-hot encoding:
~xv = ~ei ⇔ v -th word in document is i-th word in vocab
(where ~ei ∈ 0,1d has 1 in i-th position, 0 elsewhere).
(~µj)i = E[(~xv )i |h = j] = Pr[~xv = ~ei |h = j], i ∈ [d ], j ∈ [k ].
![Page 31: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/31.jpg)
Simpler case: exchangeable views
(Conditionally) exchangeable views: assume the views havethe same conditional means, i.e.,
E[ ~xv |h = j ] ≡ ~µj , j ∈ [k ], v ∈ [`].
Motivating setting: bag-of-words model,~x1, ~x2, . . . , ~x` ≡ ` exchangeable words in a document.
One-hot encoding:
~xv = ~ei ⇔ v -th word in document is i-th word in vocab
(where ~ei ∈ 0,1d has 1 in i-th position, 0 elsewhere).
(~µj)i = E[(~xv )i |h = j] = Pr[~xv = ~ei |h = j], i ∈ [d ], j ∈ [k ].
![Page 32: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/32.jpg)
Key ideas
1. Method-of-moments: conditional means are revealed byappropriate low-rank decompositions of moment matricesand tensors.
2. Third-order tensor decomposition is uniquelydetermined by directions of (locally) maximum skew.
3. The required local optimization can be efficientlyperformed in poly time.
![Page 33: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/33.jpg)
Algebraic structure in moments
Recall: E[ ~xv |h = j ] = ~µj .
By conditional independence and exchangeability of~x1, ~x2, . . . , ~x` given h,
Pairs := E[~x1 ⊗ ~x2]
= E[E[~x1|h]⊗ E[~x2|h]
]= E[~µh ⊗ ~µh]
=k∑
i=1
wi ~µi ⊗ ~µi ∈ Rd×d .
Triples := E[~x1 ⊗ ~x2 ⊗ ~x3]
=k∑
i=1
wi ~µi ⊗ ~µi ⊗ ~µi ∈ Rd×d×d , etc.
(If only we could extract these “low-rank” decompositions . . . )
![Page 34: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/34.jpg)
Algebraic structure in moments
Recall: E[ ~xv |h = j ] = ~µj .By conditional independence and exchangeability of~x1, ~x2, . . . , ~x` given h,
Pairs := E[~x1 ⊗ ~x2]
= E[E[~x1|h]⊗ E[~x2|h]
]= E[~µh ⊗ ~µh]
=k∑
i=1
wi ~µi ⊗ ~µi ∈ Rd×d .
Triples := E[~x1 ⊗ ~x2 ⊗ ~x3]
=k∑
i=1
wi ~µi ⊗ ~µi ⊗ ~µi ∈ Rd×d×d , etc.
(If only we could extract these “low-rank” decompositions . . . )
![Page 35: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/35.jpg)
Algebraic structure in moments
Recall: E[ ~xv |h = j ] = ~µj .By conditional independence and exchangeability of~x1, ~x2, . . . , ~x` given h,
Pairs := E[~x1 ⊗ ~x2]
= E[E[~x1|h]⊗ E[~x2|h]
]= E[~µh ⊗ ~µh]
=k∑
i=1
wi ~µi ⊗ ~µi ∈ Rd×d .
Triples := E[~x1 ⊗ ~x2 ⊗ ~x3]
=k∑
i=1
wi ~µi ⊗ ~µi ⊗ ~µi ∈ Rd×d×d , etc.
(If only we could extract these “low-rank” decompositions . . . )
![Page 36: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/36.jpg)
2nd moment: subspace spanned by conditional means
Non-degeneracy assumption (~µi linearly independent)=⇒ Pairs =
∑ki=1 wi ~µi ⊗ ~µi symmetric psd and rank k
=⇒ Pairs equips k -dim subspace span~µ1, ~µ2, . . . , ~µk withinner product
Pairs(~x , ~y) := ~x>Pairs~y .
However, ~µi not generally determined by just Pairs(e.g., ~µi are not necessarily orthogonal).Must look at higher-order moments?
![Page 37: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/37.jpg)
2nd moment: subspace spanned by conditional means
Non-degeneracy assumption (~µi linearly independent)
=⇒ Pairs =∑k
i=1 wi ~µi ⊗ ~µi symmetric psd and rank k=⇒ Pairs equips k -dim subspace span~µ1, ~µ2, . . . , ~µk withinner product
Pairs(~x , ~y) := ~x>Pairs~y .
However, ~µi not generally determined by just Pairs(e.g., ~µi are not necessarily orthogonal).Must look at higher-order moments?
![Page 38: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/38.jpg)
2nd moment: subspace spanned by conditional means
Non-degeneracy assumption (~µi linearly independent)=⇒ Pairs =
∑ki=1 wi ~µi ⊗ ~µi symmetric psd and rank k
=⇒ Pairs equips k -dim subspace span~µ1, ~µ2, . . . , ~µk withinner product
Pairs(~x , ~y) := ~x>Pairs~y .
However, ~µi not generally determined by just Pairs(e.g., ~µi are not necessarily orthogonal).Must look at higher-order moments?
![Page 39: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/39.jpg)
2nd moment: subspace spanned by conditional means
Non-degeneracy assumption (~µi linearly independent)=⇒ Pairs =
∑ki=1 wi ~µi ⊗ ~µi symmetric psd and rank k
=⇒ Pairs equips k -dim subspace span~µ1, ~µ2, . . . , ~µk withinner product
Pairs(~x , ~y) := ~x>Pairs~y .
However, ~µi not generally determined by just Pairs(e.g., ~µi are not necessarily orthogonal).Must look at higher-order moments?
![Page 40: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/40.jpg)
2nd moment: subspace spanned by conditional means
Non-degeneracy assumption (~µi linearly independent)=⇒ Pairs =
∑ki=1 wi ~µi ⊗ ~µi symmetric psd and rank k
=⇒ Pairs equips k -dim subspace span~µ1, ~µ2, . . . , ~µk withinner product
Pairs(~x , ~y) := ~x>Pairs~y .
However, ~µi not generally determined by just Pairs(e.g., ~µi are not necessarily orthogonal).
Must look at higher-order moments?
![Page 41: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/41.jpg)
2nd moment: subspace spanned by conditional means
Non-degeneracy assumption (~µi linearly independent)=⇒ Pairs =
∑ki=1 wi ~µi ⊗ ~µi symmetric psd and rank k
=⇒ Pairs equips k -dim subspace span~µ1, ~µ2, . . . , ~µk withinner product
Pairs(~x , ~y) := ~x>Pairs~y .
However, ~µi not generally determined by just Pairs(e.g., ~µi are not necessarily orthogonal).Must look at higher-order moments?
![Page 42: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/42.jpg)
3rd moment: (cross) skew maximizers
Claim: Up to third-moment (i.e., 3 views) suffices.View Triples : Rd × Rd × Rd → R as trilinear form.
TheoremEach isolated local maximizer ~η∗ of
max~η∈Rd
Triples(~η, ~η, ~η) s.t. Pairs(~η, ~η) ≤ 1
satisfies, for some i ∈ [k ],
Pairs ~η∗ =√
wi ~µi , Triples(~η∗, ~η∗, ~η∗) =1√wi.
Also: these maximizers can be found efficiently and robustly.
![Page 43: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/43.jpg)
3rd moment: (cross) skew maximizers
Claim: Up to third-moment (i.e., 3 views) suffices.View Triples : Rd × Rd × Rd → R as trilinear form.
TheoremEach isolated local maximizer ~η∗ of
max~η∈Rd
Triples(~η, ~η, ~η) s.t. Pairs(~η, ~η) ≤ 1
satisfies, for some i ∈ [k ],
Pairs ~η∗ =√
wi ~µi , Triples(~η∗, ~η∗, ~η∗) =1√wi.
Also: these maximizers can be found efficiently and robustly.
![Page 44: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/44.jpg)
3rd moment: (cross) skew maximizers
Claim: Up to third-moment (i.e., 3 views) suffices.View Triples : Rd × Rd × Rd → R as trilinear form.
TheoremEach isolated local maximizer ~η∗ of
max~η∈Rd
Triples(~η, ~η, ~η) s.t. Pairs(~η, ~η) ≤ 1
satisfies, for some i ∈ [k ],
Pairs ~η∗ =√
wi ~µi , Triples(~η∗, ~η∗, ~η∗) =1√wi.
Also: these maximizers can be found efficiently and robustly.
![Page 45: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/45.jpg)
Variational analysis
max~η∈Rd
Triples(~η, ~η, ~η) s.t. Pairs(~η, ~η) ≤ 1
Isolated local maximizers ~θ∗ (found via gradient ascent) are
~e1 = (1,0,0, . . . ), ~e2 = (0,1,0, . . . ), etc.
which means that each ~η∗ satisfies, for some i ∈ [k ],√wj(~η∗>~µj
)=
1 j = i0 j 6= i .
Therefore
Pairs ~η∗ =k∑
j=1
wj~µj(~η∗>~µj
)=√
wi~µi .
![Page 46: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/46.jpg)
Variational analysis
max~η∈Rd
Triples(~η, ~η, ~η) s.t. Pairs(~η, ~η) ≤ 1
(Substitute Pairs =∑k
i=1 wi~µi ⊗ ~µi and Triples =∑k
i=1 wi~µi ⊗ ~µi ⊗ ~µi .)
Isolated local maximizers ~θ∗ (found via gradient ascent) are
~e1 = (1,0,0, . . . ), ~e2 = (0,1,0, . . . ), etc.
which means that each ~η∗ satisfies, for some i ∈ [k ],√wj(~η∗>~µj
)=
1 j = i0 j 6= i .
Therefore
Pairs ~η∗ =k∑
j=1
wj~µj(~η∗>~µj
)=√
wi~µi .
![Page 47: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/47.jpg)
Variational analysis
max~η∈Rd
k∑i=1
wi (~η>~µi)3 s.t.
k∑i=1
wi (~η>~µi)2 ≤ 1
(Substitute Pairs =∑k
i=1 wi~µi ⊗ ~µi and Triples =∑k
i=1 wi~µi ⊗ ~µi ⊗ ~µi .)
Isolated local maximizers ~θ∗ (found via gradient ascent) are
~e1 = (1,0,0, . . . ), ~e2 = (0,1,0, . . . ), etc.
which means that each ~η∗ satisfies, for some i ∈ [k ],√wj(~η∗>~µj
)=
1 j = i0 j 6= i .
Therefore
Pairs ~η∗ =k∑
j=1
wj~µj(~η∗>~µj
)=√
wi~µi .
![Page 48: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/48.jpg)
Variational analysis
max~η∈Rd
k∑i=1
wi (~η>~µi)3 s.t.
k∑i=1
wi (~η>~µi)2 ≤ 1
(Let θi :=√
wi (~η>~µi ) for i ∈ [k ].)
Isolated local maximizers ~θ∗ (found via gradient ascent) are
~e1 = (1,0,0, . . . ), ~e2 = (0,1,0, . . . ), etc.
which means that each ~η∗ satisfies, for some i ∈ [k ],√wj(~η∗>~µj
)=
1 j = i0 j 6= i .
Therefore
Pairs ~η∗ =k∑
j=1
wj~µj(~η∗>~µj
)=√
wi~µi .
![Page 49: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/49.jpg)
Variational analysis
max~η∈Rd
k∑i=1
1√wi
(√
wi~η>~µi)
3 s.t.k∑
i=1
(√
wi~η>~µi)
2 ≤ 1
(Let θi :=√
wi (~η>~µi ) for i ∈ [k ].)
Isolated local maximizers ~θ∗ (found via gradient ascent) are
~e1 = (1,0,0, . . . ), ~e2 = (0,1,0, . . . ), etc.
which means that each ~η∗ satisfies, for some i ∈ [k ],√wj(~η∗>~µj
)=
1 j = i0 j 6= i .
Therefore
Pairs ~η∗ =k∑
j=1
wj~µj(~η∗>~µj
)=√
wi~µi .
![Page 50: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/50.jpg)
Variational analysis
max~θ∈Rk
k∑i=1
1√wi
θ3i s.t.
k∑i=1
θ2i ≤ 1
(Let θi :=√
wi (~η>~µi ) for i ∈ [k ].)
Isolated local maximizers ~θ∗ (found via gradient ascent) are
~e1 = (1,0,0, . . . ), ~e2 = (0,1,0, . . . ), etc.
which means that each ~η∗ satisfies, for some i ∈ [k ],√wj(~η∗>~µj
)=
1 j = i0 j 6= i .
Therefore
Pairs ~η∗ =k∑
j=1
wj~µj(~η∗>~µj
)=√
wi~µi .
![Page 51: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/51.jpg)
Variational analysis
max~θ∈Rk
k∑i=1
1√wi
θ3i s.t.
k∑i=1
θ2i ≤ 1
(Let θi :=√
wi (~η>~µi ) for i ∈ [k ].)
Isolated local maximizers ~θ∗ (found via gradient ascent) are
~e1 = (1,0,0, . . . ), ~e2 = (0,1,0, . . . ), etc.
which means that each ~η∗ satisfies, for some i ∈ [k ],√wj(~η∗>~µj
)=
1 j = i0 j 6= i .
Therefore
Pairs ~η∗ =k∑
j=1
wj~µj(~η∗>~µj
)=√
wi~µi .
![Page 52: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/52.jpg)
Variational analysis
max~θ∈Rk
k∑i=1
1√wi
θ3i s.t.
k∑i=1
θ2i ≤ 1
(Let θi :=√
wi (~η>~µi ) for i ∈ [k ].)
Isolated local maximizers ~θ∗ (found via gradient ascent) are
~e1 = (1,0,0, . . . ), ~e2 = (0,1,0, . . . ), etc.
which means that each ~η∗ satisfies, for some i ∈ [k ],√wj(~η∗>~µj
)=
1 j = i0 j 6= i .
Therefore
Pairs ~η∗ =k∑
j=1
wj~µj(~η∗>~µj
)=√
wi~µi .
![Page 53: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/53.jpg)
Extracting all isolated local maximizers
1. Start with T := Triples.
2. Find isolated local maximizer of
T (~η, ~η, ~η) s.t. Pairs(~η, ~η) ≤ 1
via gradient ascent from random ~η ∈ range(Pairs).Say maximum is λ∗ and maximizer is ~η∗.
3. Deflation: replace T with T − λ∗~η∗ ⊗ ~η∗ ⊗ ~η∗.Goto step 2.
A variant of this runs in polynomial time (w.h.p.), and isrobust to perturbations to Pairs and Triples.
![Page 54: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/54.jpg)
Extracting all isolated local maximizers
1. Start with T := Triples.
2. Find isolated local maximizer of
T (~η, ~η, ~η) s.t. Pairs(~η, ~η) ≤ 1
via gradient ascent from random ~η ∈ range(Pairs).Say maximum is λ∗ and maximizer is ~η∗.
3. Deflation: replace T with T − λ∗~η∗ ⊗ ~η∗ ⊗ ~η∗.Goto step 2.
A variant of this runs in polynomial time (w.h.p.), and isrobust to perturbations to Pairs and Triples.
![Page 55: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/55.jpg)
Extracting all isolated local maximizers
1. Start with T := Triples.
2. Find isolated local maximizer of
T (~η, ~η, ~η) s.t. Pairs(~η, ~η) ≤ 1
via gradient ascent from random ~η ∈ range(Pairs).Say maximum is λ∗ and maximizer is ~η∗.
3. Deflation: replace T with T − λ∗~η∗ ⊗ ~η∗ ⊗ ~η∗.Goto step 2.
A variant of this runs in polynomial time (w.h.p.), and isrobust to perturbations to Pairs and Triples.
![Page 56: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/56.jpg)
Extracting all isolated local maximizers
1. Start with T := Triples.
2. Find isolated local maximizer of
T (~η, ~η, ~η) s.t. Pairs(~η, ~η) ≤ 1
via gradient ascent from random ~η ∈ range(Pairs).Say maximum is λ∗ and maximizer is ~η∗.
3. Deflation: replace T with T − λ∗~η∗ ⊗ ~η∗ ⊗ ~η∗.Goto step 2.
A variant of this runs in polynomial time (w.h.p.), and isrobust to perturbations to Pairs and Triples.
![Page 57: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/57.jpg)
General case: asymmetric views
Each view v has different set of conditional means~µv ,1, ~µv ,2, . . . , ~µv ,k ⊂ Rdv .
Reduction: transform ~x1 and ~x2 to “look like” ~x3 via lineartransformations.
−→
![Page 58: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/58.jpg)
General case: asymmetric views
Each view v has different set of conditional means~µv ,1, ~µv ,2, . . . , ~µv ,k ⊂ Rdv .
Reduction: transform ~x1 and ~x2 to “look like” ~x3 via lineartransformations.
−→
![Page 59: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/59.jpg)
Asymmetric cross moments
Define asymmetric cross moment:
Pairsu,v := E[~xu ⊗ ~xv ].
Transforming view v to view 3:
Cv→3 := E[~x3 ⊗ ~xu] E[~xv ⊗ ~xu]† ∈ Rd3×dv
where † denotes Moore-Penrose pseudoinverse.
Simple exercise to show
E[Cv→3~xv |h = j] = ~µ3,j
so Cv→3~xv behaves like ~x3 (as far as our algorithm can tell).
![Page 60: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/60.jpg)
Asymmetric cross moments
Define asymmetric cross moment:
Pairsu,v := E[~xu ⊗ ~xv ].
Transforming view v to view 3:
Cv→3 := E[~x3 ⊗ ~xu] E[~xv ⊗ ~xu]† ∈ Rd3×dv
where † denotes Moore-Penrose pseudoinverse.
Simple exercise to show
E[Cv→3~xv |h = j] = ~µ3,j
so Cv→3~xv behaves like ~x3 (as far as our algorithm can tell).
![Page 61: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/61.jpg)
Asymmetric cross moments
Define asymmetric cross moment:
Pairsu,v := E[~xu ⊗ ~xv ].
Transforming view v to view 3:
Cv→3 := E[~x3 ⊗ ~xu] E[~xv ⊗ ~xu]† ∈ Rd3×dv
where † denotes Moore-Penrose pseudoinverse.
Simple exercise to show
E[Cv→3~xv |h = j] = ~µ3,j
so Cv→3~xv behaves like ~x3 (as far as our algorithm can tell).
![Page 62: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/62.jpg)
Part 3. Some applications and open questions
Multi-view mixture models
Multi-view method-of-moments
Some applications and open questionsMixtures of GaussiansHidden Markov models and other modelsTopic modelsOpen questions
Concluding remarks
![Page 63: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/63.jpg)
Mixtures of axis-aligned Gaussians
Mixture of axis-aligned Gaussian in Rn, with component means~µ1, ~µ2, . . . , ~µk ∈ Rn; no minimum separation requirement.
h
x1 x2 · · · xn
Assumptions:I non-degeneracy: component means span k dim subspace.I weak incoherence condition: component means not
perfectly aligned with coordinate axes — similar tospreading condition of (Chaudhuri-Rao, ’08).
Then, randomly partitioning coordinates into ` ≥ 3 viewsguarantees (w.h.p.) that non-degeneracy holds in all ` views.
![Page 64: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/64.jpg)
Mixtures of axis-aligned Gaussians
Mixture of axis-aligned Gaussian in Rn, with component means~µ1, ~µ2, . . . , ~µk ∈ Rn; no minimum separation requirement.
h
x1 x2 · · · xn
Assumptions:I non-degeneracy: component means span k dim subspace.I weak incoherence condition: component means not
perfectly aligned with coordinate axes — similar tospreading condition of (Chaudhuri-Rao, ’08).
Then, randomly partitioning coordinates into ` ≥ 3 viewsguarantees (w.h.p.) that non-degeneracy holds in all ` views.
![Page 65: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/65.jpg)
Mixtures of axis-aligned Gaussians
Mixture of axis-aligned Gaussian in Rn, with component means~µ1, ~µ2, . . . , ~µk ∈ Rn; no minimum separation requirement.
h
x1 x2 · · · xn
Assumptions:I non-degeneracy: component means span k dim subspace.I weak incoherence condition: component means not
perfectly aligned with coordinate axes — similar tospreading condition of (Chaudhuri-Rao, ’08).
Then, randomly partitioning coordinates into ` ≥ 3 viewsguarantees (w.h.p.) that non-degeneracy holds in all ` views.
![Page 66: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/66.jpg)
Hidden Markov models and others
h1 h2 h3
~x1 ~x2 ~x3
−→h
~x1 ~x2 ~x3
Other models:1. Mixtures of Gaussians (Hsu-Kakade, ITCS’13)
2. HMMs (Anandkumar-Hsu-Kakade, COLT’12)
3. Latent Dirichlet Allocation(Anandkumar-Foster-Hsu-Kakade-Liu, NIPS’12)
4. Latent parse trees (Hsu-Kakade-Liang, NIPS’12)
5. Independent Component Analysis(Arora-Ge-Moitra-Sachdeva, NIPS’12; Hsu-Kakade, ITCS’13)
![Page 67: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/67.jpg)
Hidden Markov models and others
h1 h2 h3
~x1 ~x2 ~x3
−→h
~x1 ~x2 ~x3
Other models:1. Mixtures of Gaussians (Hsu-Kakade, ITCS’13)
2. HMMs (Anandkumar-Hsu-Kakade, COLT’12)
3. Latent Dirichlet Allocation(Anandkumar-Foster-Hsu-Kakade-Liu, NIPS’12)
4. Latent parse trees (Hsu-Kakade-Liang, NIPS’12)
5. Independent Component Analysis(Arora-Ge-Moitra-Sachdeva, NIPS’12; Hsu-Kakade, ITCS’13)
![Page 68: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/68.jpg)
Hidden Markov models and others
h1 h2 h3
~x1 ~x2 ~x3
−→h
~x1 ~x2 ~x3
Other models:1. Mixtures of Gaussians (Hsu-Kakade, ITCS’13)
2. HMMs (Anandkumar-Hsu-Kakade, COLT’12)
3. Latent Dirichlet Allocation(Anandkumar-Foster-Hsu-Kakade-Liu, NIPS’12)
4. Latent parse trees (Hsu-Kakade-Liang, NIPS’12)
5. Independent Component Analysis(Arora-Ge-Moitra-Sachdeva, NIPS’12; Hsu-Kakade, ITCS’13)
![Page 69: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/69.jpg)
Bag-of-words clustering model(~µj)i = Pr[ see word i in document | document topic is j ].
I Corpus: New York Times (from UCI), 300000 articles.I Vocabulary size: d = 102660 words.I Chose k = 50.I For each topic j , show top 10 words i .
sales run school drug playereconomic inning student patient tiger_woodconsumer hit teacher million won
major game program company shothome season official doctor play
indicator home public companies roundweekly right children percent winorder games high cost tournamentclaim dodger education program tour
scheduled left district health right
![Page 70: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/70.jpg)
Bag-of-words clustering model(~µj)i = Pr[ see word i in document | document topic is j ].
I Corpus: New York Times (from UCI), 300000 articles.I Vocabulary size: d = 102660 words.I Chose k = 50.I For each topic j , show top 10 words i .
sales run school drug playereconomic inning student patient tiger_woodconsumer hit teacher million won
major game program company shothome season official doctor play
indicator home public companies roundweekly right children percent winorder games high cost tournamentclaim dodger education program tour
scheduled left district health right
![Page 71: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/71.jpg)
Bag-of-words clustering model
palestinian tax cup point yardisrael cut minutes game gameisraeli percent oil team play
yasser_arafat bush water shot seasonpeace billion add play teamisraeli plan tablespoon laker touchdownisraelis bill food season quarterbackleader taxes teaspoon half coachofficial million pepper lead defenseattack congress sugar games quarter
![Page 72: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/72.jpg)
Bag-of-words clustering model
percent al_gore car book talibanstock campaign race children attack
market president driver ages afghanistanfund george_bush team author official
investor bush won read militarycompanies clinton win newspaper u_s
analyst vice racing web united_statesmoney presidential track writer terrorist
investment million season written wareconomy democratic lap sales bin
![Page 73: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/73.jpg)
Bag-of-words clustering model
com court show film musicwww case network movie songsite law season director groupweb lawyer nbc play partsites federal cb character new_york
information government program actor companyonline decision television show millionmail trial series movies band
internet microsoft night million showtelegram right new_york part album
etc.
![Page 74: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/74.jpg)
Some open questions
What if k > dv? (relevant to overcomplete dictionary learning)
I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product
x1,2 := ~x1 ⊗ ~x2, x3,4 := ~x3 ⊗ ~x4, x5,6 := ~x5 ⊗ ~x6, etc. ?
Can we relax the multi-view assumption?
I Allow for richer hidden state?(e.g., independent component analysis)
I “Gaussianization” via random projection?
![Page 75: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/75.jpg)
Some open questions
What if k > dv? (relevant to overcomplete dictionary learning)
I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product
x1,2 := ~x1 ⊗ ~x2, x3,4 := ~x3 ⊗ ~x4, x5,6 := ~x5 ⊗ ~x6, etc. ?
Can we relax the multi-view assumption?
I Allow for richer hidden state?(e.g., independent component analysis)
I “Gaussianization” via random projection?
![Page 76: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/76.jpg)
Some open questions
What if k > dv? (relevant to overcomplete dictionary learning)
I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product
x1,2 := ~x1 ⊗ ~x2, x3,4 := ~x3 ⊗ ~x4, x5,6 := ~x5 ⊗ ~x6, etc. ?
Can we relax the multi-view assumption?
I Allow for richer hidden state?(e.g., independent component analysis)
I “Gaussianization” via random projection?
![Page 77: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/77.jpg)
Some open questions
What if k > dv? (relevant to overcomplete dictionary learning)
I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product
x1,2 := ~x1 ⊗ ~x2, x3,4 := ~x3 ⊗ ~x4, x5,6 := ~x5 ⊗ ~x6, etc. ?
Can we relax the multi-view assumption?
I Allow for richer hidden state?(e.g., independent component analysis)
I “Gaussianization” via random projection?
![Page 78: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/78.jpg)
Some open questions
What if k > dv? (relevant to overcomplete dictionary learning)
I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product
x1,2 := ~x1 ⊗ ~x2, x3,4 := ~x3 ⊗ ~x4, x5,6 := ~x5 ⊗ ~x6, etc. ?
Can we relax the multi-view assumption?
I Allow for richer hidden state?(e.g., independent component analysis)
I “Gaussianization” via random projection?
![Page 79: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/79.jpg)
Some open questions
What if k > dv? (relevant to overcomplete dictionary learning)
I Apply some non-linear transformations ~xv 7→ fv (~xv )?
I Combine views, e.g., via tensor product
x1,2 := ~x1 ⊗ ~x2, x3,4 := ~x3 ⊗ ~x4, x5,6 := ~x5 ⊗ ~x6, etc. ?
Can we relax the multi-view assumption?
I Allow for richer hidden state?(e.g., independent component analysis)
I “Gaussianization” via random projection?
![Page 80: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/80.jpg)
Part 4. Concluding remarks
Multi-view mixture models
Multi-view method-of-moments
Some applications and open questions
Concluding remarks
![Page 81: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/81.jpg)
Concluding remarks
Take-home messages:
I Power of multiple views: Can take advantage of diverse /non-redundant sources of information in unsupervisedlearning.
I Overcoming complexity barriers: Some provably hardestimation problems become easy after ruling out“degenerate” cases.
I “Blessing of dimensionality” for estimators based onmethod-of-moments.
![Page 82: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/82.jpg)
Concluding remarks
Take-home messages:I Power of multiple views: Can take advantage of diverse /
non-redundant sources of information in unsupervisedlearning.
I Overcoming complexity barriers: Some provably hardestimation problems become easy after ruling out“degenerate” cases.
I “Blessing of dimensionality” for estimators based onmethod-of-moments.
![Page 83: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/83.jpg)
Concluding remarks
Take-home messages:I Power of multiple views: Can take advantage of diverse /
non-redundant sources of information in unsupervisedlearning.
I Overcoming complexity barriers: Some provably hardestimation problems become easy after ruling out“degenerate” cases.
I “Blessing of dimensionality” for estimators based onmethod-of-moments.
![Page 84: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/84.jpg)
Concluding remarks
Take-home messages:I Power of multiple views: Can take advantage of diverse /
non-redundant sources of information in unsupervisedlearning.
I Overcoming complexity barriers: Some provably hardestimation problems become easy after ruling out“degenerate” cases.
I “Blessing of dimensionality” for estimators based onmethod-of-moments.
![Page 85: Efficient algorithms for estimating multi-view mixture modelsdjhsu/papers/multiview-slides.pdfCryptographic barrier: discrete HMM pa-rameter estimation as hard aslearning parity functions](https://reader034.vdocuments.mx/reader034/viewer/2022050300/5f6939d01595a2090706b085/html5/thumbnails/85.jpg)
Thanks!
(Co-authors: Anima Anandkumar, Dean Foster, Rong Ge, ShamKakade, Yi-Kai Liu, Matus Telgarsky)
http://arxiv.org/abs/1210.7559