Part 2: Unsupervised Learning Machine Learning Techniques

Download Part 2: Unsupervised Learning Machine Learning Techniques

Post on 11-May-2015

1.493 views

Category:

Documents

6 download

Embed Size (px)

TRANSCRIPT

<ul><li>1.Part 2: Unsupervised Learning Machine Learning Techniquesfor Computer Vision Microsoft Research Cambridge ECCV 2004, Prague Christopher M. Bishop</li></ul> <p>2. Overview of Part 2 </p> <ul><li>Mixture models </li></ul> <ul><li>EM </li></ul> <ul><li>Variational Inference </li></ul> <ul><li>Bayesian model complexity </li></ul> <ul><li>Continuous latent variables </li></ul> <p>3. The Gaussian Distribution </p> <ul><li>Multivariate Gaussian </li></ul> <ul><li>Maximum likelihood </li></ul> <p>mean covariance 4. Gaussian Mixtures </p> <ul><li>Linear super-position of Gaussians </li></ul> <ul><li>Normalization and positivity require </li></ul> <p>5. Example: Mixture of 3 Gaussians 6. Maximum Likelihood for the GMM </p> <ul><li>Log likelihood function</li></ul> <ul><li>Sum over components appearsinsidethe log </li></ul> <ul><li><ul><li>no closed form ML solution </li></ul></li></ul> <p>7. EM AlgorithmInformal Derivation 8. EM AlgorithmInformal Derivation </p> <ul><li>M step equations </li></ul> <p>9. EM AlgorithmInformal Derivation </p> <ul><li>E step equation </li></ul> <p>10. EM AlgorithmInformal Derivation </p> <ul><li>Can interpret the mixing coefficients as prior probabilities </li></ul> <ul><li>Corresponding posterior probabilities ( responsibilities ) </li></ul> <p>11. Old Faithful Data Set Duration of eruption (minutes) Time between eruptions (minutes) 12. 13. 14. 15. 16. 17. 18. Latent Variable View of EM </p> <ul><li>To sample from a Gaussian mixture: </li></ul> <ul><li><ul><li>first pick one of the components with probability</li></ul></li></ul> <ul><li><ul><li>then draw a samplefrom that component </li></ul></li></ul> <ul><li><ul><li>repeat these two steps for each new data point </li></ul></li></ul> <p>19. Latent Variable View of EM </p> <ul><li>Goal: given a data set, find</li></ul> <ul><li>Suppose we knew the colours </li></ul> <ul><li><ul><li>maximum likelihood would involve fitting each component to the corresponding cluster </li></ul></li></ul> <ul><li>Problem: the colours arelatent(hidden) variables </li></ul> <p>20. Incomplete and Complete Data complete incomplete 21. Latent Variable Viewpoint 22. Latent Variable Viewpoint</p> <ul><li>Binary latent variablesdescribing which component generated each data point</li></ul> <ul><li>Conditional distribution of observed variable </li></ul> <ul><li>Prior distribution of latent variables </li></ul> <ul><li>Marginalizing over the latent variables we obtain </li></ul> <p>23. Graphical Representation of GMM 24. Latent Variable View of EM </p> <ul><li>Suppose we knew the values for the latent variables </li></ul> <ul><li><ul><li>maximize thecomplete-datalog likelihood </li></ul></li></ul> <ul><li><ul><li>trivial closed-form solution: fit each component to the corresponding set of data points </li></ul></li></ul> <ul><li>We dont know the values of the latent variables </li></ul> <ul><li><ul><li>however, for given parameter values we can compute the expected values of the latent variables </li></ul></li></ul> <p>25. Posterior Probabilities (colour coded) 26. Over-fitting in Gaussian Mixture Models </p> <ul><li>Infinities in likelihood function when a component collapses onto a data point: with </li></ul> <ul><li>Also, maximum likelihood cannot determine the numberKof components </li></ul> <p>27. Cross Validation </p> <ul><li>Can select model complexity using an independent validation data set </li></ul> <ul><li>If data is scarce usecross-validation : </li></ul> <ul><li><ul><li>partition data intoSsubsets </li></ul></li></ul> <ul><li><ul><li>train onS 1subsets</li></ul></li></ul> <ul><li><ul><li>test on remainder </li></ul></li></ul> <ul><li><ul><li>repeat and average </li></ul></li></ul> <ul><li>Disadvantages </li></ul> <ul><li><ul><li>computationally expensive</li></ul></li></ul> <ul><li><ul><li>can only determine one ortwo complexity parameters </li></ul></li></ul> <p>28. Bayesian Mixture of Gaussians </p> <ul><li>Parameters and latent variables appear on equal footing </li></ul> <ul><li>Conjugate priors </li></ul> <p>29. Data Set Size </p> <ul><li>Problem 1:learn the function forfrom 100 (slightly) noisy examples </li></ul> <ul><li><ul><li>data set is computationally small but statistically large </li></ul></li></ul> <ul><li>Problem 2:learn to recognize 1,000 everyday objects from 5,000,000 natural images </li></ul> <ul><li><ul><li>data set is computationally large but statistically small </li></ul></li></ul> <ul><li>Bayesian inference</li></ul> <ul><li><ul><li>computationally more demanding than ML or MAP (but see discussion of Gaussian mixtures later) </li></ul></li></ul> <ul><li><ul><li>significant benefit for statistically small data sets </li></ul></li></ul> <p>30. Variational Inference </p> <ul><li>Exact Bayesian inference intractable </li></ul> <ul><li>Markov chain Monte Carlo </li></ul> <ul><li><ul><li>computationally expensive </li></ul></li></ul> <ul><li><ul><li>issues of convergence </li></ul></li></ul> <ul><li>Variational Inference </li></ul> <ul><li><ul><li>broadly applicable deterministic approximation </li></ul></li></ul> <ul><li><ul><li>letdenote all latent variables and parameters </li></ul></li></ul> <ul><li><ul><li>approximate true posteriorusing a simpler distribution</li></ul></li></ul> <ul><li><ul><li>minimize Kullback-Leibler divergence </li></ul></li></ul> <p>31. General View of Variational Inference </p> <ul><li>For arbitrary where </li></ul> <ul><li>Maximizing overwould give the true posterior </li></ul> <ul><li><ul><li>this is intractable by definition </li></ul></li></ul> <p>32. Variational Lower Bound 33. Factorized Approximation </p> <ul><li>Goal: choose a family ofqdistributions which are: </li></ul> <ul><li><ul><li>sufficiently flexible to give good approximation </li></ul></li></ul> <ul><li><ul><li>sufficiently simple to remain tractable </li></ul></li></ul> <ul><li>Here we consider factorized distributions </li></ul> <ul><li>No further assumptions are required! </li></ul> <ul><li>Optimal solution for one factor, keeping the remainder fixed </li></ul> <ul><li><ul><li>coupled solutions so initialize then cyclically update </li></ul></li></ul> <ul><li><ul><li>message passing view (Winn and Bishop, 2004) </li></ul></li></ul> <p>34. 35. Lower Bound </p> <ul><li>Can also be evaluated </li></ul> <ul><li>Useful for maths/code verification </li></ul> <ul><li>Also useful for model comparison: </li></ul> <p>36. Illustration: Univariate Gaussian </p> <ul><li>Likelihood function </li></ul> <ul><li>Conjugate prior</li></ul> <ul><li>Factorized variational distribution </li></ul> <p>37. Initial Configuration 38. After Updating 39. After Updating 40. Converged Solution 41. Variational Mixture of Gaussians </p> <ul><li>Assume factorized posterior distribution </li></ul> <ul><li>No other approximations needed! </li></ul> <p>42. Variational Equations for GMM 43. Lower Bound for GMM 44. VIBES </p> <ul><li>Bishop, Spiegelhalter and Winn (2002) </li></ul> <p>45. ML Limit </p> <ul><li>If instead we choose we recover the maximum likelihood EM algorithm </li></ul> <p>46. Bound vs. K for Old Faithful Data 47. Bayesian Model Complexity 48. Sparse Bayes for Gaussian Mixture </p> <ul><li>Corduneanu and Bishop (2001) </li></ul> <ul><li>Start with large value ofK </li></ul> <ul><li><ul><li>treat mixing coefficients as parameters </li></ul></li></ul> <ul><li><ul><li>maximize marginal likelihood </li></ul></li></ul> <ul><li><ul><li>prunes out excess components</li></ul></li></ul> <p>49. 50. 51. Summary: Variational Gaussian Mixtures </p> <ul><li>Simple modification of maximum likelihood EM code </li></ul> <ul><li>Small computational overhead compared to EM </li></ul> <ul><li>No singularities </li></ul> <ul><li>Automatic model order selection </li></ul> <p>52. Continuous Latent Variables </p> <ul><li>Conventional PCA </li></ul> <ul><li><ul><li>data covariance matrix </li></ul></li></ul> <ul><li><ul><li>eigenvector decomposition </li></ul></li></ul> <ul><li>Minimizes sum-of-squares projection </li></ul> <ul><li><ul><li>not a probabilistic model </li></ul></li></ul> <ul><li><ul><li>how should we chooseL? </li></ul></li></ul> <p>53. Probabilistic PCA </p> <ul><li>Tipping and Bishop (1998) </li></ul> <ul><li>Ldimensional continuous latent space </li></ul> <ul><li>Ddimensional data space </li></ul> <p>PCA factor analysis 54. Probabilistic PCA </p> <ul><li>Marginal distribution </li></ul> <ul><li>Advantages </li></ul> <ul><li><ul><li>exact ML solution </li></ul></li></ul> <ul><li><ul><li>computationally efficient EM algorithm </li></ul></li></ul> <ul><li><ul><li>captures dominant correlations with few parameters </li></ul></li></ul> <ul><li><ul><li>mixtures of PPCA </li></ul></li></ul> <ul><li><ul><li>Bayesian PCA </li></ul></li></ul> <ul><li><ul><li>building block for more complex models </li></ul></li></ul> <p>55. EM for PCA 56. EM for PCA 57. EM for PCA 58. EM for PCA 59. EM for PCA 60. EM for PCA 61. EM for PCA 62. Bayesian PCA </p> <ul><li>Bishop (1998) </li></ul> <ul><li>Gaussian prior over columns of </li></ul> <ul><li>Automatic relevance determination(ARD)</li></ul> <p>ML PCA Bayesian PCA 63. Non-linear Manifolds </p> <ul><li>Example: images of a rigid object </li></ul> <p>64. Bayesian Mixture of BPCA Models 65. 66. Flexible Sprites </p> <ul><li>Jojic and Frey (2001) </li></ul> <ul><li>Automatic decomposition of video sequence into</li></ul> <ul><li><ul><li>background model </li></ul></li></ul> <ul><li><ul><li>ordered set of masks (one per object per frame) </li></ul></li></ul> <ul><li><ul><li>foreground model (one per object per frame) </li></ul></li></ul> <p>67. 68. Transformed Component Analysis </p> <ul><li>Generative model </li></ul> <ul><li>Now include transformations (translations) </li></ul> <ul><li>Extend toLlayers </li></ul> <ul><li>Inference intractable souse variational framework</li></ul> <p>69. 70. Bayesian Constellation Model </p> <ul><li>Li, Fergus and Perona (2003) </li></ul> <ul><li>Object recognition from small training sets </li></ul> <ul><li>Variational treatment of fully Bayesian model </li></ul> <p>71. Bayesian Constellation Model 72. Summary of Part 2 </p> <ul><li>Discrete and continuous latent variables</li></ul> <ul><li><ul><li>EM algorithm </li></ul></li></ul> <ul><li>Build complex models from simple components </li></ul> <ul><li><ul><li>represented graphically </li></ul></li></ul> <ul><li><ul><li>incorporates prior knowledge </li></ul></li></ul> <ul><li>Variational inference </li></ul> <ul><li><ul><li>Bayesian model comparison </li></ul></li></ul>

Recommended

View more >