cs 59000 statistical machine learning lecture 24
DESCRIPTION
CS 59000 Statistical Machine learning Lecture 24. Yuan (Alan) Qi Purdue CS Nov. 20 2008. Outline. Review of K-medoids, Mixture of Gaussians, Expectation Maximization (EM), Alternative view of EM - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/1.jpg)
CS 59000 Statistical Machine learningLecture 24
Yuan (Alan) QiPurdue CS
Nov. 20 2008
![Page 2: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/2.jpg)
Outline
• Review of K-medoids, Mixture of Gaussians, Expectation Maximization (EM), Alternative view of EM
• Hidden Markvo Models, forward-backward algorithm, EM for learning HMM parameters, Viterbi Algorithm, Linear state space models, Kalman filtering and smoothing
![Page 3: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/3.jpg)
K-medoids Algorithm
![Page 4: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/4.jpg)
Mixture of Gaussians
Mixture of Gaussians:
Introduce latent variables:
Marginal distribution:
![Page 5: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/5.jpg)
Conditional Probability
Responsibility that component k takes for explaining the observation.
![Page 6: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/6.jpg)
Maximum Likelihood
Maximize the log likelihood function
![Page 7: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/7.jpg)
Severe Overfitting by Maximum Likelihood
When a cluster has only data point, its variance goes to 0.
![Page 8: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/8.jpg)
Maximum Likelihood Conditions (1)
Setting the derivatives of to zero:
![Page 9: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/9.jpg)
Maximum Likelihood Conditions (2)
Setting the derivative of to zero:
![Page 10: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/10.jpg)
Maximum Likelihood Conditions (3)
Lagrange function:
Setting its derivative to zero and use the normalization constraint, we obtain:
![Page 11: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/11.jpg)
Expectation Maximization for Mixture Gaussians
Although the previous conditions do not provide closed-form conditions, we can use them to construct iterative updates:
E step: Compute responsibilities .M step: Compute new mean , variance ,
and mixing coefficients .Loop over E and M steps until the log
likelihood stops to increase.
![Page 12: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/12.jpg)
General EM Algorithm
![Page 13: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/13.jpg)
EM and Jensen Inequality
Goal: maximize
Define:
We haveFrom Jesen’s Inequality, we see is a lower
bound of .
![Page 14: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/14.jpg)
Lower Bound
is a functional of the distribution .
Since and , is a lower bound of the log likelihood
function . (Another way to see the lower bound without using Jensen’s inequality)
![Page 15: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/15.jpg)
Lower Bound Perspective of EM
• Expectation Step:Maximizing the functional lower bound over the distribution .
• Maximization Step:Maximizing the lower bound over the parameters .
![Page 16: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/16.jpg)
Illustration of EM Updates
![Page 17: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/17.jpg)
Sequential Data
There are temporal dependence between data points
![Page 18: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/18.jpg)
Markov ModelsBy chain rule, a joint distribution can be re-written as:
Assume conditional independence, we have
It is known as first-order Markov chain
![Page 19: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/19.jpg)
High Order Markov Chains
Second order Markov assumption
Can be generalized to higher order Markov Chains. But the number of the parameters explores exponentially with the order.
![Page 20: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/20.jpg)
State Space Models
Important graphical models for many dynamic models, includes Hidden Markov Models (HMMs) and linear dynamic systems
Questions: order for the Markov assumption
![Page 21: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/21.jpg)
Hidden Markov Models
Many applications, e.g., speech recognition, natural language processing, handwriting recognition, bio-sequence analysis
![Page 22: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/22.jpg)
From Mixture Models to HMMs
By turning a mixture Model into a dynamic model, we obtain the HMM.
Let model the dependence between two consecutive latent variables by a transition probability:
![Page 23: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/23.jpg)
HMMs
Prior on initial latent variable:
Emission probabilities:
Joint distribution:
![Page 24: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/24.jpg)
Samples from HMM
(a) Contours of constant probability density for the emission distributions corresponding to each of the three states of the latent variable. (b) A sample of 50 points drawn from the hidden Markov model, with lines connecting the successive observations.
![Page 25: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/25.jpg)
Inference: Forward-backward Algorithm
Goal: compute marginals for latent variables.Forward-backward Algorithm: exact inference
as special case of sum-product algorithm on the HMM.
Factor graph representation (grouping emission density and transition probability in one factor at a time):
![Page 26: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/26.jpg)
Forward-backward Algorithm as Message Passing Method (1)
Forward messages:
![Page 27: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/27.jpg)
Forward-backward Algorithm as Message Passing Method (2)
Backward messages (Q: how to compute it?):
The messages actually involves X
Similarly, we can compute the following (Q: why)
![Page 28: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/28.jpg)
Rescaling to Avoid OverflowingWhen a sequence is long, the forward message will become to
small to be represented by the dynamic range of the computer. We redefine the forward message
asSimilarly, we re-define the backward message
asThen, we can compute
See detailed derivation in textbook
![Page 29: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/29.jpg)
Viterbi Algorithm
Viterbi Algorithm: • Finding the most probable sequence of
states• Special case of sum-product algorithm on
HMM.What if we want to find the most probable
individual states?
![Page 30: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/30.jpg)
Maximum Likelihood Estimation for HMM
Goal: maximize
Looks familiar? Remember EM for mixture of Gaussians… Indeed the updates are similar.
![Page 31: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/31.jpg)
EM for HMM
E step:
Computed from forward-backward/sum-product algorithm
M step:
![Page 32: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/32.jpg)
Linear Dynamical Systems
Equivalently, we have
where
![Page 33: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/33.jpg)
Kalman Filtering and Smoothing
Inference on linear Gaussian systems.Kalman filtering: sequentially update scaled
forward message: Kalman smoothing: sequentially update state
beliefs based on scaled forward and backward messages:
![Page 34: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/34.jpg)
Learning in LDS
EM again…
![Page 35: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/35.jpg)
Extension of HMM and LDS
Discrete latent variables: Factorized HMMsContinuous latent variables: switching Kalman filtering models
![Page 36: CS 59000 Statistical Machine learning Lecture 24](https://reader034.vdocuments.mx/reader034/viewer/2022052702/56814830550346895db5512c/html5/thumbnails/36.jpg)