markov chain monte carlo(mcmc)
Post on 23-Jan-2022
17 Views
Preview:
TRANSCRIPT
Outline
Monte Carlo : Sample from a distribution to estimate the distribution
Markov Chain Monte Carlo (MCMC)
โ Applied to Clustering, Unsupervised Learning, Bayesian Inference
Importance Sampling
Metropolis-Hastings Algorithm
Gibbs Sampling
Markov Blanket in Sampling for Bayesian Network
Example: Estimation of Gaussian Mixture Model
2
๐ ๐ฅ ๐ท =๐ง,๐๐ ๐ฅ, ๐ง, ๐ ๐ท = ? , ๐ ๐ง|๐ฅ, ๐ =? , ๐ ๐|๐ฅ, ๐ง =?
Markov chain Monte Carlo(MCMC)
Monte Carlo : Sample from a distribution
- to estimate the distribution for GMM estimation, Clustering
(Labeling, Unsupervised Learning)
- to compute max, mean
Markov Chain Monte Carlo : sampling using โlocalโ information
- Generic โproblem solving techniqueโ
- decision/inference/optimization/learning problem
- generic, but not necessarily very efficient
3
Monte Carlo Integration
General problem: evaluating
๐ผ๐ โ ๐ = โซ โ ๐ฅ ๐ ๐ฅ ๐๐ฅcan be difficult. (โซ โ ๐ฅ ๐ ๐ฅ ๐๐ฅ < โ)
If we can draw samples ๐ฅ(๐ )~๐ ๐ฅ , then we can estimate
๐ผ๐ โ ๐ โ เดคโ๐ =1
๐
๐ =1
๐
โ ๐ฅ ๐ .
Monte Carlo integration is great if you can sample from the target distribution
โข But what if you canโt sample from the target?
โข Importance sampling: Use of a simple distribution
4
Importance Sampling
Idea of importance sampling:
Draw the sample from a proposal distribution ๐(โ ) and re-weight the integral using importance weights so that the correct distribution is targeted
๐ผ๐ โ ๐ = โซโ ๐ฅ ๐ ๐ฅ
๐ ๐ฅ๐ ๐ฅ ๐๐ฅ = ๐ผ๐
โ ๐ ๐ ๐
๐ ๐.
Hence, given an iid sample ๐ฅ ๐ from ๐, our estimator becomes
๐ธ๐โ ๐ ๐ ๐
๐ ๐=1
๐
๐ =1
๐โ ๐ฅ ๐ ๐ ๐ฅ ๐
๐ ๐ฅ ๐
5
Limitations of Monte Carlo
Direct (unconditional) sampling
โข Hard to get rare events in high-dimensional spaces Gibbs sampling
Importance sampling
โข Do not work well if the proposal ๐ ๐ฅ is very different from target ๐ ๐ฅ
โข Yet constructing a ๐ ๐ฅ similar to ๐ ๐ฅ can be difficult Markov Chain
Intuition: instead of a fixed proposal ๐ ๐ฅ , what if we could use an adaptiveproposal?
โข ๐๐ก+1 depends only on ๐๐ก, not on ๐0, ๐1, โฆ , ๐๐กโ1โข Markov Chain
6
Markov Chains: Notation & Terminology
Countable (finite) state space ฮฉ (e.g. N)
Sequence of random variables ๐๐ก on ฮฉ for ๐ก = 0,1,2, โฆ
Definition : ๐๐ก is a Markov Chain if
๐ ๐๐ก+1 = ๐ฆ | ๐๐ก = ๐ฅ๐ก, โฆ , ๐0 = ๐ฅ0 = ๐ ๐๐ก+1 = ๐ฆ | ๐๐ก = ๐ฅ๐ก
Notation : ๐ ๐๐ก+1 = ๐ | ๐๐ก = ๐ = ๐๐๐
- Random Works
Example.
๐๐ด๐ด = ๐ ๐๐ก+1 = ๐ด | ๐๐ก = ๐ด = 0.6๐๐ด๐ธ = ๐ ๐๐ก+1 = ๐ธ | ๐๐ก = ๐ด = 0.4๐๐ธ๐ด = ๐ ๐๐ก+1 = ๐ด | ๐๐ก = ๐ธ = 0.7๐๐ธ๐ธ = ๐ ๐๐ก+1 = ๐ธ | ๐๐ก = ๐ธ = 0.3
7
Markov Chains: Notation & Terminology
Let ๐ท = ๐๐๐ - transition probability matrix
- dimension ฮฉ ร ฮฉ
Let ๐๐ก ๐ = ๐ ๐๐ก = ๐
- ๐0 : initial probability distribution
Then ๐๐ก ๐ = ฯ๐ ๐๐กโ1 ๐ ๐๐๐ = ๐๐กโ1๐ท ๐ = ๐0๐ท๐ก ๐
๐๐ก = ๐๐กโ1๐ท = ๐๐กโ2๐ท2 =โโโ= ๐0๐ท
๐ก
8
Markov Chains: Fundamental Properties
Theorem:
- If the limit lim๐กโโ
๐๐ก = ๐ exists and ฮฉ is finite, then
๐๐ ๐ = ๐ ๐ and ฯ๐ ๐ ๐ = 1
and such ๐ is an unique solution to ๐๐ท = ๐ (๐ is called a stationary distribution)
- No matter where we start, after some time, we will be in any state ๐ with probability ~ ๐ ๐
9
Markov Chain Monte Carlo
MCMC algorithm feature adaptive proposals
- Instead of ๐ ๐ฅโฒ , they use ๐(๐ฅโฒ|๐ฅ) where ๐ฅโฒ is the new state being sampled, and ๐ฅ is the previous sample
- As ๐ฅ changes, ๐ ๐ฅโฒ|๐ฅ can also change (as a function of ๐ฅโฒ)
- The acceptance probability is set to ๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
- No matter where we start, after some time, we will be in any state ๐ with probability ~ ๐ ๐
10โ ๐12โ ๐21
๐22๐11 โ ๐12โ ๐21
๐22๐11
importance
๐ ๐ฅโฒ|๐ฅ = ๐ ๐ฅโฒ|๐ฅ for Gaussian Why?
Metropolis-Hastings
Draws a sample ๐ฅโฒ from ๐ ๐ฅโฒ|๐ฅ , where ๐ฅ is the previous sample
The new sample ๐ฅโฒ is accepted or rejected with some probability ๐ด ๐ฅโฒ|๐ฅ
โข This acceptance probability is ๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
โข ๐ด ๐ฅโฒ|๐ฅ is like a ratio of importance sampling weights
โข๐ ๐ฅโฒ
๐ ๐ฅโฒ ๐ฅis the importance weight for ๐ฅโฒ,
๐ ๐ฅ
๐ ๐ฅ|๐ฅโฒis the importance weight for ๐ฅ
โข We divide the importance weight for ๐ฅโฒ by that of ๐ฅ
โข Notice that we only need to compute ๐ ๐ฅโฒ /๐ ๐ฅ rather than ๐ ๐ฅโฒ or ๐ ๐ฅ separately
โข ๐ด ๐ฅโฒ ๐ฅ ensures that, after sufficiently many draws, our samples will come from the true distribution ๐(๐ฅ)
11
๐ผ๐ โ ๐ = โซโ ๐ฅ ๐ ๐ฅ
๐ ๐ฅ๐ ๐ฅ ๐๐ฅ = ๐ผ๐
โ ๐ ๐ ๐
๐ ๐
๐ ๐ฅโฒ|๐ฅ = ๐ ๐ฅโฒ|๐ฅ for Gaussian Why?
The MH Algorithm
Initialize starting state ๐ฅ(0),
Burn-in: while samples have โnot convergedโ
โข ๐ฅ = ๐ฅ(๐ก)
โข ๐ก = ๐ก + 1
โข Sample ๐ฅโ~๐(๐ฅโ|๐ฅ) // draw from proposal
โข Sample ๐ข~Uniform 0,1 // draw acceptance threshold
โข If ๐ข < ๐ด ๐ฅโ ๐ฅ = min 1,๐ ๐ฅโ ๐(๐ฅ|๐ฅโ)
๐ ๐ฅ ๐ ๐ฅโ|๐ฅ, ๐ฅ(๐ก)= ๐ฅโ // transition
โข Else ๐ฅ(๐ก) = ๐ฅ // stay in current state
โข Repeat until converging
12
The MH Algorithm
Example:
โข Let ๐ ๐ฅโฒ|๐ฅ be a Gaussian centered on ๐ฅ
โข Weโre trying to sample from a bimodal distribution ๐ ๐ฅ
๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
13
The MH Algorithm
Example:
โข Let ๐ ๐ฅโฒ|๐ฅ be a Gaussian centered on ๐ฅ
โข Weโre trying to sample from a bimodal distribution ๐ ๐ฅ
14
๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
The MH Algorithm
Example:
โข Let ๐ ๐ฅโฒ|๐ฅ be a Gaussian centered on ๐ฅ
โข Weโre trying to sample from a bimodal distribution ๐ ๐ฅ
15
๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
The MH Algorithm
Example:
โข Let ๐ ๐ฅโฒ|๐ฅ be a Gaussian centered on ๐ฅ
โข Weโre trying to sample from a bimodal distribution ๐ ๐ฅ
16
๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
The MH Algorithm
Example:
โข Let ๐ ๐ฅโฒ|๐ฅ be a Gaussian centered on ๐ฅ
โข Weโre trying to sample from a bimodal distribution ๐ ๐ฅ
17
๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
The MH Algorithm
Example:
โข Let ๐ ๐ฅโฒ|๐ฅ be a Gaussian centered on ๐ฅ
โข Weโre trying to sample from a bimodal distribution ๐ ๐ฅ
18
๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
The MH Algorithm
Example:
โข Let ๐ ๐ฅโฒ|๐ฅ be a Gaussian centered on ๐ฅ
โข Weโre trying to sample from a bimodal distribution ๐ ๐ฅ
19
๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
The MH Algorithm
Example:
โข Let ๐ ๐ฅโฒ|๐ฅ be a Gaussian centered on ๐ฅ
โข Weโre trying to sample from a bimodal distribution ๐ ๐ฅ
20
๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
Gibbs Sampling
Gibbs Sampling is an MCMC algorithm that samples each random variable of a graphical model, one at a time
โข GS is a special case of the MH algorithm
Consider a factored state space
โข ๐ฅ โ ฮฉ is a vector ๐ฅ = ๐ฅ1, โฆ , ๐ฅ๐โข Notation: ๐ฅโ๐ = ๐ฅ1, โฆ , ๐ฅ๐โ1, ๐ฅ๐+1, โฆ , ๐ฅ๐
21
Gibbs Sampling
The GS algorithm:
1. Suppose the graphical model contains variables ๐ฅ1, โฆ , ๐ฅ๐2. Initialize starting values for ๐ฅ1, โฆ , ๐ฅ๐3. Do until convergence:
1. Pick a component ๐ โ 1, โฆ , ๐
2. Sample value of ๐ง~๐ ๐ฅ๐|๐ฅโ๐ , and update ๐ฅ๐ โ ๐ง
When we update ๐ฅ๐, we immediately use its new value for sampling other variables ๐ฅ๐
๐ ๐ฅ๐|๐ฅโ๐ achieves the acceptance probability in MH algorithm.
22
๐ด ๐ฅโฒ|๐ฅ = min 1,๐ ๐ฅโฒ /๐ ๐ฅโฒ|๐ฅ
๐ ๐ฅ /๐ ๐ฅ|๐ฅโฒ
Markov Blankets
The conditional ๐ ๐ฅ๐ ๐ฅโ๐ can be obtained using Markov Blanket
โข Let ๐๐ต(๐ฅ๐) be the Markov Blanket of ๐ฅ๐, then
๐ ๐ฅ๐ | ๐ฅโ๐ = ๐ ๐ฅ๐|MB ๐ฅ๐
For a Bayesian Network, the Markov Blanket of ๐ฅ๐ is the set containing its parents, children, and co-parents
23
Gibbs Sampling: An Example
Consider the GMM
โข The data ๐ฅ (position) are extracted from two Gaussian distribution
โข We do NOT know the class ๐ฆ of each data, and information of the Gaussian distribution
โข Initialize the class of each data at ๐ก = 0 to randomly
24
Gaussian with mean 3,โ3 , variance 3
Gaussian with mean 1,2 , variance 2
Gibbs Sampling: An Example
Sampling ๐ ๐ฆ๐ ๐ฅโ๐ , ๐ฆโ๐) at ๐ก = 1, we compute:
๐ ๐ฆ๐ = 0 |๐ฅโ๐ , ๐ฆโ๐ โ ๐ฉ ๐ฅ๐|๐๐ฅโ๐,0, ๐๐ฅโ๐,0๐ ๐ฆ๐ = 1 | ๐ฅโ๐ , ๐ฆโ๐ โ ๐ฉ ๐ฅ๐|๐๐ฅโ๐,1, ๐๐ฅโ๐,1
where๐๐ฅโ๐,๐พ = ๐๐ธ๐ด๐ ๐๐๐พ , ๐๐ฅโ๐,๐พ = ๐๐ด๐ ๐๐๐พ๐๐๐พ = ๐ฅ๐ | ๐ฅ๐ โ ๐ฅโ๐ , ๐ฆ๐ = ๐พ
And update ๐ฆ๐ with ๐ ๐ฆ๐ |๐ฅโ๐ , ๐ฆโ๐ and repeat for all data
25
Iteration of ๐ at the same ๐ก
0 1
Gibbs Sampling: An Example
Now ๐ก = 2, and we repeat the procedure to sample new class of each data
And similarly for ๐ก = 3, 4, โฆ
26
โ โ โ
Gibbs Sampling: An Example
Data ๐โs class can be chosen with tendency of ๐ฆ๐โข The classes of the data can be oscillated after the sufficient sequences
โข We can assume the class of datum as more frequently selected class
In the simulation, the final class is correct with the probability of 94.9% at ๐ก =100
27
Interim Summary
Markov Chain Monte Carlo methods use adaptive proposals ๐ ๐ฅโฒ ๐ฅ to sample from the true distribution ๐ ๐ฅ
Metropolis-Hastings allows you to specify any proposal ๐ ๐ฅโฒ|๐ฅ
โข But choosing a good ๐ ๐ฅโฒ|๐ฅ requires care
Gibbs sampling sets the proposal ๐ ๐ฅ๐โฒ|๐ฅโ1 to the conditional distribution ๐ ๐ฅ๐
โฒ|๐ฅโ1โข Acceptance rate always 1.
โข But remember that high acceptance usually entails slow exploration
โข In fact, there are better MCMC algorithms for certain models
28
top related