Download - Markov chain Monte Carlo(MCMC)
Outline
Monte Carlo : Sample from a distribution to estimate the distribution
Markov Chain Monte Carlo (MCMC)
β Applied to Clustering, Unsupervised Learning, Bayesian Inference
Importance Sampling
Metropolis-Hastings Algorithm
Gibbs Sampling
Markov Blanket in Sampling for Bayesian Network
Example: Estimation of Gaussian Mixture Model
2
π π₯ π· =π§,ππ π₯, π§, π π· = ? , π π§|π₯, π =? , π π|π₯, π§ =?
Markov chain Monte Carlo(MCMC)
Monte Carlo : Sample from a distribution
- to estimate the distribution for GMM estimation, Clustering
(Labeling, Unsupervised Learning)
- to compute max, mean
Markov Chain Monte Carlo : sampling using βlocalβ information
- Generic βproblem solving techniqueβ
- decision/inference/optimization/learning problem
- generic, but not necessarily very efficient
3
Monte Carlo Integration
General problem: evaluating
πΌπ β π = β« β π₯ π π₯ ππ₯can be difficult. (β« β π₯ π π₯ ππ₯ < β)
If we can draw samples π₯(π )~π π₯ , then we can estimate
πΌπ β π β ΰ΄€βπ =1
π
π =1
π
β π₯ π .
Monte Carlo integration is great if you can sample from the target distribution
β’ But what if you canβt sample from the target?
β’ Importance sampling: Use of a simple distribution
4
Importance Sampling
Idea of importance sampling:
Draw the sample from a proposal distribution π(β ) and re-weight the integral using importance weights so that the correct distribution is targeted
πΌπ β π = β«β π₯ π π₯
π π₯π π₯ ππ₯ = πΌπ
β π π π
π π.
Hence, given an iid sample π₯ π from π, our estimator becomes
πΈπβ π π π
π π=1
π
π =1
πβ π₯ π π π₯ π
π π₯ π
5
Limitations of Monte Carlo
Direct (unconditional) sampling
β’ Hard to get rare events in high-dimensional spaces Gibbs sampling
Importance sampling
β’ Do not work well if the proposal π π₯ is very different from target π π₯
β’ Yet constructing a π π₯ similar to π π₯ can be difficult Markov Chain
Intuition: instead of a fixed proposal π π₯ , what if we could use an adaptiveproposal?
β’ ππ‘+1 depends only on ππ‘, not on π0, π1, β¦ , ππ‘β1β’ Markov Chain
6
Markov Chains: Notation & Terminology
Countable (finite) state space Ξ© (e.g. N)
Sequence of random variables ππ‘ on Ξ© for π‘ = 0,1,2, β¦
Definition : ππ‘ is a Markov Chain if
π ππ‘+1 = π¦ | ππ‘ = π₯π‘, β¦ , π0 = π₯0 = π ππ‘+1 = π¦ | ππ‘ = π₯π‘
Notation : π ππ‘+1 = π | ππ‘ = π = πππ
- Random Works
Example.
ππ΄π΄ = π ππ‘+1 = π΄ | ππ‘ = π΄ = 0.6ππ΄πΈ = π ππ‘+1 = πΈ | ππ‘ = π΄ = 0.4ππΈπ΄ = π ππ‘+1 = π΄ | ππ‘ = πΈ = 0.7ππΈπΈ = π ππ‘+1 = πΈ | ππ‘ = πΈ = 0.3
7
Markov Chains: Notation & Terminology
Let π· = πππ - transition probability matrix
- dimension Ξ© Γ Ξ©
Let ππ‘ π = π ππ‘ = π
- π0 : initial probability distribution
Then ππ‘ π = Οπ ππ‘β1 π πππ = ππ‘β1π· π = π0π·π‘ π
ππ‘ = ππ‘β1π· = ππ‘β2π·2 =βββ= π0π·
π‘
8
Markov Chains: Fundamental Properties
Theorem:
- If the limit limπ‘ββ
ππ‘ = π exists and Ξ© is finite, then
ππ π = π π and Οπ π π = 1
and such π is an unique solution to ππ· = π (π is called a stationary distribution)
- No matter where we start, after some time, we will be in any state π with probability ~ π π
9
Markov Chain Monte Carlo
MCMC algorithm feature adaptive proposals
- Instead of π π₯β² , they use π(π₯β²|π₯) where π₯β² is the new state being sampled, and π₯ is the previous sample
- As π₯ changes, π π₯β²|π₯ can also change (as a function of π₯β²)
- The acceptance probability is set to π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
- No matter where we start, after some time, we will be in any state π with probability ~ π π
10β π12β π21
π22π11 β π12β π21
π22π11
importance
π π₯β²|π₯ = π π₯β²|π₯ for Gaussian Why?
Metropolis-Hastings
Draws a sample π₯β² from π π₯β²|π₯ , where π₯ is the previous sample
The new sample π₯β² is accepted or rejected with some probability π΄ π₯β²|π₯
β’ This acceptance probability is π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
β’ π΄ π₯β²|π₯ is like a ratio of importance sampling weights
β’π π₯β²
π π₯β² π₯is the importance weight for π₯β²,
π π₯
π π₯|π₯β²is the importance weight for π₯
β’ We divide the importance weight for π₯β² by that of π₯
β’ Notice that we only need to compute π π₯β² /π π₯ rather than π π₯β² or π π₯ separately
β’ π΄ π₯β² π₯ ensures that, after sufficiently many draws, our samples will come from the true distribution π(π₯)
11
πΌπ β π = β«β π₯ π π₯
π π₯π π₯ ππ₯ = πΌπ
β π π π
π π
π π₯β²|π₯ = π π₯β²|π₯ for Gaussian Why?
The MH Algorithm
Initialize starting state π₯(0),
Burn-in: while samples have βnot convergedβ
β’ π₯ = π₯(π‘)
β’ π‘ = π‘ + 1
β’ Sample π₯β~π(π₯β|π₯) // draw from proposal
β’ Sample π’~Uniform 0,1 // draw acceptance threshold
β’ If π’ < π΄ π₯β π₯ = min 1,π π₯β π(π₯|π₯β)
π π₯ π π₯β|π₯, π₯(π‘)= π₯β // transition
β’ Else π₯(π‘) = π₯ // stay in current state
β’ Repeat until converging
12
The MH Algorithm
Example:
β’ Let π π₯β²|π₯ be a Gaussian centered on π₯
β’ Weβre trying to sample from a bimodal distribution π π₯
π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
13
The MH Algorithm
Example:
β’ Let π π₯β²|π₯ be a Gaussian centered on π₯
β’ Weβre trying to sample from a bimodal distribution π π₯
14
π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
The MH Algorithm
Example:
β’ Let π π₯β²|π₯ be a Gaussian centered on π₯
β’ Weβre trying to sample from a bimodal distribution π π₯
15
π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
The MH Algorithm
Example:
β’ Let π π₯β²|π₯ be a Gaussian centered on π₯
β’ Weβre trying to sample from a bimodal distribution π π₯
16
π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
The MH Algorithm
Example:
β’ Let π π₯β²|π₯ be a Gaussian centered on π₯
β’ Weβre trying to sample from a bimodal distribution π π₯
17
π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
The MH Algorithm
Example:
β’ Let π π₯β²|π₯ be a Gaussian centered on π₯
β’ Weβre trying to sample from a bimodal distribution π π₯
18
π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
The MH Algorithm
Example:
β’ Let π π₯β²|π₯ be a Gaussian centered on π₯
β’ Weβre trying to sample from a bimodal distribution π π₯
19
π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
The MH Algorithm
Example:
β’ Let π π₯β²|π₯ be a Gaussian centered on π₯
β’ Weβre trying to sample from a bimodal distribution π π₯
20
π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
Gibbs Sampling
Gibbs Sampling is an MCMC algorithm that samples each random variable of a graphical model, one at a time
β’ GS is a special case of the MH algorithm
Consider a factored state space
β’ π₯ β Ξ© is a vector π₯ = π₯1, β¦ , π₯πβ’ Notation: π₯βπ = π₯1, β¦ , π₯πβ1, π₯π+1, β¦ , π₯π
21
Gibbs Sampling
The GS algorithm:
1. Suppose the graphical model contains variables π₯1, β¦ , π₯π2. Initialize starting values for π₯1, β¦ , π₯π3. Do until convergence:
1. Pick a component π β 1, β¦ , π
2. Sample value of π§~π π₯π|π₯βπ , and update π₯π β π§
When we update π₯π, we immediately use its new value for sampling other variables π₯π
π π₯π|π₯βπ achieves the acceptance probability in MH algorithm.
22
π΄ π₯β²|π₯ = min 1,π π₯β² /π π₯β²|π₯
π π₯ /π π₯|π₯β²
Markov Blankets
The conditional π π₯π π₯βπ can be obtained using Markov Blanket
β’ Let ππ΅(π₯π) be the Markov Blanket of π₯π, then
π π₯π | π₯βπ = π π₯π|MB π₯π
For a Bayesian Network, the Markov Blanket of π₯π is the set containing its parents, children, and co-parents
23
Gibbs Sampling: An Example
Consider the GMM
β’ The data π₯ (position) are extracted from two Gaussian distribution
β’ We do NOT know the class π¦ of each data, and information of the Gaussian distribution
β’ Initialize the class of each data at π‘ = 0 to randomly
24
Gaussian with mean 3,β3 , variance 3
Gaussian with mean 1,2 , variance 2
Gibbs Sampling: An Example
Sampling π π¦π π₯βπ , π¦βπ) at π‘ = 1, we compute:
π π¦π = 0 |π₯βπ , π¦βπ β π© π₯π|ππ₯βπ,0, ππ₯βπ,0π π¦π = 1 | π₯βπ , π¦βπ β π© π₯π|ππ₯βπ,1, ππ₯βπ,1
whereππ₯βπ,πΎ = ππΈπ΄π πππΎ , ππ₯βπ,πΎ = ππ΄π πππΎπππΎ = π₯π | π₯π β π₯βπ , π¦π = πΎ
And update π¦π with π π¦π |π₯βπ , π¦βπ and repeat for all data
25
Iteration of π at the same π‘
0 1
Gibbs Sampling: An Example
Now π‘ = 2, and we repeat the procedure to sample new class of each data
And similarly for π‘ = 3, 4, β¦
26
β β β
Gibbs Sampling: An Example
Data πβs class can be chosen with tendency of π¦πβ’ The classes of the data can be oscillated after the sufficient sequences
β’ We can assume the class of datum as more frequently selected class
In the simulation, the final class is correct with the probability of 94.9% at π‘ =100
27
Interim Summary
Markov Chain Monte Carlo methods use adaptive proposals π π₯β² π₯ to sample from the true distribution π π₯
Metropolis-Hastings allows you to specify any proposal π π₯β²|π₯
β’ But choosing a good π π₯β²|π₯ requires care
Gibbs sampling sets the proposal π π₯πβ²|π₯β1 to the conditional distribution π π₯π
β²|π₯β1β’ Acceptance rate always 1.
β’ But remember that high acceptance usually entails slow exploration
β’ In fact, there are better MCMC algorithms for certain models
28