short course: markov chains & mixing times · lecture 1: yuval peres, microsoft research short...
TRANSCRIPT
Introduction
Short Course: Markov chains & mixing times
Lecture 1: Yuval Peres, Microsoft Research
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Definition of Markov chains
• A process (Xt)t=0,1,... is a (time-homogeneous) Markov chain if
P(Xk+1 = y | X0 = x0, . . . ,Xk−1 = xk−1,Xk = x) = p(x, y) ,
where p(x, y) ≥ 0 and∑
y p(x, y) = 1 for all x.
• The transition kernel P = (p(x, y))x,y.
• The stationary measure π satisfies that πP = π.
• Aperiodic: gcdt ≥ 1 : Pt(x, x) > 0 = 1 for all x.
• Irreducible: for all x, y, there exists t ∈ N such that Pt(x, y) > 0.
• Reversible: π(x)p(x, y) = π(y)p(y, x) for all x, y.
Most examples in this talk are aperiodic, irreducible and reversible.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Classical theory vs. modern focus
• An important feature of aperiodic and irreducible (finite) Markovchains: the distribution at time t converges to the (unique)stationary distribution as t → ∞.
• Classical theory: Fix the chain and study the rate of convergenceof the distribution at time t to stationarity, as t → ∞.
• Modern focus: Fix the target distance to stationarity and studythe asymptotics of the required time to reach that target, for afamily of chains as the size goes to infinity. (Aldous, Diaconis,. . . )
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Motivations
• Statistical physics.• Monte Carlo simulation.• Models of dynamical processes.• Deep connections between the convergence rate and the spatial
properties of the physical model.• Biology.
• Models of DNA evolution.• A much simplified chain of Durrett: Given a permutation, take a
random segment of fixed length and reverse it.• Computer science.
• Sampling.• Approximate counting of combinatorial structures.
• Card shuffling.
. . .
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
More on sampling
• Random permutation: an example easy to sample directly.
• Knuth Algorithm: For i = 1, . . . , n − 1, select Ui out of i, . . . , nuniformly at random and swap the two elements at positions i andUi.
• Many other models are hard to sample directly.• Ising model : more in later lectures.• Coloring model: a uniform measure on all the proper colorings of
a graph with q colors, where in a proper coloring, any twoadjacent vertices are assigned different colors.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
An illustration of coloring model
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Total variation distance
For two distributions µ and ν on a (discrete) space Ω, the totalvariation distance ‖µ − ν‖TV is defined in the following equivalentways:
• ‖µ − ν‖TV := 12∑
x∈Ω |µ(x) − ν(x)|.
• ‖µ − ν‖TV := supA⊂Ω(µ(A) − ν(A)).
• For all couplings (X,Y) where X ∼ µ and Y ∼ ν, we haveP(X , Y) ≥ ‖µ − ν‖TV. Furthermore, there exists a coupling suchthat equality holds.
An important feature: advancing a Markov chain can only decreasethe total variation distance. That is,
‖µP − νP‖TV ≤ ‖µ − ν‖TV .
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Mixing time and its first property
• We are interested in the decay of d(t) := maxx∈Ω ‖Pt(x, ·) − π‖TV.
• ‖µP − νP‖ ≤ ‖µ − ν‖ ⇒ d(t) is decreasing in t.
• The mixing time tMIX(ε) := mint ≥ 0 : d(t) ≤ ε, for someε ∈ (0, 1). Furthermore, tMIX := tMIX(1/4) by convention.
• Since d(t + s) ≤ 2d(t)d(s), we have tMIX(2−k) ≤ k · tMIX.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Mixing time v.s. relaxation time
• The transition kernel P always has 1 as the largest eigenvalue.
• Define λ? := max|λ| : λ is an eigenvalue of P and the spectralgap 1 − λ?. Furthermore, define the relaxation timetREL := 1/(1 − λ?).
• An important relation between relaxation time and mixing time:
(tREL − 1) log(1/2ε) ≤ tMIX(ε) ≤ log(1/επmin)tREL ,
where πmin = minx∈Ω π(x).
• Seemingly, tREL and tMIX are roughly the same. However, thepossible difference of tREL and tMIX can reveal deep features of aMarkov chain.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Random walks on cycles and hypercubes
A chain is called lazy if P(x, x) ≥ 1/2 for all x. All the eigenvalues ofa lazy chain are non-negative.
• Lazy random walk on a cycle of length n.• The second eigenvalue λ2 = (1 + cos(2π/n))/2 and the
corresponding eigenfunction is f2(k) = cos(4πk/n).• The relaxation time tREL n2.• The mixing time tMIX n2.
• Lazy random walk on hypercube −1, 1n.• The second eigenvalue λ2 = 1 − 1
n and one of the correspondingeigenfunctions is f2 =
∑ni=1 xi, where xi ∈ −1, 1 denotes the i-th
coordinate.• The relaxation time tREL = n.• The mixing time tMIX n log n.
• In the cycle, the relaxation time and the mixing time have thesame order, while in the hypercube the mixing time is of largermagnitude.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Couplings of Markov Chains
Let P be a transition matrix for a Markov chain. A coupling of aP-Markov-chain started at x and a P-Markov-chain started at y is asequence (Xn,Yn)∞n=0 such that
• all variables Xn and Yn are defined on the same probability space,• Xn is a P-Markov-chain started at x, and• Yn is a P-Markov-chain started at y.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Example: The lazy random walk on the n-cycle.
• This chain remains at its current position with probability 1/2,and moves to each of the two adjacent site with probability 1/4.
• Can couple the chains started from x and y as follows:• Flip a fair coin to decide if the X-chain moves or the Y-chain
moves,• Move the selected chain to one of its two neighboring sites,
chosen with equal probability.
• Both the x-particle and the y-particle are performing lazy simplerandom walks on the n-cycle.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Mixing and Coupling
• Let (Xt,Yt)∞t=0 be a coupling of a P-chain started from x and aP-chain started at y.
• Letτ := mint ≥ 0 : Xt = Yt .
If the coupling is Markovian (and we will only consider those), itcan always be redefined so that
Xt = Yt for t ≥ τ,
So, let us assume this.4
t
Yt
0
1
2
3
x
y
X
• The pair (Xt,Yt) (for given t) is a coupling of Pt(x, ·) and Pt(y, ·).
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Mixing and Coupling
• Since Xt has distribution Pt(x, ·) and Yt has distribution Pt(y, ·),using the coupling characterization of total variation distance,
P(τ > t) = P(Xt , Yt) ≥ dTV(Pt(x, ·),Pt(y, ·)) .
• Combined with the inequality
dTV(Pt(x, ·), π) ≤ maxy∈Ω
dTV(Pt(x, ·),Pt(y, ·)) ,
if there is a coupling (Xt,Yt) for every pair of initial states (x, y),then this shows that
d(t) = maxx∈Ω
dTV(Pt(x, ·), π) ≤ maxx,y
dTV(Pt(x, ·),Pt(y, ·))
≤ maxx,y
Px,y(τ > t) .
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Mixing for lazy random walk on the n-cycle
• Use the coupling which selects at each move one of the“particles” at random; the chosen particle is equally likely tomove clockwise as counter-clockwise.
• The clockwise difference between the particles, Dt, is a simplerandom walk on 0, 1, . . . , n.
• When Dt ∈ 0, n, the two particles have collided.• If τ is the time until a simple random walk on 0, 1, . . . , n hits an
endpoint when started at k, then
Ekτ = k(n − k) ≤n2
4.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
RW on n-cycle, continued
• By Markov’s inequality,
P(τ > t) ≤Eτt≤
n2
4t.
• Using the coupling inequality,
d(t) ≤ maxx,y
P(τ > t) ≤n2
4t.
• Taking t ≥ n2 yields d(t) ≤ 1/4, whence
tMIX ≤ n2 .
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Random Walk on d-dimensional Torus
• Ω = (Z/nZ)d. The walk remains at current position withprobability 1/2.
• Couple two particles as follows:• Select among the d coordinates at random.• If the particles agree in the selected coordinate, move the walks
together in this coordinate. Thus both walks together either makea clockwise move, a counterclockwise move, or remain put.
• If the particles disagree in the chosen coordinate, flip a coin todecide which walker will move. Move the selected walk eitherclockwise or counterclockwise, each with probability 1/2.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
(2,4,8,7,4,2)
(1,4,4,7,5,8)
(1,5,4,7,5,9) (2,5,8,7,4,2)
yx
(1,4,4,7,5,9)
(2,4,8,7,4,2)
x
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
• Consider the clockwise difference between the i-th coordinate ofthe two particles. It moves at rate 1/d, and when it does move, itperforms simple random walk on 0, 1, . . . , n, with absorption at0 and n. Thus the expected time to couple the i-th coordinate isbounded above by dn2/4.
• Since there are d coordinates, the expected time for all of them tocouple is not more than
d × dn2
4=
d2n2
4.
• By the coupling theorem,
tMIX ≤ d2n2 .
Exercise: Improve the d-dependence in this bound.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
RW on hypercubecoin tossed to decide replacement bit
0011010011
same coordinate selected for updating
0110001010 01100110100011010011
1
• Consider the lazy random walk on the hypercube 0, 1n. Sitesare neighbors if they differ in exactly one coordinate.
• To update the two walks, first pick a coordinate at random. Thesame coordinate is used for both walks.
• Toss a coin to determine if the bit at the chosen coordinate isreplaced by a 1 or a 0. The same bit is used for both walks.
• No matter the initial positions of the two walks, when everycoordinate has been selected, the two walks agree.
• Reduces to a "coupon collector’s" problem: how many timesmust a coordinate be drawn at random before every coordinate ischosen?
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Coupon collector
• Let Ak(t) be the event that the k-th coupon has not been collectedby time t.
• Observe
P(Ak(t)) =
(1 −
1n
)t
≤ e−t/n .
• Consequently,
P( n⋃
k=1
Ak(t))≤
n∑k=1
e−t/n = ne−t/n .
• In other words, if τ is the time until all coupons have beencollected,
P(τ > n log n + cn) = P
n⋃k=1
Ak(n log n + cn)
≤ e−c .
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Returning to the hypercube,
d(n log n + cn) ≤ P(τ > n log n + cn) ≤ e−c,
whencetMIX(ε) ≤ n log n + n log(1/ε) .
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Strong stationary times
• Random mapping representation of Markov chain P: an i.i.d.sequence (Zt) and a map f such that Xt = f (Xt−1,Zt) with X0 = x.
• The sequence (Zt) has more information than the chain (Xt).• A randomized stopping time is a stopping time for the sequence
(Zt). Not necessarily a stopping time of (Xt)!• A strong stationary time is a randomized stopping time τ such
that Xτ has distribution π and is independent of τ. That is,
Px(τ = t,Xτ = y) = Px(τ = t)π(y) .
Proposition
If τ is a strong stationary time, then
d(t) = maxx‖Pt(x, ·) − π‖TV ≤ max
xPx(τ > t) .
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Top-to-random shuffle
• Top-to-random shuffle: take the top card and insert it uniformlyat random in the deck.
• Strong stationary time τtop: the time one move after the the firstoccasion when the original bottom card has moved to the top ofthe deck.
• τtop is the same as the coupon collector’s time and henceP(τtop > n log n + cn) ≤ e−c.
• tMIX(ε) ≤ n log n + log(ε−1)n.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times
Introduction
Random walk on the hypercube - revisited
• Lazy walk on a hypercube can be views as a dynamics on−1, 1n, where at each step a coordinate is selected and updateduniformly at random.
• The strong stationary time τrefresh: the first time when all thecoordinates have been selected at least once for updating.
• τrefresh is the same as the coupon collector’s time.• tMIX(ε) ≤ n log n + log(ε−1)n.• Is this tight?
Answer: In fact tMIX = 12 n log n + O(n) and there is a cutoff.
Lecture 1: Yuval Peres, Microsoft Research Short Course: Markov chains & mixing times