random discrete structures (ma3h4) - uni-frankfurt.deacoghlan/rds2.pdf · discrete structure, or...

Random discrete structures (MA3H4)

Amin Coja-OghlanUniversity of Warwick, Mathematics

Zeeman building [email protected]

February 1, 2012

Abstract

Random discrete structures such as random graphs or matrices play a crucial role in discrete math-ematics, because they enjoy properties that are difficult (or impossible) to obtain via deterministic con-structions. For example, random structures are essential in the design of algorithms or error correctingcodes. Furthermore, random discrete structures can be used to model a large variety of objects in physics,biology, or computer science (e.g., social networks). The goal of this course is to convey the most impor-tant models and the main analysis techniques, as well as a few instructive applications. Topics include

• fundamentals of discrete probability distributions,

• techniques for the analysis of rare events,

• random trees and graphs,

• applications in statistical mechanics,

• sampling and rapid mixing,

• applications in efficient decoding.

1

1 INTRODUCTION AND MOTIVATION 2

1 Introduction and motivationThe theory of random discrete structures is a relatively new area of mathematics. It was “founded” in the1950s and 60s by the Hungarian mathematicians Paul Erdos and Alfred Renyi. Their primary motivationwas to address problems in combinatorics, particularly extremal combinatorics. In that area, it is oftennecessary to produces discrete objects (e.g., graphs) with properties that seem contradictory. We will seean example below, the lower bound on the Ramsey number. Coming up with an explicit or ‘deterministic’construction of such objects is often difficult, precisely because the desired properties of the object appearcontradictory. The stroke of genius that marks the beginning of the theory of random discrete structures isthat sometimes there is a simple random experiment whose likely outcome is the desired object. This wayof producing discrete objects with particular properties has become known as the probabilistic method.

The probabilistic method has since become a key tool in far broader areas than merely extremal com-binatorics. Modern applicactions abound in areas ranging from information theory (design efficient codes)to computer science (algorithm design, complexity theory). We will discuss (some of) the techniques andvarious applications of the probablistic method in due course.

A second area where random discrete structures play an important role is statistical mechanics. Thisis a branch of physics that is concerned with the interaction of myriads of tiny particles, with the goal ofstudying the macroscopic phenoma that this interaction produces. The system is usually modeled prob-abilistically. That is, either the particles and the interactions between them are described by a randomdiscrete structure, or the interactions are described by a probability distribution on a given discrete struc-ture. An example is the Ising model of ferromagnetism.

Their importance as models in statistical mechanics motivates the study of random discrete structures asmathematical objects. At the same time, physicists have developed a host of ingenious ideas and methods,which are, however, often mathematically non-rigorous. Thus, this area exhibits the well-known ‘divisonof labour’ between mathematicans and physicists, with complementary insights and techniques.

Sections 2, 3, 4, 7, 8, 9.3, and 11 are based on [1]. The material of Section 5 is based on [5] and [7].Section 10 follows [6]. Moreover, Section 12 is based on [5, 9]. Only a very basic knowledge of probabilitytheory will be assumed thoroughout the course.

2 A first example: lower bounds on Ramsey numbersRamsey’s theorem is one of the cornerstones of extremal combinatorics. To state the theorem, we need torecap a few concepts from graph theory. Recall that a graph is a pair G = (V,E) consisting of a finite setV of vertices and a set E of edges, each of which is set containing two elements of V . A set S ⊂ V is aclique if v, w ∈ E for any two (distinct) v, w ∈ S. Moreover, a set T ⊂ V is independent if v, w 6∈ Efor any two v, w ∈ T . Now, Ramsey’s theorem can be stated as follows.

Theorem 2.1 For any s ≥ 1 there is a number r such that any graph on at least r vertices contains eithera clique or an independent set of size s.

A proof of this theorem will be given in the seminar. The obvious question associated with Ramsey’stheorem is how big r has to be (as a function of s). More precisely, we let R(s) be the minimum r so thatany graph on at least r vertices contains either a clique or an independent set of size s. Then the problemis to figure out R(s).

There are several possible ways answers to this question. The most ambitious one would be an explicitformula that yields R(s) for any value of s. Such a formula is not known. In fact, if one exits, it may wellbe so complicated as to be rather unilluminating. A second possibility is to figure out R(s) for ‘small’ s.For instance, one could hope for a fast algorithm that computes R(s). Such an algorithm is not known. Infact, for small s the only values of R(s) that are known exactly are R(2) = 2, R(3) = 6, R(4) = 18. TheRamsey number R(5) is unkown, although 43 ≤ R(5) ≤ 49. With respect to an algorithm for computingthe Ramsey number, the state of affairs was described neatly by Joel Spencer, who attributes the metaphorto Paul Erdos:

2 A FIRST EXAMPLE: LOWER BOUNDS ON RAMSEY NUMBERS 3

Imagine an alien force, vastly more powerful than us, landing on Earth and demanding thevalue ofR(5) or they will destroy our planet. In that case, we should marshal all our computersand all our mathematicians and attempt to find the value. But suppose, instead, that they askfor R(6). In that case, we should attempt to destroy the aliens.

This scepticism is due to the sheer number of possible graphs on a given number n of vertices, namely

2(n2) = 2n(n−1)/2.

Essentially the only known way to compute R(s) explicitly is by inspecting all graphs on n vertices. Forinstance, to check wheter R(5) = 43, the number of possible graphs that one would have to check is

2(432 ) = 2903,

a number with 271 decimal digits. Even if a billion possibilities could be checked in a second, this wouldtake 10255 years. Thus, in the absence of a far smarter method for finding R(s), the exact computation ofthese numbers is hopeless.

Instead, what one could hope for is reasonably tight upper and lower bounds on R(s). Ideally, theseshould describe exactly how R(s) scales as a function of s as s→∞. What we will see is that

2s/2 ≤ R(s) ≤ 22s.

You will see the proof of the upper bound in the seminar. The proof of the lower bound, which is ar-guably the far greater challenge, is the first example of the use of the probabilistic method in extremalcombinatorics (Erdos 1947).

To prove the lower bound, we need to produce a graph that has neither a ‘large’ independent set nora large clique. We will obtain this graph by a simple random experiment, namely: just choose a graphwith vertex set V = 1, . . . , n uniformly at random from the set of all 2(n

2) such graphs. We denote thisrandom object by G(n, 1

2 ). Formally, G(n, 12 ) is a random variable ranging over the set of all graphs with

n vertices. To prove the desired lower bound on R(s), we start with the following lemma.

Lemma 2.2 Let Is be the number of independent sets of size s in G(n, 12 ). Similarly, let Cs be the number

of cliques of size s in G(n, 12 ). Then

E(Is) = E(Cs) =

(n

s

)· 2−(s

2).

Proof. For a specific set S ⊂ V of size s we define a random variable

IS =

1 if S is independent in G(n, 1

2 ),0 otherwise.

Then Is =∑S IS , and thus

E(Is) =∑S

E(IS) ≤(n

s

)·max

SE(IS). (1)

Here S ranges over subsets of V of size s. Moreover, we have used the fact that the binomial coefficient(ns

)counts the number of subsets of V of size s.Thus, we need to compute E(IS) for a fixed set S. Of course, E(IS) is nothing but the probability that

S is an independent set. In order to compute this probability, recall that G(n, 12 ) is just a uniformly chosen

graph on the vertex set V . Hence, if we look at the subgraph of G(n, 12 ) induced on the vertices S, then

this is just a uniformly distributed subgraph with vertex set S. Since |S| = s, the total number of possiblegraphs on S equals 2(s

2). Consequently,

E(IS) = P [S is independent] = 1/2(s2).

3 A SECOND EXAMPLE: HIGH CHROMATIC NUMBER AND HIGH GIRTH 4

Plugging this estimate into (1), we obtain the assertion. 2

To proceed, we need Stirling’s formula, which approximates k! for integers k: it reads

√2πk ·

(k

e

)k≤ k! ≤ exp

(1

12k

)·√

2πk ·(k

e

)k.

Theorem 2.3 For s ≥ 3 we have R(s) > 1e√

2· s2s/2.

Proof. We claim that there is a graph on

n =

⌊s · 2s/2

e√

2

⌋vertices that neither has a clique nor an independent set of size s. For consider the random graph G(n, 1

2 ).Then by Markov’s inequality,

P [Is > 0 ∨ Cs > 0] = P [Is + Cs > 0] ≤ E(Is + Cs).

Therefore, by Lemma 2.2 and Stirling’s formula,

P [Is > 0 ∨ Cs > 0] ≤ E(Is + Cs) = E(Is) + E(Cs) =

(n

s

)· 21−(s

2) <ns

s!· 21−(s

2)

≤ ns√2πs(s/e)s

· 21−(s2)

≤ ss2s2/2 exp(−s)2−s/2√

2πs(s/e)s· 21−(s

2) =2√2πs

< 1.

In effect,P [Is = 0 ∧ Cs = 0] = 1− P [Is > 0 ∨ Cs > 0] > 0,

whence R(s) > n. 2

The proof of Theorem 2.3 shows that there exists a graph with n =⌊s·2s/2

e√

2

⌋vertices that does not have

a clique or an independent set of size s. Rather than by constructing such a graph deterministically, theproof is by showing that a random graph has this property with a strictly positive probability. This kind ofargument is what is called the probabilistic method. In spite of more than 60 years of intensive research, nodeterministic construction of a ‘Ramsey graph’ is known that could hold a candle to the simple probabilisticargument above.

The proof of Theorem 2.3 relies on a so-called first moment argument. The name stems from the factthat in order to show the absence of a clique/independent set of size s, we compute the expected number(aka “first moment”) of cliques/independent sets.

3 A second example: high chromatic number and high girthLet G = (V,E) be a graph. Let k ≥ 1 be an integer. A k-coloring of G is a function f : V → 1, . . . , ksuch that for any edge v, w ∈ E we have f(v) 6= f(w). The chromatic number χ(G) is the least numberk such that G has a k-coloring. Observe that if χ(G) ≤ k, then G has an independent set of size at least|V |/k.

A cycle of length l ≥ 3 is a sequence of distinct vertices v1, . . . , vl ∈ V such that v1, vl ∈ E andvi, vi+1 ∈ E for all 1 ≤ i < l. Finally, the girth of G is the least number l such that G has a cycle oflength l (resp.∞ if G has no cycle at all).

The most obvious reason for a graph to have a high chromatic number is that it contains a clique. Infact, if the largest clique in the graph contains k vertices, then χ(G) ≥ k. But how large can the gapbe between the size of the largest clique and the chromatic number? For instance, are there graphs withχ(G) > 100 that do not even contain a clique on three vertices (aka a ‘triangle’)?

3 A SECOND EXAMPLE: HIGH CHROMATIC NUMBER AND HIGH GIRTH 5

More generally, it may not be very easy to imagine a graph that has simultaneously a high chromaticnumber and a large girth. But there is a simple probabilistic experiment that (almost) produces such agraph. This ‘experiment’ is the random graph G(n, p), where n ≥ 1 is an integer and p ∈ [0, 1] is a real.The vertex set of G(n, p) is V = 1, . . . , n, and each of the

(n2

)possible edges is present in the random

graph with probability p independently. Formally, G(n, p) is a random variable ranging over all graphs onV , such that for any fixed graph G = (V,E) we have

P [G(n, p) = G] = p|E|(1− p)(n2)−|E|. (2)

There is an apparent clash of notation with the symbolG(n, 12 ) used in the previous section. However, a

moment’s reflection shows that, in fact, the uniform distribution on all graphs is the same as the distributionobtained by including each of the

(n2

)edges with probability 1

2 independently. To see this, just plug p = 12

into (2).In probabilstic arguments, we will frequently need a few elementary inequalities, often involving fac-

torials or binomial coefficients. Two particularly useful ones are(n

s

)≤

(en

s

)sfor any 0 < s ≤ n,

1− x ≤ exp(−x) for any real x.

We are going to employ the random graph G(n, p) to prove the following.

Theorem 3.1 For any g ≥ 4 and any k ≥ 4 there is a graph with girth at least g and chromatic number atleast k.

Proof. Let n = 100 · k3g and p = 2k2/n. Let Zl denote the number of cycles of length l in the randomgraph G(n, p); thus, Zl is a random variable. We need to compute E(Zl). To this end, we let C be the setof all sequence (v1, . . . , vl) ∈ V l such that the vertices v1, . . . , vl are pairwise distinct. For any C ∈ C welet

ZC =

1 if (v1, . . . , vl) is a cycle in G(n, p),0 otherwise.

ThenE(ZC) = P [ZC = 1] = pl,

because (v1, . . . , vl) is a cycle iff the l edges v1, vl and vi, vi+1 for 1 ≤ i < l are present in G(n, p).Furthermore, each cycle in the random graph G(n, p) corresponds to 2l elements of C. Therefore,

Zl =1

2l

∑C∈C

ZC ,

and thusE(Zl) =

1

2l

∑C∈C

E(ZC) =1

2l· |C|pl.

As |C| = n(n− 1) · · · (n− l + 1) < nl, we obtain

E(Zl) <(np)l

2l=

2lk2l

2l,

due to our choice of p. Letting Z =∑g−1l=3 Zl, we get

E(Z) =

g−1∑l=3

E(Zl) <

g−1∑l=3

2lk2l

2l<

2g−1k2g−2

3,

by bounding the geometric sum in terms of the highest summand. LetA be the event that Z ≤ 2g−1k2g−2.By Markov’s inequality, we have

P [A] = 1− P [¬A] = 1− P[Z > 2g−1k2g−2

]≥ 1− P [Z > 3E(Z)] >

2

3. (3)

4 A THIRD EXAMPLE: THE ERDOS-KO-RADO THEOREM 6

Thus, it is quite likely that G(n, p) has only ‘few’ short cycles.As a next step, we are going to show that with probability greater than 2/3 the random graph G(n, p)

does not have an independent set of size n/(2k) More preciselty, let B be the event that there is no in-dependent set of size n/(2k) in G(n, p). Furthermore, let I be the number of independent sets of sizes = dn/(2k)e. Then

P [B] = P [I = 0] = 1− P [I ≥ 1] ≥ 1− E(I). (4)

Moreover, by a similar argument as in the proof of Lemma 2.2,

E(I) =

(n

s

)(1− p)(

s2) ≤

(en

s

)sexp

[−(s

2

)p

]= exp [s (1 + ln(n/s)− (s− 1)p/2)]

= exp[s(1 + ln(2k)− k2 + p/2

)]≤ exp

[s(2 + ln(k)− k2

)]≤ exp(−s) ≤ exp(−n/k) < 1/3.

The last step follows from our choice of n. Hence, (4) entails that P [B] ≥ 2/3. Combining this with (3),we see that there exists a graph on n vertices that satisfies both A and B.

LetG be such a graph. Due toA, the total number of cycles of length less than g isZ(G) ≤ 2g−1k2g−2.By our choice of n, we have

Z(G)

n≤ 2g−1k2g−2

100 · k3g≤ 1/100

Choosing one vertex from each cycle of length less than g, we obtain a setR of vertices of size |R| ≤ n/100such that each cycle contains a vertex fromR. LetG′ be the graph obtained fromG by removing all verticesfrom R (and the edges in which they occur). Clearly, G′ has girth at least g. Furthermore, any independentset of G′ also is an independent set in G. Therefore, B implies that G′ does not contain an independent setof size n/(2k). Consequently, the chromatic number of G′ satisfies

χ(G′) ≥ |V \R|n/(2k)

≥ 2 · 0.99 · k > k.

Hence, G′ has girth at least g and chromatic number greater than k, as desired. 2

4 A third example: the Erdos-Ko-Rado theoremLet n ≥ 1 be an integer and let V = 0, . . . , n− 1. A family F of subsets of V is intersecting if for anytwo A,B ∈ F we have A ∩B 6= ∅.

Theorem 4.1 (Erdos-Ko-Rado) Let 1 ≤ k ≤ n/2. If F is an intersecting family such that all sets in Fhave size k, then |F| ≤

(n−1k−1

).

Before we come to the proof, note that this bound is tight: the family of all subsets of V of size k thatcontain 0 contains

(n−1k−1

)sets.

To prove the theorem, let us denote by m mod n ∈ V the remainder of the integer m upon divisionby n. We need the following lemma.

Lemma 4.2 Suppose that F is an intersecting family such that all sets in F have size k. For any s ∈ Vdefine a set

As = s+ i mod n : 0 ≤ s < k .

Then F contains at most k of these sets As. (In symbols, |F ∩ As : s ∈ V | ≤ k.)

5 THE BINOMIAL DISTRIBUTION 7

Proof. Suppose that As ∈ F for some s. The sets At such that As ∩ At 6= ∅ can be partitioned into pairsAs−i mod n, As+k−i mod n with 1 ≤ i ≤ k − 1. Since As−i mod n ∩ As+k−i mod n = ∅, F can onlycontain one set from each of these k−1 pairs. This implies that, F contains at most k of the sets At, t ∈ Vin total. 2

Proof of Theorem 4.1. Choose a permutation σ : V → V and an element i ∈ V uniformly at random fromall n · n! possible choices. Let

Ai,σ = σ(i), σ(i+ 1 mod n), . . . , σ(i+ k − 1 mod n) .

Lemma 4.2 shows that for any permutation σ0 : V → V we have

P [Ai,σ ∈ F|σ = σ0] =|F ∩ Aj,σ0

: j ∈ V |n

≤ k

n, (5)

because i ∈ V is chosen uniformly and independently of σ. Furthermore, for any i0 ∈ V and any setS ⊂ V of size k we have

P [Ai,σ = S|i = i0] = 1/

(n

k

), (6)

because σ is chosen uniformly and independently of i. Combining (5) and (6), we get

k

n≥ P [Ai,σ ∈ F ] =

|F|(nk

) .Hence, |F| ≤ k

n

(nk

)=(n−1k−1

). 2

5 The binomial distributionOne of the most important probability distributions that occurs in the theory of random discrete structuresis the binomial distribution. To define it, we first define a probability distribution on sets of 0/1 sequence.More precisely, let n ≥ 1 be an integer, and let 0 ≤ p ≤ 1 be a real. Let Ω = 0, 1n be the set of all 0/1sequences of length n. For any sequence x = (x1, . . . , xn) ∈ Ω, we let

S(x) =

n∑i=1

xi

be the number of ‘ones’ in that sequence. We define a probability distribution µn,p on Ω by letting

P x = pS(x)(1− p)n−S(x). (7)

Thus, for any subset A ⊂ Ω we haveP(A) =

∑x∈A

P x .

Now, S : Ω → R is a random variable. The distribution PS of S is called the binomial distributionBin(n, p). Thus, the binomial distribution is a probability distribution on R, and for any s ∈ R we have

PS s = µn,p x ∈ Ω : S(x) = s =∑

x∈Ω:S(x)=s

µn,p x .

Clearly, S can only take the values 0, 1, . . . n, and thus PS s = 0 for all s 6∈ 0, 1, . . . , n. Further-more, for any k ∈ 0, 1, . . . , n we have

PS k =

(n

k

)pk(1− p)n−k.


The definition (7) of the probability measure on Ω ensures that for x ∈ Ω each entry xi is equal toone with probability p independently of all the others. (You may want to verify this!) Therefore, by thelinearity of the expectation,

E(S) = n · p.

Furthermore, since the variance of a sum of independent variables equals the sum of the variances, we seethat

Var(S) = n · p(1− p).

This expression fot the variance can sometimes be used in combination with Chebyshev’s inequality:

Lemma 5.1 (Chebyshev’s inequality) Let X be a real-valued random variable such that Var(X) < ∞.Then for any t > 0 we have

P[|X − E(X)| > t ·

√Var(X)

]≤ t−2.

While we’re at it, we might as well recall Markov’s inequality.

Lemma 5.2 (Markov’s inequality) LetX be a non-negative real-valued random variable such that E(X) <∞. Then for any t > 1 we have

P [X > t · E(X)] ≤ t−1.

A key feature of the binomial distribution is that it is ‘concentrated about its expectation’. This can bequantified precisely by the so-called Chernoff bounds. The method for deriving these bounds is instructiveand can be applied to many other distributions as well.

The fundamental idea is to apply Markov’s inequality to the random variable exp(u · S) for a suitablereal u. More precisely, for any u, t ≥ 0 we have

P [S ≥ E(S) + t] = P [exp(uS) ≥ exp(u(E(S) + t))] ≤ E(exp(uS))

exp(u(E(S) + t)). (8)

Analogously, if u ≤ 0 and t ≥ 0, then

P [S ≤ E(S)− t] = P [exp(uS) ≥ exp(u(E(S)− t))] ≤ E(exp(uS))

exp(u(E(S) + t)). (9)

To proceed, we need to compute the E(exp(uS)), and then optimize over u. Since S(x) =∑ni=1 xi and

the entries xi are mutually independent, we have

E(exp(uS)) = E

[exp

(u

n∑i=1

xi

)]= E

[n∏i=1

exp(uxi)

]=

n∏i=1

E [exp(uxi)]

=

n∏i=1

(1− p+ p exp(u)) = (1− p+ p exp(u))n. (10)

To continue, let λ = E(S) = np, and assume that λ+ t < n. Plugging (8) into (10), we obtain

P [S ≥ λ+ t] ≤ exp(−u(λ+ t)) · (1− p+ p exp(u))n for any u ≥ 0. (11)

Differtiating this expression shows that the r.h.s. is minimized if u is chosen so that

exp(u) =(λ+ t)(1− p)(n− λ− t)p

.


Substituting this value of u in (11), we obtain

P [S ≥ λ+ t] ≤(

(n− λ− t)p(λ+ t)(1− p)

)λ+t

·(

1− p+ (1− p) · λ+ t

n− λ− t

)n=

(n− λ− tλ+ t

)λ+t(p

1− p

)λ+t

(1− p)n(

1 +λ+ t

n− λ− t

)n=

(n− λ− tλ+ t

)λ+t(p

1− p

)λ+t

(1− p)n(

n

n− λ− t

)n=

(n− λ− tλ+ t

)λ+t(λ

n− λ

)λ+t

(n− λ)n(n− λ− t)−n

=

(λ

λ+ t

)λ+t(n− λ

n− λ− t

)n−λ−t. (12)

This bound holds for all 0 ≤ t < n − λ. With the convention that the second factor in the last line is 1, itextends to t = n− λ as well. Furthermore, for t > n− λ, the probability on the left hand side is triviallyzero.

The bound on P [S ≥ λ+ t] that we just obtained is often a bit awkward to apply. The followingtheorem provides a more handy one.

Theorem 5.3 Let S be a random variable that has a binomial distribution Bin(n, p); that is, for anyinteger 0 ≤ k ≤ n we have

P [S = k] =

(n

k

)pk(1− p)n−k.

Let λ = np = E(S). Furthermore, define a function

ϕ : (−1,∞)→ R≥0, x 7→ (1 + x) ln(1 + x)− x.

Then

P [S ≥ λ+ t] ≤ exp(−λ · ϕ(t/λ)) ≤ exp

(− t2

2(λ+ t/3)

)for any t ≥ 0,

P [S ≤ λ− t] ≤ exp(−λ · ϕ(−t/λ)) ≤ exp

(− t

2

2λ

)for any 0 ≤ t < λ.

Proof. In terms of ϕ, (12) reads

P [S ≥ λ+ t] ≤ exp [−λϕ(t/λ)− (n− λ)ϕ(−t/(n− λ))] (13)

for 0 ≤ t ≤ n−λ. Replacing S by n−S (which has Binomial distribution Bin(n, 1− p)), we obtain from(12)

P [S ≤ λ− t] ≤ exp [−λϕ(−t/λ)− (n− λ)ϕ(t/(n− λ))] (14)

for 0 ≤ t ≤ λ. Since ϕ is positive, (13) and (14) directly yield the ‘middle’ bounds in the theorem (i.e., theones stated in terms of ϕ). Using elementary calculus, one verifies that ϕ(x) ≥ x2/2 for −1 < x < 0, andthat

ϕ(x) ≥ x2

2(1 + x/3)

for x > 0. 2

The previous results concern the properties of the binomial distribution for any fixed n, p. In the restof this section, we will deal with the asymptotics of the binomial distribution as n → ∞. That is, we willconsider a sequence of random variables (Sn)n≥1 such that Sn has a binomial distribution Bin(n, p(n)),where p(n) ∈ [0, 1]. To simplify the notation, we will usually omit the index n and just write S insteadof Sn. Similarly, we will just write p instead of p(n). Nevertheless, we will have to keep in mind that


formally p depends on n, and that we’re talking about a sequence of random variables rather than a singleone.

There are two fundamentally different cases. The first case is that the sequence (n·p(n))n≥1 is bounded.The second case is that n · p(n) → ∞ as n → ∞. Let us begin with the first case. In fact, let us assumethat there is a real λ > 0 such that

limn→∞

n · p(n) = λ.

We will see that, in this case, the binomial variable S can be approximated well by a random variable withanother distribution, the so-called Poisson distribution. More precisely, we say that a random variable Lhas a Poisson distribution Po(λ) if

P [L = k] =λk

k! exp(λ)for any integer k ≥ 0.

Theorem 5.4 Suppose that np→ λ for a real λ > 0. Then for any integer k we have

limn→∞

P [S = k] =λk

k! exp(λ).

Proof. We may assume that n > k. Then

P [S = k] =

(n

k

)pk(1− p)n−k ≤ (np)k

k!exp(−p(n− k))

=(np)k

k!exp(−np) · exp(kp), (15)

P [S = k] =

(n

k

)pk(1− p)n−k ≥ (n− k)kpk

k!· (1− p)n

≥ (np)k

k!(1− p)n(1− k/n)k. (16)

Using elementary calculus, one easily verifies that 1− x ≥ exp(−x− x2) for 0 ≤ x ≤ 12 . As np→ λ, we

know that p → 0. Similarly, for any fixed integer k we have k/n → 0. Therefore, for n sufficiently largewe obtain from (16)

P [S = k] ≥ (np)k

k!exp(−np) · exp(−k2/n− k3/n2 − p2n). (17)

Since p→ 0, k2/n→ 0, and p2n→ 0 for large n, (15) and (17) imply the assertion. 2

The second case is that np → ∞. We may confine ourselves to p ≤ 1/2, because otherwise we couldsimply replace S by n−S. What we will be interested in is the so-called center of the binomial distribution,i.e., the values close to λ = np. For this part, we have the following limit theorem, showing convergenceto the normal distribution.

Theorem 5.5 Suppose that p = p(n) ≤ 12 and np → ∞. Let σ = σ(n) =

√np(1− p). Then for any

reals a < b we have

limn→∞

P

[S − λσ∈ [a, b]

]=

1√2π

∫ b

a

exp(−x2/2)dx.

To prove the theorem, we need the following estimate.

Lemma 5.6 Suppose that p = p(n) ≤ 12 and np → ∞. Let σ = σ(n) =

√np(1− p). Then for any real

C > 0 we have

limn→∞

maxk∈Z:|k|≤Cσ

∣∣∣∣log

[P [S = λ+ k] ·

√2πσ exp

(k2

2σ2

)]∣∣∣∣ = 0.


Proof. Fix any k such that |k| ≤ Cσ. By Stirling’s formula,

P [S = λ+ k] =

(n

λ+ k

)pλ+k(1− p)n−λ−k

= (1 + o(1)) ·√

n

2π(λ+ k)(n− λ− k)·(

np

λ+ k

)λ+k (n(1− p)n− λ− k

)n−λ−k= (1 + o(1)) ·

√n

2π(λ+ k)(n− λ− k)·(

λ

λ+ k

)λ+k (n− λ

n− λ− k

)n−λ−k.

Here and throughout, the o(1) hides a term that tends to zero as n → ∞. Let us first take a look at thesquare root. Since λ→∞, λ ≤ n/2, and k ≤ C

√λ, we get√

n

2π(λ+ k)(n− λ− k)= (1 + o(1)) ·

√n

2πλ(n− λ)=

1 + o(1)√2πσ

.

Thus,

P [S = λ+ k] =1 + o(1)√

2πσ·(

λ

λ+ k

)λ+k (n− λ

n− λ− k

)n−λ−k. (18)

The right part of the expression in (19) is identical to the expression we had in (12). In (13) we expressedthis in terms of the function ϕ(x) = (1 + x) ln(1 + x)− x. Namely, we have

P [S = λ+ k] =1 + o(1)√

2πσ· exp [−λϕ(k/λ)− (n− λ)ϕ(−k/(n− λ))] . (19)

Since |k| ≤ Cσ, the quotients k/λ and k/(n − λ) tend to zero. Furthermore, Taylor expanding ϕ around0, we get ϕ(x) = x2/2 +O(x3). Hence,

−λϕ(k/λ)− (n− λ)ϕ(−k/(n− λ)) = −λ ·

[1

2

(k

λ

)2

+O(k/λ)3

]

−(n− λ) ·

[1

2

(k

n− λ

)2

+O(k/(n− λ))3

]

= − k2

2λ− k2

2(n− λ)+O(k3/λ2) = − k2

2σ2+O(k3/λ2).

Since k ≤ Cσ ≤ C√λ, we have k3/λ2 → 0. Hence, (19) yields

P [S = λ+ k] =1 + o(1)√

2πσexp

[− k2

2σ2

],

as claimed. 2

Proof of Theorem 5.5. We will only consider the case that 0 ≤ a < b. The case a < b ≤ 0 can be dealt withanalogously. Moreover, in the case a < 0 < b the assertion follows by applying the previous two cases tothe two intervals [a, 0] and [0, b]. Thus, assume from now on that 0 ≤ a < b.

Letting K = k ∈ Z : aσ ≤ k ≤ bσ, we have

P

[S − λσ∈ [a, b]

]=

∑k∈K

P [S = λ+ k]

=1 + o(1)√

2πσ

∑k∈K

exp

(− k2

2σ2

)[by Lemma 5.6]. (20)

Recall that the o(1) hides an expression that tends to zero as n→∞. The function x 7→ exp(−x2/(2σ2))is monotonically decreasing on the positive reals. Therefore,∑

k∈K

exp

(− k2

2σ2

)= δ +

∫ bσ

aσ

exp(−x2/(2σ2))dx, (21)

6 RANDOM WALKS 12

with |δ| ≤ 2 exp(−a2/2) ≤ 2. Plugging (21) into (20), we get

P

[S − λσ∈ [a, b]

]=

1 + o(1)√2πσ

[δ +

∫ bσ

aσ

exp(−x2/(2σ2))dx

]

=1 + o(1)√

2πσ

[δ + σ

∫ b

a

exp(−z2/2)dz

]=

1 + o(1)√2π

∫ b

a

exp(−z2/2)dz,

where the last step follows because |δ| ≤ 2 and σ ≥√λ/2→∞. 2

Remark 5.7 Theorem 5.5 actually is a special case of the ‘central limit theorem’ from probability theory.

6 Random walksRandom walks are the simplest example of a stochastic process. Think of a particle that moves around onthe real line, starting at the origin. At each (discrete) time step, the particle moves either one step (of afixed lengh, say 1) to the left or to the right, either with probablity 1

2 independently of the current position.We are going to address several natural questions related with this process, such as: what is the distributionof the position at time t?

Formally, the random walk of length n is described equipping the set Ω = −1, 1n with the uniformprobability distribution. The idea is that a vector x = (x1, . . . , xn) ∈ Ω encodes for each time step1 ≤ t ≤ n the increment xt of the position of our particle. Thus, the position at time 0 ≤ t ≤ n is

St =

t∑s=1

xs.

Lemma 6.1 Suppose that 1 ≤ t ≤ n and r are integers such that t+ r is even. Then

P [St = r] =

(t

(t+ r)/2

)2−t.

Proof. We consider the map

B : Ω→ Ω′ = 0, 1n , x 7→ 1

2(x + 1).

This is a bijection. Let T : Ω′ → R be the map

ω = (ω1, . . . , ωn) 7→t∑i=1

ωi.

Then T has a binomial distribution Bin(t, 12 ). Therefore,

P [T = (t+ r)/2] =

(t

(t+ r)/2

)2−t. (22)

Furthermore, the map B has the property that

St(x) + t

2= T (B(x)). (23)

Combining (22) and (23) yields the assertion. 2

Corollary 6.2 We have limt→∞√

2t · P [S2t = 0] =√π.

6 RANDOM WALKS 13

Proof. This follows from Lemma 6.1 and Stirling’s formula. 2

Theorem 6.3 For any real x < y we have

limt→∞

P[x√t ≤ St ≤ y

√t]

=1√2π

∫ y

x

exp(−z2/2)dz.

Proof. Relate St to the binomial distribution as in the proof of Lemma 6.1 and apply Theorem 5.5. 2

Corollary 6.4 We have limt→∞ E[|St|/

√t]

=√

2/π.

Proof. This will be discussed in the seminar. 2

We are going to discuss an application of random walks to an algorithmic problem called 2-SAT. Ac-tually we will define the slightly more general k-SAT problem, which we are going to deal with later. Letn, k ≥ 2 be integers and let V = v1, . . . , vn be a set of n Boolean variables. A literal over V is anexpression of the form vi or ¬vi for 1 ≤ i ≤ n. A k-clause is an expression of the form l1∨ · · ·∨ lk, wherel1, . . . , lk are literals. Furthermore, a k-CNF formula (for ‘conjunctive normal form’) is an expressionφ = C1 ∧ · · · ∧ Cm, where C1, . . . , Cm are k-clauses. Finally, an assignment is a map σ : V → 0, 1.It satisfies φ if value of the Boolean formula is 1 when the values σ(vi) are substituted for the variables vifor 1 ≤ i ≤ n, with 0 interpreted as ‘false’ and 1 interpreted as ‘true’.

In the rest of this section, we let k = 2. An example of a 2-CNF formula with n = 2 and m = 3 is

φ = (v1 ∨ ¬v2) ∧ (¬v1 ∨ ¬v2) ∧ (v1 ∨ v2).

A satisfying assignment is σ : v1 7→ 1, v2 7→ 0. (However, not every 2-CNF formula has a satisfyingassignment.)

Our goal in the rest of this section is to devise an algorithm that will find a satisfying assignmentof a 2-CNF φ if it has one. Clearly, this goal could be accomplished by trying out all 2n assignmentsσ : V → 0, 1. However, this is quite inefficient: even for n = 100, and even with the use of the fastestcomputer available, it would be completely infeasible. We will see that the task can be accomplished muchfaster, essentially with n2 attempts.

The algorithm, called Walksat, is randomized, i.e., it performs ‘coin flips’. Given a 2-CNF φ =C1 ∧ · ∧ Cm, it works as follows.

1. Initially, let σ(vi) = 1 for all 1 ≤ i ≤ n.

2. Repeat the following 16n2 times.

3. If σ is satisfying, halt and output σ.

4. Else let 1 ≤ i ≤ m be such σ fails to satisfy clause Ci = li1 ∨ li2.

Let vi1, vi2 be the variables underlying the literals li1, li2.

5. Choose j ∈ 1, 2 uniformly at random (“flip a coin”).

6. Flip the value σ(vij).

7. Output ‘failure’.

Corollary 6.5 Suppose that φ is satisfiabile. Then Walksat will find a satisfying assignment with prob-ability at least 0.4− o(1).

Proof. Let σ∗ be a satisfying assignment. Let T ≤ 16n2 be the total number of iterations that Walksatperforms before either outputting a satisfying assignment or failing. Thus, T is a random variable. For1 ≤ t ≤ T let it, jt be the indices chosen by Walksat in the tth iteration. We say that step t is a successif t ≤ T and

(σ∗(litjt) = 1 ∧ σ∗(lit2−jt) = 0) ∨ (jt = 1 ∧ σ∗(litjt) = σ∗(lit2−jt) = 1).

7 EIGENVALUES OF NON-RANDOM MATRICES 14

In words: either under σ∗ the literal litjt is satisfied, and the other literal lit2−jt in Cit is not; or bothof these literals are satisfied, and jt = 1. Intuitively, this means that σ ‘moves towards’ σ∗. Becausejt ∈ 1, 2 is chosen randomly, we have

P [step t is a success|t ≤ T ] =1

2.

Now, for 1 ≤ t ≤ T we let Xt = 1 if if step t is a success, and Xt = −1 otherwise. In addition, forT < t ≤ 16n2 we let Xt = 1 with probability 1/2 and Xt = −1 with probability 1/2, independently ofall other events. Thus, E(Xt) = 0 for all 1 ≤ t ≤ 16n2, and

St =

t∑s=1

Xs

is the position of a random walk at time t, for any 1 ≤ t ≤ 16n2.Let ν =

∣∣σ∗−1(0)∣∣ be the initial distance between σ and σ∗. Then ν ≤ n. Furthermore, if the algorithm

fails, then ν − St > 0 for all 1 ≤ t ≤ 16n2, because ν − St is the distance between σ and σ∗ at the end ofstep t. In particular, S16n2 < ν ≤ n. However, by Theorem 6.3

P [S16n2 ≤ n] ≤ 1 + o(1)√2π

∫ 1/4

−∞exp(−z2/2)dz < 0.6 + o(1).

Thus, the probability that Walksat finds a satisfying assignment is at least 0.4− o(1). 2

The ‘success probability’ of 0.4−o(1) guaranteed by Corollary 6.5 may seem disappointing. However,it can be boosted by repeating the algorithm several times independently. For instace, if we repeat thealgorithm 10 times on a satisfiable φ, then the probability that all 10 attempts fail to produce a satisfyingassignment is

1− (1− 0.4 + o(1))10 = 1− 0.610 − o(1) > 0.99 + o(1).

But even if the algorithm fails a million times, would you accept that as a proof that φ is unsatisfiable?As a point of interest, Walksat and related methods are amongst the practically most successful

algorithms for solving the k-SAT problem, for any k ≥ 2. This problem is of great practical significance,as it plays a large role in computer science in software and hardware verification.

7 Eigenvalues of non-random matricesFor two vectors ξ, η ∈ Rn, let 〈ξ, η〉 =

∑ni=1 ξiηi denote their inner product.

Theorem 7.1 Let A = (aij)1≤i,j≤n be a n× n matrix with entries aij ∈ −1, 1. There exist two vectorsx, y ∈ −1, 1n such that

| 〈Ay, x〉 | ≥

(√2

π+ o(1)

)n3/2.

Proof. Choose y = (y1, . . . , yn) ∈ −1, 1n uniformly at random and let

Ri =

n∑j=1

aijyj ,

R =

n∑i=1

|Ri|.

Then setting xi = Ri/|Ri| (and, say, xi = 1 if Ri = 0) yields 〈Ay, x〉 = R. Thus, we just need to show

that R ≥(√

2π + o(1)

)n3/2 with a non-zero probability.

8 EIGENVALUES OF RANDOM MATRICES 15

Fix 1 ≤ i ≤ n. Since yj ∈ −1, 1 is chosen uniformly, aijyj ∈ −1, 1 is also uniform. This meansthat Ri is the sum of n independent random variables that take either value ±1 with probability 1/2. Inother words, Ri is the same as the position of a random walk after n steps. Therefore, Corollary 6.4 showsthat E|Ri| = (

√2/π − o(1))

√n. Consequently,

E [R] =

(√2

π− o(1)

)n3/2.

In particular, there exists a vector y such that R ≥(√

2π − o(1)

)n3/2. 2

If A is symmetric, then Theorem 7.1 can be interpreted in terms of the eigenvalues of the matrix A.Recall that the norm of A is defined as

‖A‖ = maxx,y∈Rn\0

〈Ax, y〉‖x‖ · ‖y‖

.

Since any symmetric matrix is diagonalizable, there is an orthogonal matrix U and reals λ1, . . . , λn suchthat

A = U

λ1

. . .λn

UT .

The numbers λ1, . . . , λn are the eigenvalues of A. They are closely related to the norm as ‖A‖ =max1≤i≤n |λi|. Now, Theorem 7.1 shows that for any n× n matrix A with entries ±1 we have

‖A‖ = max1≤i≤n

|λi| ≥

(√2

π− o(1)

)√n. (24)

This leads quite naturally to the question of whether there exist A such that indeed

‖A‖ = max1≤i≤n

|λi| = O(√n).

The answer to this question is ‘yes’, but the proof is relatively complicated. In the next section, we willobtain a slightly weaker result that is, however, far easier to prove.

Remark 7.2 Actually, a slightly stronger bound than (24) can be obtained by observing that the matrixA2 has trace n2. Since the trace of A2 is the sum of the squares of the eigenvalues of A, this implies that‖A‖ ≥

√n.

8 Eigenvalues of random matricesThe goal in this section is to analyze the eigenvalues of a random symmetric matrix with entries ±1.

Lemma 8.1 (Brubaker, Vempala 2009) Let A = (aij)1≤i,j≤n be a n × n matrix with entries aij ∈[−1, 1]. Let

U =

k√|S|· 1S : S ⊂ 1, . . . , n , k ∈ −1, 1

,

where 1S ∈ 0, 1n is the indicator vector of S. Then

‖A‖ ≤ o(1) + (2d2 log2 ne)2 · maxx,y∈U

〈Ax, y〉 .


Proof. For a unit vector x = (xi)1≤i≤n ∈ Rn we let x+ = (x+i )1≤i≤n ∈ Rn be the vector with

entries x+i = max 0, xi. In addition, let x− = x − x+. Furtehrmore, we define a sequence of sets

S1(x), S2(x), . . . recursively by letting

Sj(x) =

i ∈ 1, . . . , n :

[x+ −

j−1∑k=1

2−k1Sk

]i

> 2−j

.

In words, Sj(x) contains all indices i such that the ith component of the vector x+−∑j−1k=1 2−k1Sk

, whichis defined in terms of the previous j − 1 sets in the sequence, is greater than 2−j . (Of course, for j = 1 thesum is interpreted as the zero vector.) Moreover, we define another sequence T1(x), T2(x), . . . by letting

Tj(x) =

i ∈ 1, . . . , n :

[x− +

j−1∑k=1

2−k1Tk

]i

< −2−j

.

Finally, letting N = d2 log2 ne, we define

x =

N∑j=1

2−j[1Sj(x) − 1Tj(x)

].

We claim that‖x− x‖ ≤

√n · 21−N . (25)

To see this, we are going to show that for any 1 ≤ i ≤ n and any 1 ≤ j ≤ N we have

0 ≤

[x+ −

j∑k=1

2−k1Sk(x)

]i

≤ 2−j . (26)

The proof of this is by induction on j. Since ‖x+‖ ≤ ‖x‖ = 1, all entries of x+ are bounded by 1. Thisimplies the bound (26) for j = 1. For general j > 1, assume inductively that

0 ≤

[x+ −

j−1∑k=1

2−k1Sk(x)

]i

≤ 21−j .

If[x+ −

∑j−1k=1 2−k1Sk(x)

]i> 2−j , then i ∈ Sj(x), which implies (26). On the other hand, if

[x+ −

j−1∑k=1

2−k1Sk(x)

]i

≤ 2−j ,

then i 6∈ Sj(x), whence (26) follows. The same argument applies to x−; namely, we have

0 ≤

[−x+ +

j∑k=1

2−k1Sk(x)

]i

≤ 2−j . (27)

Finally, (26) and (27) yield (25).Let x, y ∈ Rn be any two unit vectors. Then by the triangle inequality

|〈Ax, y〉 − 〈Ax, y〉| ≤ |〈Ax, y〉 − 〈Ax, y〉|+ |〈Ax, y〉 − 〈Ax, y〉|= |〈A(x− x), y〉|+

∣∣⟨x, AT (y − y)⟩∣∣

≤ ‖A‖ · ‖x− x‖ · ‖y‖+ ‖x‖ ·∥∥AT∥∥ · ‖y − y‖

≤ ‖A‖ (‖x− x‖+ ‖y − y‖)(25)≤√n22−N ‖A‖ .


Since all entries of A are bounded by one in absolute value, we have ‖A‖ ≤ n. Hence,

|〈Ax, y〉 − 〈Ax, y〉| ≤ n3/2 · 22−N . (28)

To complete the proof, we need to bound | 〈Ax, y〉 |. Letting u = maxξ,η∈U 〈Aξ, η〉, we have

〈Ax, y〉 =

N∑i,j=1

⟨A1Si(x),1Sj(y)

⟩−⟨A1Si(x),1Tj(y)

⟩−⟨A1Ti(x),1Sj(y)

⟩+⟨A1Ti(x),1Tj(y)

⟩2i+j

≤ u

N∑i,j=1

√|Si(x)| · |Sj(y)|+

√|Si(x)| · |Tj(y)|+

√|Ti(x)| · |Sj(y)|+

√|Ti(x)| · |Tj(y)|

2i+j

= u

[N∑i=1

√|Si(x)|+

√Ti(x)

2i

]·

[N∑i=1

√|Si(y)|+

√Ti(y)

2i

]

≤ u

√N ·√√√√√ N∑

i=1

(√|Si(x)|+

√Ti(x)

)2

22i

·√N ·

√√√√√ N∑i=1

(√|Si(y)|+

√Ti(y)

)2

22i

≤ 2Nu ‖x‖ ‖y‖ ≤ 2Nu.

(In the second step from the bottom we applied the Cauchy-Schwarz inequality.) Thus, (28) implies that

〈Ax, y〉 ≤ n3/222−N + 〈Ax, y〉 ≤ n3/222−N + 2Nu ≤ 4n−1/2 + 2Nu = 2Nu+ o(1). (29)

As this holds for any x, y, the assertion follows. 2

Theorem 8.2 Let A = An = (aij) be a symmetric n × n matrix with entries ±1 chosen uniformly atrandom. There is a constant c > 0 such that

limn→∞

P[‖A‖ ≤ c

√n · ln2 n

]= 1.

Proof. By Lemma 8.1 it suffices to prove that

limn→∞

P[∃x, y ∈ U : 〈Ax, y〉 > c

√n]

= 0, (30)

for some constant c > 0. To simplify matters further, we claim that it is enough to show that

limn→∞

P[∃x, y ∈ U, x ⊥ y : 〈Ax, y〉 > c

√n]

= 0, (31)

for some constant c > 0. To see this, assume for contradiction that x, y ∈ U are such that | 〈Ax, y〉 | >10c√n, while

maxξ,η∈U :ξ⊥η

〈Aξ, η〉 ≤ c√n. (32)

Let S, T ⊂ 1, . . . , n be sets such that x = |S|−1/21S and y = |T |−1/21T . Then

10c√n ≤ | 〈Ax, y〉 | ≤

|⟨A1S ,1T\S

⟩|+ |

⟨A1S\T ,1T∩S

⟩|+ | 〈A1S∩T ,1T∩S〉 |√

|S × T |

≤ 2c√n+| 〈A1S∩T ,1T∩S〉 |√

|S × T |≤ 2c√n+| 〈A1S∩T ,1T∩S〉 |

|S ∩ T |.

Now, let R ⊂ S ∩ T be a subset chosen uniformly at random. For s ∈ S and t ∈ T let r(s, t) = 1 if s ∈ Rand t 6∈ R. If s 6= t, then E [r(s, t)] = 1

4 . Therefore, letting R = S ∩ T \R, we get

E 〈A1R,1R〉 =1

4

[〈A1S∩T ,1S∩T 〉 −

∑z∈S∩T

azz

].


Hence, there exists a set R such that | 〈A1R,1R〉 | ≥ 14 [| 〈A1S∩T ,1S∩T 〉 | − |S ∩ T |], and thus for suffi-

ciently large n,| 〈A1R,1R〉 |√|R× R|

≥ | 〈A1S∩T ,1T∩S〉 |4|S ∩ T |

− 1 ≥ 2c√n− 1/4 > c

√n,

which contradicts (32). This shows that proving (31) will be sufficient.To continue, fix two disjoint sets S, T ⊂ 1, . . . , n such that s = |S| ≥ t = |T |. Let bij =

(1 + aij)/2 ∈ 0, 1. Then the random variable B(S, T ) =∑

(s,t)∈S×T bst has a binomial distributionBin(st, 1/2). Furthermore,

| 〈A1S ,1T 〉 |√|S × T |

=2|B(S, T )− |S × T |/2|√

|S × T |=

2|B(S, T )− E [B(S, T )] |√|S × T |

.

Hence,

P

[| 〈A1S ,1T 〉 |√|S × T |

> c√n

]= P

[|B(S, T )− E [B(S, T )] |√

|S × T |> c√n/2

]= 2 · P

[B(S, T ) < E [B(S, T )]− c

√nst/2

],

by the symmetry of Bin(st, 1/2). Letting λ = st/2 and δ = c√nst/2, we obtain from Theorem 5.3 that

P

[| 〈A1S ,1T 〉 |√|S × T |

> c√n

]≤ 2 exp(−δ2/(2λ)) ≤ 2 exp(−c2n/2) ≤ 8−n,

provided that c is chosen sufficiently big. Hence, by the union bound

P

[∃S, T : S ∩ T = ∅, | 〈A1S ,1T 〉 |√

|S × T |> c√n

]≤ 22n · 8−n ≤ 2−n. (33)

This implies (31), and thus the proof is complete. 2

Theorem 8.2 can be generalized substantially. A far more comprehensive result is the following.

Theorem 8.3 (Furedi-Komlos 1981) Suppose that (aij)1≤i≤j≤n is a family of independent random vari-ables such that

|aij | ≤ 1, E(aij) = 0, and Var(aij) ≤ σ2

for all i, j, where σ2 ≥ ln10(n)/n. Let aji = aij for i > j, and let A = (aij)1≤i,j≤n. Then

limn→∞

P[‖A‖ ≤ 10σ

√n]

= 1.

The proof of this result is based on a slightly different technique than we used in the proof of Theorem 8.2.However, the simple technique that we used is rather versatile and generalizes to many other types ofrandom matrices.

An application: matrix sparsification. This paragraph follows Achlioptas, McSherry (2007). Supposethat we are given a symmetric n × n matrix A = (aij) with entries in [−1, 1]. In practice, the operationsthat can be performed on A are often limited by the storage space that A uses up in a computer. If westore A naively by one memory cell for each matrix entry, the number of cells required is n(n + 1)/2 (asthe matrix is symmetric), which is quadratic in n. This is often infeasible. Of coruse, if A has a lot ofentries that are equal to zero, we could economize memory usage by skipping the zero entries. This has theadditional benefit that various matrix operations (such as multiplying the matrix by a vector, or computingthe eigenvalues, or solving linear equations) can be sped up.

But what if A does not have a lot of zero entries? An appealing idea might be to approximate A byanother matrix A that has a lot of zeros. For instance, it would be nice if we could find A such that

∥∥∥A−A∥∥∥

9 THE SECOND MOMENT METHOD 19

is ‘small’. Our knowledge of random matrices (Theorem 8.3) can be harnessed to achieve this end. In fact,consider the random matrix A = (aij) defined as follows. For each 1 ≤ i ≤ j ≤ n we let ξij be a randomvariable that takes the value 1 with probability p and the value 0 with probability 1 − p independently ofall others, for some 0 < p < 1. Let

aij =ξijp· aij

for i ≤ j, and let aji = aij . Then for each i, j we have E(aij) = aij . Furthermore, Var(aij) ≤ p−1.Hence, the random matrix E = p

2 (A−A) satisfies the assumptions of Theorem 8.3 with σ = p/4, providedthat p ≥ 4 ln10 n/n. Theorem 8.3 shows that with probability tending to one we have ‖E‖ ≤ 10σ

√n, and

thus ∥∥∥A− A∥∥∥ ≤ 10√n/p. (34)

Furthermore, the expected number of non-zero entries of A is bounded by n2p. This means that we canreduce the number of non-zero entries by choosing a smaller p, at the expense of deteriorating the approx-imation guarantee (34).

9 The second moment method

9.1 Connectivity of random graphsRemember the random graph G(n, p) on V = 1, . . . , n, where each edge is present with probabilityp ∈ [0, 1] independently? We are going to study G(n, p) “in the limit” n → ∞. More precisely, let(p(n))n≥1 be a sequence of numbers p(n) ∈ [0, 1]. We say, somewhat sloppily, that G(n, p(n)) has someproperty E with high probability (‘w.h.p.’) if

limn→∞

P [G(n, p(n)) ∈ E ] = 1.

As usual, we will usually just write p instead of p(n).A graph G = (V,E) is connected if for any set S ⊂ V such that ∅ 6= S 6= V there is an edge e ∈ E

such that e ∩ S 6= ∅ 6= e \ S. In other words, there is an edge that leads from S to V \ S.

Theorem 9.1 Fix any ε > 0.

1. If p ≤ (1− ε) lnn/n, then G(n, p) is disconnected w.h.p.

2. If p ≥ (1 + ε) lnn/n, then G(n, p) is connected w.h.p.

Proof. Suppose that p ≤ (1 − ε) lnn/n. We call a vertex v of G(n, p) isolated if there is no edge e suchthat v ∈ e. Let Iv = 1 if v is isolated, and set Iv = 0 otherwise. Let I =

∑v∈V Iv be the number of

isolated vertices. Since for any v ∈ V the number of edges e such that e ∈ V is binomially distributedBin(n− 1, p), we have

E [Iv] = (1− p)n−1 ≥ exp(−p(n− 1) + n ·O(p2)) = (1 + o(1))nε−1.

Hence,E [I] =

∑v∈V

E [Iv] = (1 + o(1))nε. (35)

Furthermore, the second moment of I is

E[I2]

=∑

(v,w)∈V×V

E [IvIw] = n(n− 1)P [I1 = I2 = 1] + E [I] .

Thus, we need to compute the probability that the first two vertices are isolated. Clearly, 1 and 2 areisolated iff there is no edge that touches either. The total number of possible edges touching 1 or 2 is2(n− 1)− 1 = 2n− 3. Hence,

P [I1 = I2 = 1] = (1− p)2(n−1)−1 = (1− p)(1− p)2(n−1).


Consequently, we obtain

E[I2]− E [I]

2 − E [I] = n(n− 1)P [I1 = I2 = 1]−∑

(v,w)∈V×V

E [Iv] E [Iw]

= n(n− 1)[(1− p)−1(1− p)2(n−1) − (1− p)2(n−1)

]≤ c · E [I]

for a constant c > 0. Thus, Var [I] = E[I2]−E [I]

2 ≤ (c+ 1)EI. Now, Chebyshev’s inequality and (35)show that

P [I = 0] ≤ P [|I − E [I] | ≥ E [I]] ≤ Var [I]

E [I]2 ≤

c+ 1

E [I]= o(1),

as ε > 0 is fixed as n → ∞. This means that I > 0 w.h.p., i.e., w.h.p. the random graph G(n, p) has atleast one isolated vertex. In particular, it is disconnected.

Now, assume that np ≥ (1 + ε) lnn. We are going to show that w.h.p. there is no set S ⊂ V of size1 ≤ |S| ≤ n/2 such that there is no edge connecting S and V \ S. If we fix the size 1 ≤ s ≤ n/2 of thisset S, then the number of potential edges equals s(n − s). Hence, for any one set S ⊂ V of size |S| = sthe probability that there is no S-V \ S edge equals (1− p)s(n−s). Thus, letting Xs denote the number ofsuch sets S, we obtain

E [Xs] ≤(n

s

)· (1− p)s(n−s) ≤

(en

s

)sexp(−ps(n− s))

≤ exp

(s

[1 + ln(n/s)− (1 + ε) lnn · n− s

n

])= exp

(s[1 + (1 + ε)

s

nlnn− ε lnn− ln s

]).

The function s 7→ 1 + (1 + ε) sn lnn− ε lnn− ln s is strictly decreasing on the interval [1, n/2]. Thus, itsmaximum value is attained at s = 1, whence

E [Xs] ≤(n

s

)· (1− p)s(n−s) ≤ exp(−(1 + o(1))ε ln(n) · s) = n(o(1)−1)ε·s.

and thus Markov’s inequality shows that∑n/2s=1Xs = 0 w.h.p. 2

9.2 Cliques in random graphsFor a graph G let ω(G) denote the number of vertices in the largest clique of G. For integers k, n we let

f(k) = fn(k) =

(n

k

)2−(k

2)

denote the expected number of cliques of size k in the random graph G(n, 1/2). As usual, mostly the n inthe index is omitted.

The aim in this section is to estimate ω(G(n, 1/2)). More precisely, we would like to find the ‘typical’value of this random variable. Clearly, Markov’s inequality provides an easy upper bound on this value.

Lemma 9.2 Fix ε > 0. Then ω(G(n, 1/2)) ≤ (2 + ε) log2 n w.h.p.

Proof. Let k = k(n) = d(2 + ε) log2 ne. The expected number of cliques of size k is

f(k) =

(n

k

)2−(k

2) ≤(en

k

)k2−k(k−1)/2 =

( en

k2(k−1)/2

)k≤( n

2k/2

)k,

provided that n (and thus k) is sufficiently large. Since k ≥ (2+ε) log2 n, we have n/2k/2 ≤ n−ε/2 ≤ 1/e(assuming, once more, that n is large), and thus f(k) ≤ exp(−k) = o(1). 2

Using the second moment method, we can also derive the following lower bound.


Theorem 9.3 Fix ε > 0. Then w.h.p.

(2− ε) log2 n ≤ ω(G(n, 1/2)) ≤ (2 + ε) log2 n.

Proof. The upper bound follows from Lemma 9.2. With respect to the lower bound, let k = d(2−ε) log2 ne.Let X be the number of cliques of size k in G(n, 1/2). Then the expectation of X is

E [X] = f(k) ≥(nk

)k2−(k

2) = 2k[log2(n/k)−(k−1)/2]

= 2k[− log2 k+( ε2−o(1)) log2 n] = 2k[(

ε2−o(1)) log2 n−O(log2 log2 n)] →∞. (36)

We are going to show thatVar [X] = o(E [X]

2). (37)

Then Chebyshev’s inequality implies

P [X = 0] ≤ P [|X − EX| ≥ EX] ≤ VarX

(EX)2= o(1),

i.e., G(n, 1/2) has a clique of size k w.h.p.Thus, we need to establish (37). Since VarX = E

[X2]− E [X]

2 and E [X] = f(k), the main issue isthe computation of E

[X2]. For a set S ⊂ V = 1, . . . , n, we let XS = 1 if S is a clique in G(n, 1/2),

and XS = 0 otherwise. Then X =∑S XS . Thus, with S, T ranging over subsets of V of size k,

E[X2]

=∑S,T

P [XS = XT = 1] = EX +∑S 6=T

P [XS = XT = 1]

= EX +∑S 6=T

P [XS = 1] · P [XT = 1|XS = 1]

≤ E [X] ·

1 + maxS

∑T 6=S

P [XT = 1|XS = 1]

. (38)

If we fix any two sets S, T ⊂ V of size k, then

P [XT = 1|XS = 1] = 2(|S∩T |2 )−(k2).

For given that XS = 1, we know that S ∩ T is a clique, but conditioning on XS = 1 does not yielda conditioning on edges that contain at most one vertex from S. Thus, for a given S ⊂ V we shouldrearrange the sum in (38) according to the ‘overlaps’ |S ∩ T |:

∑T 6=S

P [XT = 1|XS = 1] =

k−1∑l=0

∑T :|S∩T |=l

2(l2)−(k

2) =

k−1∑l=0

(n− kk − l

)(k

l

)2(l

2)−(k2). (39)

Since k ≤ 2 log2 n is ‘small’ relative to n, the ‘entropy term’(n−kk−l)(kl

)is maximized for l = 0. However,

this is also the term that minimizes the ‘energy term’ 2(l2)−(k

2). To figure this tradeoff out, let

g(l) =

(n− kk − l

)(k

l

)2(l

2)−(k2)

be the contribution for a given 0 ≤ l ≤ k. Then for 0 ≤ l < k we have

g(l + 1)

g(l)=

(k − l)2

(l + 1)(n− k − l)· 2(l+1

2 )−(l2) = (1 + o(1))2l · (k − l)2

(l + 1)n. (40)

Fix a sufficiently small δ > 0. We will consider three cases.


Case 1: 0 ≤ l ≤ (1− δ) log2 n. The r.h.s. of (40) is

2l(k − l)2

(l + 1)n≤ k22l

n≤ k2n−δ = o(1),

and thus g(l + 1) = o(g(l)).

Case 2: (1 + δ) log2 n ≤ l < k. We obtain

2l(k − l)2

(l + 1)n≥ 2l

kn≥ nδ/k,

whence (40) yields g(l) = o(g(l + 1)).

Case 3: (1− δ) log2 n ≤ l < (1 + δ) log2 n. Estimating g(l) directly, we get for large enough n

g(l) ≤ nk−lkl2(l2)−(k

2) = (kn)l2(l2)−(2l

2 ) · nk−2l2(2l2 )−(k

2)

≤ (kn)l2(l2)−(2l

2 ) · n2δ22(δ+ε+o(1)) log2 n

≤[kn · 2−3l/2

]ln4δ+2ε+o(1) ≤

[kn1−3(1−δ)/2

]l· n4δ+2ε+o(1)

≤ nl2 (−1+3δ/2+o(1))+4δ+2ε+o(1) ≤ n−1/2,

provided that ε, δ > 0 are chosen sufficiently small.

Thus, in summary, we obtain

k−1∑l=0

g(l) ≤ g(0) + o(g(k)) + o(1) = (1 + o(1))E [X] . (41)

(For the last step observe that g(0) = f(k) = E [X].) Hence, (38) and (39) yield

E[X2]

= (1 + o(1))E [X]2,

and thus

Var [X] = (1 + o(1))E [X]2 − E [X]

2= o(E [X]

2). (42)

Finally, (37) follows from (36) and (42). 2

9.3 Prime factorizationsFor an integer n ≥ 2 let Pn denote the sets of all primes p ≤ n. For an integer x we let ν(x) signify thenumber of primes that divide x. Recall that the covariance of two random variables X,Y is Cov(X,Y ) =E(XY )− E(X)E(Y ).

Lemma 9.4 We have∑p∈Pn

1p = O(1) + ln lnn.

Proof. The proof proceeds in several steps. First, we claim that∑p∈Pn

ln p = O(n). (43)

We may assume that n = 2J is a power of two. Each prime p ∈ P2n \ Pn divides(

2nn

), because p divides

the numerator (2n)! but not the denominator (n!)2. Therefore,∑p∈P2n\Pn

ln p ≤ ln

(2n

n

)≤ n ln 4.


Hence, ∑p∈P2n

ln p ≤J∑j=0

∑p∈P2j+1\P2j

ln p ≤J∑j=0

2j+1 ln 2 ≤ 2J+1 ln 2 = O(n),

which shows (43).As a next step, we are going to see that∑

p∈Pn

ln p

p= lnn+O(1). (44)

Decomposing each integer j ≤ n into primes, we see that

n! =∏p∈Pn

p∑

k≥1bn/pkc.

Hence, Stirling’s formula yields

n lnn+O(n) =∑p∈Pn

∑k≥1

bn/pkc ln p. (45)

The sum over k is dominated by the first summand, i.e.,∑k≥1bn/pkc = n

p + O(1) + O(n/p2). Hence,(45) gives

∑p∈Pn

n ln p

p= n lnn+O

n+∑p∈Pn

ln p+∑p≤n

n ln p

p2

.

Invoking (43), recalling that the sum∑i≥1

ln ii2 converges, and cancelling n, we obtain (44).

Finally, we are going to derive the assertion from (44). By elementary calculus, for any p ∈ Pn wehave

1

p=

ln p

p· 1

ln p=

ln p

p

∫ ∞p

(t ln2 t)−1dt.

Let 1(p,∞)(t) be equal to one if t > p, and 0 if t ≤ p. Then∑p∈Pn

1

p=

∑p∈Pn

ln p

p

∫ ∞2

1(p,∞)

t ln2 tdt =

∫ n

2

∑p∈Pn:p≤t

ln p

p

dt

t ln2 t+

∫ ∞n

dt

t ln2 t·∑p∈Pn

ln p

p

=

∫ n

2

(ln t+O(1))dt

t ln2 t+O(1) by (44). (46)

Since∫t/ ln2 t = ln ln t and

∫∞0

(t ln2 t)−1dt <∞, (46) implies the assertion. 2

Theorem 9.5 Let ω = ω(n) be a function such that ω →∞. For all but o(n) integers 0 < x ≤ n we have

|ν(x)− ln lnn| ≤ ω(n)√

ln lnn.

Proof. For a uniformly random x ∈ 1, . . . , n and p ∈ Pn we let Xp = 1 if p divides x, and Xp = 0otherwise. For any prime p ≤ n we have

E [Xp] = P [p divides x] =bn/pcn

and Var [Xp] =bn/pcn·(

1− bn/pcn

). (47)

Thus,∑p≤nXp is the total number ν(x) of primes dividing x. Technically, it is a little easier to work with

a slightly modified random variable. Set N = n0.1 and define

X =∑p∈PN

Xp.

10 THE MINIMUM SPANNING TREE 24

Thenν(x) ≥ X ≥ ν(x)− 10. (48)

By Lemma 9.4,

E [X] =∑p∈PN

bn/pcn

= O(1) +∑p∈PN

1

p= O(1) + ln lnN = O(1) + ln lnn. (49)

In order to prove the theorem, we will use Chebyshev’s inequality. Thus, we need to bound Var [X].By (47) We have

Var [X] =∑p∈PN

Var [Xp] +∑

p,q∈PN :p 6=q

Cov [Xp, Xq]

= O(1) +∑p∈PN

1

p(1− 1/p) +

∑p,q∈PN :p6=q

Cov [Xp, Xq]

≤ O(1) + ln lnn+∑

p,q∈PN :p 6=q

Cov [Xp, Xq] . (50)

If p, q ∈ PN are distinct, then the event pq|n has probability bn/(pq)c/n. Hence, (47) yields

Cov [Xp, Xq] =bn/(pq)c

n− bn/pc

n· bn/qc

n≤ 1

n

(1

p+

1

q

).

Therefore, ∑p 6=q

Cov [Xp, Xq] ≤1

n

∑p 6=q

1

p+

1

q≤ 2N

n

∑p∈PN

1

p.

Thus, Lemma 9.4 entails that∑p 6=q Cov [Xp, Xq] ≤ o(1). Similarly, −

∑p 6=q Cov [Xp, Xq] ≤ o(1).

Plugging these estimates into (50), we see that Var [X] = O(1) + ln lnn. Thus, the assertion follows fromChebyshev’s inequality. 2

10 The minimum spanning treeStrictly speaking, in this section we go beyond the realm of discrete probability. For an integer n let Kn

denote the complete graph on V = 1, . . . , n, i.e., the graph in which any two vertices are adjacent.Let En be the edge set of Kn. Furthermore, let (Xe)e∈En be a family of mutually independent randomvariables so that each Xe is uniformly distributed on the interval [0, 1].

A spanning tree of Kn is a graph T = (V,ET ) on V = 1, . . . , n that is connected but has no cycles.The length of T is

`(T ) =∑e∈ET

Xe.

We would like to figure out the expected minimum length of a spanning tree. To this end, we will need thefollowing result from combinatorics.

Lemma 10.1 (Cayley’s formula) The total number of spanning trees of Kn equals nn−2.

Proof. You will see the proof of this in the seminar. 2

Theorem 10.2 We have limn→∞ E [minT `(T )] = ζ(3) =∑∞j=1 1/j3 = 1.202 . . . .

10 THE MINIMUM SPANNING TREE 25

Proof. Fix n sufficiently large. With probability 1 the edge lengths Xe are pairwise distinct. In this case,the minimum spanning tree is unique. Let T = (V,ET ) be the minimum spanning tree. Then

`(T ) =∑e∈ET

Xe =∑e∈ET

∫ 1

0

1p≤Xedp =

∫ 1

0

∑e∈ET

1p≤Xedp

=

∫ 1

0

|e ∈ ET : Xe ≥ p| dp. (51)

For 0 ≤ p ≤ 1 let Ep be the set of all edges e of Kn such that Xe < p. Furthermore, let Gp = (V,Ep).This graph may be disconnected. Let κ(Gp) be the number of connected components of Gp. Since T isthe minimum spanning tree, the number of edges e ∈ ET with Xe ≥ p equals the number of componentsof κ(Gp) minus 1. Hence, (51) yields

`(T ) =

∫ 1

0

(κ(Gp)− 1)dp.

By Fubini’s theorem, we have

E[minT`(T )

]=

∫ 1

0

E [κ(Gp)− 1] dp. (52)

Thus, we need to estimate E [κ(Gp)].Since the random variables Xe are uniformly distributed, Gp is identical to the random graph G(n, p).

Case 1: p ≥ 6 ln(n)/n. We already saw that G(n, p) is connected w.h.p. if np > (1 + ε) lnn for a fixedε > 0. In fact, the proof of Theorem 9.1 implies that for p > 6 lnn/n and n sufficiently large wehave

P [G(n, p) is disconnected] ≤ n−2.

This implies that for p ≥ 6 ln(n)/n we have

E [κ(Gp)] ≤ 1 + n · P [G(n, p) is disconnected] ≤ 1 + 1/n = 1 + o(1). (53)

Case 2: p < 6 ln(n)/n. For each 1 ≤ k ≤ n let Ak be the number of components of G(n, p) with kvertices that are trees. Moreover, let Bk be the number of components on k vertices that are nottrees. Finally, let C be the number of components on more than ln2 n vertices. By Lemma 10.1, forany k ≤ ln2 n we have

E [Ak] =

(n

k

)kk−2pk−1(1− p)k(n−k)+(k

2)−k+1

= (1 + o(1))nk · kk−2

k!pk−1(1− p)kn · (1− p)(

k2)−k−k

2

= (1 + o(1))nkkk−2

k!pk−1(1− p)kn,

because p · ((k2

)− k − k2) = o(1). Furthermore, again for k ≤ ln2 n,

E [Bk] ≤(n

k

)kk−2pk−1(1− p)k(n−k) ·

(k

2

)p

≤ 1 + o(1)

2

(en

k

)kkkpk exp(−knp)

≤ (1 + o(1)) [np · exp(1− np)]k ≤ c,

for some constant c > 0. (The last step is because the function z 7→ z · exp(1 − z) is bounded on(0,∞).) Finally, we can say deterministically that C ≤ n/ ln2 n. Hence,∣∣∣∣E [κ(Gp)]− (1 + o(1))

nkkk−2

k!pk−1(1− p)kn

∣∣∣∣ ≤ c+n

ln2 n. (54)

11 CODES 26

Since ∫ 6 ln(n)/n

0

∑1≤k≤ln2 n

E [Bk] dp ≤ 6 lnn

n· c = o(1),

∫ 6 ln(n)/n

0

E [C] dp ≤ 6 lnn

n· n

ln2 n= o(1),

we obtain from (54) that∫ 6 lnn/n

0

E [κ(p)] dp+ o(1) = (1 + o(1))∑

1≤k≤ln2 n

nkkk−2

k!

∫ 6 lnn/n

0

pk−1(1− p)kndp. (55)

Now, by Stirling’s formula,∑1≤k≤ln2 n

nkkk−2

k!

∫ 1

6 lnn/n

pk−1(1− p)kndp ≤∑

1≤k≤ln2 n

(en)k∫ 1

6 lnn/n

exp(−knp)dp

≤∑

1≤k≤ln2 n

(en)kn−6k = o(1).

Therefore, (55) yields

(1 + o(1))

∫ 6 lnn/n

0

E [κ(p)] dp+ o(1) =∑

1≤k≤ln2 n

nkkk−2

k!

∫ 1

0

pk−1(1− p)kndp

=∑

1≤k≤ln2 n

nkkk−2

k!· (k − 1)!(k(n− k))!

(k(n− k + 1))!

=∑

1≤k≤ln2 n

nkkk−3k∏i=1

1

k(n− k) + i

= (1 + o(1))∑

1≤k≤ln2 n

1/k3 = o(1) +

∞∑k=1

1/k3. (56)

Finally, the assertion follows by combining (52), (53), and (56). 2

11 CodesSuppose we want to transmit data over via a noisy channel. Instead of transmitting the data in plaintext, itmight be a good idea to encode the data so as to maximize the probability that the receiver obtains the rightinformation. In information theory, the goal is to find ways to encode data for this kind of transmission aseffectively as possible. Throughout, we assume that our data is just a string of bits.

A code is a map f : 0, 1m → 0, 1n together with a map g : 0, 1n → 0, 1m. The mapf is called the encoding function, and g is called the decoding function. The rate of the code is m/n.Given 0 < p < 1, let e = (e1, . . . , en) ∈ 0, 1n be chosen randomly so that P [ei = 1] = p for each1 ≤ i ≤ n, and the individual components (ei)1≤i≤n are mutually independent. Furthermore, let ⊕signify componentwise addition of vectors over the field F2 with two elements. Then we define the errorprobability of the code as

maxx∈0,1m

P [g(f(x) + e) 6= x] .

We leth : (0, 1)→ (0, 1], z 7→ −z log2(z)− (1− z) log2(1− z)

denote the binary entropy function.

12 MARKOV CHAIN MONTE CARLO 27

Theorem 11.1 (Shannon’s theorem) For any p ∈ (0, 1/2) and any ε > 0 there is a code with ratem/n ≥ 1− h(p)− ε and error probability at most ε.

Proof. Choose δ > 0 small so that p+ δ < 1/2 and h(p+ δ) < h(p) + ε/2. Picking n large enough, we letm = n(1− h(p)− ε). This ensures the desired rate. Choose a function f : 0, 1m → 0, 1n uniformlyat random from the set of all 2n2m

such functions. To define g, let d(a, b) be the Hamming distance of twowords a, b ∈ 0, 1n, i.e., the number of components where a, b differ. Then for any y ∈ 0, 1n we letg(y) be the word x ∈ 0, 1m such that the distance d(g(y), f(x)) is minimum. (If there are several x withthis distance, choose one of them arbitrarily.)

For x ∈ 0, 1m let Ax be the event that there is x′ ∈ 0, 1m such that d(f(x), f(x′)) ≤ n(p + δ).Moreover, let A =

⋃xAx. Then by the union bound, Stirling’s formula, and our choice of m,

P [A] ≤∑

x∈0,1mP [Ax] ≤ 2m

(n

bn(p+ δ)c

)2−n

≤ 2m+(h(p+δ)+o(1))n−n ≤ 2−Ω(n).

Hence, there exist maps f, g such that the event A does not occur. Fix such f, g.If A does not occur, then we can bound the error probability as follows. If g(f(x) + e) 6= x, then

d(f(x) + e, f(x)) > n(p+ δ). Hence, the total number of ones in the vector e exceeds n(p+ δ). But thisnumber has a binomial distribution Bin(n, p). Therefore, by the Chernoff bound (Theorem 5.3), we havefor n large enough

P [g(f(x) + e) 6= x] ≤ P [Bin(n, p) > np+ δ] ≤ exp(−Ω(n)) < ε.

This shows that the error probability of the code is less than ε, as desired. 2

12 Markov Chain Monte Carlo

12.1 BasicsA Markov chain is a sequence (Xn)n≥0 of random variables with values in a finite set Ω such that there isa matrix P = (P (x, y))x,y∈Ω such that

P [Xn+1 = xn+1|X1 = x1, . . . , Xn = xn] = P (xn, xn+1) for any n ≥ 0, x1, . . . , xn+1 ∈ Ω. (57)

Intuitively, you could think of the Markov chain as a walk on Ω. In each step, the probability distributionfor the transition from the current ‘state’ xn of the chain is given by the row (P (xn, y))y∈Ω. Thus, thedistribution of the next state only depends on the current state, but not on the history of the process, or onthe ‘time’.

The condition (57) implies directly that P (x, y) ≥ 0 for all x, y, and∑y∈Ω P (x, y) = 1 for all x. A

matrix with these two properties is called row-stochastic.Assume that the distribution of X0 is given by a vector q[0] ∈ [0, 1]

Ω, i.e., P [X0 = x] = q[0]x for any

x ∈ Ω. Then (57) implies that for any t ≥ 1 the distribution of Xt is

q[t] = P tq[0].

A probability distribution q ∈ [0, 1]Ω is called stationary if qP = q. Furthermore, the Markov chain

is called irreducible if for any x, y ∈ Ω there is t ≥ 1 such that P t(x, y) > 0. (Here P t(x, y) denotes thex, y entry of the matrix pt.) In addition, the Markov chain is aperiodic if for all x, y ∈ Ω we have

gcdt ≥ 1 : P t(x, y) > 0

= 1.

Finally, a Markov chain that is both irreducible and aperiodic is called ergodic. The single most importantfact about Markov chains is the following.


Theorem 12.1 An ergodic Markov chain has a unique stationary distribution π, and for any x, y we have

limt→∞

P t(x, y) = π(y).

The proof of Theorem 12.1 is from [8]. We assume that the Markov chain (X0, X1, . . .) under discussionhas finite state space and transition matrix P . For the proof of Theorem 12.1 we need the notion of totalvariation distance between distributions. If µ, ν are two probability distributions on Ω, we define their totalvariation distance as

d(µ, ν) = ||µ− ν||TV =1

2

∑ω∈Ω

|µ(ω)− ν(ω)|.

This defines a metric. For our case, it suffice to show that for an ergodic Markov Chain with transitionmatrix P there is a stationary distribution π such that for any x ∈ Ω it holds that

limt→∞

dtv(Ptx, π) = 0,

where P tx is the t-step distribution of the Markov Chain conditional that is starts from state x.For x ∈ Ω, define the hitting time for x to be

τx := min t ≥ 0 : Xt = x

the first time at which the chain visits state x. For situations where only a visit to x at a positive time willdo, we also define

τ+x := min t ≥ 1 : Xt = x .

When X0 = x, we call τ+x the first return time.

Lemma 12.2 For any states x and y of an irreducible chain, Ex(τ+y ) <∞.

Proof. The definition of irreducibility implies that there exists an integer r > 0 and a real ε > 0 with thefollowing property: for any states z, w ∈ Ω, there exists a j ≤ r with P j(z, w) > ε. Thus for any value ofXt, the probability of hitting state y at a time between t and t+ r is at least ε. Hence for k > 0 we have

Px[τ+y > kr] ≤ (1− ε)Px[τ+

y > (k − 1)r]. (58)

Repeated application of (58) yieldsPx[τ+

y > kr] ≤ (1− ε)k. (59)

Recall that when Y is a non-negative integer-valued random variable, we have

E[Y ] =∑t≥0

P [Y > t].

Since Px[τ+y > t] is a decreasing function of t, (59) suffices to bound all terms of the corresponding

expression for Ex(τ+y ):

Ex[τ+y ] =

∑t≥0

Px[τ+y > t] ≤

∑k≥0

rPx[τ+y > rk] ≤ r

∑k≥0

(1− ε)k <∞.

2

Proposition 12.3 Let P be the transition matrix of an irreducible Markov chain. Then

1. there exists a probability distribution π on Ω such that π = πP and π(x) > 0 for all x ∈ Ω, andmoreover,

2. π(x) = 1Ex[τ+

x ].


Proof. Let z ∈ Ω be an arbitrary state of the Markov chain. We will closely examine the time the chainspends, on average, at each state in between visits to z. Hence define

π(y) = Ez[number of visits to y before returning to z]

=

∞∑t=0

Pz[Xt = y, τ+y > t]. (60)

For any state y, we have π(y) ≤ Ezτ+y . Hence Lemma 12.2 ensures that π(y) < ∞ for all y ∈ Ω. We

check that π is stationary, starting from the definition:∑x∈Ω

π(x)P (x, y) =∑x∈Ω

∞∑t=0

Pz[Xt = x, τ+z > t]P (x, y). (61)

Because the event τ+z ≥ t+ 1 = τ+

z > t is determined by X0, . . . , Xt,

Pz[Xt = x,Xt+1 = y, τ+z ≥ t+ 1] = Pz[Xt = x, τ+

z ≥ t+ 1]P (x, y) (62)

Reversing the order of summation in (61) and using the identity (62) shows that∑x∈Ω

π(x)P (x, y) =

∞∑t=0

Pz[Xt+1 = y, τ+z ≥ t+ 1]

=

∞∑t=1

Pz[Xt+1 = y, τ+z ≥ t]. (63)

The expression in (63) is very similar to (60), so we are almost done. In fact,

∞∑t=1

Pz[Xt = y, τ+z ≥ t] =

= π(y)− Pz[X0 = y, τ+y > 0] +

∞∑t=1

Pz[Xt = y, τ+z = t]

= π(y)− Pz[X0 = y] + Pz[Xτ+z

= y] (64)= π(y). (65)

The equality (65) follows by considering two cases:

y = z : Since X0 = z and Xτ+z

= z, the last two terms of (64) are both 1 (they cancel each other out).

y 6= z : Here both terms of (64) are 0.

Therefore, combining (63) with (65) shows that π = πP .Finally, to get a probability measure, we normalize by

∑x π(x) = Ez(τ

+z ):

π(x) =π(x)

Ez(τ+z )

satisfies π = πP.

In particular, for any x ∈ Ω,

π(x) =1

Ex[τ+x ].

2

Now, Theorem 12.1 follows as a corollary from the following proposition.

Proposition 12.4 Suppose that P is irreducible and aperiodic, with stationary distribution π. Then thereexist constants α ∈ (0, 1) and C > 0 such that

maxx∈Ω||P t(x, ·)− π||TV ≤ Cαt.


Proof. Since P is irreducible and aperiodic, by Proposition 12.3 there exists an r such that P r has strictlypositive entries. Let Π be the matrix with |Ω| rows, each of which is the row vector π. For sufficientlysmall δ > 0, we have

P r(x, y) ≥ δπ(y)

for all x, y ∈ Ω. Letting θ = 1− δ, the equation

P r = (1− θ)Π + θQ, (66)

defines a stochastic matrix Q.It is a straightforward computation to check that MΠ = Π for any stochastic matrix M and that

ΠM = Π for any matrix M such that πM = π.Next, we use induction to demonstrate that

P rk = (1− θk)Π + θkQk (67)

for k ≥ 1. If k = 1, this holds by (66). Assuming that (67) holds for k = n,

P r(n+1) = P rnP r = [(1− θn)Π + θnQn]P r. (68)

Distributing and expanding P r in the second term (using (66)) gives

P r(n+1) = [1 + θn]ΠP r + (1− θ)θnQnΠ + θn+1Qn+1. (69)

Using that ΠP r = Π and QnΠ = Π shows that

P r(n+1) = [1− θn+1]Π + θn+1Qn+1. (70)

This establishes (67) for k = n+ 1 (assuming it holds for k = n), and hence it holds for all k.Multiplying by P j and rearranging terms now yields

P rk+j −Π = θk(QkP j −Π). (71)

To complete the proof, sum the absolute values of the elements in row x0 on both sides of (71) and di-vide by 2. On the right, the second factor is at most the largest possible total variation distance betweendistributions, which is 1. Hence for any x0 we have

||P rk+j(x0, ·)− π||TV ≤ θk.

2

12.2 Examples12.2.1 Random walks

Maybe the simplest natural example of a Markov chain is the simple random walk on the line. But in orderto obtain a finite state space Ω, let us discuss the random walk on a cycle of lengthN for some large enoughN > 0. More precisely, let Ω = Z/NZ be the cyclic group with N elements. Let (It)t≥1 be a sequence ofindependent random variables such that It = ±1 with probability 1/2. Now, define

Xt = Xt−1 + It ∈ Ω for any t ≥ 1,

where addition is in Z/NZ, of course. Then (Xt)t≥0 is a Markov chain.The chain is easily seen to be irreducible. For clearly, for any two states x, y ∈ Ω there is a non-zero

probability that given Xt = x the chain will reach Xs = y for some s > t (for instance, the probability ofgoing staight from x to y is 2−|x−y| > 0).

However, for evenN the chain is not aperiodic. In fact, at even times the chain will be at even positions,and at odd times the state will be an odd number.

But there is an easy fix to make the chain aperiodic, which indeed applies to any chain. Namely,we could introduce a non-zero probability in each step for the chain to just stay at the current state, i.e.,P [Xt+1 = Xt] > 0. For instance, instead of choosing It ∈ −1, 1 uniformly at random, we could chooseIt ∈ −1, 0, 1 uniformly. The modified chain is sometimes called the ‘lazy’ random walk. Its stationarydistribution is the uniform one, for any N .


12.2.2 The Ising model

This is a statistical mechanics model of ferromagnetism pioneered in the 1920s by Lenz and Ising. Considerthe set V = 1, . . . , Nd for integers N > 1, d ≥ 1, and turn this set into a graph by letting

E = v, w : v, w ∈ V, ‖v − w‖1 = 1 .

This is the so-called ‘square lattice’ in d dimensions. Let Ω = −1, 1V . The elements σ : V → −1, 1are also called configurations. Furthermore, define a map, the so-called Hamiltonian,

H : Ω→ R, σ 7→ −∑

e=v,w∈E

σ(v)σ(w).

For a given number β > 0, called inverse temperature, we define a ‘weight function’

W : Ω→ R≥0, σ 7→ exp(−βH(σ)), and we let Z =∑σ∈Ω

W (σ),

which is called the partition function. This gives rise to a probability distribution

π : Ω→ [0, 1] , σ 7→W (σ)/Z,

the so-called Gibbs measure.We would like to design a Markov chain whose stationary distribution is the Gibbs measure. Starting

from any X0 ∈ Ω, we define Xt for t ≥ 1 inductively as follows. Given Xt−1, we let Xt = Xt−1 withprobability 1/2. Otherwise, we define Xt as follows.

• Choose a vertex v ∈ V uniformly at random and let σ(v) = −Xt(v), and σ(w) = Xt(w) for allw ∈ V \ v.

• Let p = min 1,W (σ)/W (Xt). (Note that W (σ)/W (Xt) = π(σ)/π(Xt).)

• Let Xt+1 = σ with probability p, and let Xt+1 = Xt with probability 1− p.

This Markov chain is ergodic, and it is easily verified that π is the stationary distribution.Observe that the quotient W (σ)/W (Xt) depends only on the neighbors of the chosen vertex v. More

precisely, letting τ = Xt for notational convenience, we have

W (σ)

W (Xt)= exp [−β(H(σ)−H(τ))] = exp

β ∑w:v,w∈E

σ(v)σ(w)− τ(v)τ(w)

= exp

−2βτ(v)∑

w:v,w∈E

τ(w)

.Thus, W (σ)

W (Xt)≥ 1 iff the signs of τ(v) and

∑w:v,w∈E τ(w) differ.

The above construction of a Markov chain extends to many other models in statistical mechanics Itgoes by the name of Metropolis process.

12.2.3 Sampling graph colorings

Let G = (V,E) be a graph. Remember that a proper k-coloring of G is a map σ : V → 1, . . . , k suchthat σ(v) 6= σ(w) for all v, w ∈ V . Let Ω be the set of all k-colorings of G. Assuming that Ω 6= ∅, wewould like to sample uniformly from that set.

To this end, we define a Markov chain as follows. Start from any X0 ∈ Ω. To obtain Xt for t ≥ 1 fromXt−1, proceed as follows.

• Choose a vertex v ∈ V uniformly at random.


• Choose a color c ∈ 1, . . . , k uniformly at random.

• Let σ(v) = c and σ(w) = Xt−1(w) for all w 6= v.

• If σ ∈ Ω, then let Xt = σ; otherwise set Xt = Xt−1.

The chain is aperiodic because there is a non-zero probability that Xt = Xt−1. Moreover, it is easily seenthat the uniform distribution is stationary. In fact, the transition matrix of this Markov chain is symmetric,and thus doubly-stochastic.

However, the chain may not be irreducible. Let

∆(G) = maxv∈V|e ∈ E : v ∈ e|

denote the maximum degree of G. It is fairly easy to see that the chain is irreducible if k > 2∆. Indeed, itis irreducible even for k ≥ ∆ + 2. On the other hand, if k ≤ ∆, the chain may not be irreducible (think ofa clique).

12.3 Rapid mixingLet Ω 6= ∅ be a finite set. Recall that a probability distribution on Ω is a map µ : Ω → [0, 1] such that∑ω∈Ω µ(ω) = 1. Also, recal that if µ, ν are two probability distributions on Ω, the total variation distance

is defined asd(µ, ν) =

1

2

∑ω∈Ω

|µ(ω)− ν(ω)|.

This defines a metric. Recall that for a set A ⊂ Ω we let µ(A) =∑a∈A µ(a).

Fact 12.5 We have d(µ, ν) = maxA⊂Ω |µ(A)− ν(A)|.

A coupling of two probability distributions µ, ν on Ω is a probability distribution λ on Ω×Ω such thatfor any x ∈ Ω we have

µ(x) =∑y∈Ω

λ(x, y) and ν(x) =∑y∈Ω

λ(y, x).

Lemma 12.6 (‘coupling lemma’) Let µ, ν be probability distributions on Ω. Let X,Y : Ω × Ω → Ω bethe maps

X : (x, y) 7→ x, Y : (x, y) 7→ y.

For any coupling λ of µ, ν we have λ X 6= Y ≥ d(µ, ν). Furthermore, there is a coupling λ such thatλ X 6= Y = d(µ, ν).

Proof. Let S = a ∈ Ω : µ(a) ≥ ν(a) and let T = Ω \ S. Let

A =⋃a∈Sa × (Ω \ a), B =

⋃b∈T

(Ω \ b)× b .

Thenλ X 6= Y ≥ max λ(A), λ(B) . (72)

Furthermore, as λ is a coupling, for any a, b ∈ Ω we have

λ(a × Ω \ a) = λ X = a ∧ Y 6= a≥ λ X = a − λ Y = a = µ(a)− ν(a), and similarly (73)

λ((Ω \ b)× b) ≥ ν(b)− µ(b). (74)


We obtain

d(µ, ν) =1

2

∑ω∈Ω

|ν(ω)− µ(ω)|

=1

2

[∑ω∈S

µ(ω)− ν(ω) +∑ω∈T

ν(ω)− µ(ω)

]

≤ max

∑ω∈S

µ(ω)− ν(ω),∑ω∈T

ν(ω)− µ(ω)

(75)

≤ max λ(A), λ(B) [by (73), (74)]≤ λ X 6= Y [by (72)],

thereby proving the first assertion.To construct the coupling for the second assertion, note that acutally (75) can be stenghtened to

d(µ, ν) =∑ω∈S

µ(ω)− ν(ω) =∑ω∈T

ν(ω)− µ(ω), (76)

because µ, ν are probability distributions. For each a ∈ Ω let la = min µ(a), ν(a). Then by (76)∑a∈Ω

la =∑ω∈S

ν(ω) +∑ω∈T

µ(ω) =∑ω∈Ω

ν(ω) +∑ω∈T

µ(ω)− ν(ω) = 1− d(µ, ν). (77)

Now, choose a pair (X,Y ) ∈ Ω× Ω as follows.

• With probability 1− d(µ, ν), choose an element a ∈ Ω from the distribution (lω/(1− d(µ, ν)))ω∈Ω.Let X = Y = a. (This is well-defined due to (77).)

• With probability d(µ, ν), choose an element a ∈ S from the distribution ((µ(ω)−ν(ω))/d(µ, ν))ω∈Sand independently an element b ∈ T from the distribution ((ν(ω)−µ(ω))/d(µ, ν))ω∈T . Let X = aand Y = b. (This is well-defined due to (76).)

A trite calculation shows that this distribution λ on Ω×Ω is a coupling, and that λ X 6= Y = d(µ, ν). 2Suppose that (Xj)j≥0 is an ergodic Markov chain on Ω with a transition matrix p and stationary distri-

bution π. For x ∈ Ω let ptx be the probability distribution on Ω defined by ptx : y 7→ pt(x, y). That is, ptx isthe distribution of the state of the chain after t steps given that the initial state X0 was x. Define for ε > 0and t ≥ 1

∆x(t) = d(ptx, π), ∆(t) = maxx∈Ω

∆x(t), τx(ε) = min t : ∆x(t) ≤ ε , τ(ε) = maxx∈Ω

τx(ε).

The mixing time of the chain is τmix = τ( 12e ).

Proposition 12.7 In an ergodic chain we have τ(ε) ≤ τmix · dln εe. for any 0 < ε < 0.01

Proof. The proof consists of several steps. As a first step, we will show that

for any x ∈ Ω the sequence (∆x(t))t≥1 is monotonically decreasing. (78)

We run two ‘copies’ of the chain: let X0 = x, and let (Xt)t≥1 be the resulting sequence of randomvariables, so that Xt has distribution ptx. In addition, let X ′0 have the stationary distribution π, and let X ′tbe the resulting sequence of random variables induced by the operation of the chain. Then each X ′t hasdistribution π. Now, fix a time t ≥ 0. By Lemma 12.6, there is a coupling λt of the random variables Xt,X ′t such that λt Xt 6= X ′t = d(ptx, π). We use this coupling to define a coupling λt+1 ofXt+1 andX ′t+1.Under the coupling λt+1, we define

• Xt+1 = X ′t+1 if Xt = X ′t, and otherwise


• Xt+1 and X ′t+1 are chosen independently from the conditional distributions given the values of Xt

and X ′t.

It is easily verified that λt+1 is indeed a coupling of Xt+1, X ′t+1. Therefore, Lemma 12.6 implies

∆x(t+ 1) = d(pt+1x , π) ≤ λt+1

[Xt+1 6= X ′t+1

]≤ λt [Xt 6= X ′t] = d(ptx, π) = ∆x(t).

To continue, we letD(t) = max

x,y∈Ωd(ptx, p

ty).

We claim that for any s, tD(s+ t) ≤ D(s)D(t). (79)

To see this, fix any x, y ∈ Ω. Let (Xt)t≥1 and (Yt)t≥1 be copies of our chain with X0 = x and Y0 = y.By Lemma 12.6, there is a coupling λt of Xt, Yt such that d(πtx, π

ty) = λt Xt 6= Yt. Given this coupling

λt, we define a coupling λs+t of Xs+t and Ys+t as follows.

• Choose (Xt, Yt) from the distribution λt.

• If Xt = Yt, then run the chain Xt+i for i = 1, . . . , s given the value of Xt and set Yt+i = Xt+i fori = 1, . . . , s; in particular, Xt+s = Yt+s.

• Otherwise, given the values x′ = Xt, y′ = Yt, x′ 6= y′, obtain a coupling λs+t,x′,y′ of the conditionalvariables Xt+s|Xt = x′. Yt+s|Yy = y′ from Lemma 12.6 such that

λs+t,x′,y′ Xt+s 6= Yt+s = d(psx′ , psy′) ≤ D(s).

Then, once more by Lemma 12.6,

d(ps+tx , ps+ty ) ≤ λs+t Xs+t 6= Ys+t ≤ λt Xt 6= Yt ·D(s)

= d(πtx, πty) ·D(s) ≤ D(t)D(s).

Since the r.h.s. is independent of x, y, we obtain (79).Now, observe that

∆(t) ≤ D(t) ≤ 2∆(t).

Therefore, (79) yields for any integer k ≥ 1

∆(k · τmix) ≤ D(k · τmix) ≤ D(τmix)k ≤ (2∆(τmix))k ≤ exp(−k), (80)

by the definition of τmix. The assertion is immediate from (80) and the monotonicity (78). 2

Let (Xt, Yt)t≥1 be a sequence of random variables in which each step (Xt, Yt) is a pair in Ω × Ω.Suppose that the sequences (Xt)t≥0 and (Yt)t≥0 are Markov chains with identical transition probabilities.We call the sequence (Xt, Yt)t a coupling of the two chains (Xt)t≥0, (Yt)t≥0 if Xt = Yt implies Xt+1 =Yt+1 for any t ≥ 0.

Let (Xt, Yt)t be a coupling of two Markov chains. For x, y ∈ Ω define to be the Txy minimum t ≥ 0such that Xt = Yt, given that X0 = x and Y0 = y. Thus, Txy is a random variable. The followingproposition is a key instrument for the analysis of the mixing time.

Proposition 12.8 Let (Xt, Yt)t be a coupling of two Markov chains. Then

∆(t) ≤ maxx,y∈Ω

P [Txy > t] .

Proof. As in the proof of Proposition 12.7, we let D(t) = maxx,y∈Ω d(ptx, pty), so that ∆(t) ≤ D(t). Then

by Lemma 12.6

∆(t) ≤ D(t) = maxx,y

d(ptx, pty) ≤ max

x,yP [Xt 6= Yt|X0 = x, Y0 = y] = P [Txy > t||X0 = x, Y0 = y] ,

as claimed. 2


12.4 Random walks on the hypercubeThe n-dimensional Hamming cube is the graph H = (V,E) with vertex set V = 0, 1n and edge setE = v, w : δ(v, w) = 1, where the Hamming distance δ(v, w) is the number of coordinates in whichv, w differ. Letting Ω = V , we define a Markov chain (Xt)t≥1, which we think of as a random walk onthe vertices of H . Starting with any X0 ∈ Ω, we define Xt for t ≥ 1 as follows:

• With probability 1/2, let Xt+1 = Xt.

• Otherwise, choose a neighbor v of the present vertex Xt uniformly at random and let Xt+1 = v.

Due to the first rule, this chain is aperiodic. Moreover, it is immediate that the chain is irreducible, becauseH is connected. (As an aside, the same definition of a Markov chain applies to any connected graph, notjust the Hamming cube; however, the following analysis of the mixing time does not.)

Due to the particular structure of the Hamming cube, the above chain allows for a different descriptionthat is easier to analyze. Starting from any X0 ∈ Ω, we could obtain Xt for t ≥ 0 as follows.

• Let Xt = (x1, . . . , xn) ∈ Ω = 0, 1n be the present state.

• Draw j ∈ 1, . . . , n uniformly at random.

• Choose xj ∈ 0, 1 uniformly at random.

• Let Xt+1 = (x1, . . . , xj−1, xj , xj+1, . . . , xn).

This second description directly leads to a coupling. Let X0, Y0 ∈ Ω be two arbitrary initial states.Adapting the above experiment, we define a coupling (Xt, Yt) for t ≥ 0 as follows.

• Let Xt = (x1, . . . , xn), Yt = (y1, . . . , yn) ∈ Ω = 0, 1n be the present states.

• Draw j ∈ 1, . . . , n uniformly at random.

• Choose zj ∈ 0, 1 uniformly at random.

• Let Xt+1 = (x1, . . . , xj−1, zj , xj+1, . . . , xn) and Yt+1 = (y1, . . . , yj−1, zj , yj+1, . . . , yn).

Using this coupling, we obtain the following result.

Proposition 12.9 The mixing time of the random walk on H is O(n lnn).

Proof. Let c > 1 be a constant. By Proposition 12.8, we need to bound the probability that Txy > cn lnnfor any two initial states X0 = x, Y0 = y. The definition of the coupling ensures that if an intex j ∈1, . . . , n is chosen at time t, then the jth components of Xs and Ys coincide for all s > t. Thus, onceeach j ∈ 1, . . . , n has been chosen, the states of both copies coincide. Now, the probability that onefixed index j ∈ 1, . . . , n did not get chosen in the first L = dcn lnne steps is bounded by

(1− 1/n)L ≤ exp(−L/n) ≤ exp(−c lnn) = n−c.

Hence, the expected number of indices j ∈ 1, . . . , n that do not get chosen in the first L steps is boundedby n1−c = o(1). Hence, P [Tx,y ≥ L] = o(1) for any x, y ∈ Ω. Therefore, the assertion follows fromProposition 12.8. 2

12.5 Sampling graph coloringsWe return to the problem (and the chain) discussed in Section 12.2.3.

Theorem 12.10 If k ≥ 4∆ + 1, then the mixing time of the chain from Section 12.2.3 is O(kn lnn).

REFERENCES 36

Proof. Let (Xt), (Yt) be two copies of the chain, starting at two arbitrary initial configurations. We considerthe simplest possible coupling: in each step, the chains choose the same vertex v and the same color c, andrecolor if possible. Let dt be the number of vertices that have different colors in the two copies of the chain.Our goal is to analyze the expected difference E [dt+1 − dt|Xt, Yt] given the current configurations of thetwo chains. We need to consider two cases.

Case 1: dt+1 = dt − 1. This happens if Xt, Yt assign different colors to v but in both configurationsXt, Yt the ‘new’ color c can be assigned to vertex v. Since in either chain at most ∆ colors can be‘forbidden’ for v, the probability of this event is at least dt(k − 2∆)/(kn).

Case 2: dt+1 = dt + 1. The chosen vertex v has the same color in the two configurationsXt, Yt, but it hasa neighbor w that is color differently in Xt, Yt, and the chosen color c is the color of w in either Xt

or Yt. The probability of this event is at most 2dt∆/(kn).

Combining the two cases, we see that

E [dt+1 − dt|Xt, Yt] ≤−dt(k − 2∆) + 2dt∆

kn= −dt

n

(1− 4∆

k

)< 0

Inductively, we get

E [dt|Xt, Yt] ≤ d0

(1− 4∆

kn

)t≤ n

(1− 4∆

kn

)t≤ n

(1− 1

kn

)t,

as k ≥ 4∆ + 1. Thus, for t ≥ kn(lnn− ln ε) we get ∆(t) ≤ ε, whence τmix ≤ O(kn lnn). 2

References[1] N. Alon, J. Spencer: The probabilistic method. Wiley (2nd ed. or later).

[2] B. Bollobas: Random graphs. Cambridge University Press (2nd ed.).

[3] R. Durrett: Probability theory and examles. Thomson (3rd ed.)

[4] R. Durrett: Random graph dynamics. Cambridge.

[5] W. Feller: An introduction to probability theory and its applications, volume 1. Wiley 1966.

[6] A. Frieze: Random Graphs. Lecture notes, Carnegie Mellon University, 2010.

[7] S. Janson, T. Łuczak, A. Rucinski: Random Graphs, Wiley 2000.

[8] D. Levin, Y. Peres and E. Wilmer: Markov Chains and Mixing Times, AMS 2009.

[9] A. Sinclair: Markov Chain Monte Carlo: Foundations & Applications. Lecture notes, UC Berkeley,2009. Cambridge.

random discrete structures (ma3h4) - uni-frankfurt.deacoghlan/rds2.pdf · discrete structure, or...

Documents