convergence of random processes - nyu courant · convergence of random processes 1 introduction in...

18
DS-GA 1002 Lecture notes 6 Fall 2016 Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize phenomena such as the law of large numbers, the central limit theorem and the convergence of Markov chains, which are fundamental in statistical estimation and probabilistic modeling. 2 Types of convergence Convergence for a deterministic sequence of real numbers x 1 , x 2 , . . . is simple to define: lim i→∞ x i = x (1) if x i is arbitrarily close to x as i grows. More formally, for any > 0 there is an i 0 such that for all i>i 0 |x i - x| <. This allows to define convergence for a realization of a discrete random process e X (ω,i), i.e. when we fix the outcome ω and e X (ω,i) is just a deterministic function of i. It is more challenging to define convergence of the random process to a random variable X , since both of these objects are only defined through their distributions. In this section we describe several alternative definitions of convergence for random processes. 2.1 Convergence with probability one Consider a discrete random process e X and a random variable X defined on the same prob- ability space. If we fix an element ω of the sample space Ω, then e X (i, ω) is a deterministic sequence and X (ω) is a constant. It is consequently possible to verify whether e X (i, ω) con- verges deterministically to X (ω) as i →∞ for that particular value of ω. In fact, we can ask: what is the probability that this happens? To be precise, this would be the probability that if we draw ω we have lim i→∞ e X (i, ω)= X (ω) . (2) If this probability equals one then we say that e X (i) converges to X with probability one. Definition 2.1 (Convergence with probability one). A discrete random vector e X converges with probability one to a random variable X belonging to the same probability space , F , P)

Upload: others

Post on 27-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

DS-GA 1002 Lecture notes 6 Fall 2016

Convergence of random processes

1 Introduction

In these notes we study convergence of discrete random processes. This allows to characterizephenomena such as the law of large numbers, the central limit theorem and the convergenceof Markov chains, which are fundamental in statistical estimation and probabilistic modeling.

2 Types of convergence

Convergence for a deterministic sequence of real numbers x1, x2, . . . is simple to define:

limi→∞

xi = x (1)

if xi is arbitrarily close to x as i grows. More formally, for any ε > 0 there is an i0 such thatfor all i > i0 |xi − x| < ε. This allows to define convergence for a realization of a discrete

random process X (ω, i), i.e. when we fix the outcome ω and X (ω, i) is just a deterministicfunction of i. It is more challenging to define convergence of the random process to a randomvariable X, since both of these objects are only defined through their distributions. In thissection we describe several alternative definitions of convergence for random processes.

2.1 Convergence with probability one

Consider a discrete random process X and a random variable X defined on the same prob-ability space. If we fix an element ω of the sample space Ω, then X (i, ω) is a deterministic

sequence and X (ω) is a constant. It is consequently possible to verify whether X (i, ω) con-verges deterministically to X (ω) as i → ∞ for that particular value of ω. In fact, we canask: what is the probability that this happens? To be precise, this would be the probabilitythat if we draw ω we have

limi→∞

X (i, ω) = X (ω) . (2)

If this probability equals one then we say that X (i) converges to X with probability one.

Definition 2.1 (Convergence with probability one). A discrete random vector X convergeswith probability one to a random variable X belonging to the same probability space (Ω,F ,P)

Page 2: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

0 5 10 15 20 25 30 35 40 45 50

0

0.2

0.4

0.6

0.8

1

i

D(ω,i

)

Figure 1: Convergence to zero of the discrete random process D defined in Example 2.2 of LectureNotes 5.

if

P(ω | ω ∈ Ω, lim

i→∞X (ω, i) = X (ω)

)= 1. (3)

Recall that in general the sample space Ω is very difficult to define and manipulate explicitly,except for very simple cases.

Example 2.2 (Puddle (continued)). Let us consider the discrete random process D definedin Example 2.2 of Lecture Notes 5. If we fix ω ∈ (0, 1)

limi→∞

D (ω, i) = limi→∞

ω

i(4)

= 0. (5)

It turns out the realizations tend to zero for all possible values of ω in the sample space.This implies that D converges to zero with probability one.

2.2 Convergence in mean square and in probability

To verify convergence with probability one we fix the outcome ω and check whether thecorresponding realizations of the random process converge deterministically. An alternativeviewpoint is to fix the indexing variable i and consider how close the random variable X (i)is to another random variable X as we increase i.

2

Page 3: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

A possible measure of the distance between two random variables is the mean square of theirdifference. Recall that if E

((X − Y )2

)= 0 then X = Y with probability one by Chebyshev’s

inequality. The mean square deviation between X (i) and X is a deterministic quantity (anumber), so we can evaluate its convergence as i → ∞. If it converges to zero then we saythat the random sequence converges in mean square.

Definition 2.3 (Convergence in mean square). A discrete random process X converges inmean square to a random variable X belonging to the same probability space if

limi→∞

E

((X − X (i)

)2)= 0. (6)

Alternatively, we can consider the probability that X (i) is separated from X by a certainfixed ε > 0. If for any ε, no matter how small, this probability converges to zero as i → ∞then we say that the random sequence converges in probability.

Definition 2.4 (Convergence in probability). A discrete random process X converges inprobability to another random variable X belonging to the same probability space if for anyε > 0

limi→∞

P(∣∣∣X − X (i)

∣∣∣ > ε)

= 0. (7)

Note that as in the case of convergence in mean square, the limit in this definition is deter-ministic, as it is a limit of probabilities, which are just real numbers.

As a direct consequence of Markov’s inequality, convergence in mean square implies conver-gence in probability.

Theorem 2.5. Convergence in mean square implies convergence in probability.

Proof. We have

limi→∞

P(∣∣∣X − X (i)

∣∣∣ > ε)

= limi→∞

P

((X − X (i)

)2> ε2

)(8)

≤ limi→∞

E

((X − X (i)

)2)ε2

by Markov’s inequality (9)

= 0, (10)

if the sequence converges in mean square.

It turns out that convergence with probability one also implies convergence in probability.Convergence in probability one does not imply convergence in mean square or vice versa. Thedifference between these three types of convergence is not very important for the purposesof this course.

3

Page 4: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

2.3 Convergence in distribution

In some cases, a random process X does not converge to the value of any random variable,but the pdf or pmf of X (i) converges pointwise to the pdf or pmf of another random variableX. In that case, the actual values of X (i) and X will not necessarily be close, but they have

the same distribution. We say that X converges in distribution to X.

Definition 2.6 (Convergence in distribution). A discrete-state discrete random process Xconverges in distribution to a discrete random variable X belonging to the same probabilityspace if

limi→∞

pX(i) (x) = pX (x) for all x ∈ RX , (11)

where RX is the range of X.

A continuous-state discrete random process X converges in distribution to a continuous ran-dom variable X belonging to the same probability space if

limi→∞

fX(i) (x) = fX (x) for all x ∈ R, (12)

assuming the pdfs are well defined (otherwise we can use the cdfs1).

Note that convergence in distribution is a much weaker notion than convergence with prob-ability one, in mean square or in probability. If a discrete random process X converges to arandom variable X in distribution, this only means that as i becomes large the distributionof X (i) tends to the distribution of X, not that the values of the two random variables areclose. However, convergence in probability (and hence convergence with probability one orin mean square) does imply convergence in distribution.

Example 2.7 (Binomial converges to Poisson). Let us define a discrete random process

X (i) such that the distribution of X (i) is binomial with parameters i and p := λ/i. X (i)

and X (j) are independent for i 6= j, which completely characterizes the n-order distributionsof the process for all n > 1. Consider a Poisson random variable X with parameter λ thatis independent of X (i) for all i. Do you expect the values of X and X (i) to be close asi→∞?

No! In fact even X (i) and X (i+ 1) will not be close in general. However, X converges in

1One can also define convergence in distribution of a discrete-state random process to a continuous randomvariable through the deterministic convergence of the cdfs.

4

Page 5: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

distribution to X, as established in Example 3.7 of Lecture Notes 2:

limi→∞

pX(i) (x) = limi→∞

(i

x

)px (1− p)(i−x) (13)

=λx e−λ

x!(14)

= pX (x) . (15)

3 Law of Large Numbers

Let us define the average of a discrete random process.

Definition 3.1 (Moving average). The moving or running average A of a discrete random

process X, defined for i = 1, 2, . . . (i.e. 1 is the starting point), is equal to

A (i) :=1

i

i∑j=1

X (j) . (16)

Consider an iid sequence. A very natural interpretation for the moving average is that itis a real-time estimate of the mean. In fact, in statistical terms the moving average is theempirical mean of the process up to time i (we will discuss the empirical mean later on inthe course). The notorious law of large numbers establishes that the average does indeedconverge to the mean of the iid sequence.

Theorem 3.2 (Weak law of large numbers). Let X be an iid discrete random process with

mean µX := µ such that the variance of X (i) σ2 is bounded. Then the average A of Xconverges in mean square to µ.

Proof. First, we establish that the mean of A (i) is constant and equal to µ,

E(A (i)

)= E

(1

i

i∑j=1

X (j)

)(17)

=1

i

i∑j=1

E(X (j)

)(18)

= µ. (19)

5

Page 6: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

Due to the independence assumption, the variance scales linearly in i. Recall that for inde-pendent random variables the variance of the sum equals the sum of the variances,

Var(A (i)

)= Var

(1

i

i∑j=1

X (j)

)(20)

=1

i2

i∑j=1

Var(X (j)

)(21)

=σ2

i. (22)

We conclude that

limi→∞

E

((A (i)− µ

)2)= lim

i→∞E

((A (i)− E

(A (i)

))2)by (19) (23)

= limi→∞

Var(A (i)

)(24)

= limi→∞

σ2

iby (22) (25)

= 0. (26)

By Theorem 2.5 the average also converges to the mean of the iid sequence in probability.In fact, one can also prove convergence with probability one under the same assumptions.This result is known as the strong law of large numbers, but the proof is beyond the scopeof these notes. We refer the interested reader to more advanced texts in probability theory.

Figure 2 shows averages of realizations of several iid sequences. When the iid sequence isGaussian or geometric we observe convergence to the mean of the distribution, howeverwhen the sequence is Cauchy the moving average diverges. The reason is that, as we saw inExample 3.2 of Lecture Notes 4, the Cauchy distribution does not have a well defined mean!Intuitively, extreme values have non-negligeable probability under the Cauchy distributionso from time to time the iid sequence takes values with very large magnitudes and this makesthe moving average diverge.

4 Central Limit Theorem

In the previous section we established that the moving average of a sequence of iid randomvariables converges to the mean of their distribution (as long as the mean is well defined and

the variance is finite). In this section, we characterize the distribution of the average A (i)

6

Page 7: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

Standard Gaussian (iid)

0 10 20 30 40 50i

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Moving average

Mean of iid seq.

0 100 200 300 400 500i

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Moving average

Mean of iid seq.

0 1000 2000 3000 4000 5000i

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Moving average

Mean of iid seq.

Geometric with p = 0.4 (iid)

0 10 20 30 40 50i

0

2

4

6

8

10

12Moving average

Mean of iid seq.

0 100 200 300 400 500i

0

2

4

6

8

10

12Moving average

Mean of iid seq.

0 1000 2000 3000 4000 5000i

0

2

4

6

8

10

12Moving average

Mean of iid seq.

Cauchy (iid)

0 10 20 30 40 50i

5

0

5

10

15

20

25

30Moving average

Median of iid seq.

0 100 200 300 400 500i

10

5

0

5

10Moving average

Median of iid seq.

0 1000 2000 3000 4000 5000i

60

50

40

30

20

10

0

10

20

30Moving average

Median of iid seq.

Figure 2: Realization of the moving average of an iid standard Gaussian sequence (top), an iidgeometric sequence with parameter p = 0.4 (center) and an iid Cauchy sequence (bottom).

7

Page 8: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

as i increases. It turns out that A converges to a Gaussian random variable in distribution,which is very useful in statistics as we will see later on.

This result, known as the central limit theorem, justifies the use of Gaussian distributionsto model data that are the result of many different independent factors. For example, thedistribution of height or weight of people in a certain population often has a Gaussian shape–as illustrated by Figure 10 of Lecture Notes 2– because the height and weight of a persondepends on many different factors that are roughly independent. In many signal-processingapplications noise is well modeled as having a Gaussian distribution for the same reason.

Theorem 4.1 (Central Limit Theorem). Let X be an iid discrete random process with mean

µX := µ such that the variance of X (i) σ2 is bounded. The random process√n(A− µ

),

which corresponds to the centered and scaled moving average of X, converges in distributionto a Gaussian random variable with mean 0 and variance σ2.

Proof. The proof of this remarkable result is beyond the scope of these notes. It can be foundin any advanced text on probability theory. However, we would still like to provide someintuition as to why the theorem holds. In Theorem 3.18 of Lecture Notes 3, we establishthat the pdf of the sum of two independent random variables is equal to the convolutions oftheir individual pdfs. The same holds for discrete random variables: the pmf of the sum isequal to the convolution of the pmfs, as long as the random variables are independent.

If each of the entries of the iid sequence has pdf f , then the pdf of the sum of the first ielements can be obtained by convolving f with itself i times

f∑1j=1 X(j) (x) = (f ∗ f ∗ · · · ∗ f) (x) . (27)

If the sequence has a discrete state and each of the entries has pmf p, the pmf of the sum ofthe first i elements can be obtained by convolving p with itself i times

p∑1j=1 X(j) (x) = (p ∗ p ∗ · · · ∗ p) (x) . (28)

Normalizing by i just results in scaling the result of the convolution, so the pmf or pdf of themoving mean A is the result of repeated convolutions of a fixed function. These convolutionshave a smoothing effect, which eventually transforms the pmf/pdf into a Gaussian! We showthis numerically in Figure 3 for two very different distributions: a uniform distribution anda very irregular one. Both converge to Gaussian-like shapes after just 3 or 4 convolutions.The Central Limit Theorem makes this precise, establishing that the shape of the pmf orpdf becomes Gaussian asymptotically.

In statistics the central limit theorem is often invoked to justify treating averages as if

they have a Gaussian distribution. The idea is that for large enough n√n(A− µ

)is

8

Page 9: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

i = 1

i = 2

i = 3

i = 4

i = 5

i = 1

i = 2

i = 3

i = 4

i = 5

Figure 3: Result of convolving two different distributions with themselves several times. Theshapes quickly become Gaussian-like.

9

Page 10: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

Exponential with λ = 2 (iid)

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65

1

2

3

4

5

6

7

8

9

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65

5

10

15

20

25

30

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65

10

20

30

40

50

60

70

80

90

i = 102 i = 103 i = 104

Geometric with p = 0.4 (iid)

1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

0.5

1.0

1.5

2.0

2.5

1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

1

2

3

4

5

6

7

1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

5

10

15

20

25

i = 102 i = 103 i = 104

Cauchy (iid)

20 15 10 5 0 5 10 15

0.05

0.10

0.15

0.20

0.25

0.30

20 15 10 5 0 5 10 15

0.05

0.10

0.15

0.20

0.25

0.30

20 15 10 5 0 5 10 15

0.05

0.10

0.15

0.20

0.25

0.30

i = 102 i = 103 i = 104

Figure 4: Empirical distribution of the moving average of an iid standard Gaussian sequence (top),an iid geometric sequence with parameter p = 0.4 (center) and an iid Cauchy sequence (bottom).The empirical distribution is computed from 104 samples in all cases. For the two first rows theestimate provided by the central limit theorem is plotted in red.

10

Page 11: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

approximately Gaussian with mean 0 and variance σ2, which implies that A is approximatelyGaussian with mean µ and variance σ2/n. It’s important to remember that we have notestablished this rigorously. The rate of convergence will depend on the particular distributionof the entries of the iid sequence.

In practice convergence is usually very fast. Figure 4 shows the empirical distribution ofthe moving average of an exponential and a geometric iid sequence. In both cases theapproximation obtained by the central limit theory is very accurate even for an averageof 100 samples. The figure also shows that for a Cauchy iid sequence, the distribution ofthe moving average does not become Gaussian, which does not contradict the central limittheorem as the distribution does not have a well defined mean. To close this section wederive a useful approximation to the binomial distribution using the central limit theorem.

Example 4.2 (Gaussian approximation to the binomial distribution). Let X have a bino-mial distribution with parameters n and p, such that n is large. Computing the probabilitythat X is in a certain interval requires summing its pmf over all the values in that interval.Alternatively, we can obtain a quick approximation using the fact that for large n the dis-tribution of a binomial random variable is approximately Gaussian. Indeed, we can write Xas the sum of n independent Bernoulli random variables with parameter p,

X =n∑i=1

Bi. (29)

The mean of Bi is p and its variance is p (1− p). By the central limit theorem 1nX is

approximately Gaussian with mean p and variance p (1− p) /n. Equivalently, by Lemma 6.1in Lecture Notes 2, X is approximately Gaussian with mean np and variance np (1− p).

Assume that a basketball player makes each shot she takes with probability p = 0.4. If weassume that each shot is independent, what is the probability that she makes more than 420shots out of 1000? We can model the shots made as a binomial X with parameters 1000 and0.4. The exact answer is

P (X ≥ 420) =1000∑x=420

pX (x) (30)

=1000∑x=420

(1000

x

)0.4x0.6(n−x) (31)

= 10.4 10−2. (32)

If we apply the Gaussian approximation, by Lemma 6.1 in Lecture Notes 2 X being largerthan 420 is the same as a standard Gaussian U being larger than 420−µ

σwhere µ and σ are the

11

Page 12: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

mean and standard deviation of X, equal to np = 400 and√np(1− p) = 15.5 respectively.

P (X ≥ 420) ≈ P(√

np (1− p)U + np ≥ 420)

(33)

= P (U ≥ 1.29) (34)

= 1− Φ (1.29) (35)

= 9.85 10−2. (36)

5 Convergence of Markov chains

In this section we study under what conditions a finite-state time-homogeneous Markovchain X converges in distribution. If a Markov chain converges in distribution, then its statevector ~pX(i), which contains the first order pmf of X, converges to a fixed vector ~p∞,

~p∞ := limi→∞

~pX(i). (37)

This implies that the probability of the Markov chain being in each state tends to a specificvalue.

By Lemma 4.1 in Lecture Notes 5, we can express (37) in terms of the initial state vectorand the transition matrix of the Markov chain

~p∞ = limi→∞

T iX~pX(0). (38)

Computing this limit analytically for a particular TX and ~pX(0) may seem challenging atfirst sight. However, it is often possible to leverage the eigendecomposition of the transitionmatrix (if it exists) to find ~p∞. This is illustrated in the following example.

Example 5.1 (Mobile phones). A company that makes mobile phones wants to model thesales of a new model they have just released. At the moment 90% of the phones are in stock,10% have been sold locally and none have been exported. Based on past data, the companydetermines that each day a phone is sold with probability 0.2 and exported with probability0.1. The initial state vector and the transition matrix of the Markov chain are

~a :=

0.9

0.1

0

, TX =

0.7 0 0

0.2 1 0

0.1 0 1

. (39)

12

Page 13: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

In stock

SoldExported

0.2

0.1

0.7

11

0 5 10 15 20Day

In stock

Sold

Exported

0 5 10 15 20Day

In stock

Sold

Exported

0 5 10 15 20Day

In stock

Sold

Exported

Figure 5: State diagram of the Markov chain described in Example (5.1) (top). Below we showthree realizations of the Markov chain.

13

Page 14: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

~a ~b ~c

0 5 10 15 20Day

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20Day

0.0

0.2

0.4

0.6

0.8

1.0 In stock

Sold

Exported

0 5 10 15 20Day

0.0

0.2

0.4

0.6

0.8

1.0

Figure 6: Evolution of the state vector of the Markov chain in Example (5.1) for different valuesof the initial state vector ~p

X(0).

We have used ~a to denote ~pX(0) because later we will consider other possible initial statevectors. Figure 6 shows the state diagram and some realizations of the Markov chain.

The company is interested in the fate of the new model. In particular, it would like tocompute what fraction of mobile phones will end up exported and what fraction will be soldlocally. This is equivalent to computing

limi→∞

~pX(i) = limi→∞

T iX~pX(0) (40)

= limi→∞

T iX~a. (41)

The transition matrix TX has three eigenvectors

~q1 :=

001

, ~q2 :=

010

, ~q3 :=

0.80−0.530.27

. (42)

The corresponding eigenvalues are λ1 := 1, λ2 := 1 and λ3 := 0.7. We gather the eigenvectorsand eigenvalues into two matrices

Q :=[~q1 ~q2 ~q3

], Λ :=

λ1 0 00 λ2 00 0 λ3

, (43)

so that the eigendecomposition of TX is

TX := QΛQ−1. (44)

14

Page 15: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

It will be useful to express the initial state vector ~a in terms of the different eigenvectors.This is achieved by computing

Q−1~pX(0) =

0.30.7

1.122

, (45)

so that

~a = 0.3 ~q1 + 0.7 ~q2 + 1.122 ~q3. (46)

We conclude that

limi→∞

T iX~a = lim

i→∞T iX

(0.3 ~q1 + 0.7 ~q2 + 1.122 ~q3) (47)

= limi→∞

0.3T iX~q1 + 0.7T i

X~q2 + 1.122T i

X~q3 (48)

= limi→∞

0.3λ i1 ~q1 + 0.7λ i

2 ~q2 + 1.122λ i3 ~q3 (49)

= limi→∞

0.3 ~q1 + 0.7 ~q2 + 1.122 0.5 i ~q3 (50)

= 0.3 ~q1 + 0.7 ~q2 (51)

=

00.70.3

. (52)

This means that eventually the probability that each phone has been sold locally is 0.7 andthe probability that it has been exported is 0.3. The left graph in Figure 6 shows the evolutionof the state vector. As predicted, it eventually converges to the vector in equation (52).

In general, because of the special structure of the two eigenvectors with eigenvalues equal toone in this example, we have

limi→∞

T iX~pX(0) =

0(

Q−1~pX(0)

)2(

Q−1~pX(0)

)1

. (53)

This is illustrated in Figure 6 where you can see the evolution of the state vector if it isinitialized to these other two distributions:

~b :=

0.60

0.4

, Q−1~b =

0.60.40.75

, (54)

~c :=

0.40.50.1

, Q−1~c =

0.230.770.50

. (55)

15

Page 16: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

The transition matrix of the Markov chain in Example 5.1 has two eigenvectors with eigen-value equal to one. If we set the initial state vector to equal either of these eigenvectors(note that we must make sure to normalize them so that the state vector contains a validpmf) then

TX ~pX(0) = ~pX(0), (56)

so that

~pX(i) = T iX~pX(0) (57)

= ~pX(0) (58)

for all i. In particular,

limi→∞

~pX(i) = ~pX(0), (59)

so ~X converges to a random variable with pmf ~pX(0) in distribution. A distribution that

satisfies (59) is called a stationary distribution of the Markov chain.

Definition 5.2 (Stationary distribution). Let X be a finite-state time-homogeneous Markov

chain and let ~pstat be a state vector containing a valid pmf over the possible states of X. If~pstat is an eigenvector associated to an eigenvalue equal to one, so that

TX ~pstat = ~pstat, (60)

then the distribution corresponding to ~pstat is a stationary or steady-state distribution of X.

Establishing whether a distribution is stationary by checking whether (60) holds may bechallenging computationally if the state space is very large. We now derive an alternativecondition that implies stationarity. Let us first define reversibility of Markov chains.

Definition 5.3 (Reversibility). Let X be a finite-state time-homogeneous Markov chain with

s states and transition matrix TX . Assume that X (i) is distributed according to the statevector ~p ∈ Rs. If

P(X (i) = xj, X (i+ 1) = xk

)= P

(X (i) = xk, X (i+ 1) = xj

), for all 1 ≤ j, k ≤ s,

(61)

then we say that X is reversible with respect to ~p. This is equivalent to the detailed-balancecondition (

TX)kj~pj =

(TX)jk~pk, for all 1 ≤ j, k ≤ s. (62)

16

Page 17: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

As proved in the following theorem, reversibility implies stationarity, but the converse doesnot hold. A Markov chain is not necessarily reversible with respect to a stationary distribu-tion (and often won’t be). The detailed-balance condition therefore only provides a sufficientcondition for stationarity.

Theorem 5.4 (Reversibility implies stationarity). If a time-homogeneous Markov chain X

is reversible with respect to a distribution pX , then pX is a stationary distribution of X.

Proof. Let ~p be the state vector containing pX . By assumption TX and ~p satisfy (62), so for1 ≤ j ≤ s

(TX~p

)j

=s∑

k=1

(TX)jk~pk (63)

=s∑

k=1

(TX)kj~pj (64)

= ~pj

s∑k=1

(TX)kj

(65)

= ~pj. (66)

The last step follows from the fact that the columns of a valid transition matrix must addto one (the chain always has to go somewhere).

In Example 5.1 the Markov chain has two stationary distributions. It turns out that this isnot possible for irreducible Markov chains.

Theorem 5.5. Irreducible Markov chains have a single stationary distribution.

Proof. This follows from the Perron-Frobenius theorem, which states that the transitionmatrix of an irreducible Markov chain has a single eigenvector with eigenvalue equal to oneand nonnegative entries.

If in addition, the Markov chain is aperiodic, then it is guaranteed to converge in distributionto a random variable with its stationary distribution for any initial state vector. Such Markovchains are called ergodic.

Theorem 5.6 (Convergence of Markov chains). If a discrete-time time-homogeneous Markov

chain X is irreducible and aperiodic its state vector converges to the stationary distribution~pstat of X for any initial state vector ~pX(0). This implies that X converges in distribution toa random variable with pmf given by ~pstat.

17

Page 18: Convergence of random processes - NYU Courant · Convergence of random processes 1 Introduction In these notes we study convergence of discrete random processes. This allows to characterize

~pX(0) =

1/31/31/3

~pX(0) =

100

~pX(0) =

0.10.20.7

0 5 10 15 20Customer

0.0

0.2

0.4

0.6

0.8

1.0 SF

LA

SJ

0 5 10 15 20Customer

0.0

0.2

0.4

0.6

0.8

1.0 SF

LA

SJ

0 5 10 15 20Customer

0.0

0.2

0.4

0.6

0.8

1.0 SF

LA

SJ

Figure 7: Evolution of the state vector of the Markov chain in Example (5.7).

The proof of this result is beyond the scope of the course.

Example 5.7 (Car rental (continued)). The Markov chain in the car rental example isirreducible and aperiodic. We will now check that it indeed converges in distribution. Itstransition matrix has the following eigenvectors

~q1 :=

0.2730.5450.182

, ~q2 :=

−0.5770.789−0.211

, ~q3 :=

−0.577−0.2110.789

. (67)

The corresponding eigenvalues are λ1 := 1, λ2 := 0.573 and λ3 := 0.227. As predicted byTheorem 5.5 the Markov chain has a single stationary distribution.

For any initial state vector, the component that is collinear with ~q1 will be preserved by thetransitions of the Markov chain, but the other two components will become negligible aftera while. The chain consequently converges in distribution to a random variable with pmf ~q1(note that ~q1 has been normalized to be a valid pmf), as predicted by Theorem 5.6. Thisis illustrated in Figure 7. No matter how the company allocates the new cars, eventually27.3% will end up in San Francisco, 54.5% in LA and 18.2% in San Jose.

18