markov chain order estimation and the chi-square divergence

7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

1/19

Applied Probability Trust (1 April 2013)

MARKOV CHAIN ORDER ESTIMATION

AND THE CHI-SQUARE DIVERGENCE

A.R. BAIGORRI,

C.R. GONALVES and

P.A.A. RESENDE, University of Braslia

Email addresses: [email protected], [email protected] and [email protected]

Abstract

We define a few objects which capture relevant information from the sample of

a Markov chain and we use the chi-square divergence as a measure of diversity

between probability densities to define the estimators LDL and GDL for a

Markov chain sample. After exploring their properties, we propose a new

estimator for the Markov chain order. Finally we show numerical simulation

results, comparing the performance of the proposed alternative with the well

known and already established AIC, BIC and EDC estimators.

Keywords: GDL; LDL; AIC; BIC; EDC; Markov order estimation; chi-square

divergence.

2010 Mathematics Subject Classification: Primary 62M05

Secondary 60J10;62F12

1. Introduction

A sequence {Xi}i=0 of random variables taking values in E = {1, , m} is aMarkov chain of order if for all (a0,

, an+1)

En+2

P(Xn+1 = an+1|X0 = a0, , Xn = an)= P(Xn+1 = an+1|Xn+1 = an+1, , Xn = an) (1)

and is the smallest integer with this property. For simplicity, we assume {Xi}i=0time homogeneous, define al+k1 = a

l1a

l+kl+1 = (a1, , ak+l) Ek+l and

Postal address: Department of Mathematics, University of Braslia, 70910-900, Braslia-DF, Brazil

1


2/19

2 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

p(a+1|a1 ) = P(X+1 = a+1|X1 = a1, , X = a).

Also, we have the i.i.d. case for = 0. The class of processes that holds the

condition (1) for a given 0 will be denoted by M. In this case, the order of aprocess in i=0 Mi is the smallest integer such that X = {X} M.

Along the last few decades there has been a great number of research on the

estimation of the order of a Markov chain, starting with Bartlett [4], Hoel [13], Good

[12], Anderson and Goodman [3], Billingsley [5] among others dealing with hypothesis

tests, and more recently, using information criteria, Tong [19], Schwarz [17], Katz [14],

Csiszr and Shields [6], Zhao et al. [20] and Dorea [11] had contributed with new Markov

chain order estimators.

Akaike [1] entropic information criterion, known as AIC, has had a fundamental

impact in statistical model evaluation problems. The AIC has been applied by Tong,

for example, to the problem of estimating the order of autoregressive processes, au-

toregressive integrated moving average processes, and Markov chains. The Akaike-

Tong (AIC) estimator was derived as an asymptotic estimate of the Kullback-Leiblerinformation discrepancy and provides a useful tool for evaluating models estimated

by the maximum likelihood method. Later on, Katz [14] derived the asymptotic

distribution of the estimator and showed its inconsistency, proving that there is a

positive probability of overestimating the true order no matter how large the sample

size. Nevertheless, AIC is the most used Markov chain order estimator at the present

time, mainly because it is more efficient than BIC for small samples.

The main consistent alternative, the BIC estimator, does not perform too well for

relatively small samples, as it was pointed out by Katz [14] and Csiszr and Shields

[6]. It is natural to admit that the expansion of the Markov chain complexity (size

of the state space and order) has significant influence on the sample size required for

the identification of the unknown order, even though, most of the time it is difficult

to obtain sufficiently large samples in practical settings. In this sense, looking for a

better strong consistent alternative, Zhao et al. [20] and Dorea [11] estated the EDC,

adjusting properly the penalty term used at AIC and BIC.

All the already mentioned estimators are based on the penalized log-likelihood


3/19

Markov Chain Order Estimation and The Chi-Square Divergence 3

method, due that, have their common roots at the likelihood ratio for hypothesis tests.

In this notes well use a different entropic object called 2divergence, and study itsbehaviour when applied to samples from random variables with multinomial empirical

distributions derived from a Markov chain sample. Finally, we shall propose a new

strongly consistent Markov chain order estimator, more efficacious than the already

established AIC, BIC and EDC, which it shall be exhibited through the outcomes of

several numerical simulations.

This paper is organized as follows. Section 2 presents the fdivergences and afirst order Markov chain derived from X, which is useful to extend the already known

asymptotic results to orders larger than one. Section 3 provides the proposed order

estimator, namely GDL, and proves its strong consistency. Finally, Section 4 provides

numerical simulation, where one can observe a better performance of GDL compared

to AIC, BIC and EDC.

2. Auxiliary Results

2.1. Entropy and fdivergences

Basically, a fdivergence is a function that measures the discrepancy between twoprobability distributions P and Q. The divergence is intuitively an average of the

function f of the odds ratio given by P and Q.

These divergences were introduced and studied independently by Csiszr and Shields

[7, 8] and Ali and Silvey [2], among others. Sometimes these divergences are referred

as Ali-Silvey distances.

Definition 2.1. Let P and Q be discrete probability densities with support E =

{1, . . . , m}. For a convex function f(t) defined for t > 0, with f(1) = 0, the fdivergencefor the distributions P and Q is

Df(PQ) =aE

Q(a)f

P(a)

Q(a)

.

Here we take 0f( 00

) = 0 , f(0) = limt0

f(t), 0f( a0

) = limt0

tf( at ) = a limuf(u)

u .

For example, assuming f(t) = t log(t) or f(t) = (1 t2) we have:


4/19


f(t) = t log(t) Df(PQ) =aE

P(a)log

P(a)

Q(a)

,

f(t) = (1 t2) Df(PQ) =aE

(P(a) Q(a))2Q(a)

.

which are called relative entropy and 2divergence, respectively. From now on, the2divergence shall be denote by D2(PQ).

Observe that the triangular inequality is not satisfied in general, so that D2(PQ)

defines no distance in the strict sense.

The 2square divergence D2(PQ) is a well known statistical test procedure closerelated to the chi-square distribution [15].

2.2. Derived Markov Chains

Let Xn1 = (X1, , Xn) be a sample from a Markov chain X = {X} of unknownorder , as already defined. Assume that, x+11 E+1,

p(x+1|x1 ) = P(Xn+1 = xn+1

|Xnn+1 = x

1 ) > 0. (2)

Following Doob [10], from the process X we can derive a first order Markov chain,

Y() = {Y() } by setting Y()n = (Xn, , Xn+1) so that, for v = (i1, , i)and w = (i

1, , i),

P(Y()

n+1 = w|Y()n = v) =

p(i

|i1....i), if i

j = ij+1, j = 1, ..., ( 1)0, otherwise.

Clearly Y() is a first order and time homogeneous Markov chain that from now on

shall be called by the derived process, which by (2) is irreducible and positive recurrent

having unique stationary distribution, say . It is well known [10], that the derived

Markov chains Y() is irreducible and aperiodic, consequently ergodic. Thus, there

exists an equilibrium distribution (.) satisfying for any initial distribution on E

limn

|P(Y()n = x1 ) (x1 )| = 0,

and


5/19


(x1 ) =

z1

(z1 )p(x|z1 ) =

x

(x x11 )p(x|x x11 ).

Likewise, we can define, for l > , Y() and verify that

l(xl1) = (x

1 )p(x+1|x1 )...p(xl|xl1l) =

x

l(x xl11 )p(xl|x xl1l). (3)

which shows that l defined above is a stationary distribution for Y(). For the sake

of notations simplicity well use from now on

(al1) = l(al1), l , (4)

(al1) =

bl1El

(bl1 a

l1), l < (5)

and

p(j|al1) =

bl1El

(bl1 al1)p(j|bl1 al1)

bl1 El

(bl1 al1)

, l < . (6)

Now, let us return to Xn1 = (X1, X2,...,Xn) and define

N(a1) =

n+1j=1

1(Xj = a1,...,Xj+1 = a). (7)

That is, the number of occurrences of a1 in Xn1 . If = 0, we take N( . ) = n.

From now on, the sums related with N(a1) are taken over positive terms, or else, we

convention 0/0 and 0 as 0.The main interest of defining the derivate process is the possibility of use the well

established asymptotic results regarded to first order Markov chains. Lemma 2.1 below

is a version of the Law of Iterated Logarithm, used by Dorea [11] to conclude Lemma

2.2, which will be used for the establishment of subsequent results. The Strong Law of

Large Numbers (SLLN) is needed too and can be found at [9].


6/19


Lemma 2.1. (Meyn and Tweedie (1993).) LetX = {X}> be an ergodic Markovchain with finite state space E and stationary distribution , g : E R, Sn(g) =

nj=1

g(Xj )

and

2g = E (g2(X1)) + 2

nj=2

E (g(X1)g(Xj ))).

(a) If 2g = 0, then

limn

1n

[Sn(g) E(Sn(g))] = 0 a.s.

(b) If 2g > 0, then

limsupn

Sn(g) E(Sn(g))2 2g nlog(log(n))

= 1 a.s.

and

lim infn

Sn(g) E(Sn(g))

2 2g nlog(log(n)) = 1 a.s.Where we consider E the expectation with initial distribution and a.s. the

abbreviation of almost surely.

Lemma 2.2. (Dorea (2008).) If Y() is an ergodic Markov chain with finite state

space E, initial distribution , 1 and ia1j E+2 then

lim supn

N(ia1j) N(ia1)p(j

i a1)2n log(log(n))

= 2 (i a1 j)(1 p(ji a1)).

3. Main Results

Basically our approach consists in defining, for each sequence a1 E and i, j E,two densities Pa

1(i, j) and Qa

1(i, j). Comparing them using the 2divergence, we

capture relevant information of dependency related to ia1j. In the sequel, we take

a sum over all possible i, j E and achieve an object having local information ofdependency order for a1 . Finally, summing over all a

1 , rescaling properly and making

some adjusts we define the GDL Markov chain order estimator.


7/19


Definition 3.1. Assuming N(a1) as defined at (7), consider

Pa1 (i, j) =N(ia1j)

i,jN(ia1j)

,

Qa1 (i, j) =N(ia1) N(a

1j)

iN(ia1)

j

N(a1j)

and

2(PQ) := n

i,jE

D2(Pa1

(i, j)Qa1

(i, j)) = n

i,jE

(Pa1 (i, j) Qa1 (i, j))2Qa1 (i, j)

. (8)

Using the SLLN and assuming , we conclude

limn

Pa1 (i, j) = limnN(ia1j)

i,jN(ia1j)

= limn

nN(ia1)

ni,j N(ia1j)N(ia1j)

N(ia1)

=(ia1)

(a1)p(j|ia1) a.s. (9)

analogously,

limn

Qa1 (i, j) = limn

N(ia1) N(a1j)

iN(ia1)

j

N(a1j)

= limn

nN(ia1)

ni N(ia1)N(a1j)

j N(a1j)=

(ia1)

(a1)p(j|a1) a.s. (10)

In the same manner, but using the notation defined at (5), we conclude for <

that

limn

Pa1 (i, j) =(ia1)

(a1)p(j|ia1) a.s. (11)


8/19


and

limn

Qa1

(i, j) =(ia1)

(a1)p(j|a1) a.s. (12)

At (9) and (10) we used the easy computation equivalence1

i

N(ia1) =

j

N(a1j) + O(1) =i,j

N(ia1j) + O(1) = N(a1) + O(1). (13)

Theorem 3.1. For 2(PQ) as defined at (8),

(a) If , there exists a L R such that

P

limsup

n

2(PQ)2loglog(n)

L

= 1. (14)

(b) If = 1, there existsa1, i,j,k E andk = i such thatp(j|ia1) = p(j|ka1),for these ones

P

lim sup

n

2(PQ)2loglog(n)

=

= 1.

Proof.

(a) Replacing Pa1 (i, j) and Qa

1

(i, j), using (12) and (13) we have

limsupn

2(PQ)

2loglog(n)=

i,jE

limsupn

n(Pa1

(i, j) Qa1

(i, j))2

2loglog(n)Qa1

(i, j)

=i,jE

(a1)

2(ia1)p(j|a

1)

limsupn

n2

N(ia1j)

i,j

N(ia1j)

N(ia1)N(a1j)

i

N(ia1 )

j

N(a1 j)

2

nloglog(n)a.s.

=i,jE

(a1)

2(ia1)p(j|a1)

limsupn

nN(ia1j)i,j

N(ia1j)

nN(ia1)

i

N(ia1)N(a1j)

j

N(a1j)

2

nloglog(n)

=i,jE

(a1)

2(ia1)p(j|a1)

limsupn

nN(ia1j)

N(a1)+O(1)

nN(ia1 )N(a1)+O(1)

N(a1j)N(a1)+O(1)

2

nloglog(n). (15)

1 Here we used the O notation: g(n) = O(f(n)) means that limn

g(n)f(n)

= constant > 0.


9/19


By the SLLN

n

N(a1) + O(1)

a.s.

1

(a1)

and

N(a1j)

N(a1) + O(1)a.s.

p(j|a1).

Applying at (15), and using Lemma 2.2, we have

limsupn

2(PQ)

2loglog(n)=

i,jE

(a1)

2(ia1)p(j|a1)(a

1)

limsupn

(N(ia1j) N(ia1)p(j|a

1))

2

nloglog(n)a.s.

=i,jE

1

2(ia1)p(j|a1)

limsupn


1))

2

nloglog(n)(16)

=i,jE

1

2(ia1)p(j|a1)

limsupn

(N(ia1j) N(ia1)p(j|ia

1))

2

nloglog(n)

=i,jE

1

2(ia1)p(j|a1)

2(i a1 j)(1 p(ji a1)) a.s.

< .

At third equation we used that p(j|a1) = p(j|ia1), that is consequence of .Now, assuming

L sufficiently large, we conclude (14).

(b) = 1Continuing from (16) and considering the notation (6) and (12),

limsupn

2(PQ)

2loglog(n)=

i,jE

1

2(ia1)p(j|a1)

limsupn


1))

2

nloglog(n)a.s.

=i,jE

1

2(ia1)p(j|a1)

limsupn

n2

N(ia1 j)

n

N(ia1)n

p(j|a1)

2

nloglog(n)a.s.

=i,jE

1

2(ia1)p(j|a1) limsupn n

((ia1)p(j|ia1) (ia

1)p(j|a

1))

2

loglog(n) a.s.

=i,jE

1

2(ia1)p(j|a1)

limsupn

n[(ia1) (p(j|ia

1) p(j|a

1))]

2

loglog(n)

= a.s. (17)

We used at last equation the hypothesis that , i , j , k E with p(j|ia1) =p(j|ka1), so p(j|ia1) = p(j|a1) cannot be truth for all i, j E.


10/19


Herein we define the Local Dependency Level (LDL) and the Global Dependency

Level (GDL).

Definition 3.2. Let Xn = {Xi}ni=1 be a sample of a Markov chain X of order 0.Assume 0, P, Q and 2(PQ) as previously defined. Also, consider V a 2 randomvariable with (m 1)2 degrees of freedom and P: R+ [0, 1] the continuous strictlydecreasing function defined by

P(x) = P(V x), x R+.

(a) The Local Dependency Level LDLn(a1) for a

1 is

LDLn(a1) =

2(POnPEn)2 log(log(n))

.

(b) The Global Dependency Level GDLn() is

GDLn() = P

a1 E

N(a1)

n

LDLn(a

1)

.The LDL provides a measure of dependency for a specific a

1 , that could beanalysed separately. At GDL we rescale an average of LDLs to fit a proper variability.

Observe that, if the true order is , then a1 , ,

P

liminfn

GDLn()

P(L)

= 1 (18)

and for = 1P

limn

GDLn()

= P() = 0

= 1. (19)

Consequently, for a Markov chain X of order ,

= 0 limn

GDLn() P(L) > 0, = 0, 1, . . , B ,

otherwise

= max0 B

: lim

nGDLn() = 0

+ 1.

Finally, let us define the Markov chain order estimator based on the information

contained in the vector GDLn =GDLn(0), . . . , GDLn(B)

.


11/19


Definition 3.3. Given a fixed number B N, let us define the set S= {0, 1}B+1 andthe application T : S N defined by

T(s0, . . . , sB) =

1, if si = 1 for i = 0, . . . , B ,

max0iB

{i : si = 0}, otherwise.

Definition 3.4. Let Xn = {Xi}ni=1 be a sample for the Markov chain X of order ,0 < B N and {GDLn(i)}Bi=0 as above. We define the order estimator

GDL(Xn)

as

GDL(Xn) = T(n) + 1with n Sdefined by

n = minsS

B

i=1

GDLn(i) s(i)

2,

where s(i) is the projection of the i coordinate.

By (18) and (19) it is clear that the order estimator converges almost surely to itsvalue, i.e.,

P

limn

GDL(Xn) = = 1.4. Numeric Simulation

In what follows we shall compare the non-asymptotic performance, mainly for small

samples, of some of the most used Markov chain order estimators. Consider the

notation N(a1), as defined in (7), and denote

L() =

a+11

N

a+11

N(a1)

N(a+11 )

.

The estimators of the Markov chain order are defined under the hypothesis:

There exist a known B so that 0 < B


12/19


The most known order estimators are

AIC = argmin{AIC() ; = 0, 1,...,B},BI C = argmin{BI C() ; = 0, 1,...,B},EDC = argmin{EDC() ; = 0, 1,...,B},

where

AIC() = 2log L() + 2|E|(|E| 1),

BI C() = 2log L() + |E|(|E| 1)log(n),EDC() = 2log L() + 2|E|+1 log log(n).

By a simple observation, for large enough n, we verify that

AIC() EDC() BI C().

Clearly, for a given , the order estimator GDL(), as well as AIC(), BI C() and

EDC() contain much of the information concerning the samples relative dependency.

Nevertheless numerical simulations as well as theoretical considerations anticipates agreat deal of variability for small samples.

The following numerical simulation, based on an algorithm due to Raftery [16],

starts on with the generation of a Markov chain transition matrix Q = (qi1i2...i;i+1)

with entries

qi1i2...i;i+1 =

t=1

tR(it, i+1), i+11 E+1 (20)

where the matrix

R(i, j), 0 i, j m,m

j=1

R(i, j) = 1, 1 i m

and the positive numbers

{i}i=1,

i=1

i = 1


13/19


are arbitrarily chosen in advance.

Once the matrix Q = (qi1i2...i;i+1) is obtained, two hundreds replications of the

Markov chain sample of size n, space state E and transition matrix Q are generated

to compare GDL() performance against the standard, well known and already estab-

lished order estimators just mentioned above.

Katz [14] obtained the asymptotic distribution of AIC and proved its inconsistencyshowing the existence of a positive probability to overestimate the order. See also [18].

Besides that Csiszr and Shields [6] and Zhao et al. [20] proved strong consistency

for the estimators BI C and EDC, respectively.It is quite intuitive that the random information regarding the order of a Markov

chain is spread over an exponentially growing set of empirical distributions with

|| = mB+1, where B is the maximum integer . It seems reasonable to think that asmall viable sample, i.e. samples able to retrieve enough information to estimate the

chain order, should have size n O(mB+1). Using this, we have chosen the samplesizes for each case.

Finally, after applying all estimators to each one of the replicated samples, the final

results are registered in tables.

Case I: Markov Chain Examples with = 0, |E| = 3.Firstly, we choose the matrix {R1, R2, R3} to produce samples with sizes 500 n

2000, originated from Markov chains of order = 0 with quite different probability

distributions given by:

R1 =

0.33 0.335 0.335

0.33 0.335 0.335

0.33 0.335 0.335

, R2 =

0.05 0.475 0.475

0.05 0.475 0.475

0.05 0.475 0.475

, R3 =

0.05 0.05 0.90

0.05 0.05 0.90

0.05 0.05 0.90

.


14/19


Table 1: Rates of fitness for case |E| = 3, = 0, n {500, 1000, 1500} and distribution given

by R1.

n =500 n = 1000 n = 1500

k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL

0 75.5% 100% 100% 99% 80% 100% 100% 99.5% 71.5% 100% 100% 99%

1 24.5% 1% 18% 0.5% 22.5% 1%

2 2% 6%

3

4

Table 2: Rates of fitness for case |E| = 3, = 0, n {1000, 1500, 2000} and distribution

given by R2.

n = 1000 n = 1500 n = 2000


0 63.5% 100% 100% 99% 63% 100% 100% 99% 59% 100% 100% 99%

1 29% 1% 34.5% 1% 37% 1%

2 7.5% 2.5% 4%

3

4


given by R3.

n = 1000 n = 1500 n = 2000


0 43% 100% 100% 98% 47% 100% 99.5% 96% 46% 100% 100% 97%

1 53% 2% 51.5% 0.5% 4% 50.5% 2%

2 4% 1.5% 3.5% 1%

3

4

Notice that for a fixed sample size n = {500, 1000, 1500, 2000}, the order estimatorAIC steadily overestimate the real order = 0 with the excessiveness dependingon the probability distribution of the Markov chain. Differently, the order estimators


15/19


BI C, EDC andGDL show consistent performance, mainly obtaining the right order,free from the influence of the sample size and the generating matrix. The apparent

efficiency performed by BI C and EDC in the last case ( = 0) is consequence of thegreat tendency of these estimators to underestimate the order.

Case II: Markov Chain Examples with = 3, |E| = 3 and {2, 3, 0}, |E| = 4Secondly, we choose the matrix {R4, R5} to produce samples with sizes n {500, 1000,

1500, 2000}, originated from Markov chains for |E| = 3 of order = 3.

R4 =

0.05 0.05 0.90

0.05 0.90 0.05

0.90 0.05 0.05

, R5 =

0.475 0.475 0.05

0.475 0.05 0.475

0.05 0.475 0.475

.


given by R4 and i = 1/3, i = 1, 2, 3.

n = 1000 n = 1500 n = 2000


0

1

2 99.5% 88.5% 41% 76.5% 16.5% 5% 17% 0.5% 1%

3 100% 0.5% 11.5% 59% 100% 23.5% 83.5% 95% 100% 83% 99.5% 99%

4


given by R5 and i = 1/3, i = 1, 2, 3.

n = 1000 n = 1500 n = 2500


0 0.5%

1 92.5% 69.5% 6.5% 54.5% 19.5% 1%

2 16.5% 7% 30.5% 92% 2% 45.5% 80.5% 80.5% 100% 98.5% 8.5%

3 83.5% 1.5% 98% 18.5% 100% 1.5% 91.5%

4


16/19


For |E| = 3, = 3 the estimator AIC overestimate the order in a lesser extentthan the previous case, while BI C and EDC overweighted by the respective penaltyterms, underestimate the order more than it was supposed to be. Concerning GDL,it rapidly converges to the right order depending on the sample size n.

For |E| = 4, the greater complexity of a Markov chain of order = 3 imposes the useof a larger sample size to accomplish some reliability. Finally, we choose the matrix

{R6, R7} to produce samples with size n = 5000, originated from Markov chains oforder {2, 3, 0}, like in the previous cases.

R6 =

0.05 0.05 0.05 0.85

0.05 0.05 0.85 0.05

0.05 0.85 0.05 0.05

0.85 0.05 0.05 0.05

, R7 =

0.05 0.05 0.05 0.85

0.05 0.05 0.05 0.85

0.05 0.05 0.05 0.85

0.05 0.05 0.05 0.85

.

Table 6: Rates of fitness for case |E| = 4, {2, 3, 0}, n = 5000 and distributions given by

R6, R7 and i = 1/, i = 1, 2, 3 if > 0.

R6, i = 1/2 and = 2 R6, i = 1 /3 and = 3 R7 and = 0


0 85% 100% 100% 100%

1 15%

2 100% 100% 100% 100% 99% 4%

3 100% 1% 100% 96%

4

5

6

For the order for |E| = 4, = 0, apparently AIC keeps overestimating the orderin some degree, while BI C, as in example = 3, severely underestimate the order,presumably due to the excessive weight penalty term. On the contrary, EDC and

GDL behaves quite well in same setting.


17/19


Conclusion

The pioneer research started with the contributions of Bartlett [4], Hoel [13], Good

[12], Anderson and Goodman [3], Billingsley [5] among others, where they developed

tests of hypothesis for the estimation of the order of a given Markov chain.

Later on these procedures were adapted and improved with the use of Penalty

Functions [19, 14] together with other tools created in the realm of Models Selection

[1, 17]. Since then, there have been a considerable number of subsequent contributions

on this subject, several of them consisting in the enhancement of the already existing

techniques [6, 20].

In this notes we propose a new Markov chain order estimator based on a different idea

which makes it behave in a quite different form. This estimator is strongly consistent

and more efficient than AIC (inconsistent), outperforming the well established and

consistent BIC and EDC, mainly on relatively small samples.

References

[1] Akaike, H. (1974). A new look at the statistical model identification. Automatic

Control, IEEE Transactions on 19, 716723.

[2] Ali, S. and Silvey, S. (1966). A general class of coefficients of divergence of

one distribution from another. Journal of the Royal Statistical Society. Series B

(Methodological) 28, 131142.

[3] Anderson, T. W. and Goodman, L. A. (1957). Statistical inference about

markov chains. The Annals of Mathematical Statistics 28, 89110.

[4] Bartlett, M. S. (1951). The frequency goodness of fit test for probability chains.

Proceedings of the Cambridge Philosophical Society.

[5] Billingsley, P. (1961). Statistical methods in markov chains. The Annals of

Mathematical Statistics 32, 1240.

[6] Csiszar, I. and Shields, P. C. (2000). The consistency of the $bic$ markov

order estimator. The Annals of Statistics 28, 16011619.


18/19


[7] Csiszr, I. (1967). Information-type measures of difference of probability

distributions and indirect observations. Studia Sci. Math. Hungar. 2, 299318.

[8] Csiszr, I. and Shields, P. C. (2004). Information Theory And Statistics: A

Tutorial. Now Publishers Inc.

[9] Dacunha-Castelle, D., Duflo, M. and McHale, D. (1986). Probability and

Statistics vol. II. Springer.

[10] Doob, J. L. (1966). Stochastic Processes (Wiley Publications in Statistics). John

Wiley & Sons Inc.

[11] Dorea, C. C. Y. (2008). Optimal penalty term for EDC markov chain order

estimator. Annales de lInstitut de Statistique de lUniversite de Paris (lISUP)

52, 1526.

[12] Good, I. J. (1955). The likelihood ratio test for markoff chains. Biometrika 42,

531533.

[13] Hoel, P. G. (1954). A test for markoff chains. Biometrika 41, 430433.

[14] Katz, R. W. (1981). On some criteria for estimating the order of a markov chain.

Technometrics 23, 243249.

[15] Pardo, L. (2005). Statistical Inference Based on Divergence Measures. Chapman

and Hall/CRC.

[16] Raftery, A. E. (1985). A model for high-order markov chains. J. R. Statist.

Soc. B..

[17] Schwarz, G. (1978). Estimating the dimension of a model. The Annals of

Statistics 6, 461464.

[18] Shibata, R. (1976). Selection of the order of an autoregressive model by akaikes

information criterion. Biometrika 63, 117126.

[19] Tong, H. (1975). Determination of the order of a markov chain by akaikes

information criterion. Journal of Applied Probability 12, 488497.


19/19


[20] Zhao, L., Dorea, C. and Gonalves, C. (2001). On determination of the order

of a markov chain. Statistical Inference for Stochastic Processes 4, 273282.

markov chain order estimation and the chi-square divergence

Documents