markov chain order estimation and the chi-square divergence
TRANSCRIPT
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
1/19
Applied Probability Trust (1 April 2013)
MARKOV CHAIN ORDER ESTIMATION
AND THE CHI-SQUARE DIVERGENCE
A.R. BAIGORRI,
C.R. GONALVES and
P.A.A. RESENDE, University of Braslia
Email addresses: [email protected], [email protected] and [email protected]
Abstract
We define a few objects which capture relevant information from the sample of
a Markov chain and we use the chi-square divergence as a measure of diversity
between probability densities to define the estimators LDL and GDL for a
Markov chain sample. After exploring their properties, we propose a new
estimator for the Markov chain order. Finally we show numerical simulation
results, comparing the performance of the proposed alternative with the well
known and already established AIC, BIC and EDC estimators.
Keywords: GDL; LDL; AIC; BIC; EDC; Markov order estimation; chi-square
divergence.
2010 Mathematics Subject Classification: Primary 62M05
Secondary 60J10;62F12
1. Introduction
A sequence {Xi}i=0 of random variables taking values in E = {1, , m} is aMarkov chain of order if for all (a0,
, an+1)
En+2
P(Xn+1 = an+1|X0 = a0, , Xn = an)= P(Xn+1 = an+1|Xn+1 = an+1, , Xn = an) (1)
and is the smallest integer with this property. For simplicity, we assume {Xi}i=0time homogeneous, define al+k1 = a
l1a
l+kl+1 = (a1, , ak+l) Ek+l and
Postal address: Department of Mathematics, University of Braslia, 70910-900, Braslia-DF, Brazil
1
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
2/19
2 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
p(a+1|a1 ) = P(X+1 = a+1|X1 = a1, , X = a).
Also, we have the i.i.d. case for = 0. The class of processes that holds the
condition (1) for a given 0 will be denoted by M. In this case, the order of aprocess in i=0 Mi is the smallest integer such that X = {X} M.
Along the last few decades there has been a great number of research on the
estimation of the order of a Markov chain, starting with Bartlett [4], Hoel [13], Good
[12], Anderson and Goodman [3], Billingsley [5] among others dealing with hypothesis
tests, and more recently, using information criteria, Tong [19], Schwarz [17], Katz [14],
Csiszr and Shields [6], Zhao et al. [20] and Dorea [11] had contributed with new Markov
chain order estimators.
Akaike [1] entropic information criterion, known as AIC, has had a fundamental
impact in statistical model evaluation problems. The AIC has been applied by Tong,
for example, to the problem of estimating the order of autoregressive processes, au-
toregressive integrated moving average processes, and Markov chains. The Akaike-
Tong (AIC) estimator was derived as an asymptotic estimate of the Kullback-Leiblerinformation discrepancy and provides a useful tool for evaluating models estimated
by the maximum likelihood method. Later on, Katz [14] derived the asymptotic
distribution of the estimator and showed its inconsistency, proving that there is a
positive probability of overestimating the true order no matter how large the sample
size. Nevertheless, AIC is the most used Markov chain order estimator at the present
time, mainly because it is more efficient than BIC for small samples.
The main consistent alternative, the BIC estimator, does not perform too well for
relatively small samples, as it was pointed out by Katz [14] and Csiszr and Shields
[6]. It is natural to admit that the expansion of the Markov chain complexity (size
of the state space and order) has significant influence on the sample size required for
the identification of the unknown order, even though, most of the time it is difficult
to obtain sufficiently large samples in practical settings. In this sense, looking for a
better strong consistent alternative, Zhao et al. [20] and Dorea [11] estated the EDC,
adjusting properly the penalty term used at AIC and BIC.
All the already mentioned estimators are based on the penalized log-likelihood
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
3/19
Markov Chain Order Estimation and The Chi-Square Divergence 3
method, due that, have their common roots at the likelihood ratio for hypothesis tests.
In this notes well use a different entropic object called 2divergence, and study itsbehaviour when applied to samples from random variables with multinomial empirical
distributions derived from a Markov chain sample. Finally, we shall propose a new
strongly consistent Markov chain order estimator, more efficacious than the already
established AIC, BIC and EDC, which it shall be exhibited through the outcomes of
several numerical simulations.
This paper is organized as follows. Section 2 presents the fdivergences and afirst order Markov chain derived from X, which is useful to extend the already known
asymptotic results to orders larger than one. Section 3 provides the proposed order
estimator, namely GDL, and proves its strong consistency. Finally, Section 4 provides
numerical simulation, where one can observe a better performance of GDL compared
to AIC, BIC and EDC.
2. Auxiliary Results
2.1. Entropy and fdivergences
Basically, a fdivergence is a function that measures the discrepancy between twoprobability distributions P and Q. The divergence is intuitively an average of the
function f of the odds ratio given by P and Q.
These divergences were introduced and studied independently by Csiszr and Shields
[7, 8] and Ali and Silvey [2], among others. Sometimes these divergences are referred
as Ali-Silvey distances.
Definition 2.1. Let P and Q be discrete probability densities with support E =
{1, . . . , m}. For a convex function f(t) defined for t > 0, with f(1) = 0, the fdivergencefor the distributions P and Q is
Df(PQ) =aE
Q(a)f
P(a)
Q(a)
.
Here we take 0f( 00
) = 0 , f(0) = limt0
f(t), 0f( a0
) = limt0
tf( at ) = a limuf(u)
u .
For example, assuming f(t) = t log(t) or f(t) = (1 t2) we have:
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
4/19
4 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
f(t) = t log(t) Df(PQ) =aE
P(a)log
P(a)
Q(a)
,
f(t) = (1 t2) Df(PQ) =aE
(P(a) Q(a))2Q(a)
.
which are called relative entropy and 2divergence, respectively. From now on, the2divergence shall be denote by D2(PQ).
Observe that the triangular inequality is not satisfied in general, so that D2(PQ)
defines no distance in the strict sense.
The 2square divergence D2(PQ) is a well known statistical test procedure closerelated to the chi-square distribution [15].
2.2. Derived Markov Chains
Let Xn1 = (X1, , Xn) be a sample from a Markov chain X = {X} of unknownorder , as already defined. Assume that, x+11 E+1,
p(x+1|x1 ) = P(Xn+1 = xn+1
|Xnn+1 = x
1 ) > 0. (2)
Following Doob [10], from the process X we can derive a first order Markov chain,
Y() = {Y() } by setting Y()n = (Xn, , Xn+1) so that, for v = (i1, , i)and w = (i
1, , i),
P(Y()
n+1 = w|Y()n = v) =
p(i
|i1....i), if i
j = ij+1, j = 1, ..., ( 1)0, otherwise.
Clearly Y() is a first order and time homogeneous Markov chain that from now on
shall be called by the derived process, which by (2) is irreducible and positive recurrent
having unique stationary distribution, say . It is well known [10], that the derived
Markov chains Y() is irreducible and aperiodic, consequently ergodic. Thus, there
exists an equilibrium distribution (.) satisfying for any initial distribution on E
limn
|P(Y()n = x1 ) (x1 )| = 0,
and
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
5/19
Markov Chain Order Estimation and The Chi-Square Divergence 5
(x1 ) =
z1
(z1 )p(x|z1 ) =
x
(x x11 )p(x|x x11 ).
Likewise, we can define, for l > , Y() and verify that
l(xl1) = (x
1 )p(x+1|x1 )...p(xl|xl1l) =
x
l(x xl11 )p(xl|x xl1l). (3)
which shows that l defined above is a stationary distribution for Y(). For the sake
of notations simplicity well use from now on
(al1) = l(al1), l , (4)
(al1) =
bl1El
(bl1 a
l1), l < (5)
and
p(j|al1) =
bl1El
(bl1 al1)p(j|bl1 al1)
bl1 El
(bl1 al1)
, l < . (6)
Now, let us return to Xn1 = (X1, X2,...,Xn) and define
N(a1) =
n+1j=1
1(Xj = a1,...,Xj+1 = a). (7)
That is, the number of occurrences of a1 in Xn1 . If = 0, we take N( . ) = n.
From now on, the sums related with N(a1) are taken over positive terms, or else, we
convention 0/0 and 0 as 0.The main interest of defining the derivate process is the possibility of use the well
established asymptotic results regarded to first order Markov chains. Lemma 2.1 below
is a version of the Law of Iterated Logarithm, used by Dorea [11] to conclude Lemma
2.2, which will be used for the establishment of subsequent results. The Strong Law of
Large Numbers (SLLN) is needed too and can be found at [9].
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
6/19
6 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
Lemma 2.1. (Meyn and Tweedie (1993).) LetX = {X}> be an ergodic Markovchain with finite state space E and stationary distribution , g : E R, Sn(g) =
nj=1
g(Xj )
and
2g = E (g2(X1)) + 2
nj=2
E (g(X1)g(Xj ))).
(a) If 2g = 0, then
limn
1n
[Sn(g) E(Sn(g))] = 0 a.s.
(b) If 2g > 0, then
limsupn
Sn(g) E(Sn(g))2 2g nlog(log(n))
= 1 a.s.
and
lim infn
Sn(g) E(Sn(g))
2 2g nlog(log(n)) = 1 a.s.Where we consider E the expectation with initial distribution and a.s. the
abbreviation of almost surely.
Lemma 2.2. (Dorea (2008).) If Y() is an ergodic Markov chain with finite state
space E, initial distribution , 1 and ia1j E+2 then
lim supn
N(ia1j) N(ia1)p(j
i a1)2n log(log(n))
= 2 (i a1 j)(1 p(ji a1)).
3. Main Results
Basically our approach consists in defining, for each sequence a1 E and i, j E,two densities Pa
1(i, j) and Qa
1(i, j). Comparing them using the 2divergence, we
capture relevant information of dependency related to ia1j. In the sequel, we take
a sum over all possible i, j E and achieve an object having local information ofdependency order for a1 . Finally, summing over all a
1 , rescaling properly and making
some adjusts we define the GDL Markov chain order estimator.
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
7/19
Markov Chain Order Estimation and The Chi-Square Divergence 7
Definition 3.1. Assuming N(a1) as defined at (7), consider
Pa1 (i, j) =N(ia1j)
i,jN(ia1j)
,
Qa1 (i, j) =N(ia1) N(a
1j)
iN(ia1)
j
N(a1j)
and
2(PQ) := n
i,jE
D2(Pa1
(i, j)Qa1
(i, j)) = n
i,jE
(Pa1 (i, j) Qa1 (i, j))2Qa1 (i, j)
. (8)
Using the SLLN and assuming , we conclude
limn
Pa1 (i, j) = limnN(ia1j)
i,jN(ia1j)
= limn
nN(ia1)
ni,j N(ia1j)N(ia1j)
N(ia1)
=(ia1)
(a1)p(j|ia1) a.s. (9)
analogously,
limn
Qa1 (i, j) = limn
N(ia1) N(a1j)
iN(ia1)
j
N(a1j)
= limn
nN(ia1)
ni N(ia1)N(a1j)
j N(a1j)=
(ia1)
(a1)p(j|a1) a.s. (10)
In the same manner, but using the notation defined at (5), we conclude for <
that
limn
Pa1 (i, j) =(ia1)
(a1)p(j|ia1) a.s. (11)
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
8/19
8 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
and
limn
Qa1
(i, j) =(ia1)
(a1)p(j|a1) a.s. (12)
At (9) and (10) we used the easy computation equivalence1
i
N(ia1) =
j
N(a1j) + O(1) =i,j
N(ia1j) + O(1) = N(a1) + O(1). (13)
Theorem 3.1. For 2(PQ) as defined at (8),
(a) If , there exists a L R such that
P
limsup
n
2(PQ)2loglog(n)
L
= 1. (14)
(b) If = 1, there existsa1, i,j,k E andk = i such thatp(j|ia1) = p(j|ka1),for these ones
P
lim sup
n
2(PQ)2loglog(n)
=
= 1.
Proof.
(a) Replacing Pa1 (i, j) and Qa
1
(i, j), using (12) and (13) we have
limsupn
2(PQ)
2loglog(n)=
i,jE
limsupn
n(Pa1
(i, j) Qa1
(i, j))2
2loglog(n)Qa1
(i, j)
=i,jE
(a1)
2(ia1)p(j|a
1)
limsupn
n2
N(ia1j)
i,j
N(ia1j)
N(ia1)N(a1j)
i
N(ia1 )
j
N(a1 j)
2
nloglog(n)a.s.
=i,jE
(a1)
2(ia1)p(j|a1)
limsupn
nN(ia1j)i,j
N(ia1j)
nN(ia1)
i
N(ia1)N(a1j)
j
N(a1j)
2
nloglog(n)
=i,jE
(a1)
2(ia1)p(j|a1)
limsupn
nN(ia1j)
N(a1)+O(1)
nN(ia1 )N(a1)+O(1)
N(a1j)N(a1)+O(1)
2
nloglog(n). (15)
1 Here we used the O notation: g(n) = O(f(n)) means that limn
g(n)f(n)
= constant > 0.
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
9/19
Markov Chain Order Estimation and The Chi-Square Divergence 9
By the SLLN
n
N(a1) + O(1)
a.s.
1
(a1)
and
N(a1j)
N(a1) + O(1)a.s.
p(j|a1).
Applying at (15), and using Lemma 2.2, we have
limsupn
2(PQ)
2loglog(n)=
i,jE
(a1)
2(ia1)p(j|a1)(a
1)
limsupn
(N(ia1j) N(ia1)p(j|a
1))
2
nloglog(n)a.s.
=i,jE
1
2(ia1)p(j|a1)
limsupn
(N(ia1j) N(ia1)p(j|a
1))
2
nloglog(n)(16)
=i,jE
1
2(ia1)p(j|a1)
limsupn
(N(ia1j) N(ia1)p(j|ia
1))
2
nloglog(n)
=i,jE
1
2(ia1)p(j|a1)
2(i a1 j)(1 p(ji a1)) a.s.
< .
At third equation we used that p(j|a1) = p(j|ia1), that is consequence of .Now, assuming
L sufficiently large, we conclude (14).
(b) = 1Continuing from (16) and considering the notation (6) and (12),
limsupn
2(PQ)
2loglog(n)=
i,jE
1
2(ia1)p(j|a1)
limsupn
(N(ia1j) N(ia1)p(j|a
1))
2
nloglog(n)a.s.
=i,jE
1
2(ia1)p(j|a1)
limsupn
n2
N(ia1 j)
n
N(ia1)n
p(j|a1)
2
nloglog(n)a.s.
=i,jE
1
2(ia1)p(j|a1) limsupn n
((ia1)p(j|ia1) (ia
1)p(j|a
1))
2
loglog(n) a.s.
=i,jE
1
2(ia1)p(j|a1)
limsupn
n[(ia1) (p(j|ia
1) p(j|a
1))]
2
loglog(n)
= a.s. (17)
We used at last equation the hypothesis that , i , j , k E with p(j|ia1) =p(j|ka1), so p(j|ia1) = p(j|a1) cannot be truth for all i, j E.
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
10/19
10 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
Herein we define the Local Dependency Level (LDL) and the Global Dependency
Level (GDL).
Definition 3.2. Let Xn = {Xi}ni=1 be a sample of a Markov chain X of order 0.Assume 0, P, Q and 2(PQ) as previously defined. Also, consider V a 2 randomvariable with (m 1)2 degrees of freedom and P: R+ [0, 1] the continuous strictlydecreasing function defined by
P(x) = P(V x), x R+.
(a) The Local Dependency Level LDLn(a1) for a
1 is
LDLn(a1) =
2(POnPEn)2 log(log(n))
.
(b) The Global Dependency Level GDLn() is
GDLn() = P
a1 E
N(a1)
n
LDLn(a
1)
.The LDL provides a measure of dependency for a specific a
1 , that could beanalysed separately. At GDL we rescale an average of LDLs to fit a proper variability.
Observe that, if the true order is , then a1 , ,
P
liminfn
GDLn()
P(L)
= 1 (18)
and for = 1P
limn
GDLn()
= P() = 0
= 1. (19)
Consequently, for a Markov chain X of order ,
= 0 limn
GDLn() P(L) > 0, = 0, 1, . . , B ,
otherwise
= max0 B
: lim
nGDLn() = 0
+ 1.
Finally, let us define the Markov chain order estimator based on the information
contained in the vector GDLn =GDLn(0), . . . , GDLn(B)
.
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
11/19
Markov Chain Order Estimation and The Chi-Square Divergence 11
Definition 3.3. Given a fixed number B N, let us define the set S= {0, 1}B+1 andthe application T : S N defined by
T(s0, . . . , sB) =
1, if si = 1 for i = 0, . . . , B ,
max0iB
{i : si = 0}, otherwise.
Definition 3.4. Let Xn = {Xi}ni=1 be a sample for the Markov chain X of order ,0 < B N and {GDLn(i)}Bi=0 as above. We define the order estimator
GDL(Xn)
as
GDL(Xn) = T(n) + 1with n Sdefined by
n = minsS
B
i=1
GDLn(i) s(i)
2,
where s(i) is the projection of the i coordinate.
By (18) and (19) it is clear that the order estimator converges almost surely to itsvalue, i.e.,
P
limn
GDL(Xn) = = 1.4. Numeric Simulation
In what follows we shall compare the non-asymptotic performance, mainly for small
samples, of some of the most used Markov chain order estimators. Consider the
notation N(a1), as defined in (7), and denote
L() =
a+11
N
a+11
N(a1)
N(a+11 )
.
The estimators of the Markov chain order are defined under the hypothesis:
There exist a known B so that 0 < B
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
12/19
12 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
The most known order estimators are
AIC = argmin{AIC() ; = 0, 1,...,B},BI C = argmin{BI C() ; = 0, 1,...,B},EDC = argmin{EDC() ; = 0, 1,...,B},
where
AIC() = 2log L() + 2|E|(|E| 1),
BI C() = 2log L() + |E|(|E| 1)log(n),EDC() = 2log L() + 2|E|+1 log log(n).
By a simple observation, for large enough n, we verify that
AIC() EDC() BI C().
Clearly, for a given , the order estimator GDL(), as well as AIC(), BI C() and
EDC() contain much of the information concerning the samples relative dependency.
Nevertheless numerical simulations as well as theoretical considerations anticipates agreat deal of variability for small samples.
The following numerical simulation, based on an algorithm due to Raftery [16],
starts on with the generation of a Markov chain transition matrix Q = (qi1i2...i;i+1)
with entries
qi1i2...i;i+1 =
t=1
tR(it, i+1), i+11 E+1 (20)
where the matrix
R(i, j), 0 i, j m,m
j=1
R(i, j) = 1, 1 i m
and the positive numbers
{i}i=1,
i=1
i = 1
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
13/19
Markov Chain Order Estimation and The Chi-Square Divergence 13
are arbitrarily chosen in advance.
Once the matrix Q = (qi1i2...i;i+1) is obtained, two hundreds replications of the
Markov chain sample of size n, space state E and transition matrix Q are generated
to compare GDL() performance against the standard, well known and already estab-
lished order estimators just mentioned above.
Katz [14] obtained the asymptotic distribution of AIC and proved its inconsistencyshowing the existence of a positive probability to overestimate the order. See also [18].
Besides that Csiszr and Shields [6] and Zhao et al. [20] proved strong consistency
for the estimators BI C and EDC, respectively.It is quite intuitive that the random information regarding the order of a Markov
chain is spread over an exponentially growing set of empirical distributions with
|| = mB+1, where B is the maximum integer . It seems reasonable to think that asmall viable sample, i.e. samples able to retrieve enough information to estimate the
chain order, should have size n O(mB+1). Using this, we have chosen the samplesizes for each case.
Finally, after applying all estimators to each one of the replicated samples, the final
results are registered in tables.
Case I: Markov Chain Examples with = 0, |E| = 3.Firstly, we choose the matrix {R1, R2, R3} to produce samples with sizes 500 n
2000, originated from Markov chains of order = 0 with quite different probability
distributions given by:
R1 =
0.33 0.335 0.335
0.33 0.335 0.335
0.33 0.335 0.335
, R2 =
0.05 0.475 0.475
0.05 0.475 0.475
0.05 0.475 0.475
, R3 =
0.05 0.05 0.90
0.05 0.05 0.90
0.05 0.05 0.90
.
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
14/19
14 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
Table 1: Rates of fitness for case |E| = 3, = 0, n {500, 1000, 1500} and distribution given
by R1.
n =500 n = 1000 n = 1500
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 75.5% 100% 100% 99% 80% 100% 100% 99.5% 71.5% 100% 100% 99%
1 24.5% 1% 18% 0.5% 22.5% 1%
2 2% 6%
3
4
Table 2: Rates of fitness for case |E| = 3, = 0, n {1000, 1500, 2000} and distribution
given by R2.
n = 1000 n = 1500 n = 2000
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 63.5% 100% 100% 99% 63% 100% 100% 99% 59% 100% 100% 99%
1 29% 1% 34.5% 1% 37% 1%
2 7.5% 2.5% 4%
3
4
Table 3: Rates of fitness for case |E| = 3, = 0, n {1000, 1500, 2000} and distribution
given by R3.
n = 1000 n = 1500 n = 2000
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 43% 100% 100% 98% 47% 100% 99.5% 96% 46% 100% 100% 97%
1 53% 2% 51.5% 0.5% 4% 50.5% 2%
2 4% 1.5% 3.5% 1%
3
4
Notice that for a fixed sample size n = {500, 1000, 1500, 2000}, the order estimatorAIC steadily overestimate the real order = 0 with the excessiveness dependingon the probability distribution of the Markov chain. Differently, the order estimators
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
15/19
Markov Chain Order Estimation and The Chi-Square Divergence 15
BI C, EDC andGDL show consistent performance, mainly obtaining the right order,free from the influence of the sample size and the generating matrix. The apparent
efficiency performed by BI C and EDC in the last case ( = 0) is consequence of thegreat tendency of these estimators to underestimate the order.
Case II: Markov Chain Examples with = 3, |E| = 3 and {2, 3, 0}, |E| = 4Secondly, we choose the matrix {R4, R5} to produce samples with sizes n {500, 1000,
1500, 2000}, originated from Markov chains for |E| = 3 of order = 3.
R4 =
0.05 0.05 0.90
0.05 0.90 0.05
0.90 0.05 0.05
, R5 =
0.475 0.475 0.05
0.475 0.05 0.475
0.05 0.475 0.475
.
Table 4: Rates of fitness for case |E| = 3, = 3, n {1000, 1500, 2000} and distribution
given by R4 and i = 1/3, i = 1, 2, 3.
n = 1000 n = 1500 n = 2000
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0
1
2 99.5% 88.5% 41% 76.5% 16.5% 5% 17% 0.5% 1%
3 100% 0.5% 11.5% 59% 100% 23.5% 83.5% 95% 100% 83% 99.5% 99%
4
Table 5: Rates of fitness for case |E| = 3, = 3, n {1000, 1500, 2500} and distribution
given by R5 and i = 1/3, i = 1, 2, 3.
n = 1000 n = 1500 n = 2500
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 0.5%
1 92.5% 69.5% 6.5% 54.5% 19.5% 1%
2 16.5% 7% 30.5% 92% 2% 45.5% 80.5% 80.5% 100% 98.5% 8.5%
3 83.5% 1.5% 98% 18.5% 100% 1.5% 91.5%
4
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
16/19
16 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
For |E| = 3, = 3 the estimator AIC overestimate the order in a lesser extentthan the previous case, while BI C and EDC overweighted by the respective penaltyterms, underestimate the order more than it was supposed to be. Concerning GDL,it rapidly converges to the right order depending on the sample size n.
For |E| = 4, the greater complexity of a Markov chain of order = 3 imposes the useof a larger sample size to accomplish some reliability. Finally, we choose the matrix
{R6, R7} to produce samples with size n = 5000, originated from Markov chains oforder {2, 3, 0}, like in the previous cases.
R6 =
0.05 0.05 0.05 0.85
0.05 0.05 0.85 0.05
0.05 0.85 0.05 0.05
0.85 0.05 0.05 0.05
, R7 =
0.05 0.05 0.05 0.85
0.05 0.05 0.05 0.85
0.05 0.05 0.05 0.85
0.05 0.05 0.05 0.85
.
Table 6: Rates of fitness for case |E| = 4, {2, 3, 0}, n = 5000 and distributions given by
R6, R7 and i = 1/, i = 1, 2, 3 if > 0.
R6, i = 1/2 and = 2 R6, i = 1 /3 and = 3 R7 and = 0
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 85% 100% 100% 100%
1 15%
2 100% 100% 100% 100% 99% 4%
3 100% 1% 100% 96%
4
5
6
For the order for |E| = 4, = 0, apparently AIC keeps overestimating the orderin some degree, while BI C, as in example = 3, severely underestimate the order,presumably due to the excessive weight penalty term. On the contrary, EDC and
GDL behaves quite well in same setting.
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
17/19
Markov Chain Order Estimation and The Chi-Square Divergence 17
Conclusion
The pioneer research started with the contributions of Bartlett [4], Hoel [13], Good
[12], Anderson and Goodman [3], Billingsley [5] among others, where they developed
tests of hypothesis for the estimation of the order of a given Markov chain.
Later on these procedures were adapted and improved with the use of Penalty
Functions [19, 14] together with other tools created in the realm of Models Selection
[1, 17]. Since then, there have been a considerable number of subsequent contributions
on this subject, several of them consisting in the enhancement of the already existing
techniques [6, 20].
In this notes we propose a new Markov chain order estimator based on a different idea
which makes it behave in a quite different form. This estimator is strongly consistent
and more efficient than AIC (inconsistent), outperforming the well established and
consistent BIC and EDC, mainly on relatively small samples.
References
[1] Akaike, H. (1974). A new look at the statistical model identification. Automatic
Control, IEEE Transactions on 19, 716723.
[2] Ali, S. and Silvey, S. (1966). A general class of coefficients of divergence of
one distribution from another. Journal of the Royal Statistical Society. Series B
(Methodological) 28, 131142.
[3] Anderson, T. W. and Goodman, L. A. (1957). Statistical inference about
markov chains. The Annals of Mathematical Statistics 28, 89110.
[4] Bartlett, M. S. (1951). The frequency goodness of fit test for probability chains.
Proceedings of the Cambridge Philosophical Society.
[5] Billingsley, P. (1961). Statistical methods in markov chains. The Annals of
Mathematical Statistics 32, 1240.
[6] Csiszar, I. and Shields, P. C. (2000). The consistency of the $bic$ markov
order estimator. The Annals of Statistics 28, 16011619.
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
18/19
18 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
[7] Csiszr, I. (1967). Information-type measures of difference of probability
distributions and indirect observations. Studia Sci. Math. Hungar. 2, 299318.
[8] Csiszr, I. and Shields, P. C. (2004). Information Theory And Statistics: A
Tutorial. Now Publishers Inc.
[9] Dacunha-Castelle, D., Duflo, M. and McHale, D. (1986). Probability and
Statistics vol. II. Springer.
[10] Doob, J. L. (1966). Stochastic Processes (Wiley Publications in Statistics). John
Wiley & Sons Inc.
[11] Dorea, C. C. Y. (2008). Optimal penalty term for EDC markov chain order
estimator. Annales de lInstitut de Statistique de lUniversite de Paris (lISUP)
52, 1526.
[12] Good, I. J. (1955). The likelihood ratio test for markoff chains. Biometrika 42,
531533.
[13] Hoel, P. G. (1954). A test for markoff chains. Biometrika 41, 430433.
[14] Katz, R. W. (1981). On some criteria for estimating the order of a markov chain.
Technometrics 23, 243249.
[15] Pardo, L. (2005). Statistical Inference Based on Divergence Measures. Chapman
and Hall/CRC.
[16] Raftery, A. E. (1985). A model for high-order markov chains. J. R. Statist.
Soc. B..
[17] Schwarz, G. (1978). Estimating the dimension of a model. The Annals of
Statistics 6, 461464.
[18] Shibata, R. (1976). Selection of the order of an autoregressive model by akaikes
information criterion. Biometrika 63, 117126.
[19] Tong, H. (1975). Determination of the order of a markov chain by akaikes
information criterion. Journal of Applied Probability 12, 488497.
-
7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE
19/19
Markov Chain Order Estimation and The Chi-Square Divergence 19
[20] Zhao, L., Dorea, C. and Gonalves, C. (2001). On determination of the order
of a markov chain. Statistical Inference for Stochastic Processes 4, 273282.