markov chain order estimation and the chi-square divergence

Upload: yxz300

Post on 03-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    1/19

    Applied Probability Trust (1 April 2013)

    MARKOV CHAIN ORDER ESTIMATION

    AND THE CHI-SQUARE DIVERGENCE

    A.R. BAIGORRI,

    C.R. GONALVES and

    P.A.A. RESENDE, University of Braslia

    Email addresses: [email protected], [email protected] and [email protected]

    Abstract

    We define a few objects which capture relevant information from the sample of

    a Markov chain and we use the chi-square divergence as a measure of diversity

    between probability densities to define the estimators LDL and GDL for a

    Markov chain sample. After exploring their properties, we propose a new

    estimator for the Markov chain order. Finally we show numerical simulation

    results, comparing the performance of the proposed alternative with the well

    known and already established AIC, BIC and EDC estimators.

    Keywords: GDL; LDL; AIC; BIC; EDC; Markov order estimation; chi-square

    divergence.

    2010 Mathematics Subject Classification: Primary 62M05

    Secondary 60J10;62F12

    1. Introduction

    A sequence {Xi}i=0 of random variables taking values in E = {1, , m} is aMarkov chain of order if for all (a0,

    , an+1)

    En+2

    P(Xn+1 = an+1|X0 = a0, , Xn = an)= P(Xn+1 = an+1|Xn+1 = an+1, , Xn = an) (1)

    and is the smallest integer with this property. For simplicity, we assume {Xi}i=0time homogeneous, define al+k1 = a

    l1a

    l+kl+1 = (a1, , ak+l) Ek+l and

    Postal address: Department of Mathematics, University of Braslia, 70910-900, Braslia-DF, Brazil

    1

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    2/19

    2 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

    p(a+1|a1 ) = P(X+1 = a+1|X1 = a1, , X = a).

    Also, we have the i.i.d. case for = 0. The class of processes that holds the

    condition (1) for a given 0 will be denoted by M. In this case, the order of aprocess in i=0 Mi is the smallest integer such that X = {X} M.

    Along the last few decades there has been a great number of research on the

    estimation of the order of a Markov chain, starting with Bartlett [4], Hoel [13], Good

    [12], Anderson and Goodman [3], Billingsley [5] among others dealing with hypothesis

    tests, and more recently, using information criteria, Tong [19], Schwarz [17], Katz [14],

    Csiszr and Shields [6], Zhao et al. [20] and Dorea [11] had contributed with new Markov

    chain order estimators.

    Akaike [1] entropic information criterion, known as AIC, has had a fundamental

    impact in statistical model evaluation problems. The AIC has been applied by Tong,

    for example, to the problem of estimating the order of autoregressive processes, au-

    toregressive integrated moving average processes, and Markov chains. The Akaike-

    Tong (AIC) estimator was derived as an asymptotic estimate of the Kullback-Leiblerinformation discrepancy and provides a useful tool for evaluating models estimated

    by the maximum likelihood method. Later on, Katz [14] derived the asymptotic

    distribution of the estimator and showed its inconsistency, proving that there is a

    positive probability of overestimating the true order no matter how large the sample

    size. Nevertheless, AIC is the most used Markov chain order estimator at the present

    time, mainly because it is more efficient than BIC for small samples.

    The main consistent alternative, the BIC estimator, does not perform too well for

    relatively small samples, as it was pointed out by Katz [14] and Csiszr and Shields

    [6]. It is natural to admit that the expansion of the Markov chain complexity (size

    of the state space and order) has significant influence on the sample size required for

    the identification of the unknown order, even though, most of the time it is difficult

    to obtain sufficiently large samples in practical settings. In this sense, looking for a

    better strong consistent alternative, Zhao et al. [20] and Dorea [11] estated the EDC,

    adjusting properly the penalty term used at AIC and BIC.

    All the already mentioned estimators are based on the penalized log-likelihood

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    3/19

    Markov Chain Order Estimation and The Chi-Square Divergence 3

    method, due that, have their common roots at the likelihood ratio for hypothesis tests.

    In this notes well use a different entropic object called 2divergence, and study itsbehaviour when applied to samples from random variables with multinomial empirical

    distributions derived from a Markov chain sample. Finally, we shall propose a new

    strongly consistent Markov chain order estimator, more efficacious than the already

    established AIC, BIC and EDC, which it shall be exhibited through the outcomes of

    several numerical simulations.

    This paper is organized as follows. Section 2 presents the fdivergences and afirst order Markov chain derived from X, which is useful to extend the already known

    asymptotic results to orders larger than one. Section 3 provides the proposed order

    estimator, namely GDL, and proves its strong consistency. Finally, Section 4 provides

    numerical simulation, where one can observe a better performance of GDL compared

    to AIC, BIC and EDC.

    2. Auxiliary Results

    2.1. Entropy and fdivergences

    Basically, a fdivergence is a function that measures the discrepancy between twoprobability distributions P and Q. The divergence is intuitively an average of the

    function f of the odds ratio given by P and Q.

    These divergences were introduced and studied independently by Csiszr and Shields

    [7, 8] and Ali and Silvey [2], among others. Sometimes these divergences are referred

    as Ali-Silvey distances.

    Definition 2.1. Let P and Q be discrete probability densities with support E =

    {1, . . . , m}. For a convex function f(t) defined for t > 0, with f(1) = 0, the fdivergencefor the distributions P and Q is

    Df(PQ) =aE

    Q(a)f

    P(a)

    Q(a)

    .

    Here we take 0f( 00

    ) = 0 , f(0) = limt0

    f(t), 0f( a0

    ) = limt0

    tf( at ) = a limuf(u)

    u .

    For example, assuming f(t) = t log(t) or f(t) = (1 t2) we have:

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    4/19

    4 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

    f(t) = t log(t) Df(PQ) =aE

    P(a)log

    P(a)

    Q(a)

    ,

    f(t) = (1 t2) Df(PQ) =aE

    (P(a) Q(a))2Q(a)

    .

    which are called relative entropy and 2divergence, respectively. From now on, the2divergence shall be denote by D2(PQ).

    Observe that the triangular inequality is not satisfied in general, so that D2(PQ)

    defines no distance in the strict sense.

    The 2square divergence D2(PQ) is a well known statistical test procedure closerelated to the chi-square distribution [15].

    2.2. Derived Markov Chains

    Let Xn1 = (X1, , Xn) be a sample from a Markov chain X = {X} of unknownorder , as already defined. Assume that, x+11 E+1,

    p(x+1|x1 ) = P(Xn+1 = xn+1

    |Xnn+1 = x

    1 ) > 0. (2)

    Following Doob [10], from the process X we can derive a first order Markov chain,

    Y() = {Y() } by setting Y()n = (Xn, , Xn+1) so that, for v = (i1, , i)and w = (i

    1, , i),

    P(Y()

    n+1 = w|Y()n = v) =

    p(i

    |i1....i), if i

    j = ij+1, j = 1, ..., ( 1)0, otherwise.

    Clearly Y() is a first order and time homogeneous Markov chain that from now on

    shall be called by the derived process, which by (2) is irreducible and positive recurrent

    having unique stationary distribution, say . It is well known [10], that the derived

    Markov chains Y() is irreducible and aperiodic, consequently ergodic. Thus, there

    exists an equilibrium distribution (.) satisfying for any initial distribution on E

    limn

    |P(Y()n = x1 ) (x1 )| = 0,

    and

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    5/19

    Markov Chain Order Estimation and The Chi-Square Divergence 5

    (x1 ) =

    z1

    (z1 )p(x|z1 ) =

    x

    (x x11 )p(x|x x11 ).

    Likewise, we can define, for l > , Y() and verify that

    l(xl1) = (x

    1 )p(x+1|x1 )...p(xl|xl1l) =

    x

    l(x xl11 )p(xl|x xl1l). (3)

    which shows that l defined above is a stationary distribution for Y(). For the sake

    of notations simplicity well use from now on

    (al1) = l(al1), l , (4)

    (al1) =

    bl1El

    (bl1 a

    l1), l < (5)

    and

    p(j|al1) =

    bl1El

    (bl1 al1)p(j|bl1 al1)

    bl1 El

    (bl1 al1)

    , l < . (6)

    Now, let us return to Xn1 = (X1, X2,...,Xn) and define

    N(a1) =

    n+1j=1

    1(Xj = a1,...,Xj+1 = a). (7)

    That is, the number of occurrences of a1 in Xn1 . If = 0, we take N( . ) = n.

    From now on, the sums related with N(a1) are taken over positive terms, or else, we

    convention 0/0 and 0 as 0.The main interest of defining the derivate process is the possibility of use the well

    established asymptotic results regarded to first order Markov chains. Lemma 2.1 below

    is a version of the Law of Iterated Logarithm, used by Dorea [11] to conclude Lemma

    2.2, which will be used for the establishment of subsequent results. The Strong Law of

    Large Numbers (SLLN) is needed too and can be found at [9].

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    6/19

    6 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

    Lemma 2.1. (Meyn and Tweedie (1993).) LetX = {X}> be an ergodic Markovchain with finite state space E and stationary distribution , g : E R, Sn(g) =

    nj=1

    g(Xj )

    and

    2g = E (g2(X1)) + 2

    nj=2

    E (g(X1)g(Xj ))).

    (a) If 2g = 0, then

    limn

    1n

    [Sn(g) E(Sn(g))] = 0 a.s.

    (b) If 2g > 0, then

    limsupn

    Sn(g) E(Sn(g))2 2g nlog(log(n))

    = 1 a.s.

    and

    lim infn

    Sn(g) E(Sn(g))

    2 2g nlog(log(n)) = 1 a.s.Where we consider E the expectation with initial distribution and a.s. the

    abbreviation of almost surely.

    Lemma 2.2. (Dorea (2008).) If Y() is an ergodic Markov chain with finite state

    space E, initial distribution , 1 and ia1j E+2 then

    lim supn

    N(ia1j) N(ia1)p(j

    i a1)2n log(log(n))

    = 2 (i a1 j)(1 p(ji a1)).

    3. Main Results

    Basically our approach consists in defining, for each sequence a1 E and i, j E,two densities Pa

    1(i, j) and Qa

    1(i, j). Comparing them using the 2divergence, we

    capture relevant information of dependency related to ia1j. In the sequel, we take

    a sum over all possible i, j E and achieve an object having local information ofdependency order for a1 . Finally, summing over all a

    1 , rescaling properly and making

    some adjusts we define the GDL Markov chain order estimator.

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    7/19

    Markov Chain Order Estimation and The Chi-Square Divergence 7

    Definition 3.1. Assuming N(a1) as defined at (7), consider

    Pa1 (i, j) =N(ia1j)

    i,jN(ia1j)

    ,

    Qa1 (i, j) =N(ia1) N(a

    1j)

    iN(ia1)

    j

    N(a1j)

    and

    2(PQ) := n

    i,jE

    D2(Pa1

    (i, j)Qa1

    (i, j)) = n

    i,jE

    (Pa1 (i, j) Qa1 (i, j))2Qa1 (i, j)

    . (8)

    Using the SLLN and assuming , we conclude

    limn

    Pa1 (i, j) = limnN(ia1j)

    i,jN(ia1j)

    = limn

    nN(ia1)

    ni,j N(ia1j)N(ia1j)

    N(ia1)

    =(ia1)

    (a1)p(j|ia1) a.s. (9)

    analogously,

    limn

    Qa1 (i, j) = limn

    N(ia1) N(a1j)

    iN(ia1)

    j

    N(a1j)

    = limn

    nN(ia1)

    ni N(ia1)N(a1j)

    j N(a1j)=

    (ia1)

    (a1)p(j|a1) a.s. (10)

    In the same manner, but using the notation defined at (5), we conclude for <

    that

    limn

    Pa1 (i, j) =(ia1)

    (a1)p(j|ia1) a.s. (11)

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    8/19

    8 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

    and

    limn

    Qa1

    (i, j) =(ia1)

    (a1)p(j|a1) a.s. (12)

    At (9) and (10) we used the easy computation equivalence1

    i

    N(ia1) =

    j

    N(a1j) + O(1) =i,j

    N(ia1j) + O(1) = N(a1) + O(1). (13)

    Theorem 3.1. For 2(PQ) as defined at (8),

    (a) If , there exists a L R such that

    P

    limsup

    n

    2(PQ)2loglog(n)

    L

    = 1. (14)

    (b) If = 1, there existsa1, i,j,k E andk = i such thatp(j|ia1) = p(j|ka1),for these ones

    P

    lim sup

    n

    2(PQ)2loglog(n)

    =

    = 1.

    Proof.

    (a) Replacing Pa1 (i, j) and Qa

    1

    (i, j), using (12) and (13) we have

    limsupn

    2(PQ)

    2loglog(n)=

    i,jE

    limsupn

    n(Pa1

    (i, j) Qa1

    (i, j))2

    2loglog(n)Qa1

    (i, j)

    =i,jE

    (a1)

    2(ia1)p(j|a

    1)

    limsupn

    n2

    N(ia1j)

    i,j

    N(ia1j)

    N(ia1)N(a1j)

    i

    N(ia1 )

    j

    N(a1 j)

    2

    nloglog(n)a.s.

    =i,jE

    (a1)

    2(ia1)p(j|a1)

    limsupn

    nN(ia1j)i,j

    N(ia1j)

    nN(ia1)

    i

    N(ia1)N(a1j)

    j

    N(a1j)

    2

    nloglog(n)

    =i,jE

    (a1)

    2(ia1)p(j|a1)

    limsupn

    nN(ia1j)

    N(a1)+O(1)

    nN(ia1 )N(a1)+O(1)

    N(a1j)N(a1)+O(1)

    2

    nloglog(n). (15)

    1 Here we used the O notation: g(n) = O(f(n)) means that limn

    g(n)f(n)

    = constant > 0.

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    9/19

    Markov Chain Order Estimation and The Chi-Square Divergence 9

    By the SLLN

    n

    N(a1) + O(1)

    a.s.

    1

    (a1)

    and

    N(a1j)

    N(a1) + O(1)a.s.

    p(j|a1).

    Applying at (15), and using Lemma 2.2, we have

    limsupn

    2(PQ)

    2loglog(n)=

    i,jE

    (a1)

    2(ia1)p(j|a1)(a

    1)

    limsupn

    (N(ia1j) N(ia1)p(j|a

    1))

    2

    nloglog(n)a.s.

    =i,jE

    1

    2(ia1)p(j|a1)

    limsupn

    (N(ia1j) N(ia1)p(j|a

    1))

    2

    nloglog(n)(16)

    =i,jE

    1

    2(ia1)p(j|a1)

    limsupn

    (N(ia1j) N(ia1)p(j|ia

    1))

    2

    nloglog(n)

    =i,jE

    1

    2(ia1)p(j|a1)

    2(i a1 j)(1 p(ji a1)) a.s.

    < .

    At third equation we used that p(j|a1) = p(j|ia1), that is consequence of .Now, assuming

    L sufficiently large, we conclude (14).

    (b) = 1Continuing from (16) and considering the notation (6) and (12),

    limsupn

    2(PQ)

    2loglog(n)=

    i,jE

    1

    2(ia1)p(j|a1)

    limsupn

    (N(ia1j) N(ia1)p(j|a

    1))

    2

    nloglog(n)a.s.

    =i,jE

    1

    2(ia1)p(j|a1)

    limsupn

    n2

    N(ia1 j)

    n

    N(ia1)n

    p(j|a1)

    2

    nloglog(n)a.s.

    =i,jE

    1

    2(ia1)p(j|a1) limsupn n

    ((ia1)p(j|ia1) (ia

    1)p(j|a

    1))

    2

    loglog(n) a.s.

    =i,jE

    1

    2(ia1)p(j|a1)

    limsupn

    n[(ia1) (p(j|ia

    1) p(j|a

    1))]

    2

    loglog(n)

    = a.s. (17)

    We used at last equation the hypothesis that , i , j , k E with p(j|ia1) =p(j|ka1), so p(j|ia1) = p(j|a1) cannot be truth for all i, j E.

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    10/19

    10 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

    Herein we define the Local Dependency Level (LDL) and the Global Dependency

    Level (GDL).

    Definition 3.2. Let Xn = {Xi}ni=1 be a sample of a Markov chain X of order 0.Assume 0, P, Q and 2(PQ) as previously defined. Also, consider V a 2 randomvariable with (m 1)2 degrees of freedom and P: R+ [0, 1] the continuous strictlydecreasing function defined by

    P(x) = P(V x), x R+.

    (a) The Local Dependency Level LDLn(a1) for a

    1 is

    LDLn(a1) =

    2(POnPEn)2 log(log(n))

    .

    (b) The Global Dependency Level GDLn() is

    GDLn() = P

    a1 E

    N(a1)

    n

    LDLn(a

    1)

    .The LDL provides a measure of dependency for a specific a

    1 , that could beanalysed separately. At GDL we rescale an average of LDLs to fit a proper variability.

    Observe that, if the true order is , then a1 , ,

    P

    liminfn

    GDLn()

    P(L)

    = 1 (18)

    and for = 1P

    limn

    GDLn()

    = P() = 0

    = 1. (19)

    Consequently, for a Markov chain X of order ,

    = 0 limn

    GDLn() P(L) > 0, = 0, 1, . . , B ,

    otherwise

    = max0 B

    : lim

    nGDLn() = 0

    + 1.

    Finally, let us define the Markov chain order estimator based on the information

    contained in the vector GDLn =GDLn(0), . . . , GDLn(B)

    .

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    11/19

    Markov Chain Order Estimation and The Chi-Square Divergence 11

    Definition 3.3. Given a fixed number B N, let us define the set S= {0, 1}B+1 andthe application T : S N defined by

    T(s0, . . . , sB) =

    1, if si = 1 for i = 0, . . . , B ,

    max0iB

    {i : si = 0}, otherwise.

    Definition 3.4. Let Xn = {Xi}ni=1 be a sample for the Markov chain X of order ,0 < B N and {GDLn(i)}Bi=0 as above. We define the order estimator

    GDL(Xn)

    as

    GDL(Xn) = T(n) + 1with n Sdefined by

    n = minsS

    B

    i=1

    GDLn(i) s(i)

    2,

    where s(i) is the projection of the i coordinate.

    By (18) and (19) it is clear that the order estimator converges almost surely to itsvalue, i.e.,

    P

    limn

    GDL(Xn) = = 1.4. Numeric Simulation

    In what follows we shall compare the non-asymptotic performance, mainly for small

    samples, of some of the most used Markov chain order estimators. Consider the

    notation N(a1), as defined in (7), and denote

    L() =

    a+11

    N

    a+11

    N(a1)

    N(a+11 )

    .

    The estimators of the Markov chain order are defined under the hypothesis:

    There exist a known B so that 0 < B

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    12/19

    12 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

    The most known order estimators are

    AIC = argmin{AIC() ; = 0, 1,...,B},BI C = argmin{BI C() ; = 0, 1,...,B},EDC = argmin{EDC() ; = 0, 1,...,B},

    where

    AIC() = 2log L() + 2|E|(|E| 1),

    BI C() = 2log L() + |E|(|E| 1)log(n),EDC() = 2log L() + 2|E|+1 log log(n).

    By a simple observation, for large enough n, we verify that

    AIC() EDC() BI C().

    Clearly, for a given , the order estimator GDL(), as well as AIC(), BI C() and

    EDC() contain much of the information concerning the samples relative dependency.

    Nevertheless numerical simulations as well as theoretical considerations anticipates agreat deal of variability for small samples.

    The following numerical simulation, based on an algorithm due to Raftery [16],

    starts on with the generation of a Markov chain transition matrix Q = (qi1i2...i;i+1)

    with entries

    qi1i2...i;i+1 =

    t=1

    tR(it, i+1), i+11 E+1 (20)

    where the matrix

    R(i, j), 0 i, j m,m

    j=1

    R(i, j) = 1, 1 i m

    and the positive numbers

    {i}i=1,

    i=1

    i = 1

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    13/19

    Markov Chain Order Estimation and The Chi-Square Divergence 13

    are arbitrarily chosen in advance.

    Once the matrix Q = (qi1i2...i;i+1) is obtained, two hundreds replications of the

    Markov chain sample of size n, space state E and transition matrix Q are generated

    to compare GDL() performance against the standard, well known and already estab-

    lished order estimators just mentioned above.

    Katz [14] obtained the asymptotic distribution of AIC and proved its inconsistencyshowing the existence of a positive probability to overestimate the order. See also [18].

    Besides that Csiszr and Shields [6] and Zhao et al. [20] proved strong consistency

    for the estimators BI C and EDC, respectively.It is quite intuitive that the random information regarding the order of a Markov

    chain is spread over an exponentially growing set of empirical distributions with

    || = mB+1, where B is the maximum integer . It seems reasonable to think that asmall viable sample, i.e. samples able to retrieve enough information to estimate the

    chain order, should have size n O(mB+1). Using this, we have chosen the samplesizes for each case.

    Finally, after applying all estimators to each one of the replicated samples, the final

    results are registered in tables.

    Case I: Markov Chain Examples with = 0, |E| = 3.Firstly, we choose the matrix {R1, R2, R3} to produce samples with sizes 500 n

    2000, originated from Markov chains of order = 0 with quite different probability

    distributions given by:

    R1 =

    0.33 0.335 0.335

    0.33 0.335 0.335

    0.33 0.335 0.335

    , R2 =

    0.05 0.475 0.475

    0.05 0.475 0.475

    0.05 0.475 0.475

    , R3 =

    0.05 0.05 0.90

    0.05 0.05 0.90

    0.05 0.05 0.90

    .

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    14/19

    14 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

    Table 1: Rates of fitness for case |E| = 3, = 0, n {500, 1000, 1500} and distribution given

    by R1.

    n =500 n = 1000 n = 1500

    k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL

    0 75.5% 100% 100% 99% 80% 100% 100% 99.5% 71.5% 100% 100% 99%

    1 24.5% 1% 18% 0.5% 22.5% 1%

    2 2% 6%

    3

    4

    Table 2: Rates of fitness for case |E| = 3, = 0, n {1000, 1500, 2000} and distribution

    given by R2.

    n = 1000 n = 1500 n = 2000

    k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL

    0 63.5% 100% 100% 99% 63% 100% 100% 99% 59% 100% 100% 99%

    1 29% 1% 34.5% 1% 37% 1%

    2 7.5% 2.5% 4%

    3

    4

    Table 3: Rates of fitness for case |E| = 3, = 0, n {1000, 1500, 2000} and distribution

    given by R3.

    n = 1000 n = 1500 n = 2000

    k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL

    0 43% 100% 100% 98% 47% 100% 99.5% 96% 46% 100% 100% 97%

    1 53% 2% 51.5% 0.5% 4% 50.5% 2%

    2 4% 1.5% 3.5% 1%

    3

    4

    Notice that for a fixed sample size n = {500, 1000, 1500, 2000}, the order estimatorAIC steadily overestimate the real order = 0 with the excessiveness dependingon the probability distribution of the Markov chain. Differently, the order estimators

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    15/19

    Markov Chain Order Estimation and The Chi-Square Divergence 15

    BI C, EDC andGDL show consistent performance, mainly obtaining the right order,free from the influence of the sample size and the generating matrix. The apparent

    efficiency performed by BI C and EDC in the last case ( = 0) is consequence of thegreat tendency of these estimators to underestimate the order.

    Case II: Markov Chain Examples with = 3, |E| = 3 and {2, 3, 0}, |E| = 4Secondly, we choose the matrix {R4, R5} to produce samples with sizes n {500, 1000,

    1500, 2000}, originated from Markov chains for |E| = 3 of order = 3.

    R4 =

    0.05 0.05 0.90

    0.05 0.90 0.05

    0.90 0.05 0.05

    , R5 =

    0.475 0.475 0.05

    0.475 0.05 0.475

    0.05 0.475 0.475

    .

    Table 4: Rates of fitness for case |E| = 3, = 3, n {1000, 1500, 2000} and distribution

    given by R4 and i = 1/3, i = 1, 2, 3.

    n = 1000 n = 1500 n = 2000

    k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL

    0

    1

    2 99.5% 88.5% 41% 76.5% 16.5% 5% 17% 0.5% 1%

    3 100% 0.5% 11.5% 59% 100% 23.5% 83.5% 95% 100% 83% 99.5% 99%

    4

    Table 5: Rates of fitness for case |E| = 3, = 3, n {1000, 1500, 2500} and distribution

    given by R5 and i = 1/3, i = 1, 2, 3.

    n = 1000 n = 1500 n = 2500

    k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL

    0 0.5%

    1 92.5% 69.5% 6.5% 54.5% 19.5% 1%

    2 16.5% 7% 30.5% 92% 2% 45.5% 80.5% 80.5% 100% 98.5% 8.5%

    3 83.5% 1.5% 98% 18.5% 100% 1.5% 91.5%

    4

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    16/19

    16 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

    For |E| = 3, = 3 the estimator AIC overestimate the order in a lesser extentthan the previous case, while BI C and EDC overweighted by the respective penaltyterms, underestimate the order more than it was supposed to be. Concerning GDL,it rapidly converges to the right order depending on the sample size n.

    For |E| = 4, the greater complexity of a Markov chain of order = 3 imposes the useof a larger sample size to accomplish some reliability. Finally, we choose the matrix

    {R6, R7} to produce samples with size n = 5000, originated from Markov chains oforder {2, 3, 0}, like in the previous cases.

    R6 =

    0.05 0.05 0.05 0.85

    0.05 0.05 0.85 0.05

    0.05 0.85 0.05 0.05

    0.85 0.05 0.05 0.05

    , R7 =

    0.05 0.05 0.05 0.85

    0.05 0.05 0.05 0.85

    0.05 0.05 0.05 0.85

    0.05 0.05 0.05 0.85

    .

    Table 6: Rates of fitness for case |E| = 4, {2, 3, 0}, n = 5000 and distributions given by

    R6, R7 and i = 1/, i = 1, 2, 3 if > 0.

    R6, i = 1/2 and = 2 R6, i = 1 /3 and = 3 R7 and = 0

    k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL

    0 85% 100% 100% 100%

    1 15%

    2 100% 100% 100% 100% 99% 4%

    3 100% 1% 100% 96%

    4

    5

    6

    For the order for |E| = 4, = 0, apparently AIC keeps overestimating the orderin some degree, while BI C, as in example = 3, severely underestimate the order,presumably due to the excessive weight penalty term. On the contrary, EDC and

    GDL behaves quite well in same setting.

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    17/19

    Markov Chain Order Estimation and The Chi-Square Divergence 17

    Conclusion

    The pioneer research started with the contributions of Bartlett [4], Hoel [13], Good

    [12], Anderson and Goodman [3], Billingsley [5] among others, where they developed

    tests of hypothesis for the estimation of the order of a given Markov chain.

    Later on these procedures were adapted and improved with the use of Penalty

    Functions [19, 14] together with other tools created in the realm of Models Selection

    [1, 17]. Since then, there have been a considerable number of subsequent contributions

    on this subject, several of them consisting in the enhancement of the already existing

    techniques [6, 20].

    In this notes we propose a new Markov chain order estimator based on a different idea

    which makes it behave in a quite different form. This estimator is strongly consistent

    and more efficient than AIC (inconsistent), outperforming the well established and

    consistent BIC and EDC, mainly on relatively small samples.

    References

    [1] Akaike, H. (1974). A new look at the statistical model identification. Automatic

    Control, IEEE Transactions on 19, 716723.

    [2] Ali, S. and Silvey, S. (1966). A general class of coefficients of divergence of

    one distribution from another. Journal of the Royal Statistical Society. Series B

    (Methodological) 28, 131142.

    [3] Anderson, T. W. and Goodman, L. A. (1957). Statistical inference about

    markov chains. The Annals of Mathematical Statistics 28, 89110.

    [4] Bartlett, M. S. (1951). The frequency goodness of fit test for probability chains.

    Proceedings of the Cambridge Philosophical Society.

    [5] Billingsley, P. (1961). Statistical methods in markov chains. The Annals of

    Mathematical Statistics 32, 1240.

    [6] Csiszar, I. and Shields, P. C. (2000). The consistency of the $bic$ markov

    order estimator. The Annals of Statistics 28, 16011619.

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    18/19

    18 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende

    [7] Csiszr, I. (1967). Information-type measures of difference of probability

    distributions and indirect observations. Studia Sci. Math. Hungar. 2, 299318.

    [8] Csiszr, I. and Shields, P. C. (2004). Information Theory And Statistics: A

    Tutorial. Now Publishers Inc.

    [9] Dacunha-Castelle, D., Duflo, M. and McHale, D. (1986). Probability and

    Statistics vol. II. Springer.

    [10] Doob, J. L. (1966). Stochastic Processes (Wiley Publications in Statistics). John

    Wiley & Sons Inc.

    [11] Dorea, C. C. Y. (2008). Optimal penalty term for EDC markov chain order

    estimator. Annales de lInstitut de Statistique de lUniversite de Paris (lISUP)

    52, 1526.

    [12] Good, I. J. (1955). The likelihood ratio test for markoff chains. Biometrika 42,

    531533.

    [13] Hoel, P. G. (1954). A test for markoff chains. Biometrika 41, 430433.

    [14] Katz, R. W. (1981). On some criteria for estimating the order of a markov chain.

    Technometrics 23, 243249.

    [15] Pardo, L. (2005). Statistical Inference Based on Divergence Measures. Chapman

    and Hall/CRC.

    [16] Raftery, A. E. (1985). A model for high-order markov chains. J. R. Statist.

    Soc. B..

    [17] Schwarz, G. (1978). Estimating the dimension of a model. The Annals of

    Statistics 6, 461464.

    [18] Shibata, R. (1976). Selection of the order of an autoregressive model by akaikes

    information criterion. Biometrika 63, 117126.

    [19] Tong, H. (1975). Determination of the order of a markov chain by akaikes

    information criterion. Journal of Applied Probability 12, 488497.

  • 7/28/2019 MARKOV CHAIN ORDER ESTIMATION AND THE CHI-SQUARE DIVERGENCE

    19/19

    Markov Chain Order Estimation and The Chi-Square Divergence 19

    [20] Zhao, L., Dorea, C. and Gonalves, C. (2001). On determination of the order

    of a markov chain. Statistical Inference for Stochastic Processes 4, 273282.