10.1.1.12.1862

Upload: neil-john-apps

Post on 14-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 10.1.1.12.1862

    1/4

    ONLINE EXPECTATION-MAXIMIZATION TYPE ALGORITHMS FOR PARAMETER

    ESTIMATION IN GENERAL STATE SPACE MODELS

    Christophe Andrieu

    - Arnaud Doucet

    Department of Mathematics, Bristol University,

    Bristol, BS8 1TW, UK.

    Department of Engineering, Cambridge University

    Cambridge, CB2 1PZ, UK.

    Email: [email protected] - [email protected]

    ABSTRACT

    In this paper we present new online algorithms to estimate static

    parameters in nonlinear non Gaussian state space models. These

    algorithms rely on online Expectation-Maximization (EM) type al-gorithms. Contrary to standard Sequential Monte Carlo (SMC)

    methods recently proposed in the literature, these algorithms do

    not degenerate over time.

    1 Introduction

    1.1 State-Space Models and Problem Statement

    Let {Xn}n0 and {Yn}n1 be Rp and Rq-valued stochastic pro-

    cesses defined on a measurable space (, F) and where is an open subset ofRk. The process {Xn}n0 is an unobserved(hidden) Markov process of initial density ; i.e. X0 , andMarkov transition density f (x, x

    ); i.e.

    Xn+1| Xn = x f (x, ) . (1)

    One observes the process {Yn}n1. It is assumed that the obser-

    vations are conditionally independent upon {Xn}n0 of marginaldensity g (x, y) ; i.e.

    Yn| Xn = x g (x, ) . (2)

    This class of models includes many nonlinear and non-Gaussian

    time series models such as

    Xn+1 = (Xn, Vn+1) , Yn = (Xn, Wn)

    where {Vn}n1 and {Wn}n1 are independent sequences.

    We assume here that the model depends on some unknown pa-rameter denoted . The true value of is . We are interested inderiving recursive algorithms to estimate for the class of sta-tionary state-space models; i.e. when the Markov chain {Xn}n0admits a limiting distribution. This problem has numerous appli-

    cations in electrical engineering, econometrics, statistics etc. It

    is extremely complex. Even if was known, the simpler prob-lem of optimal filtering, i.e. estimating the posterior distribution

    of Xn given (Y1, . . . , Y n) , does not usually admit a closed-formsolution.

    Further on we will denote for any sequence zk/random processZk zi:j = (zi, zi+1, . . . , z j) and Zi:j = (Zi, Zi+1, . . . , Z j) .

    1.2 A Brief Literature Review

    1.2.1 Filtering methods

    A standard approach followed in the literature consists of setting

    a prior distribution on the unknown parameter and then consid-ering the extended state Sn (Xn, ). This converts the param-eter estimation into an optimal filtering problem. One can then

    apply, at least theoretically, standard particle filtering techniques

    [5] to estimate p ( xn, | Y1:n) and thus p ( | Y1:n). In this ap-proach, the parameter space is only explored at the initialization

    of the algorithm. Consequently the algorithm is inefficient; after

    a few iterations the marginal posterior distribution of the parame-

    ter is approximated by a single delta Dirac function. To limit this

    problem, several authors have proposed to use kernel density esti-

    mation methods [13]. However, this has the effect of transforming

    the fixed parameter into a slowly time-varying one. A pragmatic

    approach consists of introducing explicitly an artificial dynamic on

    the parameter of interest; see [10], [11]. To avoid the introduction

    of an artificial dynamic model, an approach proposed in [8] con-sists of adding Markov chain Monte Carlo (MCMC) steps so as to

    add diversity among the particles. However, this approach does

    not solve the fixed-parameter estimation problem. More precisely,

    the addition of MCMC steps does not make the dynamic model

    ergodic. Thus, there is an accumulation of errors over time and the

    algorithm can diverge as observed in [1]. Similar problems arise

    with the recent method proposed in [16].

    1.2.2 Recursive maximum likelihood

    Consider the log-likelihood function

    l (Y1:n) =n

    k=1log

    g (xk, Yk)p (xk|Y1:k1) dxk

    . (3)

    wherep (xk|Y1:k1) is the posterior density of the state Xk giventhe observations Y1:k1 and . Under regularity assumptions in-cluding the stationarity of the state-space model, one has

    1

    nl (Y1:n) l () (4)

    where

    l ()

    RqP(Rp)

    log

    g (x, y) (x) dx

    , (dy,d) ,

  • 7/27/2019 10.1.1.12.1862

    2/4

    where P(Rp) is the space of probability distributions on Rp and, (dy,d) is the joint invariant distribution of the couple(Yk, p (xk|Y1:k1)). It is dependent on both and the true pa-rameter . Maximizing l () corresponds to minimizing the fol-lowing KullbackLeibler information measure given by

    K(, ) l () l () 0. (5)

    To optimize this cost function, Recursive Maximum Likelihood

    (RML) is based on a stochastic gradient algorithm

    n+1 = n+n log

    n (xn, Yn)p1:n (xn|Y1:n1) dxn

    .

    This requires the computation ofp1:n (xn|Y1:n1) and its deriva-tives with respect to using the parameter k at time k where

    p1:n (xn|Y1:n1) denotes the predictive distribution computed us-ing parameter k at time k. This is the approach followed in [12]for finite state-space HMM and in [6] for general state-space mod-

    els. In the general state-space case, the filter and its derivatives are

    approximated numerically. This method does not suffer from the

    problems mentioned above. However it is sensitive to initializa-tion and in practice it can be difficult to scale the components of

    the gradient properly.

    1.2.3 Online EM algorithm

    Another stochastic gradient type algorithm is based on an online

    version of the EM. This approach is better from a practical point

    of view as it is easy to implement and numerically well-behaved.

    One can actually show that online EM corresponds to a Newton-

    Raphson like algorithm; the difference being that the gradient is

    not scaled by the information matrix but by the complete informa-

    tion matrix. Online EM algorithms have been proposed for finite

    state-space HMM and linear Gaussian state-space models; e.g. [7].

    It is formally possible to come up with a similar algorithm for

    general state-space models. However, it requires one to computequantities such as

    Qn ( | ) = E (logp (x0:n, Y1:n)| Y1:n)

    = E (log (x0)| Y1:n) +n

    k=1

    E (log f (xk1, xk)| Y1:n)

    +n

    k=1

    E (log g (xk, Yk)| Y1:n)

    It is trivially possible to estimate Qn ( | ) using particle methods

    as one gets an estimate of the joint posterior densityp ( x1:n| Y1:n).However, as n increases, only the approximation of marginal den-sitiesp ( xnL:n| Y1:n) for a reasonable value ofL, say L around5, is good for a reasonable number of particles [5]. It seems impos-

    sible to obtain a good approximation of Qn ( |

    ) with an onlinealgorithm as n increases. It is thus necessary to come up with analternative method.

    1.3 Contributions

    We propose here three new algorithms to address the problem

    of recursive parameter estimation in general state-space models.

    These algorithms rely on online EM type algorithms, namely on-

    line EM, Stochastic EM (SEM) and Data Augmentation (DA). To

    prevent the degeneracy inherent to all previous approaches (except

    [6]), the key point of our paper is to modify the contrast function

    to optimize. Instead of considering the likelihood function which

    leads to (5), we will consider here the so-called split-data likeli-

    hood (SDL) as originally proposed in [14], [15] for finite state-

    space HMM. In this approach, the data set is divided in blocks of

    say L data and one maximizes the average of the resulting log-SDL. This leads to an alternative Kullback-Leibler contrast func-

    tion. It can be shown under regularity assumptions that the set

    of parameters optimizing this contrast function includes the true

    parameter. An approach consists of using the Fisher identity to

    obtain a gradient estimate. We will not discuss this approach here

    and we will maximize the average log-SDL using online EM type

    algorithms. As we work on data blocks of fixed dimension, the

    crucial point is that there is no more degeneracy problem. More-

    over, contrary to [6], [12], [15], these algorithms are numerically

    well-behaved. An additional annealing schedule can also be added

    to the DA algorithm to make it more robust to initialization.

    The rest of this paper is organized as follows. In Section 2, we

    introduce the average log-SPL. In Section 3, we present three re-

    cursive algorithms to optimize the resulting Kullback-Leibler con-

    trast function and discuss the implementation issues. Finally in

    Section 4, we present an application to stochastic volatility.

    2 Split-data likelihoodThe standard likelihood function ofY1:nL is defined by

    L (Y1:nL) =

    (x0)

    nLk=1

    f (xk1, xk) g (xk, Yk) dx0:nL.

    The SDL consists of dividing the data Y1:nL in n blocks ofL data.

    For each data block, the pseudo-likelihood L Y(i1)L+1:iL isgiven byp x(i1)L+1:iL, Y(i1)L+1:iL dx(i1)L+1:iL, (6)with

    p

    x(i1)L+1:iL, Y(i1)L+1:iL

    equal to

    x(i1)L+1 g x(i1)L+1, Y(i1)L+1

    iLk=(i1)L+2

    f (xk1, xk) g (xk, Yk) ,(7)

    where correspond to the invariant density of the latent Markovprocess. The SDL corresponds to the product of the n pseudo-likelihoods (6). It is important to remark that our algorithms will

    require the knowledge of the analytical expression of this invariant

    density up to a normalizing constant. This is a restriction. How-

    ever, this density is known in many important applications; it is

    satisfied for instance for all the examples addressed in [16].

    We propose to maximize the average log-SDL

    1

    n

    n

    i=1 l Y(i1)L+1:iL l () (8)where l Y(i1)L+1:iL log L Y(i1)L+1:iL and

    l () = l (Y1:L)p (Y1:L) dY1:L,p (Y1:L) corresponds to the invariant distribution of the obser-vations under the true parameter ; i.e. this is the marginal of(7) for i = 1, = . Note the difference with the standard

    RML approach, see (4). Maximizing l () is equivalent to mini-mizing K(, ) l () l () which satisfies under regularity

  • 7/27/2019 10.1.1.12.1862

    3/4

    assumptions K(, ) = 0 [14]. There is a trade-off associatedto the choice of L. If L is small, the algorithm is typically easierto implement but the convergence of the algorithm might be slow.

    IfL is large, the algorithm will converge faster as one mimics theconvergence properties of the RML estimate but the algorithm is

    getting more complex.

    3 Recursive Algorithms and Implementation

    We describe briefly in this section the algorithms. A detailed ex-

    ample is given in the next section. In practice we consider only

    models for which the joint density p (x0:n, y1:n) is in the expo-nential family so that one only needs to propagate a set of sufficient

    statistics. We will denote by the set of (typically multivariate)sufficient statistics. All algorithms rely on a non-increasing pos-

    itive stepsize sequence {i}i0 satisfying

    i = ,

    2i 0).For models of interest,

    p ( | x1:nL, Y1:nL) only depends on a set

    of sufficient statistics (similar to those of the EM and SEM) and

    the online DA algorithm proceeds as follows.

    Online DA

    Initialization: i = 0, (0) = 0 and (0). Iteration i, i 1

    Sample X(i1)L+1:iL p(i1) x(i1)L+1:iL Y(i1)L+1:iL .

    (i) = (1 i) (i1)+i

    X(i1)L+1:iL, Y(i1)L+1:iL. Sample (i) p| (i).

    Remark. The different components of the vector might be up-dated one-at-a-time using a Gibbs sampling strategy.

    Remark. Note that instead of using non-overlapping data blocks{Y(i1)L+1:iL}i1, it is possible to use a sliding window. Thisenables one to update the parameter estimate at the data rate for

    example [14].

    3.2 Implementation Issues

    The algorithms presented above assume that we know how to inte-

    grate with respect to p x(i1)L+1:iL Y(i1)L+1:iL or to sam-ple from this density. It is typically impossible but one can perform

    these integration/sampling steps exactly or approximately using

    modern simulation techniques; i.e. Markov Chain Monte Carlo

    (MCMC) and SMC. When one uses SMC to estimate this density,

    the approximation one gets will be reasonable only if L is not toolarge. If not, one can use the forward filtering backward sampling

    algorithm proposed in [9] to sample from the joint density based

    on the particle approximations of the marginal filtering densitiesp xk| Y(i1)L+1:k. One should keep in mind that even if oneuses this method, the algorithm is still an online algorithm.

    4 Application

    Let us consider the following stochastic volatility model arising in

    finance

    Xn+1 = Xn + Vn+1,

    Yn = exp(Xn/2) Wn,

    where Vni.i.d. N (0, 1) and Wn

    i.i.d. N (0, 1) are two mutually

    sequences, independent of the initial state X0. We are interested in

    estimating the parameter

    (,,) . In this case, the stationarydistribution of the hidden process is N

    0,

    2

    12

    . One has

    log p (x1:L, Y1:L)= cste

    1

    2log

    1 2

    L log L log

    1 2

    22

    x21 1

    22

    Lk=1

    Y2k exp(xk)

    1

    22

    Lk=2

    (xk xk1)2

  • 7/27/2019 10.1.1.12.1862

    4/4

    so clearly (x1:L, Y1:L) is given by

    x21,L

    k=1 Y2k exp(xk) ,L1

    k=1 x2k,L

    k=2 x2k,L

    k=2 xkxk1

    . If one writes = (1,

    . . . , 5) then the mapping between the sufficient statistics andthe parameter space necessary to the EM and SEM algorithms is

    defined as follows =

    1L

    2 and

    2 = 1L1 21 + 23 25 + 4 ,

    1 21

    + (1 3) 2 + 25 = 0.

    Plugging the expression of 2 in the last equation gives an equa-tion in that can be solved by a simple dichotomic search. Forthe DA augmentation algorithm, we used a rejection technique to

    sample from p ( | ) . We present here a numerical experimenton synthetic data. The true parameter values are = 0.9, =1.0, 2 = 0.1 and the RDA algorithm was used for L = 2 with atemperature n = n

    0.5.

    0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.65

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    1.05

    Figure 1: Convergence of the estimate of .

    1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    Figure 2: Convergence of the estimate of .

    0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    Figure 3: Convergence of the estimate of 2.

    5 Discussion

    In this paper, we have proposed three original algorithms to per-

    form online parameter estimation in general state-space models.

    These algorithms are online EM type algorithms. Provided the in-

    variant distribution of the hidden Markov process is known, these

    algorithms are simple and we have demonstrated their efficiency in

    practice. An alternative and computationally very efficient strategy

    which does not require state estimation has been recently proposed

    in [3].

    6 Acknowledgments

    This work was initiated while the second author was visiting the

    Department of Mathematics from Bristol University thanks to an

    EPSRC visiting fellowship.

    7 REFERENCES

    [1] Andrieu, C., De Freitas, J.F.G. and Doucet, A. (1999). Se-quential MCMC for Bayesian model selection, Proc. Work.

    IEEE HOS.

    [2] Andrieu, C. and Doucet A. (2000) Stochastic algorithms for

    marginal MAP retrieval of sinusoids in non-Gaussian noise,

    Proc. Work. IEEE SSAP.

    [3] Andrieu, C., Doucet, A. and Tadic, V.B. (2003). Online sam-

    pling for parameter estimation in general state space models,

    Proc. Worshop Sysid, Rotterdam.

    [4] Doucet, A., Godsill, S.J. and Andrieu, C. (2000). On sequen-

    tial Monte Carlo sampling methods for Bayesian filtering.

    Statist. Comput., 10, 197-208.

    [5] Doucet, A., de Freitas, J.F.G. and Gordon N.J. (eds.) (2001).

    Sequential Monte Carlo Methods in Practice. New York:Springer-Verlag.

    [6] Doucet A. & Tadic V.B. (2003) Parameter estimation in gen-

    eral state-space models using particle methods. Ann. Inst.

    Stat. Math., to appear.

    [7] Elliott R.J. & Krishnamurthy V. (1999) New finite dimen-

    sional filters for estimation of linear Gauss-Markov models,

    IEEE Trans. Auto Control, Vol.44, No.5, pp.938-951.

    [8] Gilks, W.R. and Berzuini, C. (2001). Following a moving

    target - Monte Carlo inference for dynamic Bayesian models.

    J. R. Statist. Soc. B, 63, pp. 127-146.

    [9] Godsill, S.J., Doucet, A. & West M. (2000) Monte Carlo

    smoothing for nonlinear time series. Technical Report TR00-

    01, Institute Statistics and Decision Sciences, Duke Univer-

    sity.

    [10] Higuchi T. (1997) Monte Carlo filter using the genetic algo-

    rithm operators. J. Statist. Comp. Simul., 59, 1-23.

    [11] Kitagawa, G. (1998). A self-organizing state-space model. J.

    Am. Statist. Ass., 93, 1203-1215.

    [12] LeGland F. and Mevel L. (1997) Recursive identification in

    hidden Markov models, Proc. 36th IEEE Conf. Decision and

    Control, pp. 3468-3473.

    [13] Liu, J. and West, M. (2001). Combined parameter and state

    estimation in simulation-based filtering. In Sequential Monte

    Carlo Methods in Practice (Doucet, A., de Freitas, J.F.G. and

    Gordon N.J. (eds.)). New York: Springer-Verlag.[14] Ryden T. (1994) Consistent and asymptotically normal pa-

    rameter estimates for hidden Markov models. Annals of

    Statistics 22, 1884-1895.

    [15] Ryden T. (1997) On recursive estimation for hidden Markov

    models. Stochastic Processes and their Applications 66, 79-

    96.

    [16] Storvik G. (2002) Particle filters in state space models with

    the presence of unknown static parameters. IEEE. Trans. Sig-

    nal Processing, 50, 281289.