Download - 10.1.1.12.1862

7/27/2019 10.1.1.12.1862

1/4

ONLINE EXPECTATION-MAXIMIZATION TYPE ALGORITHMS FOR PARAMETER

ESTIMATION IN GENERAL STATE SPACE MODELS

Christophe Andrieu

- Arnaud Doucet

Department of Mathematics, Bristol University,

Bristol, BS8 1TW, UK.

Department of Engineering, Cambridge University

Cambridge, CB2 1PZ, UK.

Email: [email protected] - [email protected]

ABSTRACT

In this paper we present new online algorithms to estimate static

parameters in nonlinear non Gaussian state space models. These

algorithms rely on online Expectation-Maximization (EM) type al-gorithms. Contrary to standard Sequential Monte Carlo (SMC)

methods recently proposed in the literature, these algorithms do

not degenerate over time.

1 Introduction

1.1 State-Space Models and Problem Statement

Let {Xn}n0 and {Yn}n1 be Rp and Rq-valued stochastic pro-

cesses defined on a measurable space (, F) and where is an open subset ofRk. The process {Xn}n0 is an unobserved(hidden) Markov process of initial density ; i.e. X0 , andMarkov transition density f (x, x

); i.e.

Xn+1| Xn = x f (x, ) . (1)

One observes the process {Yn}n1. It is assumed that the obser-

vations are conditionally independent upon {Xn}n0 of marginaldensity g (x, y) ; i.e.

Yn| Xn = x g (x, ) . (2)

This class of models includes many nonlinear and non-Gaussian

time series models such as

Xn+1 = (Xn, Vn+1) , Yn = (Xn, Wn)

where {Vn}n1 and {Wn}n1 are independent sequences.

We assume here that the model depends on some unknown pa-rameter denoted . The true value of is . We are interested inderiving recursive algorithms to estimate for the class of sta-tionary state-space models; i.e. when the Markov chain {Xn}n0admits a limiting distribution. This problem has numerous appli-

cations in electrical engineering, econometrics, statistics etc. It

is extremely complex. Even if was known, the simpler prob-lem of optimal filtering, i.e. estimating the posterior distribution

of Xn given (Y1, . . . , Y n) , does not usually admit a closed-formsolution.

Further on we will denote for any sequence zk/random processZk zi:j = (zi, zi+1, . . . , z j) and Zi:j = (Zi, Zi+1, . . . , Z j) .

1.2 A Brief Literature Review

1.2.1 Filtering methods

A standard approach followed in the literature consists of setting

a prior distribution on the unknown parameter and then consid-ering the extended state Sn (Xn, ). This converts the param-eter estimation into an optimal filtering problem. One can then

apply, at least theoretically, standard particle filtering techniques

[5] to estimate p ( xn, | Y1:n) and thus p ( | Y1:n). In this ap-proach, the parameter space is only explored at the initialization

of the algorithm. Consequently the algorithm is inefficient; after

a few iterations the marginal posterior distribution of the parame-

ter is approximated by a single delta Dirac function. To limit this

problem, several authors have proposed to use kernel density esti-

mation methods [13]. However, this has the effect of transforming

the fixed parameter into a slowly time-varying one. A pragmatic

approach consists of introducing explicitly an artificial dynamic on

the parameter of interest; see [10], [11]. To avoid the introduction

of an artificial dynamic model, an approach proposed in [8] con-sists of adding Markov chain Monte Carlo (MCMC) steps so as to

add diversity among the particles. However, this approach does

not solve the fixed-parameter estimation problem. More precisely,

the addition of MCMC steps does not make the dynamic model

ergodic. Thus, there is an accumulation of errors over time and the

algorithm can diverge as observed in [1]. Similar problems arise

with the recent method proposed in [16].

1.2.2 Recursive maximum likelihood

Consider the log-likelihood function

l (Y1:n) =n

k=1log

g (xk, Yk)p (xk|Y1:k1) dxk

. (3)

wherep (xk|Y1:k1) is the posterior density of the state Xk giventhe observations Y1:k1 and . Under regularity assumptions in-cluding the stationarity of the state-space model, one has

1

nl (Y1:n) l () (4)

where

l ()

RqP(Rp)

log

g (x, y) (x) dx

, (dy,d) ,

7/27/2019 10.1.1.12.1862

2/4

where P(Rp) is the space of probability distributions on Rp and, (dy,d) is the joint invariant distribution of the couple(Yk, p (xk|Y1:k1)). It is dependent on both and the true pa-rameter . Maximizing l () corresponds to minimizing the fol-lowing KullbackLeibler information measure given by

K(, ) l () l () 0. (5)

To optimize this cost function, Recursive Maximum Likelihood

(RML) is based on a stochastic gradient algorithm

n+1 = n+n log

n (xn, Yn)p1:n (xn|Y1:n1) dxn

.

This requires the computation ofp1:n (xn|Y1:n1) and its deriva-tives with respect to using the parameter k at time k where

p1:n (xn|Y1:n1) denotes the predictive distribution computed us-ing parameter k at time k. This is the approach followed in [12]for finite state-space HMM and in [6] for general state-space mod-

els. In the general state-space case, the filter and its derivatives are

approximated numerically. This method does not suffer from the

problems mentioned above. However it is sensitive to initializa-tion and in practice it can be difficult to scale the components of

the gradient properly.

1.2.3 Online EM algorithm

Another stochastic gradient type algorithm is based on an online

version of the EM. This approach is better from a practical point

of view as it is easy to implement and numerically well-behaved.

One can actually show that online EM corresponds to a Newton-

Raphson like algorithm; the difference being that the gradient is

not scaled by the information matrix but by the complete informa-

tion matrix. Online EM algorithms have been proposed for finite

state-space HMM and linear Gaussian state-space models; e.g. [7].

It is formally possible to come up with a similar algorithm for

general state-space models. However, it requires one to computequantities such as

Qn ( | ) = E (logp (x0:n, Y1:n)| Y1:n)

= E (log (x0)| Y1:n) +n

k=1

E (log f (xk1, xk)| Y1:n)

+n

k=1

E (log g (xk, Yk)| Y1:n)

It is trivially possible to estimate Qn ( | ) using particle methods

as one gets an estimate of the joint posterior densityp ( x1:n| Y1:n).However, as n increases, only the approximation of marginal den-sitiesp ( xnL:n| Y1:n) for a reasonable value ofL, say L around5, is good for a reasonable number of particles [5]. It seems impos-

sible to obtain a good approximation of Qn ( |

) with an onlinealgorithm as n increases. It is thus necessary to come up with analternative method.

1.3 Contributions

We propose here three new algorithms to address the problem

of recursive parameter estimation in general state-space models.

These algorithms rely on online EM type algorithms, namely on-

line EM, Stochastic EM (SEM) and Data Augmentation (DA). To

prevent the degeneracy inherent to all previous approaches (except

[6]), the key point of our paper is to modify the contrast function

to optimize. Instead of considering the likelihood function which

leads to (5), we will consider here the so-called split-data likeli-

hood (SDL) as originally proposed in [14], [15] for finite state-

space HMM. In this approach, the data set is divided in blocks of

say L data and one maximizes the average of the resulting log-SDL. This leads to an alternative Kullback-Leibler contrast func-

tion. It can be shown under regularity assumptions that the set

of parameters optimizing this contrast function includes the true

parameter. An approach consists of using the Fisher identity to

obtain a gradient estimate. We will not discuss this approach here

and we will maximize the average log-SDL using online EM type

algorithms. As we work on data blocks of fixed dimension, the

crucial point is that there is no more degeneracy problem. More-

over, contrary to [6], [12], [15], these algorithms are numerically

well-behaved. An additional annealing schedule can also be added

to the DA algorithm to make it more robust to initialization.

The rest of this paper is organized as follows. In Section 2, we

introduce the average log-SPL. In Section 3, we present three re-

cursive algorithms to optimize the resulting Kullback-Leibler con-

trast function and discuss the implementation issues. Finally in

Section 4, we present an application to stochastic volatility.

2 Split-data likelihoodThe standard likelihood function ofY1:nL is defined by

L (Y1:nL) =

(x0)

nLk=1

f (xk1, xk) g (xk, Yk) dx0:nL.

The SDL consists of dividing the data Y1:nL in n blocks ofL data.

For each data block, the pseudo-likelihood L Y(i1)L+1:iL isgiven byp x(i1)L+1:iL, Y(i1)L+1:iL dx(i1)L+1:iL, (6)with

p

x(i1)L+1:iL, Y(i1)L+1:iL

equal to

x(i1)L+1 g x(i1)L+1, Y(i1)L+1

iLk=(i1)L+2

f (xk1, xk) g (xk, Yk) ,(7)

where correspond to the invariant density of the latent Markovprocess. The SDL corresponds to the product of the n pseudo-likelihoods (6). It is important to remark that our algorithms will

require the knowledge of the analytical expression of this invariant

density up to a normalizing constant. This is a restriction. How-

ever, this density is known in many important applications; it is

satisfied for instance for all the examples addressed in [16].

We propose to maximize the average log-SDL

1

n

n

i=1 l Y(i1)L+1:iL l () (8)where l Y(i1)L+1:iL log L Y(i1)L+1:iL and

l () = l (Y1:L)p (Y1:L) dY1:L,p (Y1:L) corresponds to the invariant distribution of the obser-vations under the true parameter ; i.e. this is the marginal of(7) for i = 1, = . Note the difference with the standard

RML approach, see (4). Maximizing l () is equivalent to mini-mizing K(, ) l () l () which satisfies under regularity

7/27/2019 10.1.1.12.1862

3/4

assumptions K(, ) = 0 [14]. There is a trade-off associatedto the choice of L. If L is small, the algorithm is typically easierto implement but the convergence of the algorithm might be slow.

IfL is large, the algorithm will converge faster as one mimics theconvergence properties of the RML estimate but the algorithm is

getting more complex.

3 Recursive Algorithms and Implementation

We describe briefly in this section the algorithms. A detailed ex-

ample is given in the next section. In practice we consider only

models for which the joint density p (x0:n, y1:n) is in the expo-nential family so that one only needs to propagate a set of sufficient

statistics. We will denote by the set of (typically multivariate)sufficient statistics. All algorithms rely on a non-increasing pos-

itive stepsize sequence {i}i0 satisfying

i = ,

2i 0).For models of interest,

p ( | x1:nL, Y1:nL) only depends on a set

of sufficient statistics (similar to those of the EM and SEM) and

the online DA algorithm proceeds as follows.

Online DA

Initialization: i = 0, (0) = 0 and (0). Iteration i, i 1

Sample X(i1)L+1:iL p(i1) x(i1)L+1:iL Y(i1)L+1:iL .

(i) = (1 i) (i1)+i

X(i1)L+1:iL, Y(i1)L+1:iL. Sample (i) p| (i).

Remark. The different components of the vector might be up-dated one-at-a-time using a Gibbs sampling strategy.

Remark. Note that instead of using non-overlapping data blocks{Y(i1)L+1:iL}i1, it is possible to use a sliding window. Thisenables one to update the parameter estimate at the data rate for

example [14].

3.2 Implementation Issues

The algorithms presented above assume that we know how to inte-

grate with respect to p x(i1)L+1:iL Y(i1)L+1:iL or to sam-ple from this density. It is typically impossible but one can perform

these integration/sampling steps exactly or approximately using

modern simulation techniques; i.e. Markov Chain Monte Carlo

(MCMC) and SMC. When one uses SMC to estimate this density,

the approximation one gets will be reasonable only if L is not toolarge. If not, one can use the forward filtering backward sampling

algorithm proposed in [9] to sample from the joint density based

on the particle approximations of the marginal filtering densitiesp xk| Y(i1)L+1:k. One should keep in mind that even if oneuses this method, the algorithm is still an online algorithm.

4 Application

Let us consider the following stochastic volatility model arising in

finance

Xn+1 = Xn + Vn+1,

Yn = exp(Xn/2) Wn,

where Vni.i.d. N (0, 1) and Wn

i.i.d. N (0, 1) are two mutually

sequences, independent of the initial state X0. We are interested in

estimating the parameter

(,,) . In this case, the stationarydistribution of the hidden process is N

0,

2

12

. One has

log p (x1:L, Y1:L)= cste

1

2log

1 2

L log L log

1 2

22

x21 1

22

Lk=1

Y2k exp(xk)

1

22

Lk=2

(xk xk1)2

7/27/2019 10.1.1.12.1862

4/4

so clearly (x1:L, Y1:L) is given by

x21,L

k=1 Y2k exp(xk) ,L1

k=1 x2k,L

k=2 x2k,L

k=2 xkxk1

. If one writes = (1,

. . . , 5) then the mapping between the sufficient statistics andthe parameter space necessary to the EM and SEM algorithms is

defined as follows =

1L

2 and

2 = 1L1 21 + 23 25 + 4 ,

1 21

+ (1 3) 2 + 25 = 0.

Plugging the expression of 2 in the last equation gives an equa-tion in that can be solved by a simple dichotomic search. Forthe DA augmentation algorithm, we used a rejection technique to

sample from p ( | ) . We present here a numerical experimenton synthetic data. The true parameter values are = 0.9, =1.0, 2 = 0.1 and the RDA algorithm was used for L = 2 with atemperature n = n

0.5.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

Figure 1: Convergence of the estimate of .

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

Figure 2: Convergence of the estimate of .

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Figure 3: Convergence of the estimate of 2.

5 Discussion

In this paper, we have proposed three original algorithms to per-

form online parameter estimation in general state-space models.

These algorithms are online EM type algorithms. Provided the in-

variant distribution of the hidden Markov process is known, these

algorithms are simple and we have demonstrated their efficiency in

practice. An alternative and computationally very efficient strategy

which does not require state estimation has been recently proposed

in [3].

6 Acknowledgments

This work was initiated while the second author was visiting the

Department of Mathematics from Bristol University thanks to an

EPSRC visiting fellowship.

7 REFERENCES

[1] Andrieu, C., De Freitas, J.F.G. and Doucet, A. (1999). Se-quential MCMC for Bayesian model selection, Proc. Work.

IEEE HOS.

[2] Andrieu, C. and Doucet A. (2000) Stochastic algorithms for

marginal MAP retrieval of sinusoids in non-Gaussian noise,

Proc. Work. IEEE SSAP.

[3] Andrieu, C., Doucet, A. and Tadic, V.B. (2003). Online sam-

pling for parameter estimation in general state space models,

Proc. Worshop Sysid, Rotterdam.

[4] Doucet, A., Godsill, S.J. and Andrieu, C. (2000). On sequen-

tial Monte Carlo sampling methods for Bayesian filtering.

Statist. Comput., 10, 197-208.

[5] Doucet, A., de Freitas, J.F.G. and Gordon N.J. (eds.) (2001).

Sequential Monte Carlo Methods in Practice. New York:Springer-Verlag.

[6] Doucet A. & Tadic V.B. (2003) Parameter estimation in gen-

eral state-space models using particle methods. Ann. Inst.

Stat. Math., to appear.

[7] Elliott R.J. & Krishnamurthy V. (1999) New finite dimen-

sional filters for estimation of linear Gauss-Markov models,

IEEE Trans. Auto Control, Vol.44, No.5, pp.938-951.

[8] Gilks, W.R. and Berzuini, C. (2001). Following a moving

target - Monte Carlo inference for dynamic Bayesian models.

J. R. Statist. Soc. B, 63, pp. 127-146.

[9] Godsill, S.J., Doucet, A. & West M. (2000) Monte Carlo

smoothing for nonlinear time series. Technical Report TR00-

01, Institute Statistics and Decision Sciences, Duke Univer-

sity.

[10] Higuchi T. (1997) Monte Carlo filter using the genetic algo-

rithm operators. J. Statist. Comp. Simul., 59, 1-23.

[11] Kitagawa, G. (1998). A self-organizing state-space model. J.

Am. Statist. Ass., 93, 1203-1215.

[12] LeGland F. and Mevel L. (1997) Recursive identification in

hidden Markov models, Proc. 36th IEEE Conf. Decision and

Control, pp. 3468-3473.

[13] Liu, J. and West, M. (2001). Combined parameter and state

estimation in simulation-based filtering. In Sequential Monte

Carlo Methods in Practice (Doucet, A., de Freitas, J.F.G. and

Gordon N.J. (eds.)). New York: Springer-Verlag.[14] Ryden T. (1994) Consistent and asymptotically normal pa-

rameter estimates for hidden Markov models. Annals of

Statistics 22, 1884-1895.

[15] Ryden T. (1997) On recursive estimation for hidden Markov

models. Stochastic Processes and their Applications 66, 79-

96.

[16] Storvik G. (2002) Particle filters in state space models with

the presence of unknown static parameters. IEEE. Trans. Sig-

nal Processing, 50, 281289.

Download - 10.1.1.12.1862

Top Related