# particle filtering and smoothing particle filtering and smoothing methods arnaud doucet department...

Post on 09-May-2020

7 views

Embed Size (px)

TRANSCRIPT

Particle Filtering and Smoothing Methods

Arnaud Doucet Department of Statistics, Oxford University

University College London

3rd October 2012

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 1 / 46

State-Space Models

Let {Xt}t≥1 be a latent/hidden X -valued Markov process with

X1 ∼ µ (·) and Xt | (Xt−1 = x) ∼ f ( ·| x) .

Let {Yt}t≥1 be an Y-valued Markov observation process such that observations are conditionally independent given {Xt}t≥1 and

Yt | (Xt = x) ∼ g ( ·| x) .

General class of time series models aka Hidden Markov Models (HMM) including

Xt = Ψ (Xt−1,Vt ) , Yt = Φ (Xt ,Wt )

where Vt ,Wt are two sequences of i.i.d. random variables.

Aim: Infer {Xt} given observations {Yt} on-line or off-line.

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 2 / 46

State-Space Models

State-space models are ubiquitous in control, data mining, econometrics, geosciences, system biology etc. Since Jan. 2012, more than 13,500 papers have already appeared (source: Google Scholar).

Finite State-space HMM: X is a finite space, i.e. {Xt} is a finite Markov chain

Yt | (Xt = x) ∼ g ( ·| x) Linear Gaussian state-space model

Xt = AXt−1 + BVt , Vt i.i.d.∼ N (0, I )

Yt = CXt +DWt , Wt i.i.d.∼ N (0, I )

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 3 / 46

State-Space Models

Stochastic Volatility model

Xt = φXt−1 + σVt , Vt i.i.d.∼ N (0, 1)

Yt = β exp (Xt/2)Wt , Wt i.i.d.∼ N (0, 1)

Biochemical Network model Pr ( X 1t+dt=x

1 t+1,X

2 t+dt=x

2 t

∣∣ x1t , x2t ) = α x1t dt + o (dt) , Pr ( X 1t+dt=x

1 t−1,X 2t+dt=x2t+1

∣∣ x1t , x2t ) = β x1t x2t dt + o (dt) , Pr ( X 1t+dt=x

1 t ,X

2 t+dt=x

2 t−1

∣∣ x1t , x2t ) = γ x2t dt + o (dt) , with

Yk = X 1 k∆T +Wk with Wk

i.i.d.∼ N ( 0, σ2

) .

Nonlinear Diffusion model

dXt = α (Xt ) dt + β (Xt ) dVt , Vt Brownian motion

Yk = γ (Xk∆T ) +Wk , Wk i.i.d.∼ N

( 0, σ2

) .

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 4 / 46

Inference in State-Space Models

Given observations y1:t := (y1, y2, . . . , yt ), inference about X1:t := (X1, ...,Xt ) relies on the posterior

p (x1:t | y1:t ) = p (x1:t , y1:t ) p (y1:t )

where

p (x1:t , y1:t ) = µ (x1) t

∏ k=2

f (xk | xk−1)︸ ︷︷ ︸ p(x1:t )

t

∏ k=1

g (yk | xk )︸ ︷︷ ︸ p( y1:t |x1:t )

,

p (y1:t ) = ∫ · · ·

∫ p (x1:t , y1:t ) dx1:t

When X is finite & linear Gaussian models, {p (xt | y1:t )}t≥1 can be computed exactly. For non-linear models, approximations are required: EKF, UKF, Gaussian sum filters, etc. Approximations of {p (xt | y1:t )}Tt=1 provide approximation of p (x1:T | y1:T ) .

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 5 / 46

Monte Carlo Methods Basics

Assume you can generate X (i )1:t ∼ p (x1:t | y1:t ) where i = 1, ...,N then MC approximation is

p̂ (x1:t | y1:t ) = 1 N

N

∑ i=1

δ X (i )1:t (x1:t )

Integration is straightforward.∫ ϕt (x1:t ) p (x1:t | y1:t ) dx1:t ≈

∫ ϕt (x1:t ) p̂ (x1:t | y1:t ) dx1:t

= 1N ∑ N i=1 ϕ

( X (i )1:t

) Marginalization is straightforward.

p̂ (xk | y1:t ) = ∫ p̂ (x1:t | y1:t ) dx1:k−1dxk+1:t =

1 N

N

∑ i=1

δ X (i )k (xk ) .

Basic and “key” property: V [ 1 N ∑

N i=1 ϕ

( X (i )1:t

)] = C (t dim(X ))N , i.e.

rate of convergence to zero is independent of t dim (X ). A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 6 / 46

Monte Carlo Methods

Problem 1: We cannot typically generate exact samples from p (x1:t | y1:t ) for non-linear non-Gaussian models. Problem 2: Even if we could, algorithms to generate samples from p (x1:t | y1:t ) will have at least complexity O (t) . Particle Methods solves partially Problems 1 & 2 by breaking the problem of sampling from p (x1:t | y1:t ) into a collection of simpler subproblems. First approximate p (x1| y1) and p (y1) at time 1, then p (x1:2| y1:2) and p (y1:2) at time 2 and so on.

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 7 / 46

Bayesian Recursion on Path Space

We have

p (x1:t | y1:t ) = p (x1:t , y1:t ) p (y1:t )

= g (yt | xt ) f (xt | xt−1)

p (yt | y1:t−1) p (x1:t−1, y1:t−1) p (y1:t−1)

= g (yt | xt )

predictive p( x1:t |y1:t−1)︷ ︸︸ ︷ f (xt | xt−1) p (x1:t−1| y1:t−1) p (yt | y1:t−1)

where

p (yt | y1:t−1) = ∫ g (yt | xt ) p (x1:t | y1:t−1) dx1:t

Prediction-Update formulation

p (x1:t | y1:t−1) = f (xt | xt−1) p (x1:t−1| y1:t−1) ,

p (x1:t | y1:t ) = g (yt | xt ) p (x1:t | y1:t−1)

p (yt | y1:t−1) .

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 8 / 46

Monte Carlo Implementation of Prediction Step

Assume you have at time t − 1

p̂ (x1:t−1| y1:t−1) = 1 N

N

∑ i=1

δ X (i )1:t−1

(x1:t−1) .

By sampling X (i ) t ∼ f

( xt |X (i )t−1

) and setting X

(i ) 1:t =

( X (i )1:t−1,X

(i ) t

) then

p (x1:t | y1:t−1) = 1 N

N

∑ i=1

δ X (i ) 1:t (x1:t ) .

Sampling from f (xt | xt−1) is usually straightforward and is feasible even if f (xt | xt−1) does not admit any analytical expression; e.g. biochemical network models.

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 9 / 46

Importance Sampling Implementation of Updating Step

Our target at time t is

p (x1:t | y1:t ) = g (yt | xt ) p (x1:t | y1:t−1)

p (yt | y1:t−1) so by substituting p (x1:t | y1:t−1) to p (x1:t | y1:t−1) we obtain

p̂ (yt | y1:t−1) = ∫ g (yt | xt ) p (x1:t | y1:t−1) dx1:t

= 1 N

N

∑ i=1 g ( yt |X

(i ) t

) .

We now have

p (x1:t | y1:t ) = g (yt | xt ) p (x1:t | y1:t−1)

p̂ (yt | y1:t−1) =

N

∑ i=1 W (i )t δX (i )1:t

(x1:t ) .

with W (i )t ∝ g ( yt |X

(i ) t

) , ∑Ni=1W

(i ) t = 1.

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 10 / 46

Multinomial Resampling

We have a “weighted”approximation p (x1:t | y1:t ) of p (x1:t | y1:t )

p (x1:t | y1:t ) = N

∑ i=1 W (i )t δX (i )1:t

(x1:t ) .

To obtain N samples X (i )1:t approximately from p (x1:t | y1:t ), resample N times with replacement

X (i )1:t ∼ p (x1:t | y1:t )

to obtain

p̂ (x1:t | y1:t ) = 1 N

N

∑ i=1

δ X (i )1:t (x1:t ) =

N

∑ i=1

N (i )t N

δ X (i ) 1:t (x1:t )

where { N (i )t

} follow a multinomial of param. N,

{ W (i )t

} .

This can be achieved in O (N). A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 11 / 46

Vanilla Particle Filter

At time t = 1

Sample X (i ) 1 ∼ µ (x1) then

p (x1| y1) = N

∑ i=1 W (i )1 δX (i )1

(x1) , W (i ) 1 ∝ g

( y1|X

(i ) 1

) .

Resample X (i )1 ∼ p (x1| y1) to obtain p̂ (x1| y1) = 1N ∑ N i=1 δX (i )1

(x1) .

At time t ≥ 2 Sample X

(i ) t ∼ f

( xt |X (i )t−1

) , set X

(i ) 1:t =

( X (i )1:t−1,X

(i ) t

) and

p (x1:t | y1:t ) = N

∑ i=1 W (i )t δX (i )1:t

(x1:t ) , W (i ) t ∝ g

( yt |X

(i ) t

) .

Resample X (i )1:t ∼ p (x1:t | y1:t ) to obtain

p̂ (x1:t | y1:t ) = 1 N

N

∑ i=1

δ X (i )1:t (x1:t ) .

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 12 / 46

Particle Estimates

At time t, we get

p̂ (x1:t | y1:t ) = 1 N

N

∑ i=1

δ X (i )1:t (x1:t ) .

The marginal likelihood estimate is given by

p̂ (y1:t ) = t

∏ k=1

p̂ (yk | y1:k−1) = t

∏ k=1

( 1 N

N

∑ i=1 g ( yk |X

(i ) k

)) .

Computational complexity is O (N) at each time step and memory requirements O (tN) . If we are only interested in p (xt | y1:t ) or p ( st (x1:t )| y1:t ) where st (x1:t ) = Ψt (xt , st−1 (x1:t−1)) - e.g. st (x1:t ) = ∑tk=1 x2k - is fixed-dimensional then memory requirements O (N) .

A. Doucet (UCL Masterclass Oct. 2012) 3rd October 2012 13 / 46

Some Convergence Results

Numerous convergence results are available; see (Del Moral, 2004, Del Moral, D. & Singh, 2013). Let ϕt : X t → R and consider

ϕt = ∫

ϕt (x1:t ) p (x1:t | y1:t ) dx1:t ,

ϕ̂t = ∫

ϕt (x1:t ) p̂ (x1:t | y1:t ) dx1:t = 1 N

N

∑ i=1

ϕt

( X (i )1:t

) .

We can prove that for any bounded function ϕ and any p ≥ 1

E [|ϕ̂t − ϕt | p ] 1/p ≤ B (t) c (p) ‖ϕ‖∞√

N ,

lim N→∞

√ N (ϕ̂t − ϕt ) ⇒ N

( 0, σ2

t

) .

Very weak results: For a path-dependent ϕt (x1:t ) , B (t) and σ 2 t

typically